Juxta 1.2.2 Released

Juxta 1.2.2 is now available for download here. The major new feature in this release is an improved fragment selection mechanism and the ability to easily preview files before collating them. This functionality is accessed via the “Files” tab on the left hand panel, depicted below.

juxta-frag1.jpg
Clicking on the “Files” tab brings up a tree of the files in the currently selected base directory. Clicking on the file icon allows the scholar to select a directory from which to select files for collation.

juxta-frag2.jpg
Double clicking on files with a “txt” or “xml” extension opens them in a preview mode. The scholar can then choose to import the entire file into the collation or to highlight a fragment and pull just the highlighted fragment into the collation. Fragments carry with them the metadata and lineation from the source text, if any. This new functionality replaces the old fragment selection mode with a more integrated solution.

Add comment April 14th, 2008 Nick Laiacona

for dummies?

Wesley Raabe, a former colleague at UVA (now CLIR Fellow at the Center for Digital Research in the Humanities) has written a very nice blog post describing his experiences with Juxta. He subtitles it “textual collation for dummies,” which I take as a real compliment, because Juxta was designed to open up this esoteric practice and make it easier for literary scholars to see the utility of analyzing variant texts without having to hunker over a Lindstrand Comparator or dazzle at the flashing lights of an Hinman.

Wesley also points out that Juxta accepts unmarked, plain-text (.txt) documents as a baseline for comparison. But we want to make it clear that Juxta can work with more than plain text files — and for scholars who are interested in recording even very complex line or other numbering schemes, embedding bibliographic citation information and other notes in the files, Juxta’s particular flavor of XML can be useful. Juxta XML can be constructed by hand or generated via XSLT from other XML formatted files (such as TEI). Its simple format is described beginning on page 17 of our user manual.

Why bother? Juxta XML is a great choice if you’d like the printable apparatus to be generated complete with bibliographic information and your notes, keyed to line and page or scene or chapter or canto numbers that make sense to scholars studying your particular texts.

I haven’t seen anybody do this yet, but Juxta XML would also be a nice choice for the editor of an existing archive of well-proofed XML documents of various editions to provide to end users as a download option. In that case, Juxta — in its most sophisticated form — would be plug-and-play. Even for dummies.

Add comment January 18th, 2008 Bethany Nowviskie

Juxta-dev mailing list

A mailing list is now available for following Juxta’s development and communicating with others who are using Juxta. Please subscribe to the mailing list here.

Add comment November 1st, 2007 Nick Laiacona

Juxta 1.2.1 Released

It has been a while since we have had an update on this blog, but work has been occurring this year behind the scenes at ARP. Juxta 1.2.1 is now available for download.

Last summer, Performant Software Solutions developed a new version of Juxta for ARP. This is the version I demoed at COST 32 Workshop in Antwerp, Belgium in September. Below is a summary of the new features and bug fixes found in this release.

New Features

Passage Collation - Juxta can now collate texts in which passages appear in different order from one text to the next. A new user interface component, the Passage Panel, guides the user through the process. See the updated user’s manual for more information.

Fragment Collation - Juxta can also collate fragments of texts. This is useful when the target of collation is embedded in a document.

Free Scrolling in Comparison View - Documents in the side-by-side comparison view can now be scrolled independent of one another.

Easier to Find Samples Directory - A file menu shortcut has been added that takes you straight to the samples directory.

Improvements to generation of Critical Apparatus - A dialog box now allows you to specify a title for the critical apparatus. A progress bar is now provided while the critical apparatus is being generated. Some bugs with the generation of lemmas within the apparatus have been fixed.

Find Works in Either Document - The Find Dialog can now find text in either document in the side-by-side comparison mode.

Bug Fixes

Improvements to Collation Algorithm - Hans Walter Gabler found some problems with the collation output in specific circumstances. We have corrected these errors.

Margin Box Clipping - Occasionally, when the window was resized it would cause the margin boxes to be clipped, this has been fixed.

Scrollbar Positioning Sometimes Incorrect - When loading a large document the scrollbar would become positioned incorrectly, this has been fixed.

Image Display Not Updating - When flipping between documents, sometimes the image associated with the document on screen would not appear, this has been fixed.

Large Images Cause Scrolling to Be Erratic - Related to the issue above, scrolling through documents which have images associated with them could be jerky at times as the images loaded. Loading has been adjusted so this no longer occurs.

Out of Memory Error - Loading two large, completely unrelated documents could cause a system error. This has been fixed.

Add comment October 31st, 2007 Nick Laiacona

The Difference Algorithm

The following is an excerpt from a talk I gave in September at the COST 32 meeting in Antwerp, Belgium. It explains at a high level the workings of Juxta’s collation procedure.

Users of UNIX based operating systems may be familiar with the operating system command diff. Certainly most computer programmers have used some version of this program. Using diff, programmers can identify variations between versions of source code files to detect modifications by themselves or other programmers.  They can detect additions, deletions and replacements of code within a source file. Diff has been long used by programmers for this purpose. There are many implementations and variations but the essential functionality is always the same: line-by-line comparison of text files. If humanities researchers also wish to conduct a line-by-line comparison of text files, why not employ one of the many implementations of this tool?

Unfortunately, a Difference Algorithm geared for use by programmers is not satisfactory for comparing variations in natural language texts. The reason is that the text of a computer program is structured differently than that of prose text. Computer programs are written in lines and hard carriage returns (inserted into text by hitting the ‘Return’ key on a computer keyboard) usually punctuate the end of a single line of code. Comparing one line to another is sufficient for a programmer to determine what changed. But in natural language texts, the positioning of hard carriage returns is less predictable. Poetry is a best-case scenario, but once you start looking at prose you find hard carriage returns only at the ends of entire paragraphs. Having the algorithm report that an entire paragraph changed when really only one word or perhaps just a comma was omitted is not helpful. There may be additions and deletions within the paragraph, but all we can detect is broadly that something changed.

Juxta’s Difference Algorithm is tuned to the needs of collating natural language. It identifies differences in the text down to the word and punctuation level. It is highly accurate over large spans of prose and has been tested on Mark Twain, medieval Brand, Hamlet and of course the poetry of Dante Gabriel Rossetti.

The Difference Algorithm employed by Juxta was first described in the April 1978 edition of the Journal of the Association for Computing Machinery in an article entitled: “A Technique for Isolating Difference Between Files”. Donald C. Lindsay of Carnegie Mellon University later implemented this algorithm in the C programming language. This version worked on a pair of text files and operated on a line-by-line basis. A few bug fixes and improvements were made to this version and for years it floated through the pre-Web Internet. Then in 1997 Ian F. Darwin took the C version of this code and translated it to Java, keeping the same functionality with an eye toward make the source more “pedagogic”. This was the version received by Nick Laiacona at Applied Research in Patacriticism at the University of Virginia in 2005.

The algorithm as received had little separation of concerns within the code and was written more as a demonstration than as a reusable software component. For example, file format and the visual output code were built into the same methods that performed the actual algorithm. One of the first challenges of this project was to untangle all of this (without breaking anything!) so that the code could be a reusable module within the context of Juxta. The diff algorithm itself is now a standalone package which can be loaded into any Java based software by a software programmer, regardless of whether that software looks or feel like Juxta.

A number of changes were made to the structure and functionality of the code at this point. First, core sub-routines were named, encapsulated and separated into a structure described below. The original version constructed a symbol table and then promptly destroyed the table’s contents through the workings of the code. The symbol table is computationally expensive to build, so this was an inefficient way of operating when comparing many copies of the same text. The code was changed to make it non-destructive and thus reusable. Also, the original version was only capable of producing a plain text report as output. This was changed so that the output is now a data structure, which can be manipulated by the calling program in any way it sees fit. The code was also made so that it can operate on texts loaded in memory, which divorces the core algorithm from concerns over file format. Perhaps most importantly, the algorithm was made to operate on tokens composed of individual words and punctuation marks instead of entire lines of text.

The Difference Algorithm code is composed with the following discrete objects. These objects are chained together and the output of one is the input of the next.

Symbol Table – The symbol table takes the raw texts and turns them into a hash table of symbols with indices stored to either one or both of the texts. Every word’s disposition is established: does it appear once in the base, once in the witness, once in each, or more than once in each. Symbols that appear once are used as landmarks in the following step.

Correlator – Takes the symbol table scans from the points where both match to establish where there are “blocks” of symbols which are the same between both files.

Collector – The collector then takes this series of blocks and “interprets” them as one of three difference types: addition, deletion, or change. For example, a block of symbols that appears in the base but not in the witness would be considered a deletion. Each difference is captured and added to the Difference Set.

Difference Set – This is the final output of the process, which contains a complete list of all the differences encountered, their location and type.

During the development of this tool we used Microsoft Corporation’s MS Word Compare Documents function as a baseline. This algorithm is excellent, but unfortunately is proprietary, closed-source; only works on Word documents and only collates two documents at a time. Nevertheless, it is very accurate and we used it for checking our work and also as inspiration for some of the visualizations found in Juxta.

When we first ran the difference algorithm, we found that we were doing pretty good but not as good as Microsoft. Sometimes the algorithm was able to isolate the change area in a very small span of text but at other times the spans were too wide. While the algorithm was correct that a change did occur within the paragraph or multi-line area that it had marked, humanities researchers at ARP wanted it to be more specific as to what exactly had changed.

The diff algorithm’s strategy involves finding words that are unique within the texts. This has its limitations because it depends on the richness of the vocabulary employed in the target text. Also, if you write enough in any language in which spelling is regularized, you end up using the same words over and over again. The algorithm code is capable of computing this ratio so that it can actually report a confidence level in its own results. While this is important, it is better to do better.

At this point we introduced a multi-pass approach, which gave us a significant improvement in accuracy. This approach is easy to understand. Remember that there are three types of differences that we identify: additions, deletions and changes. The definition of the first two is pretty obvious; the passage is either in one text or another. What are changes? Changes are areas where we are not exactly sure what happened. We know that the text changed and exactly where and for how long. But that’s it. We can take these areas of ambiguity and resolve them by running the whole diff algorithm again on these blocks of change. This treats these changed passages as texts in their own right for the purposes of collation and greatly improves the symbol/length ratio in that moment. This change brought us to word level accuracy, which was our goal.

Add comment October 31st, 2007 Nick Laiacona

“new horizons” demo

If you’re in the Charlottesville area this week and are interested in a hands-on look at Juxta, please join us at the New Horizons in Teaching and Research conference, jointly sponsored by ITC and the University of Virginia Library. Bethany Nowviskie will be demonstrating the software and answering questions at 10 AM on Tuesday, May 22nd. For more details, see the conference program.

Add comment May 21st, 2007 Bethany Nowviskie

juxta open source

We’re happy to announce, as promised, a release of the Juxta source code under an Educational Community License. All materials are hosted at SourceForge, and full instructions for downloading and compiling the software are available here.

If you simply wish to use Juxta as a scholar rather than as a Java developer, you can download Mac, Windows, or Unix versions here.

4 comments May 9th, 2006 Bethany Nowviskie

Juxta’s Future Development

Juxta is being released under an Educational Community License. We will be making it available in SourceForge very shortly, and we welcome all and any suggestions about its future development. It’s the case that, as currently released, Juxta’s interface has a kind of “black-box” appearance. We wanted it to be as user-friendly as possible and in that effort we left developers without easy access to the code. Ron Van den Branden’s comment (on our initial post, below) has spurred us to decide that we should, as soon as possible, release a “developer’s pack” with information about how jxt and source XML files are located and constructed, and maybe with some sample XSLT for TEI-lite. On that matter look for a posting shortly, within a couple of weeks.

As to the other issue raised by Ron’s feedback: at present Juxta outputs HTML and so doesn’t facilitate XSL transformations for databases and online editions. But the tool was conceived for XML output (it’s built using the Velocity template engine) and this development is planned (in the next release of Juxta, this year).

Jerry

Add comment May 5th, 2006 Jerome McGann

call for contributors

We’d like to invite Juxta users to share their experiences with the tool on this blog. We’re happy to publish short essays on your bibliographic research and the way Juxta assists or complements it and (most useful for us) on any shortcomings you find in the software. Bug reports or questions not of general interest are still best addressed to technologies@nines.org — but if you’d like to introduce yourself and your project here and give us some indication of the way Juxta fits into your work, please register for the blog under “Contribute” at bottom right.

Once your introductory post is in draft state in our Wordpress system, kindly drop us a line at technologies@nines.org so that we can authorize you to contribute! And thanks.

Add comment February 27th, 2006 Bethany Nowviskie

Initial Juxta Release

Juxta 1.0 is now ready for anyone to try out.  If you can break it we’ll be pleased.  If you find it useful we’ll also be pleased.  If you can suggest improvements and/or modifications we’ll be happy to have them.

The tool was initially conceived and designed primarily in relation to scholarly editing issues that are located in relation to 19th and 20th century texts.  But as you will see from the samples given with this initial release, we’re especially interested in developing the tool so that it will be useful for pre-modern texts.  (The samples are from Dante Gabriel Rossetti, from Shakespeare, and from Walter Pater.  The Pater example is supplied in order to exemplify how the tool works with prose texts.)

The tool comes with full documentation.

We’d be grateful if you would post your comments and suggestions to this blog so that everyone involved with the development of Juxta can benefit from your thoughts and experience.  Feel free to register and log in as a contributor to the blog using the links listed under “Contribute” at bottom right. We’d be happy to publish posts on your experiences with Juxta as part of this communal blog.

You can also comment on posts others have made, by clicking the “comments” link under each post. Thanks.

49 comments February 6th, 2006 Jerome McGann


Categories

Links

Subscribe