Read the abstract as presented in the conference proceedings (use an XML/XSLT-capable browser).
Students of this topic will recognize that I barely skim the surface of the problem here.
overlap problem, although famously difficult and worthy itself of a more extended
treatment, must also be considered within the context of broader problems with the currently
dominant architecture of document processing, which is designed to support the goals of
publishing (especially publishing at scale in multiple formats), not of scholarly
interpretation. In brief, what is called for is a data model and architecture supporting the
Each of these three points could be elaborated at length. The first two in particular are the subjects of ongoing research. The third one especially is the focus of this presentation. Truly expressive applications of markup to scholarly text processing will be rewarded, it seems to me, by shifting attention from markup as such, to document modeling as a research project in its own right, underlying and enabling markup and applications based on it.
The presentation slides are in PDF format. This is a very high-level view of the problem, and is not intended to be self-explanatory.
A demonstration shows not
in anything like its full potential, but only a hint of what will be possible in a markup
regimen that does not impose a single unitary hierarchy over a text. The markup here is
extremely simple and straightforward, even trivial. The only thing at all remarkable about
it (and the fact that this is remarkable is itself somewhat remarkable) is that it
identifies phenomena in the texts that overlap, and therefore cannot be directly represented
together, at least at the same time, in XML.
The scholarly intent of this demonstration (such as it is) is to depict the way different
examples of the sonnet form (mainly in English, but also with German, French and Spanish
cases) have different
rhythmic profiles in the interplay between their metrical
(verse) structures and rhetorical or grammatical (sentence/phrasing) structures. The thesis
is that any particular sonnet, and any moment within a sonnet, is more or less
turbulent, turbulence occurring when the speech rhythms proper to the phrasing of
the sonnet interfere with the regular flow of the meter. While these differences within and
between sonnets are subtly apparent in reading (subject, of course, to the different
interpretations provided to them by different enunciations), they can also be more
dramatically represented by a graphical rendition in which the correspondence or
interference between the two hierarchies is specifically drawn.
This markup, trivial though it is, is manifestly interpretive in at least two respects:
Now Sleeps the Crimson Petal, Now the Whiteand Meredith's
Modern Love XXX). In the case of sentences and phrasing, perhaps a greater measure of more or less deliberate decision-making by the encoder is required to establish proper boundaries (although punctuation is also a serviceable guide, and indeed in the case of most of these examples, a first cut at the markup was facilitated by an automated routine that marked up phrase and sentence boundaries indicated by punctuation).
Correspondences(nineteenth century, French) as compared to Milton's
On His Blindess(seventeenth century, English).
In order to create these representations, a library of sonnets is marked up in XML, with
tree structure representing the verse form, namely lines within couplets or
quatrains. Another hierarchy, indicating the grammar or phrasing of the poem (elements are
s for sentence and phr for phrase) are marked up using a
convention (the LMNL
CLIX notation) in which XML elements, rather than simple start-
or end-tags, indicate the beginnings and ends of structures. This enables pipeline
processing to create the following alternative formats and renditions:
extended CLIX: the document sources use CLIX elements interspersed into regular XML to indicate the presence of document structures that violate the clean nesting of the XML elements identified in the document.
overlap problemwill recognize xLMNL as a kind of
standoffrepresentation of overlapping ranges.
sawtooth syntax(an alternative markup syntax for representing this data model) is also easily generated from xLMNL.
XML inductionis the process of deriving XML hierarchies from flat LMNL instances. Here, inductions of two hierarchies are demonstrated: (a) the verse/line hierarchy (which happens to be the XML we started with) and (b) the sentence/phrase hierarchy, with verse/line structures represented using CLIX notation.
Of the stylesheets that perform these conversions, the only ones that are not entirely
generic are the two that display the sonnet structures, the
views. These have been tuned for display of documents with ranges of the types given in the
sonnets, namely octave, sestet, quatrain, couplet, line, s, and
phr. All other stylesheets will work equally on any documents in which
the CLIX notation is used to represent structures overlapping the main hierarchy – a
format that is easily generated from many common workarounds used to represent overlap in
More information about LMNL, the Layered Markup and Annotation Language, is available at the LMNL wiki.
Readers who wish to see or adapt the XSLT 2.0 code that performs these conversions are invited to contact the author at wapiez (at) mulberrytech.com.