Tangentium

January '04: Menu

This month's introduction
Feature essay 1: Language Engineering
Feature essay 2: The Digital Learning Divide
Supplementary essay 1: Electronic Text Technologies
Supplementary essay 2: Language, knowledge and exclusion
Key terms defined
Snippets
Reviewed links and resources
Main menu

All material on this site remains © the original authors: please see our submission guidelines for more information. If no author is shown material is © Drew Whitworth. For any reproduction beyond fair dealing, permission must be sought: e-mail drew@comp.leeds.ac.uk.

ISSN number: 1746-4757

Electronic Text Technologies

Drew Whitworth (based on materials created by Martin Thomas)

Page 1 ¦ Page 2 ¦ Page 3 ¦ Printer-friendly version

Structural markup is based on the assumption that all texts have underlying structure. For example, consider the structure of the novel, Lord of the Rings. In the pseudo-code below (which is only the beginning of a complex code structure which could, in principle, markup this entire epic novel), all markup information has been enclosed in angle brackets < > to distinguish it from the content. Note also how the scope of each piece of markup is limited by "closing tags", which include the / character. In some cases these cover only single passages, but in others (e.g. each volume) their scope is large, and include other sub-levels within them. Texts therefore have an inbuilt hierarchy. (All the following should be essentially familiar to anyone who has written web pages in HTML, although I have used tags which would not work in that particular markup language - see the discussion of XML below, however.)

<novel>
    <title>The Lord of the Rings</title>
    <author>J. R. R. Tolkien</author>
    <volume>
        <volumetitle>The Fellowship of the Ring</volumetitle>
        <book>
             <booknumber>Book One</booknumber>
             <chapter>
                 <chapternumber>1</chapternumber>
                 <chaptertitle>A Long-Expected Party</chaptertitle>
                 <paragraph>When Mr. Bilbo Baggins of Bag End announced
                      that he would shortly be celebrating his eleventy-first
                      birthday...</paragraph>

                 [etcetera]

             </chapter>
             <chapter>
                 <chapternumber>2</chapternumber>
                 <chaptertitle>The Shadows of the Past</chaptertitle>
                 <paragraph>The talk did not die down in nine or even
                      ninety-nine days...  </paragraph>

                 [etcetera]

             </chapter>

             [until eventually, at the end of the book and appendices...]

        </book>
    </volume>
</novel>

This is simple but unsophisticated markup code. In several cases the tags above could be combined, thereby encoding more complex information within one single passage of text. For example:

<volume title="The Fellowship of the Ring">
    <book number="1">
         <chapter number="1" title="A Long-Expected Party">
             <paragraph>When Mr. <character race="hobbit">Bilbo
                 Baggins</character> of Bag End announced that he would
                 shortly be celebrating his eleventy-first birthday...
             </paragraph>
             [and so on]
         </chapter>
         <chapter number="2" title="The Shadows of the Past">
             <paragraph>The talk did not die down in nine or even
                 ninety-nine days...  </paragraph>

Note also the extra information given about the words "Bilbo Baggins", which informs anyone handling or analysing the text of the words' context. "Bilbo Baggins" is not just a random string of thirteen bytes (and these words mean nothing in English or any other language); they actively symbolise a character in the novel. Similarly "The Fellowship of the Ring" actively symbolises one-third of the whole novel. The text is not only something which can be spoken, but it represents the complete volume, having an equal status in this hierarchy to "The Two Towers" and a "superior" status to "A Long-Expected Party", and so on. Not only that but the volume title can be appropriated for other uses, such as when referring to the first of Peter Jackson's films. Here is a case where further markup (perhaps an attribute called "MEDIUM"?) would be required, as otherwise how would a computer "know" when the reference was to the film, or the printed novel? Which edition of the novel anyway?

In principle, however, markup can deal with any such issue. Returning to the earlier example, one could envision the following markup in two different electronic texts:

I have recently applied to join the
<ACRONYM TITLE="Automobile Association">AA</ACRONYM>

I have recently applied to join the
<ACRONYM TITLE="Architectural Association">AA</ACRONYM>

Note that these are working examplea of markup: one of these tags has been included around this acronym AA and, unless you are viewing this page through an older web browser which cannot understand the ACRONYM tag, you can hover your mouse pointer over the "AA" and see which particular definition I am referring to.

Markup's applications are limited only by the willingness to encode the information and, more significantly, the interface's ability to understand the markup. As I observed, not all browsers can handle the ACRONYM tag, despite the fact this is now included in the HTML 4.0 standard. It is therefore a good point in this discussion to talk about particular markup technologies, how they have developed over time, and the current situation.

SGML (Standard Generalized Markup Language) is the grandparent of markup technologies. It was developed in the 1970s by Charles Goldfarb. Users can create their own markup elements and rules as to how these are to be handled. But this is a complex and time-consuming process for individual authors. What have therefore developed since are, effectively, pre-fabricated families of tags and rules which can be applied in particular contexts. The most familiar of these is HTML (Hypertext Markup Language), first developed by Tim Berners-Lee. This emphasises the formatting of text over structure, and also makes it easier to include hypertext links into documents.

HTML's ubiquity is not without its drawbacks. It works well enough for the formatting of text (and other things such as images) - i.e., presentational markup - but using ICT to analyse HTML text is difficult. The use of ICT in the scholarly analysis of text depends on good structural markup, but HTML is not only not designed for this purpose, it is also often sloppily written: something for which both authors and the designers of web browsers are responsible. The last page of this essay describes how XML has been developed to compensate for these problems, and concludes with a discussion of potential consequences of these new technologies.

Continue to page 3

Back to the top of this page