Electronic Text Technologies

Drew Whitworth (based on materials created by Martin Thomas)

This is a printer-friendly version of this month's first supplementary essay. Return to this month's introduction.

Computers were originally invented to process numerical data. Over time their functions have evolved and diffused into the contemporary pattern. Number-crunching is still important, but networking has also turned computers into communications devices. Whereas networking was originally developed so computers could talk to each other, it has become more common for people to use computers to talk to other people. Number-crunching and CMC (computer-mediated communication) probably constitute the main applications of ICT in most people's minds.

However, an equally important application of modern ICT is to handle electronic text. Indeed, without the set of software technologies and standards which have developed in this area, one of CMC's main media - the World Wide Web - would not exist. Nor could any of the democratising technologies suggested by Kevin Carey (in this month's first feature essay) work either.

Yet the history and workings of electronic text are often unfamiliar to the general public, even to many who use ICT regularly. This essay summarises these issues. For more detail the reader is directed to resources listed on this month's page of links.

The fundamental principle of electronic text is that of markup. Without markup, electronic text handling would be limited merely to character encoding: that is, standards for translating binary digits 0001001100101111 into letters and other symbols. Markup predates computing, with punctuation being the most familiar form. Punctuation turns writing from a mere string of words into sentences with varying meanings. Symbols such as "" ? and ! are not actually spoken, but can influence meaning and intonation, "marking" a particular sentence as a quotation, question or exclamation respectively.

At another level, meaning is influenced by context. Many words have ambiguous meanings which depend on the context, particularly certain technical terms or acronyms. For example, the acronym AA is defined in Chambers' English Dictionary as having any of the following meanings - doubtless there are more in other specialist contexts:

For different reasons, there is no guarantee that either a human or a machine interpreter would make the right selection from this (or any similar) list. Misinterpretations are a fact of human communication; our logic systems are fuzzy, our usage often sloppy, we make mistakes. More seriously, we throw up boundaries around certain fields by the use of jargon, buzzwords or slogans (see this month's second supplementary essay). Computers, on the other hand, follow rules, but their ability to interpret context is limited (and in the case of deriving clues to meaning in speech from intontation and body language, non-existent).

Markup compensates for some of these problems. Simply, the computer is not left to its own devices when it tries to interpret a text. The author of the text explicitly tells the machine what a particular word, sentence or passage actually is and how it should be handled.

Broadly, markup takes two forms: presentational (or stylistic) markup, which encodes information about how the text should be produced or displayed, and structural markup, which encodes information about what the text actually is. Presentational markup is important for technologies such as the WWW which rely more heavily on style (what font, colour scheme, layout etc. are to be used), but this is of less interest to us here than structural markup.

Structural markup is based on the assumption that all texts have underlying structure. For example, consider the structure of the novel, Lord of the Rings. The pseudo-code below is only the beginning of a complex code structure which could, in principle, markup this entire epic novel. All markup information has been enclosed in angle brackets < > to distinguish it from the content. Note also how the scope of each piece of markup is limited by "closing tags", which include the / character. In some cases these cover only single passages, but in others (e.g. each volume) their scope is large, and include other sub-levels within them. Texts therefore have an inbuilt hierarchy. (All the following should be essentially familiar to anyone who has written web pages in HTML, although I have used tags which would not work in that particular markup language - see the discussion of XML below, however.)

<novel>
    <title>The Lord of the Rings</title>
    <author>J. R. R. Tolkien</author>
    <volume>
        <volumetitle>The Fellowship of the Ring</volumetitle>
        <book>
             <booknumber>Book One</booknumber>
             <chapter>
                 <chapternumber>1</chapternumber>                        
                 <chaptertitle>A Long-Expected Party</chaptertitle>
                 <paragraph>When Mr. Bilbo Baggins of Bag End announced
                      that he would shortly be celebrating his eleventy-first
                      birthday...</paragraph>

                 [etcetera]

             </chapter>
             <chapter>
                 <chapternumber>2</chapternumber>
                 <chaptertitle>The Shadows of the Past</chaptertitle>
                 <paragraph>The talk did not die down in nine or even 
                      ninety-nine days...  </paragraph>

                 [etcetera]

             </chapter>            

             [until eventually, at the end of the book and appendices...]

        </book>
    </volume>
</novel>

This is simple but unsophisticated markup code. In several cases the tags above could be combined, thereby encoding more complex information within one single passage of text. For example:

<volume title="The Fellowship of the Ring">
    <book number="1">
         <chapter number="1" title="A Long-Expected Party">
             <paragraph>When Mr. <character race="hobbit">Bilbo 
                 Baggins</character> of Bag End announced that he would 
                 shortly be celebrating his eleventy-first birthday...  
             </paragraph>
             [and so on]
         </chapter>
         <chapter number="2" title="The Shadows of the Past">
             <paragraph>The talk did not die down in nine or even
                 ninety-nine days...  </paragraph>

Note also the extra information given about the words "Bilbo Baggins", which informs anyone handling or analysing the text of the words' context. "Bilbo Baggins" is not just a random string of thirteen bytes (and these words mean nothing in English or any other language); they actively symbolise a character in the novel. Similarly "The Fellowship of the Ring" actively symbolises one-third of the whole novel. In each case the short string of text is not only something which can be spoken, but the words have a certain status in the context of the novel. Each may also be appropriated for other uses, such as when referring to the first of Peter Jackson's films. Here is a case where further markup (perhaps an attribute called "MEDIUM"?) would be required, as otherwise how would a computer "know" when the reference was to the film, or the printed novel? Which edition of the novel anyway?

In principle, however, markup can deal with any such issue. Returning to the earlier example, one could envision the following markup in two different electronic texts:

I have recently applied to join the
<acronym title="Automobile Association">AA</acronym>

I have recently applied to join the
<acronym title="Architectural Association">AA</acronym>

Note that these are working examples of markup which would work in HTML, although not on some older browsers.

Markup's applications are limited only by the willingness to encode the information and, more significantly, the interface's ability to understand the markup. As I observed, not all browsers can handle the ACRONYM tag, despite the fact this is now included in the HTML 4.0 standard. It is therefore a good point in this discussion to talk about particular markup technologies, how they have developed over time, and the current situation.

SGML (Standard Generalized Markup Language) is the grandparent of markup technologies. It was developed in the 1970s by Charles Goldfarb. Users can create their own markup elements and rules as to how these are to be handled. But this is a complex and time-consuming process for individual authors. What have therefore developed since are, effectively, pre-fabricated families of tags and rules which can be applied in particular contexts. The most familiar of these is HTML (Hypertext Markup Language), first developed by Tim Berners-Lee. This emphasises the formatting of text over structure, and also makes it easier to include hypertext links into documents.

HTML's ubiquity is not without its drawbacks. It works well enough for the formatting of text (and other things such as images) - i.e., presentational markup - but using ICT to analyse HTML text is difficult. The use of ICT in the scholarly analysis of text depends on good structural markup, but HTML is not only not designed for this purpose, it is also often sloppily written: something for which both authors and the designers of web browsers are responsible.

Like HTML, XML is a simplified derivative of SGML, but it adheres more closely to the principles of structural markup, and more effectively enforces proper practice on those who use it to markup text. As documents marked up in XML are more reliable than with HTML, it is now possible to use a further technology, XSLT (Extensible Stylesheet Language Transformations) to take an XML document and transform it into a variety of formats. HTML (web-friendly) would be one, but there is no reason why a second XSLT transformation could not be performed on the same file and produce content suitable for transmission to a mobile phone, say, or a digital television.

It is through this kind of application that suggestions made by Kevin Carey in this month's feature essay might be approached. To take one of his more involved examples, it may well be possible for an interpreted or "plain English" version of the Maastricht Treaty to be derived from the "pure" version. This, or similar projects, require two things:

  1. The willingness to go through a source document and mark it up appropriately
  2. The willingness to develop a text-handling application which could interpret this markup and then create the alternative version of the document according to certain rules.

For a document as large and complex as the Maastricht Treaty this would not be a straightforward task! "Machine translation" between languages is a technique that is only just beginning to be developed, and its limitations would merely be replicated when trying to automate translation "within" a language. Nevertheless there are great improvements being made here and in any case, what is presently lacking (as Carey suggests) is not technology but the political will to spend time and energy providing alternative versions of text. Even HTML can quite easily encode hypertext links, as anyone who has browsed the Web will know. Therefore, placing source documents online, and linking to them from any other document which draws on that source, is straightforward if the will is there to do it. As this month's second supplementary essay argues, however, there are often political reasons to withhold sources.

Like many applications of ICT, exploiting the possibilities of electronic text is potentially democratising, for the reasons Carey describes. The convergence of computing and other technologies, such as TV or mobile phones, and the ability of XML/XSLT to produce real multimedia content, may even answer some critics who point to the increasing migration of important information to a medium (computing) which is expensive to everyone, and inaccessible to many. Electronic texts are more accessible; more analysable; more searchable; and more flexible than traditional printed words.

Yet nor can the drawbacks of electronic text be ignored. Truly open texts, accessible (and perhaps even adjustable) by all, are firstly a threat to copyright and intellectual property: secondly they may be politically sensitive. The book remains a very convenient medium for the storage of text, despite its rather one-dimensional nature. And as is becoming clear, the mere provision of information does not necessarily provoke the revitalisation of the public sphere which a more democratic, political society requires. (For reasons why, look at almost any other issue of Tangentium, as this is, essentially, our central theme.) At the present time intermediaries would still be required to both markup text in the first place, and write tools which can interpret that markup: and any intermediary will insert their own assumptions and prejudices between the author and reader (as Carey observes). In the end we must remember that though "XML may help humans predict what information might lie 'between the tags'", computers have no intrinsic understanding of their own, and ultimately, to a computer "<trunk> and <i> and <booktitle> are all equally (and totally) meaningless" (both quotes from Robin Cover, XML and Semantic Transparency). Computers cannot invest any text, electronic or not, with meaning. Only humans can do that.

Despite all this, there seem to be certain types of text for which digitisation is eminently suited - public documents, as Carey notes. This essay has summarised the technologies which already exist to simplify and widen access to these documents - in all the multi-faceted ways which truly democratic "access" requires. Applying these to our public information sphere is now a political challenge as much, or more than, it is a technological one.