A Corpus of late Modern English Prose

Summary
Content of corpus
Copyright
File format
How to get a copy
WordCruncher

Summary

A corpus of informal private letters by British writers, covering the period 1861 to 1919.
All decades in that range are represented, four by about 20,000 words of text each. The decade 1880-89 has only about 6,000, 1890-99 about 13,000.
However, the range of dates by birth-date of writer is narrower: 1837-67.
Corpus constructed 1992-1994 by David Denison with the very considerable assistance of Graeme Trousdale and Linda van Bergen.

Professor David Denison
Department of Linguistics & English Language
University of Manchester
Manchester M13 9PL
U.K.

Content of corpus

The total number of words is approximately 100,000, made up of approx. 20,000 words in one randomly chosen block from each of:

Russell, Bertrand and Patricia Russell (eds.) 1937. The Amberley Papers: The Letters and Diaries of Lord and Lady Amberley, vol. 2, pp.512-71. London: Leonard & Virginia Woolf at the Hogarth Press.
Bell, Lady (ed.) 1927. The Letters of Gertrude Bell, vol. 1, pp.396-403; vol. 2, pp.404-55. London: Ernest Benn.
Flower, Desmond and Henry Maas (eds.) 1967. The Letters of Ernest Dowson. pp.110-59. London: Cassell.
Stephen, Leslie (ed.) 1901. Letters of John Richard Green. pp.72-123. London: Macmillan.
Mackenzie, Norman (ed.) 1978. The Letters of Sidney and Beatrice Webb, vol. 1, Apprenticeships 1873-1892, pp.270-319. Cambridge: Cambridge University Press in cooperation with the London School of Economics and Political Science.

For fuller details about the selection and coding of texts, see my article:

Denison, David. 1994. A corpus of late Modern English prose. In Merja Kytö, Matti Rissanen & Susan Wright (eds.), Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 (Language and Computers - Studies in Practical Linguistics 11), 7-16. Amsterdam and Atlanta GA: Rodopi.

Copyright

Compilation and coding

Text of The Letters of Ernest Dowson

Other texts

The letters of Bell and the Webbs are out of copyright. We believe the Amberley and Green texts to be out of copyright too, having tried without success to trace copyright holders.

Permission for reproduction in the corpus has been given by C.U.P. for Amberley and (now that a copyright fee has been paid) by A.U.P. for Dowson. I have not been able to find any other copyright obligations. As far as I am aware, all that is required is that users of the corpus should acknowledge the Dowson copyright holder in any research derived from the corpus. Reproduction of the text beyond the usual limits of fair dealing is, of course, not allowed. Please acknowledge the corpus in any published work that makes use of it.

File format

The text is stored in 7 files totalling a little under 600 Kb. The files are extended (8-bit) Ascii, and the text is coded as far as possible according to the conventions used in the Helsinki Corpus, that is, with COCOA-style brackets giving information on writer, recipient, relationship, date, genre, page, etc, enclosed within carets. Two subperiods are identified: items dated 1860-1889 are coded as L86 and 1890-1919 as L89. Note, though, that the "social" info - on relationship, social status, degree of formality, etc - is not complete and is often deliberately underspecified. Almost all such caret brackets are on separate lines and start in column 1, apart from embedded editorial comments on e.g. cancelled text. For further information on the coding system see

Kytö, Merja 1994 Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts, 2nd edn. Helsinki: Helsinki University Press for Department of English, University of Helsinki.

Lineation of the original editions is preserved, making for lines of variable length, maximum 95 characters. The "line-continues" characters noted in my article have been removed. Another change from the information published there is that the Green selection - as well as Amberley - contains a diary entry as well as the letters.

Original spelling is preserved, and a file is provided which lists correspondences between misspellings and standard British spellings and between abbreviations and expanded forms. The text is clean and now pretty accurate.

How to get a copy

The Corpus has been lodged with the Oxford Text Archive. You can get the text from the OTA over the Internet at no cost after completing their written application form, and I encourage you to do so. Scholars can also get the corpus from me on request. I will mail two versions in a zip file:

the 7-file "plain" version
a 1-file WordCruncher-indexed version with associated files (see below).

There is also a README file and a file of abbreviations and non-standard spellings.

WordCruncher

We used the text here with the now-obsolete WordCruncher 4.5 for DOS, which preindexes a text for rapid search and retrieval. We provide a version fully indexed for WordCruncher. (To get the full benefit you needed WordCruncher Viewer for DOS 4.1 or later.) This version of the corpus has identical text and coding to the Ascii version, but stored in a single file, and with additional marking of sentence and page boundaries. (Page boundary markers are always made to coincide with a sentence boundary.) The spelling and text files are concatenated in such a way that users can search either

the whole corpus
or, by the use of Bookmarks (which we have preset),
material from any one of the five editions
letters only, excluding journal entries

The lists of abbreviations/spellings can be referred to from within WordCruncher. They are not indexed, nor is editorial and reference coding within <...> or [...] brackets. This one large text file, LMODEPRS.BYB, is supplied with associated WordCruncher index and other files.

Even if you do not have WordCruncher, the file LMODEPRS.BYB can still be used with other software: it is an Ascii file of just under 600 Kb. (We have moved to using MonoConc Pro as our concordance program.) Sentence boundaries are marked with |s (vertical bar + s), page boundaries by |p, and books by |b.

If you have further queries about the corpus, please ask me by e-mail at david.denison@manchester.ac.uk.

Page previously updated 11 May 2005, emended 23 February 2009, 30 March 2013, URLs last updated 21 January, 2020