A Corpus of late Modern English Prose

Summary

Professor David Denison
Department of Linguistics & English Language
University of Manchester
Manchester M13 9PL
U.K.

e-mail: david.denison@manchester.ac.uk

Content of corpus

The total number of words is approximately 100,000, made up of approx. 20,000 words in one randomly chosen block from each of:

For fuller details about the selection and coding of texts, see my article:

Copyright

Compilation and coding

© 1994 University of Manchester (Department of English and American Studies, formerly Department of English Language & Literature).

Text of The Letters of Ernest Dowson

© Associated University Presses, NJ, USA.

Other texts

The letters of Bell and the Webbs are out of copyright. We believe the Amberley and Green texts to be out of copyright too, having tried without success to trace copyright holders.

Permission for reproduction in the corpus has been given by C.U.P. for Amberley and (now that a copyright fee has been paid) by A.U.P. for Dowson. I have not been able to find any other copyright obligations. As far as I am aware, all that is required is that users of the corpus should acknowledge the Dowson copyright holder in any research derived from the corpus. Reproduction of the text beyond the usual limits of fair dealing is, of course, not allowed. Please acknowledge the corpus in any published work that makes use of it.

File format

The text is stored in 7 files totalling a little under 600 Kb. The files are extended (8-bit) Ascii, and the text is coded as far as possible according to the conventions used in the Helsinki Corpus, that is, with COCOA-style brackets giving information on writer, recipient, relationship, date, genre, page, etc, enclosed within carets. Two subperiods are identified: items dated 1860-1889 are coded as L86 and 1890-1919 as L89. Note, though, that the "social" info - on relationship, social status, degree of formality, etc - is not complete and is often deliberately underspecified. Almost all such caret brackets are on separate lines and start in column 1, apart from embedded editorial comments on e.g. cancelled text. For further information on the coding system see

Lineation of the original editions is preserved, making for lines of variable length, maximum 95 characters. The "line-continues" characters noted in my article have been removed. Another change from the information published there is that the Green selection - as well as Amberley - contains a diary entry as well as the letters.

Original spelling is preserved, and a file is provided which lists correspondences between misspellings and standard British spellings and between abbreviations and expanded forms. The text is clean and now pretty accurate.

How to get a copy

The Corpus has been lodged with the Oxford Text Archive. You can get the text from the OTA over the Internet at no cost after completing their written application form, and I encourage you to do so. Scholars can also get the corpus from me on request. I will mail two versions in a zip file:

There is also a README file and a file of abbreviations and non-standard spellings.

WordCruncher

We used the text here with the now-obsolete WordCruncher 4.5 for DOS, which preindexes a text for rapid search and retrieval. We provide a version fully indexed for WordCruncher. (To get the full benefit you needed WordCruncher Viewer for DOS 4.1 or later.) This version of the corpus has identical text and coding to the Ascii version, but stored in a single file, and with additional marking of sentence and page boundaries. (Page boundary markers are always made to coincide with a sentence boundary.) The spelling and text files are concatenated in such a way that users can search either

The lists of abbreviations/spellings can be referred to from within WordCruncher. They are not indexed, nor is editorial and reference coding within <...> or [...] brackets. This one large text file, LMODEPRS.BYB, is supplied with associated WordCruncher index and other files.

Even if you do not have WordCruncher, the file LMODEPRS.BYB can still be used with other software: it is an Ascii file of just under 600 Kb. (We have moved to using MonoConc Pro as our concordance program.) Sentence boundaries are marked with |s (vertical bar + s), page boundaries by |p, and books by |b.

If you have further queries about the corpus, please ask me by e-mail at david.denison@manchester.ac.uk.

back to top of document
go to Corpus of late 18C prose
back to David Denison's pages

Page previously updated 11 May 2005, emended 23 February 2009, last updated 30 March, 2013