The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English

1 November 2003: Although the Brooklyn Corpus is still available, prospective users should instead download The York-Toronto-Helsinki Parsed Corpus of Old English Prose, a much larger corpus with much more detailed annotation.

The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English (henceforth the Brooklyn Corpus) is a selection of texts from the Old English Section of the Helsinki Corpus of English Texts (henceforth the Helsinki Corpus), annotated to facilitate searches on lexical items and syntactic structure. It is intended for the use of students and scholars of the history of the English language. The Brooklyn Corpus contains 106,210 words of Old English text; the samples from the longer texts are 5,000 to 10,000 words in length. The texts included in the corpus represent a range of dates of composition, authors, and genres. The texts are syntactically and morphologically annotated, and each word is glossed. The size of the corpus is approximately 12 megabytes.

The Brooklyn Corpus is the joint project of five linguists: Susan Pintzuk (University of York, UK), Eric Haeberli (University of Geneva, Switzerland, and University of Reading, UK), Ans van Kemenade (University of Nijmegen, the Netherlands), Willem Koopman (University of Amsterdam, the Netherlands), and Frank Beths (Vrije Universiteit, Amsterdam, the Netherlands, and University of York, UK). Pintzuk's work was funded by grant #RT-21583-94 from the National Endowment for the Humanities (USA), an independent federal agency. Van Kemenade, Koopman, and Beths were responsible for the design and implementation of the morphological annotation scheme, which was based on the one developed for the Old French corpus of Anthonij Dees at Vrije Universiteit. Pintzuk and Haeberli were responsible for the design and implementation of the syntactic annotation scheme, which was based on the one developed at the University of Pennsylvania for the first edition of the Penn-Helsinki Parsed Corpus of Middle English. Our intent was to make the syntactic annotation of the two corpora as similar as possible, while taking into account the syntactic and morphological differences between Old and Middle English.

Each text in the Brooklyn Corpus is supplied in four different formats, each format as a separate file with the same name and a different extension. The four different formats are suitable for use with different search tools.

The Brooklyn Corpus is available without fee for educational and research purposes, but it is not in the public domain. Copyright to the Helsinki Corpus texts in their computerized form is retained by the Helsinki Corpus ( 1991); copyright to the syntactic annotation is retained by Susan Pintzuk and Eric Haeberli ( 2000); copyright to the completely annotated files is retained by Susan Pintzuk, Eric Haeberli, Ans van Kemenade, Willem Koopman, and Frank Beths ( 2000); and copyright to the Brooklyn Corpus Manual is retained by Susan Pintzuk ( 2000). Some of the original texts are also under copyright and are distributed under permission granted to the Helsinki Corpus.

Downloading the manual is unrestricted, but the texts themselves and the PERL search scripts are available only to users who agree formally to the conditions of use by filling out the access request form and returning it via e-mail to Susan Pintzuk (

The manual for the Brooklyn Corpus is available in three formats:

The Brooklyn Corpus is part of a larger project to produce syntactically annotated corpora for all stages of the history of English: