The Penn-Helsinki Parsed Corpus of Middle English, Second Edition


PPCME2 home

Annotation, Main Table of Contents

Part-of-Speech Annotation, Table of Contents

Syntactic Annotation, Brief Table of Contents

Syntactic Annotation, Detailed Table of Contents


Annotation, General Introduction

The Corpus

The second edition of the Penn-Helsinki Parsed Corpus of Middle English (PPCME2) consists of 1.3 million words of syntactically annotated Middle English prose and is based on the Middle English portion of the Helsinki Corpus of English Texts that was created under the direction of Matti Rissanen and Ossi Ihalainen, University of Helsinki. The annotation work was done with the support of the National Science Foundation (Grant #BNS89-19701 and #SBR95-11368) and with supplementary support from the University of Pennsylvania Research Foundation.

Information on the texts included in the corpus along with text, date, dialect and genre information is available in the /info directory.

Philosophy and Goals

  1. Our primary goal has been to create an annotation system that facilitates automatic searches, not to give a correct linguistic analysis of each sentence. Thus, if a construction can be found unambiguously through a combination of properties of a bracketed sentence, the annotation may not contain all of the structure that a full phrase structure diagram of the sentence would have.
  2. We have tried to plan our system so that it only includes annotations that will not need to be revised later. All changes should add detail rather than revise previous bracketings. This goal implies that subjective judgments must be avoided since they are extremely error prone; so, for example, we have not distinguished verbal from adjectival passive participles.
  3. As many categories as possible should have clear meanings so that unclear cases should wind up in a small number of categories of residual cases. The price of making most categories homogeneous is that these residual categories will not be. In further revisions of the corpus some of these residual categories may be divided into subclassifications that are homogeneous, to the extent that this is objectively possible.
  4. As much as possible, we have avoided making decisions that would be controversial, either linguistically or as far as interpretation is concerned. Where there is room for doubt as to the correct parse or interpretation of the sentence we have either used a mechanical rule to decide the case for searching purposes or we have left structure unmarked. Adherence to this criterion has made it impossible for us to include a VP constituent in most cases since, given the word order variation/change in Middle English and the possibility of the existence of scrambling, the boundaries of the VP are all too often indeterminate.

General Format Of The Texts

File Types
Text Files
Part-of-Speech Files
Parsed Files
Text Markup

File Types

The PPCME2 comes in three different formats: (1) a text-only file, (2) a part-of-speech (POS) tagged file, and (3) a parsed file.

Text Files

Text files contain, in addition to the text, the Helsinki TEXT LEVEL CODES (see Kytö 1991, p.28), converted into HTML type codes, as outlined in Section TEXT MARKUP. The original page layout is not be retained. Rather, the text is divided into tokens on the same basis as the POS and parsed files; that is, each token contains one main clause (with any subordinate clauses it incorporates) and a token ID. The ID consists of the Helsinki filename (e.g. the title reference for the file cmhali.m1.txt is CMHALI), followed by a comma and the page number from the printed text followed by a period and the token number in the computer file. It is contained in parentheses. Tokens consisting entirely of CODE material (i.e. text level codes) do not have IDs, although they are counted by the token counter and thus there are some gaps in the token numbers. Punctuation in text files is separated from the words in order to simplify searches. All text filenames have the extension .txt.
 

<P_2>

<heading>

I . (CMMALORY,2.3)

Merlin (CMMALORY,2.4)

</heading>

HIT befel in the dayes of Uther Pendragon , when he was kynge of all
Englond and so regned , that there was a myghty duke in Cornewaill that
helde warre ageynst hym long tyme . (CMMALORY,2.6)

and the duke was called the duke of Tyntagil . (CMMALORY,2.7)

And so by meanes kynge Uther send for this duk chargyng hym to brynge
his wyf with hym . (CMMALORY,2.8)

for she was called a fair lady and a passynge wyse . (CMMALORY,2.9)

and her name was called Igrayne . (CMMALORY,2.10)

So whan the duke and his wyf were comyn unto the kynge , by the meanes
of grete lordes they were accorded bothe . (CMMALORY,2.11)

Part-of-Speech Files

Part-of-speech tagged texts contain the original text with a part-of-speech tag added to each word. The text is divided into tokens in the same way as the text files. Editorial material is given the tag CODE. Tokens consisting entirely of CODE material do not have an ID. All POS filenames have the extension .pos.
<P_2>_CODE

<heading>_CODE

I_NUM ._. CMMALORY,2.3_ID

Merlin_NPR CMMALORY,2.4_ID

</heading>_CODE

HIT_PRO befel_VBD in_P the_D dayes_NS of_P Uther_NPR Pendragon_NPR ,_,
when_P he_PRO was_BED kynge_N of_P all_Q Englond_NPR and_CONJ so_ADV
regned_VBD ,_, that_C there_EX was_BED a_D myghty_ADJ duke_N in_P
Cornewaill_NPR that_C helde_VBD warre_N ageynst_P hym_PRO long_ADJ
tyme_N ,_. CMMALORY,2.6_ID

and_CONJ the_D duke_N was_BED called_VAN the_D duke_N of_P Tyntagil_NPR
._. CMMALORY,2.7_ID

And_CONJ so_ADV by_P meanes_NS kynge_NPR Uther_NPR send_VBD for_P
this_D duk_N chargyng_VAG hym_PRO to_TO brynge_VB his_PRO$ wyf_N with_P
hym_PRO ,_. CMMALORY,2.8_ID

for_CONJ she_PRO was_BED called_VAN a_D fair_ADJ lady_N and_CONJ a_D
passynge_ADV wyse_ADJ ,_. CMMALORY,2.9_ID

and_CONJ her_PRO$ name_N was_BED called_VAN Igrayne_NPR ._.
CMMALORY,2.10_ID

So_ADV whan_P the_D duke_N and_CONJ his_PRO$ wyf_N were_BED comyn_VBN
unto_P the_D kynge_N ,_, by_P the_D meanes_NS of_P grete_ADJ lordes_NS
they_PRO were_BED accorded_VAN bothe_Q ._. CMMALORY,2.11_ID

Parsed Files

The parsed files contain a labelled bracketing of the text, with the same clause division as in the text and POS files. The POS tags are included in the parsed file as the first set of labelled parens surrounding a word. The ID ends the token and the whole is enclosed in parens.
( (CODE <P_2>))
( (CODE <heading>))
( (NUMP (NUM I) 
        (E_S .)) (ID CMMALORY,2.3))
( (NP (NPR Merlin)) (ID CMMALORY,2.4))
( (CODE </heading>))
( (IP-MAT (NP-SBJ-1 (PRO HIT))
          (VBD befel)
          (PP (P in)
              (NP (D the) (NS dayes)
                  (PP (P of)
                      (NP (NPR Uther) (NPR Pendragon)))))
          (, ,)
          (PP (P when)
              (CP-ADV (C 0)
                      (IP-SUB (IP-SUB (NP-SBJ (PRO he))
                                      (BED was)
                                      (NP-OB1 (N kynge)
                                              (PP (P of)
                                                  (NP (Q all) (NPR Englond)))))
                              (CONJP (CONJ and)
                                     (IP-SUB (NP-SBJ *con*) 
                                             (ADVP (ADV so))
                                             (VBD regned))))))
          (, ,)
          (CP-THT-1 (C that)
                    (IP-SUB (NP-SBJ-2 (EX there))
                            (BED was)
                            (NP-2 (D a) (ADJ myghty) (N duke)
                                  (CP-REL *ICH*-3))
                            (PP (P in)
                                (NP (NPR Cornewaill)))
                            (CP-REL-3 (WNP-4 0)
                                      (C that)
                                      (IP-SUB (NP-SBJ *T*-4)
                                              (VBD helde)
                                              (NP-OB1 (N warre))
                                              (PP (P ageynst)
                                                  (NP (PRO hym)))
                                              (NP-MSR (ADJ long) (N tyme))))))
          (E_S ,)) (ID CMMALORY,2.6))
( (IP-MAT (CONJ and)
          (NP-SBJ-1 (D the) (N duke))
          (BED was)
          (VAN called)
          (IP-SMC (NP-SBJ *-1)
                  (NP-OB1 (D the) (N duke)
                          (PP (P of)
                              (NP (NPR Tyntagil)))))
          (E_S .)) (ID CMMALORY,2.7))
( (IP-MAT (CONJ And)
          (ADVP (ADV so))
          (PP (P by)
              (NP (NS meanes)))
          (NP-SBJ (NPR kynge) (NPR Uther))
          (VBD send)
          (PP (P for)
              (NP (D this) (N duk)))
          (IP-PPL (VAG chargyng)
                  (NP-OB1 (PRO hym))
                  (IP-INF (TO to)
                          (VB brynge)
                          (NP-OB1 (PRO$ his) (N wyf))
                          (PP (P with)
                              (NP (PRO hym)))))
          (E_S ,)) (ID CMMALORY,2.8))

Text Markup

In general it has not been possible to retain the markup conventions of the Helsinki Corpus exactly because of conflicts with the annotation system. The major changes made are as follows:
  1. The representation of the text as it was printed on the page has been lost. The text is presented in main clause units, rather than line by line.
  2. All TEXT LEVEL CODES (see Kytö 1991, p.28) which occur inside the text have been changed to HTML type codes or omitted as follows:
    1. Single word emendations are preceded by a dollar sign $the, while multi-word emendations are surrounded by <em>...</em>. Emendations include those in the text as printed and changes made to the text by Penn.
    2. Headings that are part of the original text are designated by <heading>...</heading>. Headings added by the editor are contained in {ED:...} like other comments by editors (see below). Some editorial headings may be omitted. Headings in the Helsinki samples are in all caps.
    3. Language codes are omitted.
    4. Font codes are retained.
    5. Editor comments are either omitted or enclosed in {ED:...}. Comments added by Helsinki or Penn are treated the same way, except with the label COM:.
      Al_Q so_ADV hali_ADJ scrift_N bi+d_BEP in_P mine_PRO$
      {ED:ure_WRITTEN_ABOVE_THE_LINE}_CODE wunde_N hwan_P we_PRO scale_MD
      festen_VB ._, and_CONJ fleis_N bileuen_VB and_CONJ muchel_Q of_P
      ure_PRO$ {ED:mine_WRITTEN_ABOVE_THE_LINE}_CODE wille_N for_P ure_PRO$
      {ED:mine_WRITTEN_ABOVE_THE_LINE}_CODE wrechede_NS ._. CMLAMB1,83.196_ID
      
      (NODE (IP-SUB (IP-SUB-2 (NP-SBJ (PRO we))
                              (MD scale)
                              (VB festen))
                    (, .)
                    (CONJP (CONJ and)
                           (IP-SUB=2 (NP-OB1 (NP (N fleis))
                                             (CONJP *ICH*-1)) 
                                     (VB bileuen) 
                                     (CONJP-1 (CONJ and) 
                                              (NP (Q muchel)
                                                  (PP (P of)
                                                      (NP (PRO$ ure) (CODE {ED:mine_WRITTEN_ABOVE_THE_LINE})
                                                          (N wille)))))
                                     (PP (P for)
                                         (NP (PRO$ ure) (CODE {ED:mine_WRITTEN_ABOVE_THE_LINE})
                                             (NS wrechede))))))
            (E_S .)) (ID CMLAMB1,83.196))
      
      
    6. Parentheses are represented as ... unless they indicate emendations in the original text in which case they are treated like other emendations.
All editorial material in the files, such as the text level codes, as well as comments, page numbers, etc., is labelled CODE to differentiate it from the contents of the text itself.
<P_73>_CODE     <--- page number

<heading>_CODE  <--- begin heading code

VII_NUM ._, {COM:Trinity_Homily_IV}_CODE <--- comment added by Penn
CREDO_NPR ._. LAMB1,73.3_ID

</heading>_CODE <--- end heading code

( (CODE <P_73>)) 
( (CODE <heading>))
( (FRAG (NUM VII) (CODE {COM:Trinity_Homily_IV}) 
        (, .)
        (LATIN (FW CREDO)) 
        (E_S .)) (ID CMLAMB1,73.3))
( (CODE </heading>))

In cases where the PPCME2 pos-tagged or parsed text differs from the printed text, this is indicated by enclosing the original in (CODE {TEXT:...}) and marking the changed word(s) as emendations. The text may be changed for the following reasons:
  1. Certain cases of multiple words written as one are split. In general, words are only split when necessary for the parse; that is, when the two words are members of different constituent phrases (as defined in the PPCME2, not necessarily as defined by linguistic theory). If the words belong to the same constituent phrase they are left together. In the case of PPs, a preposition is not separated from the following word, if this word is a single word (noun, adverb, particle, etc.) complement to the preposition; rather, the complex is given the tag P+N (or P+ADV, etc.) and at the syntactic level simply bracketed as a PP (PP (P+N imu+d)) If, on the other hand, the following word is part of a multi-word NP complement, the preposition is separated. Although these changes are made to facilitate parsing, they are also reflected in the pos-tagged files in order to retain compatibility.
    ( (IP-MAT (NP-SBJ (NPR$ Godes) (N word)
                      (, ,)
                      (NP-PRN (NPR$ godes) (N r+ad)))
              (BEP is)
              (NP-OB1 (PRO$ +din) (N unwine))
              (, ,)
              (PP (P $for)                     <--- for+dat split
                  (CP-ADV (C $+dat)
                          (CODE {TEXT:for+dat})
                          (IP-SUB (NP-SBJ (PRO hit))
                                  (NEG ne)
                                  (VBP sei+d)
                                  (NEG noht)
                                  (ADVP (Q al)
                                        (CP-REL (WNP-1 0)
                                                (C +dat)
                                                (IP-SUB (NP-OB1 *T*-1)
                                                        (NP-SBJ (PRO tu))
                                                        (VBD woldest)))))))
              (E_S .)) (ID CMVICES1,75.855))
    
    ( (IP-MAT (CONJ and)
              (ADVP (ADV eft))
              (NP-SBJ-1 (PRO hit))
              (VBP +gelimp+d)
              (CP-THT-1 (C +dat)
                        (IP-SUB (NP-SBJ (D a) (N mann))
                                (VBP cum+t)
                                (PP (P $to)        <--- tan split
                                    (NP (D $an) (CODE {TEXT:tan}) (OTHER o+der)))
                                (PP (P +durh)
                                    (NP (NPR$ dieules) (N mene-+ginge)))
                                (, ,)
                                (PP (P +teih)
                                    (CP-ADV (C 0)
                                            (IP-SUB (NP-SBJ (PRO he))
                                                    (NP-OB1 (PRO hit))
                                                    (NEG naht)
                                                    (NEG ne)
                                                    (VBP wite))))))
              (E_S .)) (ID CMVICES1,101.1203))
    
    (NP-SBJ (D+OTHER ano+der) (NPR$ godes) (N +giue)) <--- ano+der not split
    
    (ADJP (D+ADJR +dunwor+dere))     <--- +dunwor+dere not split
    
    ( (IP-MAT (PP (ADV+P Hierfore))
              (NP-SBJ (PRO ic))
              (BEP am)
              (ADJP (ADJ ne+der) (CONJ and) (ADJ unmihti))
              (, ,)
              (PP (P+D for+dan)      <--- for+dan not split
                  (CP-ADV (C 0) 
                          (IP-SUB (IP-SUB-1 (NP-SBJ (PRO ic))
                                            (HVP habbe)
                                            (BEN +geben)
                                            (ADJP (ADJ prud) (CONJ and) (ADJ modi)))
                                  (, ,)
                                  (CONJP (CONJ and)
                                         (IP-SUB=1 (NP-MSR (Q michel))
                                                   (VBN ilaten)
                                                   (PP (P of)
                                                       (NP (PRO me) (N seluen))))))))
              (E_S .)) (ID CMVICES1,5.40))
    
    ( (IP-MAT (NP-SBJ (N Smellunge) (CONJ &) (N smechunge))
              (BEP beo+d)
              (PP (P+N imu+d))    <--- imu+d not split
              (Q ba+de)
              (PP (P ase)
                  (CP-ADV (WADVP-1 0)
                          (C 0)
                          (IP-SUB (ADVP *T*-1)
                                  (NP-SBJ (N sich+de))
                                  (BEP *)
                                  (PP (P+N inech+ge)))))   <--- inech+ge not split
              (E_S .)) (ID CMANCRIW,II.53.502))
    
    ( (IP-MAT (PP (ADV+P Hierafter))     <--- R-pronouns not split
              (VBP cum+t)                     from prepositions
              (NP-SBJ (D an) (OTHER o+der)
                      (, ,)
                      (CP-REL (WNP-1 0)
                              (C +de)
                              (IP-SUB (NP-SBJ-2 *T*-1)
                                      (BEP is)
                                      (VAN i-cleped)
                                      (IP-SMC (NP-SBJ *-2) 
                                              (NP-OB1 (FW superbia)
                                                      (, ,)
                                                      (IP-MAT-PRN (NP-SBJ (D +tat))
                                                                  (BEP is)
                                                                  (, ,)
                                                                  (NP-OB1 (N modinesse))))))))
              (E_S .)) (ID CMVICES1,5.35))
    
    
  2. Clear cases of error in the manuscript are sometimes corrected, especially when the error is the wrong part of speech. Commonly the correction is the one suggested by the editor, but occasionally they are made without outside support.
    ( (IP-MAT-SPE (' ')
                  (NP-SBJ (D +De) (N mann))
                  (NEG ne)
                  (VBP leue+d)
                  (NEG naht)
                  (PP (P $be)
                      (CODE {TEXT:he})
                      (NP (N bread) (FP ane)))
                  (E_S ,)) (ID CMVICES1,89.1030))
    
    ( (IP-MAT-SPE (NP-SBJ (PRO $we))
                  (CODE {TEXT:+te})
                  (MD wulle+d)
                  (VB fole+ge)
                  (NP-OB1 (PRO +te))) (ID CMANCRIW,II.130.1708))
    
    (IP-SUB (NP-SBJ (N mihte))
            (NP-OB1 (PRO $+te))
            (NEG $ne)
            (CODE {TEXT:+te_+te})
            (VBP atiere+d)) (ID CMTRINIT,29.395))