The Parsed Corpus of Early English Correspondence: Metadata

The Parsed Corpus of Early English Correspondence

Metadata

The contents of the metadata node
How to search the metadata node
- Using CorpusSearch 2
- Using CorpusSearch 1.1

In addition to parsed sentences, the PCEEC contains searchable sociolinguistic information for each token. This information is collectively referred to as 'metadata' and occurs in its own node (labelled METADATA), as the first node inside the wrapper.

( (METADATA (AUTHOR NICHOLAS_BACON_II:MALE:BROTHER:1543:26)
            (RECIPIENT NATHANIEL_BACON_I:MALE:BROTHER:1546?:23?)
            (LETTER BACON_001:E1:1569:AUTO:FAMILY_NUCLEAR))
  (IP-MAT (CONJ nor)
          (NP-1 (D the) (N commyssion)
                (PP (P for)
                    (NP (D the) (N pease))))
          (NP-SBJ (PRO I))
          (ADVP-TMP (ADV never))
          (VBD harde)
          (PP (P of)
              (NP *ICH*-1))
          (. .)) (ID BACON,I,7.001.5))

The contents of the metadata node

The METADATA node contains three nodes: AUTHOR, RECIPIENT, and LETTER. Each of these nodes contains relevant information separated into fields by colons.

Author and recipient information
Letter information

Author and recipient information

The following information is given for the author and recipient:

Field 1 name
Field 2 gender
Field 3 relationship (if any, if known)
Field 4 date of birth
Field 5 age (at time of writing)

Field 1	name
Field 2	gender
Field 3	relationship (if any, if known)
Field 4	date of birth
Field 5	age (at time of writing)

Each author/recipient in the corpus is given a unique name. Father/son, mother/daughter pairs are generally distinguished by SR/JR (e.g. EDWARD_CONWAY_SR, EDWARD_CONWAY_JR). More than two members of the same (extended) family with the same name are distinguished by Roman numerals I, II, III, etc. (e.g., WILLIAM_PASTON_I, WILLIAM_PASTON_II, WILLIAM_PASTON_III, WILLIAM_PASTON_IV, etc.). Roman numerals always start at I, except in the case of kings, who are indicated in the usual way (e.g. HENRY_TUDOR_VIII). Women with same name, may be distinguished by their maiden name, rather than numbers or SR/JR. Thus ANNE_BACON is a different individual from ANNE_BACON[N.COOKE] and ANNE_BACON[N.GRESHAM]. Unrelated writers with the same name are distinguished by Arabic numbers (JOHN_WILLIAMS_1, JOHN_WILLIAMS_2).

For most correspondents both a first and last name are known, but occasionally only one name is known. When only the last name is known, a title is included when supplied (e.g., MRS_NECTON, COLONEL_STRODE), but if no title is supplied, they are referred to by the single name only (e.g., WADEHILL). Individuals known only by first name are generally (early or foreign) royalty, or certain religious figures (abbots, in particular). These are given either a number (EDWARD_IV), a title (PRINCE MAURICE), or an epithet: JOHN_OF_NORTON, JOHN_OF_TEWKESBURY, JOHN_OF_LANCASTER. The latter distinguishes three John's: the Abbot of Norton, the Abbot of Tewkesbury, and the Duke of Bedford.

Women
Women are generally referred to by the name they were using at the time of writing. If married, and their maiden name is known it is included in square brackets prefaced by N. for Nee (e.g. MARY_PEYTON[N.ASTON]). In general, the inclusion of a maiden name indicates a married state at the time of writing, but its absence indicates nothing, since not all maiden names are known.
Women who wrote letters under more than one name are given both names separated by a slash (DOROTHY_OSBORNE/TEMPLE). In some cases, the two names are the woman's maiden and married name (DOROTHY_OSBORNE/TEMPLE), but they may also be two married names. In this case, if the maiden name is known, it is also included, if known (e.g., ELIZABETH_POYNINGS/BROWNE[N.PASTON]). The difference between DOROTHY_OSBORNE/TEMPLE and MARY_PEYTON[N.ASTON] is that Dorothy wrote both as Dorothy Osborne and as Dorothy Temple, whereas Mary wrote only as Mary Peyton (i.e. when married). Her maiden name is included only for information purposes. From the information given in the name (DOROTHY_OSBORNE/TEMPLE) alone, it is not possible to tell whether one of these names is a maiden name and one a married name or both married names (although from other information we know that Osborne is her maiden name and Temple her married name). For ELIZABETH_POYNINGS/BROWNE[N.PASTON], on the other hand, who wrote both as Elizabeth Poynings and Elizabeth Browne, we know from the name alone (because of the inclusion of the maiden name) that both of these are married names.

Royalty
Pre-Tudor kings are referred to by first name and number only (e.g. RICHARD_III, EDWARD_IV). Other members of the royal family are referred to by common epithets (RICHARD_OF_YORK, HUMPHREY_OF_LANCASTER). Tudor and Stuart kings and family are named Tudor or Stuart along with their number (e.g. HENRY_TUDOR_VIII, JAMES_STUART_I/VI). Queens are referred to in the same way as other women, with the exception of Mary Tudor, Queen of England (referred to as MARY_TUDOR_I), who has the same name as Mary Tudor, Queen of France (referred to as simply MARY_TUDOR).

Unknown recipient/Name unknown
In a small number of cases, nothing is known about the recipient of a letter. In these cases the recipient is listed as UNKNOWN_RECIPIENT. In line with our policy of uniquely identifying every individual, each unknown recipient is given a separate arabic number (e.g., UNKNOWN_RECIPIENT_1, UNKNOWN_RECIPIENT_2, etc.). When something about the identity of the recipient is known (usually by some distinguishing epithet given by the editor, or the content of the letter) e.g., 'sheriff of Norfok', but the name is not, the recipient is listed as NAME_UNKNOWN (or NAMES_UNKNOWN for groups such as the 'commissioners for recusancy'). As with unknown recipients, each unnamed individual/group is given a different number. The same unnamed individual or group as author or recipient of multiple letters is given the same number. The reason for distinguishing the two types is that for the latter, but not the former, additional personal information (API) is included in the associated information file (AIF). Note that the PRIVY COUNCIL is treated as an individual, i.e., a name, and not classified as an unnamed group.

Gender (AUTHOR/RECIPIENT field 2)

The gender of all correspondents is identified as either MALE or FEMALE. Unknown recipients, who always occur in official/professional, rather than personal/family contexts, are safely assumed to be MALE.

Relationship (AUTHOR/RECIPIENT field 3)

This field is an expanded version of the Helsinki 'recipient type code' (see also Recipient classification below). Specific nuclear family relationships are given in all cases (DAUGHTER, SON, MOTHER, FATHER, etc.). Specific extended family relationships (SISTER-IN-LAW, AUNT, etc.) are also given when known. If the exact relationship is not known (usually because it is either distant, complicated, or both), the cover term KIN is used. The final two possible entries for this field are FAMILY_SERVANT, which simply reproduces the Helsinki 'recipient type' code FS, and FRIEND (Helsinki code TC). Any relationship that does not fall into one of the above categories, is left empty (indicated by an underscore in the field).

Date of birth (AUTHOR/RECIPIENT field 4)

Dates of birth for authors and recipients are given when known. Dates followed by a question mark (1530?) are uncertain. Some are baptismal dates. All dates are given in new style (i.e. with the year beginning in January). All dates of birth (DOB), whether accompanied by a ? or not, should be assumed to have a margin of error of a few years on either side, as sources sometimes differ.

Author birthdates are largely supplied by Helsinki. Recipient birthdates not supplied by the edition or Helsinki come from the on-line DNB. When no DOB is known, the field is empty (indicated, as usual, by an underscore).

Age at time of writing (AUTHOR/RECIPIENT field 5)

This number is simply calculated by subtracting the date of birth (DOB) of the author/recipient from the date of the letter. If either the letter date or the DOB of the correspondent is uncertain (indicated by a following ?), the age is likewise (e.g., 49?). When the date of the letter is given as a decade (1500S), age is not calculated. If the DOB or the date of the letter is unknown, the field is empty (indicated, as usual, by an underscore).

Letter information

The following information is given for the letter:

Field 1 unique reference
Field 2 time period
Field 3 date of letter
Field 4 authenticity
Field 5 recipient classification

Field 1	unique reference
Field 2	time period
Field 3	date of letter
Field 4	authenticity
Field 5	recipient classification

Unique reference (LETTER field 1)

Each letter in the corpus is uniquely identified by a combination of the filename and letter number (e.g., ARUNDEL_001).

Time period (LETTER field 2)

The time period of the letter is identified according to the Helsinki Corpus time periods, as follows:

Time period Dates
M4 1420-1500
E1 1500-1569
E2 1570-1639
E3 1640-1710

Time period	Dates
M4	1420-1500
E1	1500-1569
E2	1570-1639
E3	1640-1710

This information is intended to make it possible to compare PCEEC data to that from the PPCME2 and PPCEME.

Date of letter (LETTER field 3)

This information is taken from the Helsinki 'text identifier' parameter (Q). When the year of the letter is known it is given precisely. If the date is conjectural it is followed by a question mark. When a precise year cannot be suggested the decade is given, indicated by S, as follows: 1500S and 1600S indicate the first decade of each century, 1510S indicates 1510-1519, etc. If even the decade is uncertain, a question mark follows the decade (e.g., 1530S?).

Authenticity (LETTER field 4)

Three values are supplied for the authenticity of the letter. This is a simplified version of the Helsinki 'authenticity code' found in the 'text identifier' parameter (Q).

AUTOGRAPH
Includes all letters written by the author (Helsinki 'authenticity codes' A and B)
COPY
Includes both copies of letters and letters written by a secretary for the author (code C)
UNKNOWN
Authenticity is unknown (code D)

Recipient classification (LETTER field 5)

This field contains a more general classificiation of the information in field 3 (Relationship) of the AUTHOR/RECIPIENT nodes. The first two values group family relationships into nuclear and extended. The last three values simply replicate the same information as found in the AUTHOR/RECIPIENT node.

FAMILY_NUCLEAR
FAMILY_OTHER
FAMILY_SERVANT
FRIEND
OTHER

The information in this field comes from the Helsinki 'recipient type' code found in the 'text identifier'

How to search the metadata

Accessing metadata with CorpusSearch 2
Accessing metadata with CorpusSearch 1.1

Accessing metadata with CorpusSearch 2

The metadata in the format discussed here can only be searched by CorpusSearch 2 (CS2). Users of CorpusSearch 1.1 (CS1.1) can upgrade to CorpusSearch 2. (Upgrade information, CS1.1 to CS2 conversion guide), or access the metadata in another format (Accessing metadata with CS1.1). This section presupposes a familiarity with CorpusSearch 1.1. The relevant background for new users can be found in the coding section of the CS 1.1 Reference Manual, or CorpusSearch Lite.

The three nodes included in the METADATA node (AUTHOR, RECIPIENT, and LETTER) are searched in the same way as CODING strings are searched. The query below searches column 1 of the CODING string for the value x.

query: (CODING column 1 x)  <-- note space between 'column' and '1'

CS2 difference alert! There must be a space between the column and column number in CS2 CODING queries (see the Conversion guide).

To search one of the METADATA nodes, simply substitute the name of the node for CODING. Note that the metadata 'fields' are referred to as 'columns' when searching (for historical reasons).

query: (AUTHOR column 2 MALE)

This query will find all the tokens in which the author is male. To search for other values as well, simply join search functions with AND.

CS2 difference alert! It is not necessary to join multiple search calls one at a time with right-branching parentheses in CS2 (see the Conversion guide)

query: (AUTHOR column 2 MALE)
AND (RECIPIENT column 2 FEMALE)
AND (LETTER column 2 E2)

This query finds all men writing to women in period E2. The following query finds all women writing to women.

query: (AUTHOR column 2 [1]FEMALE)
AND (RECIPIENT column 2 [2]FEMALE)

Note the use of prefix indices. Indices must be used to distinguish different referents of the same search term when searching the METADATA, just as with any other type of search.

To find all women writing to men or men to women, use the new CS2 OR function.

query: ((AUTHOR column 2 MALE)
AND (RECIPIENT column 2 FEMALE))
OR
((AUTHOR column 2 FEMALE)
AND (RECIPIENT column 2 MALE))

metadata can be searched in conjunction with structure.

query: (AUTHOR column 2 FEMALE)
AND (RECIPIENT column 2 MALE)
AND (IP* iDoms NP-VOC)

This query finds all tokens written by women to men, in which vocatives are used.

Accessing metadata with CorpusSearch 1.1

For CS1.1 users, the metadata has been provided in the form of a CODING string.

( (CODING BACON_001:E1:1569:AUTOGRAPH:FAMILY_NUCLEAR:NICHOLAS_BACON_II:MALE:BROTHER:1543:26:NATHANIEL_BACON_I:MALE:BROTHER:1546?:23?:16%)
  (IP-MAT (NP-SBJ (PRO I))
	  (HVP have)
	  (VBN reseyved)
	  (ALSO also)
	  (NP-OB1 (PRO$ my) (N hose))
	  (. .)) (ID BACON,I,7.001.3))

This coding string contains all the information in the metadata nodes, with the columns from the LETTER node first, followed by the AUTHOR and then the RECIPIENT node, as follows:

Column 1 author name
Column 2 author gender
Column 3 author relationship (if any)
Column 4 author date of birth
Column 5 author age (at time of writing)
Column 6 recipient name
Column 7 recipient gender
Column 8 recipient relationship (if any)
Column 9 recipient date of birth
Column 10 recipient age (at time of writing)
Column 11 letter unique reference
Column 12 letter time period
Column 13 date of letter
Column 14 letter authenticity
Column 15 recipient classification

Column 1	author name
Column 2	author gender
Column 3	author relationship (if any)
Column 4	author date of birth
Column 5	author age (at time of writing)
Column 6	recipient name
Column 7	recipient gender
Column 8	recipient relationship (if any)
Column 9	recipient date of birth
Column 10	recipient age (at time of writing)
Column 11	letter unique reference
Column 12	letter time period
Column 13	date of letter
Column 14	letter authenticity
Column 15	recipient classification

The final column, column 16, contains the value '16%'. This is not a metadata value, but is simply to indicate clearly the end of the provided CODING string. To add your own coding to this string (without losing any of the metadata), start coding in column 17. You can, in fact, code over the provided values in any particular search if you aren't interested in them. This will not affect the metadata CODING string in the corpus files.

Searching the metadata CODING string is done in the same way as for any other CODING string (see the section on searching the CODING node in the CorpusSearch Lite Manual). For information on how to add coding to the string, see the section on the coding function.

Note that if you want to use any values from the coding string as input to varbrul, you will have to recode them as single character values, since varbrul will not accept multi-character columns.