Splitting and joining words

Summary overview

Spelled together Spelled apart
Split Emendation
(MD $can) (NEG $n't)
(CODE {TEXT:can't})
Separate tags
(MD can) (NEG not)
Treated as unitary Simple tag
(ADJ blue-eyed)
Numbered tag
(ADJ (ADJ21 blue) (ADJ22 eyed))
Treated as compound Complex (+) tag
(ADJ+NS gentlemen)
Separate tags
(ADJ gentle) (NS men)
Treated as written Relationship between tagging for variant spellings not necessarily transparent
(NPRS Englishmen)		(ADJ English) (NS men)
Fused form Phrase and complex (+) tag
(PP (P+N bicause))
Phrase and separate tags
(PP (P be)
    (NP (N cause)))

Items that are split

When an orthographic word in the original text belongs to different constituents (as defined by our annotation guidelines), the word is split into relevant parts, which are marked as emendations. As is usual with emendations, the original form is enclosed in (CODE {TEXT:...}).

Some combinations, such as a pronoun and a modal (e.g., 'twill), always belong to separate constituents and are therefore always separated. A systematic exception to the above concerns prepositions and single-word complements when they are spelled together (e.g., abed, on't, therewith); see below. Other combinations, such as determiner-modifier combinations (e.g., tother), do not always belong to distinct constituents in the sense of the annotation guidelines and are therefore not always split; see below.

In the PPCEME/PCEEC, we attempt to regularize the spelling of split forms to the standard modern equivalent (if there is one). However, in two exceptional texts (stevenson, udall), the split forms are not standardized, but reflect the characteristic dialect forms used elsewhere in these texts. To facilitate searches, we distinguish contracted and non-contracted forms in the emendations (see "Modal plus negation" for examples).

The following cases of split words are particularly common:

  1. Modal plus BE
    $shall_MD $be_BE {TEXT:shalbe}_CODE
    $shall_MD $be_BE {TEXT:shallbee}_CODE
    $will_MD $be_BE {TEXT:wolbe}_CODE
    $will_MD $be_BE {TEXT:wylbe}_CODE
  2. Modal plus negation
    $can_MD $not_NEG {TEXT:cannot}_CODE	<--- non-contracted form
    $can_MD $n't_NEG {TEXT:cant}_CODE	<--- contracted form
    $can_MD $n't_NEG {TEXT:can't}_CODE
    $shall_MD $n't_NEG {TEXT:shant}_CODE
    $wo_MD $n't_NEG {TEXT:won't}_CODE	<--- "wo" rather than "will"
  3. Modal or verb plus pronoun
    $grin_VBI $it_PRO {TEXT:grinit}_CODE
    $maist_MD $tow_PRO {TEXT:maistow}_CODE
    $Pray_VBP $thee_PRO {TEXT:Prithee}_CODE
    $Pray_VBP $thee_PRO {TEXT:Prethe}_CODE
  4. Pronoun plus modal or verb
    $ich_PRO $challe_MD {TEXT:ichalle}_CODE		<--- dialect form retained
    $it_PRO $'s_BEP {TEXT:its}_CODE
    $it_PRO $'s_BEP {TEXT:it's}_CODE
    $me_PRO $thynketh_VBP {TEXT:methynketh}_CODE	<--- original spelling in Middle English emendation
    $me_PRO $thinks_VBP {TEXT:methinkes}_CODE	<--- regularized spelling in Modern English emendation 
    $there_EX $'s_BEP {TEXT:thers}_CODE
    $they_PRO $'ll_MD {TEXT:they'l}_CODE
    $'t_PRO $is_BEP {TEXT:'tis}_CODE		<--- position of apostrophe invariant in emendation
    $'T_PRO $is_BEP {TEXT:T'is}_CODE
    $'t_PRO $will_MD {TEXT:twil}_CODE		<--- apostrophe added in emendation
  5. Possessive clitic ('S, S) See Dollar tag, Exceptionally not split. Although prepositions and their complements always belong to different constituents according to our guidelines, prepositions are exceptionally not split from single-word complements if both are spelled together. Most frequently, these single-word complements are R-pronouns or a contracted form of IT. The entire sequence is treated as a PP or WPP.
    (PP (ADV+P heretofore))		(PP (ADV+P therefore))
    (WPP (WADV+P wherewith))
    (PP (P+PRO for't))		(PP (P+PRO in't))
    (PP (P+PRO on't))		(PP (P+PRO too't))
    (PP (P+NS acneon))		(PP (P+N areawe))
    (PP (P+N ibedde))		(PP (P+N iwit))

    Split or not depending on syntactic context. Some common cases are:

    1. Preposition plus determiner

      Instances of this type are not split when the determiner is the head of the NP (following the rule that prepositions are not split from single-word complements).

      ( (CP-QUE (IP-SUB (BEP Are)
      		  (NP-SBJ (PRO you))
      		  (VAN a-uis'd)
      		  (PP (P+D o'that)))
      	  (. ?)))

      Otherwise, they are split.

      (PP (P $on)
          (NP (PRO$ $my) (CODE {TEXT:o'my}) (N life)))
    2. Determiner plus adjective, noun, etc.

      Instances of this type are not split if the noun, adjective, or other element does not form a phrasal constituent with any following words. See also Items treated as compounds.

      (NP (D+ADJ thilke) (NPR Iuditha))
      (NP (D+N th'emperour))
      (NP (D+N thestate))
      (NP (D+ADJ thilke) (N matter))
      (NP (D+ADV+VAN th'aforesayde) (N matter))

      Otherwise, they are split.

      (NP (D $the)
          (ADJP (ADV $right) (CODE {TEXT:theright}) (ADJ honourable))
          (N Earle)
          (PP (P of) 
      	(NP (NPR Atholl))))
      (NP (D $th')
          (ADJP (ADV $afore) (CODE {TEXT:th'afore}) (VAN sayde))
          (N matter))

    Items treated as unitary

    Items in this category may be spelled as one orthographic word or several. When written together, they are given a simple POS tag. When written apart, each part of the multiword sequence is surrounded by a numbered POS tag. The first number indicates the total number of parts; the second number indicates each part's place within the entire sequence. In order to facilitate CorpusSearch queries, an additional POS tag (unnumbered) surrounds the entire sequence in the parsed files.

    (ADV nevertheless)		(ADV (ADV31 never) (ADV32 the) (ADV33 less))

    Although our treatment of fused forms generally reflects their phrasal origin, certain such items must be treated as unitary because of their syntactic distribution. For instance, UNDERHAND must be treated as an adjective because it can appear as a prenominal modifier.

    (NP (ADJ underhand) (NS courses))

    Once an item is treated as unitary in one context, it is treated that way consistently.

    (ADVP (ADV secretly) (CONJ and) (ADV underhand))	<--- not (PP (P+N underhand))

    For items that go the other way (e.g., ALIVE, ASLEEP), see Fused forms.

    Historical changes in distribution can lead to differences in the way that items are treated in the PPCME2 and in the PPCEME/PCEEC.

    1. Unitary adjectives

      This category does not include:

      (ADJ alone)			(ADJ (ADJ21 a) (ADJ22 lone))
      (ADJ backward)			(ADJ (ADJ21 back) (ADJ22 ward))
      (ADJ gladful)			(ADJ (ADJ21 glad) (ADJ22 ful))
      other adjectives in -WARD
      (ADJ derworthy)
      (ADJ red-hot)
      (ADJ selfsame)
      (ADJ sevenfold)
      (NP (ADJ upright) (NS men)	(also adverb)
      (ADJ welcome)
      This category includes apparent compounds with 'false participles':
      (ADJ feather-footed)		(ADJ (ADJ21 feather) (ADJ22 footed))
      (ADJ mild-hearted)		(ADJ (ADJ21 mild) (ADJ22 hearted))
      (ADJ two-toothed)		(ADJ (ADJ21 two) (ADJ22 toothed))
      (ADJ ill-natured), but (ADV+VAN ill-favoured)
    2. Unitary adverbs and prepositions

      The following adverbs and prepositions are treated as unitary.

      a+det			about			above
      abroad			afore			again
      against			almost			already
      although(inwith)	always			alwhat
      among			amore			anon
      apon (but not upon)	aright			asswa
      away			before			behind
      beneath			beside(s)		between
      betwixt			beyond			bimong
      eftsoon			evermore		for+ti
      fornigh			forthright		forto
      fromward(tofore)	furthermore		furtherover
      henceforward		intil			inwith
      la(n)hure		maybe			mayfortune
      mayhap			moreover		na+gtuor+tan
      natforthi		ne+taget		nethelatter & variants
      nevermore & variants	nevertheless & variants	nonetheless & variants
      notwithstanding		onward			outake(n)
      overal			overmete		peradventure
      percase			perchance		perhaps
      thenceforth		there(to)against	throughout
      tilinto			tilto			toeke(n)
      tofore(hand)		togains			together
      toward			towhether		umbestunde
      underhand (also adjective)			upright (also adjective)
      unto (but not into)	whatforthi		withal
      within			without(forth)		+te+get
      +tewhether		+tohhswa+tehh
    3. Unitary nouns

      Certain items are treated differently in the PPCME2 and in the PPCEME/PCEEC (e.g., AFTERNOON, TODAY, and TONIGHT).

      Common items in this category include:

      (N ado)			(N (N21 a) (N22 do))		A = northern infinitival marker
      (N todo)		(N (N21 to) (N22 do))
      (N to-morrow)		(N (N21 to) (N22 morrow))
      (N$ tomorrow's)		(N$ (N$21 to) (N$22 Morrows))
      (N yesterday)		(N (N21 yester) (N22 day))
      (N$ yesterdays)		(N$ (N$21 yester) (N$22 day's))
      (N yesternight)		(N (N21 yester) (N22 night))
      (NPR Wednesday)		(NPR (NPR21 Wadenes) (NPR22 day))

    4. Unitary verbs

      1. Verbs with A (overwhelmingly in Middle English). In most of these verbs, A is originally a prefix (adding "intensity").
        (VBD (VBD21 a) (VBD22 resunede))	(VBD (VBD21 a) (VBD22 seide))
        (VBP (VBP21 a) (VBP22 kel+t))		(VBP (VBP21 a) (VBP22 turne+t))
      2. Verbs with the perfective prefix GE-, I-, Y-, etc. (only in Middle English).
        (VAN (VAN21 y) (VAN22 cleped))
        (VB (VB21 i) (VB22 heren))
        (VBP (VBP21 +ge) (VBP22 bette))
      3. Verbs with separable/inseparable prefixes. Because it is not reliably possible, we do not distinguish between separable and inseparable prefixes when they precede the verb. All verbal prefixes are treated as part of the verb. By contrast, separable prefixes that follow the verb are tagged RP.
        (VAN (VAN21 fore) (VAN22 said))		<-- FORE treated as prefix, not P, because of meaning
        (VB (VB21 with) (VB22 say))
        (VBD (VBD21 by) (VBD22 shone)
        (VBD (VBD21 to) (VBD22 brake))

    Items treated as compounds

    When they are spelled together, items that are treated as compounds receive a complex POS tag, consisting of two or more POS tags joined by "+".

    When items treated as compounds are spelled apart, each part receives a simple POS tag. But unlike in the case of unitary items, no additional pair of POS brackets is added to indicate the item's compound character.
    (NP (ADJ+NS gentlemen))		(NP (ADJ gentle) (NS men))          <-- no added NS
    Phrasal brackets, on the other hand, are added as appropriate.
    (NP (D the)			(NP (D the)
        (ADV+VAN aforesayde)	    (ADJP (ADV afore) (VAN sayde))  <-- added ADJP
        (N matter))			    (N matter))		

    The first part of a compound is tagged as N if that is possible given the meaning of the compound (EVIL-DOERS, ILL-BODING). Otherwise (EVIL-FAVOURED, ILL-DISPOSED, WELL-DOERS), the first part is tagged with the appropriate POS tag (here, ADV).

    1. Comparative AS
      (ADVR+ADV assone)   = as soon
      (ADVR+ADV a-swythe) = as swythe (quickly)
    2. Infinitival TO plus verb
      (FOR for) (TO+VB tabyde)	(TO+VB tappeal)
      (TO+VB toffrenn)		(TO+VB toslenne)
    3. Negation plus modal or verb
      (NEG+HVD nade)	= NE + had
      (NEG+HVP nave)	= NE + have
      (NEG+MD nolde)	= NE + wolde
      (NEG+VBD nyst)	= NE + wist
    4. Noun compounds

      This category includes compounds in which the first part is a noun or some other category.

      1. Noun-noun (N+N)
        (N+N alderman)			(N+N bishopric)
        (N+N eortheware)		(N+NS evil-doers)
        (N+N godfather)			(N+N household)
        (N+N lifetime)			(N+N mankind)
      2. Other (ADJ+N, ADV+N, OTHER+N, etc.)
        (ADJ+NS gentlemen)		(ADJ+N grandsire)
        (ADJ+NPR Halichurche)		(ADJ+NS noblemen)
        (ADJ+N vainglory)
        (ADV+N hidercume)		(NP-TMP (ADV+N oftesy+de))
        (NP-TMP (ADV+N ofte-tide))	(NP-TMP (ADV+N(S) often-tyme(s)))
        (NP-TMP (ADV+N often-while))	(ADV+NS well-doers)
        (ADV+NS well-wishes)
        (NP-TMP (ADV+N afor-tyme))	(NP-TMP (ADV+N beforetime))
        (NP-ADV (OTHER+NS othergates))	(NP-ADV (OTHER+N otherwise))
        (PP (P+N beforehand))
    5. Degree OVER
      (ADVR+Q overmanie)
      (ADVR+Q overmuch)
      (ADVR+ADJ overproud)
    6. Participles with modifying adverb
      (ADV+VAN abouesaide)		(ADV+VAN aforn-seyd)
      (ADV+VAN be-forn-wretyn)	(ADV+VAG everlasting)
      (ADV+VAN ill-disposed)		(ADV+VAN new-born)
      (ADV+VBN new-come)		(ADV+VAN well-knowyn)
    7. Participles with modifying noun
      (N+VAG alms-willing)		(NPR+VAG god-fearing)
      (N+VAG ill-boding)		(N+VAN self-conceited)
      (N+VAN wind-driven)

    8. Possessive clitic ('S, S)

      See Dollar tag, Possessive clitic.

    9. Reflexive pronouns

    10. Quantified adverbs and nouns (see also EVER plus quantifier and EVERY)

      Quantified adverbs are treated as compounds of ANY, EVERY, NO, SOME, etc. (Q) + HOW, WHERE, etc. (WADV).

      (ADVP (Q+WADV anyhow))		(ADVP (Q any) (WADV how))
      (ADVP-LOC (Q+WADV anywhere))	(ADVP-LOC (Q any) (WADV where))
      somehow				somewhere

      Quantified nouns are treated as compounds of ANY, EVERY, NO, SOME, etc. (Q) + ONE, PLACE, THING, TIME(S), WHAT, WIHT, etc. (N, NS, ONE).

      (NP (Q+ONE anyone))		(NP (Q any) (N thing))
      (NP-LOC (Q+N anyplace))		(NP-LOC (Q any) (N place))
      (NP (Q+N eawiht))
      (NP (Q+ONE echone))
      (NP (Q+ONE ilkane))
      (NP (Q+N somdel)		<-- -MSR, -OB1, -SBJ, etc. according to function
      (NP-TMP (Q+N sometime)		(NP-TMP (Q some) (N time))
      (NP-TMP (Q+NS sometimes)	(NP-TMP (Q some) (NS times))
      (NP (Q+N somewhat)		<-- -MSR, -OB1, -SBJ, etc. according to function
      similarly (some items repeated here for convenience):
      	  anyone	anyplace	anything	anytime
      everydel  everyone	everyplace	everything	everytime
        	  no-one	noplace		nothing		
      	  someone	someplace	something	sometime, sometimes
    11. -WARD The information here focuses on issues concerning splitting and joining. The various uses of -WARD and its compounds and the POS categories associated with them are discussed in Treatment of individual words, s.v. -WARD.

      Combinations with -WARD that are used as adjectives or prepositions are treated as unitary items.

      All other uses and occurrences are tagged WARD, which is either part of a complex (+) POS tag (ADV+WARD, N+WARD, NPR+WARD, RP+WARD, etc.) or a separate tag, depending on whether -WARD is spelled as a separate orthographic word.

      (ADVP-TMP (ADV+WARD afterward))		(ADVP-TMP (ADV after) (WARD ward))
      (ADVP-DIR (ADV+WARD backward))		(ADVP-DIR (ADV back) (WARD ward))
      	like adverbial BACKWARD: FORWARD
      (ADVP-DIR (RP+WARD downward))		(ADVP-DIR (RP down) (WARD ward))
    12. WHAT SO EVER, etc.
      (WPRO+ADV whatever)
      (WPRO+ADV+ADV whatsoever)
      (WADV+ADV+ADV wheresomever)
      (WPRO+ADV whoso)

    Items treated as written

    This category is used largely for
    fused forms, but also includes the following.

    ALMIGHTY and BETIME(S) are treated differently in the PPCME2 and in the PPCEME/PCEEC.

    Spelled together Spelled apart
    (P into)
    (RP in) (P to)
    unlike unitary unto
    (P up-on)
    (RP up) (P on)
    unlike unitary apon
    (NPR Englishman)
    (NPRS Englishmen)
    (NP (ADJ English) (N man))
    (NP (ADJ English) (NS men))
    (NUM fifty-three)
    (NUMP (NUM fifty) (NUM three))

    Fused forms

    Certain items in later English are fusions of earlier multi-word phrases. Given the time coverage of our diachronic corpora and the fact that word division in early texts is not always well represented, these items are very difficult to treat in a consistent way. The strategy we have adopted is as follows.

    See Items treated as unitary for distinction between cases like ALIVE and ASLEEP (fused forms) and UNDERHAND (unitary adjective or adverb).

    1. A- words (A < IN, ON) (including the A-HUNTING construction)
      (PP (P a)				(PP (P+ADV+WARD abackward))
          (ADVP (ADV+WARD backward)))
      (PP (P a)				(PP (P+N abed))
          (NP (N bed)))	
      (PP (P a)				(PP (P+RP adown))
          (ADVP (RP down))
      (PP (P a)				(PP (P+N ahunting))
          (NP (N hunting)))
      (PP (P a)				(PP (P+ADJ asunder))
          (ADJP (ADJ sunder)))
      (PP (P a)				(PP (P+NUM atwo))
          (NP (NUM two)))
      alive	<--- not ADJ because impossible in prenominal position
      asleep	<--- not ADJ because impossible in prenominal position
    2. Other
      (PP (P at)				(PP (P+ADV atonce))
          (ADVP (ADV once)))
      (PP (P be)				(PP (P+N bycaus)
          (NP (N cause)                           (CP-ADV ...))
              (CP-THT ...)))
      (PP (P before)				(PP (P+N beforehand))	similarly: AFOREHAND, BEHINDHAND 
          (NP (N hand)))
      (NP-TMP (ADV before) (N time)))		(NP-TMP (ADV+N beforetime))
      (NP-TMP (ADV before) (NS times)))	(NP-TMP (ADV+NS beforetimes))
      (PP (P for)				(PP (P+ADV forever))
          (ADVP (ADV ever)))
      (PP (P for)				(PP (P+N forsooth))	similarly: INSOOTH
          (NP (N sooth)))
      (PP (P for) (D thi)			(PP (P+D forthi))
          (CP-ADV				    (CP-ADV 
      (PP (P for) (WADV whi)			(PP (P+WADV forwhi))	when used as subordinator
          (CP-ADV ...))			    (CP-ADV ...))
      (PP (P in)				(PP (P+N indeed))
          (NP (N deed)))
      (PP (P in)				(PP (P+N instead))
          (NP (N stead)))
      (PP (P o')				(PP (P+N o'clock))
          (NP (N clock)))