Splitting and Joining Parts of Words

Words that are always treated as separate parts
Words that are sometimes treated as separate parts
Words which are always treated as unitary
Collocations which are treated as written
Table of Contents

Words that are always treated as separate parts

There are two ways in which it may be indicated that we consider a single written sequence as two words.
  1. The parts may be physically separated. This is done in cases where the two parts belong to different constituents (as defined by the PPCME2). It is done mainly to facilitate the syntactic parsing but appears in the POS and text files as well to preserve compatibility. In this case a note is always added indicating what the form was in the original text and the parts are marked as emendations. Some combinations, such as a pronoun and verb are always separated, since they always belong to different constituents. Others, such as determiner-adjective combinations, may or may not be split depending on whether the adjective belongs to a separate constituent (again as defined by the PPCME2). One exception to this rule is the case of a preposition taking a single word complement. These are so commonly written as single words in the early texts that it is not feasible to separate them all. Therefore in this case the combination is tagged with multiple tags and the whole constituent is labelled as a PP. See below. The following types of cases are always separated:
    1. pronoun plus modal or verb
              $ich_PRO $challe_MD {TEXT:ichalle}_CODE
              $me_PRO $thynketh_VBP {TEXT:methynketh}_CODE
    2. modal or verb plus pronoun
              $maist_MD $tow_PRO {TEXT:maistow}_CODE
              $grin_VBI $it_PRO {TEXT:grinit}_CODE
    A preposition taking a single word complement, however, is not split. See above. A very common type is the combination of an R-pronoun plus preposition.
    Some common types that may be split depending on the syntactic configuration that they occur in are:
    1. preposition plus determiner This type is split when the determiner is followed by a noun or adjective, etc., but not if the determiner itself is the head of the NP (following the rule that prepositions plus single word complements are not split).
              hu_WADV god_NPR seolf_N wes_BED i+tis_P+D iderued_VAN
              +Te_D blake_ADJ cros_N limpe+d_VBP to_P +teo_D +te_C 
              make+d_VBP $i_P $+te_D {TEXT:i+te}_CODE world_N hare_PRO$
              penitence_N for_P ladliche_ADJ sunnen_NS ._.
    2. determiner plus noun, adjective, or adverb This type is not split if the noun or adjective (or other element) does not form a constituent with any following words.
              $a_D $ful_ADV {TEXT:aful}_CODE wac_ADJ knif_N
  2. In cases in which the two parts do not belong to different constituents (again as defined by the PPCME2), the sequence is tagged with all the appropriate tags, joined by +. The following are some common cases which are given multiple tags:
    1. negation plus modal or verb
    2. infinitival TO plus verb
              for_FOR tabyde_TO+VB
    3. pronoun plus SELF
      Note that SELF is always tagged N whether it appears to be singular or plural.
    4. AS plus comparative
              a-swythe_ADVR+ADV       = as swythe (quickly)
              assone_ADVR+ADV         = as soon
    5. noun compounds The parts of noun compounds when written together are both tagged.
              lifetime_N+N                    alderman_N+N
              eortheware_N+N                  godfather_N+N
              bishopric_N+N                   household_N+N
      Compounds like EVIL-DOERS are tagged N+NS, if the first part can be interpreted as a noun. If not, as in WELL-DOERS, it is tagged appropriately: ADV+NS.
    6. WHAT SO EVER, etc.
    7. Modifier plus noun
              vainglory_ADJ+N                 othergates_OTHER+NS
              Halichurche_ADJ+NPR             otherwise_OTHER+N
              hidercume_ADV+N                 grandsire_ADJ+N
              oftyn-tyme_ADV+N                oftesy+de_ADV+N 
              oftyn-tymes_ADV+NS              ofte-tide_ADV+N
    8. Adverb plus past participle
    9. Combinations involving WARD Combinations involving WARD used as prepositions are tagged P (see Section PREPOSITIONS IN -WARD) and those used adjectivally (INWARD, OUTWARD) (see Section INWARD, OUTWARD (ADJECTIVE) are tagged ADJ. All others are tagged by parts.
              toward_P                   inward_ADJ beauty_N 
              inward_RP+WARD             in_RP warde_WARD
              tyll_P Jhesuwarde_NPR+WARD
    10. Quantifier plus anything
              ilkane_Q+ONE            echone_Q+ONE            someone_Q+ONE
    11. Noun plus participle

Words that are sometimes treated as separate parts

Certain words of later English are fusions of earlier multi-word phrases. Given the time coverage of the corpus and the fact that word division in early texts is not always well represented, this is very difficult to deal with in a consistent way. The solution we have adopted is as follows. With a few exceptions outlined below, if a `phrasal word' (i.e., a word which was once a phrase) ever occurs in our corpus as a phrase, then it is always tagged as a phrase, whether it is written as two words or not. Phrases that had fused before the beginning of the Middle English period are treated as single words. The following are some common categories of `words' tagged as phrases.
  1. A- words, where A $<$ ON
            a morwe         amorwe_P+N
            a two           atwo_P+NUM
            a +tre          a+tre_P+NUM
            a foure         afoure_P+NUM
            a fire          afire_P+N
            a backward      abackward_P+ADV+WARD
            a doun          adoun_P+RP
            a fishing       afishing_P+N
            a sunder        asunder_P+ADJ
  2. miscellaneous
            for ever        forever_P+ADV
            on live         onlive_P+N
            in deed         indeed_P+N
            for thi         forthi_P+D
            for soothe      forsooth_P+N
            by cause        because_P+N
            at once         atonce_P+ADV
            by times        betimes_P+N

Words which are always treated as unitary

In general the words in this category do not have (obviously) compositional semantics. They also tend to occur written together or separately stably over centuries.
  1. Adverbs

    At least the following list of adverbs are treated as unitary.

            above                      henseforward
            about                      la(n)hure
            abroad                     nevermore and variants
            afore                      nevertheless, natforthi, and variants
            again                      overal
            anon                       overmete
            aright                     peraventure
            away                       toeken
            almost                     tofore
            already                    toforehand
            always                     together
            asswa                      towhether
            behind                     whatforthi
            before                     withal
            eftsoon                    within
            evermore and variants      without
            fornigh                    +te+get
  2. Prepositions

    The following list of prepositions is treated as unitary.

            about               forto               tofore
            a+det               fromward            togains
            agains              intil               toward
            although            inwith              towhethere
            al-what             lest                unto
            among               notwithstanding     withal
            apon                onward              within
            before              outake(n)	        without
            behind              +te+get            
            beneath             thereagainst        
            beside(s)           theretogainst
            between             throughout
            betwix              tilinto
            beyond              tilto
            bimong              toeke
  3. Verbs
    1. verbs with separable/inseparable prefixes We do not (because we cannot easily) distinguish between separable and inseparable prefixes preceding the verb. All verbal prefixes are joined to the verb. When separable prefixes follow the verb, they are tagged RP.
              with_VB21 say_VB22
              to_VBD21 brake_VBD22
              by_VBD21 shone_VBD22
              fore_VAN21 said_VAN22
    2. verbs with prefix A- In most verbs with A, the A is originally a prefix (adding "intensity"), although ADO is from AT DO where AT is originally the northern infinitive marker.
              a_VBP21 kel+t_VBP22
              a_VBD21 resunede_VBD22
              a_VBD21 seide_VBD22
              a_VBP21 turne+t_VBP22
    3. verbs with GE-, Y-, I-, etc.
              i_VBN21 onswered_VBN22
              i_VB21 heren_VB22
              +ge_VBP21 bette_VBP22
  4. Adjectives If a clearly bipartite adjective cannot be accurately tagged according to its parts, it is tagged ADJ.
            mild_ADJ21 heorted_ADJ22
            tweie_ADJ21 to+ted_ADJ22
            fe+der_ADJ21 fotetd_ADJ22
            freo_ADJ21 Iheorted_ADJ22
            gled_ADJ21 ful_ADJ22
            +tusend_ADJ21 falde_ADJ22
            un_ADJ21 seli_ADJ22

Collocations which are treated as written

This category is tagged as written. That is, as a phrase when separate, as a single word when together.
  1. Time words with TO
            to_P day_N      today_N 
            to_P morrow_N   tomorrow_N
  2. Other time words
            yester_ADJ day_N        yesterday_N
            after_P noon_N          afternoon_N
            seteres_NPR day_NPR     Saturday_NPR
  3. Modification of adjectives and adverbs
            almighty_ADJ            all_Q mighty_ADJ
  4. Double prepositions (particle plus preposition)
            into_P                  in_RP to_P
            up-on_P                 up_RP on_P