Lexicon

What is a lexicon?

A lexicon is a list of the words used in an input file or files. Following each word is the number of times the word was found, followed by the part-of-speech labels associated with the word and the number of times each part of speech was found. Word identity is determined by spelling.No morphological analysis or spelling normalization is performed. However, spellings that vary only by capitalizations are listed on the same line. Also, initial "$" is ignored.

In the following example, this line:

a-boute 11: [9 P] [1 RP] [1 ADV]

means that the word "a-boute" was found 11 times, 9 times with the part of speech label "P", 1 time with the part of speech label "RP", and 1 time with the part of speech label "ADV".

make_lexicon

This is the basic command that causes a lexicon to be built. On its own, the following will generate a lexicon of every word in the input file.

make_lexicon: t

pos_labels

This command restricts the lexicon to words with certain part of speech tags. For instance, to obtain a list of words labelled as prepositions:

make_lexicon: t
pos_labels: P|P#|P-*

text_labels

This command restricts the lexicon to certain words. For instance, to find only words beginning with "th" or "+t":

make_lexicon: t
text_labels: th*|+t*|Th*|+T*

Both pos_labels and text_labels can be specified in one query. For instance, to obtain prepositions beginning "in":

make_lexicon: t
pos_labels: P|P#|P-*
text_labels: in*

an example

The following query:

make_lexicon: t

results in this output:

/*
PREFACE:
CorpusSearch copyright Beth Randall 2000.
Date:  Tue Sep 21 09:55:12 EDT 2004

command file:     test/lex.q
output file:      test/lex.out

Lexicon:
*/

/*  ~A~  */
a A $a 3713: [3421 D] [10 FW] [104 HV] [15 VAN21] [24 ADV21] [25 P21] [8 VBD21]
[15 P] [1 RP21] [1 \
N21] [4 CONJ] [5 VB21] [6 N] [4 ADJ21] [68 INTJ] [1 VBN21] [1 NUM21]
a 1: [1 D]
a+gen 15: [9 ADV] [6 P]
a+gennyst 1: [1 P]
a+gens 4: [4 P]
a+genst 2: [2 P]
a+geyne 10: [10 ADV]
a-+gen 63: [52 ADV] [11 P]
a-+gens 12: [12 P]
a-bak 1: [1 P+ADV]
a-bakke 1: [1 P+ADV]
a-baschyd 2: [2 VAN]
a-basshed 1: [1 VAN]
a-basshyd 1: [1 VAN]
a-beyn 1: [1 VB]
a-bod 1: [1 VBD]
a-bode 5: [5 VBD]
a-bood 4: [4 VBD]
a-boode 1: [1 VBD]
a-bouen 1: [1 P]
a-boute 11: [9 P] [1 RP] [1 ADV]
.
.
.
.
.
.
.
/*  ~Z~  */
zacari 1: [1 NPR]
zacharie 1: [1 NPR]
zaram 1: [1 NPR]
zebede 1: [1 NPR]
zelator 1: [1 N]
zelatoris 1: [1 NS]
zele 6: [6 N]
zelose 2: [2 ADJ]
zelously 1: [1 ADV]
zeno 1: [1 NPR]
zenocrates 1: [1 NPR]
zenon 1: [1 NPR]
zepherine 1: [1 NPR]
zorobabel Zorobabel 3: [3 NPR]
zorobabell Zorobabell 4: [4 NPR]
zozime 1: [1 NPR]

Lexicon Building

Table of Contents

CorpusSearch Home

Contents of this chapter:

What is a lexicon?

make_lexicon

pos_labels

text_labels

an example