Contents of this chapter:
ignore_nodes: COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:* ignore_words: COMMENT|CODE|ID|LB|'|\"|,|E_S|.|/|RMV:*|0|\**For instance, if you run this query:
(NP* iPrecedes PP*)This sentence will be returned:
/* 1 IP-MAT-SPE: 5 NP-1, 9 PP */ /~* There ar two bretheren beyond the see, (CMMALORY,15.439) *~/ (0 (1 IP-MAT-SPE (2 NP-SBJ-1 (3 EX There)) (4 BEP ar) (5 NP-1 (6 NUM two) (7 NS bretheren)) (8 CODE <P_15>) (9 PP (10 P beyond) (11 NP (12 D the) (13 N see))) (14 E_S ,)) (15 ID CMMALORY,15.439))Notice that NP-1 immediately precedes PP in spite of the intervening node (8 CODE <P_15>). This is because CODE is on the default ignore-list.
We will sometimes refer to nodes that are not to be ignored as "legitimate" nodes.
The value of node: gives CorpusSearch a node boundary within which to search. The list of labels gives boundaries that any structure you search for will fall within; for example, IP* would yield all the basic clauses in the corpus, and $ROOT is the topmost level of every syntactic tree, whatever its label. In the case of searches on the output of a previous search in which nodes_only is set to "true", $ROOT refers to the root of the tree, which will have the label of the node boundary.
Whenever you want to consider the entire tree as the domain within which to search use
node: $ROOT
The choice of node boundary determines the following:
node: IP*|$ROOT query: (NP* iDominates PRO*)
Here's the output; notice that 1 hit is counted because there was one IP* node (1 IP-MAT containing both NP*:
/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 1 IP-MAT: 3 NP-SBJ, 4 PRO he 1 IP-MAT: 6 NP-OB2, 7 PRO them */ (0 (1 IP-MAT (2 CONJ and) (3 NP-SBJ (4 PRO he)) (5 VBD made) (6 NP-OB2 (7 PRO them)) (8 NP-OB1 (9 ADJ grete) (10 N chere)) (11 ADVP (12 ADV out) (13 PP (14 P of) (15 NP (16 N mesure))))) (ID CMMALORY,2.13)) /* FOOTER source file: CMMALORY hits found: 1 sentences containing the hits: 1 total sentences searched: 1 */
Next we ran the query with node boundary NP*:
node: NP* query: (NP* iDominates PRO*)
Here's the output; this time 2 hits are counted, because there are two distinct NP* nodes (3 NP-SBJ and (6 NP-OB2. Because nodes_only is true by default, only the NP* nodes are printed:
/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 3 NP-SBJ: 4 PRO he 6 NP-OB2: 7 PRO them */ ( (3 NP-SBJ (4 PRO he)) (ID CMMALORY,2.13)) ( (6 NP-OB2 (7 PRO them)) (ID CMMALORY,2.13)) /* FOOTER source file: CMMALORY hits found: 2 sentences containing the hits: 1 total sentences searched: 1 */
adds given labels to the ignore_list. For instance,
add_to_ignore: \**
will tell CorpusSearch to ignore traces for this search.
tells CorpusSearch what nodes to ignore.
To replace the default ignore-list with your own ignore-list, include this command in your command file:
ignore_nodes: <your_ignore_list>
To tell CorpusSearch not to ignore any nodes, include this command in your command file:
ignore_nodes: null
If you try to search for an item that is on the ignore_list, you'll get an error message. For instance, this query:
(NP-SBJ* iPrecedes CODE)
generates this message:
WARNING! CODE in y_argument to iPrecedes is on the ignore_list. To make the ignore_list empty, add this line to your command file: ignore_nodes: null To write your own ignore_list, add this line to your command file: ignore_nodes:
The program goes ahead and runs as usual, but if you don't get the results you were looking for, you should probably change the ignore_list.
tells CorpusSearch what nodes to ignore in counting words
To replace the default word-ignore-list with your own word-ignore-list, include this command in your command file:
ignore_words: <your_word_ignore_list>
To tell CorpusSearch not to ignore any nodes in counting words, include this command in your command file:
ignore_words: nullTo add nodes to the word-ignore-list, use
add_to_ignore_words:
The following search functions are governed by the word-ignore-list: DomsWords, DomsWords<, DomsWords>. All other functions use the main ignore-list.
These commands do not in any way influence the current search. They only give instructions about how the results of the current search should be printed to the output file. However, because these commands can cause the output of the current search to take different forms, they may influence future searches which will take as their input the output of the current search.
tells CorpusSearch to print user's remark in the output Preface. This is a way for the user to write a note to herself, for instance to remember the goal of the search.
For instance, the command file "pro-obj.q" contains this command:
begin_remark: pronoun objects end_remark
which is printed in the output preface like this:
/* PREFACE: regular output file. CorpusSearch copyright Beth Randall 1999. Date: Wed Nov 03 19:12:03 EST 1999 command file: pro-obj.q input file: ipmat-2vb.out output file: pro-obj.out remark: pronoun objects node: IP* query: (NP-OB* iDominates PRO) */
If true, CorpusSearch prints out only the nodes that contain the structure described in "query".
If false, CorpusSearch prints out the entire sentence that contains the structure described in "query".
For instance, suppose you have this query:
node: ADVP*
query: (ADVP* iDominates ADVP*)
Here's what a piece of the output looks like with nodes_only true.
/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.574) *~/ /* 2 ADVP: 3 ADVP */ ( (ADVP (ADVP (ADV certayn)) (CONJP (CONJ and) (PP (P wit-owte) (NP (N doute)))) (, ,))(ID CMAELR3,45.574))
And here's the same piece of output with nodes_only false:
/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.589) *~/ /* 2 ADVP: 3 ADVP */ ( (IP-MAT (ADVP (ADVP (ADV certayn)) (CONJP (CONJ and) (PP (P wit-owte) (NP (N doute))))) (, ,) (NP-OB1 (NPR Ihon)) (BEP is) (NP-SBJ (PRO$ is) (N name)) (E_S .)) (ID CMAELR3,45.589))
tells CorpusSearch whether or not to print indices in the output.
Indices start at 0 and are used to label every node in the tree. CorpusSearch uses indices to distinguish, for instance, between several different NP nodes in the same sentence.
Here's a piece of an output sentence with indices:
(10 NP-OB1 (11 NPR Morgan) (12 NPR le) (13 NPR Fay)
Here's how it looks without indices:
(NP-PRN (NPR Morgan) (NPR le) (NPR Fey)))
removes nodes of the same syntactic category as the node boundary, which did not contain the searched-for structure, that are embedded under a node of that category that does contain an instance of the searched-for structure.
The purpose of this feature is to make it easier to search output. For instance, if you were looking for IP nodes containing a certain structure, remove_nodes will ensure that your output contains only IP nodes with that structure, and no other IP nodes.
CorpusSearch uses the following algorithm to find the syntactic category of a node: Start with the node boundary label. If that label contains any hyphens, the node's syntactic category is the substring of the label up to the leftmost hyphen, with a '*' tacked on. If the node boundary label does not contain a hyphen, the syntactic category is simply the label with a '*' tacked on, unless the label already has one.
Thus, if the node boundary label is IP-PRN*, the node category is IP*.
Consider the following command file, in which remove_nodes is set to true, and its effect on the output below:
remove_nodes: true query: (NP-OB* iDoms PRO)
Output:
/~* 'And I shall defende the,' seyde the knyght. (CMMALORY,39.1264) *~/ /* 1 IP-MAT-SPE: 8 NP-OB1, 9 PRO the */ (0 (1 IP-MAT-SPE (2 ' ') (3 CONJ And) (4 NP-SBJ (5 PRO I)) (6 MD shall) (7 VB defende) (8 NP-OB1 (9 PRO the)) (10 , ,) (11 ' ') (12 IP-MAT-PRN RMV:seyde_the_knyght...) (13 E_S .)) (ID CMMALORY,39.1264))
The structure of sub-sentence "seyde the knyght" has been removed from the parsed sentence and replaced with the symbol RMV:<rmv_string>, where rmv_string stands for a string of (up to) the first three words (leaf nodes) of the removed material and serves as a reminder of what has been removed. A further search on this output will be a search only on IP* nodes that contain a pronoun object, and on no other nodes.
print_only: CODINGThe resultant output file will bear the extension .ooo. In theory, you could substitute a part of speech label for CODING, although if you wanted a list of, for instance, all the nouns in your file, you would probably be better off using the make_lexicon feature.
Comments may be added anywhere to the command file and CorpusSearch defines default delimiters for comments. Comment lines begin with "//" and block comments appear between "/*" and "*/". In addition to these default comment delimiters, which are always respected, the user may add comment delimiters of her own by adding the following commands to the file preamble followed by the desired delimiter strings.
Line comment:
corpus_line_comment:
Block comment:
corpus_comment_begin:
corpus_comment_end: