Search Tips

about this chapter

This chapter gives tips on a number of common problems and errors that arise when using CorpusSearch. The reader is assumed to have a general familiarity with the rest of the CorpusSearch manual. Many of the example queries assume a standard definition file containing definitions for at least finite_verb and non_finite_verb.

The author of this chapter is Ann Taylor. Ann helped in the design of CorpusSearch and has used the program more than anyone else.

using definition files

The following are useful definitions to include in a definition file:

finite_verb:  *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD
non_finite_verb:  *VB|V*N|*HV|H*N|*DO|D*N|*BE|BEN
non-pronominal_NP: *N*|D*|Q*|ADJ*|CONJ*|*ONE*|*OTHER*|CP*

A common error is to forget to use the define command to specify the definition file when using definitions. No error message will be generated, but the search will result in no output.

using *

Be liberal in using *. Using NP-SBJ as a search term will only find a subset of subjects. Some subjects are resumptive (NP-SBJ-RSP), some are coindexed to a clause, or to trace in a lower clause (NP-SBJ-1), some may have other additional labels. Using NP-SBJ* will find all the subjects labelled in this way, no matter what might be added on to the end of the label. In general, only leave off the * if you are sure you don't want it.

When you want to refer to all the labels referred to by, for instance, ADVP*, except one, you have to use a list and list all the options you are interested in, as for instance ADVP|ADVP-LOC|ADVP-TMP (this omits ADVP-DIR which would be included in ADVP*). This is what definition files are for; you only have to write it once.

Note that if you want to refer to an actual * in a search (all traces start with *), escape it with a backslash \ . The following query finds subjects which dominate traces. The first * in \** is escaped and thus refers to an actual *, while the second is not and thus matches anything that follows the *; this will match, for instance, *con*, *exp*, *T-1* and others.

query: (NP-SBJ* iDoms \**)

the "exists" function

A common error is to overuse the exists function. Using a search term forces that term to exist; it is not necessary to specify this separately. Thus the following is an inefficient query, although it is not ill-formed.

query: ((NP-SBJ* exists)
AND (IP* iDoms NP-SBJ*))

The second part of the query alone will accomplish the same thing and use fewer resources.

same instance

Same instance works by literal match. Thus NP-SBJ does not match NP-SBJ*, and MD|VBD does not match VBD|MD; that is, in neither case would same instance be invoked between the two terms.

When two search terms match, they are forced to apply to the same node. Thus two uses of NP-OB* will require that, if for instance, NP-OB2 is found as an instance of the first NP-OB*, then the next use of NP-OB* will also apply to the same NP-OB2 (not, for instance, an NP-OB1 which may also be in the vicinity).

When two search terms do not match but might refer to the same node, as for instance, NP-SBJ and NP-SBJ*, or MD|VBD and VBD|MD, same instance is not forced, but neither is it ruled out; that is, the two label strings in the query may or may not wind up referring to the same node in the corpus.

In order to force non-same instance, use index numbers. [1]NP-SBJ* and [2]NP-SBJ* cannot apply to the same NP-SBJ* node.

A common error is to forget that impossible (to the linguist) cases of same instance will nonetheless be interpreted this way by CorpusSearch. Thus, for instance, a query such as the following will produce no results:

query: ((NP-SBJ* iDoms PRO)
AND (NP-OB1* iDoms PRO))

Although it is impossible for these PROs to refer to the same node, since they are dominated by different nodes, CorpusSearch will assume they do, and consequently will find no matches. Traces and zeros also need to be differentiated, as in the following:

query: ((MD iDoms [1]!\**)
AND (VB iDoms [2]!\**))

query: ((WNP iDoms [1]0)
AND (C iDoms [2]0))

An easier way to accomplish the former is to add traces to the ignore list.

ignoring certain nodes

A default "ignore list" is supplied with CorpusSearch. It contains such things as punctuation and various meta labels that are not part of the text. If you want to search for punctuation, for instance, or line breaks, then you must provide your own ignore list which does not include the items you want to be able to access.

Although the ignore list is primarily a way to avoid non-text annotations, linguistic labels can also be added to the ignore list, in which case CorpusSearch will simply act as if they are not there. Thus for instance, if you add NEG to the ignore list, you can find cases in which nothing but negation intervenes between the subject and the finite verb.

add_to_ignore: NEG
query: (NP-SBJ* iPrecedes finite_verb)

This will find the following two sentences:

Arthur loves Guinevere
Arthur ne loves Guinevere

but not:

Arthur madly loves Guinevere

Using the ignore list is also helpful in looking for V2. In many cases, the verb is not technically the second node in the IP because of initial conjunction. Adding CONJ (and possibly some other things, such as INTJ*, and NP-VOC) to the ignore list will solve this problem (or at least reduce it). The query below will find all the following:

The sword desired Lancelot
And the sword desired Lancelot
Gramercy, Arthur, the sword desired Lancelot

add_to_ignore: INTJ*|NP-VOC|CONJ
query: ((IP* iDomsNumber1 NP-OB*)
AND (IP* iDomsNumber2 finite_verb))

searching for traces

Traces (which all start with * in the PPCME2) are treated as text by CorpusSearch, and thus can be searched for. In order to differentiate the * which means "match anything" from the * that is part of the text of a trace, use \* to refer to the latter. The string \** will match any trace.

In the more common case, in which you want to simply ignore traces, add them to the ignore list as follows:

add_to_ignore: \**

This means that any node that contains a trace will not be found. Thus a query such as (NP* exists) will not find any NPs which contain only traces.

finding non-pronominal NPs

Do not search for non-pronominal NPs with the following query:

(NP* iDoms !PRO)

This will also eliminate cases like Robin and me and he and I, since these contain a PRO. Instead use the non-pronominal_NP definition.

restricting searches to a single IP

CorpusSearch requires that you specify a node boundary in which to search. The node boundary includes everything dominated by the node, no matter how deeply embedded. Thus, if IP* is specified as the node boundary and an IP contains a subordinate clause IP, the contents of the embedded subordinate clause are also within the node. A common error is to write a query such as

query: ((IP* iDomsNumber1 NP-OB*)
AND (finite_verb iPrecedes NP-SBJ*))

with the intent of finding V2 clauses with a topicalized object. The first function looks for IPs which have an object as the first element; the second for a finite verb immediately preceding the subject. This query will, in fact, find V2 clauses with a topicalized object, but it may also find some other clauses as well. It will find (if there are any) IPs which contain one clause in which the first element is an object, and another different clause within the same node boundary in which the finite verb precedes the subject. Either, one of these clauses may be the main clause and the other a embedded clause, or, they may both be embedded IPs within a dominating IP.

There are two ways to avoid this error and force all parts of the query to apply within the same IP.

Make use of the built-in same instance feature. Same instance means that if you use a node label in the query more than once in exactly the same form, CorpusSearch assumes that you intend each use to apply to the same instance of that node. Same instance applies across query clauses conjoined by AND. You can use same instance to keep all the queries inside the same IP (for instance, or any other node) by "tying" one term of the query to the node, as in the first element of the query above, and then making sure that in every subsequent search function, either that "tied" term or the node is used. For instance, we could fix the query above, by writing it as:
```
query: (((IP* iDomsNumber1 NP-OB*)
AND (NP-OB* iPrecedes finite_verb))
AND (finite_verb  iPrecedes NP-SBJ*))
```
or alternatively:
```
query: (((IP* iDomsNumber1 NP-OB*)
AND (NP-OB* iPrecedes finite_verb))
AND (finite_verb iPrecedes NP-SBJ*))
```
The repeated instances of NP-OB* in the first example and IP* in the second refer to the same instance of NP-OB* and IP* respectively, thus forcing all parts of the query to be immediately dominated by the node.
The second solution is to use the remove_nodes function. The default setting for remove_nodes is false, so to activate it you must include the line remove_nodes: t in the query file. Removing nodes removes any embedded structure whose root matches the specified boundary node. When IP* is specified, all embedded IPs will be removed. If the boundary node is set as NP*, all NPs embedded within another NP will be removed. Note that all that is required for a match is that the syntactic category of the label (the part before the hyphen) matches. Thus, if the node is IP-MAT*, any node whose label starts with IP will be removed, including in this case, IP-SUB, IP-SMC, IP-PPL, etc. Thus, for instance, you cannot set the boundary node as IP-MAT* and not have IP-SUBs removed. When "remove nodes" is in force, any node that doesn't match the query is removed completely from the output; any embedded node that matches the query is removed from its matrix and printed below it.
To solve our problem the "remove nodes" way, we would first create a file with only single clauses with all embedded nodes removed, by a query such as
```
remove_nodes: t
query: (IP* iDoms finite_verb)
```
This query will produce a file in which every token is an IP containing a finite verb with all embedded IPs removed. The following query:
```
query: ((IP* iDomsNumber1 NP-OB*)
AND (finite_verb iPrecedes NP-SBJ*))
```
can then be used on the output of the first query and will yield only the cases intended. (But note that this query is not actually going to produce all V2 clauses with a topic object anyway, since many such clauses begin with a conjunction or other introductory type word and thus the object will be the second element in the IP*; for a solution to this problem, see ignoring certain nodes).

counting words and remove_nodes

Note that if you have remove_nodes turned on, the string RMV:<rmv_string>, counts as text so you can search for it. It will not, however, be counted as a word when doing word counts (like traces, which likewise are not counted). But, if you count the number of words in a node that contains RMV:<rmv_string>, you will, of course, get the wrong answer, since RMV:<rmv_string> replaces a clause full of words. In order to avoid this result, either don't use remove_nodes when counting, or, use a query like the following which won't count any node containing RMV:<rmv_string>. Nodes containing RMV:<rmv_string> can then be counted separately.

query: (((IP* iDoms NP-OB*)
AND (NP-OB* domsWords3))
AND (NP-OB* doms !RMV:*))

Another way to do this is to add RMV:* to the ignore list and then, as before, count the nodes containing RMV:* separately.

add_to_ignore: RMV:*
query: ((IP* iDoms NP-OB*)
AND (NP-OB* domsWords3))