Our aim is to find all the NPs that contain negatives, but we want to exclude any sentential negation that may be contained in embedded clauses within NPs. This is a two-step process; first we are going to get rid of all embedded clauses that don't contain NPs because they are completely irrelevant, and print any embedded clause that contains an NP as a separate token. We do this by setting remove_nodes to true and then searching for all IPs that contain NPs.
This search is done on the corpus file. Since most IPs in the corpus contain NPs the output of this search will be quite extensive with most original tokens broken up into a series of individual NODES with all the embedded clauses removed.remove_nodes: t node: IP* query: (IP* iDoms NP*)
We can now use our original query on the output of the previous search./~* +Tu wast leof +t+at we awendon on +tam twam +arrum bocum +t+ara halgena +trowunga and lif, +te Angelcynn mid freolsdagum wur+ta+d. (copreflives,+ALS_[Pref]:7.4) *~/ /* 2 IP-MAT: 3 NP-NOM 2 IP-MAT: 6 NP-NOM-VOC 10 IP-SUB: 11 NP-NOM 10 IP-SUB: 21 NP-ACC 35 IP-SUB: 37 NP-NOM */ (NODE (2 IP-MAT (3 NP-NOM (4 PRO^N +Tu)) (5 VBPI wast) (6 NP-NOM-VOC (7 ADJ^N leof)) (8 CP-THT (9 C +t+at) (10 IP-SUB RMV:we_awendon_on...)) (44 . .)) (ID copreflives,+ALS_[Pref]:7.4)) (NODE (10 IP-SUB (11 NP-NOM (12 PRO^N we)) (13 VBDI awendon) (14 PP (15 P on) (16 NP-DAT (17 D^D +tam) (18 NUM^D twam) (19 ADJR^D +arrum) (20 N^D bocum))) (21 NP-ACC (22 NP-GEN (23 D^G +t+ara) (24 N^G halgena) (25 CP-REL *ICH*-1)) (26 N^A +trowunga) (27 CONJP (28 CONJ and) (29 NX-ACC (30 N^A lif))) (31 , ,) (32 CP-REL-1 (33 WNP-2 0) (34 C +te) (35 IP-SUB RMV:*T*-2_Angelcynn_mid...)))) (ID copreflives,+ALS_[Pref]:7.4)) (NODE (35 IP-SUB (36 NP *T*-2) (37 NP-NOM (38 NR^N Angelcynn)) (39 PP (40 P mid) (41 NP-DAT (42 N^D freolsdagum))) (43 VBPI wur+ta+d)) (ID copreflives,+ALS_[Pref]:7.4))
In this case, since embedded clauses have already been removed there is no danger of getting unwanted sentential negation.node: NP* query: (NEG* exists)
/~* and hit mid ealle forbernde, swa +t+at +d+ar n+as to lafe nan+ding +te hyre w+as. (coaelive,+ALS_[Eugenia]:260.347) *~/ /* 9 NP-NOM: 10 NEG+Q+N^N nan+ding */ (NODE (10 NP-ACC (11 NEG+Q^A n+anne) (12 ADJ^A geleaffulne) (13 N^A mann) (14 CP-REL (15 WNP-NOM-1 0) (16 C +te) (17 IP-SUB RMV:*T*-1_hi_l+aren...))) (ID coaelive,+ALS_[Eugenia]:30.208))
Example 2
This is an extended example of how to set up a file of subordinate clauses
for further searching. For our investigation we want only subordinate
clauses which are introduced by an overt complementizer, and we want the
clauses to have a finite verb. The last condition rules out a lot of
clauses which are incomplete because of elision, and probably won't be
useful (although this depends on what the real investigation is).
The first step is to extract all the CPs with overt complementizers. We set the node to CP* because at this stage we want to access the CP-level. We also set remove_nodes to true so that embedded CPs will either be thrown away if they don't have an overt complementizer or, if they do, will be printed as separate tokens. The query specifies that a CP must dominate a C, the label for complementizer, and that this complementizer doesn't immediately dominate 0, which is the way empty complementizers are indicated. If a complementizer is not empty it is overt, so this will give us what we are looking for.
Typical output looks like the following. Note that 32 CP-REL-1 has been removed from 8 CP-THT and printed as a separate token. Another CP, 25 CP-REL has also been removed, but is not printed since it doesn't match the query.remove_nodes: t node: CP* query: ((CP* iDoms C) AND (C iDoms !0))
Once we have this file we can throw away the CP-level and concentrate just on the IPs, so we set the NODE to IP*. We also set remove_nodes to true. In general because we have already removed all embedded CPs there won't be a lot of embedded IPs left, but there are some types of embedded IPs that aren't under CPs, namely, infinitives, small clauses, direct speech, and parentheticals. So we use remove_nodes one more time to be safe. The file Ann.def contains a definition for finite_verb./~* +Tu wast leof +t+at we awendon on +tam twam +arrum bocum +t+ara halgena +trowunga and lif, +te Angelcynn mid freolsdagum wur+ta+d. (copreflives,+ALS_[Pref]:7.4) *~/ /* 8 CP-THT: 9 C +t+at 32 CP-REL-1: 34 C +te */ (NODE (8 CP-THT (9 C +t+at) (10 IP-SUB (11 NP-NOM (12 PRO^N we)) (13 VBDI awendon) (14 PP (15 P on) (16 NP-DAT (17 D^D +tam) (18 NUM^D twam) (19 ADJR^D +arrum) (20 N^D bocum))) (21 NP-ACC (22 NP-GEN (23 D^G +t+ara) (24 N^G halgena) (25 CP-REL RMV:*ICH*-1...)) (26 N^A +trowunga) (27 CONJP (28 CONJ and) (29 NX-ACC (30 N^A lif))) (31 , ,) (32 CP-REL-1 RMV:0_+te_*T*-2...)))) (ID copreflives,+ALS_[Pref]:7.4)) (NODE (32 CP-REL-1 (33 WNP-2 0) (34 C +te) (35 IP-SUB (36 NP *T*-2) (37 NP-NOM (38 NR^N Angelcynn)) (39 PP (40 P mid) (41 NP-DAT (42 N^D freolsdagum))) (43 VBPI wur+ta+d))) (ID copreflives,+ALS_[Pref]:7.4))
Now we run this query on the output of the previous one. The new output is a file in which every token is a subordinate clause (introduced by an overt complementizer, although this is no longer visible) with a finite verb.define: OE.def remove_nodes: t node: IP* query: (IP* iDoms finite_verb)
This file can now be used for various kinds of investigations of sentential syntax. The restrictions placed on the CPs and IPs are just examples of what might be done. The same strategy can be used with different requirements for the CP and IP nodes. If you don't want to restrict the type of CP at all, then use (CP* iDoms IP*). You should always restrict the IP in some way at this point if at all possible, since a file consisting of all the IPs in the corpus will be extremely large, quite possibly too large to work with./~* We awrita+d fela wundra on +tissere bec, for+tan +te God is wundorlic on his halgum swa swa we +ar s+adon, and his halgena wundra wur+dia+d hine, for+tan +te he worhte +ta wundra +turh hi. (copreflives,+ALS_[Pref]:22.13) *~/ /* 5 IP-SUB: 8 BEPI is 23 IP-SUB-CON: 29 VBPI wur+dia+d */ (NODE (5 IP-SUB (6 NP-NOM (7 NR^N God)) (8 BEPI is) (9 ADJP-NOM-PRD (10 ADJ^N wundorlic)) (11 PP (12 P on) (13 NP-DAT (14 PRO$ his) (15 N^D halgum))) (16 PP (17 ADV swa) (18 P swa) (19 CPX-CMP RMV:we_+ar_s+adon...))) (ID copreflives,+ALS_[Pref]:22.13)) (NODE (23 IP-SUB-CON (24 NP-NOM (25 NP-GEN (26 PRO$ his) (27 N^G halgena)) (28 N^N wundra)) (29 VBPI wur+dia+d) (30 NP-ACC (31 PRO^A hine)) (32 , ,) (33 CP-ADV RMV:for+tan_+te_he...)) (ID copreflives,+ALS_[Pref]:22.13))
If you are having space problems, you can erase the first output file once you have made the second one. You can always recreate it if need be from the query file. Example 4 shows how to retain information from the CP-level once it's been thrown away.
Example 3
This example is similiar to the previous one, but we want all IPs, both
matrix and subordinate. It is impossible to collect all matrix and
subordinate IPs in one search. The reason is that if you set the node to
CP* you won't get any IPs that are not dominated by CPs, but if you
set the node to IP* you won't get any CPs that aren't dominated by
IPs.
The solution is to collect the two sets separately and then join them. First we get the matrix IPs. We'll use the same restriction as in example 2, but this time we want only matrix IPs so we make the node IP-MAT*. remove_nodes is set to true to remove embedded clauses. We can call this query ip-mat.q so the output will be ip-mat.out
Then we get the CPs using a query we'll call cp.q, so the output will be cp.out. We won't restrict the type of CP at all.define: OE.def remove_nodes: t node: IP-MAT* query: (IP-MAT* iDoms finite_verb) /~* and ic secge +te leof, +t+at ic h+abbe nu gegaderod on +tyssere bec +t+ara halgena +trowunga +te me to onhagode on englisc to awendene, for +tan +te +du leof swi+dost and +A+delm+ar swylcera gewrita me b+adon, and of handum gel+ahton eowerne geleafan to getrymmenne, mid +t+are gerecednysse, +te ge on eowrum gereorde n+afdon +ar. (copreflives,+ALS_[Pref]:1.3) *~/ /* 1 IP-MAT: 5 VBP secge */ (0 (1 IP-MAT (2 CONJ and) (3 NP-NOM (4 PRO^N ic)) (5 VBP secge) (6 NP (7 PRO +te)) (8 NP-NOM-VOC (9 ADJ^N leof)) (10 , ,) (11 CP-THT (12 C +t+at) (13 IP-SUB RMV:ic_h+abbe_nu...)) (111 . .)) (ID copreflives,+ALS_[Pref]:1.3))
But we still need to extract the IPs from cp.out. We can use a variation of the query to find the matrix IPs to do this (called ip-sub.q, specifying subordinate IPs this time. We need to specify the IP type because we might otherwise get embedded matrix clauses like direct speech and parentheticals. We run this query not on corpus files but on cp.out, the output of the CP search. Note that the output this time lists each of the IP-SUBs from the token above separately this time, along with its own ur-text.remove_nodes: t node: CP* query: (CP* iDoms IP*) /~* he ne m+ag beon wur+dful cynincg buton he h+abbe +ta ge+tinc+de +te him gebyria+d, and swylce +teningmen, +te +teawf+astnysse him gebeodon. (copreflives,+ALS_[Pref]:25.15) *~/ /* 10 CP-ADV: 13 IP-SUB 21 CP-REL: 24 IP-SUB 36 CP-REL: 39 IP-SUB */ (NODE (10 CP-ADV (11 P buton) (12 C 0) (13 IP-SUB (14 NP-NOM (15 PRO^N he)) (16 HVPS h+abbe) (17 NP-ACC (18 NP-ACC (19 D^A +ta) (20 N^A ge+tinc+de) (21 CP-REL RMV:0_+te_*T*-1...)) (29 , ,) (30 CONJP (31 CONJ and) (32 NP-ACC (33 ADJ^A swylce) (34 N^A +teningmen) (35 , ,) (36 CP-REL RMV:0_+te_*T*-2...)))))) (ID copreflives,+ALS_[Pref]:25.15)) (NODE (21 CP-REL (22 WNP-NOM-1 0) (23 C +te) (24 IP-SUB (25 NP-NOM *T*-1) (26 NP-DAT (27 PRO^D him)) (28 VBPI gebyria+d))) (ID copreflives,+ALS_[Pref]:25.15)) (NODE (36 CP-REL (37 WNP-NOM-2 0) (38 C +te) (39 IP-SUB (40 NP-NOM *T*-2) (41 NP (42 N +teawf+astnysse)) (43 NP-DAT (44 PRO^D him)) (45 VBDI gebeodon))) (ID copreflives,+ALS_[Pref]:25.15))
We now have two output files ip-sub.out and ip-mat.out. (We can throw away cp.out at this point if necessary). The two sets can now be searched together simply by listing both output files as input files in subsequent searches. The output of this search will list the hits by source text as usual, first all the IP-MATs, and then starting again at the first source text all the IP-SUBs. But the summary statistics will list each source text only once, with all the hits added together.define: OE.def remove_nodes: t node: IP-SUB* query: (IP-SUB* iDoms finite_verb) /~* he ne m+ag beon wur+dful cynincg buton he h+abbe +ta ge+tinc+de +te him gebyria+d, and swylce +teningmen, +te +teawf+astnysse him gebeodon. (copreflives,+ALS_[Pref]:25.15) *~/ /* 4 IP-SUB: 7 HVPS h+abbe */ (NODE (4 IP-SUB (5 NP-NOM (6 PRO^N he)) (7 HVPS h+abbe) (8 NP-ACC (9 NP-ACC (10 D^A +ta) (11 N^A ge+tinc+de) (12 CP-REL RMV:0_+te_*T*-1...)) (13 , ,) (14 CONJP (15 CONJ and) (16 NP-ACC (17 ADJ^A swylce) (18 N^A +teningmen) (19 , ,) (20 CP-REL RMV:0_+te_*T*-2...))))) (ID copreflives,+ALS_[Pref]:25.15)) /~* he ne m+ag beon wur+dful cynincg buton he h+abbe +ta ge+tinc+de +te him gebyria+d, and swylce +teningmen, +te +teawf+astnysse him gebeodon. (copreflives,+ALS_[Pref]:25.15) *~/ /* 4 IP-SUB: 8 VBPI gebyria+d */ (NODE (4 IP-SUB (5 NP-NOM *T*-1) (6 NP-DAT (7 PRO^D him)) (8 VBPI gebyria+d)) (ID copreflives,+ALS_[Pref]:25.15)) /~* he ne m+ag beon wur+dful cynincg buton he h+abbe +ta ge+tinc+de +te him gebyria+d, and swylce +teningmen, +te +teawf+astnysse him gebeodon. (copreflives,+ALS_[Pref]:25.15) *~/ /* 4 IP-SUB: 10 VBDI gebeodon */ (NODE (4 IP-SUB (5 NP-NOM *T*-2) (6 NP (7 N +teawf+astnysse)) (8 NP-DAT (9 PRO^D him)) (10 VBDI gebeodon)) (ID copreflives,+ALS_[Pref]:25.15))
Example 4
This example makes use of the coding function.
In this example we want to work with only subordinate clauses but we want
to know what type of CP originally dominated the IP. In addition we want to
know whether there is an overt complementizer.
In our first search we extract all the CPs with a C node. This condition forces all the clauses to be embedded. Direct questions lack a C node altogether. The output is a set of tokens each consisting of a CP of the appropriate type with all embedded CPs removed.
At this point in example 2 we threw away the CP-level. This time, before we throw it away, we're going to store some information about it in a coding string. The first column codes for the type of CP. The first condition codes adverbial CPs with "for" as the subordinating conjunction. The second codes for all other adverbial CPs, then so on through the types of CPs. In the second and subsequent conditions, the query is actually a bit otiose since in the previous search we made sure that all the clauses had C nodes. The condition is actually just a way to get the clause type coded. We could have used iDoms * or iDoms IP* or anything we're sure will be found in every token. Don't use exists here though, as in CP-ADV exists, since when embedded CPs are removed their labels remain, and therefore there are other CP labels that might be matched. The second column codes for whether the C node is overt or empty.remove_nodes: t node: CP* query: (CP* iDoms C) /~* him geris+d +t+at he h+abbe halige +tenas +te his willan gefylla+d, (copreflives,+ALS_[Pref]:29.17) *~/ /* 6 CP-THT-x: 7 C +t+at 15 CP-REL: 17 C +te */ (NODE (6 CP-THT-x (7 C +t+at) (8 IP-SUB (9 NP-NOM (10 PRO^N he)) (11 HVPS h+abbe) (12 NP-ACC (13 ADJ^A halige) (14 N^A +tenas) (15 CP-REL RMV:0_+te_*T*-1...)))) (ID copreflives,+ALS_[Pref]:29.17)) (NODE (15 CP-REL (16 WNP-NOM-1 0) (17 C +te) (18 IP-SUB (19 NP-NOM *T*-1) (20 NP (21 PRO$ his) (22 N willan)) (23 VBPI gefylla+d))) (ID copreflives,+ALS_[Pref]:29.17))
The output of this run looks like this:node: CP* 1: { f: ((CP-ADV* iDoms P) AND (P iDoms F*|f*)) a: (CP-ADV* iDoms C) t: (CP-THT* iDoms C) g: (CP-DEG* iDoms C) c: (CP-CMP* iDoms C) q: (CP-QUE* iDoms C) r: (CP-REL* iDoms C) r: (CP-CAR* iDoms C) r: (CP-FRL* iDoms C) k: (CP-CLF* iDoms C) x: (CP-EXL* iDoms C) } 2: { 0: (C iDoms 0) 1: (C iDoms !0) }
Now because the coding string is passed on from search to search we can get rid of the CP-level without losing the information we are interested in. We use the same query as in example 2. remove_nodes is set to true for the same reasons as well./~* him geris+d +t+at he h+abbe halige +tenas +te his willan gefylla+d, (copreflives,+ALS_[Pref]:29.17) *~/ (0 NODE (0 CODING t:1) (1 CP-THT-x (2 C +t+at) (3 IP-SUB (4 NP-NOM (5 PRO^N he)) (6 HVPS h+abbe) (7 NP-ACC (8 ADJ^A halige) (9 N^A +tenas) (10 CP-REL RMV:0_+te_*T*-1...)))) (11 ID copreflives,+ALS_[Pref]:29.17)) /~* him geris+d +t+at he h+abbe halige +tenas +te his willan gefylla+d, (AelfLives,+ALS_[Pref]:29.17) *~/ (0 NODE (0 CODING r:1) (1 CP-REL (2 WNP-NOM-1 0) (3 C +te) (4 IP-SUB (5 NP-NOM *T*-1) (6 NP (7 PRO$ his) (8 N willan)) (9 VBPI gefylla+d))) (10 ID copreflives,+ALS_[Pref]:29.17))
define: OE.def remove_nodes: t node: IP* query: (IP* iDoms finite_verb)
At this point we can search this file including the information in the coding string, or we could add further coding (just make sure you start at column 3!), or any combination of these. You can add or replace columns at any time, and you can search the coding string in conjunction with searching the parse.Our tokens now look like this. The coding string is retained but the CP is gone. /~* him geris+d +t+at he h+abbe halige +tenas +te his willan gefylla+d, (AelfLives,+ALS_[Pref]:29.17) *~/ /* 4 IP-SUB: 7 HVPS h+abbe */ (NODE (CODING t:1) (4 IP-SUB (5 NP-NOM (6 PRO^N he)) (7 HVPS h+abbe) (8 NP-ACC (9 ADJ^A halige) (10 N^A +tenas) (11 CP-REL RMV:0_+te_*T*-1...))) (ID copreflives,+ALS_[Pref]:29.17)) /~* him geris+d +t+at he h+abbe halige +tenas +te his willan gefylla+d, (copreflives,+ALS_[Pref]:29.17) *~/ /* 5 IP-SUB: 10 VBPI gefylla+d */ (NODE (CODING r:1) (5 IP-SUB (6 NP-NOM *T*-1) (7 NP (8 PRO$ his) (9 N willan)) (10 VBPI gefylla+d)) (ID copreflives,+ALS_[Pref]:29.17))