Coding

Contents of this chapter:

What is coding?
a coding file example
an output file example
how to search coding strings
just the codes

What is coding?

Coding is used for creating input to multivariate analysis programs like Varbrul; general statistical programming environments like S, Splus, and R; and statistical analysis packages like Datadesk, JMP, SAS, and SPSS.

Coding string values in a coding file may be in part automatically determined with coding queries and in part hand entered in a text editor. The resultant files can then be inputs to further searches.

a coding file example

Here's an example of a basic coding file, written by Ann Taylor. It's called "obj.c". All coding file names must end with ".c". To simplify our discussion, we show only the first three columns of an originally more complicated coding system.

node: IP*
coding_query:

1: {
        s: (IP-SPE* iDoms NP-OB*)
        n: ELSE
   }

2: {
        m: (IP-MAT* iDoms NP-OB*)
        s: (IP-SUB* iDoms NP-OB*)
        i: (IP-INF* iDoms NP-OB*)
        e: ELSE

   }

3: {
        t: ((IP* iDoms NEG)
          AND (NEG iDoms !ne))
        p: (IP* iDoms !NEG)
        n: ELSE
   }

In general, coding files have this form:

<PREAMBLE>
coding_query:

column_number: {
	label: condition
	label: condition 
	.
	.
	.
	}

The coding file begins with the preamble commands (see Command File chapter), which must include the obligatory bounding node for the coding queries. The obligatory query specification "coding_query:" then introduces the coding queries for each column of the output coding string.

In the present example, column 1 of the coding string will contain an "s" if IP-SPE* iDoms NP-OB*. Otherwise, due to the presence of the "ELSE" function (used only in coding queries), the column will contain an "n".

Coding query files are alternatives to ordinary query files in a CorpusSearch run. So, to code a file, invoke CorpusSearch as follows:

java CorpusSearch <coding_file.c> <file_to_code>

an output file example

Output files resulting from coding will carry the extension .cod. They contain every token of the input file, with coding nodes inserted at every boundary node. A coding node has the form:

(CODING <coding_string>)

If a given sentence contains more than one boundary node, the output sentence will contain multiple coding nodes. Here's a sentence from the output file resulting from the above coding file:

/~*
knewe kyndes & complexciones of men & of bestus
(CMHORSES,85.2)
*~/


( ( IP-SUB (CODING n:s:p)
             (NP-SBJ *T*-1)
             (VBD knewe)
             (NP-OB1 (NS kyndes)
                       (CONJ &)
                       (NS complexciones)
                       (PP
                       (PP (10 P of)
                       (NP (12 NS men)))
                       (CONJP (CONJ &)
                       (PP    (P of)
                       (NP (NS bestus)))))))
              (ID CMHORSES,85.2))

how to search coding strings

Coding strings may be searched using column. For instance, to find all boundary nodes whose coding string contains "m" or "p" in the 7th column, use this query:

query:  (CODING column7 m|p)

just the codes

To obtain a file with only the coding strings, use print_only as follows:

print_only: CODING

The extension of the resultant output file will be .ooo.