4.1 Natural language processing

We’ll conclude first with an example of rather more linguistic relevance than the preceding ones, which deploys a lot of what we have covered so far.

Most computer programs which handle natural language input contain a component known as a parser. This is a program which assigns a linguistic analysis to the input (a parse).

Parsing is a big topic, which we will cover in some depth later, but we will present an example here.

4.1.1 Grammar and lexicon

We need to supply the program with information about the grammar and lexicon of the language it will be dealing with.

So, first we need to decide on a representation for these. We will keep things very simple and try to stay close to what elementary linguistics texts provide.

Linguists are in the habit of writing rules such as

S ===> NP VP

V ===> likes

We can keep quite close to this kind of respresentation in Prolog by defining ===> as an infix operator:

 :- op(1050, xfx, `==>').  

We can now specify a grammar as follows, keeping to the Prolog conventions of using lower case letters for constants, commas to conjoin items and full-stop to terminate a statement.

s ==> np, vp.  
np ==> det, n.  
vp ==> v, np.  
det ==> the.  
n ==> dog.  
n ==> cat.  
v ==> chased.

4.1.2 The parser

The parser will be defined as a predicate parse/3, taking three arguments:

  1. its first argument will be a syntactic category
  2. the second will be the string of words to be parsed (represented as a Prolog list)
  3. the third will be the string of words left over when the parse is complete

parse/3 will return ‘yes’ if the string can be analysed as an instance of the syntactic category. So we will get interactions like the following:

?- parse(s, [the,dog,chased,the,cat], []).  
?- parse(np, [the,dog,chased,the,cat], []).  
?- parse(vp, [chased,the,cat], []).  
?- parse(np, [the,cat], []).  

The parser that we will describe here operates (like Prolog) in a left-to-right, top-down, depth-first fashion.

We start by seeing how the syntactic category we are seeking (Cat) is defined by the grammar. There are two possibilities

  1. the category introduces a lexical item
  2. it introduces one or more syntactic categories

In the first case, we compare that first item in the string of words with the right hand side of a rule which has Cat as its left-hand side. If they match, we have parsed the first word, with the rest of the string left over (i.e. the third argument):

parse(Cat, [Firstword|Restwords], Restwords):-  
    Cat ==> Firstword.

In the second case, we need to look up a rule which has Cat as its left-hand side, find out what its daughters are and parse them. Because parse/3 is expecting only a single category as its first argument, and there may be more than one daughter in a rule, we need to define a new predicate to handle the parsing of the daughters of a rule: parsedaughters/3:13

parse(Cat, String, Rest ):-  
   (Cat ==> Daughters),  
   parsedaughters(Daughters, String, Rest).

That is parse/3 dealt with. We now have to define parsedaughters/3. Again, there a two cases to consider:

  1. there is more than one daughter
  2. there is exactly one daughter

If there is more than one, we pass the first and the string to be parsed to parse/3 to deal with, and then pass the remaining daughters and what is left of the string to parsedaughters/3 to work on further:14

parsedaughters((Cat, Cats), String, Rest):-  
    parse(Cat, String, Next), parsedaughters(Cats, Next, Rest).

If there is exactly one, we just pass it and the string to be parsed to parse/3:

parsedaughters(Cat, String, Rest):-  
   parse(Cat, String, Rest).

That’s it.

The complete code is:

parse(Cat, [Firstword|Restwords], Restwords):-  
    (Cat ==> Firstword).  
parse(Cat, String, Rest ):-  
   (Cat ==> Daughters),  
   parsedaughters(Daughters, String, Rest).  
parsedaughters((Cat, Cats), String, Rest):-  
   parse(Cat, String, Next),  
   parsedaughters(Cats, Next, Rest).  
parsedaughters(Cat, String, Rest):-  
   parse(Cat, String, Rest).

Finally, here is a variant, parse/4, whose extra (second) argument builds a tree of the parse - represented like a bracketed string.

parse(Cat, [Cat,Firstword], [Firstword|Restwords], Restwords):-  
    Cat ==> Firstword.  
parse(Cat, [Cat|Tree], String, Rest ):-  
   (Cat ==> Daughters),  
   parserest(Daughters, Tree, String, Rest).  
parserest((Cat, Cats),  [Tree, Resttree], String, Rest):-  
   parse(Cat, Tree, String, Next),  
   parserest(Cats, Resttree, Next, Rest).  
parserest(Cat, Tree, String, Rest):-  
   parse(Cat, Tree, String, Rest).

Here is an example:

?- parse(s, T, [the,dog,chased,the,cat], []).  
T = [s,[np,[det,the],[n,dog]],[vp,[v,chased],[np,[det,the],[n,cat]]]] ?