web analytics

Parsing Language

It has occurred to me that I have drifted from building an intelligent machine to building a working language parser. I suppose this is a necessary first step, but I must not lose sight of the real goal. There are so many fascinating distractions along the way!

In any case, the next step in this journey was to connect the parse-WordNet-concept pieces together to generate a symbolic representation of the meaning in a normal English sentence. As with much of this project, it’s easy to get a simple system going and much harder to make it work generally.

I feel like I’m “standing on theĀ  shoulders of giants” to steal a phrase from Stephen Hawking (who borrowed it from others as well). There has been an enormous amount of work done by the Natural Language Toolkit (nltk) folks to implement NLP algorithms in Python. Virtually everything I’m doing uses the software they have written. When I say, “I have built”, or “I have written” you must understand that what I really mean is that I have labored mightilly to stick a couple of lines of glue code between calls to the nltk functions.

So, I have successfully connected the WordNet lexical database to a recursive-descent parser. The parser is running a simple context-free grammar (CFG) that covers a small fraction of the English language. Even so, it does surprisingly well. For example, it correctly parses “the old man the ship” as a noun phrase (the old), a verb (man), and a noun phrase (the ship):

(S
  (NP (Det the) (Nom (N old)))
  (VP (V man) (NP (Det the) (Nom (N ship)))))

This is a sentence that would not be obvious to a person but is easy for the machine because it’s not confused (yet) by the fact that “man” is generally a noun not a verb.

What it doesn’t do yet is handle punctuation or capitalization. For example, it fails to parse “The” as “the” at the moment and chokes on commas, semicolon, quotes, etc. Some of these are easy to fix, others might require more low-level coding to replace the functions already part of NLTK.

The other thing that seems to be facing me is that a CFG is unlikely to be flexible enough for general language parsing. There are simply too many special cases. That’s why I need the semantic concepts data. I’m going to use a simpler, non-generic parser even though it can’t weed out nonsense sentences like, “the dog flew water.” I’ll use a semantic filter to get at the meaning of the sentence, if any. I think that will still not be enough to get rid of the ambiguity, but it might make a CFG good enough to use as a parser.