web analytics

Computational Semantics

I’ve got such a long way to go.

These guys are really good! Their stuff actually does what I’m struggling mightily to conceive might be possible. Why isn’t this stuff already embedded in the products we use?

These guys are really smart! They’re a special interest group of the big group of smart people working on Computational Semantics.

What have I got myself into…

Parsing Language

It has occurred to me that I have drifted from building an intelligent machine to building a working language parser. I suppose this is a necessary first step, but I must not lose sight of the real goal. There are so many fascinating distractions along the way!

In any case, the next step in this journey was to connect the parse-WordNet-concept pieces together to generate a symbolic representation of the meaning in a normal English sentence. As with much of this project, it’s easy to get a simple system going and much harder to make it work generally.

I feel like I’m “standing on the  shoulders of giants” to steal a phrase from Stephen Hawking (who borrowed it from others as well). There has been an enormous amount of work done by the Natural Language Toolkit (nltk) folks to implement NLP algorithms in Python. Virtually everything I’m doing uses the software they have written. When I say, “I have built”, or “I have written” you must understand that what I really mean is that I have labored mightilly to stick a couple of lines of glue code between calls to the nltk functions.

So, I have successfully connected the WordNet lexical database to a recursive-descent parser. The parser is running a simple context-free grammar (CFG) that covers a small fraction of the English language. Even so, it does surprisingly well. For example, it correctly parses “the old man the ship” as a noun phrase (the old), a verb (man), and a noun phrase (the ship):

  (NP (Det the) (Nom (N old)))
  (VP (V man) (NP (Det the) (Nom (N ship)))))

This is a sentence that would not be obvious to a person but is easy for the machine because it’s not confused (yet) by the fact that “man” is generally a noun not a verb.

What it doesn’t do yet is handle punctuation or capitalization. For example, it fails to parse “The” as “the” at the moment and chokes on commas, semicolon, quotes, etc. Some of these are easy to fix, others might require more low-level coding to replace the functions already part of NLTK.

The other thing that seems to be facing me is that a CFG is unlikely to be flexible enough for general language parsing. There are simply too many special cases. That’s why I need the semantic concepts data. I’m going to use a simpler, non-generic parser even though it can’t weed out nonsense sentences like, “the dog flew water.” I’ll use a semantic filter to get at the meaning of the sentence, if any. I think that will still not be enough to get rid of the ambiguity, but it might make a CFG good enough to use as a parser.

A swamp of details

Well, not much progress on the main task has occurred in recent days. I’m slowly sinking into a swamp of details around how to implement ideas in code.

There was a long trek through C#, mono and MySQL that turned out to mostly be a dead end. It is possible to put the WordNet data into MySQL, and to access the database from C#/mono on Linux, but it’s not easy. I think the Linux tools are not really ready to be used quite that way by people with my limited skills, and I’m not willing to lock myself into a Windows-only platform.

The current frontrunner seems to be Python with the NLTK toolkit, which offers a lot of high quality AI code that promises to be useful. It seems fairly straightforward to get the whole thing running with a web front end. Unfortunately, NLTK really wants to be running on Python 2.5+ and my server is on Ubuntu 6.06 LTS (with Python 2.4). I guess it’s time to upgrade the server to Ubuntu 8.04 LTS anyway.

Once all this is done maybe the main task can resume. I have run across a bunch of work that has been done on the Semantic Web that seems very close to what I’m trying to do. Things like the Resource Description Framework (RDF) and Web Ontology Language (OWL) look like exactly what I’m after and will likely be among the first “meaning storage” schemes I try.