It has occurred to me that I have drifted from building an intelligent machine to building a working language parser. I suppose this is a necessary first step, but I must not lose sight of the real goal. There are so many fascinating distractions along the way!
In any case, the next step in this journey was to connect the parse-WordNet-concept pieces together to generate a symbolic representation of the meaning in a normal English sentence. As with much of this project, it’s easy to get a simple system going and much harder to make it work generally.
I feel like I’m “standing on the shoulders of giants” to steal a phrase from Stephen Hawking (who borrowed it from others as well). There has been an enormous amount of work done by the Natural Language Toolkit (nltk) folks to implement NLP algorithms in Python. Virtually everything I’m doing uses the software they have written. When I say, “I have built”, or “I have written” you must understand that what I really mean is that I have labored mightilly to stick a couple of lines of glue code between calls to the nltk functions.
So, I have successfully connected the WordNet lexical database to a recursive-descent parser. The parser is running a simple context-free grammar (CFG) that covers a small fraction of the English language. Even so, it does surprisingly well. For example, it correctly parses “the old man the ship” as a noun phrase (the old), a verb (man), and a noun phrase (the ship):
(S (NP (Det the) (Nom (N old))) (VP (V man) (NP (Det the) (Nom (N ship)))))
This is a sentence that would not be obvious to a person but is easy for the machine because it’s not confused (yet) by the fact that “man” is generally a noun not a verb.
What it doesn’t do yet is handle punctuation or capitalization. For example, it fails to parse “The” as “the” at the moment and chokes on commas, semicolon, quotes, etc. Some of these are easy to fix, others might require more low-level coding to replace the functions already part of NLTK.
The other thing that seems to be facing me is that a CFG is unlikely to be flexible enough for general language parsing. There are simply too many special cases. That’s why I need the semantic concepts data. I’m going to use a simpler, non-generic parser even though it can’t weed out nonsense sentences like, “the dog flew water.” I’ll use a semantic filter to get at the meaning of the sentence, if any. I think that will still not be enough to get rid of the ambiguity, but it might make a CFG good enough to use as a parser.
3 thoughts on “Parsing Language”
Actually, it was Isaac Newton who said, “If I have seen so far, it is only because I stand on the shoulders of giants.” At least, that’s how history records it. Such a towering intellect as Newton could equally well have said, “If I have seen so far, it is only because I am surrounded by midgets.” 🙂
Okay, so you go build this thing that understands English – how hard would it be to make it understand, say, Spanish? Assuming you put in a new database of word definitions (replacing “man” with “hombre,” for example) would that be adequate? Presumably the grammar would have to be tweaked.
How far could you stray from English before the entire thing came apart? How hard would it be to do German, and then French, and then Japanese, and then Mandarin?
I guess my question is: are you teaching your computer to understand LANGUAGE, or just ENGLISH?
That’s one of the reasons people want to do this “understanding” thing. Once you have converted English into some internal format that captures the meaning of the sentence, it should be possible to convert it back to natural language again. The specific language would not matter much. You would have a machine translation device, and be one step towards a “Universal Translator”.
Remember how the translator on Enterprise had to have enough of a new language to start translating? It was doing statistical semantic parsing!
I’m teaching my boy to speak English. You can do the Spanish later.
Comments are closed.