I’ve spent the last few weeks building and deleting little snippets of code in the general vicinity of language parsing. Mostly it has been an exercise of exploring some of the things that have been learned by the serious folks in the NLP field over the past 10-20 years. Some of what I’ve tried has worked out, and some has not. Most of it will never see the light of day.

Much of what I’ve been tinkering with is code for parsing English grammar. There has been a lot of very complex work done on solving this problem, so there are a lot of things to read and understand. I’ve learned a lot about top-down, bottom-up, shift-reduce, and left-corner parsers. I’ve tinkered with taggers and chunkers and sense and valency in an effort to deal with the complexity that we build into our language.

One of the “big” things I’ve concluded from all this is that I’m not as interested as I thought in building a good parser (i.e., a parser that includes all valid sentences while excluding all invalid ones).  What the machine really needs is to do is extract the “meaning” from the word-stream coming in. Meaning can be gleaned from all sorts of ungrammatical constructions. For example, “In the park running the dog is” and “girl like boy” are easily understood in spite of the fact they’re nonsense grammatically. Even things like “Tihs snceente is esay to raed eevn wtih meixd up lteetrs” can be understood without much effort (try it here, and read more about this effect here). There is a lot of redundancy in our language, so the grammar and spelling doesn’t matter as much as our teachers implied, especially since the average letters/word runs around 4 or so.

This brings me back to an earlier question, “What does it mean to understand?”  I’m going to start with the idea that if the machine can convert the incoming word-stream into a set of entities, the relationships among those entities, and answer questions about them requiring inference or deduction, then it has understood the statements.

I’m currently grappling with understanding propositional logic, first-order logic, and lambda abstraction (or λ-calculus) because I think these ideas might lead to a way of systematically encoding meaning in a form that the machine can use easily.