Language Matters and Computational Linguistics

A cover of Computational Linguistics

I'm rather disappointed with the second edition of Language Matters, by Donna Jo Napoli and Vera Lee-Schoenfeld. The second edition was published in 2010, which has some minor updates to the earlier 2003 edition, along with some added material.

Chapter 7, "Can computers learn language?" received only minor edits, changing references in the examples. They change the term VCR to a DVR. The example they use however, has not changed, nor has their conclusion.

The two examples they use are: 1) Record "Law and Order" at 9 P.M. on Channel 10.

2) If there's a movie on tonight with Harrison Ford in it, then record it. But if it's American Graffiti, then don't bother because I already have a copy of that.

As Napoli notes (Lee-Schoenfeld was not involved in the first edition), this task would involve asking the computer "to scan a list of TV programs, recognize which ones are movies, filter out the particular movie American Graffiti, determine whether Harrison Ford is an actor in the remaining movies, and then activate the "record" function on the DVR at all the appropriate times on all of the appropriate channels." (Language Matters 2nd Ed. p 99).  Napoli continues to suggest that "we'd be asking the computer to work from ordinary sentences, extracting the operations and then properly associating them with the correct vocabulary items, a much harder task". (Language Matters, 2nd edition, p 99).

Of interest here is that Napoli's summary does not follow the lexical and linguistic parsing of the command. In particular, Napoli filters out American Graffiti before performing any searches for Harrison Ford. This appears particularly strange to me, as the first step in parsing this statement would be the same whether by a linguist or a software parser. Parse the first sentence before attempting to add context from the second.

While Napoli and Lee-Schoenfeld make several bold, definitive statements throughout the text which I found lacking in support, in this case they seem to dismiss the concept as "much harder task".  This statement may have gotten a bare pass in 2003, but in 2010, it's a harder sell. Admittedly, the Jeopardy! showdown with IBM's Watson may not have yet occurred, but in a text revision, I would expect some level of research to validate these claims. There are several journals on computational linguistics available, such as the Journal of Computational Linguistics, which has been Open Access since March of 2009.

In particular, the example given above is domain specific. It deals with television specific language, for which there are databases of particular terms, such as movie titles and casting information.

Even before Watson, I would not have considered a problem of this scope to be extraordinarily difficult, primarily due to the limited domain. While a more general domain would increase the difficulty considerably, current research is looking more hopeful. While computers are still not ready to pass the Turing test, there are some indications that this may happen in the relatively near future.

Language Matters is a very accessible text, covering many aspects of language and linguistics to those without much experience in the field. Aside from the chapter on computers and language, this book provides a good introduction to a number of topics. I wish that in the revision process, the authors had revisited some of their conclusions in an active field of research.

Ancient Writing and the Odyssey

How are texts passed down through history? In my English 301H class, we're studying a modern english translation of The Odyssey, by Homer. Interestingly, scholars believe that the Odyssey and the Iliad were both composed some five hundred years before the alphabet was developed and became used in ancient Greece. Five hundred years of oral recitation and recomposition passed before the poem was codified in writing. How are texts transmitted and recomposed through time? While I have mentioned the recent edition of the Adventures of Huckleberry Finn, and how some of the language has changed, the question itself dates back much further, to the time of the Homeric epic poems, the Iliad and the Odyssey. Academic scholars today believe that the sack of Troy was a historical event, which took place around 1300BCE, roughly five hundred years before the Phoeneicians introduced the alphabet to ancient Greece.

There are several theories about how the version written down came to be. The Odyssey's complex structure was originally thought by scholars to have been formed during the recording of the poem into writing. Newer theories suggest instead that the complex structure would have aided the bards in the recitation of the poem, as a form of mnemonic. This theory suggests that the Odyssey was not recited word for word, but re-composed from a common template, every recitation a new work of art. As the recorded version contains over twelve thousand lines of poetry, I can easily see how the composition of the poem during recitation, based on a structured skeleton could be preferable to the rote memorization of a lengthy poem.

I don't know how many times it was edited after being first committed to written words, but there are signs that the Greek tyrant Peisistratus commissioned a revision of Homer's works, from 546-524 BCE. This is presumably the source of the "canonical" Greek text of the Iliad and the Odyssey.

The further heritage of the text is interesting, when one looks at the number of texts which use Odysseus and his journeys as the source for further writings. The Romans called him Ulysses, and portrayed him as a villain. Odysseus appears in Dante's Inferno, and James Joyce's Ulysses has many things in common with the voyages of Odysseus.