2013-06-09

NLTK 2.1 - Working with Text Corpora

Let's return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here's a list of the most well-known general-purpose corpora:

  • Brown Corpus - one of the first big corpora and the only one in the list really easily accessible - we've already worked with it in the first chapter
  • Penn Treebank - Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) - a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn't structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there's being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can't be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we'll touch working with Wiktionary in this part.

Processing corpora

All of the text corpora we'll encounter come as a set of files which use some format, either custom or standard (like XML). So to be able to access this data we'll need to implement reading and processing of individual files and assembling these files into groups. Besides, as some corpora are really large (for instance a zipped instance of the Reuters corpus amounts to 1.5+ Gb), it is also useful to be able to process the data one text at a time not detaining them in memory. All this defines a rather simple protocol for corpora processing.

A corpus is just a list of texts, that can be arbitrary grouped.

(defstruct corpus
  desc texts groups)

A text has a name and several representations: basic texts may have a raw, a cleaned-up and a tokenized one. We'll also see texts with other representations, like the parse trees for treebanks - we'll use structure inheritance to describe them.

(defstruct text
  name raw clean tokens)

The protocol itself comprises of 3 operations:

(defgeneric read-corpus (type path))

(defgeneric read-corpus-file (type source))

(defgeneric map-corpus (type path fn))

read-corpus creates a corpus object, and it uses read-corpus-file to read individual files and return their contents in multiple forms, potentially needed for further processing. map-corpus calls function fn with every text produced from calling read-corpus-file on a corpus in arbitrary order. This function works more like maphash than mapcar, because it doesn't collect the results of applying fn.

The methods of these functions are usually boring. For text-based formats we implement read-corpus-file by reading each file's text either in whole or line-by-line with the dolines macro, performing tokenization and some post-processing. For XML-based variants we may use SAX processing with Closure XML library.

Processing the NPS Chat corpus

Let's look at the NPS Chat corpus that is provided with NLTK. It has a rather simple XML format. The entries are called Posts and have the following structure:

<Post class="Emotion" user="10-19-30sUser2">10-19-30sUser11 lol
    <terminals>
        <t pos="NNP" word="10-19-30sUser11"/>
        <t pos="UH" word="lol"/>
    </terminals>
</Post>

We process each file by calling cxml:parse with an instance of the nps-chat-sax class.

(defmethod read-corpus-file ((type (eql :nps-chat)) source)
  (cxml:parse source (make 'nps-chat-sax)))

For this object we'll specialize the parser methods. It will also serve as a storage for state of processing as SAX parsing operates with callbacks that don't have access to the current context besides access to the parser object.

(defclass nps-chat-sax (sax:sax-parser-mixin)
  ((texts :initform nil)
   (tokens :initform nil)
   (classes :initform nil)
   (users :initform nil)
   (cur-tag :initform nil)
   (cur-tokens :initform nil)))

read-corpus-file will return a list of each post's contents, tokens, as well as the list of posts' classes and users, which are indicated in the meta attributes. Technically these values will be produced by the sax:end-document method that is called at the end of SAX processing:

(defmethod sax:end-document ((sax nps-chat-sax))
  (with-slots (texts tokens users classes) sax
    (values nil  ;; raw text isn't useful for this corpus
            (reverse texts)
            (reverse tokens)
            (reverse classes)
            (reverse users))))

In sax:start-element method we store the current tag and record attributes of each post or tokens, depending on the element.

(defmethod sax:start-element ((sax nps-chat-sax)
                              namespace-uri local-name qname attributes)
  (with-slots (classes users cur-tokens cur-tag) sax
    (case (setf cur-tag (mkeyw local-name))
      (:post (push (attr "class" attributes) classes)
             (push (attr "user" attributes) users))
      (:t (push (make-token :word (attr "word" attributes)
                            :tag (mkeyw (attr "pos" attributes)))
                cur-tokens)))))

In sax:characters we record current post's text:

(defmethod sax:characters ((sax nps-chat-sax) data)
  (with-slots (cur-tag texts) sax
    (when (eql :terminals cur-tag)
      (push data texts))))

And in sax:end-element we dump current tokens:

(defmethod sax:end-element ((sax nps-chat-sax) namespace-uri
                            local-name qname)
  (when (eql :terminals (mkeyw local-name))
    (with-slots (tokens cur-tokens) sax
      (push (reverse cur-tokens) tokens)
      (setf cur-tokens nil))))

If you have used SAX parsing in some other language, like Python, you should immediately recognize this approach and see how it can translated to that language almost line-by-line.

Processing the Reuters corpus

A slightly more complex processing is required for the Reuters corpus. The reason for that is that unlike with the Chat corpus that we assumed to be inside a filesystem directory, it's not always reasonable to extract this corpus as it's big and also is stored in two-level zip archive with individual archives packed inside another big archive. Extracting all of them is, obviously, tedious.

For such corpus it makes sense to resort to using map-corpus. Here's a definition of its method for this corpus:

(defmethod map-corpus ((type (eql :reuters)) path fn)
  (zip:with-zipfile (zip path)
    (zip:do-zipfile-entries (name entry zip)
      (unless (char= #\/ (last-char name))
        (with-zipped-zip (in entry :raw t)
          (mv-bind (_ text tokens __ ___ headline byline dateline)
              (read-corpus-file :reuters in)
            (declare (ignore _ __ ___))
            (funcall fn (make-reuters-text
                         :name name
                         :clean text
                         :tokens tokens
                         :headline headline
                         :byline byline
                         :dateline dateline))))))))

It relies on read-corpus-file which processes individual XML documents just like we saw early for the NPS Chat corpus. But the documents are fed into it not by fad:walk-directory, but with the help of the utilities, provided by David Lichteblau's excellent zip library.

(zip:with-zipfile (zip path)
  (zip:do-zipfile-entries (name entry zip)
    (unless (char= #\/ (last-char name))
      (with-zipped-zip (in entry :raw t)

In this snippet we open the zip file at path in with-zipfile and iterate over its entries with do-zipfile-entries. These are usual Lisp patterns to handle such kinds of resources. Yet inside the zip file we find another layer of zip archives. To remove unnecessary hassle and boilerplate I've added an additional macro with-zipped-zip. It has to some extent to replicate the functionality of with-zipfile, but it should take the data from the contents of one of the entries of the already open zip file. Another twist is that it optionally gives the possibility to access the raw unzipped binary stream, not yet converted to a text representation - this is useful for CXML which can work with such streams efficiently.

(defmacro with-zipped-zip ((name stream zipfile-entry
                                   &key (external-format :utf-8) raw)
                           &body body)
  (with-gensyms (zipstream v end entry entries zip x n)
    ;; here we transform the archives contents into an input stream
    ;; and then read it
    `(flex:with-input-from-sequence
          (,zipstream (zip:zipfile-entry-contents ,zipfile-entry))
        (let ((,v (make-array (zip:zipfile-entry-size ,zipfile-entry)
                              :element-type '(unsigned-byte 8))))
          (read-sequence ,v ,zipstream)

          ;; here we look for the central directory header
          ;; and move to it in the stream
          (if-it (search #(80 75 5 6) ,v :from-end t)
                 (file-position ,zipstream it)
                 (error "end of central directory header not found"))

          ;; here we create a corresponding zipfile object for the entry
          (let* ((,end (zip::make-end-header ,zipstream))
                 (,n (zip::end/total-files ,end))
                 (,entries (make-hash-table :test #'equal))
                 (,zip (zip::make-zipfile
                        :stream ,zipstream
                        :entries ,entries
                        :external-format ,external-format)))
            (file-position ,zipstream
                           (zip::end/central-directory-offset ,end))

            ;; and here we read individual entries from the archive
            (dotimes (,x ,n)
              (let ((,entry (zip::read-entry-object ,zipstream
                                                    ,external-format)))
                (set# (zip:zipfile-entry-name ,entry) ,entries ,entry)))

            ;; finally, we're able to access entries in the same manner
            ;; as we'll do for the ordinary archive
            (do-entries (,name ,stream ,zip
                         :external-format ,external-format :raw ,raw)
              ,@body))))))

As you see, this macro uses a lot of zip's internal symbols, as it implements the similar function to the one which the library does, but not anticipated by its author...

Let's see how it works:

NLP> (map-corpus :reuters "Reuters.zip"
                 #`(print (text-name %)))
"Reuters/Eng Lang_Disk 1/19960824.zip/12536newsML.xml"
"Reuters/Eng Lang_Disk 1/19960824.zip/12537newsML.xml"
...

Processing the Penn Treebank

The Penn Treebank is a very important corpus which is also not easy to obtain: you have to order a CD from Linguistic Data Consortium (LDC) and pay more than 1.5 thousand dollars for it. Yet, there are 2 workarounds:

  • NLTK data provides 5% of the treebank, so you can use it for experimenting with the data format
  • There's a more recent corpus OntoNotes which includes the Penn Treebank, and you also should obtain it from LDC, but it costs only 50$ (I've learned about its existence on StackOverflow, ordered it and it indeed has a treebank inside, as well as a lot of other useful linguistic annotations; the only thing I'm not sure about is whether it is an exact copy of the Penn Treebank)

Let's see how to load the NLTK's treebank. The obvious thing about treebanks is that they are basically Lisp code, so it should be very easy to parse the data with Lisp. The only catch is that the treebank doesn't obey Lisp's formatting rules for strings and doesn't distinguish special characters like quote and pipe. So the task is to properly handle all that.

In CL-NLP we have several utilities for dealing with trees: the macros dotree and doleaves which execute arbitrary code for each subtree or leaf in a tree for side-effects, and their counterpart functions maptree and mapleaves that allow to create isomorphic tree structures by applying a function to all tree's nodes or leaf nodes.

So, reading a treebanked tree will be performed in 2 steps:

  • first, we prepare the tree for loading by properly escaping everything:

    (defun prepare-tree-for-reading (string)
      (strjoin " " (mapcar #`(cond-it
                               ((char= #\( (char % 0))
                                (cond-it
                                  ((member (sub % 1)
                                           '("." "," ";" ":" "#" "''" "``")
                                           :test 'string=)
                                   (fmt "(|~A|" (sub % 1))))
                                  ((position #\| %)
                                   (fmt "(|~A\\|~A|"
                                        (sub % 1 it) (sub % (1+ it))))
                                  (t %)))
                               ((position #\) %)
                                (if (zerop it)
                                    %
                                    (fmt "\"~A\"~A" (sub % 0 it) (sub % it))))
                               (t (fmt "\"~A\"" %)))
                           (split-sequence-if #`(member % +white-chars+)
                                              string :remove-empty-subseqs t)))
    
  • and then we read the tree with the Lisp reader and separately collect its tokens:

    (defmethod read-corpus-file ((type (eql :treebank)) file)
      (let ((raw (string-trim +white-chars+ (read-file file))))
        (with-input-from-string (in (prepare-tree-for-reading raw))
          (loop
             :for tree := (car (read in nil)) :while tree
             :collect raw :into raws
             :collect tree :into trees
             :collect (let ((pos 0)
                            toks)
                        (dotree (subtree tree)
                          (when (and (listp subtree)
                                     (single (cdr subtree))
                                     (atom (cadr subtree)))
                            (let ((word (second subtree)))
                              (push (make-token
                                     :beg pos
                                     :end (1- (incf pos (1+ (length word))))
                                     :word word
                                     :tag (first subtree))
                                    toks))))
                        (reverse toks))
             :into tokens
             :finally
             (return (values (strjoin #\Newline raws)
                             (mapcar #`(strjoin #\Space
                                                (mapcar #'token-word %))
                                     tokens)
                             tokens
                             trees))))))
    

Other corpora

Most of the other corpora will use formats similar to the Brown, Reuters Corpus, or Penn Treebank, so their readers may easily be defined by analogy. Besides, the implementations of read-corpus methods assume a certain way to present the corpus to the library: in a filesystem directory, in a zipfile etc. Certainly, the corpora may come in different form, but it will take just a couple of lines of code changed to re-purpose the methods to handle a different representation

Working with some other collections of documents presented in the NLTK book, like Project Gutenberd collection of famous literary texts in public domain or Inaugural speeches collection is just trivial - you can see how it's done in our implementation of Chapter 1.

For instance, if we have Project Gutenberg collection in some directory dir, we can process it by using the following pattern:

(fad:walk-directory
 dir
 #`(let ((text (string-trim +white-chars (read-file %))))
     ... do some processing on the raw text of the corpus ...
     ))

Working with Wikipedia and Wiktionary

Unlike most of academic corpora that are not so easy to obtain Wikipedia and Wiktionary without a hassle give you access to full dumps of their data in various formats: SQL, XML and text. For instance, we can obtain the latest dump of English Wiktionary from http://dumps.wikimedia.org/enwiktionary/.

I found it the most convenient to process them using the same XML SAX parsing approach that was utilized for the Reuters and NPS Chat corpora. The difference for them is that inside the XML documents a special Mediawiki markup is used to add metadata. It is, in fact, a heroic feat to parse this markup, because it doesn't have a clear, precise spec - it's real spec is the PHP code of the Wikipedia's parser,- and different people tend to (ab)use different variations of annotations to express the same or similar things, as well as just to use the markup in an untidy manner. I call such documents semi-structured compared to unstructured raw text and structured XML markup. Still, there's a lot of value in this annotation, because of their crowdsourced nature that allows to capture much more information than in most centralized efforts. Basically, there are 2 ways to work with Mediawiki markup: using regexes or implementing the complete parser.

Below is a snippet of code, that uses the regex-based approach to extract some information from the English Wiktionary - English word definitions. In Wiktionary markup the definitions are inside each language's section (denoted with ==Lang== marker, where Lang can be English or some other language). The definitions themselves are placed on their own lines and start with one of these markers: #, #:, #*, #*:, #*::, * [[.

(defvar *defs* (list))

(defclass word-collector (sax:sax-parser-mixin)
  ((title :accessor title :initform nil)
   (tag :accessor tag :initform nil)
   (text :accessor text :initform ())
   (itemcount :accessor itemcount :initform 0)
   (tagcount :accessor tagcount :initform 0)))

(defmethod sax:end-element ((sax word-collector)
                            namespace-uri local-name qname)
  (with-slots (title tag text) sax
    (when (string= "text" local-name)
      (let ((lang-marker "==English==")
            (whole-text (strjoin "" (reverse text))))
        (when-it (search lang-marker whole-text)
          (let ((start (+ it (length lang-marker))))
            (dolist (line (split-sequence
                           #\Newline (sub whole-text start
                                          (re:scan "[^=]==[^=]" whole-text
                                                   :start start))))
              (when (some #`(starts-with % line)
                          '("# " "#: " "#* " "#*: " "#*:: " "* [["))
                (push (clean-words line) *defs*)
                (incf (itemcount sax))))))))))))

(defun clean-words (line)
  "Return a list of downcased words (only alpha-chars or hyphens
   in the middle) found in LINE."
  (mapcar #`(string-downcase (string-trim "-" %))
          (remove-if #`(or (blankp %)
                           (loop :for char :across % :do
                             (unless (or (alpha-char-p char)
                                         (char= #\- char))
                               (return t))))
                     (re:split +punctuation+ (raw-text line)))))

(defun raw-text (line)
  "Return LINE without special-purpose Wiktionary chars."
  (re:regex-replace-all "({[^}]+})|(\\[\\[)|(\\]\\])|('''?)"
                        (sub line (1+ (position #\Space line)))
                        ""))

As the task is clearly defined and limited - extract just the text of definitions - it is possible to use regexes here:

  • one regex to spot the interesting lines
  • another one to filter out all the unnecessary markup.

Using corpora

Now, as we can already load and play with our corpora let's take a look at some NLTK examples.

First, let's examine the Brown corpus. Here are its categories:

NLP> (ht-keys (corpus-groups *brown*))
(:PRESS-REPORTAGE :PRESS-EDITORIAL :PRESS-REVIEWS :RELIGION
 :SKILL-AND-HOBBIES :POPULAR-LORE :BELLES-LETTRES
 :MISCELLANEOUS-GOVERNMENT-HOUSE-ORGANS :LEARNED :FICTION-GENERAL
 :FICTION-MYSTERY :FICTION-SCIENCE :FICTION-ADVENTURE :FICTION-ROMANCE
 :HUMOR)

As you see, there is no "news" category which is mentioned in the NLTK example. Probably, what they refer to is :press-reportage.

NLP> (take 9 (mapcar #'token-word
                     (flatten (mapcar #'text-tokens
                                      (get# :press-reportage
                                            (corpus-groups *brown*))))))
("The" "Fulton" "County" "Grand" "Jury" "said" "Friday" "an" "investigation")

NLP> (take 9 (mapcar #'token-word
                     (flatten
                      (mapcar #'text-tokens
                              (remove-if-not #`(string= "cg22" (text-name %))
                                             (corpus-texts *brown*))))))
("Does" "our" "society" "have" "a" "runaway" "," "uncontrollable" "growth")

NLP> (setf *print-length* 3)
NLP> (mapcan #`(ncorpus::sentences-from-tokens (text-tokens %))
             (get# :press-reportage (corpus-groups *brown*)))
((#<The/at 0..3> #<Fulton/np-tl 4..10> #<County/nn-tl 11..17> ...)
 (#<The/at 166..169> #<jury/nn 170..174> #<further/rbr 175..182> ...)
 (#<The/at 409..412> #<September-October/np 413..430> #<term/nn 431..435> ...)
 ...)

The function sentences-from-tokens here operates under the assumption, that every token with . tag is ending a sentence, and it splits the sentences on one or more such tokens.

The presented code may seam much more elaborate, than the NLTK version:

>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

Yet, it should be understood, that it is easy to add a ton of various utility accessors, like it is done in NLTK, but they will inevitable make the code more complex and harder to maintain. And the question is, how frequently are they going to be used and what we will miss anyway? Coding such utilities is very pleasant and easy using the functional style, as shown above, so they are left outside of the scope of cl-nlp, at least until proven essential...

(NB. Don't forget to set *print-length* back to nil).

Next, we'll be once again looking at frequency distributions. First, we need to build an ngram index from Brown corpus words in "news" category. In chapter 1 we've already done a very similar thing:

NLP> (index-ngrams
      1 (mapcar #'token-word
                (flatten (mapcar #'text-tokens
                                 (get# :press-reportage
                                       (corpus-groups *brown*))))))
#<TABLE-NGRAMS order:1 count:14395 outcomes:108130 {101C2B57E3}>
NLP> (defvar *news-1grams* *)  ;; * is the previous returned value
NLP> (dolist (m '("can" "could" "may" "might" "must" "will"))
       (format t "~A: ~A " m (freq *news-1grams* m)))
can: 93 could: 86 may: 66 might: 38 must: 50 will: 389

Funny, some numbers don't add up to what there's in NLTK:

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Well, we can double-check them from a different direction:

NLP> (length (remove-if-not
              #`(string= "can" %)
              (mapcar #'token-word
                      (flatten (mapcar #'text-tokens
                                       (get# :press-reportage
                                             (corpus-groups *brown*)))))))
93

Looks like our calculation is precise.

Next is conditional frequency distribution, but we'll leave this topic for the next part that is fully dedicated to that.

So far so good. We have examined common approaches to handling various corpora. There are, basically, 2 main things you need to do with them: load & parse and filter for some interesting information. The loading for most corpora will be performed either with some regexes on raw strings, with XML SAX parsing, or using the Lisp reader. And for filtering Lisp offers a great versatile toolset of higher-order functions: mapcar, remove-if-not, position, and reduce - to name a few.

2013-04-29

Errors in your Platform

One of the always trending topics in our pop-culture-like computer industry is dealing with errors. There are whole movements addressing it. Yet, in my humble opinion, these movements often see errors in a rather narrow, I would even say, simplistic manner: like the proverbial example of passing a string to a function which expects integers. But, in fact, the scope of errors that can occur in a program is indefinite. In a sense, the language of errors is Turing complete, and creating a program can be viewed from a certain point as devising a way of working around all relevant errors.

Today, I'd like to talk about a particularly nasty type of errors - those that occur in the platform on which we're building the program. And the reality is such that we base our programs on at least several layers of platforms: a hardware platform, an OS and its standard library, a language's VM and the language's standard library, some user-level framework(s) - to name a few. There are several aspects of this picture - some positive for us and some negative.

  • the lower we go, the more widely-used the platform is and thus the better it is tested and the more robust (positive)
  • at the same time it becomes, usually, more general and so more complex and feature-full, which means that some of the more obscure features may, actually, be not so well tested (negative)
  • the platform is usually heterogeneous: each level has it's own language with its semantics and concepts, often conflicting with the ones at the other levels (negative)

There's a revealing rant by node.js creator Ryan Dahl on the topic of such platforms - now removed, but available in the all-remembering Internet - "I hate almost all software".

The platform errors are so nasty, because of two things:

  • you learn to trust the platform (or you can easily go insane otherwise), so you first search for errors in your code
  • the majority of the platforms are not built with a priority on traceability, so it's an order of magnitude harder to find the problem, when you finally start looking for it IN the platform

And the thing is, there are severe bugs in all of the platforms we use. The Pentium FDIV bug is one acute example, but let's talk Java here, because there's so many people bragging how rock solid the JVM is. One of the problems of the JVM is jar hell - the issue that haunts almost every platform under different names (like DLL hell). We had an encounter with it recently, when after an update our server started hanging for no apparent reason. After an investigation we've found that part of our code was compiled with a more recent version of Gson library for parsing JSON, while the main server used the older version. The interesting thing is that there was no error condition signalled. I wonder, how many projects do the same mistake and don't even know what's the source of their problems. We had a similar but more visible issue with different versions of the servlet-api, but that time they were caught at compile-time because of the minor API inconsistency.

The other problem we'd fought fruitlessly and had to eventually work around with external tools was the JVM not properly closing network sockets. Yes, it leaks sockets in some scenarios and we couldn't find a way to prevent it. This is an example of another nasty platform error - the one which happens at layer boundaries. Also, many java libraries, like the ubiquitous jetty webserver leak memory in their default setting. Actually, I wonder, what percentage of Java libraries doesn't leak memory by default?.. This is no way to say, that JVM is exceptional in this regard - other platforms suffer from the same problems. At the same time to JVM's credit it has one of the tools to fight platform errors that many other platforms lack.

So, what are the ways to deal with those errors? Unfortunately, I don't have a lot of answers here, but I hope to discover some with this article. Here are some of the suggestions:

First of all, don't trust the platform.
Try to check everything and cross-validate your assumptions with external tools. This is much harder to do than say, and there's a tradeoff to make somewhere between slackness and paranoia. But if you think of it beforehand, you can see ways to set-up the system so that it would be easier to deal with the problem when it's necessary.
Have the platform's source code at hand.
This is one dimension of traceability - the static one. It won't help you easily find all the problems, but without it you're just at the mercy of the platform's creator. And I'd not trust those guys (see p.1)
Do proper error-handling.
There's a plethora of information on that, and I'd not repeat it here. See the error handling manifesto for one of the good examples.

But the biggest upside is not in fighting with the existing system, but in removing or diminishing the source of the problems - i.e. creating more programmer-friendly platforms. So the most important recommendations I see are for platform authors (and, in fact, most of us become them to some extent when we release a utility library or a framework).

Build platforms with traceability in mind.
One example of such traceability is JVM's VisualVM which is a great tool to get visibility into what's happening in the platform. Another example is Lisp images the state of which you can inspect interactively with Lisp commands. For instance, Lisp has built-in operators TRACE that allow to see the invocations of any interesting function, and INSPECT which is an interactive object browser. These things are really hard to add after-the-fact: look, for example, at the trouble of bringing DTrace into the Linux world.
Create homogenous platforms.
Why? Because in homogenous platforms the boundaries are not so sharp, and so there tend to be fewer bugs on the boundaries. And it's much easier to build in traceability: for example you don't need to deal with multiple stacks, multiple object representation etc. Once again, Lisp comes up here, because with its flexibility it allows to build systems of different levels of abstraction in the same language. And that was exemplified by the Lisp Machines - see the Kalman Reti's talk describing their inner workings.
Support component versioning at core.
This is, actually, one of the topics not addressed well by any mainstream platform (except, maybe, .Net), as far as I know: how to allow several versions of the same component to live together in one program image.

How can static typing help with these types of errors? I don't know. What about the Erlang's failure isolation. Once again, it's pretty orthogonal to the mentioned issues: if your Erlang library leaks memory, the whole machine will suffer - i.e. you can't isolate errors which touch the underlying layers of your platform which you don't fully control...

Please, share your approaches and stories about dealing with platform errors.

2013-04-11

NLTK 2.3 - Working with Wordnet

I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.

Wordnet and Lisp

Wordnet is a database of word meanings (senses) and word semantic relations (synonymy, hyponymy and various other ymy's). It is a really important freely available resource, so having easy access to it is very valuable. At the same time, getting Wordnet to work is somewhat involved technically. That's why I've decided to implement it's support in cl-nlp-contrib system, so that the basic cl-nlp functionality can be loaded without the additional hassle of having to ensure Wordnet (and other similar stuff) is loading.
The implementation of Wordnet interface in cl-nlp is incomplete in the sense, that it doesn't provide matching entities for all Wordnet tables and so doesn't support all the possible interactions with it out-of-the-box. Yet, it is sufficient to run all NLTK's examples and it is trivial to implement the rest as the need arises (I plan to do it later this year).
There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
Non of them use my desired storage format (sqlite) and their overall support ranges from non-existing to little if any. Besides, one of the principles behind cl-nlp is to prefer uniform interface and ease of extensibility over performance and efficiency with the premise that a specialized efficient version can be easily implemented by restricting the more general one, while the opposite is often much harder. And adapting the existing packages to the desired interface didn't seem like a straightforward task, so I decided to implement Wordnet support from scratch.

Wordnet installation

Wordnet isn't a corpus, although NLTK categorizes it as such placing it in nltk.corpus module. It is a word database distributed in a custom format, as well as in binary format of several databases. I have picked sqlite3-based distribution as the easiest to work with — it's a single file of roughly 64 MB in size which can be downloaded from WNSQL project page. The default place where cl-nlp will search for Wordnet is the data/ directory. If you call (download :wordnet), it will fetch a copy and place it there.
There are several options to work with sqlite databases in Lisp, and I have chosen CLSQL as the most mature and comprehensive system — it's a goto library for doing simple SQL stuff in CL. Besides the low-level interface it provides the basic ORM functionality and some other goodies.
Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with apt-get or similar package managers). Yet there's another issue of making Lisp find the installed library. So for Linux I decided to ship the appropriate binaries in the project's lib/ dir. If you are not on Linux and CLSQL can't find sqlite native library, you need to manually call the following command before you load cl-nlp-contrib with quicklisp or ASDF:
(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")

Supporting Wordnet interaction

Using CLSQL we can define classes for Wordnet entities, such as: Synset, Sense or Lexlink.
(clsql:def-view-class synset ()
  ((synsetid :initarg :synsetid :reader synset-id
             :type integer :db-kind :key)
   (pos :initarg :pos :reader synset-pos
        :type char)
   (lexdomainid :initarg :lexdomain
                :type smallint :db-constraints (:not-null))
   (definition :initarg :definition :reader synset-def
               :type string)
   (word :initarg :word :db-kind :virtual)
   (sensenum :initarg :sensenum :db-kind :virtual))
  (:base-table "synsets"))
Notice the presence of virtual slots word and sensenum which are used to cache data from other tables in synset to display it like it is done in NLTK. They are populated lazily with slot-unbound method just like we've seen in Chapter 1.
All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with eql.
There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary word refers to a raw string and lemma — to its representation as a DB entity.
CLSQL also has special syntactic support for writing SQL queries in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
              :from [table]
              :where [is [null [sale_date]]]
              :group-by [item_id]
              :order-by '((num :desc))
              :limit 1)
Yet I've chosen to use another very simple custom solution for that, which I've kept in the back of my mind for a long time since I'd started working with CLSQL literal SQL syntax. Here's how we get all lemma object for a specific synset — they are connected through senses:
(query (select 'lemma
               `(:where wordid
                 :in ,(select '(wordid :from senses)
                              `(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every such request, but use a combination of the following:
  • there's a CLSQL special variable *default-database* that points to the current DB connection; it is used by all Wordnet interfacing functions
  • the connection can be established with connect-wordnet which is given an instance of sql-wordnet3 class (it has the usual default instance <wordnet>, but you can use any other instance which may connect to some othe Wordnet SQL DB like MySQL if needed)
  • ensure-db function checks, if the connection is present and opens it otherwise
  • it is assumed that all functions will perform Wordnet interaction inside with-wordnet macro that implements a standard Lisp resource-management call-with-* pattern for the Wordnet connection

Running NLTK's examples

Now, let's run the examples from the book to see how they work in our interface.
First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed implicitly through the special variable. wn is a special convenience package which implements the logic of implicitly relying on the default <wordnet>. There are functions with the same names in wordnet package that take a wordnet instance as first argument, similar to how other parts of cl-nlp are organized.
WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of synsets and other Wordnet objects is performed with custom print-object methods. Here's a print-object for synset that uses the built-in print-unreadable-object macro:
(defmethod print-object ((sample sample) stream)
  (print-unreadable-object (sample stream :type t :identity t)
    (with-slots (sample sampleid) sample
      (format stream "~A ~A" sample sampleid))))
Let's proceed further.
WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:
wn.synset('car.n.01').lemma_names
Definitions and examples:
WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.
WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
 #<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
 #<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between words and lemmas, and, IMHO, it is propagated in NLTK. The term lemma only appears once in Wordnet DB as a column in words table. And synsets are not directly related to words — they are linked through senses table. NLTK calls a synset to word pairing a lemma, which only adds to the confusion. I decided to call entities of words table lemmas. Now, how do you implement the equivalent of
wn.lemma('car.n.01.automobile')
We can do it like this:
WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
                        (wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw lemma object, but a sense object, because sense is that mapping of word to its meaning (defined by a synset), and it's a proper Wordnet entity for this:
WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:
WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named auto. This is different from NLTK's 'car', and the reason for this is that it's unclear from the Wordnet DB, what is the "primary" lemma for a synset, so I just use the word which appears first in the DB. Probably, NLTK also uses it, but has a different ordering — compare the next output with the book's one:
WORDNET> (dolist (synset (wn:synsets "car"))
           (print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses — word~synset — as it better reflects the semantic meaning of this concept.

Wordnet relations

There're two types of Wordnet relations: semantic ones between synsets and lexical ones between senses. All of the relations or links can be found in *link-types* parameter. There are 28 of them in Wordnet 3.0.
To get at any relation there's a generic function related:
WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case "ambulance" is the first motorcar type.
WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
 "convertible" "coupe" "cruiser" "electric" "electric automobile"
 "electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
 "horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
 "limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
 "phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
 "racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
 "sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
 "stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
 "subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
 "used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:
WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:
WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
2
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:
WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at part-meronym, substance-meronym and member-holonym relations and try to get them with related, we'll get an empty set.
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
NIL
That's because the relation is actually one-way: a burl is part of a tree, but not vice versa. For this case there's a :reverse key to related:
WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
 #<SYNSET crown.n.7 113128003 {1011423B73}>
 #<SYNSET limb.n.2 113163803 {1011425633}>
 #<SYNSET bole.n.2 113165815 {10114270F3}>
 #<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
 #<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is member-meronym of forest (i.e. NLTK has it slightly the opposite way):
WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:
WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
           (format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
Verbs:
WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
 #<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
 #<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic, relationship, and it takes place between senses:
WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)

Similarity measures

Let's now use the lowest-common-hypernyms function abbreviated to lch to see the semantic grouping of marine animals:
WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
2
1
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
3
2
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
7
5
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
15
7
The second and third return values of lch come handy here, as they show the depth of the paths to the common ancestor and give some immediate data for estimating semantic relatedness, that we're going to explore more now.
WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
14
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
13
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
8
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
0
By the way, there's another whale, that is much closer to the root:
WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
5
Guess what it is?
WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:
WORDNET> (wn:path-similarity *right* *minke*)
1/4
WORDNET> (wn:path-similarity *right* *orca*)
1/6
WORDNET> (wn:path-similarity *right* *tortoise*)
1/13
WORDNET> (wn:path-similarity *right* *novel*)
1/23
The algorithm here is very simple — it just reuses the secondary return values of lch:
(defmethod path-similarity ((wordnet sql-wordnet3)
                            (synset1 synset) (synset2 synset))
  (mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
    (declare (ignore _))
    (when l1
      (/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are also implemented in cl-nlp. For example, lch-similarity the name of which luckily coincides with lch function:
WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
2.0281482
This measure depends on performing expensive computation of taxonomy depth which is done in a lazy manner in the max-depth function:
(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
  (reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
                              (progn
                                (format *debug-io*
                                        "Calculating taxonomy depth for ~A~%" %)
                                (set# % *taxonomy-depths*
                                      (calc-max-depth wordnet %))))
                        (mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include wup-similarity, res-similarity, lin-similarity and others. You can read about them in the NLTK Wordnet manual. Most of them depend on the words' information content database which is calculated for different corpora (e.g. BNC and Penn Treebank) and is not part of Wordnet. There's a WordNet::Similarity project that distributes pre-calculated databases for some popular corpora.
Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
0.87035906
To calculate it we first need to fetch the information content corpus with (download :wordnet-ic) and load the data for SemCor:
WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default :filename-pattern will load the combined information content scores for the following 5 corpora: BNC, Brown, SemCor and its raw version, all of Shakespeare's works and Penn Treebank. We'll talk more about various corpora in the next part of the series...
Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
Wordnet 3.0 UML diagram

2013-03-05

Lisp Hackers: Vladimir Sedach

Vladimir Sedach is an active open-source Common Lisp developer and proponent, as well as a computing philosopher to some extent. At his carcaddar blog he writes about decentralized social networks, forgotten bits of computer history and, surely, Lisp. He is the maintainer or originator of a few open-source libraries like parenscript and Eager-Future2, and works on Vacietis C-to-Lisp compiler, which he describes in more detail in the interview. Together with Andrey Moskvitin they were the driving force behind 2012 cliki update effort (see cliki2).

Tell us something interesting about yourself.

In 2012 my best friend and I rode our bicycles across the USA.

What's your job? Tell us about your company.

Right now I work at ZestFinance in Hollywood. We're a relatively new company that's helping the underbanked receive access to credit on more reasonable terms than otherwise obtainable, by using machine learning. The only thing bad about my job is that I get so focused on programming at work that I don't have any desire to hack on Free Software projects when I come home, so a lot of my own projects are currently badly neglected.

Do you use Lisp at work? If yes, how you've made it happen? If not, why?

I've used Lisp for new commercial projects where people trust me with the technology choices. I've also come in to work on existing Lisp projects. I have used Lisp for one-off tasks at "non-Lisp" companies, but I've never started a Lisp project at a company that wasn't using Lisp already. Most computer programmers tend to make technical decisions based on cliches they read about on the Internet, and unfortunately I don't know the cure for being dumb.

What brought you to Lisp? What holds you?

As a teenager I got interested in computer graphics, and came across this really cool 3d graphics program called Mirai from Nichimen Graphics. Mirai was written in Common Lisp, and I read more about Lisp in the Slashdot interview with Kent Pitman, and started reading SICP based on Kent's recommendation. By the second chapter computer programming had finally made sense to me, and by the third I had decided to study mathematics in university.

As an aside, Mirai is directly descended from the S-Graphics software from Symbolics. Not many people know this, but Symbolics was actually the second producer of commercially available 3d computer animation software (the first was Wavefront). S-Graphics was a complete modeling, animation, and painting system, and it was all written in Lisp in the mid 1980s.

Common Lisp is still the best programming language I've come across. There is the feeling of freedom. No one is telling you what patterns or types to use. You can always just write code in a way that is most appropriate for what you're currently doing.

What's the most exciting use of Lisp you had?

Working with Daniel Gackle on Skysheet. We did some very awesome things with Parenscript.

What you dislike the most about Lisp?

No one has yet invented a good way to write "one-liners" in Lisp. I would love to replace my Unix shell with a Lisp REPL someday.

Among the software projects you've participated in what's your favorite?

Vacietis, for the sheer amount of hacks per line of code, and for the "I proved them wrong" factor. It was during the development of Vacietis that I came up with my current programming philosophy: "when stuck, do the stupidest thing possible"

Tell us about Vacietis: your vision for it, the project's progress and roadmap, what's lacking?..

Vacietis was originally intended to be a translator for a subset of C code to Common Lisp. I discussed the idea for Vacietis a few times online before starting to code, and people generally thought it wouldn't be doable. Scott Burson, who wrote the Zeta-C compiler for Lisp Machines in the 1980s, told me it would take at least a year of full-time of work.

As I worked on Vacietis, I realized that adding more and more of the "advanced" C functionality was actually easy. First came the C preprocessor (which is implemented as part of the Common Lisp readtable that also parses the C code), then large patches of the C standard library, and then one day Brit Butler came along and sent me a one-page patch that actually used Vacietis as a stand-alone C compiler! I hadn't even realized the project had gotten to that phase. The compiler comes as the "vacietis.vcc" ASDF system as part of Vacietis.

Some big things that need to be done are struct call-by-value, pointer scaling, arguments to main(), some libc stuff, setjmp, and making VCC produce linkable Lisp "object" files (fasls) for the different implementations. I would also like to change the pointer representation to be a fixnum offset into a sparse array (right now pointers are represented by structures that look like <offset, array>).

Overall, I am amazed by how much progress Vacietis has made given how little time I spent on it. I am now convinced that Common Lisp is mostly a superset of C, and have a newfound appreciation for gotos and PROG.

If you had all the time in the world for a Lisp project, what would it be?

A single-address space Common Lisp operating system based on capabilities (Jonathan Rees wrote a dissertation on doing this in Scheme in 1995) and a virtualized package system. I really like Azul Systems' hack of using the EPT/RVI virtualized page tables as a hardware read barrier for real-time garbage collection, and with a single-address space operating system I think you can do the same thing on hardware with a regular MMU.

If you follow the steps of the Viewpoints Research Institute's Fundamental New Computer Technologies project and keep the system as simple as possible, it should be a surmountable amount of work to realize a somewhat working system.

Vacietis is actually the first step in the Common Lisp operating system project. I'd like to have a C runtime onto which I can port hardware drivers from OpenBSD with the minimal amount of hand coding. The way hardware drivers are written in OpenBSD (a set of patterns it borrows from NetBSD, where they originated and are colloquially known as "bus_dma") is very pleasant. Theo de Raadt was also the first well-known Free Software operating system developer to take a position against binary hardware drivers, and I respect him tremendously for that. Linus Torvalds doesn't care, and as a result it is now impossible to get drivers for any of your ARM devices, even though 99% of them run Linux. OpenBSD will continue to have great, stable Free Software device drivers that are easily portable for different hardware architectures and to the other BSD-derived operating systems. I hope OpenBSD's model of Free Software hardware drivers catches on more widely.

Describe your workflow, give some productivity tips to fellow programmers.

I tend to design software systems in advance only at the level of the domain. I find that programmers who try to come up with APIs or draw class diagrams ahead of starting to code tend to write very poor code.

I rely heavily on interactive programming for doing mundane tasks, and since most programming time is spent dealing with mundane things, I spend most of my time in the REPL. Fortunately the Common Lisp REPL is the best programming environment I have come across. Things like the Rails console don't even come close. The only comparable environment is the Unix shell.

It's also very easy to turn Common Lisp REPL code into unit tests, which I tend to do a lot. That is something that's very hard to do with object-oriented code, which is why idiotic things like dependency injection and Test-Driven Development have to be invented.

For difficult problems, it's always better to step back from the keyboard and just think about your code. Hacking around in the debugger is usually a waste of time (the only exception is when you're having data structure issues, but then technically you're in the object inspector). I mostly tend to debug with print statements.

The best productivity tip I've come across is the "Seinfeld technique" that I learned about from reading Hacker News. It involves doing something, no matter how small, on your project every single consecutive day, without any gaps or interruptions. Apparently Jerry Sienfeld uses a big calendar and tries to mark up large "runs" of consecutive days on it. That really keeps you focused.

One thing I try to focus on in my code is giving unambiguous, unique (ie, easily grep-able) names to symbols. That makes it easy to go back to old code and quickly figure out what is going on. Common words like "status," "mapping," etc. make terrible names unless you qualify them.

submit

2013-02-26

NLTK 1.3 - Computing with Language: Simple Statistics

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won't translate that to Lisp, because there're much better resources explaining how to use Lisp properly. First and foremost I'd refer anyone interested to the appropriate chapters of Practical Common Lisp:

It's only worth noting that Lisp has a different notion of lists, than Python. Lisp's lists are linked lists, while Python's are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp's list operations is somewhat different from Python's. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn't provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn't easily reproducible in Lisp is assigning to a slice:

>>> sent[1:9] = ['Second', 'Third']
>>> sent
['First', 'Second', 'Third', 'Last']

There's replace but it can't shrink a sequence:

CL-USER> (defvar sent '(1 2 3 4 5 6 7 8 9 0))
CL-USER> (replace sent '("Second" "Third") :start1 1 :end1 9)
(1 "Second" "Third" 4 5 6 7 8 9 0)

Ngrams

So, the only part worth discussing here is statistics.

Let's start with a frequency distribution. We have already used something similar in the previous part for text generation, but it was very basic and tailored to the task. Now, it's time to get into some serious language modeling and discuss a more general-purpose implementation.

Such modeling is accomplished via collecting of large amounts of statistical data about words and their sequences appearances in texts. These sequences are called ngrams. In a nutshell, you can think of ngrams distribution as a table mapping ngram sequences to numbers.

(defclass ngrams ()
  ((order :initarg :order :reader ngrams-order)
   (count :reader ngrams-count)
   (max-freq :reader ngrams-max-freq)
   (min-freq :reader ngrams-min-freq)
   (total-freq :reader ngrams-total-freq)))

The crucial parameter of this class is order which defines the length of a sequence. In practice, ngrams of order from 1 to 5 may be used.

ngrams is an abstract class. In Lisp you don't have to somehow specify this property, you just don't implement methods for it. The simplest ngrams implementation — table-ngrams — uses an in-memory hash-table as a store. You can get ngram frequency and "probability" (the maximum likelihood estimation) from it, as well as log of probability which is used more often in calculations, because it allows to avoid the problem of floating point rounding errors occurring when multiplying probabilities which are rather small values.

NLTK> (freq (text-bigrams *moby*) "The whale")
Indexing bigrams...
Number of bigrams: 116727
14
NLTK> (logprob (text-bigrams *moby*) "The whale")
-14.255587

So how do we get bigrams of Moby Dick? For that we just have to count all of them in text (this is a simplified version — some additional processing for sentence start/ends is needed):

(defun index-ngrams (order words &key ignore-case)
  (make 'table-ngrams :order order
        :table
        (let ((ht (make-hash-table :test (if ignore-case 'equalp 'equal))))
          (do ((tail words (rest tail)))
               ((shorter? tail order))
            (incf (get# (if (= order 1)
                            (car tail)
                            (sub tail 0 order))
                        ht 0)))
          ht)))

table-ngrams will be useful for simple experimentation and prototyping, like we do in our NLTK examples.

NLTK> (defvar *1grams* (text-ugrams *moby*))
Indexing unigrams...
Number of unigrams: 1924
NLTK> (freq *1grams* "whale")
906
NLTK> (take 50 (vocab *1grams* :order-by '>))
("," "the" "<S>" "</S>" "." "of" "and" "-" "a" "to" ";" "in" "\"" "that" "'"
 "his" "it" "I" "!" "s" "is" "he" "with" "was" "as" "all" "for" "this" "at"
 "by" "but" "not" "him" "from" "be" "on" "?" "so" "whale" "one" "you" "had"
 "have" "there" "But" "or" "were" "now" "which" "me")

The strings "<S>" and "</S>" here denote special symbols for sentence start and end.

Here's a cumulative plot of them:

Cumulative Frequency Plot for 50 Most Frequent Words in Moby Dick

And here's just the counts graph:

Frequency Plot for 50 Most Frequent Words in Moby Dick

And, finally, here's hapaxes:

NLTK> (take 50 (hapaxes (text-ugrams *moby*)))
("orphan" "retracing" "sheathed" "padlocks" "dirgelike" "Buoyed" "liberated"
 "Till" "Ixion" "closing" "suction" "halfspent" "THEE" "ESCAPED" "ONLY"
 "Epilogue" "thrill" "etherial" "intercept" "incommoding" "tauntingly"
 "backwardly" "coincidings" "ironical" "intermixingly" "whelmings" "inanimate"
 "animate" "lookouts" "infatuation" "Morgana" "Fata" "gaseous" "mediums"
 "bewildering" "bowstring" "mutes" "voicelessly" "THUS" "grapple"
 "unconquering" "comber" "foregone" "bullied" "uncracked" "unsurrendered"
 "Diving" "flume" "dislodged" "buttress")

The next Python feature showcased here is list comprehensions. The idea behind them is to resemble theoretical-set notation in list definition. There's no such thing out-of-the box in Lisp (although you can implement an even closer to set-notation variant in just 24 lines), and the general approach is to favor functional style filtering with variants of map and remove-if.

NLTK> (sort (remove-if #`(< (length %) 15)
                       (uniq (text-words *moby*)))
            'string<)
("CIRCUMNAVIGATION" "Physiognomically" "apprehensiveness" "cannibalistically" "characteristically" "circumnavigating" "circumnavigation" "circumnavigations" "comprehensiveness" "hermaphroditical" "indiscriminately" "indispensableness" "irresistibleness" "physiognomically" "preternaturalness" "responsibilities" "simultaneousness" "subterraneousness" "supernaturalness" "superstitiousness" "uncomfortableness" "uncompromisedness" "undiscriminating" "uninterpenetratingly")

NLTK> (sort (remove-if #`(or (<= (length %) 7)
                             (<= (freq (text-ugrams *chat*) %) 7))
                       (vocab (text-ugrams *chat*)))
            'string<)
("20sUser104" <... another 130 users ...> "Question" "actually" "anything" "computer" "everyone" "football" "innocent" "listening" "remember" "seriously" "something" "talkcity_adults" "thinking" "together" "watching")

In NLTK variant all users are removed from the corpus with some pre-processing.

Language Modeling

But to be useful for real-world scenarios ngrams have to be large, really large (on the orders of tens of gigabytes of data for trigrams). This means that you won't be able to simply store them in memory and will have to use some external storage: a general-purpose data-store, like the relational database or a special-purpose software.

One such ngrams service that is available on the internet is Microsoft Web N-gram Services. If you have a developer token you can query it over HTTP. The service only returns log-probabilities and also log-conditional-probabilities and runs really slow, but it is capable of serving batch requests, i.e. return probabilities for several ngrams at once. The implementation of ngrams interface for such service is provided in contrib/ms-ngrams.lisp.

We have already encountered conditional probabilities in the previous part. They have the following relationship with regular (so called, "joint") probabilities (for bigrams):

p(A,B) = p(B|A) * p(A)
where P(A,B) is a joint probability and P(B|A) is the conditional one

I.e. they can be calculated from current ngrams plus the ngrams of preceding order. So, this operation is performed not on a single ngrams object, but on a pair of such objects. And they serve an important role we'll see below. But first we need to talk about language models.

A language model is, basically, a collection of ngrams of different orders. Combining these ngrams we're able to obtain some other measures beyond a simple frequency value or probability estimate. The biggest added value of such model is in smoothing capabilities that it implements. The problem smoothing solves is that you'll almost never be able to have all possible ngrams in your data-store — there's just too many of them and the language users keep adding more. But it's very nasty to get 0 probability for some ngram. The language model allows to find a balance between the number of ngrams you have to store and the possibility to get meaningful probability numbers for any ngram. This is achieved with various smoothing techniques: interpolation and discounting. Some of the smoothing methods are:

  • +1 smoothing
  • Kneser-Ney smoothing
  • and Stupid backoff

A good general compilation of various smoothing methods is assembled in this presentation.

Let's look at the simplified implementation of scoring a sentence with the Stupid Backoff model:

(defmethod logprob ((model language-model) (sentence list))
  (with-slots (order) model
    (let ((rez 0)
          (s (append (cons "<S>" sentence) (list "</S>"))))
      (when (shorter? s order)
        (return-from logprob (logprob (get-ngrams (length s) model) s)))
      ;; start of the sentence: p(A|<S>) * p(B|<S>,A) * ...
      (do ((i 2 (1+ i)))
          ((= i order))
        (incf rez (cond-logprob model (sub s 0 i))))
      ;; middle of the sentence
      (do ((tail s (rest tail)))
          ((shorter? tail order))
        (incf rez (cond-logprob model (sub tail 0 order))))
      rez)))

Eventually, the language model is able to return the estimated probability of any sequence of words, not limited to the maximum order of ngram in it. This is usually calculated using the Markov assumption with the following formula (for a bigram language model):

p(s) = p(A) * p(B|A) * p(C|A,B) * p(D|B,C) * ... * p(Z|X,Y)
where s = A B ... Z

NLTK> (defvar *moby-lm2* (make-lm 'stupid-backoff-lm
                                  :1g (text-ugrams *moby*)
                                  :2g (text-bigrams *moby*)))
NLTK> (prob *moby-lm2* "This is a test sentence.")
6.139835e-20

That was, by the way, the probability of an unseen sentence with the word "sentence" completely missing from vocabulary.

NLTK> (prob *moby-lm2* '("<S>" "Moby" "Dick" "." "</S>"))
5.084481e-9
NLTK> (float (prob (text-bigrams *moby*) '("Moby" "Dick")))
3.0310333e-4

As you see, it's much more likely to encounter the sentence "Moby Dick." in this text, although not so likely as the phrase "Moby Dick". :)

Also such model is able to generate random texts just like we did in the previous part. But because of the smoothing capability it's much more general, i.e. it can generate sequences with any word from the vocabulary, even the phrases unseen before. At the same time it's much more computationally expensive, because now generating each new word takes O(vocabulary size) while it was O(average number of words following any particular word).

NLTK> (princ (generate *genesis* :order 2 :n 93))
burial to judged eaten sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung foreign longed them ought up Temanites Aran earth earth blessings surface surface surface surface surface surface surface surface surface floated Darkness Now homage earth Now In princes said vengeance It passed said divide In beginning earth Asenath said re The peaceful kind Calah said blameless mistress Chaldees said hunter said middle surface surface surface surface yonder earth rib said said smoking smoking smoking

And, as you see, this example totally doesn't resemble the one in the previous part. Is this a bug? No, just a trick that is played with us because we aren't following the basic math principles. In the Stupid Backoff model the probabilities don't add up to 1 and the conditional probability of an unseen ngrams may be larger than the largest probability of any recorded one! This is the reason we get to produce sequences of repeated words. This problem is much less obvious for the trigram model, although the text remains a complete gibberish.

 NLTK> (princ (generate *genesis* :order 3 :n 93))
 brink time wagons fourth Besides south darkness listen foreigner Stay possessor lentils backwards be call dignity Kenizzites tar witness strained Yes appear colts bodies Reuel burn inheritance Galeed Hadar money touches conceal mighty foreigner spices Set pit straw son hurry yoke numbered gutters Dedan honest drove Magdiel Nod life assembly your Massa iniquity Tola still fifteen ascending wilderness everywhere shepherd harm bore Elah Jebusites Assyria butler Euphrates sinners gave Nephilim Stay garments find lifted communing closed Ir lights doing weeping shortly disobedience possessions drank peoples fifteen bless talked songs lamb far Shaveh heavens

What this example shows us are at least two things:

  • we should always check that mathematical properties of our models still hold as we tweak them
  • although the major use-case for language model is scoring, you can get a feel of how good it will perform by looking at the texts it generates

Finding collocations

This is another interesting and useful NLP problem with a very elegant baseline solution, which is explained in this article. Hopefully, we'll get back to it in more detail in the future chapters. And for now here's the results of implementing the algorithm from the article:

NLTK> (collocations *inaugural*)
(("United" "States") ("fellow" "citizens") ("four" "years") ("years" "ago")
 ("Federal" "Government") ("General" "Government") ("American" "people")
 ("Vice" "President") ("Old" "World") ("Almighty" "God") ("Fellow" "citizens")
 ("Chief" "Magistrate") ("Chief" "Justice") ("God" "bless") ("go" "forward")
 ("every" "citizen") ("Indian" "tribes") ("public" "debt") ("one" "another")
 ("foreign" "nations") ("political" "parties") ("State" "governments")
 ("National" "Government") ("United" "Nations") ("public" "money")
 ("national" "life") ("beloved" "country") ("upon" "us") ("fellow" "Americans")
 ("Western" "Hemisphere"))

I'm surprised at how similar they are to NLTK's considering that I didn't look at their implementation. In fact, they are the same up to the difference in the list of stopwords (the dirty secret of every NLP application :) The code for collocation extraction function can be found in core/measures.lisp.

Other uses of ngrams

Ngrams are also sometimes used for individual characters to build Character language models. And here's another usage from NLTK — for counting word lengths.

NLTK> (defvar *moby-lengths*
              (index-ngrams 1 (mapcar #'length (text-words *moby*))))
NLTK> (vocab *moby-lengths*)
(1 4 2 6 8 9 11 5 7 3 10 12 13 14 16 15 17 18 20)
NLTK> (ngrams-pairs *moby-lengths*)
((1 . 58368) (4 . 42273) (2 . 35726) (6 . 17111) (8 . 9966) (9 . 6428)
 (11 . 1873) (5 . 26595) (7 . 14399) (3 . 49633) (10 . 3528) (12 . 1053)
 (13 . 567) (14 . 177) (16 . 22) (15 . 70) (17 . 12) (18 . 1) (20 . 1))
NLTK> (ngrams-max-freq *moby-lengths*)
58368
NLTK> (freq *moby-lengths* 3)
49633

Final thoughts

Language modeling is really the foundation of any serious NLP work. Having access to ngrams expands your possibilities immensely, but the problem with them is that moving from prototype to production implementation becomes tricky due to the problems of collecting a representative data-set and consequently efficiently storing it. Yet, there are solutions: the Google Books Ngrams and Google Web1T are an example of web-scale ngrams data-set, and there's also special-purpose software for storing large ngrams corpora and obtaining language models from them. The notable examples are BerkeleyLM and KenLM.

Here's a link to some code for this chapter.

- Previous: NLTK 1.1 - Computing with Language: Texts and Words

submit

2013-02-14

NLTK 1.1 - Computing with Language: Texts and Words

OK, let's get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We'll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

Setting up texts for processing

For the purpose of this demonstration we'll need several texts which can be downloaded from NLTK data. Namely, we'll use the following 5 texts:

  • Moby Dick (can be found inside Project Gutenberg)
  • Sense and Sensibility (likewise from Project Gutenberg)
  • The Book of Genesis
  • Inaugural Address Corpus (this one comes as a collection of separate texts, that you'll need to cat together into one file)
  • NPS Chat Corpus

These texts are in nltk/data/ directory in CL-NLP.

NLTK guys have created a special Text class and have defined all the operations in this chapter as its methods. We'll employ a slightly simpler approach and implement them as ordinary functions. Yet we'll also have a special-purpose text class to cache reusable results of long-running operations, like tokenization.

NLTK> (load-nltk-texts "data/")
#<MOBY [Moby Dick by Herman... 1242986>
#<SENSE [Sense and Sensibili... 673019>
#<GENESIS In the beginning God... 188665>
...

As you've already guessed, we've just loaded all the texts. The number in the last column is each text's character count.

Now they are stored in *texts* hash-table. This is how we can access an individual text and name it for future usage:

(defparameter *sense* (get# :sense *texts*))

get# is one of the shorthand functions for operating on hash-tables defined in rutils.

Now we have a variable pointing to "Sense and Sensibility". If we examine it, this is what we'll see:

NLTK> (describe *sense*)
#<SENSE [Sense and Sensibili... 673019>
  [standard-object]
Slots with :INSTANCE allocation:
  NAME         = :SENSE
  RAW          = "[Sense and Sensibility by Jane Austen 1811]..
  WORDS        = #<unbound slot>
  CTXS         = #<unbound slot>
  TRANSITIONS  = #<unbound slot>
  DISPERSION   = #<unbound slot>

As you see, there are some unbound slots in this structure: words will hold every word in the text after tokenization, ctxs will be a table of contexts for each word with their probabilities. By analogy, transitons will be a table of transition probabilities between words. Finally, dispersion will be a table of indices of word occurrences in text. We'll use a lazy initialization strategy for them by defining slot-unbound CLOS methods, that will be called on first access to each slot. For example, here's how words is initialized:

(defmethod slot-unbound (class (obj text) (slot (eql 'words)))
  (with-slots (raw words) obj
    (format t "~&Tokenizing text...~%")
    (prog1 (setf words (mapcan #`(cons "¶" (tokenize  %))
                               (tokenize  raw)))
      (format t "Number of words: ~A~%" (length words)))))

First we split the raw text in paragraphs, because we'd like to preserve paragraph information. Splitting is slightly involved as paragraphs are separated by double newlines, while single newlines end every line in the text, and we have to distinguish this. We insert pillcrow signs paragraph boundaries. Then we tokenize the paragraphs into separate words (real words, punctuation marks, symbols, etc).

NB. I consider tokenization the crucial function of the NLP toolkit, and we'll explore it in more detail in one of the future posts.

Implementing the examples

OK, now we are ready to start churning out examples from the first chapter.

The first one finds occurrences of certain words in the text. NLTK guys perform the search on the tokenized texts. But I think, it's quite OK to do it on raw strings with regexes. This has an added benefit of preserving text structure.

NLTK> (concordance *moby* "monstrous")
Displaying 11 of 11 matches
    former, one was of a most monstrous size. ...  This came towards
               "Touching that monstrous bulk of the whale or ork we h
                     array of monstrous clubs and spears.  Some were
 you gazed, and wondered what monstrous
 has survived the flood; most monstrous
                              monstrous fable, or still worse and mor
                       Of the Monstrous Pictures of Whales.
        In connexion with the monstrous pictures of whales, I am stro
o enter upon those still more monstrous stories of them
ave been rummaged out of this monstrous
 Whale-Bones; for Whales of a monstrous size are                     

With :pass-newlines on we can get the output similar to NLTK's. Let's try one of the homework tasks:

NLTK> (concordance *genesis* "lived" :pass-newlines t)
Displaying 75 of 75 matches
t from Yahweh's presence, and lived in the land of Nod, east of
 when they were created. Adam lived one hundred thirty years, and
...

Now let's try similarity. Here we won't do without proper tokenization.

NLTK> (similar *moby* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 267803
Number of unique words: 19243
("mystifying" "subtly" "maddens" "impalpable" "modifies" "vexatious" "candid"
 "exasperate" "doleful" "delightfully" "trustworthy" "domineering" "abundant"
 "puzzled" "untoward" "contemptible" "gamesome" "reliable" "mouldy"
 "determined")
NLTK> (similar *sense* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 146926
Number of unique words: 6783
("amazingly" "vast" "heartily" "extremely" "remarkably" "great" "exceedingly"
 "sweet" "very" "so" "good" "a" "as")

We mostly get the same words as NLTK's result, but with a slightly different ordering. It turns out, that the reason for this is very simple. The function similar matches words based on the contexts, where they occur. According to the famous quote by John Rupert Firth:

You shall know a word by the company it keeps

But if we look at context overlap between various words from our list we'll see that the similarity relation between all these words is extremely weak: the decision is based on the match of a single context in which both words appeared in text. In fact, all the listed words are similar to the same extent.

NLTK> (common-contexts *moby* "monstrous" "loving")
("most_and")
NLTK> (common-contexts *moby* "monstrous" "mystifying")
("most_and")
NLTK> (apply #'common-contexts *moby* (similar *moby* "monstrous"))
("most_and")

Actually, the next NLTK example is, probably, the best context overlap you can get from those texts:

NLTK> (common-contexts *sense* "monstrous" "very")
("am_glad" "is_pretty" "a_pretty" "a_lucky")

Now let's draw a dispersion plot of the words from inaugural corpus. This task may seem difficult to approach at first, because the authors use a Python library matplotlib for drawing the graph. Fortunately, there's a language-agnostic tool to achieve similar goals, which is called gnuplot. There is a couple of Lisp wrapper libraries for it, and the actual code you need to write to drive it amounts to 2 lines (not counting the code to format the data for consumption). There are, actually, numerous language-agnostic tools on the Unix platform — don't forget to look for them when you have such kind of specific need :) gnuplot dispersion graph with cl-nlp

The next problem in this part also seems pretty hard. And, in fact, it is extremely hard if framed correctly — to generate a meaningful text based on some other text. But the example solves an easier task to generate a somewhat meaningful text. And the approach taken to solve it is a very simple one — it is the baseline method in this area and is based on Markov chains. There was even a famous mock with Markov chains in the times of Usenet called Mark V. Shaney. Markov models have one principal parameter — order. Mark V. Shaney was an order 2 chain.

Let's try to generate something with it:

NLTK> (princ (generate *genesis* :order 2 :n 93))
Building prefix transitions index of length 2...
Number of prefixes: 14326
In the marsh grass . Behold , the male and female . Of these , and that Jacob obeyed his father . I will return to the earth was dry . God set them in the earth . Noah built an altar there , and put me in the pit ; and that which was our father ' s hand , then the handmaids and their sin is very grievous in your flesh for an everlasting covenant between God and every man to see the nakedness of his flock and keep it .

And what if we raise the order?

NLTK> (princ (generate *genesis* :order 3 :n 93))
Building prefix transitions index of length 3...
Number of prefixes: 28206
In the beginning God created the large sea creatures , and every bird , whatever moves on the earth . He stayed yet another seven days , and sent over that which he had gained in Paddan Aram . Esau saw that the interpretation was good , he said to them , they conspired against him to kill him . They took captive all their little ones , and for days and years ; and let it divide the waters from the waters . God said to the younger , and his seed

The text starts to resemble the original more and more. Also you may notice, that the text will always start with "In". That's because Genesis isn't split in paragraphs, and our generation starts from paragraph beginnings, of which there's only one here.

OK, this seems to work, but with probabilities you never know for sure... ;)

Now, we're left with very simple tasks. Let's just do them:

NLTK> (length (text-words *genesis*))
44671

In the book they had a slightly different number: 44764. This is because of the different tokenization scheme. The differences can be seen in the next snippet (we have a cleaner version for this use case :)

NLTK> (take 20 (sort (remove-duplicates (text-words *genesis*) :test 'string=) 'string<))
("!" "\"" "'" "(" ")" "," "-" "." ":" ";" "?" "A" "Abel" "Abida" "Abimael"
 "Abimelech" "About" "Abraham" "Abram" "Accad")

What about the vocabulary size? Well, once again very similar to the NLTK number (2789).

NLTK> (length (remove-duplicates (text-words *genesis*) :test 'string=))
2634

Now, let's look for words:

NLTK> (count "smote" (text-words *genesis*) :test 'string=)
0

Hmm... What about some other word?

NLTK> (count "Abraham" (text-words *genesis*) :test 'string=)
134

This seems to work. What's the problem with "smote"? Turns out, there's no such word in the Genesis text: at least the examination of the text doesn't show any traces of it. Looks like we've found a bug in the book :)

(defun percentage (count total)
  (/ (* 100.0 count) total))
NLTK> (with-slots (words) *inaugural*
        (percentage (count "a" words :test 'string=) (length words)))
1.4597242
(defun lexical-diversity (text)
  (with-slots (words) text
    (/ (float (length words))
       (length (remove-duplicates words :test 'string=)))))
NLTK> (lexical-diversity *genesis*)
16.959377
NLTK> (lexical-diversity *chat*)
6.9837084

Interestingly, the results for *chat* corpus differ from the NLTK ones, although they are calculated based on tokens, provided in the corpus and not extracted by our tokenization algorithms. This text is special, because it is extracted from the XML-structured document, which also contains the full tokenization. To use it we swap words in *chat* corpus:

NLTK> (setf (text-words *chat*)
            (mapcar #'token-word
                    (flatten (corpus-text-tokens ncorp:+nps-chat-corpus+))))

But first we need to get the corpus and extract the data from it - see corpora/nps-chat.lisp for details.

And, finally, we can examine the Brown Corpus.

       Genre        |  Tokens  |  Types  |  Lexical Diversity
--------------------+----------+---------+--------------------
PRESS-REPORTAGE     |  100554  |  14393  |    7.0
PRESS-EDITORIAL     |  61604   |  9889   |    6.2
PRESS-REVIEWS       |  40704   |  8625   |    4.7
RELIGION            |  39399   |  6372   |    6.2
SKILL-AND-HOBBIES   |  82345   |  11934  |    6.9
POPULAR-LORE        |  110299  |  14502  |    7.6
BELLES-LETTRES      |  173096  |  18420  |    9.4
MISCELLANEOUS-GOVER |  70117   |  8180   |    8.6
LEARNED             |  181888  |  16858  |   10.8
FICTION-GENERAL     |  68488   |  9301   |    7.4
FICTION-MYSTERY     |  57169   |  6981   |    8.2
FICTION-SCIENCE     |  14470   |  3232   |    4.5
FICTION-ADVENTURE   |  69342   |  8873   |    7.8
FICTION-ROMANCE     |  70022   |  8451   |    8.3
HUMOR               |  21695   |  5016   |    4.3

OK, seems like we're done with this chapter. So far there was no rocket science involved, but it was interesting...

Implementation details

So, what are the interesting bits we haven't discussed?

First, let's look at a small optimization trick for calculating lexical-diversity. Our initial variant uses a library function remove-duplicates which is highly inefficient for this case.

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  9.898 seconds of real time
  9.888618 seconds of total run time (9.888618 user, 0.000000 system)
  99.91% CPU
  23,687,560,947 processor cycles
  229,392 bytes consed

What we'd like to do is something similar to the Python's version, which puts everything in a set and calculates its size. A set is easily represented with a hash-table:

(defun uniq (list &key raw case-insensitive)
  "Return only unique elements from LIST either as a new list
   or as hash-table if RAW is set. Can be CASE-INSENSITIVE."
  (let ((uniqs (make-hash-table :test (if case-insensitive 'equalp 'equal))))
    (dolist (elt list)
      (set# elt uniqs t))
    (if raw uniqs (ht-keys uniqs))))

Here's the time of the same calculation using uniq:

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  0.014 seconds of real time
  0.012001 seconds of total run time (0.012001 user, 0.000000 system)
  85.71% CPU
  33,396,336 processor cycles
  613,568 bytes consed

A 1000x speed increase!

Now, let's return to text generation. It is accomplished with the following loop (a simplified version):

(loop :for i :from 1 :to length :do
  (let ((r (random 1.0))
        (total 0))
    (dotable (word prob
              (or (get# prefix transitions)
                  ;; no continuation - start anew
                  (prog1 (get# (setf prefix initial-prefix) transitions)
                    ;; add period unless one is already there
                    (unless (every #'period-char-p (car rez))
                      (push "." rez)
                      (incf i)))))
      (when (> (incf total prob) r)
        (push word rez)
        (setf prefix (cons word (butlast prefix)))
        (return)))))

On each iteration it places all possible continuations of the current prefix on a segment from 0 to 1 and generates a random number that points to one of the variants. If there's no continuation it starts anew.

NLTK book, actually, uses a slightly more complicated model: first, it builds a probability distribution on top of the transition frequencies and then generated the text from the probabilities. As of now I don't see why this is needed and if it makes any difference in the results.

And, finally, here's how we draw the dispersion plot:

(defun plot-dispersion (words file)
  "Plot WORDS dispersion data from FILE."
  (cgn:start-gnuplot)
  (cgn:format-gnuplot "set title \"Lexical Dispersion Plot\"")
  (cgn:format-gnuplot "plot \"~A\" using 1:2:yticlabels(3) title \"\"" file))

It's just 1 line of gnuplot code, actually, but we also need to prepare the data in a tab-separated text file:

(defun dump-data (words dispersion-table)
  "Dump data from DISPERSION-TABLE for WORDS into a temporary file
   and return its name."
  (let ((filename (fmt "/tmp/~A" (gensym))))
    (with-out-file (out filename)
      (format out "0~t0~t~%")
      (doindex (i word words)
        (dolist (idx (get# word dispersion-table))
          (format out "~A~t~A~t~A~%" idx (1+ i) word)))
      (format out "0~t~A~t~%" (1+ (length words))))
    filename))

To wrap up, we've seen a demonstration of a lot of useful tools for text processing, and also discussed how they can be built. Among all of them I want to outline the utility of a seemingly simplistic concordance that is actually kind of a grep tool that is indispensable for large text exploration. I even used it a couple of times debugging issues in more complex functions from this pack.

- Previous: Introduction

submit

2013-02-02

Natural Language Meta Processing with Lisp

Recently I've started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There's actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I'd like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There's already a couple of NLP libraries, most notably, langutils, but I don't find them very well structured and also their development isn't very active. So, I see real value in creating CL-NLP.

Besides, I'm currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I'm going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I'm going to use gnuplot driven by cgn of which I've learned answering questions on StackOverflow. :) I'll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won't match. Also in the process I'm going to discuss different stuff re NLP, Lisp, Python, and NLTK — that's why there's "meta" in the title. :)

CL-NLP overview

So here's a brief overview of CL-NLP 0.1.0. I think the overall structure of NLTK is very natural and simple to understand and use, and CL-NLP will be structured somewhat similarly. The packages I'm currently working on include: util, core, and corpora — the foundational tools. There will also be phonetics, syntax, semantics, generation, and learning (classification and other machine learning activities), as well as, probably, others.

Each package will export a small number of generic functions for major operations, that will serve as the API entry point: tokenize, lemmatize, pos-tag, parse etc. Each of these functions will take as first argument an instance of a specific class, corresponding to some concrete algorithm: regex-tokenizer, markov-chain-generator, porter-stemmer and so on. The instance may or may not have configurable properties and/or state depending on the algorithm. So the algorithms themselves will be implemented in the generic functions' methods. This makes addition of new algorithms for well-specified tasks, like stemming, really straightforward: define a subclass of stemmer and implement method on stem for it. Finally, we'll have a way to easily access pre-defined instances for some frequently used algorithms. This will both provide an easy startup with the library and a simple way to experiment and extend it.

Let's take a look at the tokenization module, which is one of the few non-waporware ones so far. There're a couple of things I'd like to note:

  • Terminology: when we speak about tokenization we may assume that the result of this operation will be some kind of tokens. But in NLP field a token is often thought of as a specific data structure, that has some other properties beyond its string value, for example, POS tag. Also tokenization can produce different kinds of results: letters, phonemes, words or sentences, for instance. NLTK tokenizers will return strings by default, and they have some additional methods to return spans — tuples of corresponding beginning-end pairs. I support the approach that tokenization shouldn't produce opaque data structures not to restrict the choice for higher-level tools. In general, I adhere to the principle, that each tool in CL-NLP should produce the most raw data for its level of work. Yet I find it not very convenient to have different functions, that return string tokens and corresponding spans. This is easily amended using Lisp's multiple-values feature. So here's a definition of tokenize:
    (defgeneric tokenize (tokenizer string)
      (:documentation
       "Tokenize STRING with TOKENIZER. Outputs 2 values:
        - list of words
        - list of spans as beg-end cons pairs"))
    
  • Using a generic function allows us to define an :around method for performing pre-processing of the input text by splitting it into lines and assembling each line's tokenization.
    (defclass tokenizer ()
      ()
      (:documentation
       "Base class for tokenizers."))
    
    (defmethod tokenize :around ((tokenizer tokenizer) string)
      "Pre-split text into lines and tokenize each line separately."
      (let ((offset 0)
            words spans)
        (loop :for line :in (split-sequence #\Newline string) :do
           (mv-bind (ts ss) (call-next-method tokenizer line)
             (setf words (nconc words ts)
                   spans (nconc spans (mapcar #`(cons (+ (car %) offset)
                                                      (+ (cdr %) offset))
                                              ss)))
             (incf offset (1+ (length line)))))
        (values words
                spans)))
    
  • And here's how we'll implement pre-defined instances of the tokenizers:
    (define-lazy-singleton word-chunker
        (make 'regex-word-tokenizer :regex (re:create-scanner "[^\\s]+"))
      "Dumb word tokenizer, that will not split punctuation from words.")
    
    It is done with a special macro that defines a lazily initialized singleton and a special syntactic sugar for it. This facility is mainly intended for convenience in interactive experimentation, and not for production deployment.
    (defmacro define-lazy-singleton (name init &optional docstring)
      "Define a function NAME, that will return a singleton object,
       initialized lazily with INIT on first call.
       Also define a symbol macro  that will expand to (NAME)."
      (with-gensyms (singleton)
        `(let (,singleton)
           (defun ,name ()
             ,docstring
             (or ,singleton
                 (setf ,singleton ,init)))
           (define-symbol-macro ,(mksym name :format "<~A>") (,name)))))
    
  • Finally, a very interesting direction for development is stream tokenization. It has a whole different set of optimization compromises, and I hope to return to it in some of the future posts.

NB. I find it one of the biggest shortcomings of NLTK's design, that the algorithms are implemented inside the concrete classes. There's a good saying by Joe Armstrong about this: you wanted to have a banana, but got a gorilla holding a banana. I think, it very well characterizes the additional conceptual load that you have to incur when you look at NLTK's code with algorithms scattered over various classes. If Python supported CLOS-like decoupling of classes and methods, it would be much easier to separate the algorithms from other stuff much cleanly in NLTK. Well, in Lisp it's exactly what we're going to do.

Quickstart

To start working with CL-NLP you have to get it from my github account: vseloved/cl-nlp.

For those who are Lisp newbies: you'll also need Lisp, obviously. There's a lot of implementations (this is how a program running Lisp code is usually called) of it, but all of them support the same standard, so you can choose any. The ones I recommend are SBCL and CCL. We'll call them Lisp from now on. You can interact with the implementation either by starting it and pasting commands in the presented prompt. Or you can use more fancy tools: if you're already an Emacs user or think that you're savvy enough I recommend to get SLIME and enjoy hacking with it. If yu're on Vim try SLIMV. Otherwise your choice is pretty limited, but take a look at ABLE — a very simple program for interacting with Lisp with some useful features.

Also get quicklisp to make your work with 3rd-party libraries seamless and comfortable.

To load CL-NLP in your Lisp do the following:

(push "path/to/cl-nlp/" asdf:*central-registry*)
(ql:quickload "cl-nlp")

Here, "path/to/cl-nlp/" should point to a directory where you've downloaded the project, and the path should end with a slash. We use ql:quickload to load the project with quicklisp which will take care of loading its few dependencies.

And now if everything worked fine we're ready to start exploring the NLTK book!

submit

2013-01-10

Common Lisp is just Lisp

There's this perpetual confusion about Lisp: everyone will call a language with prefix notation and parenthesis a Lisp or a Lisp-dialect. This is one of the popular urban legends about Lisp, which I'd like to dissolve.

So, let's look at the language familiar to most of the programmers: C. Would anyone call C++ a C? Maybe, but only someone very inexperienced in these languages. On the contrary, many of the experienced people will often say: I would only program in C, but not in C++, because it's such a mess. For instance. Let's go further: would anyone call Java a C? I very much doubt it. What unites those languages is a common syntax: braces, basic keywords and operators, and more. A common root. And although the gap between C and Java is pretty big, it can be said, that they also have a somewhat similar semantics. And they are often used in similar scenarios. That's why all these languages are attributed to the C family. As well as Objective-C. Indeed, it would be much easier for a C programmer to switch to C++ or Java, that to, say, Lua or Haskell.

Now, the difference between 3 popular Lisp-family languages: Common Lisp, Scheme and Clojure — is just as big, as between C, C++ and Java. Conceptually and syntactically, Common Lisp is as far (or even further) from Clojure as C is from Java.

The language is not just superficial syntax — it's about syntax, semantics and pragmatics. And about community. The Lisp community has split at several points in time, and the main sprouts were Scheme and Clojure. It's not a coincidence, that these languages don't have Lisp in their name. And, what's most important is that each language's community cares about different things.

So, when someone asks you again "Which Lisp?", the answer is very simple: for good or for worse, the only Lisp currently is Common Lisp. It's like, say, the C99 standard of C. There were other standards and variants, like ZetaLisp or ISLISP, but they aren't in use now. Every other Lisp-like language isn't a Lisp, it's just similar to some extent.

submit