2013-02-02

Natural Language Meta Processing with Lisp

Recently I've started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There's actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I'd like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There's already a couple of NLP libraries, most notably, langutils, but I don't find them very well structured and also their development isn't very active. So, I see real value in creating CL-NLP.

Besides, I'm currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I'm going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I'm going to use gnuplot driven by cgn of which I've learned answering questions on StackOverflow. :) I'll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won't match. Also in the process I'm going to discuss different stuff re NLP, Lisp, Python, and NLTK — that's why there's "meta" in the title. :)

CL-NLP overview

So here's a brief overview of CL-NLP 0.1.0. I think the overall structure of NLTK is very natural and simple to understand and use, and CL-NLP will be structured somewhat similarly. The packages I'm currently working on include: util, core, and corpora — the foundational tools. There will also be phonetics, syntax, semantics, generation, and learning (classification and other machine learning activities), as well as, probably, others.

Each package will export a small number of generic functions for major operations, that will serve as the API entry point: tokenize, lemmatize, pos-tag, parse etc. Each of these functions will take as first argument an instance of a specific class, corresponding to some concrete algorithm: regex-tokenizer, markov-chain-generator, porter-stemmer and so on. The instance may or may not have configurable properties and/or state depending on the algorithm. So the algorithms themselves will be implemented in the generic functions' methods. This makes addition of new algorithms for well-specified tasks, like stemming, really straightforward: define a subclass of stemmer and implement method on stem for it. Finally, we'll have a way to easily access pre-defined instances for some frequently used algorithms. This will both provide an easy startup with the library and a simple way to experiment and extend it.

Let's take a look at the tokenization module, which is one of the few non-waporware ones so far. There're a couple of things I'd like to note:

  • Terminology: when we speak about tokenization we may assume that the result of this operation will be some kind of tokens. But in NLP field a token is often thought of as a specific data structure, that has some other properties beyond its string value, for example, POS tag. Also tokenization can produce different kinds of results: letters, phonemes, words or sentences, for instance. NLTK tokenizers will return strings by default, and they have some additional methods to return spans — tuples of corresponding beginning-end pairs. I support the approach that tokenization shouldn't produce opaque data structures not to restrict the choice for higher-level tools. In general, I adhere to the principle, that each tool in CL-NLP should produce the most raw data for its level of work. Yet I find it not very convenient to have different functions, that return string tokens and corresponding spans. This is easily amended using Lisp's multiple-values feature. So here's a definition of tokenize:
    (defgeneric tokenize (tokenizer string)
      (:documentation
       "Tokenize STRING with TOKENIZER. Outputs 2 values:
        - list of words
        - list of spans as beg-end cons pairs"))
    
  • Using a generic function allows us to define an :around method for performing pre-processing of the input text by splitting it into lines and assembling each line's tokenization.
    (defclass tokenizer ()
      ()
      (:documentation
       "Base class for tokenizers."))
    
    (defmethod tokenize :around ((tokenizer tokenizer) string)
      "Pre-split text into lines and tokenize each line separately."
      (let ((offset 0)
            words spans)
        (loop :for line :in (split-sequence #\Newline string) :do
           (mv-bind (ts ss) (call-next-method tokenizer line)
             (setf words (nconc words ts)
                   spans (nconc spans (mapcar #`(cons (+ (car %) offset)
                                                      (+ (cdr %) offset))
                                              ss)))
             (incf offset (1+ (length line)))))
        (values words
                spans)))
    
  • And here's how we'll implement pre-defined instances of the tokenizers:
    (define-lazy-singleton word-chunker
        (make 'regex-word-tokenizer :regex (re:create-scanner "[^\\s]+"))
      "Dumb word tokenizer, that will not split punctuation from words.")
    
    It is done with a special macro that defines a lazily initialized singleton and a special syntactic sugar for it. This facility is mainly intended for convenience in interactive experimentation, and not for production deployment.
    (defmacro define-lazy-singleton (name init &optional docstring)
      "Define a function NAME, that will return a singleton object,
       initialized lazily with INIT on first call.
       Also define a symbol macro  that will expand to (NAME)."
      (with-gensyms (singleton)
        `(let (,singleton)
           (defun ,name ()
             ,docstring
             (or ,singleton
                 (setf ,singleton ,init)))
           (define-symbol-macro ,(mksym name :format "<~A>") (,name)))))
    
  • Finally, a very interesting direction for development is stream tokenization. It has a whole different set of optimization compromises, and I hope to return to it in some of the future posts.

NB. I find it one of the biggest shortcomings of NLTK's design, that the algorithms are implemented inside the concrete classes. There's a good saying by Joe Armstrong about this: you wanted to have a banana, but got a gorilla holding a banana. I think, it very well characterizes the additional conceptual load that you have to incur when you look at NLTK's code with algorithms scattered over various classes. If Python supported CLOS-like decoupling of classes and methods, it would be much easier to separate the algorithms from other stuff much cleanly in NLTK. Well, in Lisp it's exactly what we're going to do.

Quickstart

To start working with CL-NLP you have to get it from my github account: vseloved/cl-nlp.

For those who are Lisp newbies: you'll also need Lisp, obviously. There's a lot of implementations (this is how a program running Lisp code is usually called) of it, but all of them support the same standard, so you can choose any. The ones I recommend are SBCL and CCL. We'll call them Lisp from now on. You can interact with the implementation either by starting it and pasting commands in the presented prompt. Or you can use more fancy tools: if you're already an Emacs user or think that you're savvy enough I recommend to get SLIME and enjoy hacking with it. If yu're on Vim try SLIMV. Otherwise your choice is pretty limited, but take a look at ABLE — a very simple program for interacting with Lisp with some useful features.

Also get quicklisp to make your work with 3rd-party libraries seamless and comfortable.

To load CL-NLP in your Lisp do the following:

(push "path/to/cl-nlp/" asdf:*central-registry*)
(ql:quickload "cl-nlp")

Here, "path/to/cl-nlp/" should point to a directory where you've downloaded the project, and the path should end with a slash. We use ql:quickload to load the project with quicklisp which will take care of loading its few dependencies.

And now if everything worked fine we're ready to start exploring the NLTK book!

submit

No comments: