2013-02-26

NLTK 1.3 - Computing with Language: Simple Statistics

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won't translate that to Lisp, because there're much better resources explaining how to use Lisp properly. First and foremost I'd refer anyone interested to the appropriate chapters of Practical Common Lisp:

It's only worth noting that Lisp has a different notion of lists, than Python. Lisp's lists are linked lists, while Python's are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp's list operations is somewhat different from Python's. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn't provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn't easily reproducible in Lisp is assigning to a slice:

>>> sent[1:9] = ['Second', 'Third']
>>> sent
['First', 'Second', 'Third', 'Last']

There's replace but it can't shrink a sequence:

CL-USER> (defvar sent '(1 2 3 4 5 6 7 8 9 0))
CL-USER> (replace sent '("Second" "Third") :start1 1 :end1 9)
(1 "Second" "Third" 4 5 6 7 8 9 0)

Ngrams

So, the only part worth discussing here is statistics.

Let's start with a frequency distribution. We have already used something similar in the previous part for text generation, but it was very basic and tailored to the task. Now, it's time to get into some serious language modeling and discuss a more general-purpose implementation.

Such modeling is accomplished via collecting of large amounts of statistical data about words and their sequences appearances in texts. These sequences are called ngrams. In a nutshell, you can think of ngrams distribution as a table mapping ngram sequences to numbers.

(defclass ngrams ()
  ((order :initarg :order :reader ngrams-order)
   (count :reader ngrams-count)
   (max-freq :reader ngrams-max-freq)
   (min-freq :reader ngrams-min-freq)
   (total-freq :reader ngrams-total-freq)))

The crucial parameter of this class is order which defines the length of a sequence. In practice, ngrams of order from 1 to 5 may be used.

ngrams is an abstract class. In Lisp you don't have to somehow specify this property, you just don't implement methods for it. The simplest ngrams implementation — table-ngrams — uses an in-memory hash-table as a store. You can get ngram frequency and "probability" (the maximum likelihood estimation) from it, as well as log of probability which is used more often in calculations, because it allows to avoid the problem of floating point rounding errors occurring when multiplying probabilities which are rather small values.

NLTK> (freq (text-bigrams *moby*) "The whale")
Indexing bigrams...
Number of bigrams: 116727
14
NLTK> (logprob (text-bigrams *moby*) "The whale")
-14.255587

So how do we get bigrams of Moby Dick? For that we just have to count all of them in text (this is a simplified version — some additional processing for sentence start/ends is needed):

(defun index-ngrams (order words &key ignore-case)
  (make 'table-ngrams :order order
        :table
        (let ((ht (make-hash-table :test (if ignore-case 'equalp 'equal))))
          (do ((tail words (rest tail)))
               ((shorter? tail order))
            (incf (get# (if (= order 1)
                            (car tail)
                            (sub tail 0 order))
                        ht 0)))
          ht)))

table-ngrams will be useful for simple experimentation and prototyping, like we do in our NLTK examples.

NLTK> (defvar *1grams* (text-ugrams *moby*))
Indexing unigrams...
Number of unigrams: 1924
NLTK> (freq *1grams* "whale")
906
NLTK> (take 50 (vocab *1grams* :order-by '>))
("," "the" "<S>" "</S>" "." "of" "and" "-" "a" "to" ";" "in" "\"" "that" "'"
 "his" "it" "I" "!" "s" "is" "he" "with" "was" "as" "all" "for" "this" "at"
 "by" "but" "not" "him" "from" "be" "on" "?" "so" "whale" "one" "you" "had"
 "have" "there" "But" "or" "were" "now" "which" "me")

The strings "<S>" and "</S>" here denote special symbols for sentence start and end.

Here's a cumulative plot of them:

Cumulative Frequency Plot for 50 Most Frequent Words in Moby Dick

And here's just the counts graph:

Frequency Plot for 50 Most Frequent Words in Moby Dick

And, finally, here's hapaxes:

NLTK> (take 50 (hapaxes (text-ugrams *moby*)))
("orphan" "retracing" "sheathed" "padlocks" "dirgelike" "Buoyed" "liberated"
 "Till" "Ixion" "closing" "suction" "halfspent" "THEE" "ESCAPED" "ONLY"
 "Epilogue" "thrill" "etherial" "intercept" "incommoding" "tauntingly"
 "backwardly" "coincidings" "ironical" "intermixingly" "whelmings" "inanimate"
 "animate" "lookouts" "infatuation" "Morgana" "Fata" "gaseous" "mediums"
 "bewildering" "bowstring" "mutes" "voicelessly" "THUS" "grapple"
 "unconquering" "comber" "foregone" "bullied" "uncracked" "unsurrendered"
 "Diving" "flume" "dislodged" "buttress")

The next Python feature showcased here is list comprehensions. The idea behind them is to resemble theoretical-set notation in list definition. There's no such thing out-of-the box in Lisp (although you can implement an even closer to set-notation variant in just 24 lines), and the general approach is to favor functional style filtering with variants of map and remove-if.

NLTK> (sort (remove-if #`(< (length %) 15)
                       (uniq (text-words *moby*)))
            'string<)
("CIRCUMNAVIGATION" "Physiognomically" "apprehensiveness" "cannibalistically" "characteristically" "circumnavigating" "circumnavigation" "circumnavigations" "comprehensiveness" "hermaphroditical" "indiscriminately" "indispensableness" "irresistibleness" "physiognomically" "preternaturalness" "responsibilities" "simultaneousness" "subterraneousness" "supernaturalness" "superstitiousness" "uncomfortableness" "uncompromisedness" "undiscriminating" "uninterpenetratingly")

NLTK> (sort (remove-if #`(or (<= (length %) 7)
                             (<= (freq (text-ugrams *chat*) %) 7))
                       (vocab (text-ugrams *chat*)))
            'string<)
("20sUser104" <... another 130 users ...> "Question" "actually" "anything" "computer" "everyone" "football" "innocent" "listening" "remember" "seriously" "something" "talkcity_adults" "thinking" "together" "watching")

In NLTK variant all users are removed from the corpus with some pre-processing.

Language Modeling

But to be useful for real-world scenarios ngrams have to be large, really large (on the orders of tens of gigabytes of data for trigrams). This means that you won't be able to simply store them in memory and will have to use some external storage: a general-purpose data-store, like the relational database or a special-purpose software.

One such ngrams service that is available on the internet is Microsoft Web N-gram Services. If you have a developer token you can query it over HTTP. The service only returns log-probabilities and also log-conditional-probabilities and runs really slow, but it is capable of serving batch requests, i.e. return probabilities for several ngrams at once. The implementation of ngrams interface for such service is provided in contrib/ms-ngrams.lisp.

We have already encountered conditional probabilities in the previous part. They have the following relationship with regular (so called, "joint") probabilities (for bigrams):

p(A,B) = p(B|A) * p(A)
where P(A,B) is a joint probability and P(B|A) is the conditional one

I.e. they can be calculated from current ngrams plus the ngrams of preceding order. So, this operation is performed not on a single ngrams object, but on a pair of such objects. And they serve an important role we'll see below. But first we need to talk about language models.

A language model is, basically, a collection of ngrams of different orders. Combining these ngrams we're able to obtain some other measures beyond a simple frequency value or probability estimate. The biggest added value of such model is in smoothing capabilities that it implements. The problem smoothing solves is that you'll almost never be able to have all possible ngrams in your data-store — there's just too many of them and the language users keep adding more. But it's very nasty to get 0 probability for some ngram. The language model allows to find a balance between the number of ngrams you have to store and the possibility to get meaningful probability numbers for any ngram. This is achieved with various smoothing techniques: interpolation and discounting. Some of the smoothing methods are:

  • +1 smoothing
  • Kneser-Ney smoothing
  • and Stupid backoff

A good general compilation of various smoothing methods is assembled in this presentation.

Let's look at the simplified implementation of scoring a sentence with the Stupid Backoff model:

(defmethod logprob ((model language-model) (sentence list))
  (with-slots (order) model
    (let ((rez 0)
          (s (append (cons "<S>" sentence) (list "</S>"))))
      (when (shorter? s order)
        (return-from logprob (logprob (get-ngrams (length s) model) s)))
      ;; start of the sentence: p(A|<S>) * p(B|<S>,A) * ...
      (do ((i 2 (1+ i)))
          ((= i order))
        (incf rez (cond-logprob model (sub s 0 i))))
      ;; middle of the sentence
      (do ((tail s (rest tail)))
          ((shorter? tail order))
        (incf rez (cond-logprob model (sub tail 0 order))))
      rez)))

Eventually, the language model is able to return the estimated probability of any sequence of words, not limited to the maximum order of ngram in it. This is usually calculated using the Markov assumption with the following formula (for a bigram language model):

p(s) = p(A) * p(B|A) * p(C|A,B) * p(D|B,C) * ... * p(Z|X,Y)
where s = A B ... Z

NLTK> (defvar *moby-lm2* (make-lm 'stupid-backoff-lm
                                  :1g (text-ugrams *moby*)
                                  :2g (text-bigrams *moby*)))
NLTK> (prob *moby-lm2* "This is a test sentence.")
6.139835e-20

That was, by the way, the probability of an unseen sentence with the word "sentence" completely missing from vocabulary.

NLTK> (prob *moby-lm2* '("<S>" "Moby" "Dick" "." "</S>"))
5.084481e-9
NLTK> (float (prob (text-bigrams *moby*) '("Moby" "Dick")))
3.0310333e-4

As you see, it's much more likely to encounter the sentence "Moby Dick." in this text, although not so likely as the phrase "Moby Dick". :)

Also such model is able to generate random texts just like we did in the previous part. But because of the smoothing capability it's much more general, i.e. it can generate sequences with any word from the vocabulary, even the phrases unseen before. At the same time it's much more computationally expensive, because now generating each new word takes O(vocabulary size) while it was O(average number of words following any particular word).

NLTK> (princ (generate *genesis* :order 2 :n 93))
burial to judged eaten sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung foreign longed them ought up Temanites Aran earth earth blessings surface surface surface surface surface surface surface surface surface floated Darkness Now homage earth Now In princes said vengeance It passed said divide In beginning earth Asenath said re The peaceful kind Calah said blameless mistress Chaldees said hunter said middle surface surface surface surface yonder earth rib said said smoking smoking smoking

And, as you see, this example totally doesn't resemble the one in the previous part. Is this a bug? No, just a trick that is played with us because we aren't following the basic math principles. In the Stupid Backoff model the probabilities don't add up to 1 and the conditional probability of an unseen ngrams may be larger than the largest probability of any recorded one! This is the reason we get to produce sequences of repeated words. This problem is much less obvious for the trigram model, although the text remains a complete gibberish.

 NLTK> (princ (generate *genesis* :order 3 :n 93))
 brink time wagons fourth Besides south darkness listen foreigner Stay possessor lentils backwards be call dignity Kenizzites tar witness strained Yes appear colts bodies Reuel burn inheritance Galeed Hadar money touches conceal mighty foreigner spices Set pit straw son hurry yoke numbered gutters Dedan honest drove Magdiel Nod life assembly your Massa iniquity Tola still fifteen ascending wilderness everywhere shepherd harm bore Elah Jebusites Assyria butler Euphrates sinners gave Nephilim Stay garments find lifted communing closed Ir lights doing weeping shortly disobedience possessions drank peoples fifteen bless talked songs lamb far Shaveh heavens

What this example shows us are at least two things:

  • we should always check that mathematical properties of our models still hold as we tweak them
  • although the major use-case for language model is scoring, you can get a feel of how good it will perform by looking at the texts it generates

Finding collocations

This is another interesting and useful NLP problem with a very elegant baseline solution, which is explained in this article. Hopefully, we'll get back to it in more detail in the future chapters. And for now here's the results of implementing the algorithm from the article:

NLTK> (collocations *inaugural*)
(("United" "States") ("fellow" "citizens") ("four" "years") ("years" "ago")
 ("Federal" "Government") ("General" "Government") ("American" "people")
 ("Vice" "President") ("Old" "World") ("Almighty" "God") ("Fellow" "citizens")
 ("Chief" "Magistrate") ("Chief" "Justice") ("God" "bless") ("go" "forward")
 ("every" "citizen") ("Indian" "tribes") ("public" "debt") ("one" "another")
 ("foreign" "nations") ("political" "parties") ("State" "governments")
 ("National" "Government") ("United" "Nations") ("public" "money")
 ("national" "life") ("beloved" "country") ("upon" "us") ("fellow" "Americans")
 ("Western" "Hemisphere"))

I'm surprised at how similar they are to NLTK's considering that I didn't look at their implementation. In fact, they are the same up to the difference in the list of stopwords (the dirty secret of every NLP application :) The code for collocation extraction function can be found in core/measures.lisp.

Other uses of ngrams

Ngrams are also sometimes used for individual characters to build Character language models. And here's another usage from NLTK — for counting word lengths.

NLTK> (defvar *moby-lengths*
              (index-ngrams 1 (mapcar #'length (text-words *moby*))))
NLTK> (vocab *moby-lengths*)
(1 4 2 6 8 9 11 5 7 3 10 12 13 14 16 15 17 18 20)
NLTK> (ngrams-pairs *moby-lengths*)
((1 . 58368) (4 . 42273) (2 . 35726) (6 . 17111) (8 . 9966) (9 . 6428)
 (11 . 1873) (5 . 26595) (7 . 14399) (3 . 49633) (10 . 3528) (12 . 1053)
 (13 . 567) (14 . 177) (16 . 22) (15 . 70) (17 . 12) (18 . 1) (20 . 1))
NLTK> (ngrams-max-freq *moby-lengths*)
58368
NLTK> (freq *moby-lengths* 3)
49633

Final thoughts

Language modeling is really the foundation of any serious NLP work. Having access to ngrams expands your possibilities immensely, but the problem with them is that moving from prototype to production implementation becomes tricky due to the problems of collecting a representative data-set and consequently efficiently storing it. Yet, there are solutions: the Google Books Ngrams and Google Web1T are an example of web-scale ngrams data-set, and there's also special-purpose software for storing large ngrams corpora and obtaining language models from them. The notable examples are BerkeleyLM and KenLM.

Here's a link to some code for this chapter.

- Previous: NLTK 1.1 - Computing with Language: Texts and Words

submit

2013-02-14

NLTK 1.1 - Computing with Language: Texts and Words

OK, let's get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We'll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

Setting up texts for processing

For the purpose of this demonstration we'll need several texts which can be downloaded from NLTK data. Namely, we'll use the following 5 texts:

  • Moby Dick (can be found inside Project Gutenberg)
  • Sense and Sensibility (likewise from Project Gutenberg)
  • The Book of Genesis
  • Inaugural Address Corpus (this one comes as a collection of separate texts, that you'll need to cat together into one file)
  • NPS Chat Corpus

These texts are in nltk/data/ directory in CL-NLP.

NLTK guys have created a special Text class and have defined all the operations in this chapter as its methods. We'll employ a slightly simpler approach and implement them as ordinary functions. Yet we'll also have a special-purpose text class to cache reusable results of long-running operations, like tokenization.

NLTK> (load-nltk-texts "data/")
#<MOBY [Moby Dick by Herman... 1242986>
#<SENSE [Sense and Sensibili... 673019>
#<GENESIS In the beginning God... 188665>
...

As you've already guessed, we've just loaded all the texts. The number in the last column is each text's character count.

Now they are stored in *texts* hash-table. This is how we can access an individual text and name it for future usage:

(defparameter *sense* (get# :sense *texts*))

get# is one of the shorthand functions for operating on hash-tables defined in rutils.

Now we have a variable pointing to "Sense and Sensibility". If we examine it, this is what we'll see:

NLTK> (describe *sense*)
#<SENSE [Sense and Sensibili... 673019>
  [standard-object]
Slots with :INSTANCE allocation:
  NAME         = :SENSE
  RAW          = "[Sense and Sensibility by Jane Austen 1811]..
  WORDS        = #<unbound slot>
  CTXS         = #<unbound slot>
  TRANSITIONS  = #<unbound slot>
  DISPERSION   = #<unbound slot>

As you see, there are some unbound slots in this structure: words will hold every word in the text after tokenization, ctxs will be a table of contexts for each word with their probabilities. By analogy, transitons will be a table of transition probabilities between words. Finally, dispersion will be a table of indices of word occurrences in text. We'll use a lazy initialization strategy for them by defining slot-unbound CLOS methods, that will be called on first access to each slot. For example, here's how words is initialized:

(defmethod slot-unbound (class (obj text) (slot (eql 'words)))
  (with-slots (raw words) obj
    (format t "~&Tokenizing text...~%")
    (prog1 (setf words (mapcan #`(cons "¶" (tokenize  %))
                               (tokenize  raw)))
      (format t "Number of words: ~A~%" (length words)))))

First we split the raw text in paragraphs, because we'd like to preserve paragraph information. Splitting is slightly involved as paragraphs are separated by double newlines, while single newlines end every line in the text, and we have to distinguish this. We insert pillcrow signs paragraph boundaries. Then we tokenize the paragraphs into separate words (real words, punctuation marks, symbols, etc).

NB. I consider tokenization the crucial function of the NLP toolkit, and we'll explore it in more detail in one of the future posts.

Implementing the examples

OK, now we are ready to start churning out examples from the first chapter.

The first one finds occurrences of certain words in the text. NLTK guys perform the search on the tokenized texts. But I think, it's quite OK to do it on raw strings with regexes. This has an added benefit of preserving text structure.

NLTK> (concordance *moby* "monstrous")
Displaying 11 of 11 matches
    former, one was of a most monstrous size. ...  This came towards
               "Touching that monstrous bulk of the whale or ork we h
                     array of monstrous clubs and spears.  Some were
 you gazed, and wondered what monstrous
 has survived the flood; most monstrous
                              monstrous fable, or still worse and mor
                       Of the Monstrous Pictures of Whales.
        In connexion with the monstrous pictures of whales, I am stro
o enter upon those still more monstrous stories of them
ave been rummaged out of this monstrous
 Whale-Bones; for Whales of a monstrous size are                     

With :pass-newlines on we can get the output similar to NLTK's. Let's try one of the homework tasks:

NLTK> (concordance *genesis* "lived" :pass-newlines t)
Displaying 75 of 75 matches
t from Yahweh's presence, and lived in the land of Nod, east of
 when they were created. Adam lived one hundred thirty years, and
...

Now let's try similarity. Here we won't do without proper tokenization.

NLTK> (similar *moby* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 267803
Number of unique words: 19243
("mystifying" "subtly" "maddens" "impalpable" "modifies" "vexatious" "candid"
 "exasperate" "doleful" "delightfully" "trustworthy" "domineering" "abundant"
 "puzzled" "untoward" "contemptible" "gamesome" "reliable" "mouldy"
 "determined")
NLTK> (similar *sense* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 146926
Number of unique words: 6783
("amazingly" "vast" "heartily" "extremely" "remarkably" "great" "exceedingly"
 "sweet" "very" "so" "good" "a" "as")

We mostly get the same words as NLTK's result, but with a slightly different ordering. It turns out, that the reason for this is very simple. The function similar matches words based on the contexts, where they occur. According to the famous quote by John Rupert Firth:

You shall know a word by the company it keeps

But if we look at context overlap between various words from our list we'll see that the similarity relation between all these words is extremely weak: the decision is based on the match of a single context in which both words appeared in text. In fact, all the listed words are similar to the same extent.

NLTK> (common-contexts *moby* "monstrous" "loving")
("most_and")
NLTK> (common-contexts *moby* "monstrous" "mystifying")
("most_and")
NLTK> (apply #'common-contexts *moby* (similar *moby* "monstrous"))
("most_and")

Actually, the next NLTK example is, probably, the best context overlap you can get from those texts:

NLTK> (common-contexts *sense* "monstrous" "very")
("am_glad" "is_pretty" "a_pretty" "a_lucky")

Now let's draw a dispersion plot of the words from inaugural corpus. This task may seem difficult to approach at first, because the authors use a Python library matplotlib for drawing the graph. Fortunately, there's a language-agnostic tool to achieve similar goals, which is called gnuplot. There is a couple of Lisp wrapper libraries for it, and the actual code you need to write to drive it amounts to 2 lines (not counting the code to format the data for consumption). There are, actually, numerous language-agnostic tools on the Unix platform — don't forget to look for them when you have such kind of specific need :) gnuplot dispersion graph with cl-nlp

The next problem in this part also seems pretty hard. And, in fact, it is extremely hard if framed correctly — to generate a meaningful text based on some other text. But the example solves an easier task to generate a somewhat meaningful text. And the approach taken to solve it is a very simple one — it is the baseline method in this area and is based on Markov chains. There was even a famous mock with Markov chains in the times of Usenet called Mark V. Shaney. Markov models have one principal parameter — order. Mark V. Shaney was an order 2 chain.

Let's try to generate something with it:

NLTK> (princ (generate *genesis* :order 2 :n 93))
Building prefix transitions index of length 2...
Number of prefixes: 14326
In the marsh grass . Behold , the male and female . Of these , and that Jacob obeyed his father . I will return to the earth was dry . God set them in the earth . Noah built an altar there , and put me in the pit ; and that which was our father ' s hand , then the handmaids and their sin is very grievous in your flesh for an everlasting covenant between God and every man to see the nakedness of his flock and keep it .

And what if we raise the order?

NLTK> (princ (generate *genesis* :order 3 :n 93))
Building prefix transitions index of length 3...
Number of prefixes: 28206
In the beginning God created the large sea creatures , and every bird , whatever moves on the earth . He stayed yet another seven days , and sent over that which he had gained in Paddan Aram . Esau saw that the interpretation was good , he said to them , they conspired against him to kill him . They took captive all their little ones , and for days and years ; and let it divide the waters from the waters . God said to the younger , and his seed

The text starts to resemble the original more and more. Also you may notice, that the text will always start with "In". That's because Genesis isn't split in paragraphs, and our generation starts from paragraph beginnings, of which there's only one here.

OK, this seems to work, but with probabilities you never know for sure... ;)

Now, we're left with very simple tasks. Let's just do them:

NLTK> (length (text-words *genesis*))
44671

In the book they had a slightly different number: 44764. This is because of the different tokenization scheme. The differences can be seen in the next snippet (we have a cleaner version for this use case :)

NLTK> (take 20 (sort (remove-duplicates (text-words *genesis*) :test 'string=) 'string<))
("!" "\"" "'" "(" ")" "," "-" "." ":" ";" "?" "A" "Abel" "Abida" "Abimael"
 "Abimelech" "About" "Abraham" "Abram" "Accad")

What about the vocabulary size? Well, once again very similar to the NLTK number (2789).

NLTK> (length (remove-duplicates (text-words *genesis*) :test 'string=))
2634

Now, let's look for words:

NLTK> (count "smote" (text-words *genesis*) :test 'string=)
0

Hmm... What about some other word?

NLTK> (count "Abraham" (text-words *genesis*) :test 'string=)
134

This seems to work. What's the problem with "smote"? Turns out, there's no such word in the Genesis text: at least the examination of the text doesn't show any traces of it. Looks like we've found a bug in the book :)

(defun percentage (count total)
  (/ (* 100.0 count) total))
NLTK> (with-slots (words) *inaugural*
        (percentage (count "a" words :test 'string=) (length words)))
1.4597242
(defun lexical-diversity (text)
  (with-slots (words) text
    (/ (float (length words))
       (length (remove-duplicates words :test 'string=)))))
NLTK> (lexical-diversity *genesis*)
16.959377
NLTK> (lexical-diversity *chat*)
6.9837084

Interestingly, the results for *chat* corpus differ from the NLTK ones, although they are calculated based on tokens, provided in the corpus and not extracted by our tokenization algorithms. This text is special, because it is extracted from the XML-structured document, which also contains the full tokenization. To use it we swap words in *chat* corpus:

NLTK> (setf (text-words *chat*)
            (mapcar #'token-word
                    (flatten (corpus-text-tokens ncorp:+nps-chat-corpus+))))

But first we need to get the corpus and extract the data from it - see corpora/nps-chat.lisp for details.

And, finally, we can examine the Brown Corpus.

       Genre        |  Tokens  |  Types  |  Lexical Diversity
--------------------+----------+---------+--------------------
PRESS-REPORTAGE     |  100554  |  14393  |    7.0
PRESS-EDITORIAL     |  61604   |  9889   |    6.2
PRESS-REVIEWS       |  40704   |  8625   |    4.7
RELIGION            |  39399   |  6372   |    6.2
SKILL-AND-HOBBIES   |  82345   |  11934  |    6.9
POPULAR-LORE        |  110299  |  14502  |    7.6
BELLES-LETTRES      |  173096  |  18420  |    9.4
MISCELLANEOUS-GOVER |  70117   |  8180   |    8.6
LEARNED             |  181888  |  16858  |   10.8
FICTION-GENERAL     |  68488   |  9301   |    7.4
FICTION-MYSTERY     |  57169   |  6981   |    8.2
FICTION-SCIENCE     |  14470   |  3232   |    4.5
FICTION-ADVENTURE   |  69342   |  8873   |    7.8
FICTION-ROMANCE     |  70022   |  8451   |    8.3
HUMOR               |  21695   |  5016   |    4.3

OK, seems like we're done with this chapter. So far there was no rocket science involved, but it was interesting...

Implementation details

So, what are the interesting bits we haven't discussed?

First, let's look at a small optimization trick for calculating lexical-diversity. Our initial variant uses a library function remove-duplicates which is highly inefficient for this case.

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  9.898 seconds of real time
  9.888618 seconds of total run time (9.888618 user, 0.000000 system)
  99.91% CPU
  23,687,560,947 processor cycles
  229,392 bytes consed

What we'd like to do is something similar to the Python's version, which puts everything in a set and calculates its size. A set is easily represented with a hash-table:

(defun uniq (list &key raw case-insensitive)
  "Return only unique elements from LIST either as a new list
   or as hash-table if RAW is set. Can be CASE-INSENSITIVE."
  (let ((uniqs (make-hash-table :test (if case-insensitive 'equalp 'equal))))
    (dolist (elt list)
      (set# elt uniqs t))
    (if raw uniqs (ht-keys uniqs))))

Here's the time of the same calculation using uniq:

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  0.014 seconds of real time
  0.012001 seconds of total run time (0.012001 user, 0.000000 system)
  85.71% CPU
  33,396,336 processor cycles
  613,568 bytes consed

A 1000x speed increase!

Now, let's return to text generation. It is accomplished with the following loop (a simplified version):

(loop :for i :from 1 :to length :do
  (let ((r (random 1.0))
        (total 0))
    (dotable (word prob
              (or (get# prefix transitions)
                  ;; no continuation - start anew
                  (prog1 (get# (setf prefix initial-prefix) transitions)
                    ;; add period unless one is already there
                    (unless (every #'period-char-p (car rez))
                      (push "." rez)
                      (incf i)))))
      (when (> (incf total prob) r)
        (push word rez)
        (setf prefix (cons word (butlast prefix)))
        (return)))))

On each iteration it places all possible continuations of the current prefix on a segment from 0 to 1 and generates a random number that points to one of the variants. If there's no continuation it starts anew.

NLTK book, actually, uses a slightly more complicated model: first, it builds a probability distribution on top of the transition frequencies and then generated the text from the probabilities. As of now I don't see why this is needed and if it makes any difference in the results.

And, finally, here's how we draw the dispersion plot:

(defun plot-dispersion (words file)
  "Plot WORDS dispersion data from FILE."
  (cgn:start-gnuplot)
  (cgn:format-gnuplot "set title \"Lexical Dispersion Plot\"")
  (cgn:format-gnuplot "plot \"~A\" using 1:2:yticlabels(3) title \"\"" file))

It's just 1 line of gnuplot code, actually, but we also need to prepare the data in a tab-separated text file:

(defun dump-data (words dispersion-table)
  "Dump data from DISPERSION-TABLE for WORDS into a temporary file
   and return its name."
  (let ((filename (fmt "/tmp/~A" (gensym))))
    (with-out-file (out filename)
      (format out "0~t0~t~%")
      (doindex (i word words)
        (dolist (idx (get# word dispersion-table))
          (format out "~A~t~A~t~A~%" idx (1+ i) word)))
      (format out "0~t~A~t~%" (1+ (length words))))
    filename))

To wrap up, we've seen a demonstration of a lot of useful tools for text processing, and also discussed how they can be built. Among all of them I want to outline the utility of a seemingly simplistic concordance that is actually kind of a grep tool that is indispensable for large text exploration. I even used it a couple of times debugging issues in more complex functions from this pack.

- Previous: Introduction

submit

2013-02-02

Natural Language Meta Processing with Lisp

Recently I've started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There's actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I'd like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There's already a couple of NLP libraries, most notably, langutils, but I don't find them very well structured and also their development isn't very active. So, I see real value in creating CL-NLP.

Besides, I'm currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I'm going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I'm going to use gnuplot driven by cgn of which I've learned answering questions on StackOverflow. :) I'll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won't match. Also in the process I'm going to discuss different stuff re NLP, Lisp, Python, and NLTK — that's why there's "meta" in the title. :)

CL-NLP overview

So here's a brief overview of CL-NLP 0.1.0. I think the overall structure of NLTK is very natural and simple to understand and use, and CL-NLP will be structured somewhat similarly. The packages I'm currently working on include: util, core, and corpora — the foundational tools. There will also be phonetics, syntax, semantics, generation, and learning (classification and other machine learning activities), as well as, probably, others.

Each package will export a small number of generic functions for major operations, that will serve as the API entry point: tokenize, lemmatize, pos-tag, parse etc. Each of these functions will take as first argument an instance of a specific class, corresponding to some concrete algorithm: regex-tokenizer, markov-chain-generator, porter-stemmer and so on. The instance may or may not have configurable properties and/or state depending on the algorithm. So the algorithms themselves will be implemented in the generic functions' methods. This makes addition of new algorithms for well-specified tasks, like stemming, really straightforward: define a subclass of stemmer and implement method on stem for it. Finally, we'll have a way to easily access pre-defined instances for some frequently used algorithms. This will both provide an easy startup with the library and a simple way to experiment and extend it.

Let's take a look at the tokenization module, which is one of the few non-waporware ones so far. There're a couple of things I'd like to note:

  • Terminology: when we speak about tokenization we may assume that the result of this operation will be some kind of tokens. But in NLP field a token is often thought of as a specific data structure, that has some other properties beyond its string value, for example, POS tag. Also tokenization can produce different kinds of results: letters, phonemes, words or sentences, for instance. NLTK tokenizers will return strings by default, and they have some additional methods to return spans — tuples of corresponding beginning-end pairs. I support the approach that tokenization shouldn't produce opaque data structures not to restrict the choice for higher-level tools. In general, I adhere to the principle, that each tool in CL-NLP should produce the most raw data for its level of work. Yet I find it not very convenient to have different functions, that return string tokens and corresponding spans. This is easily amended using Lisp's multiple-values feature. So here's a definition of tokenize:
    (defgeneric tokenize (tokenizer string)
      (:documentation
       "Tokenize STRING with TOKENIZER. Outputs 2 values:
        - list of words
        - list of spans as beg-end cons pairs"))
    
  • Using a generic function allows us to define an :around method for performing pre-processing of the input text by splitting it into lines and assembling each line's tokenization.
    (defclass tokenizer ()
      ()
      (:documentation
       "Base class for tokenizers."))
    
    (defmethod tokenize :around ((tokenizer tokenizer) string)
      "Pre-split text into lines and tokenize each line separately."
      (let ((offset 0)
            words spans)
        (loop :for line :in (split-sequence #\Newline string) :do
           (mv-bind (ts ss) (call-next-method tokenizer line)
             (setf words (nconc words ts)
                   spans (nconc spans (mapcar #`(cons (+ (car %) offset)
                                                      (+ (cdr %) offset))
                                              ss)))
             (incf offset (1+ (length line)))))
        (values words
                spans)))
    
  • And here's how we'll implement pre-defined instances of the tokenizers:
    (define-lazy-singleton word-chunker
        (make 'regex-word-tokenizer :regex (re:create-scanner "[^\\s]+"))
      "Dumb word tokenizer, that will not split punctuation from words.")
    
    It is done with a special macro that defines a lazily initialized singleton and a special syntactic sugar for it. This facility is mainly intended for convenience in interactive experimentation, and not for production deployment.
    (defmacro define-lazy-singleton (name init &optional docstring)
      "Define a function NAME, that will return a singleton object,
       initialized lazily with INIT on first call.
       Also define a symbol macro  that will expand to (NAME)."
      (with-gensyms (singleton)
        `(let (,singleton)
           (defun ,name ()
             ,docstring
             (or ,singleton
                 (setf ,singleton ,init)))
           (define-symbol-macro ,(mksym name :format "<~A>") (,name)))))
    
  • Finally, a very interesting direction for development is stream tokenization. It has a whole different set of optimization compromises, and I hope to return to it in some of the future posts.

NB. I find it one of the biggest shortcomings of NLTK's design, that the algorithms are implemented inside the concrete classes. There's a good saying by Joe Armstrong about this: you wanted to have a banana, but got a gorilla holding a banana. I think, it very well characterizes the additional conceptual load that you have to incur when you look at NLTK's code with algorithms scattered over various classes. If Python supported CLOS-like decoupling of classes and methods, it would be much easier to separate the algorithms from other stuff much cleanly in NLTK. Well, in Lisp it's exactly what we're going to do.

Quickstart

To start working with CL-NLP you have to get it from my github account: vseloved/cl-nlp.

For those who are Lisp newbies: you'll also need Lisp, obviously. There's a lot of implementations (this is how a program running Lisp code is usually called) of it, but all of them support the same standard, so you can choose any. The ones I recommend are SBCL and CCL. We'll call them Lisp from now on. You can interact with the implementation either by starting it and pasting commands in the presented prompt. Or you can use more fancy tools: if you're already an Emacs user or think that you're savvy enough I recommend to get SLIME and enjoy hacking with it. If yu're on Vim try SLIMV. Otherwise your choice is pretty limited, but take a look at ABLE — a very simple program for interacting with Lisp with some useful features.

Also get quicklisp to make your work with 3rd-party libraries seamless and comfortable.

To load CL-NLP in your Lisp do the following:

(push "path/to/cl-nlp/" asdf:*central-registry*)
(ql:quickload "cl-nlp")

Here, "path/to/cl-nlp/" should point to a directory where you've downloaded the project, and the path should end with a slash. We use ql:quickload to load the project with quicklisp which will take care of loading its few dependencies.

And now if everything worked fine we're ready to start exploring the NLTK book!

submit

2013-01-10

Common Lisp is just Lisp

There's this perpetual confusion about Lisp: everyone will call a language with prefix notation and parenthesis a Lisp or a Lisp-dialect. This is one of the popular urban legends about Lisp, which I'd like to dissolve.

So, let's look at the language familiar to most of the programmers: C. Would anyone call C++ a C? Maybe, but only someone very inexperienced in these languages. On the contrary, many of the experienced people will often say: I would only program in C, but not in C++, because it's such a mess. For instance. Let's go further: would anyone call Java a C? I very much doubt it. What unites those languages is a common syntax: braces, basic keywords and operators, and more. A common root. And although the gap between C and Java is pretty big, it can be said, that they also have a somewhat similar semantics. And they are often used in similar scenarios. That's why all these languages are attributed to the C family. As well as Objective-C. Indeed, it would be much easier for a C programmer to switch to C++ or Java, that to, say, Lua or Haskell.

Now, the difference between 3 popular Lisp-family languages: Common Lisp, Scheme and Clojure — is just as big, as between C, C++ and Java. Conceptually and syntactically, Common Lisp is as far (or even further) from Clojure as C is from Java.

The language is not just superficial syntax — it's about syntax, semantics and pragmatics. And about community. The Lisp community has split at several points in time, and the main sprouts were Scheme and Clojure. It's not a coincidence, that these languages don't have Lisp in their name. And, what's most important is that each language's community cares about different things.

So, when someone asks you again "Which Lisp?", the answer is very simple: for good or for worse, the only Lisp currently is Common Lisp. It's like, say, the C99 standard of C. There were other standards and variants, like ZetaLisp or ISLISP, but they aren't in use now. Every other Lisp-like language isn't a Lisp, it's just similar to some extent.

submit

2013-01-03

Lisp Hackers: John Fremlin

John Fremlin has created a couple of very performant Common Lisp programs beating on some microbenchmarks the fastest similar software written in any other language, including C: the teepeedee2 dynamic webserver, that managed to break the c10k record on a single core machine, and cl-irregex regex library. Working at MSI in Japan he also had written an object persistence DB for CL manardb. Besides, he writes interesting blogs on topics of software optimization, programming languages and technology in general.
Tell us something interesting about yourself.

I've been to more than eighty countries; I want to go everywhere!

What's your job? Tell us about your company.

I work at Facebook on the growth team, on data-driven improvements to the sign-up flows.

Do you use Lisp at work? If yes, how you've made it happen? If not, why?

I used to at msi.co.jp. It is a Japanese consultancy based in Tokyo called originally Mathematical Systems Institute. Mr Kuroda leads the Lisp group there and I think it has hovered around five or six people over many years. He's done a great many very interesting projects for a range of companies over the years: for example a crash-test data inspection tool for a big Japanese car company, text mining, graph visualisation and so on. I worked primarily on building up a visualisation and mapping of a very large set of routers for the world's biggest telecoms, which led to the creation of manardb.

I really enjoyed working for Mr Kuroda and I'm sorry I had to leave for personal reasons. There were always many very fascinating problems around — with great people to discuss them and find solutions. It was a very stimulating workplace!
At Facebook, I use PHP, Python, C++, Java and miscellaneous things. I think we would all be better off if we hadn't balkanised the different systems that we program for — and Lisp is one of the few programming languages with the flexibility to serve in all these roles.

What brought you to Lisp? What holds you?

My initial programming was following Michael Abrash's graphics books and building on his ideas, by doing things like runtime native code generation for drawing dynamically generated bitmaps efficiently. This is not so interesting for modern processors as they have good branch prediction but the idea of code generation stuck with me and Lisp is one of the few programing languages that makes this easy and efficient.

I appreciate the intellectual coherence of Lisp, and its sensible approach to numeric computations. In terms of using it today, I feel that Common Lisp has an advantage over many other programming languages in that it has multiple mature independent implementations. Running on multiple compilers tends to greatly increase the quality of a program in my opinion, as the code is exposed to different static analyses.

What's the most exciting use of Lisp you had?

I helped someone use Lisp for an automated trading project.

What you dislike the most about Lisp?

In trying to make efficient code one ends up fighting against the compiler and the runtime system and most of the time is spent in coming up with clever ways to circumvent and outwit both. This is not a good use of resources, and means that it usually makes more sense to start with C++.

Tell us about your approach(es) to optimizing Common Lisp code (and maybe code optimization in general)?

The most important thing is to try to hold in your head an understanding of where the program is going to spend time. Profilers can be misleading and inaccurate, and it is sometimes difficult to get representative workloads to profile. I think their main utility is in confirming that there is no sloppy mistake (in Lisp, typically, consing accidentally) that prevents you from achieving the natural performance of your approach.

Complexity analysis in terms of computation, network usage, disk accesses and memory accesses is a first step as obviously if you can improve the asymptotic usage of a bottlenecked resource, you will very likely do much better than trying to tweak some little detail. The second step is to try to characterize interactions with caches and, in Lisp, garbage collection, which is pretty tricky.

Among the software projects you've participated in what's your favorite?

I think the one I enjoyed most was an embedded H.264 decoder in 2005. This was for the VideoCore, a really wonderful CPU architecture that could deal with parallelizable problems incredibly efficiently if programmed correctly. It would have been awesome to use Lisp for it!

If you had all the time in the world for a Lisp project, what would it be?

I wish there were Lisp bridges to other runtime systems (Java, Android, Objective C, Perl, Python, C++, R, etc.) so that the libraries and tools for each could be leveraged efficiently in Lisp and vice versa. That would mean being able to call Java code and handle Java objects in Lisp, for example -- perhaps initially by spinning up a Java implementation in a separate process running a CL-SWANK style interface.

I really don't think this would be that difficult and it would make a huge difference to the convenience of building programs in Common Lisp!

Describe your workflow, give some productivity tips to fellow programmers.

I use emacs and I have a bunch of elisp code that I keep meaning to publish!
submit

2013-01-02

"Real" List Comprehensions in 24 Lines of Lisp

I've just come across a post on Hackernews titled List Comprehensions in Eight Lines of Clojure. It's definitely a nice little example. But it also feels kind of unreal, even cheating ;) Because who would really use such kinds of list comprehensions? It seems to me, that the whole purpose of this construct is to make the code be concise and resemble a set-theoretic notation for sets. But here we need to use them inside a list-comp macro. This actually looks more like just another iteration construct.

What you really want from a list comprehension syntax is for it to be able to "comprehend" something like this:

{x|x ∊ some set}
or like this:
{x|x ∊ some set|x < 10}
Just like math.

But, surely, it's impossible to implement such syntax in Clojure or any other language without an extensible reader. The only route you can go is what Python does — to implement it as part of the language itself.

But you can do more, if you have control of the reader. This is one of the many cases, when Lisp's reader macros prove indispensable.

So here's an implementation of "real" (i.e. really resembling mathematical notation, no cheating) list comprehensions in 24 lines of Lisp (if you don't count a utility function group):

Unfortunetely, I had to use || instead of |, because on its own the | is used to escape charfactes in symbols, and also <- instead of for ease of typing, obviously. And as for the filter part, there's an implicit and, so you can write several conditions, and they should all hold. Otherwise, I think, this can be considered a literal implementation of the idea.

PS. And using named-readtables instead of plain set-macro-character this syntax can be used on a per-file basis, just like in-package forms.

PPS. I won't discuss here the issues of list vs. sets or lists vs. sequences. They are an implementation detail, worth another post.

submit

2012-12-14

Ansi Common Lisp на русском

Недавно вышел русский перевод книги Пола Грема "Ansi Common Lisp", к которому я немного приложил руку в качестве "научного" редактора. На форуме lisper.ru уже были сообщения от счастливых обладателей бумажной версии книги, а на сайте издательства даже доступен ее электронный вариант по свободной цене.

Хотя изначально я скептически отнесся к выбору именно этой книги для перевода, сейчас я рад, что так вышло. Работая над переводом, хочешь-не хочешь, а пришлось прочитать книгу практически от корки до корки, и могу сказать, что это, пожалуй, самое краткое, простое и доступное введение в язык. Practical Common Lisp лучше открывает глаза, и все-таки остается самой лучшей книгой по Lisp'у в целом, но он существен больше. В общем, ANSI CL — очень хороший вариант для начинающих. И хотя стиль Пола Грема часто критикуют в современном Lisp-сообществе, эта книга достаточно сбаллансированна и не содержит каких-то апокрифических мыслей :)

Книга состоит из двух частей, меньшая из которых — справочник — фактически бесполезна из-за наличия Hyperspec'и. Но это хорошо, поскольку остается меньше текста для прочтения :) Первая же часть состоит из 13 глав, описывающих разные аспекты языка, и 3 глав с решением практических задач. Главы про язык содержат множество примеров использования различных структур данных и реализации с их помощью нетривиальных алгоритмов, что может позволить неплохо прокачать это направления тем, кто не занимается постоянным решением алгоритмических задачек на Codeforces. Особенно, учитывая красоту и ясность реализации этих алгоритмов на Lisp'е. Несколько глав были весьма полезны и мне с моим пятилетним практическим опытом использования языка: например, я смог по достоинству оценить элегентность structs и стал намного больше пользоваться ими, интересными также были главы про оптимизацию и структурирование программ. В последних 3 главах разобраны классические для Lisp'а задачи: логический вывод, создание своей объектной системы (фактически, реализация внутренностей JavaScript'а) и генерация HTML из мета-языка — это те вещи, на которых видны некоторые из самых сильных сторон языка.

Из-за проблем издательства работа над переводом велась очень долго — что-то около двух лет. Точнее, сама работа длилась намного меньше, но ее отдельные части были разделены большими временными промежутками. Переводил allchemist, и сделал это задорно и весело. Своей задачей я видел прежде всего исправление отступлений от оригинала и работу с терминологией. Что касается второго пункта то тут я хотел напоследок рассказать занимательную историю про стог и пул.

Стог и пул

Пару лет назад Иван Сагалаев, который выступал в той же роли научного редактора для книги "Coders at Work", написал следующее по поводу роли научного редактора:

Кто не знает, научный редактор — это человек, совершенно необходимый для специальной литературы. Он берёт сырой перевод и приводит специфичную терминологию в соответствии с принятой в реальном мире. В результате вы читаете книжку, в которой написано не "процесс синтаксического разбора", а просто "парсинг", и не "интерфейс прикладной программы", а "API".

Применительно к Кодерам, которые должны читаться как приключенческий роман, я согласен с подходом Ивана. Но вот что касается таких книг, как ANSI CL, предназначеных прежде всего для (относительных) новичков, я считаю, что выбор должен делаться в сторону максимальной понятности терминов, а не привычности их для людей, которые уже в теме. Т.е., конечно, не "процесс синтаксического разбора", а просто "синтаксический разбор" и местами "разбор" — но не "парсинг". Почему? Да хоть потому, что "парсинг" для новичка создает некий магический ореол вокруг этого термина и выделяет его из ряда других, названных на родном языке, хотя ничего выделяющегося в нем нет. Да, часто подобрать адекватный термин на родном языке очень трудно, порой их даже приходится изобретать, но именно так и происходит развитие терминологии.

По этому поводу в этой книге было 2 очень интересных примера, за первый из которых меня можно смело закидывать помидорами, но я все же буду продолжать настаивать на нем. Давайте перечислим абстрактные структуры данных, с которыми мы чаще всего встречаемя — это, конечно же, лист, три, кью, стек, хип, дек. Ой... Т.е., я хотел сказать: список, дерево, очередь, куча, колода и... стек. Как-то так вышло, что у всех этих структур имена как имена, а вот стек какой-то особенный. Почему? Наверно, из-за лени, но не важно. Если заглянуть в словарь, то для английского слова "stack" можно найти 2 вполне подходящих перевода. Первый из них — стог :) По-моему, удивительный случай созвучности, и, по-своему, очень забавный вариант. Именно его я предложил использовать в качестве термина, когда речь идет об этой структуре данных, и он продержался практически до последней ревизии, однако, в последний момент все-таки был заменен на менее одиозный вариант стопки. Это тоже хороший перевод и с точки зрения соответствия реальности даже более адекватный, так что я остался доволен. Удивительно, почему он так редко встречается в литературе!

Но тут есть еще одна трудность: а как быть со стеком вызовов функций программы, который уже не абстрактная структура данных, а конкретное технологическое решение, вокруг которого есть еще и другие термины, типа "stacktrace"? Вот тут, конечно, намного труднее, и я остановился на том, что в данном случае, чтобы не создавать путаницы, лучше использовать устоявшийся термин, т.е. стек. Возможно, с прочным вхождением в обиход стопки, можно будет перенести этот термин и сюда: стопка вызовов — звучит банально. Зато никакой дополнительной случайной сложности :)

Вторым термином, которым я остался недоволен, был пул. Тут случай хуже, т.к. адекватного перевода его на русский и вовсе нет. Ну не бассейн же. Я так ничего и не придумал. Но, если у вас будут мысли на эту тему, делитесь...

2012-12-12

Утилитарный Lisp

Вот как выглядит "клиент" (если для такого простого кусочка кода уместно столь громкое название) для набирающего популярность лог-сервера Graylog2 на современном Lisp'е: По-моему, этот кусочек кода неплохо развеивает миф о проблемах с библиотеками в Lisp-среде: в нашем пайплайне сначала сообщение сериализуется в JSON библиотекой cl-json, затем кодируется в байтовый поток babel, затем зипуется salza2, а затем отправляется через UDP-шный сокет usocket. А еще есть повод использовать прекрасную библиотеку для работу со временем local-time, основанную на статье Эрика Наггума. Ну и чуть-чуть синтаксического сахара из rutils, в том числе и буквальный синтаксис для хеш-таблиц (как в Clojure), модульно подключаемый с помощью named-readtables. Ничего лишнего.

2012-12-07

Lisp Books

All Lisp books in one place!

PS. If you know more, drop me a line, and I'll add them.

2012-11-04

cl-redis: Separation of Concerns in Library Design

TL;DR This article describes a "lispy" approach to implementing a spec, connection handling, and namespacing in the cl-redis library with special variables, generic functions, macros, package, and restarts, and a comparison of it to object-oriented one.

Redis is a simple and powerful tool, that can have a lot of different uses: coordination in a distributed system, message queue, cache, static database, dynamic database for ephemeral data — to list just a few. Seeing such potential, I have created the cl-redis Lisp client back when Redis hadn't yet reached version 1.0. A couple of weeks ago version 2.6 was released, which, as usually, added a handful of commands (now there're 140 of them — more than twofold increase since cl-redis was first released with 64), and a couple of small but sometimes incompatible communication protocol changes.

Upgrading the library to support 2.6 I have implemented a couple of improvements to the overall design, which made me rethink its original premises, and prompted to write this article to summarize those points, as they may be not the most mainstream manifestation of the very mainstream and basic principle of separation of concerns.

Anticipating a rapid rate of change to Redis from the beginning I decided to base the library on the following principles:

  • uniform declarative command definition, separated from the details of protocol implementation
  • a sort of TDD approach: adding each new command requires adding a test case for it (which can currently be extracted from the official docs)
  • implementing each command as a regular Lisp function, exported from REDIS package; and prefixing each command's name to avoid potential symbol conflicts (for instance, get is at the same time a core Lisp function and a Redis command, so it gets defined as red-get in cl-redis)

Declarative Command Definition

Such an approach should be very scalable regarding addition of new commands. The actions required should be: just copying the definition from the spec, putting parentheses around it, copying a test from the spec, recording and running it. And the protocol changes should have no or very little effect on the commands' definition. Those were the assumptions and they worked out pretty well, allowing to relatively easily handle all the 76 new commands added over time, go through the transition from old protocol to new binary-safe one (which in the end prompted only one change on the interface level: removing an output spec from command definition).

This is how a command definition looks in Redis spec:

HMGET key field [field ...]

Available since 2.0.0.
Time complexity: O(N) where N is the number of fields being requested.
Returns the values associated with the specified fields in the hash stored at key.
...
Return value
Multi-bulk reply: list of values associated with the given fields, in the same order as they are requested.
And this is its representation in Lisp code:
(def-cmd HMGET (key field &rest fields) :multi
  "Get the values associated with the specified FIELDS in the hash
stored at KEY.")

The difference from defun is that a return "type" is specified (:multi in this case) and that there's no body — it's a piece of code, that handles communication and is generated automatically.

But still there were some quirks. The biggest one was a small impedance mismatch between how Lisp handles function arguments and how Redis does. It should be said, that among all programming languages Common Lisp has the richest function arguments protocol, only matched to some extent by Python. And from the first look at Redis commands it seamed, that Lisp will be able to accommodate all of them as is. Yet, Redis' version of a protocol turned out to be more ad hoc, and so for some commands additional pre-processing of arguments was required. For instance, the ZRANGE and ZREVRANGE commands have a WITHSCORES argument, which if present should be the string "WITHSCORES". This is something in-between Lisp's &optional and &key arguments. Both choices required some pre-processing of arguments. My final choice was to go with &optional, but ensure, that whatever non-nil value is provided, it's transformed to a proper string. Still it was relatively easy to realize, because the Redis interaction protocol is implemented as 2 generic functions: tell for sending a request and expect for receiving the response. This provides the ability to decorate the methods with additional processing or override them altogether for some specific command. In this case a slight pre-processing is added to tell:

(defmethod tell :before ((cmd (eql 'ZRANGE)) &rest args)
  (when (and (= 4 (length args))
             (last1 args))
    (setf (car (last args)) :withscores)))

There are some more involved cases, like the ZUNIONSTORE command, that poses some restrictions on its arguments and also requires insertion of special keywords WEIGHTS and AGGREGATE:

(def-cmd ZINTERSTORE (dstkey n keys &rest args &key weights aggregate) :integer
  "Perform an intersection in DSTKEY over a number (N) of sorted sets at KEYS
with optional WEIGHTS and AGGREGATE.")

(defmethod tell ((cmd (eql 'ZUNIONSTORE)) &rest args)
  (ds-bind (cmd dstkey n keys &key weights aggregate) args
    (assert (integerp n))
    (assert (= n (length keys)))
    (when weights
      (assert (= (length keys) (length weights)))
      (assert (every #'numberp weights)))
    (when aggregate
      (assert (member aggregate '(:sum :min :max))))
    (apply #'tell (princ-to-string cmd)
           (cl:append (list dstkey n)
                      keys
                      (when weights (cons "WEIGHTS" weights))
                      (when aggregate (list "AGGREGATE" aggregate))))))

Overall, among 140 Redis commands 10 required some special handling.

Proper Incapsulation

The only drawback of the described solution, or rather just a consequence of it being implemented in Common Lisp, is the somewhat ugly format of Redis commands: red-incr looks definitely worse than r.incr. If the commands were defined by the names of their Redis equivalent (incr) this won't allow to import the whole REDIS package into your application, because of name clashes with the COMMON-LISP package and inevitable conflicts with other packages — these names are just too common. This is where objects-as-namespaces approach seams to be better, than Lisp's packages-as-namespaces. But it shouldn't be so, as the Lisp's approach implements proper separation of concerns, not "complecting" things, if you use Rich Hickey's parlance.

And it isn't: the solution is actually so simple, that I am surprised, that I didn't think of it until the latest version of the library. It is to define two packages: the REDIS one with all the ordinary functions, and a special package just for Redis commands — I called it RED, because it's a totally syntactic-sugar addition, so it should be short. RED package should never be imported as a whole. This way we basically get the same thing as before: red:incr, but now it's done properly. You can import a single command, you can rename the package as you wish etc. So this solution is actually more elegant, than the object-oriented one (we don't have to entangle the commands with connection state), and we don't have to sacrifice the good parts, described previously.

Connection Handling

I also worked with a couple of other Redis libraries, written in Java and Python. Generally, they use a classic object-oriented approach: the commands are defined as methods of a Redis class, which also handles the state of the connection to the server. The resulting code looks pretty neat, it is well encapsulated and at first glance has no boilerplate even in Java:

Redis r = Redis.getClient();
Long id = r.incr("WORKER_ID_COUNTER");
r.sadd("WORKERS", id.toString());
r.hset("WORKER" + id, "queue", queueName);

There's an issue of not forgetting to return resources (close a connection), but in a more advanced languages like Python it's solvable with a contextmanager:

with redis(port=3333) as r:
    id = r.incr("WORKER_ID_COUNTER")
    r.sadd("WORKERS", id)
    r.hset("WORKER%s" % id, "queue", queueName)

What can be simpler?

Yet, below the surface it suffers from a serious problem. In my work I've seen a lot of cases (I'd even say, it's a majority), where the Redis connection should be persisted and passed from one function to another, because it's very inefficient to reopen it over and over again. Most of the object-oriented libraries use some form of connection pooling for that. In terms of usability, it's not great, but tolerable. The much greater problem though is handling connection errors: in the case of these long-living connections, they always break at some unpredictable point (timeout or network hiccup), which throws an exception. This exception (or rather two or three different types of exceptions) should be handled in all functions, that use our client by trying to reconnect. And the contextmanager wouldn't help here: this is one of the cases, where it really is no match for its macro-based counterparts, that it tries to mimic. Those conenction errors also break the execution of a current function, and it's often not trivial to restart it. With connection pooling it's even worse, because a bad implementation (and I've seen one) will return to the pool broken connections and they will have an action-at-a-distance effect on other parts of the code, incidentally acquiring them. So, in preactice, connection pooling, which may seam like a neat idea, turns out to be a can of worms. This is one of the cases of excessive coupling, that often arises in object-oriented languages. (It is not to say, that connection pooling can be useful in some cases — it can, but it should be restricted only to those ones).

In Lisp there are elegant ways to solve all these problems: as usual its my favourite Lisp tool — special variables, combined with macros and the condition system. A couple of times I've seen such critic of the Lisp condition system, that its restart facility isn't actually used in practice. Well, it may not be used extensively, but when it's needed, it becomes really indispensible, and this is one of the cases.

The solution is to introduce the notion of a current connection — a *connection* special variable. All Redis commands operate on it, so they don't have to pass the connection around. At the same time, different macros can be created as proper context-managers to alter the state of connection and react to its state changes via condition handlers and restarts.

So if you just work with Redis from the console, as I often do (it's more pleasant, than working with the native client), you simply (connect) and issue the commands as is. A one-time job, that is run inside a function, can use with-connection, that is the analogue of a Python context-manager. The with-pipelining macro will delay reading of Redis replies until all commands in its body are sent to the server to save time. Such trick isn't something special: although a Java library needs to create a special object, that handles this strategy, in Ruby it is done simply with blocks (like: node.pipelined{ data.each{ |key| node.incr(key) }}).

But what can't be done gracefully in this languages is handling a connection hiccup. In the latest cl-redis a macro with-persistent-connection is responsible for handling such situations:

(defmacro with-persistent-connection ((&key (host #(127 0 0 1))
                                            (port 6379))
                                      &body body)
  `(with-connection (:host ,host :port ,port)
     (handler-bind ((redis-connection-error
                     (lambda (e)
                       (declare (ignore e))
                       (warn "Reconnecting to Redis.")
                       (invoke-restart :reconnect))))
       ,@body)))

It doesn't work on its own and requires some support from the command-definition code, though a very small one: just one line — wrapping the code of the command in (with-reconnect-restart ...), which is another macro, that intercepts all the possible failure conditions and adds a :reconnect restart to them, which tries to re-establish the connection once and then retry the body of the command. It's somewhat similar to retying a failed transaction in a database system. So, for instance, all the side-effects of the command are performed twice in such a scenario. But it's a necessary evil, if we want to support long-running operation on the server side.

The Lisp condition system separates the error-handling code in 3 distinct phases: signalling a condition, handling it, and restarting the control flow. This is its unique difference from the mainstream systems, found in Java or Python, where the second and third parts are colocated. It may seem unnecessary, until it's necessary, and there's no other way to acheive the desired properties. Consider the case of pipelining: unlike the ordinary command, sent in solitude, the pipelined command is actually a part of a larger batch, so restarting just a single command after reconnection will not return the expected results for the whole batch. So the whole body of with-pipelining should be restarted. Thanks to this separation it is possible. The trick is to check in the condition handler code, if we're in a pipeline, and not react in such a case — the reaction will be performed outside of the pipeline.

And here's the whole with-pipelining macro implementation. Did I mention, that the pipelined context is also managed with a special variable (even 2 in this case)?.. ;)

(defmacro with-pipelining (&body body)
  `(if *pipelined*
       (progn
         (warn "Already in a pipeline.")
         ,@body)
       (with-reconnect-restart
         (let (*pipeline*)
           (let ((*pipelined* t))
             ,@body)
           (mapcar #'expect (reverse *pipeline*))))))

In total, there's a proper separation of concerns: the package system ensures namespacing, the commands are just functions, which operate on the current connection, and connection-handling logic is completely separate from these functions.

(let ((redis:*echo-p* t))
  (redis:with-persistent-connection (:port 10000)
    (loop (process-job)))

(defun process-job ()
  (red:blpop "queue")  ;; block until job arrives
  (let* ((id (red:incr "id"))
         (results (do-something-involved)))     
         (result-key (fmt "result~A" id)))
    (redis:with-pipelining
      (dolist (result results)
        (red:lpush result-key result))
      (red:lpush "results" result-key))))

In this code we can independently and dynamically toggle debugging, use persistent connection, and pipelining for some part of commands, and the commands bear a bare minimum of information necessary for their operation.

A Note on Testing

In general I'm not a fan of TDD and similar approaches, that put testing at the head of design process, and prefer a much less disciplined REPL-driven development. :) Yet this is one of the classical examples, where testing really has its benefits: we have a clear spec to implement and there's an external way to test the implementation. Developing a comprehensive test-suite covering all commands (except a couple of maintainance ones, that just can't be easily tested) really sped up the whole process and made it much less error-prone. Literally every time I updated the library and added new tests, I have seen some of the tests failing! In total, the test-suite has accumulated around 600 cases for those 140 commands, and yet I didn't come up with a way to test complicated failure conditions on the wire, for which I had to resort to the REPL.

Afterword

In this article I wanted to showcase the benefits of the combination of some of Common Lisp's unique approaches to managing complexity and state: special variables, that have a context attached to them, macros, generic functions and condition-handling protocol. They provide several powerful non-mainstream ways to achieve the desired level of concern separation in the design of complex systems. In Lisp you should really remember about them and don't constrain your solution to recreating common patterns from other languages.

Finally, I would like to acknowledge Kent Pitman and his essay "Condition Handling in the Lisp Language Family", that is one of the most profound articles on the topic of protocols and separation of concerns — highly recommended. I think, the example of cl-redis is a clear manifestation of one trait of the CL condition system: you don't need it, until one day you do, and when you do, there's actually no other way to deal with the problem otherwise.

submit