Lisp, the Universe and Everything


Running Lisp in Production at Grammarly

We have written a blog post describing almost 3 years of our Lisp in production experience at Grammarly. Here's a small abstract for it.

At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.

We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).

Continue to the full text »


Announcing SHOULD-TEST

Once upon a time, it occurred to me that all sound software should be slightly self-ironic. That is how this library's name came into being: yes, you should test even Common Lisp code sometimes. :) But that's not the whole irony...

So, y u makes YATF?

Testing software always fascinated me because it is both almost always necessary and at the same time almost always excessive - it's extremely hard to find the right amount of resources you should allocate to it. You will most likely end up fearing to change your system either because you have too few tests, and some of the important scenarios aren't covered, or too many tests and you need to be constantly re-writing them. Surely, in Lisp the problem is not so drastic because in many cases you can rely on the REPL to help, but it's not a one-fit-all solution. There's also too much dogma in the space of general error handling in programming (that I addressed a little bit in this post). So, to find out how to test properly, around 7 years ago I had written my first test framework, which was called NUTS (non-unit test suite). It worked ok, and I used it in a couple of projects including the huge test suite of CL-REDIS that I'm really proud of. However, it was the first version, and you always have to re-write the first version. :) This is how MUTEST (microtest) appeared. In it, I was aiming at making a tool with the smallest footprint possible. It was also partially inspired by RT, which I consider to be the simplest (with a positive connotation) Lisp test framework (before ST). But both of them, MUTEST and RT, are not lispy because they are not extensible, and it's a shame to not have extensibility in Lisp, which provides excellent tools for building it in.

Well, "version 2 always sucks, but version 3..." So, SHOULD-TEST is version 3, and I'm really happy with it. It's truly minimal and intuitive to the extreme: like in the popular BDD approach you just write (in Yodaspeak, obviously): should be = 1 this-stuff and then st:test. And it's extensible - you can add specialized assertion strategies to the provided 3 basic ones: normal testing, exception catching, and capturing output streams.

I wasn't content with the existing Lisp test frameworks because they aren't concerned first and foremost with the things I care about:

  • intuitive defining and running arbitrary tests
  • testing from the REPL and ease of analyzing the test output
  • piping the test output to upstream systems like CI (by supporting common protocols, such as xUnit and TAP)

These are the 3 things that SHOULD-TEST should do the best.

Over more than a year, I have written or re-written with it the whole test suites for the main open-source libraries I support - RUTILS, CL-REDIS, and CL-NLP (which doesn't yet have an extensive test coverage). And I also use it for all my in-house projects.

Working with ST

Here's a quick overview of the SHOULD-TEST workflow.

Test are defined with deftest:

(deftest some-fn ()
  (should be = 1 (some-fn 2))
  (should be = 2 (some-fn 1)))

Being run, the defined test returns either T or NIL as a primary value. Secondary and third values in case of NIL are lists of:

  • all failed assertions returned by individual assertions
  • and all uncaught errors signaled inside assertions

should is a macro that takes care of checking assertions. If the assertion doesn't hold should signals a condition of types should-failed or should-erred which are aggregated by deftest. Also, should returns either T or NIL and a list of a failed expression with expected and actual outputs as values.

Under the hood, should calls the generic function should-check and passes it a keyword produced from the first symbol (in this case, :be), a test predicate (here, '=), and a tested expression as thunk (here it will be e.g. (lambda () (some-fn 1))), and expected results if any. If multiple expected results are given, like in (should be eql nil #{:failed 1} (some-other-fn :dummy)), it means that multiple values are expected. As you see, the keyword and test predicate are passed unevaluated, so you can't use expressions here.

The pre-defined types of assertions are be, signal, and print-to. They check correspondingly.

deftest and should write the summary of test results to *test-output* (by default bound to *standard-output*). The var *verbose* (default T) controls if the summary contains full failure reports or just test names.

Tests are defined as lambda-functions attached to a symbol's test property, so (deftest some-fn ... will do the following:

(setf (get some-fn 'test)
      (lambda () ...))
One feature that is pending implementation is establishing dependencies between tests while defining them, i.e. specifying the partial order in which they should be run. However, I haven't seen heavy demand for it in my test code so far.

To run the tests, use test. Without arguments, it runs all the tests in the current package. Given a :package argument it will do the same for that package, and given a :test argument it will run that individual test. In case of individual test's failure, it will return NIL and a list of failed assertions and a list of assertions, which triggered uncaught errors. In case of failed test of a package, it will return NIL and 2 hash-tables holding the same lists as above keyed by failed test's names.

As you see, the system uses a somewhat recursive protocol for test results:

  • at the lowest level should returns T or NIL and signals information about the failed assertion
  • this information is aggregated by deftest which will return aggregate information about all the failed assertions in the hash-table
  • at the highest level test will once again aggregate information over all tests

So, the structure of the summary, returned from test, will be the following:

  failed-test-1 ((failed-assertion-1 expected actual)
                 (failed-assertion-2 ...
  failed-test-2 ...

There's also :failed key to test that will re-test only tests which failed at their last run.

Usage patterns

As SHOULD-TEST is agnostic, it doesn't impose any restrictions on how each project organizes its tests. Yet, having established patterns and best-practices never hearts. Below is the approach I use...

There's no restriction on naming tests. Though, it seems like a good approach to name them the same as functions they test. As for generic functions, I have different tests for different methods. In this case, I add some suffix to the test's name to indicate which method is tested (like transform-string for one of the methods of gf transform that is specialized for the string class of arguments).

As for code organization, I use the following directory structure of the typical project:

 |    `----module
 |         `-----file.lisp

I also usually place the tests in the same package as the code they test but protect them with #+dev guard, so that in production environment they are not compiled and loaded altogether.

ASDF provides a way to define the standard for testing a system that can be invoked with asdf:test-system. The easiest way to hook into this facility is to define the following method for asdf:test-op somewhere either in package.lisp or in some common file in the test module (in the example above: some-general-tests.lisp):

(defmethod asdf:perform ((o asdf:test-op)
                         (s (eql (asdf:find-system <your-system>))))
  (asdf:load-system <your-system>)
  (st:test :package <your-package>))

There's also a minimal test suite defined in src/self-test.lisp. The test suite is also hooked to asdf:test-op for the should-test system - just as described above :)
Finally, there's an idea that ST will provide useful connector facilities that are mostly lacking in the existing Lisp test frameworks, to be able to integrate into the general testing landscape (primarily, CI systems). As a start, xUnit support was implemented by us (most of the thanks go to Maxim Zholoback). As it often happens, it was, actually, almost impossible to find the proper xUnit spec, but this SO answer saved the day for us. test-for-xunit generates appropriate XML string to *test-output*. I also plan on implementing TAP support some day (this should be pretty easy, actually), but I'm not in a hurry.

Well, if SHOULD-TEST proves useful to some of you, I'd be glad. Enjoy the hacking!


Креш-курс по Лиспу - кому он будет интересен

В июле должен состояться мой мастер-класс введение в практическую разработку на Common Lisp. По этому случаю меня попросили написать статью в блог компании SmartMe, которая проводит это мероприятие. В ней я попытался ответить на вопрос, кому и зачем сейчас может быть интересно разобраться с Лиспом.

Лисп — один из самых старых и, пожалуй, самый загадочный из современных языков программирования. Также бытует мнение, что он не просто стар, а устарел. Почему это не так и где его ниша, я попробую ответить в этой статье.

Лисп — пожалуй единственный динамический системный язык программирования. И среди динамических языков он остается непревзойденным выбором благодаря следующими свойствам:
  • реальной мультипарадигменности, дающей возможность элегантно совмещать процедурный, объектно-ориентированный, функциональный и другие стили
  • уникальной поддержке метапрограммирования
  • легендарной интерактивной среде разработки
  • железобетонному стандарту языка, за которым стоит многолетняя работа мегаумов МИТа, Xerox PARC, CMU и других подобных мест, оплаченная DARPA
  • обширному набору реализаций (компилятор и среда исполнения), коих разработано за его историю около 25, до 10 из которых активно поддерживаются и развиваются
На самом деле, современное положение Лиспа как языка, который обычно не рассматривают для серьезной разработки, обуcловленно отнюдь не техническими причинами, а лишь исторической случайностью — язык сильно опередил свое время,— и человеческим фактором: он и не выбор по-умолчанию (как С++, Java или JavaScript), и не модная новая технология (как Scala, Go или Ruby), и не имеет за собой какую-либо серьезную организацию или сообщество, которые бы продвигали его использование (как C#, Swift или Rust). Тем не менее, миф о непрактичности Лиспа опровергает как мой опыт использования его в ядре Grammarly и предыдущих моих коммерческих проектах (уже более 7 лет), так и опыт поисковика авиабилетов ITA Software, купленной Гуглом за миллиард долларов, или же португальской Siscog, разработчика решений для железных дорог, в которой работает более полусотни Лисп-программистов. А адепты теории о необходимости его модернизации могут почитать Changelog SBCL (лидирующей open source реализации) :)

Конечно, у Лиспа есть и недостатки — помимо небольшого сообщества, представленного в основном энтузиастами, это:
  • непривичный синтаксис
  • часто непривичные подходы и способы разработки
  • отсутствие библиотек для взаимодействия с остальной средой (проект Quicklisp давно доказал обратное :)
Таким образом, еще раз можно повторить, что язык и экосистема Common Lisp не имеет серьезных технических недостатков при ряде бесспорных преимуществ, но он слишком непривычен и нетипичен, поэтому страдает от проблемы курицы и яйца: отсутствие Лисп-программистов не позволяет начинать на нем серьезные проекты, а отсутсвие импульса в сообществе не приводит в него новых программистов. Поэтому, Лисп вряд ли будет в ближайшее время серьезно использоваться в индустрии разработки. В чем же тогда его ниша сегодня? Если оставить за скобками тренды, то я бы сказал, что в первую очередь, это системы, которые пишутся на годы и должны постоянно эволюционировать: в этом плане он находится в точке золотой середины между классическими системными языками, типа C++ и Java, и их динамическими конкурентами, предоставляя невероятно гибкую и, в то же время, достаточну производительную (как в отношении скорости исполнения, так и скорости разработки) среду. Особенно, если такие системы должны иметь средства представления и обработки большого количества знаний. Как раз про них 10-е правило Гринспена:
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Однако, очень мало кто решится писать сейчас такие проекты на Лиспе. Более актуально другое его применение — быстрое прототипирование и среда для экспериментов. В этой нише у Лиспа много конкурентов, таких как Python или же специализированные языки, типа R и MatLab, но преимущество Лиспа в том, что удачный протип можно со временем довести до продакшн системы. Однако, самое важное значение Лиспа для меня — это полная свобода творчества, которую он предоставляет. Кроме шуток, доступ к такой среде дает возможность программисту развиваться не просто набором опыта использования каких-либо инструментов и фреймворков, а через решение нестандартных задач пытаясь найти для этого наиболее удачный способ независимо от случайных ограничений, налагаемых текущими обстоятельствами и принятыми нормами.


How to write an English POS tagger with CL-NLP


The problem of POS tagging is a sequence labeling task: assign each word in a sentence the correct part of speech. For English, it is considered to be more or less solved, i.e. there are taggers that have around 95% accuracy. It also has a rather high baseline: assigning each word its most probable tag will give you up to 90% accuracy to start with. Also, only around 10% of the words have more than 1 possible tag. However, their usage accounts for around 40% of total text volume.

Let's start our exercise by first verifying these facts in a data-driven manner. This will also allow us to examine the basic building blocks for our solution.

POS taggers are usually built as statistical models trained on some existing pre-labeled data. For English language there is quite a lot of it already available. For other languages, that don't possess such data, an unsupervised or a rule-based approach can be applied.

Data sources

The standard dataset that is used not only for training POS taggers, but, most importantly, for evaluation is the Penn Tree Bank Wall Street Journal dataset. It contains of not only POS tag, but also noun phrase and parse tree annotations.

Here's an example of the combined POS tag and noun phrase annotations from this corpus:

[ Pierre/NNP Vinken/NNP ]
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB
[ the/DT board/NN ]
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]

The tagset used in the annotation contains such symbols as NNP for proper nouns, , for commas, and CD for cardinal numbers. The whole set is provided for CL-NLP in the file src/syntax/word-tags.txt. Here's a snippet from it:

X Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the...the-constructions.
CC Coordinating conjunction

It's, obviously, possible to extend it with other tags if necessary. All of them are, finally, available as symbols of the tag package in CL-NLP.

Available data and tools to process it

What we're interested in, is obtaining a structured representation of this data. The ncorp package implements interfaces to various raw representations, such as this one.

Different NLP corpora exist in various formats:

  • slightly annotated text, like the above
  • XML-based formats (a well-known example is the Reuters corpus)
  • a treebank format used for representing syntax trees
  • and many others

Most of these representations are supported by the ncorp adapters at least to some extent. The interface of this module consists of the following entities:

  • a text structure for representing an individual text of the corpus (most of the corpora are, thankfully, divided into separate files)
  • a generic function read-corpus-file that should read a raw file and return as multiple values its several representations. The common representations supported by almost all its methods are: raw or almost raw text, clean text - just sentences stripped of all annotations, and tokens - a list of token lists extracted from each sentence. Additionally, other corpus-specific slots may be present. For example, a treebank-text structure adds a trees slot to hold the syntax trees extracted from the treebank
  • a generic function read-corpus creates a corpus structure out of all corpus' texts. In practice, it's usually not feasible to do that due to large sizes of most of the useful corpora. That's why the next function is more practical
  • a generic function map-corpus reads and streams each corpus file as a text structure to the argument function. This is the default way to deal with corpus processing

For our task we'll be able to utilize just the tokens slot of the ptb-tagged-text structure, produced with map-corpus. Let's collect the tag distribution for each word from the WSJ section of the PTB:

NLP> (let ((words-dist #h(equal))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                   :ext "POS")
#<HASH-TABLE :TEST EQUAL :COUNT 51457 {10467E6543}>
NLP> (reduce #'+ (mapcan #'ht-vals (ht-vals *)))

So, we have around 50k distinct words and 1,3m tokens.

But, actually, the resources in the field has made some progress in the last decades, and there's a bigger corpus now available that contains not only the whole Penn Tree Bank, but also some more data from other domains. The annotations of the WSJ section in it were also improved. It is called OntoNotes. Let's do the same with its data:

NLP> (let ((words-dist #h(equal)))
       (map-corpus :treebank (corpus-file "ontonotes/")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (with-accessors ((tag token-tag) (word token-word)) tok
                           (unless (eql tag 'tag:-NONE-)
                             (unless (in# word words-dist)
                               (:= (get# word words-dist) #h()))
                             (:+ (get# tag (get# word words-dist) 0))))))
                   :ext "parse")
 #<HASH-TABLE :TEST EQUAL :COUNT 60925 {1039EAE243}>

So, in the OntoNotes 4.0 there are 60925 distinct words. 50925 of them (~84%) are tagged with a single tag. I.e. we have a 16% of multi-tag words which corresponds well with the theoretical data. Also, there are 2,1m tokens in the corpus in total.

Calculating the number of words with distinct tags:

(count-if-not #`(= 1 (ht-count (rt %)))
              (ht->pairs words-dist))

And what about the total volume?

NLP> (let ((total1 0)
           (total 0))
      (map-corpus :treebank "ontonotes"
                  #`(dolist (sent (text-tokens %))
                      (dolist (tok sent)
                        (unless (eql (token-tag tok) 'tag:-NONE-)
                          (:+ total)
                          (when (in# (token-word tok) *single-tag-words*)
                            (:+ total1)))))
                  :ext "parse")
      (float (/ total1 total)))

Only 24% instead of 60%! What's wrong?

OK, here's the trick: let's add words that have more than 1 tag, but >99% of their occurrences are labeled with a single tag. For instance, the word "the" has 9 distinct tags in OntoNotes, but 0.9997 of the times it's a DT.

If we consider such words to have a single tag, we'll get just a slight increase in the number of single-tag words (+386: 51302 instead of 50916), but a dramatic increase in the volume of their occurrence - now it's 63%! Just as the literature tells us.

(NB. Taking such shortcut will only increase the quality of the POS tagger as 99% is above the accuracy it will be able to achieve anyhow, which is at most 97% on the same-domain data and even lower for out-of-domain data.)

Here's how we can determine such set of words:

(remove-if-not #`(let ((tag-dist (ht-vals (rt %))))
                   (> (/ (reduce #'max tag-dist)
                         (reduce #'+   tag-dist))
               (ht->pairs tag-dist))

NB. The above code samples contain some non-standard utilities and idioms that may look somewhat alien to some Lisp programmers. All of them are from my RUTILS library, and you'll see more below. Mostly, these include some hash-table-specific operators, new iteration constructs, a few radical abbreviations for common operations, and literal syntax for hash-tables (#h()) and anonymous functions (#`()).

Some of them are:

  • get#/in#/set# specialized hash-table access routines, and dotable hash-table iteration
  • a pair data type with lt/rt accessors and conversion routines to/from hash-tables
  • ? generic element access operator with support for generic setf, plus := abbreviation for setf (that is using a common assignment symbol) and :+ analogy for incf

Building the POS tagger

We have explored how to access different corpus data that we'll need to train the POS tagger. To actually do that, we'll re-implement the approach described by Matthew Honnibal in "A good POS tagger in about 200 lines of Python". In fact, due to the expressiveness of Lisp and efficiency of SBCL, we'll need even less than 200 lines, and we'll get the performance comparable to a much more complex Cyton implementation of the parser (6s against 4s on 130k tokens), but that's details... ;)

Here's the source code we'll be discussing below on github.

Our tagger will use a greedy averaged perceptron model with single-tag words dictionary lookup:

(defclass greedy-ap-tagger (avg-perceptron tagger)
  ((dict :initform #h(equal) :initarg :dict :accessor tgr-dict)
   (single-tag-words :initform #h(equalp) :initarg :single-tag-words
                     :accessor tgr-single-tag-words))
   "A greedy averaged perceptron tagger with single-tag words dictionary lookup."))

As you see, it is derived from a generic class tagger and an avg-perceptron learning model. It also has a dict slot that holds all the words known to the tagger.

Every tagger has just one generic function associated with it. You guessed its name - tag :)

(defmethod tag ((tagger greedy-ap-tagger) (sentence sentence))
  (let ((prev :-START-)
        (prev2 :-START2-)
        (ctx (sent-ctx sentence)))
    (doindex (i token (sent-tokens sentence))
      (:= (token-tag token)
          (classify tagger
                    (extract-fs tagger i (token-word token) ctx prev prev2))
          prev2 prev
          prev (token-tag token)))

It accepts an already tokenized sentence and (destructively) assigns tags to each of its tokens.

The main job is performed by the call to classify method that is defined for every statistical learning model in CL-NLP. Another model-associated method here is extract-fs which produces a list of features that describe the current sample.

Now, let's take a look at the implementation of these learning model-related methods.

(defmethod classify ((model greedy-ap-tagger) fs)
  (or (get# (first fs) (tgr-single-tag-words model))
      (call-next-method model (rest fs))))

For the tagger, we first check the current word against the dictionary of single-tag-words that we've built in the previous part and then call the classify method of the avg-perceptron model itself. That one is a matter of simply returning a class that is an argmax of a dot product between model weights and sample features fs that in this case can only have values of 1 or 0.

(defmethod classify ((model greedy-ap) fs)
  (let ((scores #h()))
    (dotable (class weights (m-weights model))
      (dolist (f fs)
        (:+ (get# class scores 0) (get# f weights 0))))
    (keymax scores)))  ; keymax is argmax for hash-tables

As you see, averaged perceptron is very straightforward - a simple linear model that has a weights slot which is a mapping of feature weights for every class. In the future this method will probably be assigned to a linear-model class, but it hasn't been added to CL-NLP so far.


Let's take a look at the training part. It consists of 2 steps. extract-fs performs feature extraction. What it, basically, does in our case of a simple perceptron model is returning a list of features preceded by the word we're currently tagging.

(cons word (make-fs model
                    ("i pref1" (char word 0))
                    ("i word" word)
                    ("i-1 tag" prev-tag)

The make-fs macro is responsible for interning the features as symbols in package f by concatenating the provided prefixes and calculated variables. This is a standard Lisp practice to use symbols instead of raw strings for such things. So, in the above example for the word "the" preceded by a word tagged as VBZ will get the following list of features:

'("the" f:|bias| f:|i pref1 t| f:|word the| f:|i-1 tag VBZ| ...)

The second part of learning is training. It is the most involved procedure here, yet still very simple conceptually. Just like with the tag method, we're iterating over all tokens preceded by a dummy :-START2- and :-START- ones, and guessing the current tag using classify. Afterwards we're updating the model's weights in train1. The only difference is that we need to explicitly first consider the case of single-tag-words not to run the model update step for it.

This is how it all looks modulo debugging instrumentation. Note the use of psetf to update prev and prev2 simultaneously.

(defmethod train ((model greedy-ap-tagger) sents &key (epochs 5))
  (with-slots (single-tag-words dict) model
    ;; expand dict
    (dolist (sent sents)
      (dolist (tok (sent-tokens sent))
        (set# (token-word tok) dict nil)))
    ;; expand single-tag-words
    (dotable (word tag (build-single-tag-words-dict (mapcar #'sent-tokens sents)
                                                    :ignore-case? t))
      (unless (in# word single-tag-words)
        (set# word single-tag-words tag)))
    ;; train
    (dotimes (epoch epochs)
      (dolist (sent (mapcar #'sent-tokens sents))
        (let* ((prev :-START-)
               (prev2 :-START2-)
               (ctx (sent-ctx sent)))
          (doindex (i token sent)
            (let ((word (token-word token)))
              (psetf prev
                     (or (get# word single-tag-words)
                         (let* ((fs (extract-fs model i word ctx prev prev2))
                                (guess (classify model fs)))
                           (train1 model (rest fs) (token-tag token) guess)
                     prev2 prev)))))
      (:= sents (shuffle sents))))

Note the additional expansion of the single-tag-words dict of the model (as well as of the normal dict).

An interesting feature of the problem's object-oriented decomposition in this case is that we have a generic perceptron machinery we'd like to capture and reuse for different concrete models, and a model-specific implementation details.

This dichotomy is manifested in the training phase:

  • the train method is specific to the greedy-ap-tagger. The generic perceptron training is much simpler, because it doesn't operate in a sequence labeling scenario (see the source code here)

  • however, there's also an :after method defined for train on the avg-perceptron model which averages all the weights in the end and prunes the model by removing zero weights

  • there are also 2 more methods that are not specialized for greedy-ap-tagger: train1 and update1. They perform 1 step of the normal perceptron training and model update

    (defmethod train1 ((model perceptron) fs gold guess)
      (:+ (ap-step model))
      (dolist (f fs)
        (ensure-f-init model f gold guess)
           :for class :in (list gold guess)
           :for val :in '(1 -1) :do
           (update1 model f class val))))
    (defmethod update1 ((model avg-perceptron) f class val)
      (with-slots (step timestamps weights totals) model
        (:+ (? totals class f) (* (- step (? timestamps class f))
                                  (? weights class f)))
        (:+ (? weights class f) val)
        (:= (? timestamps class f) step)))

Evaluation & persisting the model

We have reached the last part of every machine learning exercise - evaluation. Usually it's about measuring precision/recall/f-measure, but in the tagger case both precision and recall are the same, because the sets of relevant and retrieved items are the same, so we can calculate just the accuracy:

NLP> (accuracy *tagger* *gold-test*)

A "gold" corpus is used for evaluation. This one was performed on the standard evaluation set which is the Wall Street Journal corpus (parts 22-24), OntoNotes 4.0 version. The model was also trained on the standard training set (0-18). Its results are consistent with the performance of the reference model from the blog post. The "gold" features where obtained by calling the extract-gold method of our model on the data from the treebank.

But wait, we can do more.

First, on the evaluation part. It's not being a secret already for a long time in the NLP community that WSJ corpus is far from representative to the real-world use cases. And I'm not even talking of twitter here, but just various genres of writing have different vocabularies and distributions of sentence structures. So, the high baselines shown by many results on the WSJ corpus may not be that robust. To help with such kind of evaluation Google and Yahoo have recently released another treebank called WebText that collect 5 different types of texts seen on the web: from dialogues to blog posts. It's smaller than Penn Treebank: 273k tokens isntead of 1,3m with 23k distinct word types. If we evaluate on it the accuracy drops substantially: only 89.74406!

What we can do is train on more data with better variability. Let's retrain our model on the whole OntoNotes (minus the evaluation set of WSJ 22-24). Here are the results:

  • on WSJ evaluation set: 96.76323 - a modest gain of 0.4%: we're already at max here
  • on Webtext: 92.9431 - a huge gain of more than 4%!

So, broader data helps. What else can we do?

Another aspect we haven't touched is normalization. There are some variants of generating arbitrary tokens in English which lend themselves well to normalization to some root form. These include numbers, emails, urls, and hyphenated words. The normalization variant proposed by Honnibal is rather primitive and can be improved.

Here's an original variant:

(defmethod normalize ((model greedy-ap-tagger) (word string))
    ((and (find #\- word) (not (char= #\- (char word 0))))
    ((every #'digit-char-p word)
     (if (= 4 (length word)) "!YEAR" "!DIGITS"))
    (t (string-downcase word))))

And here's a modified one:

(defmethod normalize ((model greedy-ap-tagger) (word string))
    ((re:scan *number-regex* word) (make-string (length word) :initial-element #\0))
    ((re:scan *email-regex* word) "!EMAIL")
    ((re:scan *url-regex* word) "!URL")
    ((in# word (tgr-dict model)) (string-downcase word))
    ((position #\- word :start 1 :from-end t)
     (let ((suffix (slice word (1+ it))))
       (if (in# suffix (tgr-dict model))
           (string-downcase suffix)
    (t (string-downcase word))))

Such change allows to gain another 0.06% accuracy on the Webtext corpus. So, normalization improvement doesn't help that much. However, I think it should be more useful in real-world scenarios.

Now, as we finally have the best model we need a way to persist and restore it. The corresponding save-model/load-model methods exist for any categorical model. They use the handy ZIP and USERIAL libraries to save models into a single zip file, serializing textual (categories and feature names) and binary data (floating point weights) into separate files. Here's how our serialized POS tagger model looks like:

  Length  File
--------  --------------------
     552  classes.txt
 4032099  fs.txt
 2916012  fs.bin
 2916012  weights.bin
   35308  single-tag-words.txt
  484712  dict.txt
--------  --------------------
10384695  6 files

Finally, I believe, it's an essential practice to make all results we post online reproducible, but, unfortunately, there are restrictions on the use of the Pen Treebank corpus data, so we can't just add an automated test that will reproduce the contents of this post. Still, a small portion of OntoNotes WSJ corpus can be used under the fair use policy, and it is provided with CL-NLP for evaluation purposes.

Let's add such a test to give the users confidence in the performance of our model. For testing CL-NLP I'm using yet another my own library which is called SHOULD-TEST - I'll have another blog devoted to it some time in the future.

Here's a test we need:

(defun extract-sents (text)
  (mapcar #`(make 'ncore:sentence :tokens (ncorp:remove-dummy-tokens %))
          (ncore:text-tokens text)))

(defvar *tagger* (load-model (make 'greedy-ap-tagger)
                             (models-file "pos-tagging/")
                             :classes-package :tag))
(defvar *gold*
  (let (test)
    (ncorp:map-corpus :treebank (corpus-file "onf-wsj/")
                      #`(appendf test (extract-sents %)))
    (extract-gold *tagger* test)))

(deftest greedy-ap-tagger-quality ()
  (should be = 96.31641
          (accuracy *tagger* *gold*)))

Summing up

In this article I've tried to describe the whole process of creating a new statistics-based model using CL-NLP. As long as you have the necessary data, it is quite straightforward and commonplace.

If you want to use one of the existing models (namely, greedy averaged perceptron, as of now) you can reuse almost all of the machinery and just add a couple of functions to reflect the specifics of your task. I think, it's a great demonstration of the power of the generic programming capabilities of CLOS.

Obviously, feature engineering is on you, but training/evaluation/saving/restoring the model can be handled transparently by CL-NLP tools. There's also support for common data processing and calculation tasks.

We have looked at some of the popular corpora in this domain (all of which, unfortunately, have some usage restrictions and are not readily available, but can be obtained for research purposes). And we've observed some of factors that impact the performance and robustness of machine learning models. I'd say that our final model is of the production-ready state-of-the-art level, so you can safely use it for your real-world tasks (under the licensing restrictions of the OntoNotes dataset used for training it). Surely, if you have your own data, it should be straightforward to re-train the model with it.

You can also add your own learning algorithms, and I'm going to be continue doing the same likewise.

Stay tuned and have fun!


Announcing RADICAL-UTILS (a.k.a RUTILS 3.0)


RUTILS is to my knowledge the most comprehensive and all-encompassing suite of syntactic utilities to support modern day-to-day Common Lisp development. It aims to simplify the experience of both serious system-level development and rapid experimentation in the REPL.
The library has stabilized over more than 3 years of evolution, it introduces some substantial improvements not available elsewhere, has a thorough documentation and a good test suite. The only thing it lacks is, probably, a manual, yet example usage can be found in my other libraries (notably, CL-REDIS and SHOULD-TEST).

A short history

According to git history I've started the project RUTILS (originally, REASONABLE-UTILITIES) on Saturday, Sep 12 2009, almost 4,5 years ago. Yet, I've never done any serious announcement of it. One of the reasons was that it's kind of controversial to release "yet another"™ utilities library for Common Lisp, so I wanted to see if it would stick without any promotion. And it hasn't - i.e. it hasn't gotten any serious adoption in the CL crowd (e.g. only one of the authors of libraries in Quicklisp has used it, and only once there was a serious external interest to contribute). Yet, it did stick very well with me and has grown quite well over the years.
And now it has reached version 3 by finally getting a proper test suite, and I've also come to the conclusion that it makes sense to rename the project to RADICAL-UTILITIES - just to add a bit of self-irony that should be an essential component of any grown-up open source project. ;)

So, why RUTILS?

Initially, the concept of the library was expressed in its manifesto. Here are the three main points:
  • Become an extension to Common Lisp standard by adding the missing pieces that were proposed by notable members of the community in the past or have been proven useful in other languages. This aim included support of as many CDRs, as possible.
  • Accumulate all the useful syntactic utilities in one suite. Contrary to ALEXANDRIA), the most widely-used Lisp utilities library, that proclaims the concept of a "good citizen" (i.e. it doesn't include such libraries as SPLIT-SEQUENCE, ANAPHORA, or ITERATE), RUTILS aims to include all the needed pieces, regardless of their origin and presence in other libraries.
  • Introduce fine-grained modularity - one of the common complaints about the standard is the lack of this property. So, as RUTILS is intended as an extension of the standard, it makes sense to address the complaint This point, actually, arouse interest from the folks at Mathematical Systems Institute, who proposed to co-develop the library, and that prompted the appearance of version 2.0. Yet, their plans had changed, and this collaboration hadn't worked out. However, it pushed me to the understanding of what should be done to make the library production-ready and generally useful.
REASONABLE-UTILITIES has failed in several aspects. First of all, there was not enough effort put into documentation and testing. Also, CDR support didn't prove very useful in practice. Finally, the modularity aspect, regardless of its theoretical appeal, doesn't seem to make a difference as well: :shadow is just enough to deal with modularity at this level :)
However, all that was not critical, because the benefits for me personally were always substantial.

Version 3.0. Radical?

So, RUTILS has gradually solidified. In 2011, I had found the time to document everything, and last year I had finally added a good test suite (after, also finally finding a comfortable way to write tests with SHOULD-TEST).
So, here are the main benefits of RUTILS:
  • Adding a big set of list, hash-table, string, sequence, tree, array and other general-purpose utilities.
  • Providing a readtable with support for short lambdas (similar to Clojure's), literal hash-tables and vectors, and heredoc strings. This is the first "radical" step, as it radically cracks on boilerplate, and is a good use of the standard reader facilities (as well, as the excellent NAMED-READTABLES library). In general, I would say that support making hash-tables a first-class citizen is the main achievement of RUTILS.
  • And the second "radical" step - adding shortcuts for many standard Common Lisp operators and some of RUTILS utilities, as well. This serves three purposes: simplifies introduction of many operations to beginners, reduces typing in the REPL, and saves on horizontal line space (I adhere to 80-characters rule). This is a debatable decision, yet the stance of RUTILS is to provide the choice, but not enforce it.
  • Adding a pair structure instead of cons-pair. Frankly, there's nothing very wrong with cons-pair, except for the ugly middle-dot syntax for it, but it's, I believe, a general agreement in the CL community, that cons-pair is legacy and should be deprecated. Getting rid of it also allows to retire car and cdr - which is a necessary step despite a strong affection to them with many seasoned lispers (myself included, although I'm not very seasoned :)
Some more radicality may be found in the system RUTILSX, which is treated as contrib. Here I plan to place the stuff which value may not really be well-proven. Currently, it includes iterate with keywords from keyword package (this solves naming conflicts and allows to use arbitrary clause keywords), my own version of a generic bind (I had a couple of attempts at it), and, perhaps, the most radical of all - the addition of ~ operator for generic access to elements of arbitrary structures. Think:
(~ #{:foo '(1 2 3}} :foo 1) => 2  ;; #{} is a literal hash-table syntax
Finally, as mentioned earlier, RUTILS also includes a comprehensive test suite that covers all, but the most basic and straightforward functions and macros. Most other utility libraries don't.

FAQ (imaginary)

ALEXANDRIA is the most widely used CL utilities library, and, actually, CL library in general. Why not use it?
I don't have anything against it, as it's just useful. Yet, if I was using it, I'd still need to depend on several other libraries (which is not a problem now with [quicklisp]). Still, I'd also have to include my own stuff in each project, as no external library provides reader macros for hash-tables and lambdas, as well as many other small things provided by RUTILS.
What about quickutil?
I think, it solves the wrong problem. The main problem is utilities' raw utility :) Lack of modularity is often mentioned as a drawback of the Lisp standard, but practice shows that it's just a perception issue, not a real-world one. I believe that the level of modularity provided by RUTILS is just good enough, and even it isn't utilized so far (but, maybe, in the future it will be to some extent). The QUICKUTIL approach just adds unnecessary bookkeeping and introduces another import mechanism, while sticking with the standard package import also would work and doesn't create additional mental tax and confusion.
Also, I've come to the conclusion that Lisp's flat namespaces are, actually, a win over most implementations of hierarchical namespaces, but that's a subject of another rant.
What about other individuals' utility suits (metatilities, arnesi, fare-utils, ...) - I remember, that Xach has counted around 13?
Most of them are not supported. They also are usually a random collection of stuff without a vision. Yet, sometimes it makes sense to use them as well: for instance, I've used ARNESI for its delimited continuations facility and had even put it a fork on my github to fix some bugs. Although, as discussed below, a finer-grained approach is better - see CL-CONT.
Have you heard of cl21? How does it compare with RUTILS?
Surely. It is a similar effort to RUTILS, but with a much bolder aim. I, personally, doubt that it will succeed in it, because Lisp's culture is quite conservative, and you need to get a buy-in from the community in order to make such radical moves. The problem with this is that no one has enough motivation, given that Lisp is already good enough, and there's no central authority to steward the evolution process.
So, the way to go, as for me, is to make small incremental improvements and get them adopted. This has always worked in the past, and there are many good examples: CLOSER-MOP, CL-PPCRE, NAMED-READTABLES, Marco Antoniotti's libraries, CL-CONT, OPTIMA, etc.
The main improvements, proposed by cl21 can or are already addressed by such libraries, including RUTILS. For instance, cl21 proposes the same literal syntax for hash-tables. Another utility already present in RUTILS is a generic sequence iteration loop: doeach in cl21 and doseq in RUTILSX.
The only thing that, IMHO, may make sense to explicitly modernize in Lisp is the reader, as it has 2 hard-coded cases for handling . and :. I'm thinking in the lines of providing access to the previously read item and thus allowing for definition of arbitrary infix syntaxes. But this needs much more research...
What about speed and performance?
Lisp is a dynamic language with good performance characteristics. One of the reasons for that is the standard library engineered with performance in mind. It is manifested in the presence of various specific functions for working with different data structures (like nth, svref, aref, char accessors). This prompts a valid criticism as being unfriendly to newcomers. The solution, in my opinion, is to build on top of that generic versions using capabilities provided by CLOS. The sequence-manipulation part of the standard sets such an example with the likes of elt. RUTILS continues along this track. But there's an obvious drawback of loosing the performance benefit. I think that the Lisp approach here is balanced, as it's always possible to fall back to the "low-level" specific functions (that should always be provided), and at the same time to use the easy approach in 95% of the case when performance isn't critical.
In RUTILS, there's very little support for functional programming. Why is it missing?
There are several aspects of functional programming that are not present in CL standard. One that is usually addressed by utilities is function composition and currying. It is also addressed by RUTILS but in an uncommon way - with sharp-backquote reader macro. In my opinion it's a more concise, yet easier to understand approach. Here are a couple of examples taken from cl21 docs and elsewhere:
 #`(member % '("foo" "bar") :test 'string=) is a generalized approach to currying
 #`(sin (1+ %)) is the equivalent of (compose #'sin #'1+) in cl21
 #`(and (integerp %) (evenp %)) is the equivalent of (conjoin #'integerp #'evenp) in cl21
 #`(or (oddp %) (zerop %)) is the equivalent of (disjoin #'oddp #'zerop) in cl21
The benefit of this approach is that it is the same for any kinds of composition, while you need to define your own functions to accommodate them with the "functional" style: for example, how do you join on xor? (btw, xor is also provided by RUTILS)
Other functional features, like lazy evaluation, pattern matching, or functional data-structures, are much more specific in use (for example, I use rarely use them if at all), and they have dedicated support elsewhere. See CLAZY, OPTIMA, and FSet for examples.
Where's the documentation you're mentioning?
It's in the docstrings and subsequently in the excellent quickdocs. With its appearance, I can only recommend everyone not to waste time on creating a function documentation for your library, and focus on manual and use cases instead. Though, we need to wait for the update of the project to the most recent quicklisp.


Another Concurrency FUD

Today I've stumbled on an article entitled Locks, Actors, And STM In Pictures which looks rather cool, but is a total FUD. It tells a FoxNews kind of story of how troublesome using locks are compared to the new synchronization primitives.

First of all, the author uses an example synchronization problem that is very well solved with locks, and just tells that it is a bad solution without any concrete argumentation.

What's more important though is that he needs to modify the original problem to fit with the alternative solutions.

In the Actor's case we need to resort to a waiter. Can the problem be solved without it? Surely, but the code will be, probably twice as big, and very complicated compared to the lock-based solution. Not to mentioned that it will also be prone to a deadlock. The philosophers code-based solution with a waiter is even simpler: lock the waiter, tell it what to do, unlock it.

And in the STM case we are only taking 2 forks at once, but in the task we need to do it one by one. This seems like a minor issue, but now try to add printing to the mix: can we print "Aristotle took left fork" after he takes left fork, "Aristotle put left fork" if he couldn't take right fork?

PS. I like actors and STM and try to use them where appropriate. I also teach them to students as part of the Operating Systems course.


NLTK 2.2 - Conditional Frequency Distributions

The summer is over, and it's time to go back to school and continue our exploration of the NLTK book. We're finally getting at the last part of the second chapter - conditional frequency distributions.

What is a conditional frequency distribution? From a statistical point of view, it is a function of 2 arguments - a condition and a concrete outcome - producing an integer result that is a frequency of the outcome's occurrence under the condition. The simplest variant of such function is a table.

In object-oriented programming there's a tendency to put everything in a program in objects, including functions, and NLTK authors took the same route by creating a ConditionalFreqDist class. Yet, in my view, it is quite dubious and I have a hard time with such objects. What we really need here is just a protocol to work with tables. At the same time there's a classic separation of concerns issue here, as this protocol shouldn't mandate the internal representation or a way to generate different values in the distribution. This is where object come to the rescue in OOP: they allow you to change the value storage/generation policy via subclassing. But, as Joe Armstrong has wittily pointed, "you wanted to have a banana, but got a gorilla holding a banana"...

So, let's get to the protocol part for which a proper concept in OOP languages will be an interface. Its main operations are: creating a distribution and examining it (tabulating, plotting, etc.)

Creating a distribution

Let's create a CFD from the Brown corpus. Remember that its topics are the following (and they'll become the conditions of our CFD):

NLTK> (keys (corpus-groups +brown-corpus+))

Let's create the simplest distribution:

NLTK> (make-cfd (maptable (lambda (topic texts)
                            (declare (ignore topic))
                            (mapcar #'token-word
                                    (flatten (mapcar #'text-tokens texts))))
                          (corpus-groups +brown-corpus+)))
NLTK> (defvar *cfd* *)

We'll use a very concrete hash-table representation for the distribution, and we can go a very long way with it. Although, later we'll see how to deal with other representations.

There's a new utility here, maptable, that is an equivalent of mapcar in its simple form, when it works with a single list, but operating on hash-table instead. (And, being a generic function, it can also operate on any other table-like structures).

Each frequency distribution in a CFD is an instance of ngrams class that we've worked with in the previous chapter:

NLTK> (~ *cfd* :fiction-romance)
#<TABLE-NGRAMS order:1 count:8451 outcomes:70022 {101C030083}>

Another new utility here, ~, is an alias to the generic element access operator generic-elt that is meant to provide uniform access to elements of different linguistic structures that are defined throughout CL-NLP. It's the analogue of [] access operator in Python and other C-like languages, with the upside that it's not special — just a function. And it also supports chaining using the generic-function :around method-combination:

(defgeneric generic-elt (obj key &rest keys)
   "Generic element access in OBJ by KEY.
    Supports chaining with KEYS.")
  (:method :around (obj key &rest keys)
    (reduce #'generic-elt keys :initial-value (call-next-method obj key))))

As you can see, a lot of utilities, including very generic ones, are defined along the way of our exploration. Most of them or similar ones come built into Python. This can be seen as a failure of Lisp standard at the first sight. Yet it may as well be considered a win, because it's so easy to add these things on top of the existing core with generic functions and macros, and there's no issue of "second-class citizenship". In Python the [] operator comes pre-built: you can redefine it with special hooks but only in one way, you can't have different [] implementations for one class. The other benefit of Lisp's approach is more control: you can go with the fast functions that work directly with the concrete data-structures, like mapcar for lists or gethash for hash-table. Or you can use a generic operator and pay the price of generic dispatch: standard map for abstract sequences or user-defined ~ etc, when you need future extensibility.

So, we can also use generic-elt to access individual frequency in the CFD:

NLTK> (~ *cfd* :fiction-romance "could")

For the first key it will be mapped to hash-table accessor (gethash) and for the second to the method freq of the ngrams object.

Examining the distribution

Now, we can move to the more interesting part: analysing the distribution.

The first way to explore the CFD shown in NLTK book is tabulating. It prints the distribution values in a nice and readable 2D table.

NLTK> (tabulate *cfd*
                :conditions '(:press-reportage :religion :skill-and-hobbies
                              :finction-science :fiction-romance :humor)
                :samples '("can" "could" "may" "might" "must" "will"))

                     can  could  may  might  must  will
    PRESS-REPORTAGE   93     86   66     38    50   389
           RELIGION   82     59   78     12    54    71
  SKILL-AND-HOBBIES  268     58  131     22    83   264
    FICTION-ROMANCE   74    193   11     51    45    43
              HUMOR   16     30    8      8     9    13

And, by the way, here's how we can get the individual numbers for one category:

NLTK> (dolist (m '("can" "could" "may" "might" "must" "will"))
        (format t "~A: ~A " m (~ *cfd* :press-reportage m)))
can: 93 could: 86 may: 66 might: 38 must: 50 will: 389

Interestingly enough, these results differ from NLTK's ones:

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

What's the reason? The answer is case-sensitivity. Let's try the same with the case-insensitive version of the CFD:

NLTK> (let ((cfd (make-cfd (maptable (lambda (topic texts)
                                       (mapcar #`(string-downcase (token-word %))
                                               (flatten (mapcar #'text-tokens
                                     (corpus-groups +brown-corpus+)))))
        (dolist (m '("can" "could" "may" "might" "must" "will"))
          (format t "~A: ~A " m (~ cfd :press-reportage m))))
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Now, there's an exact match :)

The printing of each row in tabulate is performed in the following loop:

(dotable (k v table)
  (when (or (null conditions)
            (member k conditions))
    (format stream "  ~V:@A" key-width k)
    (let ((total 0))
      (dolist (s samples)
        (format stream "  ~V:@A" (strlen s)
                (funcall (if cumulative
                             #`(incf total %)
                         (or (~ v s) 0)))))))

It has to account for scenarios when conditions may or may not be provided, as well as to treat cumulative output.

And, finally, let's draw some nice pictures. But first we need to create the inaugural corpus from individual texts:

(defparameter *inaugural* (make-corpus-from-dir
                           "Inaugural speeches corpus"
                           (data-file "../nltk/data/inaugural/")
                           :test #`(re:scan "^\\d+" %)))

To do that we introduce the function make-corpus-from-dir in the ncorpus package that walks the directory of raw text files and creates a basic corpus from them.

(let (texts)
  (fad:walk-directory dir
                      #`(when (funcall test (pathname-name %))
                          (push (pair (pathname-name %)
                                      (read-file %))
   :desc name
   :texts (mapcar #`(make-text
                     :name (lt %) :raw (rt %)
                     (mapcar #`(ncore:make-token :word %)
                             (mapcan #`(ncore:tokenize ncore:<word-tokenizer> %)
                                     (ncore:tokenize ncore:<sentence-splitter>
                                                     (rt %)))))

(Here, pair, lt, and rt are yet another generic utility group — the equivalents of cons, car, and cdr).

Now, we need to make a CFD from the corpus. It is somewhat complicated, because the corpus is keyed by text names, and we want a CFD keyed by words:

(make-cfd (let ((ht (make-hash-table :test 'equalp)))
            (dolist (text (corpus-texts *inaugural*))
              (let ((year (sub (text-name text) 0 4)))
                (dolist (word (mapcar #'token-word (text-tokens text)))
                  (when-it (find word '("america" "citizen")
                                 :test #`(search %% % :test 'string-equal))
                    (set# it ht (cons year (get# it ht)))))))
          :eq-test 'equalp)

And now, we can plot it:

NLTK> (plot-table *)

Plot of usage of 'america' and 'citizen' in Inaugural speeches

Once again, plotting is performed with gnuplot. This time, though, with a general plot-table function that is able to handle any table-like structure. Here is its code:

(defun plot-table (table &rest args
                   &key keys cols cumulative (order-by (constantly nil)))
  "Plot all or selected KEYS and COLS from a TABLE.
   CUMULATIVE counts may be used, as well as custom ordering with ORDER-BY."
  (mv-bind (file cols-count keys-count)
      (apply #'write-tsv table args)
    (let ((row-format (fmt "\"~A\" using ~~A:xtic(2) with lines ~
                            title columnheader(~~:*~~A)"
      (cgn:with-gnuplot (t)
        (cgn:format-gnuplot "set grid")
        (cgn:format-gnuplot "set xtics rotate 90 1")
        (cgn:format-gnuplot "set ylabel \"~@[Cumulative ~]Counts\"" cumulative)
        (cgn:format-gnuplot "plot [0:~A] ~A"
                            (strjoin "," (mapcar #`(fmt row-format (+ 3 %))
                                                 (range 0 keys-count))))))
    (delete-file file)))

A generic function write-tsv creates a temporary file in tab-separated format. Here, we use its method for hash-table objects (which in our case represent a CFD — each condition is a key and each sample is a column). So, the TSV file's contents for our last distribution look like this:

No Label citizen america
0 1789 5 2
1 1793 1 1

Using different data sources

Finally, as promised, here's how to use the newly developed CFD machinery with other ways to obtain raw data for the distribution. Oftentimes, it's infeasible to read the whole contents of some files in memory, i.e. it makes sense to use line-by-line processing. There's a handy dolines macro that allows just that. Let's see, how we can use it to build the male/female names last letters distribution from NLTK Figure 2.10.

First of all, we can just use it when we create the distribution:

(make-cfd (let ((ht (make-hash-table)))
            (dolist (gender '(:female :male))
              (dolines (word (fmt "../nltk/data/names/~(~A~).txt" gender))
                (let ((last-char (char word (1- (length word)))))
                  (when (alpha-char-p last-char)
                    (set# gender ht
                          (cons last-char (get# gender ht)))))))

And here's the plot of the distribution for reference:

(plot-table * :order-by 'char<)

Plot of the distribution of last letters in male/female names

Another approach would be to use a proxy object, that allows to pull data for a CFD from a list of files:

(defclass file-list ()
  ((files :initarg :files :reader files)))

and specialize make-cfd generic function for it (adding a transform keyword argument to be able to manipulate file data):

(defmethod make-cfd ((raw file-list)
                     &key (eq-test 'equal) (transform 'identity)
  (let ((rez (make-hash-table :test eq-test)))
    (dolist (file (files raw))
      (set# (mkeyw (pathname-name file)) rez
            (make 'table-ngrams :order 1
                  :table (count-ngram-freqs
                          (mapcar transform
                                   (read-file file)
                                   :remove-empty-subseqs t))))))

And this is how we would call it:

(make-cfd (make 'file-list
                :files (mapcar #`(fmt "../nltk/data/names/~(~A~).txt" %)
                               '(:male :female)))
          :eq-test 'eql
          :transform #`(char % (1- (length %))))

This looks much heavier than the original approach, but sometimes it may be inevitable.

Yet another way is to use a different plotting function that dynamically builds the plot as it pulls data. This will require different style of communication with gnuplot: not via a temporary file, but giving commands to it dynamically. This is left as an exercise to the reader :)

Parting words

This part was more a discussion of various rather abstract design issues, than a demonstration of NLP techniques.

Overall, I don't see a lot of value in special treatment of conditional frequency distributions as it is done in NLTK. Surely, it's one of the many useful analytic tools, but I haven't seen such objects used in production code. Unlike conditional frequencies per se or ngrams that are all over the places.