2013-09-12

NLTK 2.2 - Conditional Frequency Distributions

The summer is over, and it's time to go back to school and continue our exploration of the NLTK book. We're finally getting at the last part of the second chapter - conditional frequency distributions.

What is a conditional frequency distribution? From a statistical point of view, it is a function of 2 arguments - a condition and a concrete outcome - producing an integer result that is a frequency of the outcome's occurrence under the condition. The simplest variant of such function is a table.

In object-oriented programming there's a tendency to put everything in a program in objects, including functions, and NLTK authors took the same route by creating a ConditionalFreqDist class. Yet, in my view, it is quite dubious and I have a hard time with such objects. What we really need here is just a protocol to work with tables. At the same time there's a classic separation of concerns issue here, as this protocol shouldn't mandate the internal representation or a way to generate different values in the distribution. This is where object come to the rescue in OOP: they allow you to change the value storage/generation policy via subclassing. But, as Joe Armstrong has wittily pointed, "you wanted to have a banana, but got a gorilla holding a banana"...

So, let's get to the protocol part for which a proper concept in OOP languages will be an interface. Its main operations are: creating a distribution and examining it (tabulating, plotting, etc.)

Creating a distribution

Let's create a CFD from the Brown corpus. Remember that its topics are the following (and they'll become the conditions of our CFD):

NLTK> (keys (corpus-groups +brown-corpus+))
(:PRESS-REPORTAGE :PRESS-EDITORIAL :PRESS-REVIEWS :RELIGION
 :SKILL-AND-HOBBIES :POPULAR-LORE :BELLES-LETTRES
 :MISCELLANEOUS-GOVERNMENT-HOUSE-ORGANS :LEARNED :FICTION-GENERAL
 :FICTION-MYSTERY :FICTION-SCIENCE :FICTION-ADVENTURE :FICTION-ROMANCE
 :HUMOR)

Let's create the simplest distribution:

NLTK> (make-cfd (maptable (lambda (topic texts)
                            (declare (ignore topic))
                            (mapcar #'token-word
                                    (flatten (mapcar #'text-tokens texts))))
                          (corpus-groups +brown-corpus+)))
#<HASH-TABLE :TEST EQL :COUNT 15 {101AA320A3}>
NLTK> (defvar *cfd* *)

We'll use a very concrete hash-table representation for the distribution, and we can go a very long way with it. Although, later we'll see how to deal with other representations.

There's a new utility here, maptable, that is an equivalent of mapcar in its simple form, when it works with a single list, but operating on hash-table instead. (And, being a generic function, it can also operate on any other table-like structures).

Each frequency distribution in a CFD is an instance of ngrams class that we've worked with in the previous chapter:

NLTK> (~ *cfd* :fiction-romance)
#<TABLE-NGRAMS order:1 count:8451 outcomes:70022 {101C030083}>

Another new utility here, ~, is an alias to the generic element access operator generic-elt that is meant to provide uniform access to elements of different linguistic structures that are defined throughout CL-NLP. It's the analogue of [] access operator in Python and other C-like languages, with the upside that it's not special — just a function. And it also supports chaining using the generic-function :around method-combination:

(defgeneric generic-elt (obj key &rest keys)
  (:documentation
   "Generic element access in OBJ by KEY.
    Supports chaining with KEYS.")
  (:method :around (obj key &rest keys)
    (reduce #'generic-elt keys :initial-value (call-next-method obj key))))

As you can see, a lot of utilities, including very generic ones, are defined along the way of our exploration. Most of them or similar ones come built into Python. This can be seen as a failure of Lisp standard at the first sight. Yet it may as well be considered a win, because it's so easy to add these things on top of the existing core with generic functions and macros, and there's no issue of "second-class citizenship". In Python the [] operator comes pre-built: you can redefine it with special hooks but only in one way, you can't have different [] implementations for one class. The other benefit of Lisp's approach is more control: you can go with the fast functions that work directly with the concrete data-structures, like mapcar for lists or gethash for hash-table. Or you can use a generic operator and pay the price of generic dispatch: standard map for abstract sequences or user-defined ~ etc, when you need future extensibility.

So, we can also use generic-elt to access individual frequency in the CFD:

NLTK> (~ *cfd* :fiction-romance "could")
193

For the first key it will be mapped to hash-table accessor (gethash) and for the second to the method freq of the ngrams object.

Examining the distribution

Now, we can move to the more interesting part: analysing the distribution.

The first way to explore the CFD shown in NLTK book is tabulating. It prints the distribution values in a nice and readable 2D table.

NLTK> (tabulate *cfd*
                :conditions '(:press-reportage :religion :skill-and-hobbies
                              :finction-science :fiction-romance :humor)
                :samples '("can" "could" "may" "might" "must" "will"))

                     can  could  may  might  must  will
    PRESS-REPORTAGE   93     86   66     38    50   389
           RELIGION   82     59   78     12    54    71
  SKILL-AND-HOBBIES  268     58  131     22    83   264
    FICTION-ROMANCE   74    193   11     51    45    43
              HUMOR   16     30    8      8     9    13

And, by the way, here's how we can get the individual numbers for one category:

NLTK> (dolist (m '("can" "could" "may" "might" "must" "will"))
        (format t "~A: ~A " m (~ *cfd* :press-reportage m)))
can: 93 could: 86 may: 66 might: 38 must: 50 will: 389

Interestingly enough, these results differ from NLTK's ones:

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

What's the reason? The answer is case-sensitivity. Let's try the same with the case-insensitive version of the CFD:

NLTK> (let ((cfd (make-cfd (maptable (lambda (topic texts)
                                       (mapcar #`(string-downcase (token-word %))
                                               (flatten (mapcar #'text-tokens
                                                                texts))))
                                     (corpus-groups +brown-corpus+)))))
        (dolist (m '("can" "could" "may" "might" "must" "will"))
          (format t "~A: ~A " m (~ cfd :press-reportage m))))
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Now, there's an exact match :)

The printing of each row in tabulate is performed in the following loop:

(dotable (k v table)
  (when (or (null conditions)
            (member k conditions))
    (format stream "  ~V:@A" key-width k)
    (let ((total 0))
      (dolist (s samples)
        (format stream "  ~V:@A" (strlen s)
                (funcall (if cumulative
                             #`(incf total %)
                             #'identity)
                         (or (~ v s) 0)))))))

It has to account for scenarios when conditions may or may not be provided, as well as to treat cumulative output.

And, finally, let's draw some nice pictures. But first we need to create the inaugural corpus from individual texts:

(defparameter *inaugural* (make-corpus-from-dir
                           "Inaugural speeches corpus"
                           (data-file "../nltk/data/inaugural/")
                           :test #`(re:scan "^\\d+" %)))

To do that we introduce the function make-corpus-from-dir in the ncorpus package that walks the directory of raw text files and creates a basic corpus from them.

(let (texts)
  (fad:walk-directory dir
                      #`(when (funcall test (pathname-name %))
                          (push (pair (pathname-name %)
                                      (read-file %))
                                texts)))
  (make-corpus
   :desc name
   :texts (mapcar #`(make-text
                     :name (lt %) :raw (rt %)
                     :tokens
                     (mapcar #`(ncore:make-token :word %)
                             (mapcan #`(ncore:tokenize ncore:<word-tokenizer> %)
                                     (ncore:tokenize ncore:<sentence-splitter>
                                                     (rt %)))))
                  texts))

(Here, pair, lt, and rt are yet another generic utility group — the equivalents of cons, car, and cdr).

Now, we need to make a CFD from the corpus. It is somewhat complicated, because the corpus is keyed by text names, and we want a CFD keyed by words:

(make-cfd (let ((ht (make-hash-table :test 'equalp)))
            (dolist (text (corpus-texts *inaugural*))
              (let ((year (sub (text-name text) 0 4)))
                (dolist (word (mapcar #'token-word (text-tokens text)))
                  (when-it (find word '("america" "citizen")
                                 :test #`(search %% % :test 'string-equal))
                    (set# it ht (cons year (get# it ht)))))))
            ht)
          :eq-test 'equalp)

And now, we can plot it:

NLTK> (plot-table *)

Plot of usage of 'america' and 'citizen' in Inaugural speeches

Once again, plotting is performed with gnuplot. This time, though, with a general plot-table function that is able to handle any table-like structure. Here is its code:

(defun plot-table (table &rest args
                   &key keys cols cumulative (order-by (constantly nil)))
  "Plot all or selected KEYS and COLS from a TABLE.
   CUMULATIVE counts may be used, as well as custom ordering with ORDER-BY."
  (mv-bind (file cols-count keys-count)
      (apply #'write-tsv table args)
    (let ((row-format (fmt "\"~A\" using ~~A:xtic(2) with lines ~
                            title columnheader(~~:*~~A)"
                           file)))
      (cgn:with-gnuplot (t)
        (cgn:format-gnuplot "set grid")
        (cgn:format-gnuplot "set xtics rotate 90 1")
        (cgn:format-gnuplot "set ylabel \"~@[Cumulative ~]Counts\"" cumulative)
        (cgn:format-gnuplot "plot [0:~A] ~A"
                            cols-count
                            (strjoin "," (mapcar #`(fmt row-format (+ 3 %))
                                                 (range 0 keys-count))))))
    (delete-file file)))

A generic function write-tsv creates a temporary file in tab-separated format. Here, we use its method for hash-table objects (which in our case represent a CFD — each condition is a key and each sample is a column). So, the TSV file's contents for our last distribution look like this:

No Label citizen america
0 1789 5 2
1 1793 1 1
...

Using different data sources

Finally, as promised, here's how to use the newly developed CFD machinery with other ways to obtain raw data for the distribution. Oftentimes, it's infeasible to read the whole contents of some files in memory, i.e. it makes sense to use line-by-line processing. There's a handy dolines macro that allows just that. Let's see, how we can use it to build the male/female names last letters distribution from NLTK Figure 2.10.

First of all, we can just use it when we create the distribution:

(make-cfd (let ((ht (make-hash-table)))
            (dolist (gender '(:female :male))
              (dolines (word (fmt "../nltk/data/names/~(~A~).txt" gender))
                (let ((last-char (char word (1- (length word)))))
                  (when (alpha-char-p last-char)
                    (set# gender ht
                          (cons last-char (get# gender ht)))))))
            ht))

And here's the plot of the distribution for reference:

(plot-table * :order-by 'char<)

Plot of the distribution of last letters in male/female names

Another approach would be to use a proxy object, that allows to pull data for a CFD from a list of files:

(defclass file-list ()
  ((files :initarg :files :reader files)))

and specialize make-cfd generic function for it (adding a transform keyword argument to be able to manipulate file data):

(defmethod make-cfd ((raw file-list)
                     &key (eq-test 'equal) (transform 'identity)
                     &allow-other-keys)
  (let ((rez (make-hash-table :test eq-test)))
    (dolist (file (files raw))
      (set# (mkeyw (pathname-name file)) rez
            (make 'table-ngrams :order 1
                  :table (count-ngram-freqs
                          (mapcar transform
                                  (split-sequence-if
                                   #'newline-char-p
                                   (read-file file)
                                   :remove-empty-subseqs t))))))
    rez))

And this is how we would call it:

(make-cfd (make 'file-list
                :files (mapcar #`(fmt "../nltk/data/names/~(~A~).txt" %)
                               '(:male :female)))
          :eq-test 'eql
          :transform #`(char % (1- (length %))))

This looks much heavier than the original approach, but sometimes it may be inevitable.

Yet another way is to use a different plotting function that dynamically builds the plot as it pulls data. This will require different style of communication with gnuplot: not via a temporary file, but giving commands to it dynamically. This is left as an exercise to the reader :)

Parting words

This part was more a discussion of various rather abstract design issues, than a demonstration of NLP techniques.

Overall, I don't see a lot of value in special treatment of conditional frequency distributions as it is done in NLTK. Surely, it's one of the many useful analytic tools, but I haven't seen such objects used in production code. Unlike conditional frequencies per se or ngrams that are all over the places.

2013-09-04

Ищем Лиспера с горящими глазами

Уже 3 года я работаю в Grammarly, где мы строим (и довольно успешно) самый точный сервис исправления и улучшения английских текстов. В экстремуме эта задача упирается в полноценное понимание языка, но пока мы не достигли этого уровня, приходится искать обходные пути. :) Понятно, что работы у нашей команды, которую мы называем "core team", выше крыши. Но эта работа довольно сильно отличается от обычной программной инженерии, точнее, инженерия — это ее конечный этап, которому предшествует огромное количество экспериментов и исследования.

Я как-то назвал нашу систему Grammar OS, поскольку она чем-то похожа на ОС: в ней есть низкий уровень языковых сервисов, есть промежуточные слои проверщиков, есть и пользовательские приложения. Одним из ключевых компонентов является написанная на Лиспе и постоянно развивающаяся система проверка грамматики. Этот проект помимо собственно сложной задачи, на которую отлично легли возможности Лиспа, high-load'а, также интересен и тем, что над ним работают около 5 компьютерных лингвистов, которые все время дописывают в него какой-то код...

К чему я это веду, это к тому, что мы уже больше полугода ищем Лисп-разработчика, желающего подключиться к развитию этой системы, а в перспеткиве и к работе над новыми Лисп-проектами, о которых мы думаем. Помимо Лиспа в ядре у нас используется много Джавы (даже больше), и мы, в общем-то, ее тоже любим. И немного Питона. А основной фокус сейчас все больше и больше сдвигается в сторону различных алгоритмов машинного обучения.

Наконец, у нас в Grammarly отличная атмосфера и куча людей, у которых есть чему поучиться.

 

В общем, как по мне, практически работа мечты для программиста, которому нравится обработка текстов, R&D или Лисп. Логичный вопрос: почему же мы так долго не можем найти подходящего человека? Ответов несколько:
  • Во-первых, мы всех нанимаем очень долго (почти полгода искали даже Java-разработчика). Почему — смотри выше — качество команды для нас важнее скорости.
  • Во-вторых, лисперов с индустриальным опытом у нас не так и много: в Киеве их от силы пару десятков, в России — больше, но мало кто из них хочет переезжать.
  • В-третьих, даже у технически сильных кандидатов часто не горят глаза :(
Короче говоря, ситуация такова:
  • Есть сложная и интересная работа, с хорошей зарплатой, отличной командой, компанией и перспективами.
  • Нужен увлеченный своей работой программист, который хорошо понимает алгоритмы, математику и статистику, и хочет писать на Лиспе.
  • Опыт на Лиспе нужен, но не обязательно из реального мира: open-source или хобби-проекты тоже покатят. Но тогда важно желание и способность быстро развиваться в этом направлении.
Если что, пишите на vseloved@gmail.com...