Lisp, the Universe and Everything

2016-05-17

Improving Lisp UX One Form at a Time

At the recent ELS, I presented a lightning talk about RUTILS and how I see it as a way of "modernizing" CL, i.e. updating the basic language elements to be simpler, clearer and more generic. Thus improving the everyday user experience and answering the complaints of outsiders about "historical cruft" in the Lisp standard. Indeed, Lisp has a lot of unrecognizable names (like mapcar and svref) or just unnecessary long ones (multiple-value-bind or defparameter), and out-of-the-box it lacks a lot of things that many current programmers are used to: unified generic accessors, generators, literal syntax for defining hash-tables or dynamic vectors etc. This may not be a problem for the people working with the language on a regular basis (or if it is they probably have a personal solution for that already), but it impedes communication with the outside world. I'd paid extra attention to that recently as I was preparing code examples for the experimental course on algorithms, which I teach now using Lisp instead of pseudocode (actually, modulo the naming/generics issue, Lisp is a great fit for that).

Unfortunately, the lightning talk format is too short for a good presentation of this topic, so here's a more elaborate post, in which I want to show a few examples from the RUTILS library of using Lisp's built-in capabilities to introduce clear, uniform, and generic syntactic abstractions that may be used alongside the standard Lisp operators, as well as replace them in the cases when we want to get more concise and understandable code.

What's cool about this problem is that, in Lisp, besides a common way to extend the language with functions and methods (and even macros/templates, which find they way into more and more languages), there are several other approaches to the problem that allow to tackle issues that can't be covered by functions and even macros. Those include, for instance, reader macros and aliasing. Aliasing is, actually, a rather simple idea (and can be, probably, implemented in other dynamic languages): duplicating functionality of existing functions or macros with a new name. The idea for such operator came from Paul Graham's "On Lisp" and it may be implemented in the following way (see a full implementation here):


(defmacro abbr (short long &optional lambda-list)
  `(progn
     (cond
      ((macro-function ',long)
       (setf (macro-function ',short) (macro-function ',long)))
      ((fboundp ',long)
       (setf (fdefinition ',short) (fdefinition ',long))
       ,(when lambda-list
          `(define-setf-expander ,short ,lambda-list
             (values ,@(multiple-value-bind
                           (dummies vals store store-form access-form)
                           (get-setf-expansion
                            (cons long (remove-if (lambda (sym)
                                                    (member sym '(&optional &key)))
                                                  lambda-list)))
                         (let ((expansion-vals (mapcar (lambda (x) `(quote ,x))
                                                       (list dummies
                                                             vals
                                                             store
                                                             store-form
                                                             access-form))))
                           (setf (second expansion-vals)
                                 (cons 'list vals))
                           expansion-vals))))))
      (t (error "Can't abbreviate ~a" ',long)))
     (setf (documentation ',short 'function) (documentation ',long 'function))
     ',short))

As you may have noticed, it is also capable of duplicating a setf-expander for a given function if the lambda-list is provided. Using abbr we can define a lot of shorthands or alternative names, and it is heavily used in RUTILS to provide more than 50 alternative names; we'll see some of them in this post. What this example shows is the malleability of Lisp, which allows approaching its own improvement from different angles depending on the problem at hand and the tradeoffs you're willing to make.

Introducing generic element access

One of the examples of historic baggage in CL is a substantial variety of different methods to access elements of collections, hash-tables, structures, and objects with no generic function unifying them. Not to say that other languages have a totally uniform accessor mechanism. Usually, there will be two or three general-purpose ways to organize it: dot notation for object field access, something square-braketish for array and other collections access, and some generic operator like get for all the other cases. And occasionally (e.g. in Python or C++) there are hooks to plug into the built-in operators. Still, it's a much smaller number than in Lisp, and what's more important, it's sufficiently distinct and non-surprising.

In Lisp, actually, nothing prevents us from doing even better — both better than the current state and than other languages — i.e. from having a fully uniform and extensible solution. At first approximation, it's just a matter of defining a generic function that will work on different container types and utilize all the existing optimized accessor functions in its methods. This interface will be extensible for any container object. In RUTILSX (a part of RUTILS where any experiments are allowed) this function is called generic-elt:


(defgeneric generic-elt (obj key &rest keys)
  (:method :around (obj key &rest keys)
    (reduce #'generic-elt keys :initial-value (call-next-method obj key))))

One important aspect you can see in this definition is the presence of an :around method that allows to chain multiple accesses in one call and dispatch each one to an appropriate basic method via call-next-method. Thus, we may write something like (generic-elt obj 'children 0 :key) to access, for instance, an element indexed by :key in a hash-table that is the first element of a sequence that is the contents of the slot children of some object obj.

The only problem with this function is its long name. Unfortunately, most of good short element access names, like elt and nth are already taken in the Common Lisp standard, while for RUTILS I've adopted a religious principle to retain full backward compatibility and don't alter anything from the standard. This is a critical point: not redefining CL, but building on top of it and extending it!

Moreover, element access has two features: it's a very common operation and it's also not a usual function that does some computation, so ideally it should have a short but prominent look in the code. The perfect solution occurred to me at one point: introduce an alias ? for it. Lisp allows to name operations with any characters, and a question mark, in my opinion, matches very well the inner intent of this operation: query a container-like object using a certain key. With it, our previous example becomes very succinct and cool: (? obj 'children 0 :key).

Additionally to element reading, there's also element write access. This operation in Lisp, like in most other languages, has a unified entry point called setf. There's a special interface to provide specific "methods" for it based on the accessor function. Yet, what to do when an access function is polymorphic? Well, provide polymorphic setter companion. (defsetf generic-elt generic-setf). Like generic-elt, generic-setf defers work to already defined specific setters:


(defmethod generic-setf ((obj list) key &rest keys-and-val)
  (setf (nth key obj) (atomize keys-and-val)))

And it also supports key chaining, so you can write: (setf (? obj 'children 0 :key) new-value).

Having this unified access functionality is nice and cool, but some people may still linger for the familiar dot object slot access syntax. We can't blame them: habits are a basis of good UX. Unfortunately, this is contrary to the Lisp way... But Lisp is a pro-choice and future-proof language: if you want something badly, even something not in the usual ways, almost always you can, actually, find a clean and supported means of implementing it. And this case is not an exception. If you can tolerate an small addition — a @-prefix to the object reference (that's also an extra prominent indicator of something unusual going on) — when accessing its slots you can define a reader macro that will expand forms @obj.slot into our (? obj 'slot) or a standard (slot-value obj 'slot). With it, we can write something like (? tokens @dep.govr.id), which is much more succinct and, arguably, readable than (elt tokens (slot-value (slot-value dep 'govr) 'id)).

Still, one issue remains unsolved in this approach: the preferred Lisp slot-access method is not via slot-value, but with an accessor method that is exported. And one of the reasons for it is that slot-names, which are usually short and can clash, are kept private to the package where they are defined. It means that in most cases @obj.slot will not work across packages. (Unlike the OO-languages in which every class is its own namespace, in Lisp, this function is not "complected" within the OO-system, and packages are a namespacing method, while objects serve for encapsulation and inheritance.)

There are two ways to tackle this problem. As I said, Lisp is future-proof: being thoroughly dynamic and extensible, CLOS defines a method that is called when there's a problem accessing an object's slot — slot-missing. Once again, we can define an :around method that will be a little smarter (?) and try to look up slot-name not only in the current package, but also in the class' original package.


(defmethod slot-missing :around
    (class instance slot-name (operation (eql 'slot-value)) &optional new-value)
  (declare (ignore new-value))
  (let ((class-package (symbol-package (class-name (class-of instance)))))
    (if (eql class-package (symbol-package slot-name))  ;; to avoid infinite looping
        (call-next-method)
        (if-it (find-symbol (string-upcase slot-name) class-package)
               (slot-value instance it)
               (call-next-method)))))

This is a rather radical way and comes at a cost: two additional virtual function calls (of the slot-missing method itself and an additional slot-value one). But in most of the cases it may be worth paying it for convenience's sake, especially, since you can always optimize a particular call-site by changing the code to the most direct (slot-value obj 'package::slot) variant. By the way, using slot accessor method is also costlier than just slot-value, so we are compensating here somewhat. Anyway, it's cool to have all the options on the table: beautiful slow and ugly fast method that our backward-compatibility approach allows us. As usual, you can't have a cake and eat it too...

Though, sometimes, you can. :) If you think more of this it becomes apparent that slot-value could be implemented this way from the start: look up the slot name in the class'es original package. As classes or structs are defined together with their slots it is very rare if not almost impossible to see slot-names not available in the package where their class is defined (you have to explicitly use a private name from another package when defining a class to do such a trick). So, slot-value should always look for slot names in the class'es package first. We can define a "smart" slot-value variant that will do just that, and with our nice generic-elt frontend it can easily integrated without breaking backward-compatibility.


(defun smart-slot-value (object slot-name)
  (slot-value object
              (or (find-symbol (string-upcase slot-name)
                               (symbol-package (class-name (class-of instance))))
                  slot-name)))

Unifying variable binding with with

Almost everything in functional variable definition and binding was pioneered by Lisp at some point, including the concept of destructuring. Yet, the CL standard, once again, lacks unification in this area. There are at least 4 major constructs: let and let*, destructuring-bind and multiple-value-bind, and also a few specialized ones like with-slots or ppcre:register-groups-bind. One more thing to mention is that parallel assignment behavior of plain let can be implemented with destructuring-bind and multiple-value-bind. Overall, it just screams for uniting in a single construct, and already there have been a few attempts to do that (like metabang-bind). In RUTILS, I present a novel implementation of generic bind that has two distinct features: a more plausible name — with — and a simple method-based extension mechanism. The implementation is very simple: the binding construct selection is performed at compile-time based on the structure of the clause and, optionally, presence of special symbols in it:


(defmacro with ((&rest bindings) &body body)
  (let ((rez body))
    (dolist (binding (reverse bindings))
      (:= rez `((,@(call #'expand-binding binding rez)))))
    (first rez)))

A very short number of methods covering the basic cases are defined:

  • the first one expands to let or multiple-value-bind depending on the number of symbols in the clause (i.e. for multiple values you should have more than 2)
  • the second group triggers when the first element of the clause is a list and defaults to destructruing-bind, but has special behaviors for 2 symbols ? and @ generating clauses for our generic element access and smart slot access discussed in the previous section

(defun expand-binding (binding form)
  (append (apply #'bind-dispatch binding)
          form))

(defgeneric bind-dispatch (arg1 arg2 &rest args)
  (:method ((arg1 symbol) arg2 &rest args)
    (if args
        `(multiple-value-bind (,arg1 ,arg2 ,@(butlast args)) ,(last1 args))
        `(let ((,arg1 ,arg2)))))
  (:method ((arg1 list) (arg2 (eql '?)) &rest args)
    `(let (,@(mapcar (lambda (var-key)
                       `(,(first (mklist var-key))
                         (? ,(first args) ,(last1 (mklist var-key)))))
                     arg1))))
  (:method ((arg1 list) (arg2 (eql '@)) &rest args)
    (with-gensyms (obj)
      `(let* ((,obj ,(first args))
              ,@(mapcar (lambda (var-slot)
                          `(,(first (mklist var-slot))
                            (smart-slot-value ,obj ',(last1 (mklist var-slot)))))
                        arg1)))))
  (:method ((arg1 list) arg2 &rest args)
    `(destructuring-bind ,arg1 ,arg2)))
In a sense, it's a classic example of combining generic-functions and macros to create a clean and extensible UI. Another great benefit of using with is reduced code nesting that can become quite deep with the standard operators. Here's one of the examples from my codebase:

(with (((stack buffer ctx) @ parser)
       (fs (extract-fs parser interm))
       (((toks :tokens) (cache :cache)) ? ctx))
  ...)
And here's how it would have looked in plain CL:

(with-slots (stack buffer ctx) parser
  (let ((fs (extract-fs parser interm)))
        (toks (gethash :tokens ctx))
        (cache (gethash :cache ctx)))
    ...))

Implementing simple generators on top of signals

One of my friends and a Lisp enthusiast, Valery Zamarayev, who's also a long-time Python user, once complained that the only thing that he misses in CL from Python is generators. This feature is popular in many dynamic languages, such as Ruby or Perl, and even Java 8 has introduced something similar. Sure, there are multiple ways to implement lazy evaluation in Lisp with many libraries for that, like SERIES, pygen or CLAZY. We don't have to wait for another version of the spec (especially, since it's not coming 8-)

In RUTILS I have discovered, I believe, a novel and a very clean way to implement generators — on top of the signal system. The signal or condition facility is, by the way, one of the most underappreciated assets of Common Lisp that often comes to rescue in seemingly dead ends of control flow implementation. And Kent Pitman's description of it is one of my favorite reads in Computer Science. Anyway, here's all you need to implement Python-style generators in Lisp:


(define-condition generated ()
  ((item :initarg :item :reader generated-item)))

(defun yield (item)
  (restart-case (signal 'generated :item item)
    (resume () item)))

(defmacro doing ((item generator-form &optional result) &body body)
  (with-gensyms (e)
    `(block nil
       (handler-bind ((generated (lambda (,e)
                                   (let ((,item (generated-item ,e)))
                                     ,@body
                                     (invoke-restart (find-restart 'resume))))))
         ,generator-form)
       ,result)))

The doing macro works just like dolist, but iterating the generator form instead of an existing sequence. As you can see from this example, restarts are like generators in disguise. Or, to be more correct, they are a more general way to handle such functionality, and it takes just a thin layer of syntactic sugar to adapt them to a particular usage style.

And a few mischiefs

We have seen three different approaches to extending CL in order to accommodate new popular syntactic constructs and approaches. Lastly, I wanted to tread a little in the "danger zone" that may be considered unconventional or plain bad-style by many lispers — modifying syntax at the reader level. One thing that Clojure (following other dynamic languages before it), I believe, has proven is the importance of shorthand literal notation for popular operations. CL standard has predated this understanding: although it has specific print representations for various important objects, and even a special syntax for static arrays. Yet, the language is really future-proof in this respect, because it provides a way to hook into the reader mechanism by modifying the readtables. It was further smoothed and packaged by the popular NAMED-READTABLES library, which allows to treat readtables similar to packages. In RUTILS I have defined several extended readtables that implement a few shortcuts that are used literally in every second function or macro I define in my code. These include:

  • a shorthand notation for zero-, one- or two-argument lambda functions: ^(+ % %%) expands into (lambda (% %%) (+ % %%))
  • a literal syntax for hash-tables: #h(equal "key" "val") will create a EQUAL-hash-table with one key-value pair
  • a syntax for heredoc-strings: #/this quote (") shouldn't be escaped/# (which, unfortunately, doesn't always work smoothly in the repl)

Overall, I have experimented a lot with naming — it was sort of my obsession in this work to find short and obvious names for new things, many of which substitute the existing functionality, under the constraints of not altering what's already in the standard. For this sake, I've ventured into non-character symbols and even the keyword package — a major offence, I reckon... And here are a few of the findings I wanted to share (besides ? and with mentioned previously):

  • call is a new alias for funcall — I suppose, in the 70's it was a really fun experience to call a function hence the name, but now its too clumsy
  • get#, set#, and getset# are aliases and new operations for #-tables (when you can't or won't use ? for that)
  • finally, the grandest mischief is := (alongside :+, :-, :*, :/), which is an alias for setf (and, you've guessed it, incf etc). The justification for this is that everyone is confused about the -f, that setting a variable is a very important operation that we should immediately notice in our clean and functional code ;), and that := is a very familiar syntax for it even used by some languages, such as Pascal or golang. It may be controversial, but it's super-convenient.

The only thing I failed to find a proper renaming for so far is mapcar. It is another one of those emblematic operations that should be familiar to everyone, yet -car creates confusion. For now, I resist the temptation to rename map into map-into and make map smarter by using the first sequence's type for the result expression. However, there's no plausible alternative variant I was able to find even among the zoo of other language's naming of this concept. Any thoughts?

PS. Those were a few prominent examples, but RUTILS, in fact, has much more to offer. A lot of stuff was borrowed from other utility projects, as well as implemented from scratch: anaphoric operators, the famous iter — a replacement for loop, Clojure-style threading macros, a new semantic pair data type to replace cons-cells, lots of utilities to work with the standard data structures (sequences, vectors, hash-tables, strings) making them truly first-class, iteration with explicit indices etc etc. With all that in the toolbox, there's now no ground to claim that Lisp is in any aspect inferior in terms of day-to-day UX compared to some other language, be it Haskell, Ruby or Clojure. Surely, I'm not talking about the semantic differences here.

2016-05-10

European Lisp Symposium 2016

The last two days, I'm at the ELS2016. So far, it's being a great experience - I've actually forgotten the joy of being in one room with several dozens of Lisp enthusiasts. The peculiarity of this particular event is that it's somewhere in the middle between a scientific conference, like ACL, that I had a chance to attend in the recent years thanks to my work at Grammarly, and a tech gathering: it employs the same peer reviewed approach and a scientific presentation style you will find at the research conferences, but most of the topics are very applied and engineering-related.

Anyway, the program was really entertaining with several deep and insightful presentations (here are the proceedings). The highlights for me were the talks on the heterogenous sequences type-checker implementation based on the Lisp declare facility (that I'm growing more and more fond) by Jim Newton and a presentation of an image-processing DSL that's an excellent example of the Lisp state-of-the-art approach in DSL design by Kai Selgrad. Other things like a description of the editor buffers protocol, local variables preservation technic were also quite insightful. And other good stuff is coming...

It's also great to hear new people bringing fresh ideas alongside old-timers sharing their wisdom and perspective - one of the things I appreciate in the Common Lisp community.

Near the end, I'm going to present a lightning talk about RUTILS and how I view it as a vehicle for evolving the Common Lisp user experience.

2016-01-03

КПИ no more

Это последний год, который я читаю в КПИ курс "Операционные системы". Пролистав архив переписки, я сделал вывод, что начал это делать в 2009 году, т.е 6 лет назад, и за это время через меня прошло более 500 студентов. Выходит, я проработал преподавателем уже больше, чем проучился в институте, иными словами мой долг перед альма матер можно считать исполненным :)

В этом тексте я хочу описать свой опыт и ответить на вопрос, почему решил уйти. Начнем с конца: потому что, кроме меня, это, по большому счету, никому не нужно. А у меня с каждым годом все меньше и меньше времени делать никому не нужные вещи.

А теперь по порядку.

Шесть лет назад я начинал создавать этот курс с чистого листа. Что мне нравилось, это то, что был полный карт-бланш для экспериментов и выбора методик преподавания, и это дало возможность попробовать много всего и понять, что работает (для меня), а что нет. Для студентов это не всегда было благоприятно, т.к. далеко не все эксперименты были успешными, но, все же, я считаю, что это лучше, чем то, что было у нас, когда этот курс просто спускался на тормозах (это, собственно, и была моя изначальная мотивация прийти читать его). Каждый год я ставил себе условную оценку за проделанную работу, и если вначале это была твердая двойка, то сейчас, я считаю, что это минимум 4 или 4+ (в прошлом году я также попросил студентов поставить мне оценку, и у них тоже в среднем вышла четверка).

В целом, курсом я преследовал такие цели:

  • ввести студентов в парадигму системного программирования, показать ее интересные стороны и задачи
  • дать им альтернативный взгляд на разработку ПО (а не только ООП и C#, которые являются основными на нашей кафедре)
  • научить работать с Unix-средой и консолью
  • научить не боятся ассемблера и, вообще, низкоуровневых вещей
  • чуть-чуть познакомить с новыми языками прогарммирования, которые актуальны для системной разработки
  • познакомить с опен-сорсом
Не все из этих целей я достигал в рамках курса в тот или иной год: иногда получалось лучше что-то одно, иногда другое, но, думаю, что все они акутальны. Кроме того, если задаться целью, то на основе этого курса легко можно сделать продвинутое продолжение — "Разработка ОС" (аналог MIT Operating System Engineering). За 6 лет удалось собрать достаточно материала, который доступен в виде конспекта и заданий (а также исходников к ним). Более того, в качестве побочного продукта я разработал небольшую систему публикации материалов (свой leanpub), а также тестирования теории. Единственное, до чего так и не дошли руки — это система автоматического тестирования лабораторных. Если придумать, как эффективно масштабировать такую штуку, то это был бы, действительно, золотой грааль, но, к сожалению, основная работа тут, кажется, все равно будет заключаться в написании ручных тестов. А главный закон преподавателя: ручная проверка практических работ — самый большой убийца мотивации...

Почему это никому не нужно в КПИ?

Четыре года назад я написал колонку на ДОУ "Украинский Стенфорд", в которой изложил свой ограниченный взгляд на проблемы, вызовы и возможности нашего высшего технического образования. С тех пор ситуация конкретно в КПИ скорее ухудшилась: ушли многие из старых профессоров, многие из хороших молодых преподавателей также по разным причинам не задержались, так что еще лет 5 и на ФИВТе преподавать, похоже, останутся только люди, не имеющие особого отношения ни к индустрии, ни к науке. Впрочем, это не значит, что такая ситуация везде: мне кажется, сейчас достаточно 1-2 ориентиованных на модернизацию людей в руководстве какого-то факультета или кафедры, чтобы организовать там вполне качественное обучение (все остальное, в принципе, есть). Иными словами, с моей точки зрения основным тормозом изменений университета сейчас является его руководство (на всех уровнях). В связи с этим интересно посмотреть, как будет развиваться ситуация в ХНУРЕ, куда ректором пришел Эдуард Рубин...

Что также интересно, внешне КПИ меняется в лучшую сторону. Хорошо развивается кампус, причем в этот процесс активно включились студенты с такими проектами, как БелкаВежа или Радио КПИ. Много движения в культурной и социальной жизни (вопреки инерции системы). Отличная инициатива — Летняя школа, которая проходит уже 10 лет и в которой я участвовал последние 2 года. В прошом году мы даже провели TEDxKPI! Обычно, такие позитивные внешние проявления следуют за изменением внутренних процессов, но тут имеет место несколько иная динамика: университет — это по определению открытая система, которая постоянна испытывает приток новой крови, и среди этой новой крови действуют те же тренды, что и по всей стране ("бери и делай"). Но, в то же время, с такой же легкостью, с которой новая энергия прибывает сюда, она здесь и надолго не задерживается: активным студентам почти нет стимула оставаться в аспирантуре, а для многих преподавателей это, фактически, волонтерство (например, моя зарплата в КПИ более чем в десять раз меньше тех денег, которые я могу заработать в индустрии), которое почти всегда не стабильно. Именно поэтому я говорю о том, что при ситуацию можно было бы поменять при правильной ориентации руководства: на улучшение качества студентов (за счет их количества), поддержку новых инициатив, уменьшение бюрократии.

Однако, к сожалению, стратегического видения развития университета нет.

С моей точки зрения, есть 2 рабочих модели высшего образования:

  • элитарная, когда отбирается небольшое количество самых талантливых и мотивированных студентов и преподаватель работает с ними индивидуально (к ней есть хороший "анекдот" про астрофизика, который читал свой курс в какой-то далекой обсерватории только для двух студентов, оба из которых потом стали Нобелевскими лауреатами)
  • массовая, когда набирается большая группа студентов и их всех подтягивают до какого-то базового уровня

Очевидно, что КПИ реализовывает второй подход: на последнем потоке второго курса, которому я читал, учится около 100 человек. И таких потоков по направлению компьютерных наук в университете несколько. Но чтобы массовая система работала, она должна быть действительно системой (в которой нет слабых звеньев). В первую очередь, студенты должны быть мотивированы. По принципу кнута и пряника.

Позитивной мотивацией должно быть желание получить современную и перспективную специальность, а в процессе делать интересные и прикольные штуки, а негативной — реальная возможность вылететь. К сожалению, в КПИ обе эти мотивации недоразвиты. По большому счету, в наших реалиях на первом и втором курсе нужно выгонять минимум 20% студентов (примерно столько людей у меня на курсе вообще ничего не делает по ходу семестра и начинают пытатся что-то сдать в лучшем случае к его концу, если не к сессии). Но систематической политики выгонять нет, скорее, наоборот, есть вялое сопротивление тем, кто пытается это делать.

Что до позитивной мотивации, то в КПИ ее перебивает передающаяся из поколения в поколения совковая традиция этого университета, который можно выразить несколькими популярными у студентов мемами:

  • главное — это сдать зачет/сессию/получить диплом (и для этого хороши все способы, в том числе обман)
  • то, чему нас учат, никогда не понадобиться в реальной жизни
  • лучший друг студента — это шара
Но хорошая новость в том, что спрос на качественное техническое образование есть и он никуда не денется, и чем больше будут деградировать существующие институции, тем больше места будет открываться для новых начинаний и форм. Я думаю, что в течение ближайших 5 лет у нас появится минимум 1 технический "университет" нового образца, в котором можно будет получить образование на мировом уровне (как я написал, для этого, в принципе, есть всё, кроме некоторой доли лидерства). Я также надеюсь, что сам смогу чем-то помочь в этом. Но, чтобы начать что-то новое, нужно сначала завершить старое. Жаль только, буду скучать за парком КПИ, тополями и каштанами...

2015-06-26

Running Lisp in Production at Grammarly


We have written a blog post describing almost 3 years of our Lisp in production experience at Grammarly. Here's a small abstract for it.

At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.

We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).

Continue to the full text »

2015-04-14

Announcing SHOULD-TEST


Once upon a time, it occurred to me that all sound software should be slightly self-ironic. That is how this library's name came into being: yes, you should test even Common Lisp code sometimes. :) But that's not the whole irony...

So, y u makes YATF?

Testing software always fascinated me because it is both almost always necessary and at the same time almost always excessive - it's extremely hard to find the right amount of resources you should allocate to it. You will most likely end up fearing to change your system either because you have too few tests, and some of the important scenarios aren't covered, or too many tests and you need to be constantly re-writing them. Surely, in Lisp the problem is not so drastic because in many cases you can rely on the REPL to help, but it's not a one-fit-all solution. There's also too much dogma in the space of general error handling in programming (that I addressed a little bit in this post). So, to find out how to test properly, around 7 years ago I had written my first test framework, which was called NUTS (non-unit test suite). It worked ok, and I used it in a couple of projects including the huge test suite of CL-REDIS that I'm really proud of. However, it was the first version, and you always have to re-write the first version. :) This is how MUTEST (microtest) appeared. In it, I was aiming at making a tool with the smallest footprint possible. It was also partially inspired by RT, which I consider to be the simplest (with a positive connotation) Lisp test framework (before ST). But both of them, MUTEST and RT, are not lispy because they are not extensible, and it's a shame to not have extensibility in Lisp, which provides excellent tools for building it in.

Well, "version 2 always sucks, but version 3..." So, SHOULD-TEST is version 3, and I'm really happy with it. It's truly minimal and intuitive to the extreme: like in the popular BDD approach you just write (in Yodaspeak, obviously): should be = 1 this-stuff and then st:test. And it's extensible - you can add specialized assertion strategies to the provided 3 basic ones: normal testing, exception catching, and capturing output streams.

I wasn't content with the existing Lisp test frameworks because they aren't concerned first and foremost with the things I care about:

  • intuitive defining and running arbitrary tests
  • testing from the REPL and ease of analyzing the test output
  • piping the test output to upstream systems like CI (by supporting common protocols, such as xUnit and TAP)

These are the 3 things that SHOULD-TEST should do the best.

Over more than a year, I have written or re-written with it the whole test suites for the main open-source libraries I support - RUTILS, CL-REDIS, and CL-NLP (which doesn't yet have an extensive test coverage). And I also use it for all my in-house projects.

Working with ST

Here's a quick overview of the SHOULD-TEST workflow.

Test are defined with deftest:

(deftest some-fn ()
  (should be = 1 (some-fn 2))
  (should be = 2 (some-fn 1)))

Being run, the defined test returns either T or NIL as a primary value. Secondary and third values in case of NIL are lists of:

  • all failed assertions returned by individual assertions
  • and all uncaught errors signaled inside assertions

should is a macro that takes care of checking assertions. If the assertion doesn't hold should signals a condition of types should-failed or should-erred which are aggregated by deftest. Also, should returns either T or NIL and a list of a failed expression with expected and actual outputs as values.

Under the hood, should calls the generic function should-check and passes it a keyword produced from the first symbol (in this case, :be), a test predicate (here, '=), and a tested expression as thunk (here it will be e.g. (lambda () (some-fn 1))), and expected results if any. If multiple expected results are given, like in (should be eql nil #{:failed 1} (some-other-fn :dummy)), it means that multiple values are expected. As you see, the keyword and test predicate are passed unevaluated, so you can't use expressions here.

The pre-defined types of assertions are be, signal, and print-to. They check correspondingly.

deftest and should write the summary of test results to *test-output* (by default bound to *standard-output*). The var *verbose* (default T) controls if the summary contains full failure reports or just test names.

Tests are defined as lambda-functions attached to a symbol's test property, so (deftest some-fn ... will do the following:

(setf (get some-fn 'test)
      (lambda () ...))
One feature that is pending implementation is establishing dependencies between tests while defining them, i.e. specifying the partial order in which they should be run. However, I haven't seen heavy demand for it in my test code so far.

To run the tests, use test. Without arguments, it runs all the tests in the current package. Given a :package argument it will do the same for that package, and given a :test argument it will run that individual test. In case of individual test's failure, it will return NIL and a list of failed assertions and a list of assertions, which triggered uncaught errors. In case of failed test of a package, it will return NIL and 2 hash-tables holding the same lists as above keyed by failed test's names.

As you see, the system uses a somewhat recursive protocol for test results:

  • at the lowest level should returns T or NIL and signals information about the failed assertion
  • this information is aggregated by deftest which will return aggregate information about all the failed assertions in the hash-table
  • at the highest level test will once again aggregate information over all tests

So, the structure of the summary, returned from test, will be the following:

#{
  failed-test-1 ((failed-assertion-1 expected actual)
                 (failed-assertion-2 ...
  failed-test-2 ...
 }

There's also :failed key to test that will re-test only tests which failed at their last run.

Usage patterns

As SHOULD-TEST is agnostic, it doesn't impose any restrictions on how each project organizes its tests. Yet, having established patterns and best-practices never hearts. Below is the approach I use...

There's no restriction on naming tests. Though, it seems like a good approach to name them the same as functions they test. As for generic functions, I have different tests for different methods. In this case, I add some suffix to the test's name to indicate which method is tested (like transform-string for one of the methods of gf transform that is specialized for the string class of arguments).

As for code organization, I use the following directory structure of the typical project:

/project-root
 |----src
 |    `----module
 |         `-----file.lisp
 `----test
      |----some-general-tests.lisp
      `----module
           `-----file-test.lisp

I also usually place the tests in the same package as the code they test but protect them with #+dev guard, so that in production environment they are not compiled and loaded altogether.

ASDF provides a way to define the standard for testing a system that can be invoked with asdf:test-system. The easiest way to hook into this facility is to define the following method for asdf:test-op somewhere either in package.lisp or in some common file in the test module (in the example above: some-general-tests.lisp):

(defmethod asdf:perform ((o asdf:test-op)
                         (s (eql (asdf:find-system <your-system>))))
  (asdf:load-system <your-system>)
  (st:test :package <your-package>))
  t)

There's also a minimal test suite defined in src/self-test.lisp. The test suite is also hooked to asdf:test-op for the should-test system - just as described above :)
Finally, there's an idea that ST will provide useful connector facilities that are mostly lacking in the existing Lisp test frameworks, to be able to integrate into the general testing landscape (primarily, CI systems). As a start, xUnit support was implemented by us (most of the thanks go to Maxim Zholoback). As it often happens, it was, actually, almost impossible to find the proper xUnit spec, but this SO answer saved the day for us. test-for-xunit generates appropriate XML string to *test-output*. I also plan on implementing TAP support some day (this should be pretty easy, actually), but I'm not in a hurry.

Well, if SHOULD-TEST proves useful to some of you, I'd be glad. Enjoy the hacking!

2015-04-07

Креш-курс по Лиспу - кому он будет интересен


В июле должен состояться мой мастер-класс введение в практическую разработку на Common Lisp. По этому случаю меня попросили написать статью в блог компании SmartMe, которая проводит это мероприятие. В ней я попытался ответить на вопрос, кому и зачем сейчас может быть интересно разобраться с Лиспом.

Лисп — один из самых старых и, пожалуй, самый загадочный из современных языков программирования. Также бытует мнение, что он не просто стар, а устарел. Почему это не так и где его ниша, я попробую ответить в этой статье.

Лисп — пожалуй единственный динамический системный язык программирования. И среди динамических языков он остается непревзойденным выбором благодаря следующими свойствам:
  • реальной мультипарадигменности, дающей возможность элегантно совмещать процедурный, объектно-ориентированный, функциональный и другие стили
  • уникальной поддержке метапрограммирования
  • легендарной интерактивной среде разработки
  • железобетонному стандарту языка, за которым стоит многолетняя работа мегаумов МИТа, Xerox PARC, CMU и других подобных мест, оплаченная DARPA
  • обширному набору реализаций (компилятор и среда исполнения), коих разработано за его историю около 25, до 10 из которых активно поддерживаются и развиваются
На самом деле, современное положение Лиспа как языка, который обычно не рассматривают для серьезной разработки, обуcловленно отнюдь не техническими причинами, а лишь исторической случайностью — язык сильно опередил свое время,— и человеческим фактором: он и не выбор по-умолчанию (как С++, Java или JavaScript), и не модная новая технология (как Scala, Go или Ruby), и не имеет за собой какую-либо серьезную организацию или сообщество, которые бы продвигали его использование (как C#, Swift или Rust). Тем не менее, миф о непрактичности Лиспа опровергает как мой опыт использования его в ядре Grammarly и предыдущих моих коммерческих проектах (уже более 7 лет), так и опыт поисковика авиабилетов ITA Software, купленной Гуглом за миллиард долларов, или же португальской Siscog, разработчика решений для железных дорог, в которой работает более полусотни Лисп-программистов. А адепты теории о необходимости его модернизации могут почитать Changelog SBCL (лидирующей open source реализации) :)

Конечно, у Лиспа есть и недостатки — помимо небольшого сообщества, представленного в основном энтузиастами, это:
  • непривичный синтаксис
  • часто непривичные подходы и способы разработки
  • отсутствие библиотек для взаимодействия с остальной средой (проект Quicklisp давно доказал обратное :)
Таким образом, еще раз можно повторить, что язык и экосистема Common Lisp не имеет серьезных технических недостатков при ряде бесспорных преимуществ, но он слишком непривычен и нетипичен, поэтому страдает от проблемы курицы и яйца: отсутствие Лисп-программистов не позволяет начинать на нем серьезные проекты, а отсутсвие импульса в сообществе не приводит в него новых программистов. Поэтому, Лисп вряд ли будет в ближайшее время серьезно использоваться в индустрии разработки. В чем же тогда его ниша сегодня? Если оставить за скобками тренды, то я бы сказал, что в первую очередь, это системы, которые пишутся на годы и должны постоянно эволюционировать: в этом плане он находится в точке золотой середины между классическими системными языками, типа C++ и Java, и их динамическими конкурентами, предоставляя невероятно гибкую и, в то же время, достаточну производительную (как в отношении скорости исполнения, так и скорости разработки) среду. Особенно, если такие системы должны иметь средства представления и обработки большого количества знаний. Как раз про них 10-е правило Гринспена:
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Однако, очень мало кто решится писать сейчас такие проекты на Лиспе. Более актуально другое его применение — быстрое прототипирование и среда для экспериментов. В этой нише у Лиспа много конкурентов, таких как Python или же специализированные языки, типа R и MatLab, но преимущество Лиспа в том, что удачный протип можно со временем довести до продакшн системы. Однако, самое важное значение Лиспа для меня — это полная свобода творчества, которую он предоставляет. Кроме шуток, доступ к такой среде дает возможность программисту развиваться не просто набором опыта использования каких-либо инструментов и фреймворков, а через решение нестандартных задач пытаясь найти для этого наиболее удачный способ независимо от случайных ограничений, налагаемых текущими обстоятельствами и принятыми нормами.

2014-09-03

How to write an English POS tagger with CL-NLP

State-of-the-art

The problem of POS tagging is a sequence labeling task: assign each word in a sentence the correct part of speech. For English, it is considered to be more or less solved, i.e. there are taggers that have around 95% accuracy. It also has a rather high baseline: assigning each word its most probable tag will give you up to 90% accuracy to start with. Also, only around 10% of the words have more than 1 possible tag. However, their usage accounts for around 40% of total text volume.

Let's start our exercise by first verifying these facts in a data-driven manner. This will also allow us to examine the basic building blocks for our solution.

POS taggers are usually built as statistical models trained on some existing pre-labeled data. For English language there is quite a lot of it already available. For other languages, that don't possess such data, an unsupervised or a rule-based approach can be applied.

Data sources

The standard dataset that is used not only for training POS taggers, but, most importantly, for evaluation is the Penn Tree Bank Wall Street Journal dataset. It contains of not only POS tag, but also noun phrase and parse tree annotations.

Here's an example of the combined POS tag and noun phrase annotations from this corpus:

[ Pierre/NNP Vinken/NNP ]
,/,
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB
[ the/DT board/NN ]
as/IN
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./.

The tagset used in the annotation contains such symbols as NNP for proper nouns, , for commas, and CD for cardinal numbers. The whole set is provided for CL-NLP in the file src/syntax/word-tags.txt. Here's a snippet from it:

X Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the...the-constructions.
CC Coordinating conjunction
...

It's, obviously, possible to extend it with other tags if necessary. All of them are, finally, available as symbols of the tag package in CL-NLP.

Available data and tools to process it

What we're interested in, is obtaining a structured representation of this data. The ncorp package implements interfaces to various raw representations, such as this one.

Different NLP corpora exist in various formats:

  • slightly annotated text, like the above
  • XML-based formats (a well-known example is the Reuters corpus)
  • a treebank format used for representing syntax trees
  • and many others

Most of these representations are supported by the ncorp adapters at least to some extent. The interface of this module consists of the following entities:

  • a text structure for representing an individual text of the corpus (most of the corpora are, thankfully, divided into separate files)
  • a generic function read-corpus-file that should read a raw file and return as multiple values its several representations. The common representations supported by almost all its methods are: raw or almost raw text, clean text - just sentences stripped of all annotations, and tokens - a list of token lists extracted from each sentence. Additionally, other corpus-specific slots may be present. For example, a treebank-text structure adds a trees slot to hold the syntax trees extracted from the treebank
  • a generic function read-corpus creates a corpus structure out of all corpus' texts. In practice, it's usually not feasible to do that due to large sizes of most of the useful corpora. That's why the next function is more practical
  • a generic function map-corpus reads and streams each corpus file as a text structure to the argument function. This is the default way to deal with corpus processing

For our task we'll be able to utilize just the tokens slot of the ptb-tagged-text structure, produced with map-corpus. Let's collect the tag distribution for each word from the WSJ section of the PTB:

NLP> (let ((words-dist #h(equal))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)
#<HASH-TABLE :TEST EQUAL :COUNT 51457 {10467E6543}>
NLP> (reduce #'+ (mapcan #'ht-vals (ht-vals *)))
1289201

So, we have around 50k distinct words and 1,3m tokens.

But, actually, the resources in the field has made some progress in the last decades, and there's a bigger corpus now available that contains not only the whole Penn Tree Bank, but also some more data from other domains. The annotations of the WSJ section in it were also improved. It is called OntoNotes. Let's do the same with its data:

NLP> (let ((words-dist #h(equal)))
       (map-corpus :treebank (corpus-file "ontonotes/")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (with-accessors ((tag token-tag) (word token-word)) tok
                           (unless (eql tag 'tag:-NONE-)
                             (unless (in# word words-dist)
                               (:= (get# word words-dist) #h()))
                             (:+ (get# tag (get# word words-dist) 0))))))
                   :ext "parse")
       words-dist)
 #<HASH-TABLE :TEST EQUAL :COUNT 60925 {1039EAE243}>

So, in the OntoNotes 4.0 there are 60925 distinct words. 50925 of them (~84%) are tagged with a single tag. I.e. we have a 16% of multi-tag words which corresponds well with the theoretical data. Also, there are 2,1m tokens in the corpus in total.

Calculating the number of words with distinct tags:

(count-if-not #`(= 1 (ht-count (rt %)))
              (ht->pairs words-dist))

And what about the total volume?

NLP> (let ((total1 0)
           (total 0))
      (map-corpus :treebank "ontonotes"
                  #`(dolist (sent (text-tokens %))
                      (dolist (tok sent)
                        (unless (eql (token-tag tok) 'tag:-NONE-)
                          (:+ total)
                          (when (in# (token-word tok) *single-tag-words*)
                            (:+ total1)))))
                  :ext "parse")
      (float (/ total1 total)))
0.2409884

Only 24% instead of 60%! What's wrong?

OK, here's the trick: let's add words that have more than 1 tag, but >99% of their occurrences are labeled with a single tag. For instance, the word "the" has 9 distinct tags in OntoNotes, but 0.9997 of the times it's a DT.

If we consider such words to have a single tag, we'll get just a slight increase in the number of single-tag words (+386: 51302 instead of 50916), but a dramatic increase in the volume of their occurrence - now it's 63%! Just as the literature tells us.

(NB. Taking such shortcut will only increase the quality of the POS tagger as 99% is above the accuracy it will be able to achieve anyhow, which is at most 97% on the same-domain data and even lower for out-of-domain data.)

Here's how we can determine such set of words:

(remove-if-not #`(let ((tag-dist (ht-vals (rt %))))
                   (> (/ (reduce #'max tag-dist)
                         (reduce #'+   tag-dist))
                      0.99))
               (ht->pairs tag-dist))

NB. The above code samples contain some non-standard utilities and idioms that may look somewhat alien to some Lisp programmers. All of them are from my RUTILS library, and you'll see more below. Mostly, these include some hash-table-specific operators, new iteration constructs, a few radical abbreviations for common operations, and literal syntax for hash-tables (#h()) and anonymous functions (#`()).

Some of them are:

  • get#/in#/set# specialized hash-table access routines, and dotable hash-table iteration
  • a pair data type with lt/rt accessors and conversion routines to/from hash-tables
  • ? generic element access operator with support for generic setf, plus := abbreviation for setf (that is using a common assignment symbol) and :+ analogy for incf

Building the POS tagger

We have explored how to access different corpus data that we'll need to train the POS tagger. To actually do that, we'll re-implement the approach described by Matthew Honnibal in "A good POS tagger in about 200 lines of Python". In fact, due to the expressiveness of Lisp and efficiency of SBCL, we'll need even less than 200 lines, and we'll get the performance comparable to a much more complex Cyton implementation of the parser (6s against 4s on 130k tokens), but that's details... ;)

Here's the source code we'll be discussing below on github.

Our tagger will use a greedy averaged perceptron model with single-tag words dictionary lookup:

(defclass greedy-ap-tagger (avg-perceptron tagger)
  ((dict :initform #h(equal) :initarg :dict :accessor tgr-dict)
   (single-tag-words :initform #h(equalp) :initarg :single-tag-words
                     :accessor tgr-single-tag-words))
  (:documentation
   "A greedy averaged perceptron tagger with single-tag words dictionary lookup."))

As you see, it is derived from a generic class tagger and an avg-perceptron learning model. It also has a dict slot that holds all the words known to the tagger.

Every tagger has just one generic function associated with it. You guessed its name - tag :)

(defmethod tag ((tagger greedy-ap-tagger) (sentence sentence))
  (let ((prev :-START-)
        (prev2 :-START2-)
        (ctx (sent-ctx sentence)))
    (doindex (i token (sent-tokens sentence))
      (:= (token-tag token)
          (classify tagger
                    (extract-fs tagger i (token-word token) ctx prev prev2))
          prev2 prev
          prev (token-tag token)))
    sentence))

It accepts an already tokenized sentence and (destructively) assigns tags to each of its tokens.

The main job is performed by the call to classify method that is defined for every statistical learning model in CL-NLP. Another model-associated method here is extract-fs which produces a list of features that describe the current sample.

Now, let's take a look at the implementation of these learning model-related methods.

(defmethod classify ((model greedy-ap-tagger) fs)
  (or (get# (first fs) (tgr-single-tag-words model))
      (call-next-method model (rest fs))))

For the tagger, we first check the current word against the dictionary of single-tag-words that we've built in the previous part and then call the classify method of the avg-perceptron model itself. That one is a matter of simply returning a class that is an argmax of a dot product between model weights and sample features fs that in this case can only have values of 1 or 0.

(defmethod classify ((model greedy-ap) fs)
  (let ((scores #h()))
    (dotable (class weights (m-weights model))
      (dolist (f fs)
        (:+ (get# class scores 0) (get# f weights 0))))
    (keymax scores)))  ; keymax is argmax for hash-tables

As you see, averaged perceptron is very straightforward - a simple linear model that has a weights slot which is a mapping of feature weights for every class. In the future this method will probably be assigned to a linear-model class, but it hasn't been added to CL-NLP so far.

Training

Let's take a look at the training part. It consists of 2 steps. extract-fs performs feature extraction. What it, basically, does in our case of a simple perceptron model is returning a list of features preceded by the word we're currently tagging.

(cons word (make-fs model
                    "bias"
                    ("i pref1" (char word 0))
                    ("i word" word)
                    ("i-1 tag" prev-tag)
                    ...

The make-fs macro is responsible for interning the features as symbols in package f by concatenating the provided prefixes and calculated variables. This is a standard Lisp practice to use symbols instead of raw strings for such things. So, in the above example for the word "the" preceded by a word tagged as VBZ will get the following list of features:

'("the" f:|bias| f:|i pref1 t| f:|word the| f:|i-1 tag VBZ| ...)

The second part of learning is training. It is the most involved procedure here, yet still very simple conceptually. Just like with the tag method, we're iterating over all tokens preceded by a dummy :-START2- and :-START- ones, and guessing the current tag using classify. Afterwards we're updating the model's weights in train1. The only difference is that we need to explicitly first consider the case of single-tag-words not to run the model update step for it.

This is how it all looks modulo debugging instrumentation. Note the use of psetf to update prev and prev2 simultaneously.

(defmethod train ((model greedy-ap-tagger) sents &key (epochs 5))
  (with-slots (single-tag-words dict) model
    ;; expand dict
    (dolist (sent sents)
      (dolist (tok (sent-tokens sent))
        (set# (token-word tok) dict nil)))
    ;; expand single-tag-words
    (dotable (word tag (build-single-tag-words-dict (mapcar #'sent-tokens sents)
                                                    :ignore-case? t))
      (unless (in# word single-tag-words)
        (set# word single-tag-words tag)))
    ;; train
    (dotimes (epoch epochs)
      (dolist (sent (mapcar #'sent-tokens sents))
        (let* ((prev :-START-)
               (prev2 :-START2-)
               (ctx (sent-ctx sent)))
          (doindex (i token sent)
            (let ((word (token-word token)))
              (psetf prev
                     (or (get# word single-tag-words)
                         (let* ((fs (extract-fs model i word ctx prev prev2))
                                (guess (classify model fs)))
                           (train1 model (rest fs) (token-tag token) guess)
                           guess))
                     prev2 prev)))))
      (:= sents (shuffle sents))))
  model)

Note the additional expansion of the single-tag-words dict of the model (as well as of the normal dict).

An interesting feature of the problem's object-oriented decomposition in this case is that we have a generic perceptron machinery we'd like to capture and reuse for different concrete models, and a model-specific implementation details.

This dichotomy is manifested in the training phase:

  • the train method is specific to the greedy-ap-tagger. The generic perceptron training is much simpler, because it doesn't operate in a sequence labeling scenario (see the source code here)

  • however, there's also an :after method defined for train on the avg-perceptron model which averages all the weights in the end and prunes the model by removing zero weights

  • there are also 2 more methods that are not specialized for greedy-ap-tagger: train1 and update1. They perform 1 step of the normal perceptron training and model update

    (defmethod train1 ((model perceptron) fs gold guess)
      (:+ (ap-step model))
      (dolist (f fs)
        (ensure-f-init model f gold guess)
        (loop
           :for class :in (list gold guess)
           :for val :in '(1 -1) :do
           (update1 model f class val))))
    
    (defmethod update1 ((model avg-perceptron) f class val)
      (with-slots (step timestamps weights totals) model
        (:+ (? totals class f) (* (- step (? timestamps class f))
                                  (? weights class f)))
        (:+ (? weights class f) val)
        (:= (? timestamps class f) step)))
    

Evaluation & persisting the model

We have reached the last part of every machine learning exercise - evaluation. Usually it's about measuring precision/recall/f-measure, but in the tagger case both precision and recall are the same, because the sets of relevant and retrieved items are the same, so we can calculate just the accuracy:

NLP> (accuracy *tagger* *gold-test*)
....................................................................................................
96.39183

A "gold" corpus is used for evaluation. This one was performed on the standard evaluation set which is the Wall Street Journal corpus (parts 22-24), OntoNotes 4.0 version. The model was also trained on the standard training set (0-18). Its results are consistent with the performance of the reference model from the blog post. The "gold" features where obtained by calling the extract-gold method of our model on the data from the treebank.

But wait, we can do more.

First, on the evaluation part. It's not being a secret already for a long time in the NLP community that WSJ corpus is far from representative to the real-world use cases. And I'm not even talking of twitter here, but just various genres of writing have different vocabularies and distributions of sentence structures. So, the high baselines shown by many results on the WSJ corpus may not be that robust. To help with such kind of evaluation Google and Yahoo have recently released another treebank called WebText that collect 5 different types of texts seen on the web: from dialogues to blog posts. It's smaller than Penn Treebank: 273k tokens isntead of 1,3m with 23k distinct word types. If we evaluate on it the accuracy drops substantially: only 89.74406!

What we can do is train on more data with better variability. Let's retrain our model on the whole OntoNotes (minus the evaluation set of WSJ 22-24). Here are the results:

  • on WSJ evaluation set: 96.76323 - a modest gain of 0.4%: we're already at max here
  • on Webtext: 92.9431 - a huge gain of more than 4%!

So, broader data helps. What else can we do?

Another aspect we haven't touched is normalization. There are some variants of generating arbitrary tokens in English which lend themselves well to normalization to some root form. These include numbers, emails, urls, and hyphenated words. The normalization variant proposed by Honnibal is rather primitive and can be improved.

Here's an original variant:

(defmethod normalize ((model greedy-ap-tagger) (word string))
  (cond
    ((and (find #\- word) (not (char= #\- (char word 0))))
     "!HYPHEN")
    ((every #'digit-char-p word)
     (if (= 4 (length word)) "!YEAR" "!DIGITS"))
    (t (string-downcase word))))

And here's a modified one:

(defmethod normalize ((model greedy-ap-tagger) (word string))
  (cond-it
    ((re:scan *number-regex* word) (make-string (length word) :initial-element #\0))
    ((re:scan *email-regex* word) "!EMAIL")
    ((re:scan *url-regex* word) "!URL")
    ((in# word (tgr-dict model)) (string-downcase word))
    ((position #\- word :start 1 :from-end t)
     (let ((suffix (slice word (1+ it))))
       (if (in# suffix (tgr-dict model))
           (string-downcase suffix)
           "!HYPH")))
    (t (string-downcase word))))

Such change allows to gain another 0.06% accuracy on the Webtext corpus. So, normalization improvement doesn't help that much. However, I think it should be more useful in real-world scenarios.

Now, as we finally have the best model we need a way to persist and restore it. The corresponding save-model/load-model methods exist for any categorical model. They use the handy ZIP and USERIAL libraries to save models into a single zip file, serializing textual (categories and feature names) and binary data (floating point weights) into separate files. Here's how our serialized POS tagger model looks like:

  Length  File
--------  --------------------
     552  classes.txt
 4032099  fs.txt
 2916012  fs.bin
 2916012  weights.bin
   35308  single-tag-words.txt
  484712  dict.txt
--------  --------------------
10384695  6 files

Finally, I believe, it's an essential practice to make all results we post online reproducible, but, unfortunately, there are restrictions on the use of the Pen Treebank corpus data, so we can't just add an automated test that will reproduce the contents of this post. Still, a small portion of OntoNotes WSJ corpus can be used under the fair use policy, and it is provided with CL-NLP for evaluation purposes.

Let's add such a test to give the users confidence in the performance of our model. For testing CL-NLP I'm using yet another my own library which is called SHOULD-TEST - I'll have another blog devoted to it some time in the future.

Here's a test we need:

(defun extract-sents (text)
  (mapcar #`(make 'ncore:sentence :tokens (ncorp:remove-dummy-tokens %))
          (ncore:text-tokens text)))

(defvar *tagger* (load-model (make 'greedy-ap-tagger)
                             (models-file "pos-tagging/onf.zip")
                             :classes-package :tag))
(defvar *gold*
  (let (test)
    (ncorp:map-corpus :treebank (corpus-file "onf-wsj/")
                      #`(appendf test (extract-sents %)))
    (extract-gold *tagger* test)))

(deftest greedy-ap-tagger-quality ()
  (should be = 96.31641
          (accuracy *tagger* *gold*)))

Summing up

In this article I've tried to describe the whole process of creating a new statistics-based model using CL-NLP. As long as you have the necessary data, it is quite straightforward and commonplace.

If you want to use one of the existing models (namely, greedy averaged perceptron, as of now) you can reuse almost all of the machinery and just add a couple of functions to reflect the specifics of your task. I think, it's a great demonstration of the power of the generic programming capabilities of CLOS.

Obviously, feature engineering is on you, but training/evaluation/saving/restoring the model can be handled transparently by CL-NLP tools. There's also support for common data processing and calculation tasks.

We have looked at some of the popular corpora in this domain (all of which, unfortunately, have some usage restrictions and are not readily available, but can be obtained for research purposes). And we've observed some of factors that impact the performance and robustness of machine learning models. I'd say that our final model is of the production-ready state-of-the-art level, so you can safely use it for your real-world tasks (under the licensing restrictions of the OntoNotes dataset used for training it). Surely, if you have your own data, it should be straightforward to re-train the model with it.

You can also add your own learning algorithms, and I'm going to be continue doing the same likewise.

Stay tuned and have fun!