2018-11-27

Structs vs Parametric Polymorphism

Recently, Tamas Papp wrote about one problem he had with Lisp in the context of scientific computing: that it's impossible to specialize methods on parametric types.
While you can tell a function that operates on arrays that these arrays have element type double-float, you cannot dispatch on this, as Common Lisp does not have parametric types.
I encountered the same issue while developing the CL-NLP Lisp toolkit for natural language processing. For instance, I needed to specialize methods on sentences, which may come in different flavors: as lists of tokens, vectors of tokens, lists of strings or some more elaborate data-structure with attached metadata. Here's an example code. There's a generic function to perform various tagпing jobs (POS, NER, SRL etc.) It takes two arguments: the first — as with all CL-NLP generic functions — is the tagger object that is used for algorithm selection, configuration, as well as for storing intermediate state when necessary. The second one is a sentence being tagged. Here are two of its possible methods:
(defmethod tag ((tagger ap-dict-postagger) (sent string)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent list)) ...)
The first processes a raw string, which assumes that we should invoke some pre-processing machinery that tokenizes it and then, basically, call the second method, which will perform the actual tagging of the resulting tokens. So, list here means list of tokens. But what if we already have the tokenization, but haven't created the token objects? I.e. a list of strings is supplied as the input to the tag method. The CLOS machinery doesn't have a way to distinguish, so we'll have to resort to using typecase inside the method, which is exactly what defmethod replaces as a transparent and extensible alternative. Well, in most other languages we'll stop here and just have to assert that nothing can be done and it should be accepted as is. After all, it's a local nuisance and not a game changer for our code (although Tamas refers to it as a game-changer for his). In Lisp, we can do better. Thinking about this problem, I see at least 3 solutions with a varying level of elegance and portability. Surely, they may seem slightly inferior to such capability being built directly into the language, but demanding to have everything built-in is unrealistic, to say the least. Instead, having a way to build something in ourselves is the only future-proof and robust alternative. And this is what Lisp is known for. The first approach was mentioned by Tamas himself:
You can of course branch on the array element types and maybe even paper over the whole mess with sufficient macrology (which is what LLA ended up doing), but this approach is not very extensible, as, eventually, you end up hardcoding a few special types for which your functions will be "fast", otherwise they have to fall back to a generic, boxed type. With multiple arguments, the number of combinations explodes very quickly.
Essentially, rely on typecase-ing but use macros to blend it into the code in the most non-intrusive way, minimizing boilerplate. This is a straightforward path, in Lisp, and it has its drawbacks for long-running projects that need to evolve over time. But it remains a no-brainer for custom one-offs. That's why, usually, few venture further to explore other alternatives. The other solution was mentioned in the Reddit discussion of the post:
Generics dispatching on class rather than type is an interesting topic. I've definitely sometimes wanted the latter so far in doing CL for non-scientific things. It is certainly doable to make another group of generics that do this using the MOP.
I.e. use the MOP to introduce type-based generic dispatch. I won't discuss it here but will say that similar things were tried in the past quite successfully. ContextL and Layered functions are some of the examples. Yet, the MOP path is rather heavy and has portability issues (as the MOP is not in the standard, although there is the closer-to-mop project that unifies most of the implementations). In my point of view, its best use is for serious and fundamental extension of the CL object system, not to solve a local problem that may occur in some contexts but is not so pervasive. Also, I'd say that the Lisp approach that doesn't mix objects and types (almost) is, conceptually, the right one as these two facilities solve a different set of problems. There's a third — much simpler, clear and portable solution that requires minimal boilerplate and, in my view, is best suited for such level of problems. To use struct-s. Structs are somewhat underappreciated in the Lisp world, not a lot of books and study materials give them enough attention. And that is understandable as there's not a lot to explain. But structs are handy for many problems as they are a hassle-free and efficient facility that provides some fundamental capabilities. In its basic form, the solution is obvious, although a bit heavy. We'll have to define the wrapper structs for each parametric type we'd like to dispatch upon. For example, list-of-strings and list-of-tokens. This looks a little stupid and it is, because what's the semantic value of a list of strings? That's why I'd go for sentence/string and sentence/token which is a clearer naming scheme. (Or, if we want to mimic Julia, sentence<string>).
(defstruct sent/str
  toks)
Now, from the method's signature, we will already see that we're dealing with sentences in the tagging process. And will be able to spot when some other tagging algorithm operates on the paragraphs instead of words: let's say, tagging parts of an email with such labels as greeting, signature, and content. Yes, this can also be conveyed via the name of the tagger, but, still, it's helpful. And it's also one of the hypothetical fail cases for a parametric type-based dispatch system: if we have two different kinds of lists of strings that need to be processed differently, we'd have to resort to similar workarounds in it as well. However, if we'd like to distinguish between lists of strings and vectors of strings, as well as more generic sequences of strings we'll have to resort to more elaborate names, like sent-vec/str, as a variant. It's worth noting though that, for the sake of producing efficient compiled code, only vectors of different types of numbers really make a difference. A list of strings or a list of tokens, in Lisp, uses the same accessors so optimization here is useless and type information may be used only for dispatch and, possibly, type checking. Actually, Lisp doesn't support type-checking of homogenous lists, so you can't say :type (list string), only :type list. (Wel, you can, actually uses (and satisfies (lambda (x) (every 'stringp x)), but what's the gain?) Yet, using structs adds more semantic dimensions to the code than just naming. They may store additional metadata and support simple inheritance, which will come handy when we'd like to track sentence positions in the text and so on.
(defstruct sent-vec/tok
  (toks nil :type (vector tok)))

(defstruct (corpus-sent-vec/tok (:include sent-vec/tok))
  file beg end)
And structs are efficient in terms of both space consumption and speed of slot access.
So, now we can do the following:
(defmethod tag ((tagger ap-dict-postagger) (sent sent/str)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent sent/tok)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent sent-vec/tok)) ...)
We'll also have to defstruct each parametric type we'd like to use. As a result, with this approach, we can have the following clean and efficient dispatch:
(defgeneric tag (tagger sent)
  (:method (tagger (sent string))
    (tag tagger (tokenize *word-splitter* sent))
  (:method (tagger (sent sent/str))
    (tag tagger (make-sent/tok :toks (map* ^(prog1 (make-tok 
                                                    :word %
                                                    :beg off 
                                                    :end (+ off (length %))) 
                                              (:+ off (1+ (length %)))
                                           @sent.toks)))
  (:method ((tagger pos-tagger) (sent sent/tok))
    (copy sent :toks (map* ^(copy % :pos (classify tagger
                                                   (extract-features tagger %))
                           @sent.toks))))

CL-USER> (tag *pos-tagger* "This is a test.")
#S(SENT/TOK :TOKS (<This/DT 0..4> <is/VBZ 5..7> <a/DT 8..9>
                   <test/NN 10..14> <./. 14..15>))
Some of the functions used here, ?, map*, copy, as well as @ and ^ reader macros, come from my RUTILS, which fills the missing pieces of the CL standard library. Also an advantage of structs is that they define a lot of things in the background: invoking type-checking for slots, a readable print-function, a constructor, a builtin copy-structure and more. In my view, this solution isn't any less easy-to-use than the static-typed one (Julia's). There's a little additional boilerplate (defstructs), which may be even considered to have a positive impact on the code's overall clarity. And yes, you have to write boilerplate in Lisp sometimes, although not so much of it. Here's a fun quote on the topic I saw on twitter some days ago:
Lisp is an embarrassingly concise language. If you’re writing a bunch of boilerplate in it, you need to read SICP & “Lisp: A Language for Stratified Design”.
P.S. There's one more thing I wanted to address from Tamas's post
Now I think that one of the main reasons for this is that while you can write scientific code in CL that will be (1) fast, (2) portable, and (3) convenient, you cannot do all of these at the same time.
I'd say that this choice (or rather a need to prioritize one over the others) exists in every ecosystem. At least, looking at his Julia example, there's no word of portability (citing Tamas's own words about the language: "At this stage, code that was written half a year ago is very likely to be broken with the most recent release."), while convenience may be manifest well for his current use case, but what if we require to implement in the same system features that deal with other areas outside of numeric computing? I'm not so convinced. Or, speaking about Python, which is a goto language for scientific computing. In terms of performance, the only viable solution is to implement the critical parts in C (or Cython). Portable? No. Convenient — likewise. Well, as a user you get convenience, and speed, and portability (although, pretty limited). But at what cost? I'd argue that developing the Common Lisp scientific computing ecosystem to a similar quality would have required only 10% of the effort that went into building numpy and scipy...

No comments: