Programming Algorithms: Compression

Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.


Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.

Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.


Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +, /, and =). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=) is used to indicate 2 (=) or 4 (==) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.

Here is one of the ways to implement a Base64 serialization routine:

(defparameter *b64-dict*
  (coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\a) :to (char-code #\z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\0) :to (char-code #\9)
                        :collect (code-char ch))
                  '(#\+ #\/ #\=))

(defun b64-encode (in out)
  (let ((key 0)
        (limit 6))
    (flet ((fill-key (byte off beg limit)
             (:= (ldb (byte limit off) key)
                 (ldb (byte limit beg) byte))
             (:= off (- 6 beg)))
           (emit1 (k)
             (write-byte (char-code (svref *b64-dict* k)) out)))
      (loop :for byte := (read-byte in nil) :while byte :do
        (let ((beg (- 8 limit)))
          (fill-key byte 0 beg limit)
          (emit1 key)
          (fill-key byte (:= limit (- 6 beg)) 0 beg)
          (when (= 6 beg)
            (emit1 key)
            (:= limit 6))))
      (when (< limit 6)
        (:= (ldb (byte limit 0) key)
            (ldb (byte limit 0) 0))
        (emit1 key)
        (loop :repeat (ceiling limit 2) :do
          (emit1 64))))))

This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.

One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:

  • It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
  • Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: (with-open-file (in ...) (with-out-file (out) (base64-encode in out)). With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.

So, what happens in the code above? First, the bytes are read from the binary input stream in, then each one is slashed into 2 parts. The higher bits are set into the current base64 key, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...). The function relies on the standard Lisp ldb operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset) to control the bits it wants to obtain.

Implementing a decoder procedure is left as an exercise to the reader...

Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):

CL-USER> (with-input-from-string (str "Man i")
           (let ((in (flex:make-in-memory-input-stream 
                      (map 'vector 'char-code
                           (loop :for ch := (read-char str nil) :while ch :collect ch))))
                 (out (flex:make-in-memory-output-stream)))
             (b64-encode in out)
             (map 'string 'code-char (? out 'vector))))

This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key) and the byte being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:

        M (77)    |    a (97)     |    n (110)
   0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1    |               |                 19 = T
               0 1|               |
1:             0 1|0 1 1 0        |                 22 = W
                  |        0 0 0 1|
2:                |        0 0 0 1|0 1               5 = F

Iteration 0:

beg: 2
off: 0
limit: 6

beg: 0
off: (- 6 2) = 4
limit: 2

Iteration 1:

beg: 4
off: 0
limit: 4

beg: 0
off: (- 6 4) = 2
limit: 4

Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...). Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.

(deftest b64-encode ()
  ;; B64STR would be the function wrapped over the REPL code presented above
  (should be blankp (b64str ""))
  (should be string= "TWFu" (b64str "Man"))
  (should be string= "TWFuIA==" (b64str "Man "))
  (should be string= "TWFuIGk=" (b64str "Man i")))

Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.

Lossless Compression

The idea behind lossless compression is straightforward: find an encoding that is tailored to our particular dataset and allows the encoding procedure to produce a shorter version than using a standard encoding. Not being general-purpose, the vocabulary for this encoding may use a more compact representation for those things that occur often, and a longer one for those that appear rarely, skipping altogether those that don't appear at all. Such an encoding scheme will be, probably, structure-agnostic and just convert sequences of bytes into other sequences of a smaller size, although custom structure-aware compression is also possible.

This approach can be explained with a simple example. The phrase "this is a test" uses 8-bit ASCII characters to represent each letter. There are 256 different ASCII characters in total. However, for this particular message, only 7 characters are used: t, h, i, s, , a, and e. 7 characters, in theory, need only 2.81 bits to be distinguished. Encoding them in just 3 bits instead of 8 will reduce the size of the message almost thrice. In other words, we could create the following vocabulary (where #*000 is a Lisp literal representation of a zero bit-vector of 3 bits):

#h(#\t #*000
   #\h #*001
   #\i #*010
   #\s #*011
   #\a #*100
   #\e #*101
   #\Space #*110)

Using this vocabulary, our message could be encoded as the following bit-vector: #*0000010100111100100111101001100001010111000. The downside, compared to using some standard encoding, is that we now need to package the vocabulary alongside the message, which will make its total size larger than the original that used an 8-bit standard encoding with a known vocabulary. It's clear, though, that, as the message becomes longer, the fixed overhead of the vocabulary will quickly be exceeded by the gain from message size reduction. Although, we have to account for the fact that the vocabulary may also continue to grow and require more and more bits to represent each entry (for instance, if we use all Latin letters and numbers it will soon reach 6 or 7 bits, and our gains will diminish as well). Still, the difference may be pre-calculated and the decision made for each message or a batch of messages. For instance, in this case, the vocabulary size may be, say, 30 bytes, and the message size reduction is 62.5%, so a message of 50 or more characters will be already more compact if encoded with this vocabulary even when the vocabulary itself will be sent with it. The case of only 7 characters is pretty artificial, but consider that DNA strings have only 4 characters.

However, this simplistic approach is just the beginning. Once again, if we use an example of the Latin alphabet, some letters, like q or x may end up used much less frequently, than, say, p or a. Our encoding scheme uses equal length vectors to represent them all. Yet, if we were to use shorter representations for more frequently used chars at the expense of longer ones for the characters occurring less often, additional compression could be gained. That's exactly the idea behind Huffman coding.

Huffman Coding

Huffman coding tailors an optimal "alphabet" for each message, sorting all letters based on their frequency and putting them in a binary tree, in which the most frequent ones are closer to the top and the less frequent ones — to the bottom. This tree allows calculating a unique encoding for each letter based on a sequence of left or right branches that need to be taken to reach it, from the top. The key trick of the algorithm is the usage of a heap to maintain the characters (both individual and groups of already processed ones) in sorted order. It builds the tree bottom-up by first extracting two least frequent letters and combining them: the least frequent on the left, the more frequent — on the right. Let's consider our test message. In it, the letters are sorted by frequency in the following order:

((#\a 1) (#\e 1) (#\h 1) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

Extracting the first two letters results in the following treelet:

 ((#\a #\e) 2)
  /         \
(#\a 1)  (#\e 1)

Uniting the two letters creates a tree node with a total frequency of 2. To use this information further, we add it back to the queue in place of the original letters, and it continues to represent them, during the next steps of the algorithm:

((#\h 1) ((#\a #\e) 2) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

By continuing this process, we'll come to the following end result:

  ((#\s #\t #\Space #\i #\h #\a #\e) 14)
    /                                \
 ((#\s #\t) 6)     ((#\Space #\i #\h #\a #\e) 8)
   /        \        /                       \
(#\s 3)   (#\t 3) (#\Space 3)  ((#\i #\h #\a #\e) 5)
                                 /               \              
                              (#\i 2)  ((#\h #\a #\e) 3)     
                                         /           \
                                      (#\h 1)  ((#\a #\e) 2)
                                                 /       \
                                               (#\a 1)  (#\e 1)

From this tree, we can construct the optimal encoding:

#h(#\s #*00
   #\t #*01
   #\Space #*10
   #\i #*110
   #\h #*1110
   #\a #*11110
   #\e #*11111)

Compared to the simple approach that used constantly 3 bits per character, it takes 1 bit less for the 3 most frequent letters and 2 bits more for two least frequent ones. The encoded message becomes: #*01111011000101100010111101001111110001, and it has a length of 38 compared to 43 for our previous attempt.

To be clear, here are the encoding and decoding methods that use the pre-built vocabulary (for simplicity's sake, they operate on vectors and strings instead of streams):

(defun huffman-encode (envocab str)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (dovec (char str)
      (dovec (bit (? envocab char))
        (vector-push-extend bit rez)))

(defun huffman-decode (devocab vec)
  (let (rez)
    (dotimes (i (length vec))
      (dotimes (j (- (length vec) i))
        (when-it (? devocab (slice vec i (+ i j 1)))
          (push it rez)
          (:+ i j)
    (coerce (reverse rez) 'string)))

It is worth recalling that vector-push-extend is implemented in a way, which will not adjust the array by only 1 bit each time it is called. The efficient implementation "does the right thing", for whatever the right thing means in this particular case (maybe, adjusting by 1 machine word). You can examine the situation in more detail by trying to extend the array by hand (using adjust-array or providing a third optional argument to vector-push-extend) and comparing the time taken by the different variants, to verify my words.

Finally, here is the most involved part of the Huffman algorithm, which builds the encoding and decoding vocabularies (with the help of a heap implementation we developed in the chapter on Trees):

(defun huffman-vocabs (str)
  ;; here we assume more than a single unique character in STR
  (let ((counts #h())
        (q (make-heap :op '< :key 'rt))
        (envocab #h())
        (devocab #h(equal)))  ; bit-vectors as keys require 'equal comparison
    ;; count character frequencies
    (dovec (char str)
      (:+ (get# char counts 0)))  ; here, we use the default third argument of gethash set to 0
    ;; heapsort the characters based on their frequency
    (dotable (char count counts)
      (heap-push (pair char count) q))
    ;; build the tree
    (dotimes (i (1- (heap-size q)))
      (with (((lt cl) (heap-pop q))
             ((rt cr) (heap-pop q)))
        (heap-push (pair (list lt rt) (+ cl cr))
    ;; traverse the tree in DFS manner
    ;; encoding the path to each leaf node as a bit-vector
    (labels ((dfs (node &optional (level 0) path)
               (if (listp node)
                     (dfs (lt node) (1+ level) (cons 0 path))
                     (dfs (rt node) (1+ level) (cons 1 path)))
                   (let ((vec (make-array level :element-type 'bit
                                          :initial-contents (reverse list))))
                     (:= (? envocab node) vec
                         (? devocab vec) node)))))
      (dfs (lt (heap-pop q))))
    (list envocab devocab)))

Huffman Coding in Action

Compression is one of the areas for which it is especially interesting to directly compare the measured gain in space usage to the one expected theoretically. Yet, as we discussed in one of the previous chapters, such measurements are not so straightforward as execution speed measurements. Yes, if we compress a single sequence of bytes into another one, there's nothing more trivial than to compare their lengths, but, in many tasks, we want to see a cumulative effect of applying compression on a complex data structure. This is what we're going to do next.

Consider the problem that I had in my work on the tool for text language identification WIKI-LANG-DETECT. This software relies on a number of big dictionaries that map strings (character trigrams and individual words) to floats. The obvious approach to storing these maps is with a hash-table. However, due to the huge number of keys, such table will, generally, have a sizeable overhead, which we would like to avoid. Besides, the keys are strings, so they have a good potential for reduction in occupied size when compressed. The data is also serialized into per-language files in the tab-separated format. This is the sample of a word-level file for the Danish language:

afrika    -8.866735
i    -2.9428265
the    -6.3879676
ngo    -11.449115
of    -6.971129
kanye    -12.925021
e    -8.365895
natal    -12.171249

Our task is to load the data in memory so that access to the keys had constant runtime and minimal occupied space.

Let's begin with a simple hash-table-based approach. The following function will load two files from the default directory (*default-pathname-defaults*) and return a list of two hash-tables: for the word and trigram probabilities.

(defun load-data-into-hts (lang)
  (declare (optimize sb-c::instrument-consing))
  (mapcar (lambda (kind)
            (let ((rez (make-hash-table :test 'equal)))
              (dolines (line (fmt "~A-~A.csv" lang kind))
                (let ((space (position #\Space line)))
                  (set# (slice line 0 space) rez
                        (read-from-string (slice line (1+ space))))))
          '("words" "3gs")))

To measure the space it will take, we'll use a new SBCL extension called allocation profiling from the sb-aprof package[1]. To enable the measurement, we have put a special declaration immediately after the defun header: (optimize sb-c::instrument-consing).

Now, prior to running the code, let's look at the output of room:

CL-USER> (room)
Dynamic space usage is:   60,365,216 bytes.

This is a freshly loaded image, so space usage is minimal. Usually, before proceeding with the experiment, I invoke garbage collection to ensure that we don't have some leftover data from the previous runs that may overlap with the current one. In SBCL, you run it with (sb-ext:gc :full t).

Now, let's load the files for the German language (the biggest ones) under aprof. The data can be obtained from the github repository of the project. The total size of 2 German-language files on disk (words and trigrams dictionaries) is around 4 MB.

CL-USER> (sb-aprof:aprof-run
          (lambda () (defparameter *de* (load-data-into-hts "DE"))))
227 (of 50000 max) profile entries consumed

       %        Bytes        Count    Function
 -------  -----------    ---------    --------
  24.2       34773600       434670    SB-KERNEL:%MAKE-ARRAY - #:|unknown|
  19.4       27818880       217335    SB-IMPL::%MAKE-STRING-INPUT-STREAM - SB-IMPL::STRING-INPUT-STREAM
  19.4       27818880       434670    SLICE - LIST

  17.3       24775088                 SB-IMPL::HASH-TABLE-NEW-VECTORS
      54.0     13369744         52        SIMPLE-VECTOR
      46.0     11405344        156        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  14.9       21406176                 SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER
      99.4     21280192     225209        (SIMPLE-ARRAY CHARACTER (*))
       0.6       125984       7874        LIST

   4.8        6957184       217412    SB-KERNEL::INTEGER-/-INTEGER - RATIO

  00.0          14160                 SB-IMPL::%MAKE-PATHNAME
      91.8        12992        812        LIST
       8.2         1168          1        SIMPLE-VECTOR

  00.0           4160            2    SB-IMPL::SET-FD-STREAM-ROUTINES - (SIMPLE-ARRAY CHARACTER (*))

  00.0           3712                 SB-IMPL::%MAKE-DEFAULT-STRING-OSTREAM
      62.1         2304          8        (SIMPLE-ARRAY CHARACTER (*))
      37.9         1408          8        SB-IMPL::CHARACTER-STRING-OSTREAM

  00.0           1024                 MAKE-HASH-TABLE
      53.1          544          2        SIMPLE-VECTOR
      46.9          480          6        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  00.0            832                 SB-IMPL::%MAKE-FD-STREAM
      73.1          608          2        SB-SYS:FD-STREAM
      19.2          160          2        SB-VM::ARRAY-HEADER
       7.7           64          2        (SIMPLE-ARRAY CHARACTER (*))

  00.0            576                 GET-OUTPUT-STREAM-STRING
      55.6          320          8        SIMPLE-BASE-STRING
      44.4          256          8        SB-KERNEL:CLOSURE

  00.0            400                 SB-KERNEL:VECTOR-SUBSEQ*
      60.0          240          6        (SIMPLE-ARRAY CHARACTER (*))
      40.0          160          5        SIMPLE-BASE-STRING

  00.0            400            5    SB-IMPL::%%MAKE-PATHNAME - PATHNAME
  00.0            384            2    SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE
  00.0            288            4    SB-KERNEL:%CONCATENATE-TO-STRING - (SIMPLE-ARRAY CHARACTER (*))
  00.0            192           12    SB-IMPL::UNPARSE-NATIVE-PHYSICAL-FILE - LIST
  00.0            176            2    SB-IMPL::READ-FROM-C-STRING/UTF-8 - (SIMPLE-ARRAY CHARACTER (*))
  00.0            128            4    SB-ALIEN-INTERNALS:%SAP-ALIEN - SB-ALIEN-INTERNALS:ALIEN-VALUE

  00.0             96                 SB-IMPL::QUERY-FILE-SYSTEM
      66.7           64          2        SB-KERNEL:CLOSURE
      33.3           32          2        SB-VM::VALUE-CELL

 =======  ===========
 100.0      143576336

The profiling report is pretty cryptic, at first sight, and requires some knowledge of SBCL internals to understand. It contains all the allocations performed during the test run, so we should mind that some of the used memory is garbage and will be collected at the next gc. We can confirm that by looking at the room output:

CL-USER> (room)
Dynamic space usage is:   209,222,464 bytes.
CL-USER> (sb-ext:gc :full t)
CL-USER> (room)
Dynamic space usage is:   107,199,296 bytes.

Let's study the report in detail. Around 47 MB were, in fact, used for the newly created data structures — more than 10 times more than what was needed to store the data on disc. Well, efficient access requires sacrificing a lot of space. From the report, we can make an educated guess where these 47 MB originate: 24.7 MB was used for the hash-table structures themselves (SB-IMPL::HASH-TABLE-NEW-VECTORS) and 21.4 MB for the keys (SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER), plus some small amount of bookkeeping information. We can also infer that the floating-point values required around 7 MB (SB-KERNEL::INTEGER-/-INTEGER - RATIO), but, it seems like they were put inside the hash-table arrays without any indirection. To verify that this assumption is correct we can calculate the total number of keys in the hash-tables, which amounts to 216993, and multiply it by 32 (the number of bits in a short-float used here). Also, the first 3 lines, which, in total, accrued around 90 MB or almost 2/3 of the memory used, are all related to reading the data and its processing; and this space was freed during gc.

So, this report, although it is not straightforward to understand, gives a lot of insight into how space is used during the run of the algorithm. And the ability to specify what to track on a per-code block basis makes it even more useful.

From the obtained breakdown, we can see the optimization potential of the current solution:

  • the use of a more space-efficient data structure instead of a hash-table might save us up to 17 MB of space (7 MB of float values will remain intact)
  • and another 20 MB may be saved if we compress the keys

Let's try the second option as it is exactly the focus of this chapter. We'll use the created hash-tables to make new ones with Huffman-encoded keys. Here are the contents of the word probabilities table:

CL-USER> (print-ht (first *de*))
  "afrika" -9.825206
  "i" -7.89809
  "the" -7.0929685
  "ngo" -12.696277
  "noma" -14.284437
  "of" -6.82038
  "kanye" -14.233144
  "e" -7.7334323
  "natal" -11.476304
  "c" -8.715089

And here is the function that will transform the tables:

(defun huffman-tables (hts envocab)
  (declare (optimize sb-c::instrument-consing))
  (mapcar (lambda (ht)
            (let ((rez #h(equal)))
              (dotable (str logprob ht)
                (:= (? rez (huffman-encode envocab str)) logprob))

;; the Huffman encoding vocabulary *DE-VOCAB* should be built
;; from all the keys of *DE* tables separately
CL-USER> (sb-aprof:aprof-run
          (lambda () (defparameter *de2* (huffman-tables *de* *de-vocab*))))
1294 (of 50000 max) profile entries consumed
       %        Bytes        Count    Function
 -------  -----------    ---------    --------
  42.5       44047104      1376461    SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG - ARRAY

  23.9       24775088                 SB-IMPL::HASH-TABLE-NEW-VECTORS
      54.0     13369744         52        SIMPLE-VECTOR
      46.0     11405344        156        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  20.1       20864160                 HUFFMAN-ENCODE
      83.3     17386800     217335        SB-VM::ARRAY-HEADER
      16.7      3477360     217335        SIMPLE-BIT-VECTOR

   6.7        6955072       217335    SB-KERNEL:VECTOR-SUBSEQ* - SIMPLE-BIT-VECTOR
   3.4        3477360       217335    (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF :AROUND (T T)) - LIST
   3.4        3477360       217335    (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF (HASH-TABLE T)) - LIST
  00.0           2464           77    SB-KERNEL::INTEGER-/-INTEGER - RATIO

  00.0           1024                 MAKE-HASH-TABLE
      53.1          544          2        SIMPLE-VECTOR
      46.9          480          6        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  00.0            384            2    SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE

  00.0             96                 SB-C::%PROCLAIM
      66.7           64          2        LIST
      33.3           32          1        SB-KERNEL:CLOSURE

  00.0             96            2    SB-INT:SET-INFO-VALUE - SIMPLE-VECTOR
  00.0             64            2    SB-THREAD:MAKE-MUTEX - SB-THREAD:MUTEX
  00.0             32            1    SB-IMPL::%COMPILER-DEFVAR - LIST
  00.0             32            2    HUFFMAN-TABLES - LIST
  00.0             16            1    SB-KERNEL:ASSERT-SYMBOL-HOME-PACKAGE-UNLOCKED - LIST
 =======  ===========
 100.0      103600352
CL-USER> (sb-ext:gc :full t)
CL-USER> (room)
Dynamic space usage is:   139,922,208 bytes.

So, we have claimed 32 MB of additional space (instead of 47) and some of it seems to be used by other unrelated data (some functions I have redefined in the REPL during the experiment etc), as the compressed keys amount for only 3.5 MB:

3477360     217335        SIMPLE-BIT-VECTOR 

That is more than 5 times reduction or almost 40% compression of the whole data structure!

And what about performance? Huffman compression will be needed at every data access, so let's measure the time it will take for vanilla string keys and the bit-vector ones. We will use another file from the wiki-lang-detect repository for the smoke test — a snippet from Faust:

CL-USER> (defparameter *de-words*
           (let ((words (list))
                 (dict (first *de*)))
             (dolines (line "~/prj/lisp/wiki-lang-detect/data/smoke/de.txt")
               (dolist (word (split #\Space line))
                 (push word words)))
CL-USER> (length *de-words*)

CL-USER> (let ((vocab (first *de*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# word vocab)))))
Evaluation took:
  0.045 seconds of real timeEvaluation took:

CL-USER> (let ((vocab (first *de2*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# (huffman-encode *de-vocab* word) vocab)))))
Evaluation took:
  0.341 seconds of real time

Hmm, with Huffman coding, it's almost 10x slower :( Is there a way to speed it up somewhat? To answer it, we can utilize another profiler — this time a more conventional one, which measures the time spent in each operation. SBCL provides access to two versions of such profilers: a precise and a statistical one. The statistical doesn't seriously interfere with the flow of the program as it uses sampling to capture the profiling data, and it's the preferred one among the developers. To use it, we need to perform (require 'sb-sprof) and then run the computation with profiling enabled (the lengthy output is redacted to show only the most important parts):

CL-USER> (let ((dict (first *de2*)))
           (sb-sprof:with-profiling (:report :graph)
             (loop :repeat 100 :do
               (dolist (word *de-words*)
                 (get# (huffman-encode *de-vocab* word) dict)))))
Number of samples:   34
Sample interval:     0.01 seconds
Total sampling time: 0.34 seconds
Number of cycles:    0
Sampled threads:
 #<SB-THREAD:THREAD "repl-thread" RUNNING {100FB19BC3}>

                 Total.     Function
 Count     %  Count     %      Callees
    24  70.6                   "Unknown component: #x52CD6390" [41]
     5  14.7     24  70.6   HUFFMAN-ENCODE [1]
     1   2.9                   SB-IMPL::GETHASH/EQL [17]
     1   2.9                   SB-IMPL::GETHASH3 [6]
     1   2.9                   LENGTH [14]
     1   2.9                   SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS [13]
     2   5.9                   (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]
    13  38.2                   VECTOR-PUSH-EXTEND [11]
     4  11.8                   SB-VM::EXTEND-VECTOR [4]
     4  11.8      4  11.8   SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
     6  17.6                   "Unknown component: #x52CD6390" [41]
     3   8.8      6  17.6   SB-IMPL::GETHASH/EQUAL [3]
     1   2.9                   SXHASH [42]
     2   5.9                   SB-INT:BIT-VECTOR-= [10]
     8  23.5                   VECTOR-PUSH-EXTEND [11]
     2   5.9      8  23.5   SB-VM::EXTEND-VECTOR [4]
     2   5.9                   SB-VM::COPY-VECTOR-DATA [9]
     4  11.8                   SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
     2   5.9                   HUFFMAN-ENCODE [1]
     2   5.9      2   5.9   (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
   1      5  14.7     24  70.6      5  14.7        -  HUFFMAN-ENCODE
   2      4  11.8      4  11.8      9  26.5        -  SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG
   3      3   8.8      6  17.6     12  35.3        -  SB-IMPL::GETHASH/EQUAL
   4      2   5.9      8  23.5     14  41.2        -  SB-VM::EXTEND-VECTOR
   5      2   5.9      2   5.9     16  47.1        -  (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT)
   6      2   5.9      2   5.9     18  52.9        -  SB-IMPL::GETHASH3
   7      2   5.9      2   5.9     20  58.8        -  GETHASH
   8      2   5.9      2   5.9     22  64.7        -  (SB-VM::OPTIMIZED-DATA-VECTOR-SET BIT)
   9      2   5.9      2   5.9     24  70.6        -  SB-VM::COPY-VECTOR-DATA
  10      2   5.9      2   5.9     26  76.5        -  SB-INT:BIT-VECTOR-=
  11      1   2.9     13  38.2     27  79.4        -  VECTOR-PUSH-EXTEND
  12      1   2.9      1   2.9     28  82.4        -  SB-VM::SLOW-HAIRY-DATA-VECTOR-SET
  13      1   2.9      1   2.9     29  85.3        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
  14      1   2.9      1   2.9     30  88.2        -  LENGTH
  15      1   2.9      1   2.9     31  91.2        -  SB-KERNEL:HAIRY-DATA-VECTOR-SET
  16      1   2.9      1   2.9     32  94.1        -  SB-KERNEL:VECTOR-SUBSEQ*
  17      1   2.9      1   2.9     33  97.1        -  SB-IMPL::GETHASH/EQL

Unsurprisingly, most of the time is spent in huffman-encode, and of it the biggest chunks are vector-push-extend and hash-table access (to get the Huffman code of a letter). Surely, instead of extending the vector at each iteration, it would be much nicer to just perform a bulk copy of the bits for each character directly into the vector. Let's try that and see the difference:

(defun huffman-encode2 (envocab str)
  (let ((vecs (map 'vector (lambda (ch) (get# ch envocab))
        (total-size 0))
    (dovec (vec vecs)
      (:+ total-size (length vec)))
    (let ((rez (make-array total-size :element-type 'bit))
          (i 0))
      (dovec (vec vecs)
        (let ((size (length vec)))
          (:= (subseq rez i) vec)
          (:+ i size)))

CL-USER> (let ((vocab (first *de2*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# (huffman-encode2 *de-vocab* word) vocab)))))
Evaluation took:
  0.327 seconds of real time

Almost no difference. Well, it's a usual case with these micro-optimizations: you have a brilliant idea, try it under the profiler — and, bah, no difference... This doesn't have to stop us, though. Another idea could be to use a jump-table instead of a hash-table to store character-vector mappings. There are only around 500 characters that have a mapping in my data, although they span the whole Unicode range:

CL-USER> (reduce 'max (mapcar 'char-code (keys de-vocab)))
CL-USER> (defparameter jvocab (make-array (1+ 65533)
                                            :element-type 'bit-vector
                                            :initial-element #))
CL-USER> (dotable (k v *de-vocab)
           (:= (? jvocab (char-code k)) v))

(defun huffman-encode3 (envocab str)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (dovec (char str)
      ;; here, we have changed the hash-table to a jump-table
      (dovec (bit (svref envocab (char-code char)))
        (vector-push-extend bit rez)))

CL-USER> (let ((vocab (first de2))) (time (loop :repeat 1000 :do (dolist (word de-words) (get# (huffman-encode3 jvocab word) vocab))))) Evaluation took: 0.308 seconds of real time

OK, we get an improvement of around 10%[2]. That's a start. But many more ideas and experiments are needed if we want to significantly optimize this implementation. Yet, for the sake of space conservation on the pages of this book, we won't continue with it.

Another tool we could use to analyze the performance and think about further improvement is flamegraphs — a way to visualize profiler output. [CL-FLAMGRAPH](https://github.com/40ants/cl-flamegraph) is a wrapper around `sb-sprof` that generates the output in the common format which can be further processed by the Perl tool, in order to generate the image itself. Here is the basic output I got. It's rather rough and, probably, requires some fiddling with the Perl tool to obtain a prettier image:

To conclude, key compression alone gives a sizeable reduction in used space at the cost of deteriorated performance.

Another possible angle of attack is to move from a hash-table to a more space-efficient structure. We have explored this direction somewhat in the chapter on hash-tables already.

Arithmetic Coding

Why Huffman coding works? The answer lies in Shannon's Source Coding Theorem and has to do with a notion of entropy. Entropy is one of the ways to represent expectation and surprise, in a message. The most random message has the maximal surprise, i.e. it's very hard to predict what symbol will appear at a certain position in it, while the least random (for instance, containing only repetitions of a single char) is the least surprising. Obviously, any kind of useful data is not uniformly distributed, or, otherwise, it's indistinguishable from white noise. Most of the data representations use an "alphabet" (encoding) that is redundant, for a particular message. Why? Because it is general-purpose and should allow expressing arbitrary messages. Yet, in practice, some passages appear much more often than the others, some words are more frequent, some letters are, and even some patterns in the images may be.

The idea of character-level compression algorithms is to tailor a custom vocabulary that uses fewer bits for low-entropy (frequent) characters and more bits for high-entropy ones. In general, the probability distribution of characters may be thought of as a [0,1) interval, in which each char occupies a slice proportionate to its frequency. If we rely on standard encoding, the interval for our test example will look like this:

0 e   a   h    i        s         t       Space  1

Here, each subinterval for a character is its probability times the number of bits per character (8 for each). Huffman coding tries to equalize this distribution by assigning fewer bits to characters that occupy larger space. For the Huffman vocabulary we have constructed, the distribution will look like this:

0  e     a    h     i      s      t      Space 1

As you can see, it has become more even, but still not totally. This is due to the discrete nature of the encoding that results in rounding the number of bits to the closest integer value. There's another approach to solving the same problem that aims at reducing the rounding error even further — Arithmetic coding. It acts directly on our interval and encodes the whole message in a single number that represents the point in this interval. How this point is found and used? Let's consider a message with a single character i. In our example, the subinterval for it is [0.214285714, 0.357142857). So, if we use any number from this interval and know that the message contains a single character we can unambiguously decode it back. Ideally, we'd use the number from the interval that has the least amount of digits. Here is a simple example of how such a number can be found:

(defun find-shortest-bitvec (lo hi)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
      (with ((lod lof (floor (* lo 2)))
             (hid hif (floor (* hi 2))))
        (when (or (zerop lof)
                  (zerop hif)
                  (/= lod hid))
          (vector-push-extend hid rez)
        (vector-push-extend lod rez)
        (:= lo lof
            hi hif)))

CL-USER> (find-shortest-bitvec 0.214285714 0.357142857)

The result is a bit-vector that represents the fractional part of some floating point number lying within the interval, which may be also used as an encoding of our one-character message. Obviously, we could use just a single bit to encode it with a custom vocabulary of one entry, but, here, for the purpose of illustration, I wanted to use an existing pre-calculated vocabulary that includes other characters as well. Also, if we compare this version with the Huffman coding, the message length is decreased by 1 bit.

Now, how can we process the longer messages? In the same manner: by recursively dividing the currently selected part using the same original distribution. For the message is:

  • on step 1 (for character i), the interval [0.214285714, 0.357142857) will be selected
  • on step 2 (for character s), we'll narrow it down to [0.26530612, 0.29591838) (using the subinterval [0.357142857, 0.5714286) for s)

For this interval, the shortest encoding will be 01001. In this case, it has the same size as the Huffman one.

So, the naive arithmetic encoding implementation is quite simple:

(defun arithm-encode (envocab message)
  (let ((lo 0.0)
        (hi 1.0))
    (dovec (char message)
      (let ((coef (- hi lo)))
        (dotable (ch prob envocab)
          (let ((off (* prob coef)))
            (when (eql char ch)
              (:= hi (+ lo off))
            (:+ lo off)))))
    (find-shortest-bitvec lo hi)))

CL-USER> (arithm-encode #h(#\e 1/14
                           #\a 1/14
                           #\h 1/14
                           #\i 2/14
                           #\s 3/14
                           #\t 3/14
                           #\Space 3/14)    
                        "this is a test")

However, this function has a hidden bug. The problem lies in the dreaded floating-point overflow that happens quite soon in the process of narrowing the interval, which results in using more and more digits of the floating-point number until all the bits are utilized and we can't distinguish the intervals any further. If we try to faithfully decode even the short message encoded above, we'll already see this effect by getting the output this ist sssst.

The implementation of this approach, which works around the bug, relies on the same idea but uses a clever bit arithmetics trick. Due to that, it becomes less clean and obvious, because it has to work not with the whole number, but with a bounded window in that number (in this case: a 32-bit one) and, also, still take care of potential overflow that may happen when the range collapses around 0.5. Here it is shown, for illustration purposes, without a detailed explanation[3]. This function is another showcase of the Lisp standard support for handling bit-level values. Besides, read-eval (#.) is used here to provide literal values of bitmasks.

(defun arithm-encode-correct (envocab message)
  (let ((lo 0)
        (hi (1- (expt 2 32)))
        (pending-bits 0)
        (rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (flet ((emit-bit (bit)
             (vector-push-extend bit rez)
             (let ((pbit (if (zerop bit) 1 0)))
               (loop :repeat pending-bits :do (vector-push-extend pbit rez))
               (:= pending-bits 0))))
      (dovec (char message)
        (with ((range (- hi lo -1))
               ((plo phi) (? envocab char)))
          (:= lo (round (+ lo (* plo range)))
              hi (round (+ lo (* phi range) -1)))
            (cond ((< hi #.(expt 2 31))
                   (emit-bit 0))
                  ((>= lo #.(expt 2 31))
                   (emit-bit 1)
                   (:- lo #.(expt 2 31))
                   (:- hi #.(expt 2 31)))
                  ((and (>= lo #.(expt 2 30))
                        (< hi (+ #.(expt 2 30) #.(expt 2 31))))
                   (:- lo #.(expt 2 30))
                   (:- hi #.(expt 2 30))
                   (:+ pending-bits))
                  (t (return)))
            (:= lo (mask32 (ash lo 1))
                hi (mask32 (1+ (ash hi 1)))))))
      (:+ pending-bits)
      (emit-bit (if (< lo #.(expt 2 30)) 0 1)))

(defun mask32 (num)
  ;; this utility is used to confine the number in 32 bits
  (logand num #.(1- (expt 2 32))))

CL-USER> (arithm-encode-correct #h(#\e '(0 1/14)
                                   #\a '(1/14 1/7)
                                   #\h '(1/7 3/14)
                                   #\i '(3/14 5/14)
                                   #\s '(5/14 4/7)
                                   #\t '(4/7 11/14)
                                   #\Space '(11/14 1))    
                                "this is a test")

Note that the length of the compressed message is 38 bits. The same as the Huffman version!

And here, for the sake of completeness and verification, is the decoding routine. It works in a similar fashion but backwards: we determine the interval into which our current number falls, emit the corresponding character, and narrow the search interval to the currently found one. We'll need to have access to the same vocabulary and know the length of the message.

(defun bitvec->int (bits)
  (reduce (lambda (bit1 bit2) (+ (ash bit1 1) bit2)

(defun arithm-decode (dedict vec size)
  (with ((len (length vec))
         (lo 0)
         (hi (1- (expt 2 32)))
         (val (bitvec->int (subseq vec 0 (min 32 len))))
         (off 32)
         (rez (make-string size)))
    (dotimes (i size)
      (with ((range (- hi lo -1))
             (prob (/ (- val lo) range)))
        (dotable (char r dedict)
          (with (((plo phi) r))
            (when (>= phi prob)
              (:= (? rez i) char
                  lo (round (+ lo (* plo range)))
                  hi (round (+ lo (* phi range) -1)))
        (print (list val lo hi))
          (cond ((< hi #.(expt 2 31))
                 ;; do nothing
                ((>= lo #.(expt 2 31))
                 (:- lo #.(expt 2 31))
                 (:- hi #.(expt 2 31))
                 (:- val #.(expt 2 31)))
                ((and (>= lo #.(expt 2 30))
                      (< hi #.(* 3 (expt 2 30))))
                 (:- lo #.(expt 2 30))
                 (:- hi #.(expt 2 30))
                 (:- val #.(expt 2 30)))
          (:= lo (mask32 (ash lo 1))
              hi (mask32 (1+ (ash hi 1)))
              val (mask32 (+ (ash val 1)
                             (if (< off len)
                                 (? vec off)
              off (1+ off)))))

CL-USER> (let ((vocab #h(#\e '(0 1/14)
                         #\a '(1/14 1/7)
                         #\h '(1/7 3/14)
                         #\i '(3/14 5/14)
                         #\s '(5/14 4/7)
                         #\t '(4/7 11/14)
                         #\Space '(11/14 1))))
           (arithm-decode vocab
                          (arithm-encode-correct vocab "this is a test")
"this is a test"


Entropy-based compression — or, as I would call it, сharacter-level one — can do only so much: it can't account for repetitions of the larger-scale message parts. For instance, a message with a single word repeated twice, when compressed with Huffman or Arithmetic encodings, will have twice the length of the message with a single occurrence of that word. The reason being that the probability distribution will not change, and thus the encodings of each character. Yet, there's an obvious possibility to reduce the compressed size here. This and other similar cases are much better treated by dictionary-based or block-level encoding approaches. The most well-known and wide-spread of them is the DEFLATE algorithm that is a variant of LZ77. Surely, there are other approaches like LZW, LZ78 or the Burrows-Wheeler algorithm (used in bzip2), but they are based on the same principle approach, so studying DEFLATE will allow you to grasp other algorithms if necessary.

But, before considering DEFLATE, let's first look at the simplest block-level scheme — Run-Length Encoding (RLE). This is not even a block-level algorithm, in full, as it operates on single characters, once again. The idea is to encode sequences of repeating characters as a single character followed by the number of repetitions. Of course, such an approach will hardly help with natural language texts that have almost no long character repetitions, instead, it was used in images with limited palettes (like those encoded in the GIF format). It is common for such images to have large areas filled with the same color, so the GIF format, for instance, used RLE for each line of pixels. That was one of the reasons that an image with a horizontal pattern like this:




lent itself to stellar compression, while the same one rotated 90 degrees didn't :)

x  x  x
x  x  x
x  x  x
x  x  x
x  x  x

LZ77 is a generalization of the RLE approach that considers runs not just of single characters but of variable-length character sequences. Under such conditions, it becomes much better suited for text compression, especially, when the text has some redundancies. For example, program code files tend to have some identifiers constantly repeated (like, if, loop or nil, in Lisp), each code file may have a lengthy identical copyright notice at the top, and so on and so forth. The algorithm operates by replacing repeated occurrences of data with references to a single copy of that data seen earlier in the uncompressed stream. The encoding is by a pair of numbers: the length of the sequence and the offset back into the stream where the same sequence was originally encountered.

The most popular LZ77-based compression method is DEFLATE. In the algorithm, literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances are placed into a separate alphabet as they occur just after lengths, so they cannot be mistaken for another kind of symbol or vice versa. A DEFLATE stream consists of a series of blocks. Each block is preceded by a 3-bit header indicating the position of the block (last or intermediate) and the type of character-level compression used: no compression, Huffman with a predefined tree, and Huffman with a custom tree. Most compressible data will end up being encoded using the dynamic Huffman encoding. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the loss in compression due to using a non-optimal code.

The algorithm performs the following steps:

  1. Matching and replacement of duplicate strings with pointers: within a single block, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (the repeated block size is between 3 and 258 bytes) and a 15-bit distance (which specifies an offset of 1-32768 bytes inside the so-called "sliding window") to the beginning of the duplicate. If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of any number identical bytes can be encoded as a single byte followed by a length of (1- n).

  2. Huffman coding of the obtained block. Instructions to generate the necessary Huffman trees immediately follow the block header. There are, actually, 2 trees: the 288-symbol length/literal tree and the 32-symbol distance tree, which themselves encoded as canonical Huffman codes by giving the bit length of the code for each symbol. The bit lengths are then run-length encoded to produce as compact a representation as possible.

An interesting fact is that DEFLATE compression is so efficient in terms of speed that it is faster to read a compressed file from an ATA hard drive and decompress it in-memory than to read an original longer version: disk access is much longer than CPU processing, for this rather simple algorithm! Even more, it applies to network traffic. That's why compression is used (and enabled by default) in many popular network protocols, for instance, HTTP.


This chapter, unlike the previous one, instead of exploring many different approaches, dealt with, basically, just a single one in order to dig deeper and to demonstrate the use of all the tools that can be applied in algorithmic programming: from a piece of paper to sophisticated profilers. Moreover, the case we have analyzed provides a great showcase not just of the tools but of the whole development process with all its setbacks, trial and error, and discoveries.

Bit fiddling was another topic that naturally emerged in this chapter. It may look cryptic to those who have never ventured into this territory, but mastering the technique is necessary to gain access to a number of important areas of the algorithms landscape.

[1] To make full use of this feature and be able to profile SBCL internal functions, you'll need to compile SBCL with --with-cons-profiling flag. Many thanks to Douglas Katzman for developing this feature and guiding me through its usage.

[2] It was verified by taking the average of multiple test runs.

[3] You can study the details in the [relevant article](https://www.drdobbs.com/cpp/data-compression-with-arithmetic-encodin/240169251).


prj-nlp v.3

Ми з Мар'яною Романишин розпочали третій набір курсу з обробки людської письмової мови (чи як краще перекласти NLP :-p) в Проджекторі. Хотів написати трохи деталей про нього, бо курс нам дуже подобається і, звісно, кожного року хочеться, щоб його рівень продовжував зростати. А для цього треба, щоб потенційні студенти про нього знали і розуміли, як він влаштований.

Курс є трьохмісячним інтенсивом, який ставить на меті підготувати спеціаліста, здатного самостійно і якісно вирішувати NLP-задачі будь-якої складності. Як наодинці, так і в команді. Для того, щоб бути максимально успішним в ньому, треба мати певну базу. Як правило, найкраще себе показують, звісно, програмісти. Але є й винятки: ми залюбки беремо лінгвістів, журналістів, та й, загалом, спеціалістів з інших галузей за умови, що вони володють достатніми навиками програмування, щоб самостійно писати програми, налаштовувати зручне для себе середовище розробки і розуміти базові алгоритмічні концепції. Для курсу не треба знати ML, хоча якесь уявлення про нього бажано мати. Але курс побудований так, що ми почергово розбираємо NLP-теми і пов'язані з ними розділи ML. Звісно, це не значить, що в результаті людина буде гарно розбиратись у машинному навчанні, але необхідну базу для продовження поглиблення у цій сфері отримає.

Другою передумовою успіху на курсі є наявність достатнього часу. Ми кажемо, що необхідний мінімум — це 10 годин самостійної роботи на тиждень, плюс 5 годин на заняттях. Іншими словами, враховуючи час на дорогу, це вже пів робочих ставки. Але, звісно, комусь може знадобитись і більше самостійного часу. Крім того мозок буде досить сильно завантажений новими темами, тому на час занять доведеться відмовитись від інших додаткових проєктів, хоббі і т.п. Також не дуже добре виходить, якщо більше тижня підряд випадає з якоїсь зовнішньої причини: хвороби, відрядження, шлюбу, народження дітей... :)

Як побудований цей курс? Ми збираємо групу з 15 студентів і зустрічаємось два рази на тиждень: одне заняття — теоретичне у четвер увечері, присвячене розбору певної теми, друге — практичне у суботу, на якому ми показуємо приклад вирішення задачі по цій теми і разом програмуємо його. У більшості випадків, ця програма буде основою для розв'язку більш просунутої, але подібної задачі, яка дається на домашню роботу. Відповідно, у нас є 12 тижнів основної роботи, тобто 12 тем і близько 10 повноцінних домашніх проєктів рівня побудувати систему аналізу тональності, перевірки фактів чи синтаксичного розбору. Звісно, в умовах обмеженого часу, кожний з проєктів робиться в рамках певного обмеженого домену.

Курс розбитий на 3 частини:

  • перший місяць — це дуже ґрунтовна підготовча робота: основи структурної лінгвістики, робота з даними, метрики, правильні експерименти, підхід на правилах. В кінці місяця в голові у студента має сформуватись цілком структуроване уявлення про те, як правильно підходити до вирішення NLP-задач. Інший результат цієї частини — сформульоване завдання для курсового і початок роботи над ним: зібрані перші дані, визначена метрика, пророблений план експериментів
  • другий місяць — це занурення в класичне NLP з паралельною проробкою наійбільш розповсюджених ML-технік, які в ньому використовуються. В кінці місяця, після вирішення на лекціях, практичних і вдома майже десятка NLP-задач у студентів вже мають сформуватись навички для самостійного застосування цих технік у реальних проєктах. Ну і зроблена основна частина курсової роботи
  • останній місяць — це deep learning в NLP. Ми одразу попереджаємо, що цей курс не ставить на меті розказати побільше найгарячішого і проривного: для цього є достатньо інших майданчиків. Ми хочемо сформувати систематичне розуміння NLP, з всією його 70-річною історією. Бо в цій історії є дуже багато корисних і, можливо, timeless речей. Тож до state-of-the-art ми підходимо тільки під кінець (і на останньому занятті у нас виступає запрошений лектор, який розказує щось про bleeding edge :) Але принципові речі, пов'язані з DL, ми також пророблюємо як на заняттях, так і в рамках курсового. Ті зі студентів, кого цікавить саме ця сфера, під кінець курсу ганяють навчання десяток нейронок в своїх пісочницях, а також випробовують можливості глибинного навчання у своєму курсовому проєкті.. Втім, тут ми не можемо похвалитись приголомшливими результатами по якості, адже для їх досягнення замало пари тижнів, які по-максимуму є на ту чи іншу задачу: навчання глибинних моделей потребує багато як обчислювальних ресурсів, так і часових. Але тому, як до цього підходити, ми навчаємося

Як можна побачити з цього опису, дуже велику увагу ми приділяємо курсовому проєкту, роботу над яким стимулюємо кожного тижня. В результаті, у більшості виходять досить непогані і цікаві штуки, понад 70% студентів доходять до фінішу з якісною завершеною роботою (нечувана кількість для інших курсів, в яких мені доводилось брати участь). Деякі з проєктів навіть виходять у великий світ: хтось робить речі, пов'язані з роботою, хтось — з хоббі. За 2 роки у нас було 2 дослідження для журналістики даних, проєкт з аналізу конфліктів ліків між собою та з хронічними хворобами людини (на основі обробки інструкцій), система пошуку у соціальному застосунку для конференцій з запитами природньою мовою. Був і ряд цікавих проєктів для себе, які досягли класних результатів по якості та були зроблені з душею. Все це студенти презентують після закінчення на великій фінальній вечірці в офісі Grammarly.

Одна з основних цілей цього курсу для нас полягає в тому, щоб вирощувати українську NLP-спільноту. Адже школи комп'ютерної лінгвістики у нас, по суті, ніколи не було. І ми сподіваємось, що нам вдастся долучитись до її формування разом з іншими прогресивними проєктами навчання в цій сфері, зокрема магістерською програмою по Data Science в УКУ. У курсу вже більше 30 випускників, які увійшли до закритого клубу prj-nlp-alumni де ми ділимось цікавими речами та можливостями, а також плануємо періодично зустрічатись у неформальній атмосфері, а не тільки на івентах. Тож сподіваємось на розширення цього клубу ще на половину у червні цього року :)

P.S. До речі, про УКУ. Я також беру участь як викладач і ментор дипломних робіт у курсі NLP в їх програмі. Це трохи інший досвід, ніж цей курс. Звісно, УКУ пропонує більш академічну програму, яка триває довший час. Студенти отримують там гарну і, що важливо, систематичну підготовку з ML та DL. Тому цьому зовсім не треба приділяти увагу на курсі по NLP. З іншого боку, курс більш короткий і читається декількома викладачами, тому в його рамках важче сформувати цілісну картинку, нема можливості організувати такий же рівень занурення і концентрації, як на курсі в Проджекторі. Зате на магістерську роботу у них більше часу, ніж на весь наш курс. Але, найголовніше, що і тут, і там підбираються гарно підготовлені і мотивовані студенти, тож результати в УКУ також виходять гарної якості з великою кількістю цікавих робіт. Деякі з яких мають рівень статей на топові наукові конференції у цій галузі. Хоча мені особисто все-таки більше подобається формат Проджектора, адже він дає можливість відчути дух інтенсивної командної роботи протягом трьох місяців, зітхнувши з полегшенням в кінці у передчутті дев'ятимісячного перепочинку і нової ітерації...