2021-10-12

Watching a Model Train

Last week, I did a quick hack that quite delighted me: I added a way to visually watch the progress of training my MGL-based neural networks inside Emacs. And then people on twitter asked me to show the code. So, it will be here, but first I wanted to rant a bit about one of my pet peeves.

Low-Tech

In the age of Jupyter and TensorBoard, adding a way to see an image that records the value of a loss function blinking on the screen — "huh, big deal" you would say. But I believe this example showcases a difference between low-tech and high-tech approaches. Just recently I chatted with one of my friends who is entering software engineering at a rather late age (30+), and we talked of how frontend development became even more complicated than backend one (while, arguably, the complexity of tasks solved on the frontend is significantly lower). And that discussion just confirmed to me that the tendency to overcomplicate things is always there, with our pop-culture industry, surely, following it. But I always tried to stay on the simple side, on the side of low-tech solutions. And that's, by the way, one of the reasons I chose to stick with Lisp: with it, you would hardly be forced into some nonsense framework hell, or playing catch-up with the constant changes of your environment, or following crazy "best practices". Lisp is low-tech just like the Unix command-line or vanilla Python or JS. Contrary to the high-tech Rust, Haskell or Java. Everything text-based is also low-tech: text-based data formats, text-based visualization, text-based interfaces.

So, what is low-tech, after all? I saw the term popularized by Kris De Decker from the Low-Tech Magazine, which focuses on using simple (perhaps, outdated by some standards) technologies for solving serious engineering problems. Most people, and the software industry is no exception, are after high-tech, right? Progress of technology enables solving more and more complex tasks. And, indeed, that happens. Sometimes, not always. Sometimes, the whole thing crumbles, but that's a different story. Yet, even when it happens, there's a catch, a negative side-effect: the barrier of entry rises. If 5 or 10 years ago it was enough to know HTML, CSS, and JavaScript to be a competent frontend developer, now you have to learn a dozen more things: convoluted frameworks, complicated deploy toolchains, etc., etc. Surely, sometimes it's inevitable, but it really delights me when you can avoid all the bloat and use simple tools to achieve the same result. OK, maybe not completely the same, maybe not a perfect one. But good enough. The venerable 80% solution that requires 20% effort.

Low-tech is not low-quality, it's low-barrier of entry.

And I would argue that, in the long run, better progress in our field will be made if we strive towards lowering the bar to more people in, than if we continue raising it (ensuring our "job security" this way). Which doesn't mean that the technologies should be primitive (like BASIC). On the contrary, the most ingenious solutions are also the simplest ones. So, I'm going to continue this argument in the future posts I'd like to write about interactive programming. And now, back to our hacks.

Getting to Terms with MGL

In my recent experiments I returned to MGL — an advanced, although pretty opinionated, machine learning library by the prolific Gabor Melis — for playing around with neural networks. Last time, a few years ago I stumbled when I tried to use it to reproduce a very advanced (by that time's standards) recurrent neural network and failed. Yet, before that, I was very happy using it (or rather, it's underlying MGL-MAT library) for running in Lisp (in production) some of the neural networks that were developed by my colleagues. I know it's usually the other way around: Lisp for prototyping, some high-tech monstrosity for production, but we managed to turn the tides for some time :D

So, this time, I decided to approach MGL step by step, starting from simple building blocks. First, I took on training a simple feed-forward net with a number of word inputs converted to vectors using word2vec-like approach.

This is the network I created. Jumping slightly ahead, I've experimented with several variations of the architecture, starting from a single hidden layer MLP, and this one worked the best so far. As you see, it has 2 hidden layers (l1/l1-l and l2/l2-l) and performs 2-class classification. It also uses dropout after each of the layers as a standard means of regularization in the training process.

(defun make-nlp-mlp (&key (n-hidden 100))
  (mgl:build-fnn (:class 'nlp-mlp)
    (in (->input :size *input-len*))
    (l1-l (->activation in :size n-hidden))
    (l1 (->relu l1-l))
    (d1 (->dropout l1 :dropout 0.5))
    (l2-l (->activation d1 :size (floor n-hidden 2)))
    (l2 (->relu l2-l))
    (d2 (->dropout l2 :dropout 0.5))
    (out-l (->activation d2 :size 2))
    (out (->softmax-xe-loss out-l))))

MGL model definition is somewhat different from the approach one might be used to with Keras or TF: you don't imperatively add layers to the network, but, instead, you define all the layers at once in a declarative fashion. A typical Lisp style it is. Yet, what still remains not totally clear to me yet, is the best way to assemble layers when the architecture is not a straightforward one-direction or recurrent, but combines several parts in nonstandard ways. That's where I stumbled previously. I plan to get to that over time, but if someone has good examples already, I'd be glad to take a look at those. Unfortunately, despite the proven high-quality of MGL, there's very little open-source code that uses it.

Now, to make a model train (and watch it), we have to pass it to mgl:minimize alongside with a learner:

(defun train-nlp-fnn (&key data (batch-size 100) (epochs 1000) (n-hidden 100)
                       (random-state *random-state*))
  (let ((*random-state* random-state)
        (*agg-loss* ())
        (opt (make 'mgl:segmented-gd-optimizer
                   :termination (* epochs batch-size)
                   :segmenter (constantly
                                (make 'mgl:adam-optimizer
                                      :n-instances-in-batch batch-size))))
        (fnn (make-nlp-mlp :n-hidden n-hidden)))
    (mgl:map-segments (lambda (layer)
                        (mgl:gaussian-random!
                         (mgl:nodes layer)
                         :stddev (/ 2 (reduce '+ (mgl:mat-dimensions (mgl:nodes layer))))))
                      fnn)
    (mgl:monitor-optimization-periodically
     opt
     `((:fn mgl:reset-optimization-monitors :period ,batch-size :last-eval 0)
       (:fn draw-test-error :period ,batch-size)))
    (mgl:minimize opt (make 'mgl:bp-learner
                            :bpn fnn
                            :monitors (mgl:make-cost-monitors
                                       fnn :attributes `(:event "train")))
                  :dataset (sample-data data (* epochs batch-size)))
    fnn))

This code is rather complex, so let me try to explain each part.

  • We use (let ((*random-state* random-state) to ensure that we can reproduce training in exactly the same way if needed.
  • mgl:segmented-gd-optimizer is a class that allows us to specify a different optimization algorithm for each segment (layer) of the network. Here we use the same standard mgl:adam-optimizer with vanilla parameters for each segment (constantly).
  • The following mgl:map-segments call is performing the Xavier initialization of the input layers. It is crucial to properly initialize the layers of the network before training or, at least, ensure that they are not all set to zeroes.
  • The next part is, finally, responsible for WATCHING THE MODEL TRAIN. mgl:monitor-optimization-periodically is a hook to make MGL invoke some callbacks that will help you peek into the optimization process (and, perhaps, do other needful things). That's where we insert our draw-test-error function that will run each batch. There's also an out-of-the-box cost-monitor attached directly to the mgl:bp-learner, which is collecting the data for us and also printing it on the screen. I guess, we could build the draw-test-error monitor in a similar way, but I opted for my favorite Lisp magic wand — a special variable *agg-loss*.
  • And last but not least, we need to provide the dataset to the model: (sample-adata data (* epochs batch-size)). The simple approach that I use here is to pre-sample the necessary number of examples beforehand. However, streaming sampling may be also possible with a different dataset-generating function.

Now, let's take a look at the function that is drawing the graph:

(defun draw-test-error (opt learner)
  ;; here, we print out the architecture and parameters of
  ;; our model and learning algorithm
  (when (zerop (mgl:n-instances opt))
    (describe opt)
    (describe (mgl:bpn learner)))
  ;; here, we rely on the fact that there's
  ;; just a single cost monitor defined
  (let ((mon (first (mgl:monitors learner))))
    ;; using some of RUTILS syntax sugar here to make the code terser
    (push (pair (+ (? mon 'counter 'denominator)
                   (if-it (first *agg-loss*)
                          (lt it)
                          0))
                (? mon 'counter 'numerator))
          *agg-loss*)
    (redraw-loss-graph)))

(defun redraw-loss-graph (&key (file "/tmp/loss.png") (smoothing 10))
  (adw-charting:with-chart (:line 800 600)
    (adw-charting:add-series "Loss" *agg-loss*)
    (adw-charting:add-series
     (fmt "Smoothed^~a Loss" smoothing)
     (loop :for i :from 0
           :for off := (* smoothing (1+ i))
           :while (< off (length *agg-loss*))
           :collect (pair (? *agg-loss* (- off (floor smoothing 2)) 0)
                          (/ (reduce ^(+ % (rt %%))
                                     (subseq *agg-loss* (- off smoothing) off)
                                     :initial-value 0)
                             smoothing))))
    (adw-charting:set-axis :y "Loss" :draw-gridlines-p t)
    (adw-charting:set-axis :x "Iteration #")
    (adw-charting:save-file file)))

Using this approach, I could also draw the change of the validation loss on the same graph. And I'll do that in the next version.

ADW-CHARTING is my goto-library when I need to draw a quick-and-dirty chart. As you see, it is very straightforward to use and doesn't require a lot of explanation. I've looked into a couple other charting libraries and liked their demo screenshots (probably, more than the style of ADW-CHARTING), but there were some blockers that prevented me from switching to them. Maybe, next time, I'll have more inclination.  

To complete the picture we now need to display our learning progress not just with text running in the console (produced by the standard cost-monitor), but also by updating the graph. This is where Emacs' nature of a swiss-army knife for any interactive workflow came into play. Surely, there was already an existing auto-revert-mode that updates the contents of a Emacs buffer on any change or periodically. For my purposes, I've added this lines to my Emacs config:

(setq auto-revert-use-notify nil)
(setq auto-revert-interval 6)  ; refresh every seconds

Obviously, this can be abstracted away into a function which could be invoked by pressing some key or upon other conditions occurring.

2021-02-08

"Programming Algorithms in Lisp" Is Out!

The updated version of my book "Programming Algorithms" has been released by Apress recently. It has undergone a number of changes that I want to elaborate on in this post.

But first, I'd like to thank all the people who contributed to the book or supported my work on it in other ways. It was an honor for me to be invited to Apress as "Practical Common Lisp" published by them a decade ago was my one-way ticket to the wonderful world of Lisp. Writing "Programming Algorithms" was, in a way, an attempt to give something back. Also, I was very curious to see how the cooperation with the publisher would go. And I can say that they have done a very professional job and helped significantly improve the book through the review process. That 5-10% change that is contributed by the editors, although it may seem insignificant, is very important to bring any book to the high standard that allows not to annoy many people. Unfortunately, I am not a person who can produce a flawless result at once, so helping with correcting those flaws is very valuable. Part of gratitude for that also, surely, goes to many of the readers who have sent their suggestions.

I was very pleased that Michał "phoe" Herda has agreed to become the technical reviewer. He has found a number of bugs and suggested lots of improvements, of which I could implement, maybe, just a third. Perhaps, the rest will go into the second edition :)

Now, let's speak about some of those additions to Programming Algorithms in Lisp.

Curious Fixes

First of all, all the executable code from the book was published in a github repo (and also republished to the oficial Apress repo). As suggested by Michał, I have added automated tests to ensure (for now, partially, but we plan to make the test suite all-encompassing) that everything compiles and runs correctly. Needless to say that some typos and other issues were found in the process. Especially, connected with handling different corner cases. So, if you have trouble running some code from the book, you can use the github version. Funny enough, I got into a similar situation recently, when I tried to utilize the dynamic programming example in writing a small tool for aligning outputs of different ASR systems and found a bug in it. The bug was is in the matrix initialization code:

-    (dotimes (k (1+ (length s1))) (setf (aref ld k 0) 0))
-    (dotimes (k (1+ (length s2))) (setf (aref ld 0 k) 0)))
+    (dotimes (k (1+ (length s1))) (setf (aref ld k 0) k))
+    (dotimes (k (1+ (length s2))) (setf (aref ld 0 k) k)))

Another important fix that originated from the review process touched not only the book but also the implementation of the slice function in RUTILS! It turned out that I was naively assuming that displaced arrays will automatically recursively point into the original array, and thus, inadvertently, created a possibility for O(n) slice performance instead of O(1). It explains the strange performance of array sorting algorithms at the end of Chapter 5. After fixing slice, the measurements started to perfectly resemble the theoretical expectations! And, also the performance has improved an order of magnitude :D

CL-USER> (let ((vec (random-vec 10000)))
           (print-sort-timings "Insertion " 'insertion-sort vec)
           (print-sort-timings "Quick" 'quicksort vec)
           (print-sort-timings "Prod" 'prod-sort vec))
= Insertion sort of random vector (length=10000) =
Evaluation took:
  0.632 seconds of real time
...
= Insertion sort of sorted vector (length=10000) =
Evaluation took:
  0.000 seconds of real time
...
= Insertion sort of reverse sorted vector (length=10000) =
Evaluation took:
  1.300 seconds of real time
...
= Quicksort of random vector (length=10000) =
Evaluation took:
  0.039 seconds of real time
...
= Quicksort of sorted vector (length=10000) =
Evaluation took:
  1.328 seconds of real time
...
= Quicksort of reverse sorted vector (length=10000) =
Evaluation took:
  1.128 seconds of real time
...
= Prodsort of random vector (length=10000) =
Evaluation took:
  0.011 seconds of real time
...
= Prodsort of sorted vector (length=10000) =
Evaluation took:
  0.011 seconds of real time
...
= Prodsort of reverse sorted vector (length=10000) =
Evaluation took:
  0.021 seconds of real time
...

Also, there were some missing or excess closing parens in a few code blocks. This, probably, resulted from incorrectly copying the code from the REPL after finishing experimenting with it. :)

New Additions

I have also added more code to complete the full picture, so to say, in several parts where it was lacking, from the reviewers' point of view. Most new additions went into expanding "In Action" sections where it was possible. Still, unfortunately, some parts remain on the level of general explanation of the solution as it was not possible to include whole libraries of code into the book. You can see a couple of snippets below:

Binary Search in Action: a Fast Specialized In-Memory DB

We can outline the operation of such a datastore with the following key structures and functions.

A dictionary *dict* will be used to map words to numeric codes. (We'll discuss hash-tables that are employed for such dictionaries several chapters later. For now, it will be sufficient to say that we can get the index of a word in our dictionary with (rtl:? *dict* word)). The number of entries in the dictionary will be around 1 million.

All the ngrams will be stored alphabetically sorted in 2-gigabyte files with the following naming scheme: ngram-rank-i.bin. rank is the ngram word count (we were specifically using ngrams of ranks from 1 to 5) and i is the sequence number of the file. The contents of the files will constitute the alternating ngram indices and their frequencies. The index for each ngram will be a vector of 32-bit integers with the length equal to the rank of an ngram. Each element of this vector will represent the index of the word in *dict*. The frequency will also be a 32-bit integer.

All these files will be read into memory. As the structure of the file is regular — each ngram corresponds to a block of (1+ rank) 32-bit integers — it can be treated as a large vector.

For each file, we know the codes of the first and last ngrams. Based on this, the top-level index will be created to facilitate efficiently locating the file that contains a particular ngram.

Next, binary search will be performed directly on the contents of the selected file. The only difference with regular binary search is that the comparisons need to be performed rank times: for each 32-bit code.

A simplified version of the main function get-freq intended to retrieve the ngram frequency for ranks 2-5 will look something like this:

(defun get-freq (ngram)
  (rt:with ((rank (length ngram))
            (codes (ngram-codes ngram))
            (vec index found?
                 (bin-search codes
                             (ngrams-vec rank codes)
                             :less 'codes<
                             :test 'ngram=)))
     (if found?
         (aref vec rank)
         0)))

where

(defun ngram-codes (ngram)
  (map-vec (lambda (word) (rtl:? *dict* word))
           ngram))

(defun ngrams-vec (rank codes)
  (loop :for ((codes1 codes2) ngrams-vec) :across *ngrams-index*
        :when (and (<= (aref codes1 0) (aref codes 0))
                   (codes< codes codes2 :when= t))
        :do (return ngrams-vec)))
             
(defun codes< (codes1 codes2 &key when=)
  (dotimes (i (length codes1)
              ;; this will be returned when all
              ;; corresponding elements of codes are equal
              when=)
    (cond ((< (aref codes1 i)
              (aref codes2 i))
           (return t))
          ((> (aref codes1 i)
              (aref codes2 i))
           (return nil))))) 

(defun ngram= (block1 block2)
  (let ((rank (1- (length block1))))
    (every '= (rtl:slice block1 0 rank)
              (rtl:slice block2 0 rank)))

We assume that the *ngrams-index* array containing a pair of pairs of codes for the first and last ngram in the file and the ngrams data from the file itself was already initialized. This array should be sorted by the codes of the first ngram in the pair. A significant drawback of the original version of this program was that it took quite some time to read all the files (tens of gigabytes) from disk. During this operation, which measured in several dozens of minutes, the application was not responsive. This created a serious bottleneck in the system as a whole and complicated updates, as well as put normal operation at additional risk. The solution we utilized to counteract this issue was a common one for such cases: switching to lazy loading using the Unix mmap facility. With this approach, the bounding ngram codes for each file should be precalculated and stored as metadata, to initialize the *ngrams-index* before loading the data itself.

Pagerank MapReduce Explanation

;; this function will be executed by mapper workers
(defun pr1 (node n p &key (d 0.85))
  (let ((pr (make-arrray n :initial-element 0))
        (m (hash-table-count (node-children node))))
    (rtl:dokv (j child (node-children node))
      (setf (aref pr j) (* d (/ p m))))
    pr))

(defun pagerank-mr (g &key (d 0.85) (repeat 100))
  (rtl:with ((n (length (nodes g)))
             (pr (make-arrray n :initial-element (/ 1 n))))
    (loop :repeat repeat :do
      (setf pr (map 'vector (lambda (x)
                              (- 1 (/ d n)))
                    (reduce 'vec+ (map 'vector (lambda (node p)
                                                 (pr1 node n p :d d))
                                       (nodes g)
                                       pr)))))
    pr))

Here, we have used the standard Lisp map and reduce functions, but a map-reduce framework will provide replacement functions which, behind-the-scenes, will orchestrate parallel execution of the provided code. We will talk a bit more about map-reduce and see such a framework in the last chapter of this book.

One more thing to note is that the latter approach differs from the original version in that each mapper operates independently on an isolated version of the pr vector, and thus the execution of Pagerank on the subsequent nodes during a single iteration will see an older input value p. However, since the algorithm is stochastic and the order of calculations is not deterministic, this is acceptable: it may impact only the speed of convergence (and hence the number of iterations needed) but not the final result.

Other Significant Changes

My decision to heavily rely on syntactic utilities from my RUTILS library was a controversial one, from the start. And, surely, I understood it. But my motivation, in this regard, always was and still remains not self-promotion but a desire to present Lisp code so that it didn't seem cumbersome, old-fashioned, or cryptic (and, thankfully, the language provides all possibilities to tune its surface look to your preferences). However, as it bugged so many people, including the reviewers, for the new edition, we have come to a compromise to use all RUTILS code only qualified with the rtl prefix so that it was apparent. Besides, I have changed some of the minor purely convenience abbreviations to their standard counterparts (like returning funcall instead of call).

Finally, the change that I regret the most, but understand that it was inevitable, is the change of title and the new cover, which is in standard Apress style. However, they have preserved the Draco tree in the top right corner. And it's like a window through which you can glance at the original book :)  


So, that is an update on the status of the book.

For those who were waiting for the Apress release to come out, it's your chance to get it. The price is quite affordable. Basically, the same as the one I asked for (individual shipping via post is a huge expense).

And for those who have already gotten the original version of the book, all the major changes and fixes are listed in the post. Please, take notice if you had any issues.

I hope the book turns out to be useful to the Lisp community and serves both Lisp old-timers and newcomers.