Lisp, the Universe and Everything

Watching a Model Train

2021-10-12T13:50:00.004+03:00

Last week, I did a quick hack that quite delighted me: I added a way to visually watch the progress of training my MGL-based neural networks inside Emacs. And then people on twitter asked me to show the code. So, it will be here, but first I wanted to rant a bit about one of my pet peeves.

Low-Tech

In the age of Jupyter and TensorBoard, adding a way to see an image that records the value of a loss function blinking on the screen — "huh, big deal" you would say. But I believe this example showcases a difference between low-tech and high-tech approaches. Just recently I chatted with one of my friends who is entering software engineering at a rather late age (30+), and we talked of how frontend development became even more complicated than backend one (while, arguably, the complexity of tasks solved on the frontend is significantly lower). And that discussion just confirmed to me that the tendency to overcomplicate things is always there, with our pop-culture industry, surely, following it. But I always tried to stay on the simple side, on the side of low-tech solutions. And that's, by the way, one of the reasons I chose to stick with Lisp: with it, you would hardly be forced into some nonsense framework hell, or playing catch-up with the constant changes of your environment, or following crazy "best practices". Lisp is low-tech just like the Unix command-line or vanilla Python or JS. Contrary to the high-tech Rust, Haskell or Java. Everything text-based is also low-tech: text-based data formats, text-based visualization, text-based interfaces.

So, what is low-tech, after all? I saw the term popularized by Kris De Decker from the Low-Tech Magazine, which focuses on using simple (perhaps, outdated by some standards) technologies for solving serious engineering problems. Most people, and the software industry is no exception, are after high-tech, right? Progress of technology enables solving more and more complex tasks. And, indeed, that happens. Sometimes, not always. Sometimes, the whole thing crumbles, but that's a different story. Yet, even when it happens, there's a catch, a negative side-effect: the barrier of entry rises. If 5 or 10 years ago it was enough to know HTML, CSS, and JavaScript to be a competent frontend developer, now you have to learn a dozen more things: convoluted frameworks, complicated deploy toolchains, etc., etc. Surely, sometimes it's inevitable, but it really delights me when you can avoid all the bloat and use simple tools to achieve the same result. OK, maybe not completely the same, maybe not a perfect one. But good enough. The venerable 80% solution that requires 20% effort.

Low-tech is not low-quality, it's low-barrier of entry.

And I would argue that, in the long run, better progress in our field will be made if we strive towards lowering the bar to more people in, than if we continue raising it (ensuring our "job security" this way). Which doesn't mean that the technologies should be primitive (like BASIC). On the contrary, the most ingenious solutions are also the simplest ones. So, I'm going to continue this argument in the future posts I'd like to write about interactive programming. And now, back to our hacks.

Getting to Terms with MGL

In my recent experiments I returned to MGL — an advanced, although pretty opinionated, machine learning library by the prolific Gabor Melis — for playing around with neural networks. Last time, a few years ago I stumbled when I tried to use it to reproduce a very advanced (by that time's standards) recurrent neural network and failed. Yet, before that, I was very happy using it (or rather, it's underlying MGL-MAT library) for running in Lisp (in production) some of the neural networks that were developed by my colleagues. I know it's usually the other way around: Lisp for prototyping, some high-tech monstrosity for production, but we managed to turn the tides for some time :D

So, this time, I decided to approach MGL step by step, starting from simple building blocks. First, I took on training a simple feed-forward net with a number of word inputs converted to vectors using word2vec-like approach.

This is the network I created. Jumping slightly ahead, I've experimented with several variations of the architecture, starting from a single hidden layer MLP, and this one worked the best so far. As you see, it has 2 hidden layers (l1/l1-l and l2/l2-l) and performs 2-class classification. It also uses dropout after each of the layers as a standard means of regularization in the training process.

(defun make-nlp-mlp (&key (n-hidden 100))
  (mgl:build-fnn (:class 'nlp-mlp)
    (in (->input :size *input-len*))
    (l1-l (->activation in :size n-hidden))
    (l1 (->relu l1-l))
    (d1 (->dropout l1 :dropout 0.5))
    (l2-l (->activation d1 :size (floor n-hidden 2)))
    (l2 (->relu l2-l))
    (d2 (->dropout l2 :dropout 0.5))
    (out-l (->activation d2 :size 2))
    (out (->softmax-xe-loss out-l))))

MGL model definition is somewhat different from the approach one might be used to with Keras or TF: you don't imperatively add layers to the network, but, instead, you define all the layers at once in a declarative fashion. A typical Lisp style it is. Yet, what still remains not totally clear to me yet, is the best way to assemble layers when the architecture is not a straightforward one-direction or recurrent, but combines several parts in nonstandard ways. That's where I stumbled previously. I plan to get to that over time, but if someone has good examples already, I'd be glad to take a look at those. Unfortunately, despite the proven high-quality of MGL, there's very little open-source code that uses it.

Now, to make a model train (and watch it), we have to pass it to mgl:minimize alongside with a learner:

(defun train-nlp-fnn (&key data (batch-size 100) (epochs 1000) (n-hidden 100)
                       (random-state *random-state*))
  (let ((*random-state* random-state)
        (*agg-loss* ())
        (opt (make 'mgl:segmented-gd-optimizer
                   :termination (* epochs batch-size)
                   :segmenter (constantly
                                (make 'mgl:adam-optimizer
                                      :n-instances-in-batch batch-size))))
        (fnn (make-nlp-mlp :n-hidden n-hidden)))
    (mgl:map-segments (lambda (layer)
                        (mgl:gaussian-random!
                         (mgl:nodes layer)
                         :stddev (/ 2 (reduce '+ (mgl:mat-dimensions (mgl:nodes layer))))))
                      fnn)
    (mgl:monitor-optimization-periodically
     opt
     `((:fn mgl:reset-optimization-monitors :period ,batch-size :last-eval 0)
       (:fn draw-test-error :period ,batch-size)))
    (mgl:minimize opt (make 'mgl:bp-learner
                            :bpn fnn
                            :monitors (mgl:make-cost-monitors
                                       fnn :attributes `(:event "train")))
                  :dataset (sample-data data (* epochs batch-size)))
    fnn))

This code is rather complex, so let me try to explain each part.

We use (let ((*random-state* random-state) to ensure that we can reproduce training in exactly the same way if needed.
mgl:segmented-gd-optimizer is a class that allows us to specify a different optimization algorithm for each segment (layer) of the network. Here we use the same standard mgl:adam-optimizer with vanilla parameters for each segment (constantly).
The following mgl:map-segments call is performing the Xavier initialization of the input layers. It is crucial to properly initialize the layers of the network before training or, at least, ensure that they are not all set to zeroes.
The next part is, finally, responsible for WATCHING THE MODEL TRAIN. mgl:monitor-optimization-periodically is a hook to make MGL invoke some callbacks that will help you peek into the optimization process (and, perhaps, do other needful things). That's where we insert our draw-test-error function that will run each batch. There's also an out-of-the-box cost-monitor attached directly to the mgl:bp-learner, which is collecting the data for us and also printing it on the screen. I guess, we could build the draw-test-error monitor in a similar way, but I opted for my favorite Lisp magic wand — a special variable *agg-loss*.
And last but not least, we need to provide the dataset to the model: (sample-adata data (* epochs batch-size)). The simple approach that I use here is to pre-sample the necessary number of examples beforehand. However, streaming sampling may be also possible with a different dataset-generating function.

Now, let's take a look at the function that is drawing the graph:

(defun draw-test-error (opt learner)
  ;; here, we print out the architecture and parameters of
  ;; our model and learning algorithm
  (when (zerop (mgl:n-instances opt))
    (describe opt)
    (describe (mgl:bpn learner)))
  ;; here, we rely on the fact that there's
  ;; just a single cost monitor defined
  (let ((mon (first (mgl:monitors learner))))
    ;; using some of RUTILS syntax sugar here to make the code terser
    (push (pair (+ (? mon 'counter 'denominator)
                   (if-it (first *agg-loss*)
                          (lt it)
                          0))
                (? mon 'counter 'numerator))
          *agg-loss*)
    (redraw-loss-graph)))

(defun redraw-loss-graph (&key (file "/tmp/loss.png") (smoothing 10))
  (adw-charting:with-chart (:line 800 600)
    (adw-charting:add-series "Loss" *agg-loss*)
    (adw-charting:add-series
     (fmt "Smoothed^~a Loss" smoothing)
     (loop :for i :from 0
           :for off := (* smoothing (1+ i))
           :while (< off (length *agg-loss*))
           :collect (pair (? *agg-loss* (- off (floor smoothing 2)) 0)
                          (/ (reduce ^(+ % (rt %%))
                                     (subseq *agg-loss* (- off smoothing) off)
                                     :initial-value 0)
                             smoothing))))
    (adw-charting:set-axis :y "Loss" :draw-gridlines-p t)
    (adw-charting:set-axis :x "Iteration #")
    (adw-charting:save-file file)))

Using this approach, I could also draw the change of the validation loss on the same graph. And I'll do that in the next version.

ADW-CHARTING is my goto-library when I need to draw a quick-and-dirty chart. As you see, it is very straightforward to use and doesn't require a lot of explanation. I've looked into a couple other charting libraries and liked their demo screenshots (probably, more than the style of ADW-CHARTING), but there were some blockers that prevented me from switching to them. Maybe, next time, I'll have more inclination.

To complete the picture we now need to display our learning progress not just with text running in the console (produced by the standard cost-monitor), but also by updating the graph. This is where Emacs' nature of a swiss-army knife for any interactive workflow came into play. Surely, there was already an existing auto-revert-mode that updates the contents of a Emacs buffer on any change or periodically. For my purposes, I've added this lines to my Emacs config:

(setq auto-revert-use-notify nil)
(setq auto-revert-interval 6)  ; refresh every seconds

Obviously, this can be abstracted away into a function which could be invoked by pressing some key or upon other conditions occurring.

A nice lisp/interactive dev hack I did today: watching an MGL neural net train with a chart dynamically redrawn by adw-charting and updated inside Emacs by auto-revert-mode. Slime/Emacs - the ultimately customizable interactive experience FTW pic.twitter.com/y5ie9Xm20D
— Vsevolod (@vseloved) October 6, 2021

"Programming Algorithms in Lisp" Is Out!

2021-02-08T12:48:00.002+02:00

The updated version of my book "Programming Algorithms" has been released by Apress recently. It has undergone a number of changes that I want to elaborate on in this post.

But first, I'd like to thank all the people who contributed to the book or supported my work on it in other ways. It was an honor for me to be invited to Apress as "Practical Common Lisp" published by them a decade ago was my one-way ticket to the wonderful world of Lisp. Writing "Programming Algorithms" was, in a way, an attempt to give something back. Also, I was very curious to see how the cooperation with the publisher would go. And I can say that they have done a very professional job and helped significantly improve the book through the review process. That 5-10% change that is contributed by the editors, although it may seem insignificant, is very important to bring any book to the high standard that allows not to annoy many people. Unfortunately, I am not a person who can produce a flawless result at once, so helping with correcting those flaws is very valuable. Part of gratitude for that also, surely, goes to many of the readers who have sent their suggestions.

I was very pleased that Michał "phoe" Herda has agreed to become the technical reviewer. He has found a number of bugs and suggested lots of improvements, of which I could implement, maybe, just a third. Perhaps, the rest will go into the second edition :)

Now, let's speak about some of those additions to Programming Algorithms in Lisp.

Curious Fixes

First of all, all the executable code from the book was published in a github repo (and also republished to the oficial Apress repo). As suggested by Michał, I have added automated tests to ensure (for now, partially, but we plan to make the test suite all-encompassing) that everything compiles and runs correctly. Needless to say that some typos and other issues were found in the process. Especially, connected with handling different corner cases. So, if you have trouble running some code from the book, you can use the github version. Funny enough, I got into a similar situation recently, when I tried to utilize the dynamic programming example in writing a small tool for aligning outputs of different ASR systems and found a bug in it. The bug was is in the matrix initialization code:

-    (dotimes (k (1+ (length s1))) (setf (aref ld k 0) 0))
-    (dotimes (k (1+ (length s2))) (setf (aref ld 0 k) 0)))
+    (dotimes (k (1+ (length s1))) (setf (aref ld k 0) k))
+    (dotimes (k (1+ (length s2))) (setf (aref ld 0 k) k)))

Another important fix that originated from the review process touched not only the book but also the implementation of the slice function in RUTILS! It turned out that I was naively assuming that displaced arrays will automatically recursively point into the original array, and thus, inadvertently, created a possibility for O(n) slice performance instead of O(1). It explains the strange performance of array sorting algorithms at the end of Chapter 5. After fixing slice, the measurements started to perfectly resemble the theoretical expectations! And, also the performance has improved an order of magnitude :D

CL-USER> (let ((vec (random-vec 10000)))
           (print-sort-timings "Insertion " 'insertion-sort vec)
           (print-sort-timings "Quick" 'quicksort vec)
           (print-sort-timings "Prod" 'prod-sort vec))
= Insertion sort of random vector (length=10000) =
Evaluation took:
  0.632 seconds of real time
...
= Insertion sort of sorted vector (length=10000) =
Evaluation took:
  0.000 seconds of real time
...
= Insertion sort of reverse sorted vector (length=10000) =
Evaluation took:
  1.300 seconds of real time
...
= Quicksort of random vector (length=10000) =
Evaluation took:
  0.039 seconds of real time
...
= Quicksort of sorted vector (length=10000) =
Evaluation took:
  1.328 seconds of real time
...
= Quicksort of reverse sorted vector (length=10000) =
Evaluation took:
  1.128 seconds of real time
...
= Prodsort of random vector (length=10000) =
Evaluation took:
  0.011 seconds of real time
...
= Prodsort of sorted vector (length=10000) =
Evaluation took:
  0.011 seconds of real time
...
= Prodsort of reverse sorted vector (length=10000) =
Evaluation took:
  0.021 seconds of real time
...

Also, there were some missing or excess closing parens in a few code blocks. This, probably, resulted from incorrectly copying the code from the REPL after finishing experimenting with it. :)

New Additions

I have also added more code to complete the full picture, so to say, in several parts where it was lacking, from the reviewers' point of view. Most new additions went into expanding "In Action" sections where it was possible. Still, unfortunately, some parts remain on the level of general explanation of the solution as it was not possible to include whole libraries of code into the book. You can see a couple of snippets below:

Binary Search in Action: a Fast Specialized In-Memory DB

We can outline the operation of such a datastore with the following key structures and functions.

A dictionary *dict* will be used to map words to numeric codes. (We'll discuss hash-tables that are employed for such dictionaries several chapters later. For now, it will be sufficient to say that we can get the index of a word in our dictionary with (rtl:? *dict* word)). The number of entries in the dictionary will be around 1 million.

All the ngrams will be stored alphabetically sorted in 2-gigabyte files with the following naming scheme: ngram-rank-i.bin. rank is the ngram word count (we were specifically using ngrams of ranks from 1 to 5) and i is the sequence number of the file. The contents of the files will constitute the alternating ngram indices and their frequencies. The index for each ngram will be a vector of 32-bit integers with the length equal to the rank of an ngram. Each element of this vector will represent the index of the word in *dict*. The frequency will also be a 32-bit integer.

All these files will be read into memory. As the structure of the file is regular — each ngram corresponds to a block of (1+ rank) 32-bit integers — it can be treated as a large vector.

For each file, we know the codes of the first and last ngrams. Based on this, the top-level index will be created to facilitate efficiently locating the file that contains a particular ngram.

Next, binary search will be performed directly on the contents of the selected file. The only difference with regular binary search is that the comparisons need to be performed rank times: for each 32-bit code.

A simplified version of the main function get-freq intended to retrieve the ngram frequency for ranks 2-5 will look something like this:

(defun get-freq (ngram)
  (rt:with ((rank (length ngram))
            (codes (ngram-codes ngram))
            (vec index found?
                 (bin-search codes
                             (ngrams-vec rank codes)
                             :less 'codes<
                             :test 'ngram=)))
     (if found?
         (aref vec rank)
         0)))

where

(defun ngram-codes (ngram)
  (map-vec (lambda (word) (rtl:? *dict* word))
           ngram))

(defun ngrams-vec (rank codes)
  (loop :for ((codes1 codes2) ngrams-vec) :across *ngrams-index*
        :when (and (<= (aref codes1 0) (aref codes 0))
                   (codes< codes codes2 :when= t))
        :do (return ngrams-vec)))
             
(defun codes< (codes1 codes2 &key when=)
  (dotimes (i (length codes1)
              ;; this will be returned when all
              ;; corresponding elements of codes are equal
              when=)
    (cond ((< (aref codes1 i)
              (aref codes2 i))
           (return t))
          ((> (aref codes1 i)
              (aref codes2 i))
           (return nil))))) 

(defun ngram= (block1 block2)
  (let ((rank (1- (length block1))))
    (every '= (rtl:slice block1 0 rank)
              (rtl:slice block2 0 rank)))

We assume that the *ngrams-index* array containing a pair of pairs of codes for the first and last ngram in the file and the ngrams data from the file itself was already initialized. This array should be sorted by the codes of the first ngram in the pair. A significant drawback of the original version of this program was that it took quite some time to read all the files (tens of gigabytes) from disk. During this operation, which measured in several dozens of minutes, the application was not responsive. This created a serious bottleneck in the system as a whole and complicated updates, as well as put normal operation at additional risk. The solution we utilized to counteract this issue was a common one for such cases: switching to lazy loading using the Unix mmap facility. With this approach, the bounding ngram codes for each file should be precalculated and stored as metadata, to initialize the *ngrams-index* before loading the data itself.

Pagerank MapReduce Explanation

;; this function will be executed by mapper workers
(defun pr1 (node n p &key (d 0.85))
  (let ((pr (make-arrray n :initial-element 0))
        (m (hash-table-count (node-children node))))
    (rtl:dokv (j child (node-children node))
      (setf (aref pr j) (* d (/ p m))))
    pr))

(defun pagerank-mr (g &key (d 0.85) (repeat 100))
  (rtl:with ((n (length (nodes g)))
             (pr (make-arrray n :initial-element (/ 1 n))))
    (loop :repeat repeat :do
      (setf pr (map 'vector (lambda (x)
                              (- 1 (/ d n)))
                    (reduce 'vec+ (map 'vector (lambda (node p)
                                                 (pr1 node n p :d d))
                                       (nodes g)
                                       pr)))))
    pr))

Here, we have used the standard Lisp map and reduce functions, but a map-reduce framework will provide replacement functions which, behind-the-scenes, will orchestrate parallel execution of the provided code. We will talk a bit more about map-reduce and see such a framework in the last chapter of this book.

One more thing to note is that the latter approach differs from the original version in that each mapper operates independently on an isolated version of the pr vector, and thus the execution of Pagerank on the subsequent nodes during a single iteration will see an older input value p. However, since the algorithm is stochastic and the order of calculations is not deterministic, this is acceptable: it may impact only the speed of convergence (and hence the number of iterations needed) but not the final result.

Other Significant Changes

My decision to heavily rely on syntactic utilities from my RUTILS library was a controversial one, from the start. And, surely, I understood it. But my motivation, in this regard, always was and still remains not self-promotion but a desire to present Lisp code so that it didn't seem cumbersome, old-fashioned, or cryptic (and, thankfully, the language provides all possibilities to tune its surface look to your preferences). However, as it bugged so many people, including the reviewers, for the new edition, we have come to a compromise to use all RUTILS code only qualified with the rtl prefix so that it was apparent. Besides, I have changed some of the minor purely convenience abbreviations to their standard counterparts (like returning funcall instead of call).

Finally, the change that I regret the most, but understand that it was inevitable, is the change of title and the new cover, which is in standard Apress style. However, they have preserved the Draco tree in the top right corner. And it's like a window through which you can glance at the original book :)

So, that is an update on the status of the book.

For those who were waiting for the Apress release to come out, it's your chance to get it. The price is quite affordable. Basically, the same as the one I asked for (individual shipping via post is a huge expense).

And for those who have already gotten the original version of the book, all the major changes and fixes are listed in the post. Please, take notice if you had any issues.

I hope the book turns out to be useful to the Lisp community and serves both Lisp old-timers and newcomers.

The Common Lisp Condition System Book

2020-11-23T14:41:00.004+02:00

Several months ago I had a pleasure to be one of the reviewers of the book The Common Lisp Condition System (Beyond Exception Handling with Control Flow Mechanisms) by Michał Herda. I doubt that I have contributed much to the book, but, at least, I can express my appreciation in the form of a reader review here.

My overall impression is that the book is very well-written and definitely worth reading. I always considered special variables, the condition system, and multiple returns values to be the most underappreciated features of Common Lisp, although I have never imagined that a whole book may be written on these topics (and even just two of them). So, I was pleasantly flabbergasted.

The book has a lot of things I value in good technical writing: a structured and logical exposition, detailed discussions of various nuances, a subtle sense of humor, and lots of Lisp. I should say that reading the stories of Tom, Kate, and Mark was so entertaining that I wished to learn more about their lives. I even daydreamt (to use the term often seen throughout the book) about a new semi-fiction genre: stories about people who behave like computer programs. I guess a book of short stories containing the two from this book and the story of Mac from "Practical Common Lisp" can already be initialized. "Anthropomorphic Lisp Tales"...

So, I can definitely recommend reading CLCS to anyone interested in expanding their Lisp knowledge and general understanding of programming concepts. And although I can call myself quite well versed with the CL condition system, I was also able to learn several new tricks and enrich my understanding. Actually, that is quite valuable as you never know when one of its features could become handy to save your programming day. In my own Lisp career, I had several such a-ha moments and continue appreciating them.

This book should also be relevant to those, who have a general understanding of Lisp, but are compelled to spend their careers programming in inferior languages: you can learn more about one of the foundations of interactive programming and appreciate its value. Perhaps, one day you'll have access to programming environments that focus on this dimension or you'll be able to add elements of interactivity to your own workflow.

As for those who are not familiar with Lisp, I'd first start with the classic Practical Common Lisp.

So, thanks to Michał for another great addition to my virtual Lisp books collection. The spice mush flow, as they say...

Why RDF* Is a Mess... and How to Fix It

2020-10-14T12:02:00.018+03:00

TL;DR

RDF* is a new group of standards that aims to bridge the gap between RDF and property graphs. However, it has taken an "easy" route that made it ambiguous and backward incompatible. An alternative approach that doesn't suffer from the mentioned problems would be to introduce the notion of triple labels instead of using the embedded triples syntax.

How Standards Should Be Developed

Our CTO used to say that standards should be written solely by those who are implementing them. And although this statement may be a little too extreme in some cases, it's a good rule of thumb. The main reason for it is not that it will make the standards simple to implement. Moreover, I don't want to argue that allowing a simple implementation is the main requirement for a standard. What's more important is that the implementors have a combined exposure to the whole variety of potential use cases both from the user feedback and own experience of being a consumer of their own dogfood. Besides, it doesn't hurt, in the long run, that if something is simple to implement it's also, usually, simple to understand, reason about, and use.

Obviously, given all power to the implementers might lead to abuse of this power, but it's a second-order problem and there are known ways to mitigate it. Primarily, by assembling representatives of several implementations in a committee. An approach that is often frowned upon by hotheads due to alleged bureaucracy and the need for compromise, yet leading to much more thought-out and lasting standards. A good example of such is the Common Lisp standard.

RDF*

But I digressed, let's talk about RDF*. This is the major candidate to solve the problem of RDF triple reification that is crucial to basic compatibility between RDF and property graph representations. In short, RDF defines the simplest elegant abstraction for representing any kind of data — a triple that comprises a subject, predicate, and object. Triples are used to represent facts. Besides, there's a third component to a triple called a graph. So, in fact, the historic name triple, in the realm of RDF triple-stores, currently stands for a structure of 4 elements. The graph may be used to group triples. And despite the beauty of the simple and elegant concept of a triple, in theory, having this fourth component is essential for any serious data modeling.

Now, we know how to represent facts and arbitrarily group them together. So, the usual next-level question arises: how to represent facts about facts? Which leads to the problem of reification. Let's consider a simple example:

:I :have :dream .

This is a simple triple represented in the popular Turtle format. As RDF deals with resources, it assumes that there's some default prefix defined elsewhere (let's say it's https://foo.com/). The sam triple may be represented in the basic NTriples format like this:

<https://foo.com/I> <https://foo.com/have> <https://foo.com/dream> .

What if we want to express the facts that it is a quote by Martin Luther King and that it was uttered in 1963? There are at least 3 ways to approach it:

The obvious but wrong (as it reimplements the semantics of RDF in RDF introducing unnecessary complexity) one of turning it into a set of triples each one describing some of the properties of the original statement with all of subject, predicate, and object being one of the properties:
```
_:fact1 rdf:subject :I ;
        rdf:predicate :have ;
        rdf:object :dream ;
        meta:author wiki:Martin_Luther_King_Jr. ;
        meta:date "1963" .
```
This works but, as I said, is conceptually wrong. It's the RDF's Java-style cancer of the semicolon. It leads to storage waste and poor performance.
The other one is to use singleton predicates and assign metadata to them:
```
:I :have#1 :dream .
:have#1 meta:author wiki:Martin_Luther_King_Jr. ;
        meta:date "1963" .
```
This is complete nonsense as it makes SPARQL queries unreasonably complex unless you implement special syntax that will ignore the #1 suffix, in the query engine.
Yet another one, which I consider to be the best (close to perfect), is to use the graph to attach triple metadata instead of grouping the triples.
```
:t1 { :I :have :dream . }
:t1 meta:author wiki:Martin_Luther_King_Jr. ;
    meta:date "1963" .
```
Here, we use another (there's many more of them :) ) RDF format — TriG, which is an extension to Turtle for representing graphs. :t1 is a unique graph that is associated with our triple, and it is also used as a subject resource for metadata triples. This approach also has minor drawbacks, the most important of which is that grouping triples needs more overhead. We'll have to add an additional triple if we'd like to express that :t1 belongs to a graph :g1:
```
:t1 meta:graph :g1 .
```
On the flip side, that will open the possibility of putting the triple into more than a single graph. In other words, now grouping may be expressed as yet another triple property, which it, in fact, is.
RDF* takes a different approach: embedding triples. We may say it tries to do the obvious by attaching metadata directly to the triple:
```
<< :I :have :dream >> meta:author wiki:Martin_Luther_King_Jr. ;
                      meta:date "1963" .
```
Besides, you can also embed a triple into an object:
```
wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .
```
And do nesting:
```
<< wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> >> meta:date "1963" .
```
Neat, at first glance... Yet, there are many pitfalls of this seemingly simple approach.

What's Wrong with RDF*

The first obvious limitation of this approach is that this syntax is not able to unambiguously express all the possible cases. What if we want to say something like this:

<< << :I :have :dream >> meta:author wiki:Martin_Luther_King_Jr. ;
                         meta:date "1963" >>
   meta:timestamp "2020-10-13T01:02:03" .

Such syntax is not specified in the RFC and it's unclear if it is allowed (it seems like it shouldn't be), although this is perfectly legit:

<< << :I :have :dream >> meta:author wiki:Marthin_Luther_King_Jr. >>
   meta:timestamp "2020-10-13T01:02:03" .

What about this:

wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .
wiki:John_Doe meta:quote << :I :have :dream >> .

Do these statements refer to the same :I :have :dream . triple or two different ones? RDF* seems to assume (although the authors don't say that anywhere explicitly) that each subject-predicate-object combination is a unique triple, i.e. there can be no duplicates. But RDF doesn't mandate it. So, some triple stores support duplicate triples. In this case, there is no way to express referencing the same embedded triple in object position from multiple triples in Turtle*.

Moreover, there's a not in the RDF* spec that mentions that the embedded triples should not, actually, be asserted (at least, in the object position — it is unclear whether that also applies to them in the subject position). I.e. in the following example:

wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .

the triple :I :have :dream might be treated differently then the toplevel triples, and the SPARQL query like SELECT ?obj { :I :have ?obj } will not return :dream. And only SELECT ?obj { ?s ?p << :I :have ?obj >> } will be an acceptable way of accessing the embedded triple. We're now questioning the most basic principles of RDF...

And I haven't even started talking about graphs (for there's no TriG* yet). With graphs, there're more unspecified corner cases. For instance, the principal question is: can an embedded triple have a different graph than the enclosing property triple. It seems like a desirable property, moreover, it will be hard to prevent the appearance of such situations from directly manipulating the triple store (and not by reading serialized TriG* statements).

This is, actually, the major problem with Turtle*: it makes an impression that it exists in a vacuum. To see RDF* in context, we have to understand that the core of RDF comprises a group of connected standards: NTriples/NQuads, Turtle/TriG, and SPARQL. Turtle is a successor to NTriples that makes it more human-friendly, but all of them build on the same syntax. And this syntax is used by SPARQL also. Yet, there's no NTriples* and it's unclear whether it can exist. GraphDB implements a hack by embedding the triple (or rather its hash, but that doesn't matter much) in a resource (like <urn:abcdefgh>), but, first of all, that's ugly, and, secondly, it also assumes no duplicates. Yet, NTriples is the basic data interchange format for RDF, and forsaking it is a huge mistake. There's also no TriG* yet as I mentioned. Another sign that RDF* is mostly a theoretical exercise. TriG* can be defined as an extension to TriG with Turtle* syntax, but I have already briefly mentioned the issue it will face.

To sum up, the main deficiencies of Trutle* are:

poor backward compatibilty (up to a point of not taking into account other related standards)
limited syntax
lots of underspecified corners

And, in my opinion, they originate from the desire to make the most obvious UI not paying attention to all other considerations at all.

An Obvious Fix

What's the alternative? Well, probably, Turtle* will end up being implemented in some way or another by all the triple-store vendors. Although, I expect the implementations to be quite incompatible due to the high level of underspecification in the RFC.

Yet, you don't have to wait for Turtle* as graph-based reification is already available and quite usable.

Also, if we still had a choice to define an extension to RDF with the same purpose as RDF*, I'd take another quite obvious route. It may be less sexy, but it is at least as simple to understand and much more consistent both within itself and with other RDF standards. Moreover, a similar approach is already part of RDF — blank nodes.

Blank nodes are resources that are used just as ids:

We could as well use a blank node instead of http://foo.com/t1 as our graph label resouce:


_:b1 { :I :have :dream . }
_:b1 meta:author wiki:Martin_Luther_King_Jr. ;
      meta:date "1963" .

The underscore syntax is for blank nodes so _:b1 will create a graph node that is used to connect other nodes together but we don't care about its representation at all.

Similarly to blank nodes syntax, we could introduce triple label syntax:

^:t1 :I :have :dream .

This statement will mean that our triple has a t1 label. Now, we could add metadata to that label — in exactly the same manner as with graph-based reification (*:t1 is a "dereference" of the triple label):

*:t1 meta:author wiki:Martin_Luther_King_Jr. ;
     meta:date "1963" .

This would map directly to the implementation that will be able to unambiguously link the triple to its properties. Also, it would enable this:

wiki:Martin_Luther_King_Jr. *:t1 .
wiki:John_Doe meta:quote *:t1 .

And defining NTriples*/NQuads* becomes possible, as well. Here are NQuads triples for the MLK quote (with a graph g1 added for completeness).

^:t1 <https://foo.com/I> <https://foo.com/have> <https://foo.com/dream> <https://foo.com/g1> .
^:t2 *:t1 <https://meta.org/author> <https://en.wikipedia.org/wiki/Martin_Luther_King_Jr.> .
*:t1 <https://meta.org/date> "1963" .

Alas, this simple and potent approach was overlooked for RDF*, so now we have to deal with a mess that is both hard to implement and will likely lead to more fragmentation.

Announcing CL-AGRAPH

2020-08-12T22:50:00.005+03:00

AllegroGraph (agraph) is one of the hugely underappreciated pieces of software freely available to the programming community. Especially this relates to the Lisp community as agraph is one of the largest, most successful, and mature Lisp software projects around, yet it is hardly used by lispers. In part, its relative obscurity may be explained by the following factors:

the software is commercial... but it also has a free version with very reasonable limitations that can be used in the majority of hobby and other small projects
it is written in a commercial Lisp — Allegro CL... but it also has a free version; and it can run agraph
out-of-the-box, agraph has only an ACL client (in addition to the ones in mainstream languages like java or python)... in this post, a portable Lisp client is introduced to fill this gap

In fact, free access to agraph may enable the development of a wide variety of useful applications, and I plan another post about the unbeknown value that RDF may bring to almost any software project. Yet, to showcase it in full with code, I was missing the client. Besides, I have an occasional personal need for it, and so, with some hacking over several weekends, here it is — a minimal portable Lisp client for agraph that has, IMHO, a couple of interesting high-level features and can also be rather easily extended to support other RDF-backends.

Disclosure: for the last 2,5 years, I've been working for Franz on AllegroGraph. Over that period I was able to participate in the development and support of different parts of the system and come to gradually appreciate it both as an engineering accomplishment and an extremely useful data store.

The HTTP API

cl-agraph provides a way to interact from a running Lisp process with AllegroGraph via its HTTP API. I call it minimal as the client implements only the essential CRUD commands and the SPARQL interface. That is the critical part that enables the usage of the triple store as part of any application. However, the agraph API also provides many administrative capabilities. Those are not (yet) supported by cl-agraph, although they may be implemented rather easily (I'll show how this can be done below). Yet, those endpoints are accessible directly both via the WebView management web interface and the agtool command-line utility. So, the absence of their support in the client doesn't preclude the effective use of agraph from any application.

The client uses nquads as the data interchange format. The availability of standard data formats, such as nquads, is one of the great advantages of RDF as a way to model any data. And it also made the development of this client much easier. To work with nquads, I have extended the cl-ntriples library by Viktor Anyakin (ntriples is a simpler version of the nquads format).

The basic data structure of agraph is a `triple` (actually, a quad, but it's the name "triple" is more common):

(defstruct (triple (:conc-name nil)
                   (:print-object print-triple))
  s p o g
  triple-id
  obj-lang obj-type)

Triple components are, uris, blank-nodes, and literals (strings, numbers, booleans).

When the triple is printed, it is displayed in the standard nquads format:

AGRAPH> (<> (make-blank-node) "http://foo.com/foo" "bar" :g "http://foo.com/baz" :lang "en")

_:bn1899  "bar"@en  .
AGRAPH> (s *)
_:bn1899

I have chosen the diamond sign (<>) to signify triples (as in ntriples/nquads formats, the URIs are enclosed in it). So, the API functions that deal with triples are mostly accompanied by this sign. The parts enclosed in <> in the nquads representation are uris. Also, the very short names s, p, o, and g are used as triple parts accessors. This is a generally discouraged approach, but from my experience working with AG, I have learned that these functions are used very often and no one will be mistaken when seeing them, in the context of triple-store interactions. Also, usually, they will be used with a package prefix anyway, so the common code pattern, in a convenient setup, may look like this:

(defpackage #:foo
  (:local-nicknames (#:ag #:agraph))
  ...)

FOO> (ag:with-ag (:repo "bar")
       ;; Though, there's a more efficient variant of triple iteration
       ;; that will be shown below
       (dolist (tr (ag:get<> :p (uri "baz:quux")))
         (when (ag:blank-node-p (ag:s *))
           (ag:rem<> tr))))

The function <> ensures proper types of the triple components. There's also raw make-triple that creates the triple structure using the arguments as is.

RDF permits specifying aliases for uri prefixes and the uri function is aware of that:

AGRAPH> (register-prefix "foo" "http://foo.com/")
"http://foo.com/"
AGRAPH> (<> (make-blank-node) "rdf:type" (uri "foo:quux"))
_:bn1921 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/quux> .

You can see that we have used the default expansion for prefix "rdf" and the user-defined one for prefix "foo". The object of the triple needed to be explicitly converted to an uri (unlike the predicate) before it was passed to the <> function as objects may be also strings and it's impossible to reliably distinguish in the background.

The other core data structure of CL-AGRAPH is ag-config. It lists the connection parameters that are used to make the client HTTP requests. Most of the parameters have reasonable defaults. The macro with-ag is a common Lisp with-style macro that is intended for creating an interaction context with fixed config parameters. Usually, it should be given at least the :repo argument.

Here are some simple interactions with agraph:

AGRAPH> (open-ag :repo "test" :port 12345)
NIL
AGRAPH> (let ((subj (make-blank-node)))
          (add<> (<> subj "rdf:type" (uri "foo:bar"))
                 (<> subj "foo:baz" "quux" :lang "en")))
2
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/bar> .
 _:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (rem<> :g (uri "foo:bar"))
0
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/bar> .
 _:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (rem<> :o (uri "foo:bar"))
1
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (count<>)
1
AGRAPH> (close-ag)
T

CL-AGRAPH defines the function map<> and the macro do<> in the standard Lisp iteration paradigm. Map performs iteration with accumulation, while do is intended to be used just for the side-effects. Actually, do<> is expressed, internally, in terms of map<>. The main advantage of using these functions instead of just calling the normal mapcar or dolist on the results of get<> is their streaming mode of operation. Instead of pulling, potentially, millions of triples from the triple-store into the program's memory (or organizing paging-based iteration, as get<> has a :limit option), the triples are streamed from the backend and discarded after being processed.

AGRAPH> (map<> 'o)
("quux")

Unlike the usual mapcar, this call didn't have the second argument: it iterated all triples in the repository. Yet, surely, it can be limited to certain subjects, predicates, objects, and/or graphs:

AGRAPH> (do<> (tr :s "foo" :p "bar" :o "baz" :g "quuz")
          (print tr))
NIL  ; sorry, there were no triples with such parameters

AGRAPH> (do<> (tr :p (uri "foo:baz"))
          (print tr))
_:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .

Defining Your Own Commands

All these commands use, under the hood, the ag-req utility that can be utilized to define other API wrappers. For example, here is a function to get all the duplicate triples in the repository (that's one of the features of agraph that permits several triples with the same SPOG to be added):


(defun duplicates (&key mode)
  (ag-req "/statements/duplicates" (list "mode" mode)))

However, the simplest commands can be defined even without ag-req, by using just the high-level functions. Here is a small example — the function that checks if a triple exists in the repository:


(defun storedp (tr)
  (assert (triple-p tr))
  (when (get<> :s (s tr) :p (p tr) :o (o tr) :limit 1)
    t))

NB. As the client uses a REST HTTP + nquads protocol, it should be rather easy to extend it to support other triple-store backends such as GraphDB, Stardog or Virtuoso. Provided they also support this method of interaction.

Sessions & Transactions

Now, let's return to open-ag and with-ag. Both of them have a sessionp keyword argument (that is, by default, true for with-ag and nil for open-ag). A session is a mechanism for speeding up some agraph operations and for running transactions. Without an active session, each update is committed at once. It is much more costly than batching up groups of operations. However, if a session is established, you need to explicitly call commit to enact the modifications to the triple-store. I.e. sessions create an implicit transaction. with-ag will commit the transcation after executing its body. It is also possible to manually rollback the changes. Any unhandled error inside with-ag will also, effectively, cause a rollback: the session will be terminated without a commit.

An agraph session has a certain lifetime/timeout that can also be specified as a parameter to open-ag/with-ag. However, there's also a maximum possible lefitime that is configured by the triple-store admin. Once the timeout expires, the session is terminated. with-ag will try to rerun the transaction if it encounteres a terminated session — but that will be done just once. And the user should be careful not to place transaction-unfriedly code in the body of with-ag. open-ag, on the contrary, defaults to sessionless mode. This way the additional complexity of timeouts and transactions is removed. In this mode, the only thing that open-ag does is configuring the connection spec and internal caches.

Symbolic SPARQL

Another thing worth some attention in this client is its symbolic SPARQL facility that allows generating SPARQL requests from s-expressions. Query generation from sexps is a common Lisp trick that can be found in such libraries as CLSQL & co. However, the implementation I came up with is, from my point of view, much simpler.

Here are a few simple examples that give a general impression of symbolic SPARQL:


AGRAPH> (generate-sparql '(select * (?s ?p ?o))
                         nil)
"SELECT 
* 
{
?S ?P ?O .
 }
"
AGRAPH> (generate-sparql '(select * (:union (?s ?p ?o)
                                            (:graph ?g (?s ?p ?o))))
                         nil)
"SELECT 
* 
{ {
?S ?P ?O .
 }
UNION
{
GRAPH ?G {
?S ?P ?O .
 } } }
"

The function generate-sparql uses a very simple evaluation rule. It will print any symbols as is, while lists are processed recursively in 3 possible ways depending on the first element:

a keyword in first position means that custom rules should be invoked;
any other symbol causes the list to be treated as a tiples pattern containing subject, predicate(s), and object(s);
another list invokes recursive processing.

Now, the custom rules are defined as methods of a generic function process-custom, which makes this mechanism quite extensible. Let's see an example SPARQL sexps and the custom rules that were used to handle it:


(defun storedp (tr)
  (assert (triple-p tr))
  (when (get<> :s (s tr) :p (p tr) :o (o tr) :limit 1)
    t))
AGRAPH> (generate-sparql '(select ?date ?title
                                  ((?g |dc:date| ?date)
                                   (:filter (:> ?date (:|^| "2005-08-01T00:00:00Z"
                                                            |xsd:dateTime|)))
                                   (:graph ?g (?b |dc:title| ?title))))
                         nil)
"SELECT 
?DATE 
?TITLE 
{ ?G dc:date ?DATE .
 FILTER ( (?DATE > \"2005-08-01T00:00:00Z\"^^xsd:dateTime ) )
 GRAPH ?G {
?B dc:title ?TITLE .
 } }
"
(defgeneric process-custom (key tree out &key)
  ...
  (:method ((key (eql :|^|)) tree out &key)
    (assert (and (dyadic tree)
                 (stringp (first tree))
                 (symbolp (second tree))))
    (format out "~S^^~A" (first tree) (second tree)))
  (:method ((key (eql :filter)) tree out &key)
    (assert (single tree))
    (format out "FILTER ( ~A )~%"
            (process-expr (first tree) nil :top-level nil)))
  (:method ((key (eql :graph)) tree out &key)
    (assert (dyadic tree))
    (format out "GRAPH ~A " (first tree))
    (process-expr (second tree) out :top-level t))
  (:method ((key (eql :>)) tree out &key top-level)
    (process-arithm key tree out :top-level top-level))
  ...

The sexp-based form of SPAQRL queries may seem unusual, but it is much more convenient and powerful than the standard string format:

it is more convenient to edit;
passing variables is easy;
and you can write function and macros to construct these expressions from parts, which is very rough and error-prone using the string-based format.

I considered implementing symbolic SPARQL ever since I started working with it as programmatically filling string templates is so primitive. Finally, I've found time to realize this idea!

Afterword

This announcement is targeted mainly at those who are already "enlightened" about RDF triple stores and were eagerly waiting for a chance to try agraph. :) I hope that it provides a good starting point for you to actually do it. I believe, the agraph download webpage gives enough guidance regarding installing it either on your machine or running it from the AWS Marketplace.

As I said, there will be another post (for now, unclear when) that will be an introduction to RDF cabilities for those developers who are still "in ignorance" about the possibilities that triple stores may open for their applications. Stay tuned...

Programming Algorithms 2nd Edition

2020-07-17T23:11:00.000+03:00

Apress — the most dedicated publisher of Common Lisp books, famous for giving the world "Practical Common Lisp" and "Common Lisp Recipes" — has approached me to publish "Programming Algorithms", and, after some consideration, I have agreed. So, the book will be released under the title "Programming Algorithms in Lisp" and with some slight modifications to the content.

It was not an easy decision to make. Ultimately, my goal for the book is to make it as widely read as possible. For these three months since it had been published on Leanpub, it was downloaded more than 1500 times, and almost 250 people have also donated some money in its support. The paperback book was shipped to around 40 locations around the globe: even to Australia and Colombia. Besides, I have received lots of positive feedback and some improvement suggestions. I'm very grateful and happy that it has seen such positive reception.

In my opinion, the book has the potential to hit at least an order of magnitude more readers. However, to achieve that, targeted promotion effort is necessary. I have already mostly exhausted the capacity of the free PR channels I had access to (such as Hacker News, Reddit, and Twitter). I had a long-term promotion strategy though, but it required spending the time and (possibly) financial resource that could be used elsewhere.

The Apress edition of the book will not be free, but it will have the full power of this respected publisher behind it. So, my hope is that thus it will reach an even wider audience. Very soon I will have to take down the free version of the book, so this is the last chance to download it (if you or some of your friends planned to do it). The book webpage will remain active and will collect relevant information and news, so stay tuned...

vseloved.github.io/progalgs

Eval Spotted in the Wild

2020-06-22T20:27:00.000+03:00

(#lisptips on the dynamic nature of CLOS magnified by eval)

Since starting programming in Lisp, I always had an impression that using eval is a taboo. Or rather, a cul-de-sac that you never want to touch. When I was only learning Lisp, I had a couple of unsuccessful and rather stupid attempts of utilizing it to bend the language to my view of how it should function — only to learn how it is really intended to function. After that, it occupied its rightful place on my mind's shelf of "low-level constructs only needed to implement the language".

Yet, recently, I saw a legitimate use case for it and even wrote a piece of production code containing eval! That was such a revelation that I wanted to share it in this short post.

So, here is the case I needed to solve: I was developing a parser for a new data format that had to fit into an existing set of parsers. The parsers not only decode the data but also store it in the datastore using the CLOS machinery for datastore access. I.e. there's a generic function to store an individual piece of data that is specialized for different connection/datastore types. Now, my parser had to prepare the individual pieces and, eventually, they would be fed to this function. But that may happen independently of the parser operation: when the data store commit is performed.

Yet, there was another issue at play: the data format allows the individual items to be interdependent, i.e. reference one another via an implicit reference. And when the data is persisted, due to the properties of the data store, these references should be changed to the internal ids of the referenced items. And those are not known before the commit happens. I.e. I was in the following situation:

my parser produces an array of items that are to be persisted to the dataset at some later time
the order of their addition matters as the dependent items should be added after the ones they reference
and as the referenced item is added its id should be saved
and assigned to a field of the dependent item before that item is also added

This program logic is quite normal, apart from the fact that my parser doesn't have control over the whole workflow. Actually, the data persistence stage operates in the inversion of control paradigm, i.e. I can only override (rather, augment) the part of the program that is responsible for processing an individual item. Needless to say that I had no desire or intention to reimplement all the different (I believe, 7) ways of interaction with the datastore that had their own methods plus a number of before/after/around-methods.

In fact, CLOS is very flexible and provides a way, using an object of my own mixin-class to hold the state and around-method specialized on it, to achieve my goal of fitting into this whole machinery without distracting it or having to reimplement anything. If not for one issue: limited facilities for dynamic creation of classes.

So, here's what I wanted to do and to avoid:

I wanted to define a mixin-class and an around-method for it to augment the data storing procedure with saving of the ids of specified items and assigning them to fields in other items before persisting them. Here's the sketch of the relevant code:


(defclass my-data-store-mixin ()
  ((linked-items-table :reader my-mixin-table
                       :initform (make-hash-table))))

(defmethod add-item :around ((db my-data-store-mixin) item)
  (let ((linked-items-table (my-mixin-table db))
        (item-id (call-next-method)))
    (dolist (it (gethash item linked-items-table))
      (remhash it linked-items-table)
      (setf (reference it) item-id))
    (remhash item linked-items-table)
    item-id))

Yet, I didn't want this code to run when other data formats are imported, hence my mixin should have been "activated" if and only if my specific format is parsed.
In other words, I needed a way to dynamically add this mixin to an existing connection object, in the context of the parser call, and then restore the connection to its previous state. In general, CLOS also provides such a facility with its change-class operator. I would say, this would have been a manifestation of a dynamic object system in all its glory if not for one deficiency.
You can't just dynamically define a temporary class: the one that will inherit from the class of the current connection and my mixin. defclass is a macro that's expected to deal with names known ahead-of-time and coded as part of its call: it doesn't evaluate variables. Usually, all such APIs in Lisp have a make-function counterpart. I.e. what I needed was:
```
(let ((temp-class (gensym))
      (current-db-class (class-of *db*)))
  (make-class temp-class (list (class-name current-db-class) my-data-store-mixin) nil)
  (unwind-protect (progn (change-class *db* temp-class)
                         ;; execute my code
                  )
    (change-class *db* current-db-class)))
```
But CLOS just doesn't have an API for that. (Which might be perfectly reasonable — and I don't want to delve into the discussion of those reasons in this post). Actually, there's MOP for that. But I'd prefer not to take the MOP route here for another set of reasons I want to skip discussing now :) Suffice to say that it is complicated and, from my experience with the MOP, I developed a stance that it's another area intended for language implementation usage — not for user-level code.

And here's where eval comes to the rescue. In place of the nonexisting make-class I could just put this piece:


(let ((class (intern (format nil "my-mixed-~a" (class-name current-db-class)))))
  (when (not (find-class class nil))
    (eval `(defclass ,class (,(class-of *db*) my-data-store-mixin) ()))))

So, eval is an escape hatch into the world of ultimate dynamism. This operator can add it anywhere: whether an appropriate API was left out due to lack of foresight or even when it was not intended to exist... :)

"Programming Algorithms" Book Gets its own Page

2020-05-18T21:08:00.002+03:00

Recently, I was busy organizing the process of postal delivery of the paperback version of the book to all the interested people. Here, I have created a map showing everyone who has already ordered:

I'm glad to see the global geography of readership and all the positive feedback. Thanks a lot! Please, share your thoughts online :)

Finally, the book got its own webpage with all the relevant details.

Dead-Tree Version of "Programming Algorithms"

2020-05-08T12:33:00.001+03:00

I have finally obtained the first batch of the printed "Programming Algorithms" books and will shortly be sending them to the 13 people who asked for a hardcopy.

Here is a short video showing the book "in action":

If you also want to get a copy, here's how you do it:

Send an email to vseloved@gmail.com with your postal address — I'll send you a Paypal money request.
Once I see the donation, I'll go to the post office and send you the book.
Optionaly step: if you want it to be signed, please, indicate it in your letter.

Shipping details: As I said originally, the price of the dead-tree version will be $20+shipping. I'll ship via the Ukrainian national post. You can do the fee calculation online here (book weight is 0.58 kg, size is 23 x 17 x 2 cm): https://calc.ukrposhta.ua/international-calculator. Alas, the interface is only in Ukrainian. According to the examples I've tried, the cost will be approximately $10-15. To make it easier, I've just settled on $10 shipping without a tracking number of $15 if you want a tracking number. Regardless of your country. I don't know how long it will take - probably depends on the location (I'll try to inquire when sending).

The book was already downloaded more than 1170 times (I'm not putting the exact number here as it's constantly growing little by little). I wish I knew how many people have, actually, read it in full or in part. I've also received some error corrections (special thanks goes to Serge Kruk), several small reviews and letters of encouragement. Those were very valuable and I hope to see more :)

Greetings from the far away city of Lima, Peru!
I loved this part: "Only losers don’t comment their code, and comments will be used extensively"
Thank you so much for putting this comprehensive collection of highly important data structures, i'm already recommending this to two of my developers, which I hope i'll induce into my Lisp addiction.
--Flavio Egoavil

And here's another one:

Massively impressive book you've written! I've been a Lisp programmer for a long time and truly appreciate the work put in here. Making Lisp accessible for more people in relation to practical algorithms is very hard to do. But you truly made it. You'll definitely end up in the gallery of great and modern Lisp contributions like "Land of Lisp" and "Let Over Lambda". Totally agree with your path to focus on practical algorithmic thinking with Lisp and not messing it up with macros, oop and other advanced concepts.
--Lars Hård

Thanks guys, it's really appreciated!

If you feel the same or you've liked the book in some respect and have found it useful, please, continue to share news about it: that definitely helps attract more readers. And my main goal is to make it as widely read as possible...

"Programming Algorithms" Book Freely Available

2020-04-15T12:29:00.000+03:00

The book "Programming Algorithms (A comprehensive guide to writing efficient programs with examples in Lisp)" has been completed. It turned out to be more than 360 pages in a standard technical book format, with over 100k words (that's 2 NanoWriMos :). It covers more than 100 topics that made it to the TOC — but, actually, more. Phew, making it to this point was quite a challenge...

This book is, surely, not perfect. Hopefully, most of the mistakes in it were fixed with the help of many nice people who commented on the chapters as they were published on my blog.

Also, the book is terribly incomplete. Almost every chapter could be expanded by a factor of two or three with relevant details and concrete implementations of some of the general ideas that are presented, currently. But neither did I have the time to write those down, nor, what's much more important, anyone would have had the time to read them, in entirety. I believe I have put enough concrete examples with executable code to illustrate all the important concepts in each part. This is a great advantage of using Lisp for the book: the code is clear and compact enough to serve both to explain the algorithms and to permit testing them for real, in the REPL. The main compromise each author has to make is between brevity and completeness. I hope that I made the right choices in this regard, but, for sure, there's much more to learn about every piece of technology mentioned. My hope is that the book lays a solid groundwork to facilitate further deeper exploration.

There are also a couple of topics that I would like to cover but couldn't find a good place for them. Probabilistic data structures is the most important of them. Yet, they are not big enough to justify a separate chapter and, also, don't fit into any of the existing chapters.

But enough with the whining :) In fact, I'm quite satisfied with the end result as my main goal was to sufficiently develop the following key themes:

The main one, obviously, was the description of all the important data structures and the associated algorithms.
The next, also very important, was the demonstration of the essential tools that help in the development, testing, and verification of the produced algorithmic code: tracing, profiling, pretty-printing, etc.
We have also discussed, when it was relevant, the real-world engineering considerations and constraints that influence the programs using our algorithms. And sometimes these constraints have more impact than the purely theoretical complexity calculations.
Finally, in each chapter, I tried to present the practical use case of the algorithms we have studied, showing the broad variety of such applications. In fact, it spans all the different corners of the software landscape we're used to. We have talked, albeit briefly, about such different domains as neural networks, plagiarism detection, web search, mapping, chess-playing, image compression, and many others.

There is a lot of books on algorithms, but I haven't seen any that primarily aims to bridge the gap between the theory and practice. This is one of the key distinctions of "Programming Algorithms". It is definitely not the best exposition of the theoretical ideas, but I hope that, instead, it builds sufficient understanding and skill, for the common developer, to start writing efficient algorithmic programs.

I wanted to finish the book with the following statement: programming craft is primarily about making choices. What approach to prefer, which algorithm to choose, what tradeoffs to make. And, at the other level, what properties to give more priority: speed or safety, brevity or consistency, space or debuggability, clarity or conciseness, and so on and so forth. Lisp is one of the few languages that are "pro-choice". Its authors understood very well the importance of freedom to make the critical choices, and it is felt in the design of the language. For instance, with the help of declaim we can even signal our preferences to the compiler, to some extent, at the level of a single file or even an individual form. (declaim (optimize (speed 3) (safety 1) (debug 0) (compilation-speed 0))) will ask the compiler to produce the fastest possible code. Yes, this language will not guard you against poor choices like some others claim to do. Sometimes, you're not wise enough to make a correct choice, but, much more often, every choice just has its pros and cons, so someone will approve of it and someone won't. And that's what freedom is about: ownership and responsibility. So, use Lisp if you liked it. And if you prefer other languages, I'd urge you to still take advantage of the concept of freedom of choice in programming. Don't be constrained by the prevailing paradigms and try to use the best parts of all the different approaches you know...

Acknowledgments

Finally, the most pleasant part. I'm very thankful to those who helped me in the work on "Programming Algorithms" by providing support, advice, corrections, and suggestions. First of all, many thanks to my wife Ksenya who encouraged me to work on it despite the time for that is, in part, taken from my family duties. Also, I am very grateful to Dr. Robert Standh who humbly volunteered his help as an editor to make it sound more native (as my English is far from perfect since I'm not a native speaker) and point out the mistakes that I made. He and his wife had contributed lots of improvements to more than half of the chapters, and I tried to follow their advice in the subsequent ones. Thanks to Rainer Joswig for commenting on the Lisp choices. Thanks to @dzenicv on reddit who posted links to almost all of the chapters there, which triggered some helpful discussions. Thanks to @tom_mellior on Hacker News for pointing a serious deficiency in the explanation of Union-Find. Thanks to all those people who shared the links to the chapters, contributed their comments and attention.

If you've found the book helpful and interesting, you can also support it's past and (potential) further development in several ways. First and foremost, you can share it with your friends, colleagues, social network. The book was made free and will remain free as its main premise, for me, was to spread the knowledge gathered inside. Yet, you can also make a donation at Leanpub if you believe that it has brought you some value. Last but not least, if you find some way the book can be improved, don't hesitate to contact me.

Finally, I wanted to solicit reviews: if you've read the book and liked it, please, write a paragraph or two to let others know your opinion!

Programming Algorithms: Synchronization

2020-03-30T16:56:00.000+03:00

This is a snippet of the "Synchronization" chapter of the book.

This is the final chapter of the book, in which we will discuss optimization of parallel computations: whether concurrently on a single machine in and a shared-memory setting or in a distributed shared-nothing environment. This is a huge topic that spans synchronization itself, parallelization, concurrency, distributed computations, and the functional approach. And every senior software developer should be well-versed in it.

Usually, synchronization is studied in the context of system or distributed programming, but it has a significant algorithmic footprint and is also one of the hottest topics for new algorithm research. In fact, there are whole books that concentrate on it, but, usually, they attack the problem from other angles, not focusing on the algorithmic part. This chapter will be more algorithm-centered, although it will also present an overview of the problem space. SO that, in the end, you'll have a good foundation to explore the topic further if a desire or need for that will appear.

Let's start from the basics. In the previous chapters of the book we, mainly, viewed algorithms as single computations running without external interruptions. This approach, obviously, removes the unnecessary complexity, but it also isn't totally faithful to reality. Most of the programs we deal with, now, run in multiprocessing environments (sometimes, even distributed ones), and even when they don't utilize these capabilities, those are still available, they sometimes have their impact, and, besides, might have improved the performance of the programs if they would have been utilized. The majority of the backend stuff, which, currently, is comprised of services running in the datacenters, are multithreaded. There's a notorious "Zawinski's Law" that states that every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. Being a good joke it also reflects an important truth about the tendency of all programs over time to become network-aware, and thus distributed to at least some extent.

There are two principally different types of environments in which the programs that need synchronization run: shared-memory and shared-nothing ones.

In a shared-memory setting, there exists some shared storage (not necessarily, RAM) that can be directly accessed by all the threads of the application. Concurrent access to data in this shared memory is the principal source of the synchronization challenges, although not the only one. The example of a shared-memory program is a normal application that uses multithreading provided either directly by the OS or, more frequently, by the language runtime.

The opposite of shared-memory is a shared-nothing environment, in which all threads don't have any common data storage and can coordinate only by sending messages directly to other processes. The contents of the messages have to be copied from the memory of the sender to the receiver. In this setting, some of the synchronization problems disappear, but others still remain. At the fundamental level, some synchronization or coordination still needs to happen. From a performance standpoint, however, the shared-nothing mode is, usually, inferior due to the need for additional data copying. So, both paradigms have their place and the choice, which one to utilize, depends on the context of a particular task.

The main goal of synchronization is ensuring program correctness when multiple computations are running in parallel. Another side of the coin is achieving optimal performance, which is also addressed by parallelization that we have somewhat discussed in a couple of prior chapters. Prioritizing performance before correctness, although tempting, is one of the primary sources of bugs in the concurrent systems. The trivial example would be building a shared-memory program without explicit use of any synchronization mechanisms. It is, definitely, the most performant approach, but non-coordinated access to the shared data will inevitably result in failures like data corruption.

More details about of the book may be found on its website.

Programming Algorithms: Compression

2020-02-19T14:01:00.001+02:00

This is a snippet of the "Compression" chapter of the book.

Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.

Encoding

Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.

Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.

Base64

Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +, /, and =). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=) is used to indicate 2 (=) or 4 (==) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.

Here is one of the ways to implement a Base64 serialization routine:

(defparameter *b64-dict*
  (coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\a) :to (char-code #\z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\0) :to (char-code #\9)
                        :collect (code-char ch))
                  '(#\+ #\/ #\=))
          'simple-vector))

(defun b64-encode (in out)
  (let ((key 0)
        (limit 6))
    (flet ((fill-key (byte off beg limit)
             (:= (ldb (byte limit off) key)
                 (ldb (byte limit beg) byte))
             (:= off (- 6 beg)))
           (emit1 (k)
             (write-byte (char-code (svref *b64-dict* k)) out)))
      (loop :for byte := (read-byte in nil) :while byte :do
        (let ((beg (- 8 limit)))
          (fill-key byte 0 beg limit)
          (emit1 key)
          (fill-key byte (:= limit (- 6 beg)) 0 beg)
          (when (= 6 beg)
            (emit1 key)
            (:= limit 6))))
      (when (< limit 6)
        (:= (ldb (byte limit 0) key)
            (ldb (byte limit 0) 0))
        (emit1 key)
        (loop :repeat (ceiling limit 2) :do
          (emit1 64))))))

This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.

One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:

It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: (with-open-file (in ...) (with-out-file (out) (base64-encode in out)). With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.

So, what happens in the code above? First, the bytes are read from the binary input stream in, then each one is slashed into 2 parts. The higher bits are set into the current base64 key, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...). The function relies on the standard Lisp ldb operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset) to control the bits it wants to obtain.

Implementing a decoder procedure is left as an exercise to the reader...

Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):

CL-USER> (with-input-from-string (str "Man i")
           (let ((in (flex:make-in-memory-input-stream 
                      (map 'vector 'char-code
                           (loop :for ch := (read-char str nil) :while ch :collect ch))))
                 (out (flex:make-in-memory-output-stream)))
             (b64-encode in out)
             (map 'string 'code-char (? out 'vector))))
"TWFuIGk="

This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key) and the byte being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:

        M (77)    |    a (97)     |    n (110)
   0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1    |               |                 19 = T
               0 1|               |
1:             0 1|0 1 1 0        |                 22 = W
                  |        0 0 0 1|
2:                |        0 0 0 1|0 1               5 = F

Iteration 0:

beg: 2
off: 0
limit: 6

beg: 0
off: (- 6 2) = 4
limit: 2


Iteration 1:

beg: 4
off: 0
limit: 4

beg: 0
off: (- 6 4) = 2
limit: 4

Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...). Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.

(deftest b64-encode ()
  ;; B64STR would be the function wrapped over the REPL code presented above
  (should be blankp (b64str ""))
  (should be string= "TWFu" (b64str "Man"))
  (should be string= "TWFuIA==" (b64str "Man "))
  (should be string= "TWFuIGk=" (b64str "Man i")))

Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.

More details about of the book may be found on its website.

prj-nlp v.3

2020-02-10T21:32:00.001+02:00

The state of NLP in 2019.

I’m talking with an amazing undergrad who has already published multiple papers on BERT-type things.

We are discussing deep into a new idea on pretraining.

Me: What would TFIDF do here, as a simple place to start?
Him: ....
Me: ....
Him: What’s TFIDF?
— Eric Wallace (@Eric_Wallace_) December 19, 2019

Ми з Мар'яною Романишин розпочали третій набір курсу з обробки людської письмової мови (чи як краще перекласти NLP :-p) в Проджекторі. Хотів написати трохи деталей про нього, бо курс нам дуже подобається і, звісно, кожного року хочеться, щоб його рівень продовжував зростати. А для цього треба, щоб потенційні студенти про нього знали і розуміли, як він влаштований.

Курс є трьохмісячним інтенсивом, який ставить на меті підготувати спеціаліста, здатного самостійно і якісно вирішувати NLP-задачі будь-якої складності. Як наодинці, так і в команді. Для того, щоб бути максимально успішним в ньому, треба мати певну базу. Як правило, найкраще себе показують, звісно, програмісти. Але є й винятки: ми залюбки беремо лінгвістів, журналістів, та й, загалом, спеціалістів з інших галузей за умови, що вони володють достатніми навиками програмування, щоб самостійно писати програми, налаштовувати зручне для себе середовище розробки і розуміти базові алгоритмічні концепції. Для курсу не треба знати ML, хоча якесь уявлення про нього бажано мати. Але курс побудований так, що ми почергово розбираємо NLP-теми і пов'язані з ними розділи ML. Звісно, це не значить, що в результаті людина буде гарно розбиратись у машинному навчанні, але необхідну базу для продовження поглиблення у цій сфері отримає.

Другою передумовою успіху на курсі є наявність достатнього часу. Ми кажемо, що необхідний мінімум — це 10 годин самостійної роботи на тиждень, плюс 5 годин на заняттях. Іншими словами, враховуючи час на дорогу, це вже пів робочих ставки. Але, звісно, комусь може знадобитись і більше самостійного часу. Крім того мозок буде досить сильно завантажений новими темами, тому на час занять доведеться відмовитись від інших додаткових проєктів, хоббі і т.п. Також не дуже добре виходить, якщо більше тижня підряд випадає з якоїсь зовнішньої причини: хвороби, відрядження, шлюбу, народження дітей... :)

Як побудований цей курс? Ми збираємо групу з 15 студентів і зустрічаємось два рази на тиждень: одне заняття — теоретичне у четвер увечері, присвячене розбору певної теми, друге — практичне у суботу, на якому ми показуємо приклад вирішення задачі по цій теми і разом програмуємо його. У більшості випадків, ця програма буде основою для розв'язку більш просунутої, але подібної задачі, яка дається на домашню роботу. Відповідно, у нас є 12 тижнів основної роботи, тобто 12 тем і близько 10 повноцінних домашніх проєктів рівня побудувати систему аналізу тональності, перевірки фактів чи синтаксичного розбору. Звісно, в умовах обмеженого часу, кожний з проєктів робиться в рамках певного обмеженого домену.

Курс розбитий на 3 частини:

перший місяць — це дуже ґрунтовна підготовча робота: основи структурної лінгвістики, робота з даними, метрики, правильні експерименти, підхід на правилах. В кінці місяця в голові у студента має сформуватись цілком структуроване уявлення про те, як правильно підходити до вирішення NLP-задач. Інший результат цієї частини — сформульоване завдання для курсового і початок роботи над ним: зібрані перші дані, визначена метрика, пророблений план експериментів
другий місяць — це занурення в класичне NLP з паралельною проробкою наійбільш розповсюджених ML-технік, які в ньому використовуються. В кінці місяця, після вирішення на лекціях, практичних і вдома майже десятка NLP-задач у студентів вже мають сформуватись навички для самостійного застосування цих технік у реальних проєктах. Ну і зроблена основна частина курсової роботи
останній місяць — це deep learning в NLP. Ми одразу попереджаємо, що цей курс не ставить на меті розказати побільше найгарячішого і проривного: для цього є достатньо інших майданчиків. Ми хочемо сформувати систематичне розуміння NLP, з всією його 70-річною історією. Бо в цій історії є дуже багато корисних і, можливо, timeless речей. Тож до state-of-the-art ми підходимо тільки під кінець (і на останньому занятті у нас виступає запрошений лектор, який розказує щось про bleeding edge :) Але принципові речі, пов'язані з DL, ми також пророблюємо як на заняттях, так і в рамках курсового. Ті зі студентів, кого цікавить саме ця сфера, під кінець курсу ганяють навчання десяток нейронок в своїх пісочницях, а також випробовують можливості глибинного навчання у своєму курсовому проєкті.. Втім, тут ми не можемо похвалитись приголомшливими результатами по якості, адже для їх досягнення замало пари тижнів, які по-максимуму є на ту чи іншу задачу: навчання глибинних моделей потребує багато як обчислювальних ресурсів, так і часових. Але тому, як до цього підходити, ми навчаємося

Як можна побачити з цього опису, дуже велику увагу ми приділяємо курсовому проєкту, роботу над яким стимулюємо кожного тижня. В результаті, у більшості виходять досить непогані і цікаві штуки, понад 70% студентів доходять до фінішу з якісною завершеною роботою (нечувана кількість для інших курсів, в яких мені доводилось брати участь). Деякі з проєктів навіть виходять у великий світ: хтось робить речі, пов'язані з роботою, хтось — з хоббі. За 2 роки у нас було 2 дослідження для журналістики даних, проєкт з аналізу конфліктів ліків між собою та з хронічними хворобами людини (на основі обробки інструкцій), система пошуку у соціальному застосунку для конференцій з запитами природньою мовою. Був і ряд цікавих проєктів для себе, які досягли класних результатів по якості та були зроблені з душею. Все це студенти презентують після закінчення на великій фінальній вечірці в офісі Grammarly.

Одна з основних цілей цього курсу для нас полягає в тому, щоб вирощувати українську NLP-спільноту. Адже школи комп'ютерної лінгвістики у нас, по суті, ніколи не було. І ми сподіваємось, що нам вдастся долучитись до її формування разом з іншими прогресивними проєктами навчання в цій сфері, зокрема магістерською програмою по Data Science в УКУ. У курсу вже більше 30 випускників, які увійшли до закритого клубу prj-nlp-alumni де ми ділимось цікавими речами та можливостями, а також плануємо періодично зустрічатись у неформальній атмосфері, а не тільки на івентах. Тож сподіваємось на розширення цього клубу ще на половину у червні цього року :)

P.S. До речі, про УКУ. Я також беру участь як викладач і ментор дипломних робіт у курсі NLP в їх програмі. Це трохи інший досвід, ніж цей курс. Звісно, УКУ пропонує більш академічну програму, яка триває довший час. Студенти отримують там гарну і, що важливо, систематичну підготовку з ML та DL. Тому цьому зовсім не треба приділяти увагу на курсі по NLP. З іншого боку, курс більш короткий і читається декількома викладачами, тому в його рамках важче сформувати цілісну картинку, нема можливості організувати такий же рівень занурення і концентрації, як на курсі в Проджекторі. Зате на магістерську роботу у них більше часу, ніж на весь наш курс. Але, найголовніше, що і тут, і там підбираються гарно підготовлені і мотивовані студенти, тож результати в УКУ також виходять гарної якості з великою кількістю цікавих робіт. Деякі з яких мають рівень статей на топові наукові конференції у цій галузі. Хоча мені особисто все-таки більше подобається формат Проджектора, адже він дає можливість відчути дух інтенсивної командної роботи протягом трьох місяців, зітхнувши з полегшенням в кінці у передчутті дев'ятимісячного перепочинку і нової ітерації...

Programming Algorithms: Approximation

2020-01-11T23:08:00.000+02:00

This is a snippet of the "Approximation" chapter of the book.

This chapter will be a collection of stuff from somewhat related but still distinct domains. What unites it is that all the algorithms we will discuss are, after all, targeted at calculating approximations to some mathematical functions. There are no advanced data structures involved, neither is the aim to find a clever way to improve the runtime of some common operations. No, these algorithms are about calculations and computing an acceptable result within the allocated time budget.

Combinatorial Optimization

Dynamic Programming is a framework that can be used for finding the optimal value of some loss function when there are multiple configurations of the problem space that result in different values. Such search is an example of discrete optimization for there is a countable number of states of the system and a distinct value of the cost function we're optimizing corresponding to each state. There are also similar problems that have an unlimited and uncountable number of states, but there is still a way to find a global or local optimum of the cost function for them. They comprise the continuous optimization domain. Why is optimization not just a specialized area relevant to a few practitioners but a toolbox that every senior programmer should know how to utilize? The primary reason is that it is applicable in almost any domain: the problem just needs to be large enough to rule out simple brute force. You can optimize how the data is stored or how the packets are routed, how the blueprint is laid out or the servers are loaded. Many people are just not used to looking at their problems this way. Also, understanding optimization is an important prerequisite for having a good grasp of machine learning, which is revolutionizing the programming world.

DP is an efficient and, overall, great optimization approach, but it can't succeed if the problem doesn't have an optimal substructure. Combinatorial Optimization approaches deal with finding a near-optimum for the problems where an exhaustive search requires O(2^n) computations. Such problems are called NP-hard and a classic example of those is the Travelling Salesman (TSP). The task is to find an optimal order of edges in a cycle spanning all vertices of a fully-connected weighted graph. As we saw previously, this problem doesn't have an optimal substructure, i.e. an optimal partial solution isn't necessarily a part of the best overall one, and so taking the shortest edge doesn't allow the search procedure to narrow down the search space when looking at the next vertex. A direct naive approach to TSP will enumerate all the possible variants and select the one with a minimal cost. However, the number of variants is n!, so this approach becomes intractable very fast. A toy example of visiting all the capitals of the 50 US states has 10^64 variants. This is where quantum computers promise to overturn the situation, but while we're waiting for them to mature, the only feasible approach is developing approximation methods that will get us a good enough solution in polynomial (ideally, linear) time. TSP may look like a purely theoretical problem, but it has some real-world applications. Besides vehicle routing, automated drilling and soldering in electronics is another example. Yet, even more important is that there are many other combinatorial optimization problems, but, in essence, the approaches to solving one of them apply to all the rest. I.e., like with shortest path, coming up with an efficient solution to TSP allows to efficiently solve a very broad range of problems over a variety of domains.

So, let's write down the code for the basic TSP solution. As usual, we have to select the appropriate graph representation. From one point of view, we're dealing with a fully-connected graph, so every representation will work and a matrix one will be the most convenient. However, storing an n^2-sized array is not the best option, especially for a large n. A better "distributed" representation might be useful here. Yet, for the TSP graph, an even better approach would be to do the opposite of our usual optimization trick: trade computation for storage space. When the graph is fully-connected, usually, there exists some kind of an underlying metric space that contains all the vertices. The common example is the Euclidian space, in which each vertex has a coordinate (for example, the latitude and longitude). Anyway, whichever way to represent the vertex position is used, the critical requirement is the existence of the metric that may be calculated at any time (and fast). Under such conditions, we don't have to store the edges at all. So, our graph will be just a list of vertices.

Let's use the example with the US state capitals. Each vertex will be representated as a pair of floats (lat & lon). We can retireve the raw data from the Wikipedia article about the US capitols (with an 'o') and extract the values we need with the following code snippet[1], which cuts a few corners:

(defstruct city
  name lat lon)

(defparameter *wp-link* "https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1")

(defparameter *cs*
  (with ((raw (drakma:http-request *wp-link*))
         (coords-regex (ppcre:create-scanner "\\{\\{coord\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|type"))
         (capitals (list)))
    (flet ((dms->rad (vec off)
             (* (/ pi 180)
                (+     (? vec (+ off 0))
                    (/ (? vec (+ off 1)) 60)
                    (/ (? vec (+ off 2)) 3600)))))
      (dolist (line (split #\Newline (slice raw
                                            (search "{| class=\"wikitable sortable\"" raw)
                                            (search "</textarea><div class='editOptions'>" raw))))
        (when-it (and (starts-with "|" line)
                      (search "{{coord" line))
          (with ((_ coords (ppcre:scan-to-strings coords-regex line))
                 (coords (map* 'read-from-string coords)))
            (push (make-city :name (slice line (position-if 'alpha-char-p line)
                                          (position-if (lambda (ch) (member ch '(#\] #\|)))
                                                       line :start 1))
                           :lat (dms->rad coords 0)
                           :lon (dms->rad coords 3))
                capitals)))))
    (coerce capitals 'vector)))

CL-USER> (length *cs*)
50

We also need to define the metric. The calculation of distances on Earth, though, is not so straightforward as on a plain. Usually, as a first approximation, the haversine formula is used that provides the estimate of the shortest distance over the surface "as-the-crow-flies" (ignoring the relief).

(defun earth-dist (c1 c2)
  (with ((lat1 (? c1 'lat))
         (lat2 (? c2 'lat))
         (a (+ (expt (sin (/ (- lat2 lat1) 2))
                     2)
               (* (cos lat1)
                  (cos lat2)
                  (expt (sin (/ (- (? c2 'lon) (? c1 'lon)) 2)) 
                        2)))))
    (* 1.2742e7  ; Earth diameter
       (atan (sqrt a) (sqrt (- 1 a))))))

With the metric at our disposal, let's define the function that will calculate the length of the whole path and use it for a number of random paths (we'll use the RUTILS function shuffle to produce a random path).

(defun path-length (path)
  (let ((rez (earth-dist (? path 0) (? path -1))))
    (dotimes (i (1- (length path)))
      (:+ rez (earth-dist (? path i) (? path (1+ i)))))
    rez))

CL-USER> (path-length *cs*)
9.451802301259182d7
CL-USER> (path-length (shuffle *cs*))
9.964776273250546d7
CL-USER> (path-length (shuffle *cs*))
1.009761841183094d8

We can see that an average path may have a length of around 10k kilometers. However, we don't know anything about the shortest or the longest one, and to find out reliably, we'll have to evaluate 50! paths... Yet, as we accept the sad fact that it is not possible to do with our current technology, it's not time to give up yet. Yes, we may not be able to find the absolute best path, but at least we can try to improve on the random one. Already, the three previous calculations had a variance of 5%. So, if we're lucky, maybe we could hit a better path purely by chance. Let's try a thousand paths using our usual argmin pattern:

(defun random-search (path n)
  (let ((min (path-length path))
        (arg path))
    (loop :repeat n :do
      (with ((path (shuffle path))
             (len (path-length path)))
        (when (< len min)
          (:= min len
              arg path))))
    (values arg
            min)))

CL-USER> (:= *print-length* 2)
2
CL-USER> (random-search *cs* 1000)
(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
 #S(CITY :NAME "Montpelier, Vermont" :LAT 0.772521512027179d0 ...) ...)
7.756170773802838d7

OK, we've got a sizable 20% improvement. What about 1,000,000 combinations?

CL-USER> (time (random-search *cs* 1000000))
Evaluation took:
  31.338 seconds of real time
...
(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
 #S(CITY :NAME "Helena, Montana" :LAT 0.813073800024579d0 ...) ...)
6.746660953705506d7

Cool, another 15%. Should we continue increasing the size of the sample? Maybe, after a day of computations, we could get the path length down by another 20-30%. And that's already a good gain. Surely, we could also parallelize the algorithm or use a supercomputer in order to analyze many more variants. But there should be something smarter than simple brute force, right?

More details about of the book may be found on its website.

Programming Algorithms: Dynamic Programming

2019-12-13T12:40:00.000+02:00

This is a snippet of the "Dynamic Programming" chapter of the book.

This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.

We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.

Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:

A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.

An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.

Fibonacci Numbers

So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.

A naive approach to calculating the i-th number will be directly coding the Fibonacci formula:

(defun naive-fib (i)
  (assert (typep i '(integer 0)))
  (if (< i 2) 1
      (+ (naive-fib (- i 1))
         (naive-fib (- i 2)))))

However, applying it will result in an exponential growth of the number of computations: each call to naive-fib results in two more calls. So, the number of calls needed for the n-th number, with this approach, is O(2^n).

> (time (naive-fib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naive-fib 42))
Evaluation took: 7.827 seconds of real time
433494437

Yet, we can see here a direct manifestation of an optimal substructure property: the i-th number calculation uses the result of the (1- i)-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.

(let ((fib (vec 1 1)))  ; our table will be an adjustable vector
  (defun fib (i)
    (when (< (length fib) i)
      (vector-push-extend (fib (- i 1)) fib))
    (+ (? fib (- i 1))
       (? fib (- i 2)))))

What we've done here is added a layer of memoization to our function that uses an array fib that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n) running time to a mere O(n)!

Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n) in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:

(let ((fib (vec 1 1)))
  (defun bottom-up-fib (i)
    (let ((off (length fib)))
      (adjust-array fib (1+ i) :fill-pointer t)
      (dotimes (j (- (1+ i) off))
        (let ((j (+ j off)))
          (:= (aref fib j)
              (+ (aref fib (- j 1))
                 (aref fib (- j 2)))))))
    (aref fib i)))
> (time (bottom-up-fib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottom-up-fib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543...  ; this number is a Lisp's bignum that has unlimited size

Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...

More details about of the book may be found on its website.

Programming Algorithms: Strings

2019-11-20T12:47:00.000+02:00

This is a snippet of the "Srings" chapter of the book.

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
there's a lot of string-specific algorithms that deserve their own chapter
finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
                        (declare (type (simple-array bit) x))
                        (count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2:       F3480FB8FA       POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
  bytes
  bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
  (let ((off 0))
    (loop
      (with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
             (diff (- i cnt)))
        (cond
         ((= cnt i) (return (+ off i)))
         ((< diff *mb-threshold*) (return (mb-linear-char-index
                                           string diff off)))
         ((< cnt (floor i 2)) (:+ off i)
                              (:- i cnt))
         (t (:+ off (floor i 2))
            (:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
  (count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

More details about of the book may be found on its website.

Programmatically Drawing Simple Graphviz Graphs

2019-10-25T13:48:00.001+03:00

For my book, I had to draw a number of graphs. Obviously, I wanted to have a programmatic way to do that, and Graphviz is the goto-library for that. In Lisp, the interface to Graphviz is cl-dot, but, for me, it wasn't easy to figure out from the manual the "simple and easy" way to use it. I.e. I couldn't find a stupid beginner-level interface, so I had to code it myself. Here's the implementation that allows anyone with a REPL to send to Graphviz lists of edges and obtain graph images.

Generated images:

Programming Algorithms: Graphs

2019-10-24T16:09:00.001+03:00

This is a snippet of the "Graphs" chapter of the book.

Graphs have already been mentioned several times, in the book, in quite diverse contexts. Actually, if you are familiar with graphs you can spot opportunities to use them in quite different areas for problems that aren't explicitly formulated with graphs in mind. So, in this chapter, we'll discuss how to handle graphs to develop such intuition to some degree.

But first, let's list the most prominent examples of the direct graph applications, some of which we'll see here in action:

pathfinding
network analysis
dependency analysis in planning, compilers, etc.
various optimization problems
distributing and optimizing computations
knowledge representation and reasoning with it
meaning representation in natural language processing

Graphs may be thought of as a generalization of trees: indeed, trees are, as we said earlier, connected directed acyclic graphs. But there's an important distinction in the patterns of the usage of graphs and trees. Graphs, much more frequently than trees, have weights associated with the edges, which adds a whole new dimension both to algorithms for processing them and to possible data that can be represented in the graph form. So, while the main application of trees is reflecting some hierarchy, for graphs, it is often more about determining connectedness and its magnitude, based on the weights.

Graph Representations

A graph is, basically, a set of nodes (called "vertices", V) and an enumeration of their pairs ("edges", E). The edges may be directed or undirected (i.e. bidirectional), and also weighted or unweighted. There are many ways that may be used to represent these sets, which have varied utility for different situations. Here are the most common ones:

as a linked structure: (defstruct node data links) where links may be either a list of other nodes, possibly, paired with weights, or a list of edge structures represented as (defsturct edge source destination weight). For directed graphs, this representation will be similar to a singly-linked list but for undirected — to a heavier doubly-linked one
as an adjacency matrix (V x V). This matrix is indexed by vertices and has zeroes when there's no connection between them and some nonzero number for the weight (1 — in case of unweighted graphs) when there is a connection. Undirected graphs have a symmetric adjacency matrix and so need to store only the abovediagonal half of it
as an adjacency list that enumerates for each vertex the other vertices it's connected to and the weights of connections
as an incidence matrix (V x E). This matrix is similar to the previous representation, but with much more wasted space. The adjacency list may be thought of as a sparse representation of the incidence matrix. The matrix representation may be more useful for hypergraphs (that have more than 2 vertices for each edge), though
just as a list of edges

More details about of the book may be found on its website.

Programming Algorithms: Trees

2019-09-28T11:51:00.000+03:00

This is a snippet of the "Trees" chapter of the book.

Couldn't resist adding this xkcd

Balancing a binary tree is the infamous interview problem that has all that folklore and debate associated with it. To tell you the truth, like the other 99% of programmers, I never had to perform this task for some work-related project. And not even due to the existence of ready-made libraries, but because self-balancing binary trees are, actually, pretty rarely used. But trees, in general, are ubiquitous even if you may not recognize their presence. The source code we operate with, at some stage of its life, is represented as a tree (a popular term here is Abstract Syntax Tree or AST, but the abstract variant is not the only one the compilers process). The directory structure of the file system is the tree. The object-oriented class hierarchy is likewise. And so on. So, returning to interview questions, trees indeed are a good area as they allow to cover a number of basic points: linked data structures, recursion, complexity. But there's a much better task, which I have encountered a lot in practice and also used quite successfully in the interview process: breadth-first tree traversal. We'll talk about it a bit later.

Similar to how hash-tables can be thought of as more sophisticated arrays (they are sometimes even called "associative arrays"), trees may be considered an expansion of linked lists. Although technically, a few specific trees are implemented not as a linked data structure but are based on arrays, the majority of trees are linked. Like hash-tables, some trees also allow for efficient access to the element by key, representing an alternative key-value implementation option.

Basically, a tree is a recursive data structure that consists of nodes. Each node may have zero or more children. If the node doesn't have a parent, it is called the root of the tree. And the constraint on trees is that the root is always single. Graphs may be considered a generalization of trees that don't impose this constraint, and we'll discuss them in a separate chapter. In graph terms, a tree is an acyclic directed single-component graph. Directed means that there's a one-way parent-child relation. And acyclic means that a child can't have a connection to the parent neither directly, nor through some other nodes (in the opposite case, what will be the parent and what — the child?) The recursive nature of trees manifests in the fact that if we extract an arbitrary node of the tree with all of its descendants, the resulting part will remain a tree. We can call it a subtree. Besides parent-child or, more generally, ancestor-descendant "vertical" relationships that apply to all the nodes in the tree, we can also talk about horizontal siblings — the set of nodes that have the same parent/ancestor.

Another important tree concept is the distinction between terminal (leaf) and nonterminal (branch) nodes. Leaf nodes don't have any children. In some trees, the data is stored only in the leaves with branch nodes serving to structure the tree in a certain manner. In other trees, the data is stored in all nodes without any distinction.

More details about of the book may be found on its website.

Programming Algorithms: Hash-Tables

2019-09-11T22:25:00.000+03:00

This is a snippet of the "Hash-Tables" chapter of the book.

Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.

Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure O(1) membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with t for all values.

Implementation

The basic properties of hash-tables are average O(1) access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:

As the backing array has a limited number of cells (n), the function should produce values in the interval [0, n). This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by n.
Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".

Dealing with Collisions

Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.

For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us (/ (* 20 19) 2), i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is (/ (- 365 1) 365): there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in (* (/ 364 365) (/ (- 365 2) 365)). The value (- 365 2) refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:

(defun birthday-collision-prob (n)
  (let ((rez 1))
    (dotimes (i n)
      (:* rez (/ (- 365 i) 365)))
    ;; don't forget that we want the complement
    ;; of the probability of no collisions,
    ;; hence (- 1.0 ...)
    (- 1.0 rez)))

CL-USER> (birthday-collision-prob 20)
0.4114384

So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!

But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).

(defun hash-collision-prob (n size)
  (let ((rez 1))
    (dotimes (i n)
      (:* rez (/ (- size i) size)))
    (- 1.0 rez)))

Let's say, we have 10 keys. What should be the array size to be safe against collisions?

CL-USER> (hash-collision-prob 10 10)
0.9996371

99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?

CL-USER> (hash-collision-prob 10 20)
0.9345271

93%. Still, pretty high.

CL-USER> (hash-collision-prob 10 100)
0.37184352
CL-USER> (hash-collision-prob 10 10000)
0.004491329

So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...

Note that the number depends on both arguments, so (hash-collision-prob 10 100) (0.37) is not the same as (hash-collision-prob 20 200) (0.63).

We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.

More details about of the book may be found on its website.

Programming Algorithms: Key-Values

2019-08-30T20:29:00.002+03:00

This is a snippet of the "Key-Values" chapter of the book.

To conclude the description of essential data structures, we need to discuss key-values (kvs), which are the broadest family of structures one can imagine. Unlike arrays and lists, kvs are not concrete structures. In fact, they span, at least in some capacity, all of the popular concrete ones, as well as some obscure.

The main feature of kvs is efficient access to the values by some kind of keys that they are associated with. In other words, each element of such data structure is a key-value pair that can be easily retrieved if we know the key, and, on the other hand, if we ask for the key that is not in the structure, the null result is also returned efficiently. By "efficiently", we usually mean O(1) or, at least, something sublinear (like O(log n)), although, for some cases, even O(n) retrieval time may be acceptable. See how broad this is! So, a lot of different structures may play the role of key-values.

By the way, there isn't even a single widely-adopted name for such structures. Besides key-values — which isn't such a popular term (I derived it from key-value stores) — in different languages, they are called maps, dictionaries, associative arrays, tables, objects and so on.

In a sense, these are the most basic and essential data structures. They are so essential that some dynamic languages — for example, Lua, explicitly, and JavaScript, without a lot of advertisement — rely on them as the core (sometimes sole) language's data structure. Moreover, key-values are used almost everywhere. Below is a list of some of the most popular scenarios:

implementation of the object system in programming languages
most of the key-value stores are, for the most part, glorified key-value structures
internal tables in the operating system (running process table or file descriptor tables in the Linux kernel), programming language environment or application software
all kinds of memoization and caching
efficient implementation of sets
ad hoc or predefined records for returning aggregated data from function calls
representing various dictionaries (in language processing and beyond)

Considering such a wide spread, it may be surprising that, historically, the programming language community only gradually realized the usefulness of key-values. For instance, such languages as C and C++ don't have the built-in support for general kvs (if we don't count structs and arrays, which may be considered significantly limited versions). Lisp, on the contrary, was to some extent pioneering their recognition with the concepts of alists and plists, as well as being one of the first languages to have hash-table support in the standard.

More details about of the book may be found on its website.

RUTILS 5.0 and Tutorial

2019-08-29T11:10:00.002+03:00

RUTILS is my take on the Lisp "modernization" effort that adds the missing syntactic and data structure pieces, which became proven and well-established, in Lisp itself or in other languages. The programming field is constantly developing while the Lisp standard remains fixed so some additions, over time, are not only desirable but inevitable, if we don't want to lag behind. Thankfully, Lisp provides all the necessary means for implementing them and so, with some creativity, there's a way to have access to almost anything you want and need while retaining full backward compatibility (a lack of which is the most critical problem of some alternative solutions).

I, surely, understand that using such an extension remains a matter of taste and not every Lisper will like it. I didn't try to seriously promote it and was quite satisfied with the benefit that it provided to me and my teams' development. However, as I decided to use it for the "Programming Algorithms" book, it received some attention and a number of questions. From the direction of the discussions, I realized that the docs are lacking a critical part — the tutorial explaining how to effectively use the library. This text is intended to bridge that gap. I had to finish it before publishing the next chapter of the book, which I'll do on Friday.

So, today, version 5 of RUTILS is released alongside with the tutorial that aims to better explain its usage.

Programming Algorithms: Linked Lists

2019-08-19T15:32:00.001+03:00

This is a snippet of the "Lists" chapter of the book.

Linked data structures are in many ways the opposite of the contiguous ones that we have explored to some extent in the previous chapter using the example of arrays. In terms of complexity, they fail where those ones shine (first of all, at random access) — but prevail at scenarios when a repeated modification is necessary. In general, they are much more flexible and so allow the programmer to represent almost any kind of a data structure, although the ones that require such level of flexibility may not be too frequent. Usually, they are specialized trees or graphs.

The basic linked data structure is a singly-linked list.

Just like arrays, lists in Lisp may be created both with a literal syntax for constants and by calling a function — make-list — that creates a list of a certain size filled with nil elements. Besides, there's a handy list utility that is used to create lists with the specified content (the analog of vec).

CL-USER> '("hello" world 111)
("hello" WORLD 111)
CL-USER> (make-list 3)
(NIL NIL NIL)
CL-USER> (list "hello" 'world 111)
("hello" WORLD 111)

An empty list is represented as () and, interestingly, in Lisp, it is also a synonym of logical falsehood (nil). This property is used very often, and we'll have a chance to see that.

If we were to introduce our own lists, which may be quite a common scenario in case the built-in ones' capabilities do not suit us, we'd need to define the structure "node", and our list would be built as a chain of such nodes. We might have wanted to store the list head and, possibly, tail, as well as other properties like size. All in all, it would look like the following:

(defstruct list-cell
  data
  next)

(defstruct our-own-list
  (head nil :type (or list-cell null))
  (tail nil :type (or list-cell null)))

CL-USER> (let ((tail (make-list-cell :data "world")))
           (make-our-own-list
            :head (make-list-cell
                   :data "hello"
                   :next tail)
            :tail tail))
#S(OUR-OWN-LIST
   :HEAD #S(LIST-CELL
            :DATA "hello"
            :NEXT #S(LIST-CELL :DATA "world" :NEXT NIL))
   :TAIL #S(LIST-CELL :DATA "world" :NEXT NIL))

More details about of the book may be found on its website.

Programming Algorithms: Arrays

2019-08-12T16:37:00.001+03:00

This is a snippet of the "Arrays" chapter of the book.

Arrays are, alongside structs, the most basic data structure and, at the same time, the default choice for implementing algorithms. A one-dimensional array that is also called a "vector" is a contiguous structure consisting of the elements of the same type. One of the ways to create such arrays, in Lisp, is this:

CL-USER> (make-array 3)
#(0 0 0)

The printed result is the literal array representation. It happens that the array is shown to hold 0's, but that's implementation-dependent. Additional specifics can be set during array initialization: for instance, the :element-type, :initial-element, and even full contents:

CL-USER> (make-array 3 :element-type 'list :initial-element nil)
#(NIL NIL NIL)
CL-USER> (make-array 3 :initial-contents '(1.0 2.0 3.0))
#(1.0 2.0 3.0)

If you read back such an array you'll get a new copy with the same contents:

CL-USER> #(1.0 2.0 3.0)
#(1.0 2.0 3.0)

It is worth noting that the element type restriction is, in fact, not a limitation the default type is T[1]. In this case, the array will just hold pointers to its elements that can be of arbitrary type. If we specify a more precise type, however, the compiler might be able to optimize storage and access by putting the elements in memory directly in the array space. This is, mainly, useful for numeric arrays, but it makes multiple orders of magnitude difference for them for several reasons, including the existence of vector CPU instructions that operate on such arrays.

The arrays we have created are mutable, i.e. we can change their contents, although we cannot resize them. The main operator to access array elements is aref. You will see it in those pieces of code, in this chapter, where we care about performance.

CL-USER> (let ((vec (make-array 3 :initial-contents '(1.0 2.0 3.0))))
           (print (aref vec 0))
           (print (? vec 1))
           (:= (aref vec 2) 4.0))
           (print (? vec 2))
           (aref vec 3))
1.0 
2.0 
4.0
; Evaluation aborted on #<SIMPLE-TYPE-ERROR expected-type: (MOD 3) datum: 3>

In Lisp, array access beyond its boundary, as expected, causes an error.

It is also possible to create constant arrays using the literal notation #(). These constants can, actually, be changed in some environments, but don't expect anything nice to come out of such abuse — and the compiler will warn you of that:

CL-USER> (let ((vec #(1.0 2.0 3.0)))
           (:= (aref vec 2) nil)
           (print vec))
; caught WARNING:
;   Destructive function (SETF AREF) called on constant data.
;   See also:
;     The ANSI Standard, Special Operator QUOTE
;     The ANSI Standard, Section 3.2.2.3
; 
; compilation unit finished
;   caught 1 WARNING condition

#(1.0 2.0 NIL)

RUTILS provides more options to easily create arrays with a shorthand notation:

CL-USER> #v(1 2 3)
#(1 2 3)
CL-USER> (vec 1 2 3)
#(1 2 3)

Although the results seem identical they aren't. The first version creates a mutable analog of #(1 2 3), and the second also makes it adjustable (we'll discuss adjustable or dynamic arrays next).

More details about of the book may be found on its website.

Programming Algorithms: Data Structures

2019-08-05T13:40:00.000+03:00

The next several chapters will be describing the basic data structures that every programming language provides, their usage, and the most important algorithms relevant to them. And we'll start with the notion of a data-structure and tuples or structs that are the most primitive and essential one. Here is a snippet from this chapter.

Data Structures vs Algorithms

Let's start with a somewhat abstract question: what's more important, algorithms or data structures?

From one point of view, algorithms are the essence of many programs, while data structures may seem secondary. Besides, although a majority of algorithms rely on certain features of particular data structures, not all do. Good examples of the data-structure-relying algorithms are heapsort, search using BSTs, and union-find. And of the second type: the sieve of Erastophenes and consistent hashing.

At the same time, some seasoned developers state that when the right data structure is found, the algorithm will almost write itself. Linus Torvalds, the creator of Linux, is quoted saying:

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

A somewhat less poignant version of the same idea is formulated in the Art of Unix Programming by Eric S. Raymond as the "Rule of Representation":

Fold knowledge into data so program logic can be stupid and robust.

Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic.

Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.

Data structures are more static than algorithms. Surely, most of them allow change of their contents over time, but there are certain invariants that always hold. This allows reasoning by simple induction: consider only two (or at least a small number of) cases, the base one(s) and the general. In other words, data structures remove, in the main, the notion of time from consideration, and change over time is one of the major causes of program complexity. In other words, data structures are declarative, while most of the algorithms are imperative. The advantage of the declarative approach is that you don't have to imagine (trace) the flow of time through it.

So, this book, like most other books on the subject, is organized around data structures. The majority of the chapters present a particular structure, its properties and interface, and explain the algorithms, associated with it, showing its real-world use cases. Yet, some important algorithms don't require a particular data structure, so there are also several chapters dedicated exclusively to them.

More information about the book may be found on its website.