2020-06-22

Eval Spotted in the Wild

(#lisptips on the dynamic nature of CLOS magnified by eval)

Since starting programming in Lisp, I always had an impression that using eval is a taboo. Or rather, a cul-de-sac that you never want to touch. When I was only learning Lisp, I had a couple of unsuccessful and rather stupid attempts of utilizing it to bend the language to my view of how it should function — only to learn how it is really intended to function. After that, it occupied its rightful place on my mind's shelf of "low-level constructs only needed to implement the language".

Yet, recently, I saw a legitimate use case for it and even wrote a piece of production code containing eval! That was such a revelation that I wanted to share it in this short post.

So, here is the case I needed to solve: I was developing a parser for a new data format that had to fit into an existing set of parsers. The parsers not only decode the data but also store it in the datastore using the CLOS machinery for datastore access. I.e. there's a generic function to store an individual piece of data that is specialized for different connection/datastore types. Now, my parser had to prepare the individual pieces and, eventually, they would be fed to this function. But that may happen independently of the parser operation: when the data store commit is performed.

Yet, there was another issue at play: the data format allows the individual items to be interdependent, i.e. reference one another via an implicit reference. And when the data is persisted, due to the properties of the data store, these references should be changed to the internal ids of the referenced items. And those are not known before the commit happens. I.e. I was in the following situation:

  • my parser produces an array of items that are to be persisted to the dataset at some later time
  • the order of their addition matters as the dependent items should be added after the ones they reference
  • and as the referenced item is added its id should be saved
  • and assigned to a field of the dependent item before that item is also added

This program logic is quite normal, apart from the fact that my parser doesn't have control over the whole workflow. Actually, the data persistence stage operates in the inversion of control paradigm, i.e. I can only override (rather, augment) the part of the program that is responsible for processing an individual item. Needless to say that I had no desire or intention to reimplement all the different (I believe, 7) ways of interaction with the datastore that had their own methods plus a number of before/after/around-methods.

In fact, CLOS is very flexible and provides a way, using an object of my own mixin-class to hold the state and around-method specialized on it, to achieve my goal of fitting into this whole machinery without distracting it or having to reimplement anything. If not for one issue: limited facilities for dynamic creation of classes.

So, here's what I wanted to do and to avoid:

  1. I wanted to define a mixin-class and an around-method for it to augment the data storing procedure with saving of the ids of specified items and assigning them to fields in other items before persisting them. Here's the sketch of the relevant code:
    
    (defclass my-data-store-mixin ()
      ((linked-items-table :reader my-mixin-table
                           :initform (make-hash-table))))
    
    (defmethod add-item :around ((db my-data-store-mixin) item)
      (let ((linked-items-table (my-mixin-table db))
            (item-id (call-next-method)))
        (dolist (it (gethash item linked-items-table))
          (remhash it linked-items-table)
          (setf (reference it) item-id))
        (remhash item linked-items-table)
        item-id)) 
    
  2. Yet, I didn't want this code to run when other data formats are imported, hence my mixin should have been "activated" if and only if my specific format is parsed.
  3. In other words, I needed a way to dynamically add this mixin to an existing connection object, in the context of the parser call, and then restore the connection to its previous state. In general, CLOS also provides such a facility with its change-class operator. I would say, this would have been a manifestation of a dynamic object system in all its glory if not for one deficiency.
  4. You can't just dynamically define a temporary class: the one that will inherit from the class of the current connection and my mixin. defclass is a macro that's expected to deal with names known ahead-of-time and coded as part of its call: it doesn't evaluate variables. Usually, all such APIs in Lisp have a make-function counterpart. I.e. what I needed was:
    
    (let ((temp-class (gensym))
          (current-db-class (class-of *db*)))
      (make-class temp-class (list (class-name current-db-class) my-data-store-mixin) nil)
      (unwind-protect (progn (change-class *db* temp-class)
                             ;; execute my code
                      )
        (change-class *db* current-db-class)))
    
    But CLOS just doesn't have an API for that. (Which might be perfectly reasonable — and I don't want to delve into the discussion of those reasons in this post). Actually, there's MOP for that. But I'd prefer not to take the MOP route here for another set of reasons I want to skip discussing now :) Suffice to say that it is complicated and, from my experience with the MOP, I developed a stance that it's another area intended for language implementation usage — not for user-level code.
  5. And here's where eval comes to the rescue. In place of the nonexisting make-class I could just put this piece:
    
    (let ((class (intern (format nil "my-mixed-~a" (class-name current-db-class)))))
      (when (not (find-class class nil))
        (eval `(defclass ,class (,(class-of *db*) my-data-store-mixin) ()))))
    

So, eval is an escape hatch into the world of ultimate dynamism. This operator can add it anywhere: whether an appropriate API was left out due to lack of foresight or even when it was not intended to exist... :)

2020-05-18

"Programming Algorithms" Book Gets its own Page

Recently, I was busy organizing the process of postal delivery of the paperback version of the book to all the interested people. Here, I have created a map showing everyone who has already ordered:

I'm glad to see the global geography of readership and all the positive feedback. Thanks a lot! Please, share your thoughts online :)

Finally, the book got its own webpage with all the relevant details.

2020-05-08

Dead-Tree Version of "Programming Algorithms"

I have finally obtained the first batch of the printed "Programming Algorithms" books and will shortly be sending them to the 13 people who asked for a hardcopy.

Here is a short video showing the book "in action":

If you also want to get a copy, here's how you do it:

  1. Send an email to vseloved@gmail.com with your postal address — I'll send you a Paypal money request.
  2. Once I see the donation, I'll go to the post office and send you the book.
  3. Optionaly step: if you want it to be signed, please, indicate it in your letter.
Shipping details: As I said originally, the price of the dead-tree version will be $20+shipping. I'll ship via the Ukrainian national post. You can do the fee calculation online here (book weight is 0.58 kg, size is 23 x 17 x 2 cm): https://calc.ukrposhta.ua/international-calculator. Alas, the interface is only in Ukrainian. According to the examples I've tried, the cost will be approximately $10-15. To make it easier, I've just settled on $10 shipping without a tracking number of $15 if you want a tracking number. Regardless of your country. I don't know how long it will take - probably depends on the location (I'll try to inquire when sending).

The book was already downloaded more than 1170 times (I'm not putting the exact number here as it's constantly growing little by little). I wish I knew how many people have, actually, read it in full or in part. I've also received some error corrections (special thanks goes to Serge Kruk), several small reviews and letters of encouragement. Those were very valuable and I hope to see more :)

Greetings from the far away city of Lima, Peru!
I loved this part: "Only losers don’t comment their code, and comments will be used extensively"
Thank you so much for putting this comprehensive collection of highly important data structures, i'm already recommending this to two of my developers, which I hope i'll induce into my Lisp addiction.
--Flavio Egoavil

And here's another one:

Massively impressive book you've written! I've been a Lisp programmer for a long time and truly appreciate the work put in here. Making Lisp accessible for more people in relation to practical algorithms is very hard to do. But you truly made it. You'll definitely end up in the gallery of great and modern Lisp contributions like "Land of Lisp" and "Let Over Lambda". Totally agree with your path to focus on practical algorithmic thinking with Lisp and not messing it up with macros, oop and other advanced concepts.
--Lars Hård

Thanks guys, it's really appreciated!

If you feel the same or you've liked the book in some respect and have found it useful, please, continue to share news about it: that definitely helps attract more readers. And my main goal is to make it as widely read as possible...

2020-04-15

"Programming Algorithms" Book Freely Available

The book "Programming Algorithms (A comprehensive guide to writing efficient programs with examples in Lisp)" has been completed. It turned out to be more than 360 pages in a standard technical book format, with over 100k words (that's 2 NanoWriMos :). It covers more than 100 topics that made it to the TOC — but, actually, more. Phew, making it to this point was quite a challenge...

This book is, surely, not perfect. Hopefully, most of the mistakes in it were fixed with the help of many nice people who commented on the chapters as they were published on my blog.

Also, the book is terribly incomplete. Almost every chapter could be expanded by a factor of two or three with relevant details and concrete implementations of some of the general ideas that are presented, currently. But neither did I have the time to write those down, nor, what's much more important, anyone would have had the time to read them, in entirety. I believe I have put enough concrete examples with executable code to illustrate all the important concepts in each part. This is a great advantage of using Lisp for the book: the code is clear and compact enough to serve both to explain the algorithms and to permit testing them for real, in the REPL. The main compromise each author has to make is between brevity and completeness. I hope that I made the right choices in this regard, but, for sure, there's much more to learn about every piece of technology mentioned. My hope is that the book lays a solid groundwork to facilitate further deeper exploration.

There are also a couple of topics that I would like to cover but couldn't find a good place for them. Probabilistic data structures is the most important of them. Yet, they are not big enough to justify a separate chapter and, also, don't fit into any of the existing chapters.

But enough with the whining :) In fact, I'm quite satisfied with the end result as my main goal was to sufficiently develop the following key themes:

  • The main one, obviously, was the description of all the important data structures and the associated algorithms.
  • The next, also very important, was the demonstration of the essential tools that help in the development, testing, and verification of the produced algorithmic code: tracing, profiling, pretty-printing, etc.
  • We have also discussed, when it was relevant, the real-world engineering considerations and constraints that influence the programs using our algorithms. And sometimes these constraints have more impact than the purely theoretical complexity calculations.
  • Finally, in each chapter, I tried to present the practical use case of the algorithms we have studied, showing the broad variety of such applications. In fact, it spans all the different corners of the software landscape we're used to. We have talked, albeit briefly, about such different domains as neural networks, plagiarism detection, web search, mapping, chess-playing, image compression, and many others.

There is a lot of books on algorithms, but I haven't seen any that primarily aims to bridge the gap between the theory and practice. This is one of the key distinctions of "Programming Algorithms". It is definitely not the best exposition of the theoretical ideas, but I hope that, instead, it builds sufficient understanding and skill, for the common developer, to start writing efficient algorithmic programs.

I wanted to finish the book with the following statement: programming craft is primarily about making choices. What approach to prefer, which algorithm to choose, what tradeoffs to make. And, at the other level, what properties to give more priority: speed or safety, brevity or consistency, space or debuggability, clarity or conciseness, and so on and so forth. Lisp is one of the few languages that are "pro-choice". Its authors understood very well the importance of freedom to make the critical choices, and it is felt in the design of the language. For instance, with the help of declaim we can even signal our preferences to the compiler, to some extent, at the level of a single file or even an individual form. (declaim (optimize (speed 3) (safety 1) (debug 0) (compilation-speed 0))) will ask the compiler to produce the fastest possible code. Yes, this language will not guard you against poor choices like some others claim to do. Sometimes, you're not wise enough to make a correct choice, but, much more often, every choice just has its pros and cons, so someone will approve of it and someone won't. And that's what freedom is about: ownership and responsibility. So, use Lisp if you liked it. And if you prefer other languages, I'd urge you to still take advantage of the concept of freedom of choice in programming. Don't be constrained by the prevailing paradigms and try to use the best parts of all the different approaches you know...

Acknowledgments

Finally, the most pleasant part. I'm very thankful to those who helped me in the work on "Programming Algorithms" by providing support, advice, corrections, and suggestions. First of all, many thanks to my wife Ksenya who encouraged me to work on it despite the time for that is, in part, taken from my family duties. Also, I am very grateful to Dr. Robert Standh who humbly volunteered his help as an editor to make it sound more native (as my English is far from perfect since I'm not a native speaker) and point out the mistakes that I made. He and his wife had contributed lots of improvements to more than half of the chapters, and I tried to follow their advice in the subsequent ones. Thanks to Rainer Joswig for commenting on the Lisp choices. Thanks to @dzenicv on reddit who posted links to almost all of the chapters there, which triggered some helpful discussions. Thanks to @tom_mellior on Hacker News for pointing a serious deficiency in the explanation of Union-Find. Thanks to all those people who shared the links to the chapters, contributed their comments and attention.

If you've found the book helpful and interesting, you can also support it's past and (potential) further development in several ways. First and foremost, you can share it with your friends, colleagues, social network. The book was made free and will remain free as its main premise, for me, was to spread the knowledge gathered inside. Yet, you can also make a donation at Leanpub if you believe that it has brought you some value. Last but not least, if you find some way the book can be improved, don't hesitate to contact me.

Finally, I wanted to solicit reviews: if you've read the book and liked it, please, write a paragraph or two to let others know your opinion!

2020-03-30

Programming Algorithms: Synchronization

This is the final chapter of the book, in which we will discuss optimization of parallel computations: whether concurrently on a single machine in and a shared-memory setting or in a distributed shared-nothing environment. This is a huge topic that spans synchronization itself, parallelization, concurrency, distributed computations, and the functional approach. And every senior software developer should be well-versed in it.

Usually, synchronization is studied in the context of system or distributed programming, but it has a significant algorithmic footprint and is also one of the hottest topics for new algorithm research. In fact, there are whole books that concentrate on it, but, usually, they attack the problem from other angles, not focusing on the algorithmic part. This chapter will be more algorithm-centered, although it will also present an overview of the problem space. SO that, in the end, you'll have a good foundation to explore the topic further if a desire or need for that will appear.

Let's start from the basics. In the previous chapters of the book we, mainly, viewed algorithms as single computations running without external interruptions. This approach, obviously, removes the unnecessary complexity, but it also isn't totally faithful to reality. Most of the programs we deal with, now, run in multiprocessing environments (sometimes, even distributed ones), and even when they don't utilize these capabilities, those are still available, they sometimes have their impact, and, besides, might have improved the performance of the programs if they would have been utilized. The majority of the backend stuff, which, currently, is comprised of services running in the datacenters, are multithreaded. There's a notorious "Zawinski's Law" that states that every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. Being a good joke it also reflects an important truth about the tendency of all programs over time to become network-aware, and thus distributed to at least some extent.

There are two principally different types of environments in which the programs that need synchronization run: shared-memory and shared-nothing ones.

In a shared-memory setting, there exists some shared storage (not necessarily, RAM) that can be directly accessed by all the threads[1] of the application. Concurrent access to data in this shared memory is the principal source of the synchronization challenges, although not the only one. The example of a shared-memory program is a normal application that uses multithreading provided either directly by the OS or, more frequently, by the language runtime[2].

The opposite of shared-memory is a shared-nothing environment, in which all threads[3] don't have any common data storage and can coordinate only by sending messages directly to other processes. The contents of the messages have to be copied from the memory of the sender to the receiver. In this setting, some of the synchronization problems disappear, but others still remain. At the fundamental level, some synchronization or coordination still needs to happen. From a performance standpoint, however, the shared-nothing mode is, usually, inferior due to the need for additional data copying. So, both paradigms have their place and the choice, which one to utilize, depends on the context of a particular task.

The main goal of synchronization is ensuring program correctness when multiple computations are running in parallel. Another side of the coin is achieving optimal performance, which is also addressed by parallelization that we have somewhat discussed in a couple of prior chapters. Prioritizing performance before correctness, although tempting, is one of the primary sources of bugs in the concurrent systems. The trivial example would be building a shared-memory program without explicit use of any synchronization mechanisms. It is, definitely, the most performant approach, but non-coordinated access to the shared data will inevitably result in failures like data corruption.

Synchronization Troubles

So, let's talk in more detail about the most common synchronization problems that the methods we will discuss next are trying to handle. Such situations are called race conditions for there's a situation when multiple threads compete for the same resource — be it data storage or processor — and, in the absence of special coordination, the order of execution will be unspecified, which may result in unpredictable and unintended outcomes. There are two main results of this unpredictability (often, both occur simultaneously):

  • data corruption or loss
  • incorrect order of execution up to total failure of the program

Here is the simplest code segment that is amenable to data corruption in multithreaded programs:

(incf i)

It seems like there's just one operation involved — how can there be a race condition? From the point of view of a programmer in a high-level language, indeed, we deal with a single operation, but if we go deeper to the level of machine code we'll see that it is not the case. The relevant assembly snippet will, probably, contain three instructions:

mov i, register
inc register
mov register, i

You've just seen one more convincing evidence why every programmer should understand how the lower levels of his platform operate. :)

The issue is that modern processors can't directly modify data in the RAM (our variable i). First, the data needs to be moved into a register, only then some operation on it may be performed by the CPU, and, finally, it needs to be put back where the high-level program can find it. If an interrupt occurs (we're talking about multithreaded execution in a single address space, in this context) after mov i, reg the current thread will remember the old value of i (let it be 42) and be put into a waitqueue. If another thread that wants to change i is given processor time next, it may set it to whatever value it wants and continue execution (suppose it will be 0). However, when the turn is returned to the first thread, it will increment the value it remembered (42), so i will change the value in the following sequence: 42 - 0 - 43. Hardly, it's an expected behavior.

Such data corruption will only impact the mentioned variable and may not cause catastrophic failures in the program. Its behavior will be incorrect, but in some situations that can be tolerated (for instance, if we gather some statistics, and occasional off-by-one errors will not be noticed). Yet, if i was some counter that impacts the core behavior of the program, it might easily lead to a catastrophe.

Utimately, incorrect execution order should be considered the root cause of all synchronization problems. And here it is also manifest: we expected increment to be a single (atomic) operation and thus finish execution before anything else will happen to i.

What are some other common cases of execution order errors? The most well-known and dreaded race condition is a deadlock. It is a situation of mutual blocking among two or more threads. Here is the simplest illustration of how it can occur:

thread 1 ---> acquire resource1 --> try to acquire resource2
thread 2 --> acquire resource2 ------> try to acquire resource1

In other words, two threads need exclusive access to two resources, but the order of access is opposite and the timing of the operations is such that the first thread manages to acquire access to the first resource while the second — to the second. After that, the deadlock is inevitable and both threads will be blocked as soon as they will try to access the other resource. The period between each thread acquiring the first resource for exclusive access and the release of this resource is called a critical section of the program. Only in the critical sections, a synchronization issue may manifest.

The only way to untangle "from within" such deadlock situation is for one of the threads to release the resource it already holds. Another approach, which requires external intervention, is often employed in database management systems — deadlock monitoring. A separate thread is periodically examining blocked threads to check for some conditions that signify a deadlock situation, and it resets the threads that were spotted in such a condition. Yet, instead of trying to fix the deadlock situations, it may be better to prevent them from occurring altogether. The prevention techniques may utilize time-limited exclusive leases on resources or mandating the threads to acquire resources in a specific order. However, such approaches are limited and don't cover all the use cases. It would be nice to find some way to totally exclude deadlocks, but we should remember that the original reason why they may occur, at all, is the need to prevent data corruption in the case of uncontrolled access to the data. Exclusive access to the resource ensures that this problem will not occur, but results in the possibility of a deadlock, which is a comparatively lesser evil.

A livelock is a dynamic counterpart to deadlock which occurs much rarely. It is a situation when threads don't constantly hold the resources exclusively (for instance, they might release them after a timeout), but the timing of the operations is such that at the time when the resource is needed by one thread it happens to be occupied by the other, and, ultimately, mutual blocking still occurs.

One more obnoxious race condition is priority inversion — a phenomenon one can frequently observe in real life: when a secondary lane of cars merges into the main road, but, for some extraneous reason (traffic light malfunctioning, an accident that is blocking part of the road, etc.), the cars from it have more time to merge than the ones of the main road to progress. Priority inversion may be the reason for a more severe problem, which is starvation — a situation when the execution of a thread is stalled as it can't access the resource it needs. Deadlocks result in starvation of all the involved threads, but the issue may occur in other conditions, as well. I would say that starvation or, more generally, underutilization is the most common performance issue of multithreaded applications.

Low-Level Synchronization

I hope, in the previous section, the importance of ensuring proper execution order in the critical sections of the program was demonstrated well enough. How to approach this task? There are many angles of attack. Partially, the problem may be solved by the introduction of atomic operations. Atomic increment/decrement are a common example of those, which may be found in the ecosystem of the majority of the programming languages. For instance, SBCL provides an sb-ext:atomic-incf macro that operates on the fixnum slots of structures, array cells, contents of cons pairs or global variables. Some other languages, like Java, provide AtomicInteger and similar structures that guarantee atomic operations on its main slot.

What enables atomic operations are special hardware instructions:

  • TSL — test and set lock
  • CAS — compare and swap
  • LL/CS — load link/store conditional

The most widespread of them is CAS that has the same effect as if the following code would work as a single atomic operation:

(defmacro cas (place old new)
  `(when (eql ,place ,old)
     (:= ,place ,new))

Based on this spec, we could define atomic-incf using cas:

(defmacro atomic-incf (place &optional i)
  (let ((cur (gensym "CUR"))
        (rez (gensym "REZ")))
    `(loop :for ,rez := (let ((,cur ,place))
                          (cas ,place ,cur (+ ,cur ,i)))
           :when ,rez :do (return ,rez))))

Here, we read the current value of place, and then try to set it with cas. These two operations happen non-atomically, so there's a chance that cas will return nil. In that case, we redo the whole sequence again. It is clear that execution time of such operation is non-deterministic, but, in a reasonably configured multithreaded system, there should be, generally, just a single chance for cas to fail: when the thread is preempted between the assignment and cas. It shouldn't repeat the second time this thread gets its time slice for it should have enough time to complete both operations considering that it will start from them.

Another important low-level instruction is a memory barrier. It causes the CPU to enforce an ordering constraint on memory operations issued before and after the barrier instruction. I.e. the operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier. Memory barriers are necessary because most modern CPUs employ performance optimizations that can result in out-of-order execution. The reordering of memory loads and stores goes unnoticed within a single thread of execution but can cause unpredictable behavior in concurrent programs. One more leak from the low level adding to the list of synchronization worries...

On top of CAS and atomic operations, some higher-level synchronization primitives are provided by the OS and the execution runtimes of most of the programming languages. The most popular of them is the semaphore. It is a counter that is initially set to the number of threads that can proceed past querying its value. If the counter is above zero, the thread may continue execution, but it also atomically decrements the counter. This operation is usually called wait, acquire or down on the semaphore. However, if the counter is already down to zero, the thread goes to sleep and is put into an OS waitqueue until the wakeup notification arrives. The notification is initiated by some thread calling release/up on the same semaphore. This operation atomically increments the counter value and also allows some of the waiting threads to continue execution. The most used type of semaphores is called the mutex and it allows only a single thread to enter and also mandates the implementation to check that the thread that releases the mutex is the one that has previously acquired it. There are also other types of semaphores or more complex locks built on top of them, such as the read-write lock or a monitor.

Semaphores are an alternative to a lower-level spin-lock primitive that uses busy waiting, i.e. constant checking of the counter variable until it increases above zero. Another, more general name, for this method is polling that refers to constantly querying the state of some resource (a lock, a network socket, a file descriptor) to know when its state changes. Polling has both drawbacks and advantages: it occupies the thread instead of yielding CPU to other workers, which is a serious downside, but it also avoids expensive utilization of the OS context-switching required by semaphores.

So, both semaphores and spin-locks find their place. In the low-level OS code, spin-locks prevail, while semaphores are a default synchronization primitive in the user-space.

Mutual Exclusion Algorithms

Relying on hardware features for synchronization is a common approach taken by most software systems. However, since the beginning of work on this problem, computer scientists, including such famous algorithmists as Dijkstra and Lamport, proposed mutual exclusion algorithms that allowed guarding the critical sections without any special support from the platform. One of the simplest of them is the Peterson's algorithm. It guarantees mutual exclusion of two threads with the use of two variables: a two-element array interest and a boolean turn. A true value of the interest item corresponding to a thread indicates that it wants to enter the critical section. Entrance is granted if a second thread does not want the same or it has yielded priority to the first thread.

(defparameter *interest* (vec nil nil))
(defparameter *turn* nil)

(defun peterson-call (i fn)
  (let ((other (abs (1- i))))
    (:= (? *interest* i) t
        *turn* other)
    ;; busy waiting
    (loop :while (and (? *interest* other) (= *turn* other))) 
    ;; critical section
    (call fn)
    (:= (? *interest* i) nil))

The algorithm satisfies the three essential criteria to solve the critical section problem: mutual exclusion, progress, and bounded waiting. Mutual exclusion means that several competing threads can never be in the critical section at the same time. For the Peterson's algorithms, if thread 0 is in its critical section, then (? *interest* 0) is true. In addition, either (? *interest* 1) is nil (meaning thread 1 has left its critical section and isn't interested in coming back into it) or turn is 0 (meaning that thread 1 is just now trying to enter the critical section but waiting), or thread 1 is trying to enter its critical section, after setting (? *interest* 1) to true but before setting *turn* to 0). So if both processes are in the critical section then we conclude that the state must satisfy (and (? *interest* 0) (? *interest* 1) (= *turn* 0) (= *turn* 1)), which is, obviously, impossible. I.e. only on of the threads could have entered the section. The condition of progress, basically, says that only those threads that wish to enter the critical section can participate in making the decision as to which one will do it next, and that this selection cannot be postponed indefinitely. In our case, a thread cannot immediately reenter the critical section if the other thread has set its interest flag. Thus the thread that has just left the critical section will not impact the progress of the waiting thread. Bounded waiting means that the number of times a thread is bypassed by another thread after it has indicated its desire to enter the critical section is bounded by a function of the number of threads in the system. In the Peterson's algorithm, a thread will never wait longer than one turn for entrance to the critical section.

The drawback of the Peterson's algorithms is busy waiting[4]. So, it may be compared to a spin-lock. There are a number of other similar algorithms, including the Dekker's and Lamport's ones, which also share this property. A newer Szymański's algorithm is designed to avoid busy waiting, but it requires access to the OS scheduling facilities to make the thread sleep, waiting for the wakeup call, making the algorithm similar to semaphores.

High-Level Synchronization

All the mentioned synchronization primitives don't solve the challenges of synchronization completely. Rather they provide tools that enable reasonable solutions but still require advanced understanding and careful application. The complexity of multithreaded programs is a level up compared to their single-threaded counterparts, and thus a lot of effort continues being spent on trying to come up with high-level ways to contain it. I.e. remove it from the sight of a regular programmer by providing the primitives that handle synchronization behind the scenes. A simple example of that is Java synchronized classes that employ an internal monitor to ensure atomic access to the slots of a synchronized object. The major problem with regular locks (like semaphores) is that working with them brings us into the realm of global state manipulation. Such locking can't be isolated within the boundaries of a single function — it leaks through the whole caller chain, and this makes the program much harder to reason about. In this regard, it is somewhat similar to the use of goto, albeit on a larger scale, and so a push for higher-level synchronization facilities resembles the Dijkstra's famous appeal to introduce structured programming ("goto considered harmful"). Ironically, Dijkstra is one of the creators of the classic synchronization mechanisms that are now frowned upon. However, synchronization has intrinsic complexity that can't be fully contained, so no silver bullet exists (and hardly will ever be created) and every high-level solution will be effective only in a subset of cases. I have seen that very well on my own when teaching a course on system programming and witnessing how students solve the so-called classic synchronization problems. The task was to apply both classic synchronization techniques (semaphores et al.) and the new high-level ones (using Erlang, Haskell, Clojure or Go, which all provide some of those). The outcome, in terms of complexity, was not always in favor of the new approaches.

There is a number of these classic synchronization problems, and I was even collecting them to be able to provide more variants of the tasks to diminish cheating. :) But, in essence, they all boil down to just a few archetypical cases: producer-consumer, readers-writers, sleeping barber, and the dining philosophers. Each problem demonstrates a certain basic synchronization scenario and allows the researchers to see how their approach will handle it. I won't include them in the book but strongly encourage anyone interested in this topic to study them in more detail and also try to solve using different synchronization mechanisms.

Now, let's talk about some of the prominent high-level approaches. Remember, that they try to change the paradigm and avoid the need for explicit locking of critical sections altogether.

Lock-free Data Structures

My favorite among them is lock-free data structures. This is a simple and effective idea that can help deal with many common use cases and, indeed, avoid the necessity for explicit synchronization. Still, their use is limited and, obviously, can't cover all the possible scenarios.

The most important among them is arguably a lock-free queue. It can be implemented in different ways, and there's a simple and efficient implementation using cas provided by SBCL in the SB-CONCURRENCY contrib package. Here is the implementation of the main operations (taken from the SBCL source code and slightly simplified):

(defstruct lf-queue
  (head (error "No HEAD.") :type cons)
  (tail (error "No TAIL.") :type cons))

(defun lf-enqueue (value queue)
  (let ((new (cons value nil)))
    (loop (when (eq nil (sb-ext:compare-and-swap (cdr (lf-queue-tail queue))
                                                 nil new))
            (:= (lf-queue-tail queue) new)
            (return value)))))

(defun lf-dequeue (queue)
  (loop (with ((head (lf-queue-head queue))
               (next (cdr head)))
          (typecase next
            (null (return (values nil
                                  nil)))
            (cons (when (eq head (sb-ext:compare-and-swap (lf-queue-head queue)
                                                          head next))
                    (let ((value (car next)))
                      (setf (cdr head) +dead-end+
                            (car next) +dummy+)
                      (return (values value
                                      t)))))))))

The value of this structure lies in that it enables the implementation of the master-worker pattern that is a backbone of many backend applications, as well as, in general, different forms of lock-free and wait-free coordination between the running threads. Basically, it's a lock-free solution to the producer-consumer problem. The items are put in the queue by some producer threads (masters) and consumed by the worker threads. Such an architecture allows the programmer to separate concerns between different layers of the application: for instance, one type of threads may be responsible for handling incoming connections and, in order to ensure system high availability, these threads shouldn't spend much time processing them. So, after some basic processing, the connection sockets are put into the queue, from which the heavy-lifting worker threads can consume them and process in a more elaborate fashion. I.e. it's a job queue for a thread pool. Surely, a lock-based queue may also be utilized as an alternative, in these scenarios, but the necessity to lock from the caller's side makes the code for all the involved threads more complicated: what if a thread that has just acquired the lock is abruptly terminated for some reason?

Data-Paralelism and Message Passing

Beyong thread pools, there's a whole concept of data parallelism, which, in essence, lies in submitting different computations to the pool and implementing synchronization as an orchestration of those tasks. In addition, node.js and Go use lock-free IO in conjunction with such thread pools (and a special syntax for its seamless integration) for an efficient implementation of user-space green threads to support this paradigm.

Even further alone this direction is Erlang that is a whole language built around lock-free IO, efficient user-space threading, and a shared-nothing memory model. It is The language of message-passing concurrency that aims to solve all synchronization problems within this single approach. As discussed in the beginning, such stance has its advantages and drawbacks, and so Erlang fits some problems (like coordination between a large number of simple agents) exceptionally well, while for others it imposes unaffordable costs in terms of both performance and complexity.

I won't go deeper into this topic as they are not directly related to the matter of this book.

STM

Another take on concurrency is the technology that is used, for quite a long time, in the database systems, and was reimplemented in several languages, being popularized by the author of Clojure — Software Transactional Memory (STM). The idea is to treat all data accesses in memory as part of transactions — computations that possess the ACID properties: atomicity, consistency, and isolation (minus durability, which is only relevant to the database systems persisting data on disk). These transactions should still be initiated by the programmer, so the control over synchronization remains, to a large extent, in their hands with some portion of the associated complexity. The transactions may be implemented in different ways, but they will still use locking behind the scenes, and there are two main approaches to applying locking:

  • pessimistic — when the locks are acquired for the whole duration of the transaction, basically, making it analogous to a very conservative programming style that avoids the deadlocks but seriously hinders program performance: acquiring all locks at once and then entering the critical section; in the context of STM, each separate variable will have its own lock
  • optimistic — when the initial state of the transaction variables is remembered in the thread-local storage, and locking occurs only at the last (commit) phase, when all the changes are applied — but only when there were no external changes to the transaction variables; if at least one of them were changed, the whole transaction would need to be rolled back and retried

In both cases, the main issue is the same: contention. If the number of threads competing for the locks is small, an optimistic approach should perform better, while, in the opposite case, there will be too many rollbacks and even a possibility of a livelock.

The optimistic transactions are, usually, implemented using the Multiversion Concurrency Control mechanism. MVCC ensures a transaction never has to wait to read an object by maintaining several versions of thid object. Each version has both a Read Timestamp and a Write Timestamp which lets a particular transaction read the most recent version of the object which precedes the own Read Timestamp of the transaction.

STM is an interesting technology, which hasn't proven its case yet beyond the distinct area of data management systems, such as RDBMs and their analogs.

Distributed Computations

So far, we have discussed synchronization, mainly, in the context of software running in a single address space on a single machine. Yet, the same issues, although magnified, are also relevant to distributed systems. Actually, the same models of computation are relevant: shared-memory and shared-nothing message passing. Although, for distributed computing, message passing becomes much more natural, while the significance of shared-memory is seriously diminished and the "memory" itself becomes some kind of a network storage system like a database or a network file system.

However, more challenges are imposed by the introduction of the unreliable network as a communication environment between the parts of a system. These challenges are reflected in the so-called "fallacies of distributed computing":

  • the network is reliable
  • latency is zero
  • bandwidth is infinite
  • the network is secure
  • topology doesn't change
  • there is a single administrator
  • transport cost is zero
  • the network is homogeneous
  • clocks on all nodes are synchronized

Another way to summarize those challenges, which is the currently prevailing look at it, is the famous Brewer's CAP Theorem, which states that any distributed system may have only two of the three desired properties at once: consistency, availability, and partition tolerance. And since partitional tolerance is a required property of any network system as it's the ability to function in the unreliable network environment (that is the norm), the only possible distributed systems are CP and AP, i.e. they either guarantee consistency but might be unavailable at times or are constantly available but might be sometimes inconsistent.

Distributed Algorithms

Distributed computation requires distributed data structures and distributed algorithms. The domains that are in active development are distributed consensus, efficient distribution of computation, efficient change propagation. Google pioneered the area of efficient network computation with the MapReduce framework that originated from the ideas of functional programming and Lisp, in particular. The next-generation systems such as Apache Spark develop these ideas even further.

Yet, the primary challenge for distributed systems is efficient consensus. The addition of the unreliable network makes the problem nontrivial compared to a single-machine variant where the consensus may be achieved easily in a shared-memory setting. The world has seen an evolution of distributed consensus algorithms implemented in different data management systems, from the 2-Phase Commit (2PC) to the currently popular RAFT protocol.

2PC is an algorithm for coordination of all the processes that participate in a distributed atomic transaction on whether to commit or rollback the transaction. The protocol achieves its goal even in many cases of temporary system failure. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed. To accommodate recovery from failure, the participants of the transaction use logging of states, which may be implemented in different ways. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered.

In a "normal execution" of any single distributed transaction, the 2PC consists of two phases:

  1. The "commit-request" or voting phase, in which a coordinator process attempts to prepare all the participating processes to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes" (commit) or "No" (abort).
  2. The "commit" phase, in which, based on the voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or rollback the transaction and notifies the result to all the participants. The participants then follow with the needed actions (commit or rollback) with their local transactional resources.

It is clear, from the description, that 2PC is a centralized algorithm that depends on the authority and high availability of the coordinator process. Centralized and peer-to-peer are the two opposite modes of the network algorithms, and each algorithm is distinguished by its level of centralization.

The 3PC is a refinement of the 2PC which is supposed to be more resilient to failures by introducing an intermediate stage called "prepared to commit". However, it doesn't solve the fundamental challenges of the approach that are due to its centralized nature, only making the procedure more complex to implement and thus having more failure modes.

The modern peer-to-peer coordination algorithm alternatives are Paxos and RAFT. RAFT is considered to be a simpler (and, thus, more reliable) approach. It is also, not surprisingly, based on voting. It adds a preliminary phase to each transaction, which is leader election. The election, as well as other activities within a transaction, don't require unanimous agreement, but a simple majority. Besides, execution of all the stages on each machine is timeout-based, so if a network failure or a node failure occurs, the operations are aborted and retried with an updated view of the other peers. The details of the algorithm can be best understood from the RAFT website, which provides a link to the main paper, good visualizations and other references.

Distributed Data Structures

We have already mentioned various distributed hash-tables and content-addressable storage as one of the examples of these types of structures. Another exciting and rapidly developing direction is eventually-consistent data structures or CRDTs (Conflict-free Replicated Data Types). They are the small-scale representatives of the AP (or eventually-consistent) systems that favor high availability over constant consistency, as they become more and more the preferred mode of operation of distributed system.

The issue that CRDTs address is conflict resolution when different versions of the structure appear due to network partitions and their eventual repair. For a general data structure, if there are two conflicting versions, the solution is either to choose one (according to some general rules, like take the random one or the latest one, or application-specific logic) or to keep both versions and defer conflict resolution to the client code. CRDTs are conflict-free, i.e. the structures are devised so that any conflict is resolved automatically in a way that doesn't bring any data loss or corruption.

There are two ways to implement CRDTs: convergent structures rely on the replication of the whole state, while commutative use operation-based replication. Yet, both strategies result in the CRDTs with equivalent properties.

The simplest CRDT is a G-Counter (where "G" stands for grow only). Its operation is based on the trivial fact that addition is commutative, i.e. the order of applying the addition operation doesn't matter: we'll get the same result as long as the number of operations is the same. Every convergent CRDT has a merge operation that combines the states of each node. On each node, the G-Counter stores an array that holds the per-node numbers of the local increments. And its merge operation takes the maximums of the elements of this array across all nodes while obtaining the value of the counter requires summing all of the cells:

(defstruct (g-counter (:conc-name nil))
  ccs)

(defun make-gcc (n)
  (make-g-counter :ccs (make-array n)))

(defun gcc-val (gcc)
  (reduce '+ (ccs gcc))

(defun gcc-merge (gcc1 gcc2)
  (map* 'max gcc1 gcc2))

The structure is eventually consistent as, at any point in time, asking any live node, we can get the current value of the counter from it (so there's constant availability). However, if not all changes have already been replicated to this node, the value may be smaller than the actual one (so consistency is only eventual once all the replications are over).

The next step is a PN-Counter (positive-negative). It uses a common strategy in CRDT creation: combining several simpler CRDTs. In this case, it is a combination of two G-Counters: one — for the number of increments, and another — decrements.

A set is, in some sense, a more sophisticated analog of a counter (a counter may be considered a set of 1s). So, a G-Set functions similar to a G-Counter: it allows each node to add items to the set that are stored in the relevant cell of the main array. The merging and value retrieval operations use union. Similarly, there's 2P-Set (2-phase) that is similar in construction to the PN-counter. The difference of a 2P-Set from a normal set is that once an element is put into the removal G-Set (called the "tombstone" set) it cannot be readded to the set. I.e. addition may be undone, but deletion is permanent. This misfeature is amended y LWW-Set (last-write-wins) that adds timestamps to all the records. Thus, an item with a more recent timestamp prevails, i.e. if an object is present in both underlying G-Sets it is considered present in the set if its timestamp in the addition set is greater than the one in the removal set, and removed — in the opposite case.

There are also more complex CRDTs used to model sequences, including Treedoc, RGA, Woot, Logoot, and LSEQ. Their implementations differ, but the general idea is that each character (or chunk of characters) is assigned a key that can be ordered. When new text is added, it's given a key that is derived from the key of some adjacent text. As a result, the merge is the best-possible approximation of the intent of the edits.

The use cases for CRDTs are, as mentioned above, collaborative editing, maintaining such structures as shopping carts (e.g. with an LWW-Set), counters of page visits to a site or reactions in a social network, and so on and so forth.

Distributed Algorithms in Action: Collaborative Editing

In fact, CRDTs are a data-structure-centric answer to another technology that is used, for quite some time, to support collaborative editing: Operational Transformation (OT). OT was employed in such products as Google Docs and its predecessors to implement lock-free simultaneous rich-text editing of the same document by many actors.

OT is an umbrella term that covers a whole family of algorithms sharing the same basic principles. Such systems use replicated document storage, i.e. each node in the system operates on its own copy in a non-blocking manner as if it was a single-user scenario. The changes from every node are constantly propagated to the rest of the nodes. When a node receives a batch of changes it transforms the changes before executing them to account for the local changes that were already made since the previous changeset. Thus the name "operational transformation".

The basic idea of OT can be illustrated with the following example. Let's say we have a text document with a string "bar" replicated by two nodes and two concurrent operations:

(insert 0 "f") # o1 on node1
(delete 2 "r") # o2 on node2

Suppose, on node 1, the operations are executed in the order o1, o2. After executing o1, the document changes to "fbar". Now, before executing o2 we must transform it against o1 according to the transformation rules. As a result, it will change to (delete 3 "r"). So, the basic idea of OT is to adjust (transform) the parameters of incoming editing operations to account for the effects of the previously executed concurrent operations locally (whether they were invoked locally or received from some other node) so that the transformed operation can achieve the correct effect and maintain document consistency. The word "concurrent" here means operations that happened since some state that was recorded on the node that has sent the new batch of changes. The transformation rules are operation-specific.

In theory, OT seems quite simple, but it has its share of implementation nuances and issues:

  • While the classic OT approach of defining operations through their offsets in the text seems to be simple and natural, real-world distributed systems raise serious issues: namely, that operations propagate with finite speed (remember one of the network fallacies), states of participants are often different, thus the resulting combinations of states and operations are extremely hard to foresee and understand.
  • For OT to work, every single change to the data needs to be captured: "Obtaining a snapshot of the state is usually trivial, but capturing edits is a different matter altogether. The richness of modern user interfaces can make this problematic, especially within a browser-based environment.
  • The notion of a "point in time" relevant to which the operations should be transformed is nontrivial to implement correctly (another network fallacy in play). Relying on global time synchronization is one of the approaches, but it requires tight control over the whole environment (which Google has demonstrated to be possible for its datacenter). So, in most cases, a distributed solution instead of simple timestamps is needed

THe most popular of these solutions is a vector clock. The VC of a distributed system of n nodes is a vector of n logical clocks, one clock per process; a local "smallest possible values" copy of the global clock-array is kept in each process, with the following rules for clock updates:

  • Initially all clocks are zero.
  • Each time a process experiences an internal event, it increments its own logical clock in the vector by one.
  • Each time a process sends a message, it increments its own logical clock in the vector by one (as in the bullet above, but not twice for the same event) and then sends a copy of its own vector.
  • Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).

You might notice that the operation of vector clocks is similar to the CRDT G-counter.

VCs allow the partial causal ordering of events. A vector clock value for the event x is less than the value for y if and only if for all indices the items of the x's clock are less or equal, and, at least for one element, they are stricktly smaller.

Besides vector clocks, the other mechanisms to implement distributed partial ordering include Lamport Timestamps, Plausible Clocks, Interval Tree Clocks, Bloom Clocks, and others.

Persistent Data Structures

To conclude this chapter, I wanted to say a few words about the role of the functional paradigm in synchronization and distributed computing. That's no coincidence that it was mentioned several times in the description of different synchronization strategies: essentially, functional programming is about achieving good separation of concerns by splitting computations into independent referentially-transparent units that are easier to reason about. Such an approach supports concurrency more natively than the standard imperative paradigm, although it might not be optimal computationally (at least, in the small). Yet, the gains obtained from parallelism and utilizing the scale of distributed computing may greatly outweigh this low-level inefficiency. So, with the advent of concurrent and distributed paradigms, functional programming gains more traction and adoption. Such ideas as MapReduce, STM, and message-passing-based coordination originated in the functional programming world.

Another technology coming from the functional paradigm that is relevant to synchronization is Purely Functional Data Structures. Their principal property is that any modification doesn't cause a destructive effect on the previous version of the structure, i.e., with each change, a new version is created, while the old one may be preserved or discarded depending on the particular program requirements. This feature makes them very well suited for concurrent usage as the possibility of corruption due to incorrect operation order is removed, and such structures are also compatible with any kind of transactional behavior. The perceived inefficiency of constant copying, in many cases, may be mostly avoided by using structure sharing. So the actual cost of maintaining these data structures is not proportional to their size, but rather constant or, at worst, logarithmic in size. Another name for these structures is persistent data structures — contrast to "ephemeral" ones which operate by destructive modification.

The persistent functional structure is, as we already mentioned in one of the preceding chapters, is a Lisp list[5] used as a stack. We have also seen the queue implemented with two stacks called a Real-Time Queue. It is a purely functional data structure, as well. The other examples are mostly either list- or tree-based, i.e. they also use the linked backbone structured in a certain way.

To illustrate once again how most persistent data structures operate, we can look at a Zipper that may be considered a generalization of a Real-Time Queue. It is is a technique of representing a data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents in a purely functional manner, i.e. without destructive operations. A list-zipper represents the entire list from the perspective of a specific location within it. It is a pair consisting of a recording of the reverse path from the list start to the current location and the tail of the list starting at the current location. In particular, the list-zipper of a list (1 2 3 4 5), when created will look like this: (() . (1 2 3 4 5)). As we traverse the list, it will change in the following manner:

  • ((1) . (2 3 4 5))
  • ((2 1) . (3 4 5))
  • etc.

If we want to replace 3 with 0, the list-zipper will become ((2 1) . (0 4 5)) while the previous version will still persist. The new zipper will reuse the list (2 1) and create a new list by consing 0 to the front of the sublist (4 5). Consequently, there memory after performing 2 movements and 1 update will look like this:

It is apparent that each operation on the zipper (movement or modification) adds at most a single additional element. So, it's complexity is the same as for normal lists (although, with larger constants).

Zippers can operate on any linked structures. A very similar structure for trees is called a Finger Tree. To create it from a normal tree, we need to put "fingers" to the right and left ends of the tree and transform it like a zipper. A finger is simply a point at which you can access part of a data structure.

Let's consider the case of a 2-3 tree for which the finger approach was first developed. First, we restructure the entire tree and make the parents of the first and last children the two roots of our tree. This finger is composed of several layers that sit along the spine of the tree. Each layer of the finger tree has a prefix (on the left) and a suffix (on the right), as well as a link further down the spine. The prefix and suffix contain values in the finger tree – on the first level, they contain values (2-3 trees of depth 0); on the second level, they contain 2-3 trees of depth 1; on the third level, they contain 2-3 trees of depth 2, and so on. This somewhat unusual property comes from the fact that the original 2-3 tree was of uniform depth. The edges of the original 2-3 tree are now at the top of the spine. The root of the 2-3 tree is now the very bottom element of the spine. As we go down the spine, we are traversing from the leaves to the root of the original 2-3 tree; as we go closer to the root, the prefix and suffixes contain deeper and deeper subtrees of the original 2-3 tree.

Now, the principle of operation (traversal or modification) on the finger tree is the same as with the zipper: with each change, some elements are tossed from one side of the spine to the other, and the number of such elements remains withing the O(log n) limits.

Finally, another data structure that is crucial for the efficient implementation of systems that rely solely on persistent data structures (like the Clojure language environment) is a Hash-Array Mapped Trie (HAMT). It may be used both in ephemeral and persistent mode to represent maps and sets with O(log n) access complexity[6]. HAMT is a special trie that uses the following two tricks:

  • as an array-mapped trie, instead of storing pointers to the children nodes in a key-value indexed with their subkeys, it stores a list an array of pointers and a bitmap that is used to determine if the pointer is present and at what position in the array it resides. This feature requires limiting the number of possible subkeys (for example, individual characters, which are the dominant use case for tries) to the length of a bitmap. The default length is 32, which is enough to represent English alphabet :)
  • however, the hash feature gives us a number of benefits including limiting the limitations on the subkeys. Actually, in a HAMT, all values are stored at the leaves that have the same depth, while the subkeys are obtained by, first, hashing the key and then splitting the obtained hash into n-bit ranges (where n is usually also 5)[7]. Each subkey is used as an index into the bitmap: if the element at it is 1 the key is present. To calculate the index of a pointer in the pointer array, we need to perform popcount on the preceding bits.

With such a structure, all major operations will have O(log 5) i.e. O(1) complexity. However, hash collisions are possible, so the hash-table-related collision considerations also apply to a HAMT. In other words, HAMTs are pretty similar to hash-tables, with the keys being split into parts and put into a trie. However, due to their tree-based nature, the memory footprints and the runtime performance of iteration and equality checking of the HAMTs lag behind array-based counterparts:

  • Increased memory overhead as each internal node adds an overhead over a direct array-based encoding, so finding a small representation for internal nodes is crucial.
  • On the other hand, HAMTs do not need expensive table resizing and do not waste (much) space on null references.
  • Iteration is slower due to non-locality, while a hash-table uses a simple linear scan through a continuous array.
  • Delete can cause the HAMT to deviate from the most compact representation (leave nodes with no children, in the tree)
  • Equality checking can be expensive due to non-locality and the possibility of a degenerate structure due to deletes.

So, what's the value of this structure if it's just a slightly less efficient hash-table? The difference is that a HAMT can not only be implemented both with destructive operations, but also, being a tree, it can be easily adapted to persistent mode with a usual path-copying trick that we have already seen.

Complexity estimations for persistent data structres use amortized analysis to prove acceptable performance (O(log n)). Another trick at play here is called scheduling, and it lies in properly planning heavy structure-rebuilding operations and splitting them into chunks to avoid having to execute some at a time when optimal complexity can't be achieved. To learn more about these topics read the seminal book by Chris Okasaki "Purely Functional Data Structures"[8] that describes these methods in more detail and provides complexity analysis for various structures.

Besides, the immutability of persistent data structures enables additional optimizations that may be important in some scenarios:

  • native copy-on-write (COW) semantics that is required in some domains and algorithms
  • objects can be easily memoized
  • properties, such as hashes, sizes, etc., can be precomputed

The utility of persistent data structures is only gradually being realized and apprehended. Recently, some languages, including Clojure, were built around them as core structures. Moreover, some people even go as far as to claim that git is a purely functional data structure due to its principal reliance on structure-sharing persistent trees to store the data.

Take-aways

We have covered a lot of ground in this chapter at a pretty high level. Obviously, you can go much deeper: the whole books are written on the topics of concurrency and distributed computing.

Overall, concurrency can be approached from, at least, three different directions:

  1. There's a low-level view: the means that should be provided by the underlying platforms to support concurrent operation. It includes the threading/process APIs, the atomic operation, synchronization, and networking primitives.
  2. Then, there's an architecture viewpoint: what constraints our systems should satisfy and how to ensure that. At this level, the main distinctions are drawn: shared-memory vs shared-nothing, centralized vs peer-to-peer.
  3. And, last but not least, comes the algorithmic perspective. What data structures (as usual, they are, in fact, more important than the algorithms) can be used to satisfy the constraints in the most efficient way possible, or to simplify the architecture? We have seen several examples of special-purpose ones that cater to the needs of a particular problem: lock-free data structures, eventually-consistent, purely functional persistent ones. And then, there are some areas where special-purpose algorithms also play a major role. Their main purpose, there, is not so much computational efficiency (like we're used to), but, mostly, correctness coupled with good enough efficiency[9]. Mutual exclusion and distributed consensus algorithms are examples of such targeted algorithm families.

There's a lot of room for further research in the realms of synchronization and, especially, distributed computation. It is unclear, whether the new breakthroughs will come from our current computing paradigms or we'll have to wait for the new tide and new approaches. Anyway, there's still a chance to make a serious and lasting contribution to the field by developing new algorithm-related stuff. And not only that. Unlike other chapters, we haven't talked much here about the tools that can help a developer of concurrent programs. The reason for that is, actually, an apparent lack of such tools, at least, of widely adopted ones. Surely, the toolbox we have already studied in the previous chapters is applicable here, but an environment with multiple concurrent threads and, possibly, multiple address spaces adds new classes of issues and seriously complicates debugging. There are network service tools to collect metrics and execution traces, but none of them is tightly integrated into the development toolboxes, not to speak of their limited utility. So, substantial pieces are still missing from the picture and are waiting to be filled.


[1] We will further use the term "thread" to denote a separate computation running as part of our application, as it is less ambiguous than "process" and also much more widespread than all the other terms.

[2] This internal "threading", usually, also relies on the OS threading API behind the scenes.

[3] In this context, they tend to be called "processes", but we'll still stick to the term "thread".

[4] The other apparent limitation of supporting only two threads can be lifted by a modification to the algorithm, which requires some hardware support.

[5] If we forbid the destructive rplaca/rplacd operations and their derivatives.

[6] and a quite high algorithm base — usually 32 — that means very shallow trees resulting in just a handful of hops even for quite large structures

[7] Except for the length of the leftmost range that depends on the number of bits in a hash. For instance, for a 32-bit hash, it may be 7, and the depth of the whole HAMT would be 5.

[8] His thesis with the same title is freely available, but the book covers more and is more accessible.

[9] The reason for that might be relative immaturity of this space, as well as its complexity, so that our knowledge of it hasn't been developed enough to reach the stage when optimization becomes the main focus.

2020-02-19

Programming Algorithms: Compression

Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.

Encoding

Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.

Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.

Base64

Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +, /, and =). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=) is used to indicate 2 (=) or 4 (==) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.

Here is one of the ways to implement a Base64 serialization routine:

(defparameter *b64-dict*
  (coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\a) :to (char-code #\z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\0) :to (char-code #\9)
                        :collect (code-char ch))
                  '(#\+ #\/ #\=))
          'simple-vector))

(defun b64-encode (in out)
  (let ((key 0)
        (limit 6))
    (flet ((fill-key (byte off beg limit)
             (:= (ldb (byte limit off) key)
                 (ldb (byte limit beg) byte))
             (:= off (- 6 beg)))
           (emit1 (k)
             (write-byte (char-code (svref *b64-dict* k)) out)))
      (loop :for byte := (read-byte in nil) :while byte :do
        (let ((beg (- 8 limit)))
          (fill-key byte 0 beg limit)
          (emit1 key)
          (fill-key byte (:= limit (- 6 beg)) 0 beg)
          (when (= 6 beg)
            (emit1 key)
            (:= limit 6))))
      (when (< limit 6)
        (:= (ldb (byte limit 0) key)
            (ldb (byte limit 0) 0))
        (emit1 key)
        (loop :repeat (ceiling limit 2) :do
          (emit1 64))))))

This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.

One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:

  • It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
  • Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: (with-open-file (in ...) (with-out-file (out) (base64-encode in out)). With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.

So, what happens in the code above? First, the bytes are read from the binary input stream in, then each one is slashed into 2 parts. The higher bits are set into the current base64 key, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...). The function relies on the standard Lisp ldb operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset) to control the bits it wants to obtain.

Implementing a decoder procedure is left as an exercise to the reader...

Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):

CL-USER> (with-input-from-string (str "Man i")
           (let ((in (flex:make-in-memory-input-stream 
                      (map 'vector 'char-code
                           (loop :for ch := (read-char str nil) :while ch :collect ch))))
                 (out (flex:make-in-memory-output-stream)))
             (b64-encode in out)
             (map 'string 'code-char (? out 'vector))))
"TWFuIGk="

This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key) and the byte being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:

        M (77)    |    a (97)     |    n (110)
   0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1    |               |                 19 = T
               0 1|               |
1:             0 1|0 1 1 0        |                 22 = W
                  |        0 0 0 1|
2:                |        0 0 0 1|0 1               5 = F

Iteration 0:

beg: 2
off: 0
limit: 6

beg: 0
off: (- 6 2) = 4
limit: 2


Iteration 1:

beg: 4
off: 0
limit: 4

beg: 0
off: (- 6 4) = 2
limit: 4

Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...). Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.

(deftest b64-encode ()
  ;; B64STR would be the function wrapped over the REPL code presented above
  (should be blankp (b64str ""))
  (should be string= "TWFu" (b64str "Man"))
  (should be string= "TWFuIA==" (b64str "Man "))
  (should be string= "TWFuIGk=" (b64str "Man i")))

Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.

Lossless Compression

The idea behind lossless compression is straightforward: find an encoding that is tailored to our particular dataset and allows the encoding procedure to produce a shorter version than using a standard encoding. Not being general-purpose, the vocabulary for this encoding may use a more compact representation for those things that occur often, and a longer one for those that appear rarely, skipping altogether those that don't appear at all. Such an encoding scheme will be, probably, structure-agnostic and just convert sequences of bytes into other sequences of a smaller size, although custom structure-aware compression is also possible.

This approach can be explained with a simple example. The phrase "this is a test" uses 8-bit ASCII characters to represent each letter. There are 256 different ASCII characters in total. However, for this particular message, only 7 characters are used: t, h, i, s, , a, and e. 7 characters, in theory, need only 2.81 bits to be distinguished. Encoding them in just 3 bits instead of 8 will reduce the size of the message almost thrice. In other words, we could create the following vocabulary (where #*000 is a Lisp literal representation of a zero bit-vector of 3 bits):

#h(#\t #*000
   #\h #*001
   #\i #*010
   #\s #*011
   #\a #*100
   #\e #*101
   #\Space #*110)

Using this vocabulary, our message could be encoded as the following bit-vector: #*0000010100111100100111101001100001010111000. The downside, compared to using some standard encoding, is that we now need to package the vocabulary alongside the message, which will make its total size larger than the original that used an 8-bit standard encoding with a known vocabulary. It's clear, though, that, as the message becomes longer, the fixed overhead of the vocabulary will quickly be exceeded by the gain from message size reduction. Although, we have to account for the fact that the vocabulary may also continue to grow and require more and more bits to represent each entry (for instance, if we use all Latin letters and numbers it will soon reach 6 or 7 bits, and our gains will diminish as well). Still, the difference may be pre-calculated and the decision made for each message or a batch of messages. For instance, in this case, the vocabulary size may be, say, 30 bytes, and the message size reduction is 62.5%, so a message of 50 or more characters will be already more compact if encoded with this vocabulary even when the vocabulary itself will be sent with it. The case of only 7 characters is pretty artificial, but consider that DNA strings have only 4 characters.

However, this simplistic approach is just the beginning. Once again, if we use an example of the Latin alphabet, some letters, like q or x may end up used much less frequently, than, say, p or a. Our encoding scheme uses equal length vectors to represent them all. Yet, if we were to use shorter representations for more frequently used chars at the expense of longer ones for the characters occurring less often, additional compression could be gained. That's exactly the idea behind Huffman coding.

Huffman Coding

Huffman coding tailors an optimal "alphabet" for each message, sorting all letters based on their frequency and putting them in a binary tree, in which the most frequent ones are closer to the top and the less frequent ones — to the bottom. This tree allows calculating a unique encoding for each letter based on a sequence of left or right branches that need to be taken to reach it, from the top. The key trick of the algorithm is the usage of a heap to maintain the characters (both individual and groups of already processed ones) in sorted order. It builds the tree bottom-up by first extracting two least frequent letters and combining them: the least frequent on the left, the more frequent — on the right. Let's consider our test message. In it, the letters are sorted by frequency in the following order:

((#\a 1) (#\e 1) (#\h 1) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

Extracting the first two letters results in the following treelet:

 ((#\a #\e) 2)
  /         \
(#\a 1)  (#\e 1)

Uniting the two letters creates a tree node with a total frequency of 2. To use this information further, we add it back to the queue in place of the original letters, and it continues to represent them, during the next steps of the algorithm:

((#\h 1) ((#\a #\e) 2) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

By continuing this process, we'll come to the following end result:

  ((#\s #\t #\Space #\i #\h #\a #\e) 14)
    /                                \
 ((#\s #\t) 6)     ((#\Space #\i #\h #\a #\e) 8)
   /        \        /                       \
(#\s 3)   (#\t 3) (#\Space 3)  ((#\i #\h #\a #\e) 5)
                                 /               \              
                              (#\i 2)  ((#\h #\a #\e) 3)     
                                         /           \
                                      (#\h 1)  ((#\a #\e) 2)
                                                 /       \
                                               (#\a 1)  (#\e 1)

From this tree, we can construct the optimal encoding:

#h(#\s #*00
   #\t #*01
   #\Space #*10
   #\i #*110
   #\h #*1110
   #\a #*11110
   #\e #*11111)

Compared to the simple approach that used constantly 3 bits per character, it takes 1 bit less for the 3 most frequent letters and 2 bits more for two least frequent ones. The encoded message becomes: #*01111011000101100010111101001111110001, and it has a length of 38 compared to 43 for our previous attempt.

To be clear, here are the encoding and decoding methods that use the pre-built vocabulary (for simplicity's sake, they operate on vectors and strings instead of streams):

(defun huffman-encode (envocab str)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (dovec (char str)
      (dovec (bit (? envocab char))
        (vector-push-extend bit rez)))
    rez))

(defun huffman-decode (devocab vec)
  (let (rez)
    (dotimes (i (length vec))
      (dotimes (j (- (length vec) i))
        (when-it (? devocab (slice vec i (+ i j 1)))
          (push it rez)
          (:+ i j)
          (return))))
    (coerce (reverse rez) 'string)))

It is worth recalling that vector-push-extend is implemented in a way, which will not adjust the array by only 1 bit each time it is called. The efficient implementation "does the right thing", for whatever the right thing means in this particular case (maybe, adjusting by 1 machine word). You can examine the situation in more detail by trying to extend the array by hand (using adjust-array or providing a third optional argument to vector-push-extend) and comparing the time taken by the different variants, to verify my words.

Finally, here is the most involved part of the Huffman algorithm, which builds the encoding and decoding vocabularies (with the help of a heap implementation we developed in the chapter on Trees):

(defun huffman-vocabs (str)
  ;; here we assume more than a single unique character in STR
  (let ((counts #h())
        (q (make-heap :op '< :key 'rt))
        (envocab #h())
        (devocab #h(equal)))  ; bit-vectors as keys require 'equal comparison
    ;; count character frequencies
    (dovec (char str)
      (:+ (get# char counts 0)))  ; here, we use the default third argument of gethash set to 0
    ;; heapsort the characters based on their frequency
    (dotable (char count counts)
      (heap-push (pair char count) q))
    ;; build the tree
    (dotimes (i (1- (heap-size q)))
      (with (((lt cl) (heap-pop q))
             ((rt cr) (heap-pop q)))
        (heap-push (pair (list lt rt) (+ cl cr))
                   q)))
    ;; traverse the tree in DFS manner
    ;; encoding the path to each leaf node as a bit-vector
    (labels ((dfs (node &optional (level 0) path)
               (if (listp node)
                   (progn
                     (dfs (lt node) (1+ level) (cons 0 path))
                     (dfs (rt node) (1+ level) (cons 1 path)))
                   (let ((vec (make-array level :element-type 'bit
                                          :initial-contents (reverse list))))
                     (:= (? envocab node) vec
                         (? devocab vec) node)))))
      (dfs (lt (heap-pop q))))
    (list envocab devocab)))

Huffman Coding in Action

Compression is one of the areas for which it is especially interesting to directly compare the measured gain in space usage to the one expected theoretically. Yet, as we discussed in one of the previous chapters, such measurements are not so straightforward as execution speed measurements. Yes, if we compress a single sequence of bytes into another one, there's nothing more trivial than to compare their lengths, but, in many tasks, we want to see a cumulative effect of applying compression on a complex data structure. This is what we're going to do next.

Consider the problem that I had in my work on the tool for text language identification WIKI-LANG-DETECT. This software relies on a number of big dictionaries that map strings (character trigrams and individual words) to floats. The obvious approach to storing these maps is with a hash-table. However, due to the huge number of keys, such table will, generally, have a sizeable overhead, which we would like to avoid. Besides, the keys are strings, so they have a good potential for reduction in occupied size when compressed. The data is also serialized into per-language files in the tab-separated format. This is the sample of a word-level file for the Danish language:

afrika    -8.866735
i    -2.9428265
the    -6.3879676
ngo    -11.449115
of    -6.971129
kanye    -12.925021
e    -8.365895
natal    -12.171249

Our task is to load the data in memory so that access to the keys had constant runtime and minimal occupied space.

Let's begin with a simple hash-table-based approach. The following function will load two files from the default directory (*default-pathname-defaults*) and return a list of two hash-tables: for the word and trigram probabilities.

(defun load-data-into-hts (lang)
  (declare (optimize sb-c::instrument-consing))
  (mapcar (lambda (kind)
            (let ((rez (make-hash-table :test 'equal)))
              (dolines (line (fmt "~A-~A.csv" lang kind))
                (let ((space (position #\Space line)))
                  (set# (slice line 0 space) rez
                        (read-from-string (slice line (1+ space))))))
              rez))
          '("words" "3gs")))

To measure the space it will take, we'll use a new SBCL extension called allocation profiling from the sb-aprof package[1]. To enable the measurement, we have put a special declaration immediately after the defun header: (optimize sb-c::instrument-consing).

Now, prior to running the code, let's look at the output of room:

CL-USER> (room)
Dynamic space usage is:   60,365,216 bytes.
...

This is a freshly loaded image, so space usage is minimal. Usually, before proceeding with the experiment, I invoke garbage collection to ensure that we don't have some leftover data from the previous runs that may overlap with the current one. In SBCL, you run it with (sb-ext:gc :full t).

Now, let's load the files for the German language (the biggest ones) under aprof. The data can be obtained from the github repository of the project. The total size of 2 German-language files on disk (words and trigrams dictionaries) is around 4 MB.

CL-USER> (sb-aprof:aprof-run
          (lambda () (defparameter *de* (load-data-into-hts "DE"))))
227 (of 50000 max) profile entries consumed

       %        Bytes        Count    Function
 -------  -----------    ---------    --------
  24.2       34773600       434670    SB-KERNEL:%MAKE-ARRAY - #:|unknown|
  19.4       27818880       217335    SB-IMPL::%MAKE-STRING-INPUT-STREAM - SB-IMPL::STRING-INPUT-STREAM
  19.4       27818880       434670    SLICE - LIST

  17.3       24775088                 SB-IMPL::HASH-TABLE-NEW-VECTORS
      54.0     13369744         52        SIMPLE-VECTOR
      46.0     11405344        156        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  14.9       21406176                 SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER
      99.4     21280192     225209        (SIMPLE-ARRAY CHARACTER (*))
       0.6       125984       7874        LIST

   4.8        6957184       217412    SB-KERNEL::INTEGER-/-INTEGER - RATIO

  00.0          14160                 SB-IMPL::%MAKE-PATHNAME
      91.8        12992        812        LIST
       8.2         1168          1        SIMPLE-VECTOR

  00.0           4160            2    SB-IMPL::SET-FD-STREAM-ROUTINES - (SIMPLE-ARRAY CHARACTER (*))

  00.0           3712                 SB-IMPL::%MAKE-DEFAULT-STRING-OSTREAM
      62.1         2304          8        (SIMPLE-ARRAY CHARACTER (*))
      37.9         1408          8        SB-IMPL::CHARACTER-STRING-OSTREAM

  00.0           1024                 MAKE-HASH-TABLE
      53.1          544          2        SIMPLE-VECTOR
      46.9          480          6        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  00.0            832                 SB-IMPL::%MAKE-FD-STREAM
      73.1          608          2        SB-SYS:FD-STREAM
      19.2          160          2        SB-VM::ARRAY-HEADER
       7.7           64          2        (SIMPLE-ARRAY CHARACTER (*))

  00.0            576                 GET-OUTPUT-STREAM-STRING
      55.6          320          8        SIMPLE-BASE-STRING
      44.4          256          8        SB-KERNEL:CLOSURE

  00.0            400                 SB-KERNEL:VECTOR-SUBSEQ*
      60.0          240          6        (SIMPLE-ARRAY CHARACTER (*))
      40.0          160          5        SIMPLE-BASE-STRING

  00.0            400            5    SB-IMPL::%%MAKE-PATHNAME - PATHNAME
  00.0            384            2    SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE
  00.0            288            4    SB-KERNEL:%CONCATENATE-TO-STRING - (SIMPLE-ARRAY CHARACTER (*))
  00.0            192           12    SB-IMPL::UNPARSE-NATIVE-PHYSICAL-FILE - LIST
  00.0            176            2    SB-IMPL::READ-FROM-C-STRING/UTF-8 - (SIMPLE-ARRAY CHARACTER (*))
  00.0            128            4    SB-ALIEN-INTERNALS:%SAP-ALIEN - SB-ALIEN-INTERNALS:ALIEN-VALUE

  00.0             96                 SB-IMPL::QUERY-FILE-SYSTEM
      66.7           64          2        SB-KERNEL:CLOSURE
      33.3           32          2        SB-VM::VALUE-CELL

 =======  ===========
 100.0      143576336

The profiling report is pretty cryptic, at first sight, and requires some knowledge of SBCL internals to understand. It contains all the allocations performed during the test run, so we should mind that some of the used memory is garbage and will be collected at the next gc. We can confirm that by looking at the room output:

CL-USER> (room)
Dynamic space usage is:   209,222,464 bytes.
CL-USER> (sb-ext:gc :full t)
NIL
CL-USER> (room)
Dynamic space usage is:   107,199,296 bytes.

Let's study the report in detail. Around 47 MB were, in fact, used for the newly created data structures — more than 10 times more than what was needed to store the data on disc. Well, efficient access requires sacrificing a lot of space. From the report, we can make an educated guess where these 47 MB originate: 24.7 MB was used for the hash-table structures themselves (SB-IMPL::HASH-TABLE-NEW-VECTORS) and 21.4 MB for the keys (SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER), plus some small amount of bookkeeping information. We can also infer that the floating-point values required around 7 MB (SB-KERNEL::INTEGER-/-INTEGER - RATIO), but, it seems like they were put inside the hash-table arrays without any indirection. To verify that this assumption is correct we can calculate the total number of keys in the hash-tables, which amounts to 216993, and multiply it by 32 (the number of bits in a short-float used here). Also, the first 3 lines, which, in total, accrued around 90 MB or almost 2/3 of the memory used, are all related to reading the data and its processing; and this space was freed during gc.

So, this report, although it is not straightforward to understand, gives a lot of insight into how space is used during the run of the algorithm. And the ability to specify what to track on a per-code block basis makes it even more useful.

From the obtained breakdown, we can see the optimization potential of the current solution:

  • the use of a more space-efficient data structure instead of a hash-table might save us up to 17 MB of space (7 MB of float values will remain intact)
  • and another 20 MB may be saved if we compress the keys

Let's try the second option as it is exactly the focus of this chapter. We'll use the created hash-tables to make new ones with Huffman-encoded keys. Here are the contents of the word probabilities table:

CL-USER> (print-ht (first *de*))
#{EQUAL
  "afrika" -9.825206
  "i" -7.89809
  "the" -7.0929685
  "ngo" -12.696277
  "noma" -14.284437
  "of" -6.82038
  "kanye" -14.233144
  "e" -7.7334323
  "natal" -11.476304
  "c" -8.715089
  ...
 }

And here is the function that will transform the tables:

(defun huffman-tables (hts envocab)
  (declare (optimize sb-c::instrument-consing))
  (mapcar (lambda (ht)
            (let ((rez #h(equal)))
              (dotable (str logprob ht)
                (:= (? rez (huffman-encode envocab str)) logprob))
              rez))
          hts))

;; the Huffman encoding vocabulary *DE-VOCAB* should be built
;; from all the keys of *DE* tables separately
CL-USER> (sb-aprof:aprof-run
          (lambda () (defparameter *de2* (huffman-tables *de* *de-vocab*))))
1294 (of 50000 max) profile entries consumed
       %        Bytes        Count    Function
 -------  -----------    ---------    --------
  42.5       44047104      1376461    SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG - ARRAY

  23.9       24775088                 SB-IMPL::HASH-TABLE-NEW-VECTORS
      54.0     13369744         52        SIMPLE-VECTOR
      46.0     11405344        156        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  20.1       20864160                 HUFFMAN-ENCODE
      83.3     17386800     217335        SB-VM::ARRAY-HEADER
      16.7      3477360     217335        SIMPLE-BIT-VECTOR

   6.7        6955072       217335    SB-KERNEL:VECTOR-SUBSEQ* - SIMPLE-BIT-VECTOR
   3.4        3477360       217335    (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF :AROUND (T T)) - LIST
   3.4        3477360       217335    (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF (HASH-TABLE T)) - LIST
  00.0           2464           77    SB-KERNEL::INTEGER-/-INTEGER - RATIO

  00.0           1024                 MAKE-HASH-TABLE
      53.1          544          2        SIMPLE-VECTOR
      46.9          480          6        (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

  00.0            384            2    SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE

  00.0             96                 SB-C::%PROCLAIM
      66.7           64          2        LIST
      33.3           32          1        SB-KERNEL:CLOSURE

  00.0             96            2    SB-INT:SET-INFO-VALUE - SIMPLE-VECTOR
  00.0             64            2    SB-THREAD:MAKE-MUTEX - SB-THREAD:MUTEX
  00.0             32            1    SB-IMPL::%COMPILER-DEFVAR - LIST
  00.0             32            2    HUFFMAN-TABLES - LIST
  00.0             16            1    SB-KERNEL:ASSERT-SYMBOL-HOME-PACKAGE-UNLOCKED - LIST
 =======  ===========
 100.0      103600352
CL-USER> (sb-ext:gc :full t)
NIL
CL-USER> (room)
Dynamic space usage is:   139,922,208 bytes.

So, we have claimed 32 MB of additional space (instead of 47) and some of it seems to be used by other unrelated data (some functions I have redefined in the REPL during the experiment etc), as the compressed keys amount for only 3.5 MB:

3477360     217335        SIMPLE-BIT-VECTOR 

That is more than 5 times reduction or almost 40% compression of the whole data structure!

And what about performance? Huffman compression will be needed at every data access, so let's measure the time it will take for vanilla string keys and the bit-vector ones. We will use another file from the wiki-lang-detect repository for the smoke test — a snippet from Faust:

CL-USER> (defparameter *de-words*
           (let ((words (list))
                 (dict (first *de*)))
             (dolines (line "~/prj/lisp/wiki-lang-detect/data/smoke/de.txt")
               (dolist (word (split #\Space line))
                 (push word words)))
             words))
CL-USER> (length *de-words*)
562

CL-USER> (let ((vocab (first *de*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# word vocab)))))
Evaluation took:
  0.045 seconds of real timeEvaluation took:

CL-USER> (let ((vocab (first *de2*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# (huffman-encode *de-vocab* word) vocab)))))
Evaluation took:
  0.341 seconds of real time

Hmm, with Huffman coding, it's almost 10x slower :( Is there a way to speed it up somewhat? To answer it, we can utilize another profiler — this time a more conventional one, which measures the time spent in each operation. SBCL provides access to two versions of such profilers: a precise and a statistical one. The statistical doesn't seriously interfere with the flow of the program as it uses sampling to capture the profiling data, and it's the preferred one among the developers. To use it, we need to perform (require 'sb-sprof) and then run the computation with profiling enabled (the lengthy output is redacted to show only the most important parts):

CL-USER> (let ((dict (first *de2*)))
           (sb-sprof:with-profiling (:report :graph)
             (loop :repeat 100 :do
               (dolist (word *de-words*)
                 (get# (huffman-encode *de-vocab* word) dict)))))
Number of samples:   34
Sample interval:     0.01 seconds
Total sampling time: 0.34 seconds
Number of cycles:    0
Sampled threads:
 #<SB-THREAD:THREAD "repl-thread" RUNNING {100FB19BC3}>

                               Callers
                 Total.     Function
 Count     %  Count     %      Callees
------------------------------------------------------------------------
    24  70.6                   "Unknown component: #x52CD6390" [41]
     5  14.7     24  70.6   HUFFMAN-ENCODE [1]
     1   2.9                   SB-IMPL::GETHASH/EQL [17]
     1   2.9                   SB-IMPL::GETHASH3 [6]
     1   2.9                   LENGTH [14]
     1   2.9                   SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS [13]
     2   5.9                   (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]
    13  38.2                   VECTOR-PUSH-EXTEND [11]
------------------------------------------------------------------------
     4  11.8                   SB-VM::EXTEND-VECTOR [4]
     4  11.8      4  11.8   SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
------------------------------------------------------------------------
     6  17.6                   "Unknown component: #x52CD6390" [41]
     3   8.8      6  17.6   SB-IMPL::GETHASH/EQUAL [3]
     1   2.9                   SXHASH [42]
     2   5.9                   SB-INT:BIT-VECTOR-= [10]
------------------------------------------------------------------------
     8  23.5                   VECTOR-PUSH-EXTEND [11]
     2   5.9      8  23.5   SB-VM::EXTEND-VECTOR [4]
     2   5.9                   SB-VM::COPY-VECTOR-DATA [9]
     4  11.8                   SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
------------------------------------------------------------------------
     2   5.9                   HUFFMAN-ENCODE [1]
     2   5.9      2   5.9   (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]
------------------------------------------------------------------------
...

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1      5  14.7     24  70.6      5  14.7        -  HUFFMAN-ENCODE
   2      4  11.8      4  11.8      9  26.5        -  SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG
   3      3   8.8      6  17.6     12  35.3        -  SB-IMPL::GETHASH/EQUAL
   4      2   5.9      8  23.5     14  41.2        -  SB-VM::EXTEND-VECTOR
   5      2   5.9      2   5.9     16  47.1        -  (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT)
   6      2   5.9      2   5.9     18  52.9        -  SB-IMPL::GETHASH3
   7      2   5.9      2   5.9     20  58.8        -  GETHASH
   8      2   5.9      2   5.9     22  64.7        -  (SB-VM::OPTIMIZED-DATA-VECTOR-SET BIT)
   9      2   5.9      2   5.9     24  70.6        -  SB-VM::COPY-VECTOR-DATA
  10      2   5.9      2   5.9     26  76.5        -  SB-INT:BIT-VECTOR-=
  11      1   2.9     13  38.2     27  79.4        -  VECTOR-PUSH-EXTEND
  12      1   2.9      1   2.9     28  82.4        -  SB-VM::SLOW-HAIRY-DATA-VECTOR-SET
  13      1   2.9      1   2.9     29  85.3        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
  14      1   2.9      1   2.9     30  88.2        -  LENGTH
  15      1   2.9      1   2.9     31  91.2        -  SB-KERNEL:HAIRY-DATA-VECTOR-SET
  16      1   2.9      1   2.9     32  94.1        -  SB-KERNEL:VECTOR-SUBSEQ*
  17      1   2.9      1   2.9     33  97.1        -  SB-IMPL::GETHASH/EQL
...

Unsurprisingly, most of the time is spent in huffman-encode, and of it the biggest chunks are vector-push-extend and hash-table access (to get the Huffman code of a letter). Surely, instead of extending the vector at each iteration, it would be much nicer to just perform a bulk copy of the bits for each character directly into the vector. Let's try that and see the difference:

(defun huffman-encode2 (envocab str)
  (let ((vecs (map 'vector (lambda (ch) (get# ch envocab))
                   str))
        (total-size 0))
    (dovec (vec vecs)
      (:+ total-size (length vec)))
    (let ((rez (make-array total-size :element-type 'bit))
          (i 0))
      (dovec (vec vecs)
        (let ((size (length vec)))
          (:= (subseq rez i) vec)
          (:+ i size)))
      rez)))

CL-USER> (let ((vocab (first *de2*)))
           (time (loop :repeat 1000 :do
                   (dolist (word *de-words*)
                     (get# (huffman-encode2 *de-vocab* word) vocab)))))
Evaluation took:
  0.327 seconds of real time

Almost no difference. Well, it's a usual case with these micro-optimizations: you have a brilliant idea, try it under the profiler — and, bah, no difference... This doesn't have to stop us, though. Another idea could be to use a jump-table instead of a hash-table to store character-vector mappings. There are only around 500 characters that have a mapping in my data, although they span the whole Unicode range:

CL-USER> (reduce 'max (mapcar 'char-code (keys de-vocab)))
65533
CL-USER> (defparameter jvocab (make-array (1+ 65533)
                                            :element-type 'bit-vector
                                            :initial-element #))
CL-USER> (dotable (k v *de-vocab)
           (:= (? jvocab (char-code k)) v))

(defun huffman-encode3 (envocab str)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (dovec (char str)
      ;; here, we have changed the hash-table to a jump-table
      (dovec (bit (svref envocab (char-code char)))
        (vector-push-extend bit rez)))
    rez))

CL-USER> (let ((vocab (first de2))) (time (loop :repeat 1000 :do (dolist (word de-words) (get# (huffman-encode3 jvocab word) vocab))))) Evaluation took: 0.308 seconds of real time

OK, we get an improvement of around 10%[2]. That's a start. But many more ideas and experiments are needed if we want to significantly optimize this implementation. Yet, for the sake of space conservation on the pages of this book, we won't continue with it.

Another tool we could use to analyze the performance and think about further improvement is flamegraphs — a way to visualize profiler output. [CL-FLAMGRAPH](https://github.com/40ants/cl-flamegraph) is a wrapper around `sb-sprof` that generates the output in the common format which can be further processed by the Perl tool, in order to generate the image itself. Here is the basic output I got. It's rather rough and, probably, requires some fiddling with the Perl tool to obtain a prettier image:

To conclude, key compression alone gives a sizeable reduction in used space at the cost of deteriorated performance.

Another possible angle of attack is to move from a hash-table to a more space-efficient structure. We have explored this direction somewhat in the chapter on hash-tables already.

Arithmetic Coding

Why Huffman coding works? The answer lies in Shannon's Source Coding Theorem and has to do with a notion of entropy. Entropy is one of the ways to represent expectation and surprise, in a message. The most random message has the maximal surprise, i.e. it's very hard to predict what symbol will appear at a certain position in it, while the least random (for instance, containing only repetitions of a single char) is the least surprising. Obviously, any kind of useful data is not uniformly distributed, or, otherwise, it's indistinguishable from white noise. Most of the data representations use an "alphabet" (encoding) that is redundant, for a particular message. Why? Because it is general-purpose and should allow expressing arbitrary messages. Yet, in practice, some passages appear much more often than the others, some words are more frequent, some letters are, and even some patterns in the images may be.

The idea of character-level compression algorithms is to tailor a custom vocabulary that uses fewer bits for low-entropy (frequent) characters and more bits for high-entropy ones. In general, the probability distribution of characters may be thought of as a [0,1) interval, in which each char occupies a slice proportionate to its frequency. If we rely on standard encoding, the interval for our test example will look like this:

|---+---+---+------+---------+---------+---------|
0 e   a   h    i        s         t       Space  1

Here, each subinterval for a character is its probability times the number of bits per character (8 for each). Huffman coding tries to equalize this distribution by assigning fewer bits to characters that occupy larger space. For the Huffman vocabulary we have constructed, the distribution will look like this:

|-----+-----+----+------+------+-------+-------|
0  e     a    h     i      s      t      Space 1

As you can see, it has become more even, but still not totally. This is due to the discrete nature of the encoding that results in rounding the number of bits to the closest integer value. There's another approach to solving the same problem that aims at reducing the rounding error even further — Arithmetic coding. It acts directly on our interval and encodes the whole message in a single number that represents the point in this interval. How this point is found and used? Let's consider a message with a single character i. In our example, the subinterval for it is [0.214285714, 0.357142857). So, if we use any number from this interval and know that the message contains a single character we can unambiguously decode it back. Ideally, we'd use the number from the interval that has the least amount of digits. Here is a simple example of how such a number can be found:

(defun find-shortest-bitvec (lo hi)
  (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (loop
      (with ((lod lof (floor (* lo 2)))
             (hid hif (floor (* hi 2))))
        (when (or (zerop lof)
                  (zerop hif)
                  (/= lod hid))
          (vector-push-extend hid rez)
          (return))
        (vector-push-extend lod rez)
        (:= lo lof
            hi hif)))
    rez))

CL-USER> (find-shortest-bitvec 0.214285714 0.357142857)
#*01

The result is a bit-vector that represents the fractional part of some floating point number lying within the interval, which may be also used as an encoding of our one-character message. Obviously, we could use just a single bit to encode it with a custom vocabulary of one entry, but, here, for the purpose of illustration, I wanted to use an existing pre-calculated vocabulary that includes other characters as well. Also, if we compare this version with the Huffman coding, the message length is decreased by 1 bit.

Now, how can we process the longer messages? In the same manner: by recursively dividing the currently selected part using the same original distribution. For the message is:

  • on step 1 (for character i), the interval [0.214285714, 0.357142857) will be selected
  • on step 2 (for character s), we'll narrow it down to [0.26530612, 0.29591838) (using the subinterval [0.357142857, 0.5714286) for s)

For this interval, the shortest encoding will be 01001. In this case, it has the same size as the Huffman one.

So, the naive arithmetic encoding implementation is quite simple:

(defun arithm-encode (envocab message)
  (let ((lo 0.0)
        (hi 1.0))
    (dovec (char message)
      (let ((coef (- hi lo)))
        (dotable (ch prob envocab)
          (let ((off (* prob coef)))
            (when (eql char ch)
              (:= hi (+ lo off))
              (return))
            (:+ lo off)))))
    (find-shortest-bitvec lo hi)))

CL-USER> (arithm-encode #h(#\e 1/14
                           #\a 1/14
                           #\h 1/14
                           #\i 2/14
                           #\s 3/14
                           #\t 3/14
                           #\Space 3/14)    
                        "this is a test")
#*100110110100001110000001

However, this function has a hidden bug. The problem lies in the dreaded floating-point overflow that happens quite soon in the process of narrowing the interval, which results in using more and more digits of the floating-point number until all the bits are utilized and we can't distinguish the intervals any further. If we try to faithfully decode even the short message encoded above, we'll already see this effect by getting the output this ist sssst.

The implementation of this approach, which works around the bug, relies on the same idea but uses a clever bit arithmetics trick. Due to that, it becomes less clean and obvious, because it has to work not with the whole number, but with a bounded window in that number (in this case: a 32-bit one) and, also, still take care of potential overflow that may happen when the range collapses around 0.5. Here it is shown, for illustration purposes, without a detailed explanation[3]. This function is another showcase of the Lisp standard support for handling bit-level values. Besides, read-eval (#.) is used here to provide literal values of bitmasks.

(defun arithm-encode-correct (envocab message)
  (let ((lo 0)
        (hi (1- (expt 2 32)))
        (pending-bits 0)
        (rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
    (flet ((emit-bit (bit)
             (vector-push-extend bit rez)
             (let ((pbit (if (zerop bit) 1 0)))
               (loop :repeat pending-bits :do (vector-push-extend pbit rez))
               (:= pending-bits 0))))
      (dovec (char message)
        (with ((range (- hi lo -1))
               ((plo phi) (? envocab char)))
          (:= lo (round (+ lo (* plo range)))
              hi (round (+ lo (* phi range) -1)))
          (loop
            (cond ((< hi #.(expt 2 31))
                   (emit-bit 0))
                  ((>= lo #.(expt 2 31))
                   (emit-bit 1)
                   (:- lo #.(expt 2 31))
                   (:- hi #.(expt 2 31)))
                  ((and (>= lo #.(expt 2 30))
                        (< hi (+ #.(expt 2 30) #.(expt 2 31))))
                   (:- lo #.(expt 2 30))
                   (:- hi #.(expt 2 30))
                   (:+ pending-bits))
                  (t (return)))
            (:= lo (mask32 (ash lo 1))
                hi (mask32 (1+ (ash hi 1)))))))
      (:+ pending-bits)
      (emit-bit (if (< lo #.(expt 2 30)) 0 1)))
    rez))

(defun mask32 (num)
  ;; this utility is used to confine the number in 32 bits
  (logand num #.(1- (expt 2 32))))

CL-USER> (arithm-encode-correct #h(#\e '(0 1/14)
                                   #\a '(1/14 1/7)
                                   #\h '(1/7 3/14)
                                   #\i '(3/14 5/14)
                                   #\s '(5/14 4/7)
                                   #\t '(4/7 11/14)
                                   #\Space '(11/14 1))    
                                "this is a test")
#*10011011010000111000001101010110010101

Note that the length of the compressed message is 38 bits. The same as the Huffman version!

And here, for the sake of completeness and verification, is the decoding routine. It works in a similar fashion but backwards: we determine the interval into which our current number falls, emit the corresponding character, and narrow the search interval to the currently found one. We'll need to have access to the same vocabulary and know the length of the message.

(defun bitvec->int (bits)
  (reduce (lambda (bit1 bit2) (+ (ash bit1 1) bit2)
          bits))

(defun arithm-decode (dedict vec size)
  (with ((len (length vec))
         (lo 0)
         (hi (1- (expt 2 32)))
         (val (bitvec->int (subseq vec 0 (min 32 len))))
         (off 32)
         (rez (make-string size)))
    (dotimes (i size)
      (with ((range (- hi lo -1))
             (prob (/ (- val lo) range)))
        (dotable (char r dedict)
          (with (((plo phi) r))
            (when (>= phi prob)
              (:= (? rez i) char
                  lo (round (+ lo (* plo range)))
                  hi (round (+ lo (* phi range) -1)))
              (return))))
        (print (list val lo hi))
        (loop
          (cond ((< hi #.(expt 2 31))
                 ;; do nothing
                 )
                ((>= lo #.(expt 2 31))
                 (:- lo #.(expt 2 31))
                 (:- hi #.(expt 2 31))
                 (:- val #.(expt 2 31)))
                ((and (>= lo #.(expt 2 30))
                      (< hi #.(* 3 (expt 2 30))))
                 (:- lo #.(expt 2 30))
                 (:- hi #.(expt 2 30))
                 (:- val #.(expt 2 30)))
                (t
                 (return)))
          (:= lo (mask32 (ash lo 1))
              hi (mask32 (1+ (ash hi 1)))
              val (mask32 (+ (ash val 1)
                             (if (< off len)
                                 (? vec off)
                                 0)))
              off (1+ off)))))
    rez)))

CL-USER> (let ((vocab #h(#\e '(0 1/14)
                         #\a '(1/14 1/7)
                         #\h '(1/7 3/14)
                         #\i '(3/14 5/14)
                         #\s '(5/14 4/7)
                         #\t '(4/7 11/14)
                         #\Space '(11/14 1))))
           (arithm-decode vocab
                          (arithm-encode-correct vocab "this is a test")
                          14))
"this is a test"

DEFLATE

Entropy-based compression — or, as I would call it, сharacter-level one — can do only so much: it can't account for repetitions of the larger-scale message parts. For instance, a message with a single word repeated twice, when compressed with Huffman or Arithmetic encodings, will have twice the length of the message with a single occurrence of that word. The reason being that the probability distribution will not change, and thus the encodings of each character. Yet, there's an obvious possibility to reduce the compressed size here. This and other similar cases are much better treated by dictionary-based or block-level encoding approaches. The most well-known and wide-spread of them is the DEFLATE algorithm that is a variant of LZ77. Surely, there are other approaches like LZW, LZ78 or the Burrows-Wheeler algorithm (used in bzip2), but they are based on the same principle approach, so studying DEFLATE will allow you to grasp other algorithms if necessary.

But, before considering DEFLATE, let's first look at the simplest block-level scheme — Run-Length Encoding (RLE). This is not even a block-level algorithm, in full, as it operates on single characters, once again. The idea is to encode sequences of repeating characters as a single character followed by the number of repetitions. Of course, such an approach will hardly help with natural language texts that have almost no long character repetitions, instead, it was used in images with limited palettes (like those encoded in the GIF format). It is common for such images to have large areas filled with the same color, so the GIF format, for instance, used RLE for each line of pixels. That was one of the reasons that an image with a horizontal pattern like this:

xxxxx

xxxxx

xxxxx

lent itself to stellar compression, while the same one rotated 90 degrees didn't :)

x  x  x
x  x  x
x  x  x
x  x  x
x  x  x

LZ77 is a generalization of the RLE approach that considers runs not just of single characters but of variable-length character sequences. Under such conditions, it becomes much better suited for text compression, especially, when the text has some redundancies. For example, program code files tend to have some identifiers constantly repeated (like, if, loop or nil, in Lisp), each code file may have a lengthy identical copyright notice at the top, and so on and so forth. The algorithm operates by replacing repeated occurrences of data with references to a single copy of that data seen earlier in the uncompressed stream. The encoding is by a pair of numbers: the length of the sequence and the offset back into the stream where the same sequence was originally encountered.

The most popular LZ77-based compression method is DEFLATE. In the algorithm, literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances are placed into a separate alphabet as they occur just after lengths, so they cannot be mistaken for another kind of symbol or vice versa. A DEFLATE stream consists of a series of blocks. Each block is preceded by a 3-bit header indicating the position of the block (last or intermediate) and the type of character-level compression used: no compression, Huffman with a predefined tree, and Huffman with a custom tree. Most compressible data will end up being encoded using the dynamic Huffman encoding. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the loss in compression due to using a non-optimal code.

The algorithm performs the following steps:

  1. Matching and replacement of duplicate strings with pointers: within a single block, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (the repeated block size is between 3 and 258 bytes) and a 15-bit distance (which specifies an offset of 1-32768 bytes inside the so-called "sliding window") to the beginning of the duplicate. If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of any number identical bytes can be encoded as a single byte followed by a length of (1- n).

  2. Huffman coding of the obtained block. Instructions to generate the necessary Huffman trees immediately follow the block header. There are, actually, 2 trees: the 288-symbol length/literal tree and the 32-symbol distance tree, which themselves encoded as canonical Huffman codes by giving the bit length of the code for each symbol. The bit lengths are then run-length encoded to produce as compact a representation as possible.

An interesting fact is that DEFLATE compression is so efficient in terms of speed that it is faster to read a compressed file from an ATA hard drive and decompress it in-memory than to read an original longer version: disk access is much longer than CPU processing, for this rather simple algorithm! Even more, it applies to network traffic. That's why compression is used (and enabled by default) in many popular network protocols, for instance, HTTP.

Take-aways

This chapter, unlike the previous one, instead of exploring many different approaches, dealt with, basically, just a single one in order to dig deeper and to demonstrate the use of all the tools that can be applied in algorithmic programming: from a piece of paper to sophisticated profilers. Moreover, the case we have analyzed provides a great showcase not just of the tools but of the whole development process with all its setbacks, trial and error, and discoveries.

Bit fiddling was another topic that naturally emerged in this chapter. It may look cryptic to those who have never ventured into this territory, but mastering the technique is necessary to gain access to a number of important areas of the algorithms landscape.


[1] To make full use of this feature and be able to profile SBCL internal functions, you'll need to compile SBCL with --with-cons-profiling flag. Many thanks to Douglas Katzman for developing this feature and guiding me through its usage.

[2] It was verified by taking the average of multiple test runs.

[3] You can study the details in the [relevant article](https://www.drdobbs.com/cpp/data-compression-with-arithmetic-encodin/240169251).