tag:blogger.com,1999:blog-60316479615060054242020-04-02T11:06:52.620+03:00Lisp, the Universe and EverythingVsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.comBlogger120125tag:blogger.com,1999:blog-6031647961506005424.post-33728441405566467522020-03-30T16:56:00.000+03:002020-03-30T16:56:14.349+03:00Programming Algorithms: Synchronization<p>This is the final chapter of the book, in which we will discuss optimization of parallel computations: whether concurrently on a single machine in and a shared-memory setting or in a distributed shared-nothing environment. This is a huge topic that spans synchronization itself, parallelization, concurrency, distributed computations, and the functional approach. And every senior software developer should be well-versed in it.</p> <p>Usually, synchronization is studied in the context of system or distributed programming, but it has a significant algorithmic footprint and is also one of the hottest topics for new algorithm research. In fact, there are whole books that concentrate on it, but, usually, they attack the problem from other angles, not focusing on the algorithmic part. This chapter will be more algorithm-centered, although it will also present an overview of the problem space. SO that, in the end, you'll have a good foundation to explore the topic further if a desire or need for that will appear.</p> <p>Let's start from the basics. In the previous chapters of the book we, mainly, viewed algorithms as single computations running without external interruptions. This approach, obviously, removes the unnecessary complexity, but it also isn't totally faithful to reality. Most of the programs we deal with, now, run in multiprocessing environments (sometimes, even distributed ones), and even when they don't utilize these capabilities, those are still available, they sometimes have their impact, and, besides, might have improved the performance of the programs if they would have been utilized. The majority of the backend stuff, which, currently, is comprised of services running in the datacenters, are multithreaded. There's a notorious "Zawinski's Law" that states that every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. Being a good joke it also reflects an important truth about the tendency of all programs over time to become network-aware, and thus distributed to at least some extent.</p> <p>There are two principally different types of environments in which the programs that need synchronization run: shared-memory and shared-nothing ones.</p> <p>In a shared-memory setting, there exists some shared storage (not necessarily, RAM) that can be directly accessed by all the threads<a href="#f14-1" name="r14-1">[1]</a> of the application. Concurrent access to data in this shared memory is the principal source of the synchronization challenges, although not the only one. The example of a shared-memory program is a normal application that uses multithreading provided either directly by the OS or, more frequently, by the language runtime<a href="#f14-2" name="r14-2">[2]</a>.</p> <p>The opposite of shared-memory is a shared-nothing environment, in which all threads<a href="#f14-3" name="r14-3">[3]</a> don't have any common data storage and can coordinate only by sending messages directly to other processes. The contents of the messages have to be copied from the memory of the sender to the receiver. In this setting, some of the synchronization problems disappear, but others still remain. At the fundamental level, some synchronization or coordination still needs to happen. From a performance standpoint, however, the shared-nothing mode is, usually, inferior due to the need for additional data copying. So, both paradigms have their place and the choice, which one to utilize, depends on the context of a particular task.</p> <p>The main goal of synchronization is ensuring program correctness when multiple computations are running in parallel. Another side of the coin is achieving optimal performance, which is also addressed by parallelization that we have somewhat discussed in a couple of prior chapters. Prioritizing performance before correctness, although tempting, is one of the primary sources of bugs in the concurrent systems. The trivial example would be building a shared-memory program without explicit use of any synchronization mechanisms. It is, definitely, the most performant approach, but non-coordinated access to the shared data will inevitably result in failures like data corruption.</p> <h2 id="synchronizationtroubles">Synchronization Troubles</h2> <p>So, let's talk in more detail about the most common synchronization problems that the methods we will discuss next are trying to handle. Such situations are called <strong>race conditions</strong> for there's a situation when multiple threads compete for the same resource — be it data storage or processor — and, in the absence of special coordination, the order of execution will be unspecified, which may result in unpredictable and unintended outcomes. There are two main results of this unpredictability (often, both occur simultaneously):</p> <ul><li>data corruption or loss</li> <li>incorrect order of execution up to total failure of the program</li></ul> <p>Here is the simplest code segment that is amenable to data corruption in multithreaded programs:</p> <pre><code>(incf i)<br /></code></pre> <p>It seems like there's just one operation involved — how can there be a race condition? From the point of view of a programmer in a high-level language, indeed, we deal with a single operation, but if we go deeper to the level of machine code we'll see that it is not the case. The relevant assembly snippet will, probably, contain three instructions:</p> <pre><code>mov i, register<br />inc register<br />mov register, i<br /></code></pre> <p>You've just seen one more convincing evidence why every programmer should understand how the lower levels of his platform operate. :)</p> <p>The issue is that modern processors can't directly modify data in the RAM (our variable <code>i</code>). First, the data needs to be moved into a register, only then some operation on it may be performed by the CPU, and, finally, it needs to be put back where the high-level program can find it. If an interrupt occurs (we're talking about multithreaded execution in a single address space, in this context) after <code>mov i, reg</code> the current thread will remember the old value of <code>i</code> (let it be 42) and be put into a waitqueue. If another thread that wants to change <code>i</code> is given processor time next, it may set it to whatever value it wants and continue execution (suppose it will be 0). However, when the turn is returned to the first thread, it will increment the value it remembered (42), so <code>i</code> will change the value in the following sequence: 42 - 0 - 43. Hardly, it's an expected behavior.</p> <p>Such data corruption will only impact the mentioned variable and may not cause catastrophic failures in the program. Its behavior will be incorrect, but in some situations that can be tolerated (for instance, if we gather some statistics, and occasional off-by-one errors will not be noticed). Yet, if <code>i</code> was some counter that impacts the core behavior of the program, it might easily lead to a catastrophe.</p> <p>Utimately, incorrect execution order should be considered the root cause of all synchronization problems. And here it is also manifest: we expected increment to be a single (<strong>atomic</strong>) operation and thus finish execution before anything else will happen to <code>i</code>.</p> <p>What are some other common cases of execution order errors? The most well-known and dreaded race condition is a <strong>deadlock</strong>. It is a situation of mutual blocking among two or more threads. Here is the simplest illustration of how it can occur:</p> <pre><code>thread 1 ---> acquire resource1 --> try to acquire resource2<br />thread 2 --> acquire resource2 ------> try to acquire resource1<br /></code></pre> <p>In other words, two threads need exclusive access to two resources, but the order of access is opposite and the timing of the operations is such that the first thread manages to acquire access to the first resource while the second — to the second. After that, the deadlock is inevitable and both threads will be blocked as soon as they will try to access the other resource. The period between each thread acquiring the first resource for exclusive access and the release of this resource is called a <strong>critical section</strong> of the program. Only in the critical sections, a synchronization issue may manifest.</p> <p>The only way to untangle "from within" such deadlock situation is for one of the threads to release the resource it already holds. Another approach, which requires external intervention, is often employed in database management systems — deadlock monitoring. A separate thread is periodically examining blocked threads to check for some conditions that signify a deadlock situation, and it resets the threads that were spotted in such a condition. Yet, instead of trying to fix the deadlock situations, it may be better to prevent them from occurring altogether. The prevention techniques may utilize time-limited exclusive leases on resources or mandating the threads to acquire resources in a specific order. However, such approaches are limited and don't cover all the use cases. It would be nice to find some way to totally exclude deadlocks, but we should remember that the original reason why they may occur, at all, is the need to prevent data corruption in the case of uncontrolled access to the data. Exclusive access to the resource ensures that this problem will not occur, but results in the possibility of a deadlock, which is a comparatively lesser evil.</p> <p>A <strong>livelock</strong> is a dynamic counterpart to deadlock which occurs much rarely. It is a situation when threads don't constantly hold the resources exclusively (for instance, they might release them after a timeout), but the timing of the operations is such that at the time when the resource is needed by one thread it happens to be occupied by the other, and, ultimately, mutual blocking still occurs. </p> <p>One more obnoxious race condition is <strong>priority inversion</strong> — a phenomenon one can frequently observe in real life: when a secondary lane of cars merges into the main road, but, for some extraneous reason (traffic light malfunctioning, an accident that is blocking part of the road, etc.), the cars from it have more time to merge than the ones of the main road to progress. Priority inversion may be the reason for a more severe problem, which is <strong>starvation</strong> — a situation when the execution of a thread is stalled as it can't access the resource it needs. Deadlocks result in starvation of all the involved threads, but the issue may occur in other conditions, as well. I would say that starvation or, more generally, underutilization is the most common performance issue of multithreaded applications.</p> <h2 id="lowlevelsynchronization">Low-Level Synchronization</h2> <p>I hope, in the previous section, the importance of ensuring proper execution order in the critical sections of the program was demonstrated well enough. How to approach this task? There are many angles of attack. Partially, the problem may be solved by the introduction of atomic operations. Atomic increment/decrement are a common example of those, which may be found in the ecosystem of the majority of the programming languages. For instance, SBCL provides an <code>sb-ext:atomic-incf</code> macro that operates on the fixnum slots of structures, array cells, contents of cons pairs or global variables. Some other languages, like Java, provide <code>AtomicInteger</code> and similar structures that guarantee atomic operations on its main slot.</p> <p>What enables atomic operations are special hardware instructions:</p> <ul><li><code>TSL</code> — test and set lock</li> <li><code>CAS</code> — compare and swap</li> <li><code>LL/CS</code> — load link/store conditional</li></ul> <p>The most widespread of them is <code>CAS</code> that has the same effect as if the following code would work as a single atomic operation:</p> <pre><code>(defmacro cas (place old new)<br /> `(when (eql ,place ,old)<br /> (:= ,place ,new))<br /></code></pre> <p>Based on this spec, we could define <code>atomic-incf</code> using <code>cas</code>:</p> <pre><code>(defmacro atomic-incf (place &optional i)<br /> (let ((cur (gensym "CUR"))<br /> (rez (gensym "REZ")))<br /> `(loop :for ,rez := (let ((,cur ,place))<br /> (cas ,place ,cur (+ ,cur ,i)))<br /> :when ,rez :do (return ,rez))))<br /></code></pre> <p>Here, we read the current value of <code>place</code>, and then try to set it with <code>cas</code>. These two operations happen non-atomically, so there's a chance that <code>cas</code> will return nil. In that case, we redo the whole sequence again. It is clear that execution time of such operation is non-deterministic, but, in a reasonably configured multithreaded system, there should be, generally, just a single chance for <code>cas</code> to fail: when the thread is preempted between the assignment and <code>cas</code>. It shouldn't repeat the second time this thread gets its time slice for it should have enough time to complete both operations considering that it will start from them.</p> <p>Another important low-level instruction is a <strong>memory barrier</strong>. It causes the CPU to enforce an ordering constraint on memory operations issued before and after the barrier instruction. I.e. the operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier. Memory barriers are necessary because most modern CPUs employ performance optimizations that can result in out-of-order execution. The reordering of memory loads and stores goes unnoticed within a single thread of execution but can cause unpredictable behavior in concurrent programs. One more leak from the low level adding to the list of synchronization worries...</p> <p>On top of CAS and atomic operations, some higher-level synchronization primitives are provided by the OS and the execution runtimes of most of the programming languages. The most popular of them is the <strong>semaphore</strong>. It is a counter that is initially set to the number of threads that can proceed past querying its value. If the counter is above zero, the thread may continue execution, but it also atomically decrements the counter. This operation is usually called <code>wait</code>, <code>acquire</code> or <code>down</code> on the semaphore. However, if the counter is already down to zero, the thread goes to sleep and is put into an OS waitqueue until the wakeup notification arrives. The notification is initiated by some thread calling <code>release</code>/<code>up</code> on the same semaphore. This operation atomically increments the counter value and also allows some of the waiting threads to continue execution. The most used type of semaphores is called the <strong>mutex</strong> and it allows only a single thread to enter and also mandates the implementation to check that the thread that releases the mutex is the one that has previously acquired it. There are also other types of semaphores or more complex locks built on top of them, such as the read-write lock or a monitor.</p> <p>Semaphores are an alternative to a lower-level <strong>spin-lock</strong> primitive that uses <strong>busy waiting</strong>, i.e. constant checking of the counter variable until it increases above zero. Another, more general name, for this method is <strong>polling</strong> that refers to constantly querying the state of some resource (a lock, a network socket, a file descriptor) to know when its state changes. Polling has both drawbacks and advantages: it occupies the thread instead of yielding CPU to other workers, which is a serious downside, but it also avoids expensive utilization of the OS context-switching required by semaphores.</p> <p>So, both semaphores and spin-locks find their place. In the low-level OS code, spin-locks prevail, while semaphores are a default synchronization primitive in the user-space.</p> <h2 id="mutualexclusionalgorithms">Mutual Exclusion Algorithms</h2> <p>Relying on hardware features for synchronization is a common approach taken by most software systems. However, since the beginning of work on this problem, computer scientists, including such famous algorithmists as Dijkstra and Lamport, proposed mutual exclusion algorithms that allowed guarding the critical sections without any special support from the platform. One of the simplest of them is the Peterson's algorithm. It guarantees mutual exclusion of two threads with the use of two variables: a two-element array <code>interest</code> and a boolean <code>turn</code>. A true value of the <code>interest</code> item corresponding to a thread indicates that it wants to enter the critical section. Entrance is granted if a second thread does not want the same or it has yielded priority to the first thread.</p> <pre><code>(defparameter *interest* (vec nil nil))<br />(defparameter *turn* nil)<br /><br />(defun peterson-call (i fn)<br /> (let ((other (abs (1- i))))<br /> (:= (? *interest* i) t<br /> *turn* other)<br /> ;; busy waiting<br /> (loop :while (and (? *interest* other) (= *turn* other))) <br /> ;; critical section<br /> (call fn)<br /> (:= (? *interest* i) nil))<br /></code></pre> <p>The algorithm satisfies the three essential criteria to solve the critical section problem: mutual exclusion, progress, and bounded waiting. Mutual exclusion means that several competing threads can never be in the critical section at the same time. For the Peterson's algorithms, if thread 0 is in its critical section, then <code>(? *interest* 0)</code> is true. In addition, either <code>(? *interest* 1)</code> is nil (meaning thread 1 has left its critical section and isn't interested in coming back into it) or <code>turn</code> is 0 (meaning that thread 1 is just now trying to enter the critical section but waiting), or thread 1 is trying to enter its critical section, after setting <code>(? *interest* 1)</code> to true but before setting <code>*turn*</code> to 0). So if both processes are in the critical section then we conclude that the state must satisfy <code>(and (? *interest* 0) (? *interest* 1) (= *turn* 0) (= *turn* 1))</code>, which is, obviously, impossible. I.e. only on of the threads could have entered the section. The condition of progress, basically, says that only those threads that wish to enter the critical section can participate in making the decision as to which one will do it next, and that this selection cannot be postponed indefinitely. In our case, a thread cannot immediately reenter the critical section if the other thread has set its interest flag. Thus the thread that has just left the critical section will not impact the progress of the waiting thread. Bounded waiting means that the number of times a thread is bypassed by another thread after it has indicated its desire to enter the critical section is bounded by a function of the number of threads in the system. In the Peterson's algorithm, a thread will never wait longer than one turn for entrance to the critical section.</p> <p>The drawback of the Peterson's algorithms is busy waiting<a href="#f14-4" name="r14-4">[4]</a>. So, it may be compared to a spin-lock. There are a number of other similar algorithms, including the Dekker's and Lamport's ones, which also share this property. A newer Szymański's algorithm is designed to avoid busy waiting, but it requires access to the OS scheduling facilities to make the thread sleep, waiting for the wakeup call, making the algorithm similar to semaphores.</p> <h2 id="highlevelsynchronization">High-Level Synchronization</h2> <p>All the mentioned synchronization primitives don't solve the challenges of synchronization completely. Rather they provide tools that enable reasonable solutions but still require advanced understanding and careful application. The complexity of multithreaded programs is a level up compared to their single-threaded counterparts, and thus a lot of effort continues being spent on trying to come up with high-level ways to contain it. I.e. remove it from the sight of a regular programmer by providing the primitives that handle synchronization behind the scenes. A simple example of that is Java <code>synchronized</code> classes that employ an internal monitor to ensure atomic access to the slots of a synchronized object. The major problem with regular locks (like semaphores) is that working with them brings us into the realm of global state manipulation. Such locking can't be isolated within the boundaries of a single function — it leaks through the whole caller chain, and this makes the program much harder to reason about. In this regard, it is somewhat similar to the use of <code>goto</code>, albeit on a larger scale, and so a push for higher-level synchronization facilities resembles the Dijkstra's famous appeal to introduce structured programming ("goto considered harmful"). Ironically, Dijkstra is one of the creators of the classic synchronization mechanisms that are now frowned upon. However, synchronization has intrinsic complexity that can't be fully contained, so no silver bullet exists (and hardly will ever be created) and every high-level solution will be effective only in a subset of cases. I have seen that very well on my own when teaching a course on system programming and witnessing how students solve the so-called classic synchronization problems. The task was to apply both classic synchronization techniques (semaphores et al.) and the new high-level ones (using Erlang, Haskell, Clojure or Go, which all provide some of those). The outcome, in terms of complexity, was not always in favor of the new approaches.</p> <p>There is a number of these classic synchronization problems, and I was even collecting them to be able to provide more variants of the tasks to diminish cheating. :) But, in essence, they all boil down to just a few archetypical cases: producer-consumer, readers-writers, sleeping barber, and the dining philosophers. Each problem demonstrates a certain basic synchronization scenario and allows the researchers to see how their approach will handle it. I won't include them in the book but strongly encourage anyone interested in this topic to study them in more detail and also try to solve using different synchronization mechanisms.</p> <p>Now, let's talk about some of the prominent high-level approaches. Remember, that they try to change the paradigm and avoid the need for explicit locking of critical sections altogether.</p> <h3 id="lockfreedatastructures">Lock-free Data Structures</h3> <p>My favorite among them is lock-free data structures. This is a simple and effective idea that can help deal with many common use cases and, indeed, avoid the necessity for explicit synchronization. Still, their use is limited and, obviously, can't cover all the possible scenarios.</p> <p>The most important among them is arguably a lock-free queue. It can be implemented in different ways, and there's a simple and efficient implementation using <code>cas</code> provided by SBCL in the <a href="http://www.sbcl.org/manual/#sb_002dconcurrency">SB-CONCURRENCY</a> contrib package. Here is the implementation of the main operations (taken from the SBCL source code and slightly simplified):</p> <pre><code>(defstruct lf-queue<br /> (head (error "No HEAD.") :type cons)<br /> (tail (error "No TAIL.") :type cons))<br /><br />(defun lf-enqueue (value queue)<br /> (let ((new (cons value nil)))<br /> (loop (when (eq nil (sb-ext:compare-and-swap (cdr (lf-queue-tail queue))<br /> nil new))<br /> (:= (lf-queue-tail queue) new)<br /> (return value)))))<br /><br />(defun lf-dequeue (queue)<br /> (loop (with ((head (lf-queue-head queue))<br /> (next (cdr head)))<br /> (typecase next<br /> (null (return (values nil<br /> nil)))<br /> (cons (when (eq head (sb-ext:compare-and-swap (lf-queue-head queue)<br /> head next))<br /> (let ((value (car next)))<br /> (setf (cdr head) +dead-end+<br /> (car next) +dummy+)<br /> (return (values value<br /> t)))))))))<br /></code></pre> <p>The value of this structure lies in that it enables the implementation of the master-worker pattern that is a backbone of many backend applications, as well as, in general, different forms of lock-free and wait-free coordination between the running threads. Basically, it's a lock-free solution to the producer-consumer problem. The items are put in the queue by some producer threads (masters) and consumed by the worker threads. Such an architecture allows the programmer to separate concerns between different layers of the application: for instance, one type of threads may be responsible for handling incoming connections and, in order to ensure system high availability, these threads shouldn't spend much time processing them. So, after some basic processing, the connection sockets are put into the queue, from which the heavy-lifting worker threads can consume them and process in a more elaborate fashion. I.e. it's a job queue for a thread pool. Surely, a lock-based queue may also be utilized as an alternative, in these scenarios, but the necessity to lock from the caller's side makes the code for all the involved threads more complicated: what if a thread that has just acquired the lock is abruptly terminated for some reason?</p> <h3 id="dataparalelismandmessagepassing">Data-Paralelism and Message Passing</h3> <p>Beyong thread pools, there's a whole concept of data parallelism, which, in essence, lies in submitting different computations to the pool and implementing synchronization as an orchestration of those tasks. In addition, node.js and Go use lock-free IO in conjunction with such thread pools (and a special syntax for its seamless integration) for an efficient implementation of user-space green threads to support this paradigm. </p> <p>Even further alone this direction is Erlang that is a whole language built around lock-free IO, efficient user-space threading, and a shared-nothing memory model. It is The language of message-passing concurrency that aims to solve all synchronization problems within this single approach. As discussed in the beginning, such stance has its advantages and drawbacks, and so Erlang fits some problems (like coordination between a large number of simple agents) exceptionally well, while for others it imposes unaffordable costs in terms of both performance and complexity.</p> <p>I won't go deeper into this topic as they are not directly related to the matter of this book.</p> <h3 id="stm">STM</h3> <p>Another take on concurrency is the technology that is used, for quite a long time, in the database systems, and was reimplemented in several languages, being popularized by the author of Clojure — Software Transactional Memory (STM). The idea is to treat all data accesses in memory as part of transactions — computations that possess the ACID properties: atomicity, consistency, and isolation (minus durability, which is only relevant to the database systems persisting data on disk). These transactions should still be initiated by the programmer, so the control over synchronization remains, to a large extent, in their hands with some portion of the associated complexity. The transactions may be implemented in different ways, but they will still use locking behind the scenes, and there are two main approaches to applying locking:</p> <ul><li>pessimistic — when the locks are acquired for the whole duration of the transaction, basically, making it analogous to a very conservative programming style that avoids the deadlocks but seriously hinders program performance: acquiring all locks at once and then entering the critical section; in the context of STM, each separate variable will have its own lock</li> <li>optimistic — when the initial state of the transaction variables is remembered in the thread-local storage, and locking occurs only at the last (commit) phase, when all the changes are applied — but only when there were no external changes to the transaction variables; if at least one of them were changed, the whole transaction would need to be rolled back and retried</li></ul> <p>In both cases, the main issue is the same: contention. If the number of threads competing for the locks is small, an optimistic approach should perform better, while, in the opposite case, there will be too many rollbacks and even a possibility of a livelock.</p> <p>The optimistic transactions are, usually, implemented using the Multiversion Concurrency Control mechanism. MVCC ensures a transaction never has to wait to read an object by maintaining several versions of thid object. Each version has both a Read Timestamp and a Write Timestamp which lets a particular transaction read the most recent version of the object which precedes the own Read Timestamp of the transaction.</p> <p>STM is an interesting technology, which hasn't proven its case yet beyond the distinct area of data management systems, such as RDBMs and their analogs.</p> <h2 id="distributedcomputations">Distributed Computations</h2> <p>So far, we have discussed synchronization, mainly, in the context of software running in a single address space on a single machine. Yet, the same issues, although magnified, are also relevant to distributed systems. Actually, the same models of computation are relevant: shared-memory and shared-nothing message passing. Although, for distributed computing, message passing becomes much more natural, while the significance of shared-memory is seriously diminished and the "memory" itself becomes some kind of a network storage system like a database or a network file system.</p> <p>However, more challenges are imposed by the introduction of the unreliable network as a communication environment between the parts of a system. These challenges are reflected in the so-called "fallacies of distributed computing":</p> <ul><li>the network is reliable</li> <li>latency is zero</li> <li>bandwidth is infinite</li> <li>the network is secure</li> <li>topology doesn't change</li> <li>there is a single administrator</li> <li>transport cost is zero</li> <li>the network is homogeneous</li> <li>clocks on all nodes are synchronized</li></ul> <p>Another way to summarize those challenges, which is the currently prevailing look at it, is the famous Brewer's CAP Theorem, which states that any distributed system may have only two of the three desired properties at once: consistency, availability, and partition tolerance. And since partitional tolerance is a required property of any network system as it's the ability to function in the unreliable network environment (that is the norm), the only possible distributed systems are CP and AP, i.e. they either guarantee consistency but might be unavailable at times or are constantly available but might be sometimes inconsistent.</p> <h3 id="distributedalgorithms">Distributed Algorithms</h3> <p>Distributed computation requires distributed data structures and distributed algorithms. The domains that are in active development are distributed consensus, efficient distribution of computation, efficient change propagation. Google pioneered the area of efficient network computation with the MapReduce framework that originated from the ideas of functional programming and Lisp, in particular. The next-generation systems such as Apache Spark develop these ideas even further.</p> <p>Yet, the primary challenge for distributed systems is efficient consensus. The addition of the unreliable network makes the problem nontrivial compared to a single-machine variant where the consensus may be achieved easily in a shared-memory setting. The world has seen an evolution of distributed consensus algorithms implemented in different data management systems, from the 2-Phase Commit (2PC) to the currently popular RAFT protocol.</p> <p><strong>2PC</strong> is an algorithm for coordination of all the processes that participate in a distributed atomic transaction on whether to commit or rollback the transaction. The protocol achieves its goal even in many cases of temporary system failure. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed. To accommodate recovery from failure, the participants of the transaction use logging of states, which may be implemented in different ways. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered.</p> <p>In a "normal execution" of any single distributed transaction, the 2PC consists of two phases:</p> <ol><li>The "commit-request" or voting phase, in which a coordinator process attempts to prepare all the participating processes to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes" (commit) or "No" (abort).</li> <li>The "commit" phase, in which, based on the voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or rollback the transaction and notifies the result to all the participants. The participants then follow with the needed actions (commit or rollback) with their local transactional resources.</li></ol> <p>It is clear, from the description, that 2PC is a centralized algorithm that depends on the authority and high availability of the coordinator process. Centralized and peer-to-peer are the two opposite modes of the network algorithms, and each algorithm is distinguished by its level of centralization.</p> <p>The <strong>3PC</strong> is a refinement of the 2PC which is supposed to be more resilient to failures by introducing an intermediate stage called "prepared to commit". However, it doesn't solve the fundamental challenges of the approach that are due to its centralized nature, only making the procedure more complex to implement and thus having more failure modes.</p> <p>The modern peer-to-peer coordination algorithm alternatives are Paxos and RAFT. RAFT is considered to be a simpler (and, thus, more reliable) approach. It is also, not surprisingly, based on voting. It adds a preliminary phase to each transaction, which is leader election. The election, as well as other activities within a transaction, don't require unanimous agreement, but a simple majority. Besides, execution of all the stages on each machine is timeout-based, so if a network failure or a node failure occurs, the operations are aborted and retried with an updated view of the other peers. The details of the algorithm can be best understood from the <a href="https://raft.github.io/">RAFT website</a>, which provides a link to the main paper, good visualizations and other references.</p> <h3 id="distributeddatastructures">Distributed Data Structures</h3> <p>We have already mentioned various distributed hash-tables and content-addressable storage as one of the examples of these types of structures. Another exciting and rapidly developing direction is eventually-consistent data structures or <strong>CRDTs</strong> (Conflict-free Replicated Data Types). They are the small-scale representatives of the AP (or eventually-consistent) systems that favor high availability over constant consistency, as they become more and more the preferred mode of operation of distributed system.</p> <p>The issue that CRDTs address is conflict resolution when different versions of the structure appear due to network partitions and their eventual repair. For a general data structure, if there are two conflicting versions, the solution is either to choose one (according to some general rules, like take the random one or the latest one, or application-specific logic) or to keep both versions and defer conflict resolution to the client code. CRDTs are conflict-free, i.e. the structures are devised so that any conflict is resolved automatically in a way that doesn't bring any data loss or corruption.</p> <p>There are two ways to implement CRDTs: convergent structures rely on the replication of the whole state, while commutative use operation-based replication. Yet, both strategies result in the CRDTs with equivalent properties.</p> <p>The simplest CRDT is a <strong>G-Counter</strong> (where "G" stands for grow only). Its operation is based on the trivial fact that addition is commutative, i.e. the order of applying the addition operation doesn't matter: we'll get the same result as long as the number of operations is the same. Every convergent CRDT has a <code>merge</code> operation that combines the states of each node. On each node, the G-Counter stores an array that holds the per-node numbers of the local increments. And its <code>merge</code> operation takes the maximums of the elements of this array across all nodes while obtaining the value of the counter requires summing all of the cells:</p> <pre><code>(defstruct (g-counter (:conc-name nil))<br /> ccs)<br /><br />(defun make-gcc (n)<br /> (make-g-counter :ccs (make-array n)))<br /><br />(defun gcc-val (gcc)<br /> (reduce '+ (ccs gcc))<br /><br />(defun gcc-merge (gcc1 gcc2)<br /> (map* 'max gcc1 gcc2))<br /></code></pre> <p>The structure is eventually consistent as, at any point in time, asking any live node, we can get the current value of the counter from it (so there's constant availability). However, if not all changes have already been replicated to this node, the value may be smaller than the actual one (so consistency is only eventual once all the replications are over).</p> <p>The next step is a <strong>PN-Counter</strong> (positive-negative). It uses a common strategy in CRDT creation: combining several simpler CRDTs. In this case, it is a combination of two G-Counters: one — for the number of increments, and another — decrements.</p> <p>A set is, in some sense, a more sophisticated analog of a counter (a counter may be considered a set of 1s). So, a <strong>G-Set</strong> functions similar to a G-Counter: it allows each node to add items to the set that are stored in the relevant cell of the main array. The merging and value retrieval operations use <code>union</code>. Similarly, there's <strong>2P-Set</strong> (2-phase) that is similar in construction to the PN-counter. The difference of a 2P-Set from a normal set is that once an element is put into the removal G-Set (called the "tombstone" set) it cannot be readded to the set. I.e. addition may be undone, but deletion is permanent. This misfeature is amended y <strong>LWW-Set</strong> (last-write-wins) that adds timestamps to all the records. Thus, an item with a more recent timestamp prevails, i.e. if an object is present in both underlying G-Sets it is considered present in the set if its timestamp in the addition set is greater than the one in the removal set, and removed — in the opposite case.</p> <p>There are also more complex CRDTs used to model sequences, including Treedoc, RGA, Woot, Logoot, and LSEQ. Their implementations differ, but the general idea is that each character (or chunk of characters) is assigned a key that can be ordered. When new text is added, it's given a key that is derived from the key of some adjacent text. As a result, the merge is the best-possible approximation of the intent of the edits.</p> <p>The use cases for CRDTs are, as mentioned above, collaborative editing, maintaining such structures as shopping carts (e.g. with an LWW-Set), counters of page visits to a site or reactions in a social network, and so on and so forth.</p> <h3 id="distributedalgorithmsinactioncollaborativeediting">Distributed Algorithms in Action: Collaborative Editing</h3> <p>In fact, CRDTs are a data-structure-centric answer to another technology that is used, for quite some time, to support collaborative editing: Operational Transformation (OT). OT was employed in such products as Google Docs and its predecessors to implement lock-free simultaneous rich-text editing of the same document by many actors.</p> <p>OT is an umbrella term that covers a whole family of algorithms sharing the same basic principles. Such systems use replicated document storage, i.e. each node in the system operates on its own copy in a non-blocking manner as if it was a single-user scenario. The changes from every node are constantly propagated to the rest of the nodes. When a node receives a batch of changes it transforms the changes before executing them to account for the local changes that were already made since the previous changeset. Thus the name "operational transformation".</p> <p>The basic idea of OT can be illustrated with the following example. Let's say we have a text document with a string <code>"bar"</code> replicated by two nodes and two concurrent operations:</p> <pre><code>(insert 0 "f") # o1 on node1<br />(delete 2 "r") # o2 on node2<br /></code></pre> <p>Suppose, on node 1, the operations are executed in the order <code>o1</code>, <code>o2</code>. After executing <code>o1</code>, the document changes to <code>"fbar"</code>. Now, before executing <code>o2</code> we must transform it against <code>o1</code> according to the transformation rules. As a result, it will change to <code>(delete 3 "r")</code>. So, the basic idea of OT is to adjust (transform) the parameters of incoming editing operations to account for the effects of the previously executed concurrent operations locally (whether they were invoked locally or received from some other node) so that the transformed operation can achieve the correct effect and maintain document consistency. The word "concurrent" here means operations that happened since some state that was recorded on the node that has sent the new batch of changes. The transformation rules are operation-specific.</p> <p>In theory, OT seems quite simple, but it has its share of implementation nuances and issues:</p> <ul><li>While the classic OT approach of defining operations through their offsets in the text seems to be simple and natural, real-world distributed systems raise serious issues: namely, that operations propagate with finite speed (remember one of the network fallacies), states of participants are often different, thus the resulting combinations of states and operations are extremely hard to foresee and understand.</li> <li>For OT to work, every single change to the data needs to be captured: "Obtaining a snapshot of the state is usually trivial, but capturing edits is a different matter altogether. The richness of modern user interfaces can make this problematic, especially within a browser-based environment.</li> <li>The notion of a "point in time" relevant to which the operations should be transformed is nontrivial to implement correctly (another network fallacy in play). Relying on global time synchronization is one of the approaches, but it requires tight control over the whole environment (which Google has demonstrated to be possible for its datacenter). So, in most cases, a distributed solution instead of simple timestamps is needed</li></ul> <p>THe most popular of these solutions is a <strong>vector clock</strong>. The VC of a distributed system of <code>n</code> nodes is a vector of <code>n</code> logical clocks, one clock per process; a local "smallest possible values" copy of the global clock-array is kept in each process, with the following rules for clock updates:</p> <ul><li>Initially all clocks are zero.</li> <li>Each time a process experiences an internal event, it increments its own logical clock in the vector by one.</li> <li>Each time a process sends a message, it increments its own logical clock in the vector by one (as in the bullet above, but not twice for the same event) and then sends a copy of its own vector.</li> <li>Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).</li></ul> <p>You might notice that the operation of vector clocks is similar to the CRDT G-counter.</p> <p>VCs allow the partial causal ordering of events. A vector clock value for the event <code>x</code> is less than the value for <code>y</code> if and only if for all indices the items of the <code>x</code>'s clock are less or equal, and, at least for one element, they are stricktly smaller.</p> <p>Besides vector clocks, the other mechanisms to implement distributed partial ordering include Lamport Timestamps, Plausible Clocks, Interval Tree Clocks, Bloom Clocks, and others.</p> <h2 id="persistentdatastructures">Persistent Data Structures</h2> <p>To conclude this chapter, I wanted to say a few words about the role of the functional paradigm in synchronization and distributed computing. That's no coincidence that it was mentioned several times in the description of different synchronization strategies: essentially, functional programming is about achieving good separation of concerns by splitting computations into independent referentially-transparent units that are easier to reason about. Such an approach supports concurrency more natively than the standard imperative paradigm, although it might not be optimal computationally (at least, in the small). Yet, the gains obtained from parallelism and utilizing the scale of distributed computing may greatly outweigh this low-level inefficiency. So, with the advent of concurrent and distributed paradigms, functional programming gains more traction and adoption. Such ideas as MapReduce, STM, and message-passing-based coordination originated in the functional programming world.</p> <p>Another technology coming from the functional paradigm that is relevant to synchronization is Purely Functional Data Structures. Their principal property is that any modification doesn't cause a destructive effect on the previous version of the structure, i.e., with each change, a new version is created, while the old one may be preserved or discarded depending on the particular program requirements. This feature makes them very well suited for concurrent usage as the possibility of corruption due to incorrect operation order is removed, and such structures are also compatible with any kind of transactional behavior. The perceived inefficiency of constant copying, in many cases, may be mostly avoided by using structure sharing. So the actual cost of maintaining these data structures is not proportional to their size, but rather constant or, at worst, logarithmic in size. Another name for these structures is persistent data structures — contrast to "ephemeral" ones which operate by destructive modification.</p> <p>The persistent functional structure is, as we already mentioned in one of the preceding chapters, is a Lisp list<a href="#f14-5" name="r14-5">[5]</a> used as a stack. We have also seen the queue implemented with two stacks called a <strong>Real-Time Queue</strong>. It is a purely functional data structure, as well. The other examples are mostly either list- or tree-based, i.e. they also use the linked backbone structured in a certain way.</p> <p>To illustrate once again how most persistent data structures operate, we can look at a <strong>Zipper</strong> that may be considered a generalization of a Real-Time Queue. It is is a technique of representing a data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents in a purely functional manner, i.e. without destructive operations. A list-zipper represents the entire list from the perspective of a specific location within it. It is a pair consisting of a recording of the reverse path from the list start to the current location and the tail of the list starting at the current location. In particular, the list-zipper of a list <code>(1 2 3 4 5)</code>, when created will look like this: <code>(() . (1 2 3 4 5))</code>. As we traverse the list, it will change in the following manner:</p> <ul><li><code>((1) . (2 3 4 5))</code></li> <li><code>((2 1) . (3 4 5))</code></li> <li>etc.</li></ul> <p>If we want to replace 3 with 0, the list-zipper will become <code>((2 1) . (0 4 5))</code> while the previous version will still persist. The new zipper will reuse the list <code>(2 1)</code> and create a new list by consing 0 to the front of the sublist <code>(4 5)</code>. Consequently, there memory after performing 2 movements and 1 update will look like this:</p> <a href="https://2.bp.blogspot.com/-0H6RrTnzwL4/XoH5lFIHVDI/AAAAAAAACWE/N-s4hEuHsSU98dHxzboo-j_IDhXJJcCjQCLcBGAsYHQ/s1600/zipper.jpg" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-0H6RrTnzwL4/XoH5lFIHVDI/AAAAAAAACWE/N-s4hEuHsSU98dHxzboo-j_IDhXJJcCjQCLcBGAsYHQ/s1600/zipper.jpg" data-original-width="790" data-original-height="728" width="500" /></a> <p>It is apparent that each operation on the zipper (movement or modification) adds at most a single additional element. So, it's complexity is the same as for normal lists (although, with larger constants).</p> <p>Zippers can operate on any linked structures. A very similar structure for trees is called a <strong>Finger Tree</strong>. To create it from a normal tree, we need to put "fingers" to the right and left ends of the tree and transform it like a zipper. A finger is simply a point at which you can access part of a data structure.</p> <p>Let's consider the case of a 2-3 tree for which the finger approach was first developed. First, we restructure the entire tree and make the parents of the first and last children the two roots of our tree. This finger is composed of several layers that sit along the spine of the tree. Each layer of the finger tree has a prefix (on the left) and a suffix (on the right), as well as a link further down the spine. The prefix and suffix contain values in the finger tree – on the first level, they contain values (2-3 trees of depth 0); on the second level, they contain 2-3 trees of depth 1; on the third level, they contain 2-3 trees of depth 2, and so on. This somewhat unusual property comes from the fact that the original 2-3 tree was of uniform depth. The edges of the original 2-3 tree are now at the top of the spine. The root of the 2-3 tree is now the very bottom element of the spine. As we go down the spine, we are traversing from the leaves to the root of the original 2-3 tree; as we go closer to the root, the prefix and suffixes contain deeper and deeper subtrees of the original 2-3 tree.</p> <p>Now, the principle of operation (traversal or modification) on the finger tree is the same as with the zipper: with each change, some elements are tossed from one side of the spine to the other, and the number of such elements remains withing the <code>O(log n)</code> limits.</p> <p>Finally, another data structure that is crucial for the efficient implementation of systems that rely solely on persistent data structures (like the Clojure language environment) is a <strong>Hash-Array Mapped Trie</strong> (HAMT). It may be used both in ephemeral and persistent mode to represent maps and sets with <code>O(log n)</code> access complexity<a href="#f14-6" name="r14-6">[6]</a>. HAMT is a special trie that uses the following two tricks:</p> <ul><li>as an array-mapped trie, instead of storing pointers to the children nodes in a key-value indexed with their subkeys, it stores a list an array of pointers and a bitmap that is used to determine if the pointer is present and at what position in the array it resides. This feature requires limiting the number of possible subkeys (for example, individual characters, which are the dominant use case for tries) to the length of a bitmap. The default length is 32, which is enough to represent English alphabet :)</li> <li>however, the hash feature gives us a number of benefits including limiting the limitations on the subkeys. Actually, in a HAMT, all values are stored at the leaves that have the same depth, while the subkeys are obtained by, first, hashing the key and then splitting the obtained hash into <code>n</code>-bit ranges (where <code>n</code> is usually also 5)<a href="#f14-7" name="r14-7">[7]</a>. Each subkey is used as an index into the bitmap: if the element at it is 1 the key is present. To calculate the index of a pointer in the pointer array, we need to perform <code>popcount</code> on the preceding bits.</li></ul> <p>With such a structure, all major operations will have <code>O(log 5)</code> i.e. <code>O(1)</code> complexity. However, hash collisions are possible, so the hash-table-related collision considerations also apply to a HAMT. In other words, HAMTs are pretty similar to hash-tables, with the keys being split into parts and put into a trie. However, due to their tree-based nature, the memory footprints and the runtime performance of iteration and equality checking of the HAMTs lag behind array-based counterparts:</p> <ul><li>Increased memory overhead as each internal node adds an overhead over a direct array-based encoding, so finding a small representation for internal nodes is crucial.</li> <li>On the other hand, HAMTs do not need expensive table resizing and do not waste (much) space on null references.</li> <li>Iteration is slower due to non-locality, while a hash-table uses a simple linear scan through a continuous array.</li> <li>Delete can cause the HAMT to deviate from the most compact representation (leave nodes with no children, in the tree)</li> <li>Equality checking can be expensive due to non-locality and the possibility of a degenerate structure due to deletes.</li></ul> <p>So, what's the value of this structure if it's just a slightly less efficient hash-table? The difference is that a HAMT can not only be implemented both with destructive operations, but also, being a tree, it can be easily adapted to persistent mode with a usual path-copying trick that we have already seen.</p> <p>Complexity estimations for persistent data structres use amortized analysis to prove acceptable performance (<code>O(log n)</code>). Another trick at play here is called <strong>scheduling</strong>, and it lies in properly planning heavy structure-rebuilding operations and splitting them into chunks to avoid having to execute some at a time when optimal complexity can't be achieved. To learn more about these topics read the seminal book by Chris Okasaki "Purely Functional Data Structures"<a href="#f14-8" name="r14-8">[8]</a> that describes these methods in more detail and provides complexity analysis for various structures.</p> <p>Besides, the immutability of persistent data structures enables additional optimizations that may be important in some scenarios:</p> <ul><li>native copy-on-write (COW) semantics that is required in some domains and algorithms</li> <li>objects can be easily memoized</li> <li>properties, such as hashes, sizes, etc., can be precomputed</li></ul> <p>The utility of persistent data structures is only gradually being realized and apprehended. Recently, some languages, including Clojure, were built around them as core structures. Moreover, some people even go as far as to claim that <a href="https://blog.jayway.com/2013/03/03/git-is-a-purely-functional-data-structure/">git is a purely functional data structure</a> due to its principal reliance on structure-sharing persistent trees to store the data.</p> <h2 id="takeaways">Take-aways</h2> <p>We have covered a lot of ground in this chapter at a pretty high level. Obviously, you can go much deeper: the whole books are written on the topics of concurrency and distributed computing.</p> <p>Overall, concurrency can be approached from, at least, three different directions:</p> <ol><li>There's a low-level view: the means that should be provided by the underlying platforms to support concurrent operation. It includes the threading/process APIs, the atomic operation, synchronization, and networking primitives.</li> <li>Then, there's an architecture viewpoint: what constraints our systems should satisfy and how to ensure that. At this level, the main distinctions are drawn: shared-memory vs shared-nothing, centralized vs peer-to-peer.</li> <li>And, last but not least, comes the algorithmic perspective. What data structures (as usual, they are, in fact, more important than the algorithms) can be used to satisfy the constraints in the most efficient way possible, or to simplify the architecture? We have seen several examples of special-purpose ones that cater to the needs of a particular problem: lock-free data structures, eventually-consistent, purely functional persistent ones. And then, there are some areas where special-purpose algorithms also play a major role. Their main purpose, there, is not so much computational efficiency (like we're used to), but, mostly, correctness coupled with good enough efficiency<a href="#f14-9" name="r14-9">[9]</a>. Mutual exclusion and distributed consensus algorithms are examples of such targeted algorithm families.</li></ol> <p>There's a lot of room for further research in the realms of synchronization and, especially, distributed computation. It is unclear, whether the new breakthroughs will come from our current computing paradigms or we'll have to wait for the new tide and new approaches. Anyway, there's still a chance to make a serious and lasting contribution to the field by developing new algorithm-related stuff. And not only that. Unlike other chapters, we haven't talked much here about the tools that can help a developer of concurrent programs. The reason for that is, actually, an apparent lack of such tools, at least, of widely adopted ones. Surely, the toolbox we have already studied in the previous chapters is applicable here, but an environment with multiple concurrent threads and, possibly, multiple address spaces adds new classes of issues and seriously complicates debugging. There are network service tools to collect metrics and execution traces, but none of them is tightly integrated into the development toolboxes, not to speak of their limited utility. So, substantial pieces are still missing from the picture and are waiting to be filled.</p><hr size="1"><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-1" name="f14-1">[1]</a> We will further use the term "thread" to denote a separate computation running as part of our application, as it is less ambiguous than "process" and also much more widespread than all the other terms.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-2" name="f14-2">[2]</a> This internal "threading", usually, also relies on the OS threading API behind the scenes.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-3" name="f14-3">[3]</a> In this context, they tend to be called "processes", but we'll still stick to the term "thread".</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-4" name="f14-4">[4]</a> The other apparent limitation of supporting only two threads can be lifted by a modification to the algorithm, which requires some hardware support.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-5" name="f14-5">[5]</a> If we forbid the destructive <code>rplaca</code>/<code>rplacd</code> operations and their derivatives.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-6" name="f14-6">[6]</a> and a quite high algorithm base — usually 32 — that means very shallow trees resulting in just a handful of hops even for quite large structures</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-7" name="f14-7">[7]</a> Except for the length of the leftmost range that depends on the number of bits in a hash. For instance, for a 32-bit hash, it may be 7, and the depth of the whole HAMT would be 5.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-8" name="f14-8">[8]</a> His <a href="https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf">thesis</a> with the same title is freely available, but the book covers more and is more accessible.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r14-9" name="f14-9">[9]</a> The reason for that might be relative immaturity of this space, as well as its complexity, so that our knowledge of it hasn't been developed enough to reach the stage when optimization becomes the main focus.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-78210322917277323192020-02-19T14:01:00.001+02:002020-02-19T19:53:16.490+02:00Programming Algorithms: Compression<p>Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.</p> <h2 id="encoding">Encoding</h2> <p>Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.</p> <p>Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.</p> <h2 id="base64">Base64</h2> <p>Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, <code>+</code>, <code>/</code>, and <code>=</code>). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (<code>=</code>) is used to indicate 2 (<code>=</code>) or 4 (<code>==</code>) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.</p> <p>Here is one of the ways to implement a Base64 serialization routine:</p> <pre><code>(defparameter *b64-dict*<br /> (coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)<br /> :collect (code-char ch))<br /> (loop :for ch :from (char-code #\a) :to (char-code #\z)<br /> :collect (code-char ch))<br /> (loop :for ch :from (char-code #\0) :to (char-code #\9)<br /> :collect (code-char ch))<br /> '(#\+ #\/ #\=))<br /> 'simple-vector))<br /><br />(defun b64-encode (in out)<br /> (let ((key 0)<br /> (limit 6))<br /> (flet ((fill-key (byte off beg limit)<br /> (:= (ldb (byte limit off) key)<br /> (ldb (byte limit beg) byte))<br /> (:= off (- 6 beg)))<br /> (emit1 (k)<br /> (write-byte (char-code (svref *b64-dict* k)) out)))<br /> (loop :for byte := (read-byte in nil) :while byte :do<br /> (let ((beg (- 8 limit)))<br /> (fill-key byte 0 beg limit)<br /> (emit1 key)<br /> (fill-key byte (:= limit (- 6 beg)) 0 beg)<br /> (when (= 6 beg)<br /> (emit1 key)<br /> (:= limit 6))))<br /> (when (< limit 6)<br /> (:= (ldb (byte limit 0) key)<br /> (ldb (byte limit 0) 0))<br /> (emit1 key)<br /> (loop :repeat (ceiling limit 2) :do<br /> (emit1 64))))))<br /></code></pre> <p>This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.</p> <p>One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:</p> <ul><li>It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...</li> <li>Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: <code>(with-open-file (in ...) (with-out-file (out) (base64-encode in out))</code>. With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.</li></ul> <p>So, what happens in the code above? First, the <code>byte</code>s are read from the binary input stream <code>in</code>, then each one is slashed into 2 parts. The higher bits are set into the current base64 <code>key</code>, which is translated, using <em>b64-dict</em>, into an appropriate byte and emitted to the binary output stream <code>out</code>. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming <code>byte</code>, and the second will consume the remaining 6 bits. This is addressed in the code block <code>(when (= 6 beg) ...)</code>. The function relies on the standard Lisp <code>ldb</code> operation which provides access to the individual bits of an integer. It uses the byte-spec <code>(byte limit offset)</code> to control the bits it wants to obtain.</p> <p>Implementing a decoder procedure is left as an exercise to the reader...</p> <p>Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the <a href="http://edicl.github.io/flexi-streams/">FLEXI-STREAMS</a> library to work with binary in-memory streams):</p> <pre><code>CL-USER> (with-input-from-string (str "Man i")<br /> (let ((in (flex:make-in-memory-input-stream <br /> (map 'vector 'char-code<br /> (loop :for ch := (read-char str nil) :while ch :collect ch))))<br /> (out (flex:make-in-memory-output-stream)))<br /> (b64-encode in out)<br /> (map 'string 'code-char (? out 'vector))))<br />"TWFuIGk="<br /></code></pre> <p>This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (<code>key</code>) and the <code>byte</code> being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:</p> <pre><code> M (77) | a (97) | n (110)<br /> 0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0<br />0: 0 1 0 0 1 1 | | 19 = T<br /> 0 1| |<br />1: 0 1|0 1 1 0 | 22 = W<br /> | 0 0 0 1|<br />2: | 0 0 0 1|0 1 5 = F<br /><br />Iteration 0:<br /><br />beg: 2<br />off: 0<br />limit: 6<br /><br />beg: 0<br />off: (- 6 2) = 4<br />limit: 2<br /><br /><br />Iteration 1:<br /><br />beg: 4<br />off: 0<br />limit: 4<br /><br />beg: 0<br />off: (- 6 4) = 2<br />limit: 4<br /></code></pre> <p>Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library <a href="https://github.com/vseloved/should-test">SHOULD-TEST</a> declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block <code>(when (< limit 6) ...)</code>. Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.</p> <pre><code>(deftest b64-encode ()<br /> ;; B64STR would be the function wrapped over the REPL code presented above<br /> (should be blankp (b64str ""))<br /> (should be string= "TWFu" (b64str "Man"))<br /> (should be string= "TWFuIA==" (b64str "Man "))<br /> (should be string= "TWFuIGk=" (b64str "Man i")))<br /></code></pre> <p>Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.</p> <h2 id="losslesscompression">Lossless Compression</h2> <p>The idea behind lossless compression is straightforward: find an encoding that is tailored to our particular dataset and allows the encoding procedure to produce a shorter version than using a standard encoding. Not being general-purpose, the vocabulary for this encoding may use a more compact representation for those things that occur often, and a longer one for those that appear rarely, skipping altogether those that don't appear at all. Such an encoding scheme will be, probably, structure-agnostic and just convert sequences of bytes into other sequences of a smaller size, although custom structure-aware compression is also possible.</p> <p>This approach can be explained with a simple example. The phrase "this is a test" uses 8-bit ASCII characters to represent each letter. There are 256 different ASCII characters in total. However, for this particular message, only 7 characters are used: <code>t</code>, <code>h</code>, <code>i</code>, <code>s</code>, <code></code>, <code>a</code>, and <code>e</code>. 7 characters, in theory, need only 2.81 bits to be distinguished. Encoding them in just 3 bits instead of 8 will reduce the size of the message almost thrice. In other words, we could create the following vocabulary (where <code>#*000</code> is a Lisp literal representation of a zero bit-vector of 3 bits):</p> <pre><code>#h(#\t #*000<br /> #\h #*001<br /> #\i #*010<br /> #\s #*011<br /> #\a #*100<br /> #\e #*101<br /> #\Space #*110)<br /></code></pre> <p>Using this vocabulary, our message could be encoded as the following bit-vector: <code>#*0000010100111100100111101001100001010111000</code>. The downside, compared to using some standard encoding, is that we now need to package the vocabulary alongside the message, which will make its total size larger than the original that used an 8-bit standard encoding with a known vocabulary. It's clear, though, that, as the message becomes longer, the fixed overhead of the vocabulary will quickly be exceeded by the gain from message size reduction. Although, we have to account for the fact that the vocabulary may also continue to grow and require more and more bits to represent each entry (for instance, if we use all Latin letters and numbers it will soon reach 6 or 7 bits, and our gains will diminish as well). Still, the difference may be pre-calculated and the decision made for each message or a batch of messages. For instance, in this case, the vocabulary size may be, say, 30 bytes, and the message size reduction is 62.5%, so a message of 50 or more characters will be already more compact if encoded with this vocabulary even when the vocabulary itself will be sent with it. The case of only 7 characters is pretty artificial, but consider that DNA strings have only 4 characters.</p> <p>However, this simplistic approach is just the beginning. Once again, if we use an example of the Latin alphabet, some letters, like <code>q</code> or <code>x</code> may end up used much less frequently, than, say, <code>p</code> or <code>a</code>. Our encoding scheme uses equal length vectors to represent them all. Yet, if we were to use shorter representations for more frequently used chars at the expense of longer ones for the characters occurring less often, additional compression could be gained. That's exactly the idea behind Huffman coding.</p> <h2 id="huffmancoding">Huffman Coding</h2> <p>Huffman coding tailors an optimal "alphabet" for each message, sorting all letters based on their frequency and putting them in a binary tree, in which the most frequent ones are closer to the top and the less frequent ones — to the bottom. This tree allows calculating a unique encoding for each letter based on a sequence of left or right branches that need to be taken to reach it, from the top. The key trick of the algorithm is the usage of a heap to maintain the characters (both individual and groups of already processed ones) in sorted order. It builds the tree bottom-up by first extracting two least frequent letters and combining them: the least frequent on the left, the more frequent — on the right. Let's consider our test message. In it, the letters are sorted by frequency in the following order:</p> <pre><code>((#\a 1) (#\e 1) (#\h 1) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) <br /></code></pre> <p>Extracting the first two letters results in the following treelet:</p> <pre><code> ((#\a #\e) 2)<br /> / \<br />(#\a 1) (#\e 1)<br /></code></pre> <p>Uniting the two letters creates a tree node with a total frequency of 2. To use this information further, we add it back to the queue in place of the original letters, and it continues to represent them, during the next steps of the algorithm:</p> <pre><code>((#\h 1) ((#\a #\e) 2) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) <br /></code></pre> <p>By continuing this process, we'll come to the following end result:</p> <pre><code> ((#\s #\t #\Space #\i #\h #\a #\e) 14)<br /> / \<br /> ((#\s #\t) 6) ((#\Space #\i #\h #\a #\e) 8)<br /> / \ / \<br />(#\s 3) (#\t 3) (#\Space 3) ((#\i #\h #\a #\e) 5)<br /> / \ <br /> (#\i 2) ((#\h #\a #\e) 3) <br /> / \<br /> (#\h 1) ((#\a #\e) 2)<br /> / \<br /> (#\a 1) (#\e 1)<br /></code></pre> <p>From this tree, we can construct the optimal encoding:</p> <pre><code>#h(#\s #*00<br /> #\t #*01<br /> #\Space #*10<br /> #\i #*110<br /> #\h #*1110<br /> #\a #*11110<br /> #\e #*11111)<br /></code></pre> <p>Compared to the simple approach that used constantly 3 bits per character, it takes 1 bit less for the 3 most frequent letters and 2 bits more for two least frequent ones. The encoded message becomes: <code>#*01111011000101100010111101001111110001</code>, and it has a length of 38 compared to 43 for our previous attempt.</p> <p>To be clear, here are the encoding and decoding methods that use the pre-built vocabulary (for simplicity's sake, they operate on vectors and strings instead of streams):</p> <pre><code>(defun huffman-encode (envocab str)<br /> (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))<br /> (dovec (char str)<br /> (dovec (bit (? envocab char))<br /> (vector-push-extend bit rez)))<br /> rez))<br /><br />(defun huffman-decode (devocab vec)<br /> (let (rez)<br /> (dotimes (i (length vec))<br /> (dotimes (j (- (length vec) i))<br /> (when-it (? devocab (slice vec i (+ i j 1)))<br /> (push it rez)<br /> (:+ i j)<br /> (return))))<br /> (coerce (reverse rez) 'string)))<br /></code></pre> <p>It is worth recalling that <code>vector-push-extend</code> is implemented in a way, which will not adjust the array by only 1 bit each time it is called. The efficient implementation "does the right thing", for whatever the right thing means in this particular case (maybe, adjusting by 1 machine word). You can examine the situation in more detail by trying to extend the array by hand (using <code>adjust-array</code> or providing a third optional argument to <code>vector-push-extend</code>) and comparing the time taken by the different variants, to verify my words.</p> <p>Finally, here is the most involved part of the Huffman algorithm, which builds the encoding and decoding vocabularies (with the help of a heap implementation we developed in the chapter on Trees):</p> <pre><code>(defun huffman-vocabs (str)<br /> ;; here we assume more than a single unique character in STR<br /> (let ((counts #h())<br /> (q (make-heap :op '< :key 'rt))<br /> (envocab #h())<br /> (devocab #h(equal))) ; bit-vectors as keys require 'equal comparison<br /> ;; count character frequencies<br /> (dovec (char str)<br /> (:+ (get# char counts 0))) ; here, we use the default third argument of gethash set to 0<br /> ;; heapsort the characters based on their frequency<br /> (dotable (char count counts)<br /> (heap-push (pair char count) q))<br /> ;; build the tree<br /> (dotimes (i (1- (heap-size q)))<br /> (with (((lt cl) (heap-pop q))<br /> ((rt cr) (heap-pop q)))<br /> (heap-push (pair (list lt rt) (+ cl cr))<br /> q)))<br /> ;; traverse the tree in DFS manner<br /> ;; encoding the path to each leaf node as a bit-vector<br /> (labels ((dfs (node &optional (level 0) path)<br /> (if (listp node)<br /> (progn<br /> (dfs (lt node) (1+ level) (cons 0 path))<br /> (dfs (rt node) (1+ level) (cons 1 path)))<br /> (let ((vec (make-array level :element-type 'bit<br /> :initial-contents (reverse list))))<br /> (:= (? envocab node) vec<br /> (? devocab vec) node)))))<br /> (dfs (lt (heap-pop q))))<br /> (list envocab devocab)))<br /></code></pre> <h3 id="hufmancodinginaction">Huffman Coding in Action</h3> <p>Compression is one of the areas for which it is especially interesting to directly compare the measured gain in space usage to the one expected theoretically. Yet, as we discussed in one of the previous chapters, such measurements are not so straightforward as execution speed measurements. Yes, if we compress a single sequence of bytes into another one, there's nothing more trivial than to compare their lengths, but, in many tasks, we want to see a cumulative effect of applying compression on a complex data structure. This is what we're going to do next.</p> <p>Consider the problem that I had in my work on the tool for text language identification <a href="https://github.com/vseloved/wiki-lang-detect">WIKI-LANG-DETECT</a>. This software relies on a number of big dictionaries that map strings (character trigrams and individual words) to floats. The obvious approach to storing these maps is with a hash-table. However, due to the huge number of keys, such table will, generally, have a sizeable overhead, which we would like to avoid. Besides, the keys are strings, so they have a good potential for reduction in occupied size when compressed. The data is also serialized into per-language files in the tab-separated format. This is the sample of a word-level file for the Danish language:</p> <pre><code>afrika -8.866735<br />i -2.9428265<br />the -6.3879676<br />ngo -11.449115<br />of -6.971129<br />kanye -12.925021<br />e -8.365895<br />natal -12.171249<br /></code></pre> <p>Our task is to load the data in memory so that access to the keys had constant runtime and minimal occupied space.</p> <p>Let's begin with a simple hash-table-based approach. The following function will load two files from the default directory (<code>*default-pathname-defaults*</code>) and return a list of two hash-tables: for the word and trigram probabilities.</p> <pre><code>(defun load-data-into-hts (lang)<br /> (declare (optimize sb-c::instrument-consing))<br /> (mapcar (lambda (kind)<br /> (let ((rez (make-hash-table :test 'equal)))<br /> (dolines (line (fmt "~A-~A.csv" lang kind))<br /> (let ((space (position #\Space line)))<br /> (set# (slice line 0 space) rez<br /> (read-from-string (slice line (1+ space))))))<br /> rez))<br /> '("words" "3gs")))<br /></code></pre> <p>To measure the space it will take, we'll use a new SBCL extension called allocation profiling from the <code>sb-aprof</code> package<a href="#f13-1" name="r13-1">[1]</a>. To enable the measurement, we have put a special declaration immediately after the defun header: <code>(optimize sb-c::instrument-consing)</code>.</p> <p>Now, prior to running the code, let's look at the output of <code>room</code>:</p> <pre><code>CL-USER> (room)<br />Dynamic space usage is: 60,365,216 bytes.<br />...<br /></code></pre> <p>This is a freshly loaded image, so space usage is minimal. Usually, before proceeding with the experiment, I invoke garbage collection to ensure that we don't have some leftover data from the previous runs that may overlap with the current one. In SBCL, you run it with <code>(sb-ext:gc :full t)</code>.</p> <p>Now, let's load the files for the German language (the biggest ones) under <code>aprof</code>. The data can be obtained from the <a href="https://github.com/vseloved/wiki-lang-detect/blob/master/models/wiki156min.zip">github repository of the project</a>. The total size of 2 German-language files on disk (words and trigrams dictionaries) is around 4 MB.</p> <pre><code>CL-USER> (sb-aprof:aprof-run<br /> (lambda () (defparameter *de* (load-data-into-hts "DE"))))<br />227 (of 50000 max) profile entries consumed<br /><br /> % Bytes Count Function<br /> ------- ----------- --------- --------<br /> 24.2 34773600 434670 SB-KERNEL:%MAKE-ARRAY - #:|unknown|<br /> 19.4 27818880 217335 SB-IMPL::%MAKE-STRING-INPUT-STREAM - SB-IMPL::STRING-INPUT-STREAM<br /> 19.4 27818880 434670 SLICE - LIST<br /><br /> 17.3 24775088 SB-IMPL::HASH-TABLE-NEW-VECTORS<br /> 54.0 13369744 52 SIMPLE-VECTOR<br /> 46.0 11405344 156 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))<br /><br /> 14.9 21406176 SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER<br /> 99.4 21280192 225209 (SIMPLE-ARRAY CHARACTER (*))<br /> 0.6 125984 7874 LIST<br /><br /> 4.8 6957184 217412 SB-KERNEL::INTEGER-/-INTEGER - RATIO<br /><br /> 00.0 14160 SB-IMPL::%MAKE-PATHNAME<br /> 91.8 12992 812 LIST<br /> 8.2 1168 1 SIMPLE-VECTOR<br /><br /> 00.0 4160 2 SB-IMPL::SET-FD-STREAM-ROUTINES - (SIMPLE-ARRAY CHARACTER (*))<br /><br /> 00.0 3712 SB-IMPL::%MAKE-DEFAULT-STRING-OSTREAM<br /> 62.1 2304 8 (SIMPLE-ARRAY CHARACTER (*))<br /> 37.9 1408 8 SB-IMPL::CHARACTER-STRING-OSTREAM<br /><br /> 00.0 1024 MAKE-HASH-TABLE<br /> 53.1 544 2 SIMPLE-VECTOR<br /> 46.9 480 6 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))<br /><br /> 00.0 832 SB-IMPL::%MAKE-FD-STREAM<br /> 73.1 608 2 SB-SYS:FD-STREAM<br /> 19.2 160 2 SB-VM::ARRAY-HEADER<br /> 7.7 64 2 (SIMPLE-ARRAY CHARACTER (*))<br /><br /> 00.0 576 GET-OUTPUT-STREAM-STRING<br /> 55.6 320 8 SIMPLE-BASE-STRING<br /> 44.4 256 8 SB-KERNEL:CLOSURE<br /><br /> 00.0 400 SB-KERNEL:VECTOR-SUBSEQ*<br /> 60.0 240 6 (SIMPLE-ARRAY CHARACTER (*))<br /> 40.0 160 5 SIMPLE-BASE-STRING<br /><br /> 00.0 400 5 SB-IMPL::%%MAKE-PATHNAME - PATHNAME<br /> 00.0 384 2 SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE<br /> 00.0 288 4 SB-KERNEL:%CONCATENATE-TO-STRING - (SIMPLE-ARRAY CHARACTER (*))<br /> 00.0 192 12 SB-IMPL::UNPARSE-NATIVE-PHYSICAL-FILE - LIST<br /> 00.0 176 2 SB-IMPL::READ-FROM-C-STRING/UTF-8 - (SIMPLE-ARRAY CHARACTER (*))<br /> 00.0 128 4 SB-ALIEN-INTERNALS:%SAP-ALIEN - SB-ALIEN-INTERNALS:ALIEN-VALUE<br /><br /> 00.0 96 SB-IMPL::QUERY-FILE-SYSTEM<br /> 66.7 64 2 SB-KERNEL:CLOSURE<br /> 33.3 32 2 SB-VM::VALUE-CELL<br /><br /> ======= ===========<br /> 100.0 143576336<br /></code></pre> <p>The profiling report is pretty cryptic, at first sight, and requires some knowledge of SBCL internals to understand. It contains all the allocations performed during the test run, so we should mind that some of the used memory is garbage and will be collected at the next gc. We can confirm that by looking at the <code>room</code> output:</p> <pre><code>CL-USER> (room)<br />Dynamic space usage is: 209,222,464 bytes.<br />CL-USER> (sb-ext:gc :full t)<br />NIL<br />CL-USER> (room)<br />Dynamic space usage is: 107,199,296 bytes.<br /></code></pre> <p>Let's study the report in detail. Around 47 MB were, in fact, used for the newly created data structures — more than 10 times more than what was needed to store the data on disc. Well, efficient access requires sacrificing a lot of space. From the report, we can make an educated guess where these 47 MB originate: 24.7 MB was used for the hash-table structures themselves (<code>SB-IMPL::HASH-TABLE-NEW-VECTORS</code>) and 21.4 MB for the keys (<code>SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER</code>), plus some small amount of bookkeeping information. We can also infer that the floating-point values required around 7 MB (<code>SB-KERNEL::INTEGER-/-INTEGER - RATIO</code>), but, it seems like they were put inside the hash-table arrays without any indirection. To verify that this assumption is correct we can calculate the total number of keys in the hash-tables, which amounts to 216993, and multiply it by 32 (the number of bits in a short-float used here). Also, the first 3 lines, which, in total, accrued around 90 MB or almost 2/3 of the memory used, are all related to reading the data and its processing; and this space was freed during gc.</p> <p>So, this report, although it is not straightforward to understand, gives a lot of insight into how space is used during the run of the algorithm. And the ability to specify what to track on a per-code block basis makes it even more useful.</p> <p>From the obtained breakdown, we can see the optimization potential of the current solution:</p> <ul><li>the use of a more space-efficient data structure instead of a hash-table might save us up to 17 MB of space (7 MB of float values will remain intact)</li> <li>and another 20 MB may be saved if we compress the keys</li></ul> <p>Let's try the second option as it is exactly the focus of this chapter. We'll use the created hash-tables to make new ones with Huffman-encoded keys. Here are the contents of the word probabilities table:</p> <pre><code>CL-USER> (print-ht (first *de*))<br />#{EQUAL<br /> "afrika" -9.825206<br /> "i" -7.89809<br /> "the" -7.0929685<br /> "ngo" -12.696277<br /> "noma" -14.284437<br /> "of" -6.82038<br /> "kanye" -14.233144<br /> "e" -7.7334323<br /> "natal" -11.476304<br /> "c" -8.715089<br /> ...<br /> }<br /></code></pre> <p>And here is the function that will transform the tables:</p> <pre><code>(defun huffman-tables (hts envocab)<br /> (declare (optimize sb-c::instrument-consing))<br /> (mapcar (lambda (ht)<br /> (let ((rez #h(equal)))<br /> (dotable (str logprob ht)<br /> (:= (? rez (huffman-encode envocab str)) logprob))<br /> rez))<br /> hts))<br /><br />;; the Huffman encoding vocabulary *DE-VOCAB* should be built<br />;; from all the keys of *DE* tables separately<br />CL-USER> (sb-aprof:aprof-run<br /> (lambda () (defparameter *de2* (huffman-tables *de* *de-vocab*))))<br />1294 (of 50000 max) profile entries consumed<br /> % Bytes Count Function<br /> ------- ----------- --------- --------<br /> 42.5 44047104 1376461 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG - ARRAY<br /><br /> 23.9 24775088 SB-IMPL::HASH-TABLE-NEW-VECTORS<br /> 54.0 13369744 52 SIMPLE-VECTOR<br /> 46.0 11405344 156 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))<br /><br /> 20.1 20864160 HUFFMAN-ENCODE<br /> 83.3 17386800 217335 SB-VM::ARRAY-HEADER<br /> 16.7 3477360 217335 SIMPLE-BIT-VECTOR<br /><br /> 6.7 6955072 217335 SB-KERNEL:VECTOR-SUBSEQ* - SIMPLE-BIT-VECTOR<br /> 3.4 3477360 217335 (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF :AROUND (T T)) - LIST<br /> 3.4 3477360 217335 (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF (HASH-TABLE T)) - LIST<br /> 00.0 2464 77 SB-KERNEL::INTEGER-/-INTEGER - RATIO<br /><br /> 00.0 1024 MAKE-HASH-TABLE<br /> 53.1 544 2 SIMPLE-VECTOR<br /> 46.9 480 6 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))<br /><br /> 00.0 384 2 SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE<br /><br /> 00.0 96 SB-C::%PROCLAIM<br /> 66.7 64 2 LIST<br /> 33.3 32 1 SB-KERNEL:CLOSURE<br /><br /> 00.0 96 2 SB-INT:SET-INFO-VALUE - SIMPLE-VECTOR<br /> 00.0 64 2 SB-THREAD:MAKE-MUTEX - SB-THREAD:MUTEX<br /> 00.0 32 1 SB-IMPL::%COMPILER-DEFVAR - LIST<br /> 00.0 32 2 HUFFMAN-TABLES - LIST<br /> 00.0 16 1 SB-KERNEL:ASSERT-SYMBOL-HOME-PACKAGE-UNLOCKED - LIST<br /> ======= ===========<br /> 100.0 103600352<br />CL-USER> (sb-ext:gc :full t)<br />NIL<br />CL-USER> (room)<br />Dynamic space usage is: 139,922,208 bytes.<br /></code></pre> <p>So, we have claimed 32 MB of additional space (instead of 47) and some of it seems to be used by other unrelated data (some functions I have redefined in the REPL during the experiment etc), as the compressed keys amount for only 3.5 MB:</p> <pre><code>3477360 217335 SIMPLE-BIT-VECTOR <br /></code></pre> <p>That is more than 5 times reduction or almost 40% compression of the whole data structure!</p> <p>And what about performance? Huffman compression will be needed at every data access, so let's measure the time it will take for vanilla string keys and the bit-vector ones. We will use another file from the wiki-lang-detect repository for the smoke test — <a href="https://github.com/vseloved/wiki-lang-detect/blob/master/data/smoke/de.txt">a snippet from Faust</a>:</p> <pre><code>CL-USER> (defparameter *de-words*<br /> (let ((words (list))<br /> (dict (first *de*)))<br /> (dolines (line "~/prj/lisp/wiki-lang-detect/data/smoke/de.txt")<br /> (dolist (word (split #\Space line))<br /> (push word words)))<br /> words))<br />CL-USER> (length *de-words*)<br />562<br /><br />CL-USER> (let ((vocab (first *de*)))<br /> (time (loop :repeat 1000 :do<br /> (dolist (word *de-words*)<br /> (get# word vocab)))))<br />Evaluation took:<br /> 0.045 seconds of real timeEvaluation took:<br /><br />CL-USER> (let ((vocab (first *de2*)))<br /> (time (loop :repeat 1000 :do<br /> (dolist (word *de-words*)<br /> (get# (huffman-encode *de-vocab* word) vocab)))))<br />Evaluation took:<br /> 0.341 seconds of real time<br /></code></pre> <p>Hmm, with Huffman coding, it's almost 10x slower :( Is there a way to speed it up somewhat? To answer it, we can utilize another profiler — this time a more conventional one, which measures the time spent in each operation. SBCL provides access to two versions of such profilers: a precise and a statistical one. The statistical doesn't seriously interfere with the flow of the program as it uses sampling to capture the profiling data, and it's the preferred one among the developers. To use it, we need to perform <code>(require 'sb-sprof)</code> and then run the computation with profiling enabled (the lengthy output is redacted to show only the most important parts):</p> <pre><code>CL-USER> (let ((dict (first *de2*)))<br /> (sb-sprof:with-profiling (:report :graph)<br /> (loop :repeat 100 :do<br /> (dolist (word *de-words*)<br /> (get# (huffman-encode *de-vocab* word) dict)))))<br />Number of samples: 34<br />Sample interval: 0.01 seconds<br />Total sampling time: 0.34 seconds<br />Number of cycles: 0<br />Sampled threads:<br /> #<SB-THREAD:THREAD "repl-thread" RUNNING {100FB19BC3}><br /><br /> Callers<br /> Total. Function<br /> Count % Count % Callees<br />------------------------------------------------------------------------<br /> 24 70.6 "Unknown component: #x52CD6390" [41]<br /> 5 14.7 24 70.6 HUFFMAN-ENCODE [1]<br /> 1 2.9 SB-IMPL::GETHASH/EQL [17]<br /> 1 2.9 SB-IMPL::GETHASH3 [6]<br /> 1 2.9 LENGTH [14]<br /> 1 2.9 SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS [13]<br /> 2 5.9 (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]<br /> 13 38.2 VECTOR-PUSH-EXTEND [11]<br />------------------------------------------------------------------------<br /> 4 11.8 SB-VM::EXTEND-VECTOR [4]<br /> 4 11.8 4 11.8 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]<br />------------------------------------------------------------------------<br /> 6 17.6 "Unknown component: #x52CD6390" [41]<br /> 3 8.8 6 17.6 SB-IMPL::GETHASH/EQUAL [3]<br /> 1 2.9 SXHASH [42]<br /> 2 5.9 SB-INT:BIT-VECTOR-= [10]<br />------------------------------------------------------------------------<br /> 8 23.5 VECTOR-PUSH-EXTEND [11]<br /> 2 5.9 8 23.5 SB-VM::EXTEND-VECTOR [4]<br /> 2 5.9 SB-VM::COPY-VECTOR-DATA [9]<br /> 4 11.8 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]<br />------------------------------------------------------------------------<br /> 2 5.9 HUFFMAN-ENCODE [1]<br /> 2 5.9 2 5.9 (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]<br />------------------------------------------------------------------------<br />...<br /><br /> Self Total Cumul<br /> Nr Count % Count % Count % Calls Function<br />------------------------------------------------------------------------<br /> 1 5 14.7 24 70.6 5 14.7 - HUFFMAN-ENCODE<br /> 2 4 11.8 4 11.8 9 26.5 - SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG<br /> 3 3 8.8 6 17.6 12 35.3 - SB-IMPL::GETHASH/EQUAL<br /> 4 2 5.9 8 23.5 14 41.2 - SB-VM::EXTEND-VECTOR<br /> 5 2 5.9 2 5.9 16 47.1 - (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT)<br /> 6 2 5.9 2 5.9 18 52.9 - SB-IMPL::GETHASH3<br /> 7 2 5.9 2 5.9 20 58.8 - GETHASH<br /> 8 2 5.9 2 5.9 22 64.7 - (SB-VM::OPTIMIZED-DATA-VECTOR-SET BIT)<br /> 9 2 5.9 2 5.9 24 70.6 - SB-VM::COPY-VECTOR-DATA<br /> 10 2 5.9 2 5.9 26 76.5 - SB-INT:BIT-VECTOR-=<br /> 11 1 2.9 13 38.2 27 79.4 - VECTOR-PUSH-EXTEND<br /> 12 1 2.9 1 2.9 28 82.4 - SB-VM::SLOW-HAIRY-DATA-VECTOR-SET<br /> 13 1 2.9 1 2.9 29 85.3 - SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS<br /> 14 1 2.9 1 2.9 30 88.2 - LENGTH<br /> 15 1 2.9 1 2.9 31 91.2 - SB-KERNEL:HAIRY-DATA-VECTOR-SET<br /> 16 1 2.9 1 2.9 32 94.1 - SB-KERNEL:VECTOR-SUBSEQ*<br /> 17 1 2.9 1 2.9 33 97.1 - SB-IMPL::GETHASH/EQL<br />...<br /></code></pre> <p>Unsurprisingly, most of the time is spent in <code>huffman-encode</code>, and of it the biggest chunks are <code>vector-push-extend</code> and hash-table access (to get the Huffman code of a letter). Surely, instead of extending the vector at each iteration, it would be much nicer to just perform a bulk copy of the bits for each character directly into the vector. Let's try that and see the difference:</p> <pre><code>(defun huffman-encode2 (envocab str)<br /> (let ((vecs (map 'vector (lambda (ch) (get# ch envocab))<br /> str))<br /> (total-size 0))<br /> (dovec (vec vecs)<br /> (:+ total-size (length vec)))<br /> (let ((rez (make-array total-size :element-type 'bit))<br /> (i 0))<br /> (dovec (vec vecs)<br /> (let ((size (length vec)))<br /> (:= (subseq rez i) vec)<br /> (:+ i size)))<br /> rez)))<br /><br />CL-USER> (let ((vocab (first *de2*)))<br /> (time (loop :repeat 1000 :do<br /> (dolist (word *de-words*)<br /> (get# (huffman-encode2 *de-vocab* word) vocab)))))<br />Evaluation took:<br /> 0.327 seconds of real time</code></pre> <p>Almost no difference. Well, it's a usual case with these micro-optimizations: you have a brilliant idea, try it under the profiler — and, bah, no difference... This doesn't have to stop us, though. Another idea could be to use a jump-table instead of a hash-table to store character-vector mappings. There are only around 500 characters that have a mapping in my data, although they span the whole Unicode range:</p> <pre><code>CL-USER> (reduce 'max (mapcar 'char-code (keys <em>de-vocab</em>)))<br />65533<br />CL-USER> (defparameter <em>jvocab</em> (make-array (1+ 65533)<br /> :element-type 'bit-vector<br /> :initial-element #<em>))<br />CL-USER> (dotable (k v *de-vocab</em>)<br /> (:= (? <em>jvocab</em> (char-code k)) v))<br /><br />(defun huffman-encode3 (envocab str)<br /> (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))<br /> (dovec (char str)<br /> ;; here, we have changed the hash-table to a jump-table<br /> (dovec (bit (svref envocab (char-code char)))<br /> (vector-push-extend bit rez)))<br /> rez))<br /><br /><p>CL-USER> (let ((vocab (first <em>de2</em>)))<br /> (time (loop :repeat 1000 :do<br /> (dolist (word <em>de-words</em>)<br /> (get# (huffman-encode3 <em>jvocab</em> word) vocab)))))<br />Evaluation took:<br /> 0.308 seconds of real time</code></pre> <p>OK, we get an improvement of around 10%<a href="#f13-2" name="r13-2">[2]</a>. That's a start. But many more ideas and experiments are needed if we want to significantly optimize this implementation. Yet, for the sake of space conservation on the pages of this book, we won't continue with it. <p>Another tool we could use to analyze the performance and think about further improvement is flamegraphs — a way to visualize profiler output. [CL-FLAMGRAPH](https://github.com/40ants/cl-flamegraph) is a wrapper around `sb-sprof` that generates the output in the common format which can be further processed by the Perl tool, in order to generate the image itself. Here is the basic output I got. It's rather rough and, probably, requires some fiddling with the Perl tool to obtain a prettier image: <a href="https://1.bp.blogspot.com/-tgZbR6phkNM/Xk0gaDWwXsI/AAAAAAAACUs/8yOoXkixpJEN1qX2MMJNekw9mz71nuc8wCLcBGAsYHQ/s1600/huff3.jpg" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-tgZbR6phkNM/Xk0gaDWwXsI/AAAAAAAACUs/8yOoXkixpJEN1qX2MMJNekw9mz71nuc8wCLcBGAsYHQ/s1600/huff3.jpg" data-original-width="1200" data-original-height="166" /></a> <p>To conclude, key compression alone gives a sizeable reduction in used space at the cost of deteriorated performance. <p>Another possible angle of attack is to move from a hash-table to a more space-efficient structure. We have explored this direction somewhat in the chapter on hash-tables already. <h2>Arithmetic Coding</h2> <p>Why Huffman coding works? The answer lies in Shannon's Source Coding Theorem and has to do with a notion of entropy. Entropy is one of the ways to represent expectation and surprise, in a message. The most random message has the maximal surprise, i.e. it's very hard to predict what symbol will appear at a certain position in it, while the least random (for instance, containing only repetitions of a single char) is the least surprising. Obviously, any kind of useful data is not uniformly distributed, or, otherwise, it's indistinguishable from white noise. Most of the data representations use an "alphabet" (encoding) that is redundant, for a particular message. Why? Because it is general-purpose and should allow expressing arbitrary messages. Yet, in practice, some passages appear much more often than the others, some words are more frequent, some letters are, and even some patterns in the images may be. <p>The idea of character-level compression algorithms is to tailor a custom vocabulary that uses fewer bits for low-entropy (frequent) characters and more bits for high-entropy ones. In general, the probability distribution of characters may be thought of as a <code>[0,1)</code> interval, in which each char occupies a slice proportionate to its frequency. If we rely on standard encoding, the interval for our test example will look like this: <pre><code>|---+---+---+------+---------+---------+---------|<br />0 e a h i s t Space 1</code></pre> <p>Here, each subinterval for a character is its probability times the number of bits per character (8 for each). Huffman coding tries to equalize this distribution by assigning fewer bits to characters that occupy larger space. For the Huffman vocabulary we have constructed, the distribution will look like this: <pre><code>|-----+-----+----+------+------+-------+-------|<br />0 e a h i s t Space 1</code></pre> <p>As you can see, it has become more even, but still not totally. This is due to the discrete nature of the encoding that results in rounding the number of bits to the closest integer value. There's another approach to solving the same problem that aims at reducing the rounding error even further — Arithmetic coding. It acts directly on our interval and encodes the whole message in a single number that represents the point in this interval. How this point is found and used? Let's consider a message with a single character <code>i</code>. In our example, the subinterval for it is <code>[0.214285714, 0.357142857)</code>. So, if we use any number from this interval and know that the message contains a single character we can unambiguously decode it back. Ideally, we'd use the number from the interval that has the least amount of digits. Here is a simple example of how such a number can be found: <pre><code>(defun find-shortest-bitvec (lo hi)<br /> (let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))<br /> (loop<br /> (with ((lod lof (floor (* lo 2)))<br /> (hid hif (floor (* hi 2))))<br /> (when (or (zerop lof)<br /> (zerop hif)<br /> (/= lod hid))<br /> (vector-push-extend hid rez)<br /> (return))<br /> (vector-push-extend lod rez)<br /> (:= lo lof<br /> hi hif)))<br /> rez))<br /><br />CL-USER> (find-shortest-bitvec 0.214285714 0.357142857)<br />#*01<br /></code></pre> <p>The result is a bit-vector that represents the fractional part of some floating point number lying within the interval, which may be also used as an encoding of our one-character message. Obviously, we could use just a single bit to encode it with a custom vocabulary of one entry, but, here, for the purpose of illustration, I wanted to use an existing pre-calculated vocabulary that includes other characters as well. Also, if we compare this version with the Huffman coding, the message length is decreased by 1 bit.</p> <p>Now, how can we process the longer messages? In the same manner: by recursively dividing the currently selected part using the same original distribution. For the message <code>is</code>:</p> <ul><li>on step 1 (for character <code>i</code>), the interval <code>[0.214285714, 0.357142857)</code> will be selected</li><li>on step 2 (for character <code>s</code>), we'll narrow it down to <code>[0.26530612, 0.29591838)</code> (using the subinterval <code>[0.357142857, 0.5714286)</code> for <code>s</code>)</li></ul> <p>For this interval, the shortest encoding will be <code>01001</code>. In this case, it has the same size as the Huffman one.</p> <p>So, the naive arithmetic encoding implementation is quite simple:</p> <pre><code>(defun arithm-encode (envocab message)<br /> (let ((lo 0.0)<br /> (hi 1.0))<br /> (dovec (char message)<br /> (let ((coef (- hi lo)))<br /> (dotable (ch prob envocab)<br /> (let ((off (* prob coef)))<br /> (when (eql char ch)<br /> (:= hi (+ lo off))<br /> (return))<br /> (:+ lo off)))))<br /> (find-shortest-bitvec lo hi)))<br /><br />CL-USER> (arithm-encode #h(#\e 1/14<br /> #\a 1/14<br /> #\h 1/14<br /> #\i 2/14<br /> #\s 3/14<br /> #\t 3/14<br /> #\Space 3/14) <br /> "this is a test")<br />#*100110110100001110000001<br /></code></pre> <p>However, this function has a hidden bug. The problem lies in the dreaded floating-point overflow that happens quite soon in the process of narrowing the interval, which results in using more and more digits of the floating-point number until all the bits are utilized and we can't distinguish the intervals any further. If we try to faithfully decode even the short message encoded above, we'll already see this effect by getting the output <code>this ist sssst</code>.</p> <p>The implementation of this approach, which works around the bug, relies on the same idea but uses a clever bit arithmetics trick. Due to that, it becomes less clean and obvious, because it has to work not with the whole number, but with a bounded window in that number (in this case: a 32-bit one) and, also, still take care of potential overflow that may happen when the range collapses around 0.5. Here it is shown, for illustration purposes, without a detailed explanation<a href="#f13-3" name="r13-3">[3]</a>. This function is another showcase of the Lisp standard support for handling bit-level values. Besides, read-eval (<code>#.</code>) is used here to provide literal values of bitmasks.</p> <pre><code>(defun arithm-encode-correct (envocab message)<br /> (let ((lo 0)<br /> (hi (1- (expt 2 32)))<br /> (pending-bits 0)<br /> (rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))<br /> (flet ((emit-bit (bit)<br /> (vector-push-extend bit rez)<br /> (let ((pbit (if (zerop bit) 1 0)))<br /> (loop :repeat pending-bits :do (vector-push-extend pbit rez))<br /> (:= pending-bits 0))))<br /> (dovec (char message)<br /> (with ((range (- hi lo -1))<br /> ((plo phi) (? envocab char)))<br /> (:= lo (round (+ lo (* plo range)))<br /> hi (round (+ lo (* phi range) -1)))<br /> (loop<br /> (cond ((< hi #.(expt 2 31))<br /> (emit-bit 0))<br /> ((>= lo #.(expt 2 31))<br /> (emit-bit 1)<br /> (:- lo #.(expt 2 31))<br /> (:- hi #.(expt 2 31)))<br /> ((and (>= lo #.(expt 2 30))<br /> (< hi (+ #.(expt 2 30) #.(expt 2 31))))<br /> (:- lo #.(expt 2 30))<br /> (:- hi #.(expt 2 30))<br /> (:+ pending-bits))<br /> (t (return)))<br /> (:= lo (mask32 (ash lo 1))<br /> hi (mask32 (1+ (ash hi 1)))))))<br /> (:+ pending-bits)<br /> (emit-bit (if (< lo #.(expt 2 30)) 0 1)))<br /> rez))<br /><br />(defun mask32 (num)<br /> ;; this utility is used to confine the number in 32 bits<br /> (logand num #.(1- (expt 2 32))))<br /><br />CL-USER> (arithm-encode-correct #h(#\e '(0 1/14)<br /> #\a '(1/14 1/7)<br /> #\h '(1/7 3/14)<br /> #\i '(3/14 5/14)<br /> #\s '(5/14 4/7)<br /> #\t '(4/7 11/14)<br /> #\Space '(11/14 1)) <br /> "this is a test")<br />#*10011011010000111000001101010110010101<br /></code></pre> <p>Note that the length of the compressed message is 38 bits. The same as the Huffman version!</p> <p>And here, for the sake of completeness and verification, is the decoding routine. It works in a similar fashion but backwards: we determine the interval into which our current number falls, emit the corresponding character, and narrow the search interval to the currently found one. We'll need to have access to the same vocabulary and know the length of the message. </p> <pre><code>(defun bitvec->int (bits)<br /> (reduce (lambda (bit1 bit2) (+ (ash bit1 1) bit2)<br /> bits))<br /><br />(defun arithm-decode (dedict vec size)<br /> (with ((len (length vec))<br /> (lo 0)<br /> (hi (1- (expt 2 32)))<br /> (val (bitvec->int (subseq vec 0 (min 32 len))))<br /> (off 32)<br /> (rez (make-string size)))<br /> (dotimes (i size)<br /> (with ((range (- hi lo -1))<br /> (prob (/ (- val lo) range)))<br /> (dotable (char r dedict)<br /> (with (((plo phi) r))<br /> (when (>= phi prob)<br /> (:= (? rez i) char<br /> lo (round (+ lo (* plo range)))<br /> hi (round (+ lo (* phi range) -1)))<br /> (return))))<br /> (print (list val lo hi))<br /> (loop<br /> (cond ((< hi #.(expt 2 31))<br /> ;; do nothing<br /> )<br /> ((>= lo #.(expt 2 31))<br /> (:- lo #.(expt 2 31))<br /> (:- hi #.(expt 2 31))<br /> (:- val #.(expt 2 31)))<br /> ((and (>= lo #.(expt 2 30))<br /> (< hi #.(* 3 (expt 2 30))))<br /> (:- lo #.(expt 2 30))<br /> (:- hi #.(expt 2 30))<br /> (:- val #.(expt 2 30)))<br /> (t<br /> (return)))<br /> (:= lo (mask32 (ash lo 1))<br /> hi (mask32 (1+ (ash hi 1)))<br /> val (mask32 (+ (ash val 1)<br /> (if (< off len)<br /> (? vec off)<br /> 0)))<br /> off (1+ off)))))<br /> rez)))<br /><br />CL-USER> (let ((vocab #h(#\e '(0 1/14)<br /> #\a '(1/14 1/7)<br /> #\h '(1/7 3/14)<br /> #\i '(3/14 5/14)<br /> #\s '(5/14 4/7)<br /> #\t '(4/7 11/14)<br /> #\Space '(11/14 1))))<br /> (arithm-decode vocab<br /> (arithm-encode-correct vocab "this is a test")<br /> 14))<br />"this is a test"<br /></code></pre> <h2 id="deflate">DEFLATE</h2> <p>Entropy-based compression — or, as I would call it, сharacter-level one — can do only so much: it can't account for repetitions of the larger-scale message parts. For instance, a message with a single word repeated twice, when compressed with Huffman or Arithmetic encodings, will have twice the length of the message with a single occurrence of that word. The reason being that the probability distribution will not change, and thus the encodings of each character. Yet, there's an obvious possibility to reduce the compressed size here. This and other similar cases are much better treated by dictionary-based or block-level encoding approaches. The most well-known and wide-spread of them is the DEFLATE algorithm that is a variant of LZ77. Surely, there are other approaches like LZW, LZ78 or the Burrows-Wheeler algorithm (used in bzip2), but they are based on the same principle approach, so studying DEFLATE will allow you to grasp other algorithms if necessary.</p> <p>But, before considering DEFLATE, let's first look at the simplest block-level scheme — <strong>Run-Length Encoding</strong> (RLE). This is not even a block-level algorithm, in full, as it operates on single characters, once again. The idea is to encode sequences of repeating characters as a single character followed by the number of repetitions. Of course, such an approach will hardly help with natural language texts that have almost no long character repetitions, instead, it was used in images with limited palettes (like those encoded in the GIF format). It is common for such images to have large areas filled with the same color, so the GIF format, for instance, used RLE for each line of pixels. That was one of the reasons that an image with a horizontal pattern like this:</p> <pre><code>xxxxx<br /><br />xxxxx<br /><br />xxxxx<br /></code></pre> <p>lent itself to stellar compression, while the same one rotated 90 degrees didn't :)</p> <pre><code>x x x<br />x x x<br />x x x<br />x x x<br />x x x<br /></code></pre> <p>LZ77 is a generalization of the RLE approach that considers runs not just of single characters but of variable-length character sequences. Under such conditions, it becomes much better suited for text compression, especially, when the text has some redundancies. For example, program code files tend to have some identifiers constantly repeated (like, <code>if</code>, <code>loop</code> or <code>nil</code>, in Lisp), each code file may have a lengthy identical copyright notice at the top, and so on and so forth. The algorithm operates by replacing repeated occurrences of data with references to a single copy of that data seen earlier in the uncompressed stream. The encoding is by a pair of numbers: the length of the sequence and the offset back into the stream where the same sequence was originally encountered.</p> <p>The most popular LZ77-based compression method is DEFLATE. In the algorithm, literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances are placed into a separate alphabet as they occur just after lengths, so they cannot be mistaken for another kind of symbol or vice versa. A DEFLATE stream consists of a series of blocks. Each block is preceded by a 3-bit header indicating the position of the block (last or intermediate) and the type of character-level compression used: no compression, Huffman with a predefined tree, and Huffman with a custom tree. Most compressible data will end up being encoded using the dynamic Huffman encoding. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the loss in compression due to using a non-optimal code.</p> <p>The algorithm performs the following steps:</p> <ol><li><p>Matching and replacement of duplicate strings with pointers: within a single block, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (the repeated block size is between 3 and 258 bytes) and a 15-bit distance (which specifies an offset of 1-32768 bytes inside the so-called "sliding window") to the beginning of the duplicate. If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of any number identical bytes can be encoded as a single byte followed by a length of <code>(1- n)</code>.</p></li><li><p>Huffman coding of the obtained block. Instructions to generate the necessary Huffman trees immediately follow the block header. There are, actually, 2 trees: the 288-symbol length/literal tree and the 32-symbol distance tree, which themselves encoded as canonical Huffman codes by giving the bit length of the code for each symbol. The bit lengths are then run-length encoded to produce as compact a representation as possible.</p></li></ol> <p>An interesting fact is that DEFLATE compression is so efficient in terms of speed that it is faster to read a compressed file from an ATA hard drive and decompress it in-memory than to read an original longer version: disk access is much longer than CPU processing, for this rather simple algorithm! Even more, it applies to network traffic. That's why compression is used (and enabled by default) in many popular network protocols, for instance, HTTP.</p> <h2 id="takeaways">Take-aways</h2> <p>This chapter, unlike the previous one, instead of exploring many different approaches, dealt with, basically, just a single one in order to dig deeper and to demonstrate the use of all the tools that can be applied in algorithmic programming: from a piece of paper to sophisticated profilers. Moreover, the case we have analyzed provides a great showcase not just of the tools but of the whole development process with all its setbacks, trial and error, and discoveries.</p> <p>Bit fiddling was another topic that naturally emerged in this chapter. It may look cryptic to those who have never ventured into this territory, but mastering the technique is necessary to gain access to a number of important areas of the algorithms landscape.</p> <hr size="1"><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r13-1" name="f13-1">[1]</a> To make full use of this feature and be able to profile SBCL internal functions, you'll need to compile SBCL with <code>--with-cons-profiling</code> flag. Many thanks to Douglas Katzman for developing this feature and guiding me through its usage.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r13-2" name="f13-2">[2]</a> It was verified by taking the average of multiple test runs.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r13-3" name="f13-3">[3]</a> You can study the details in the [relevant article](https://www.drdobbs.com/cpp/data-compression-with-arithmetic-encodin/240169251).</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com1tag:blogger.com,1999:blog-6031647961506005424.post-14204358913923168802020-02-10T21:32:00.001+02:002020-02-11T13:15:03.051+02:00prj-nlp v.3<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The state of NLP in 2019.<br><br>I’m talking with an amazing undergrad who has already published multiple papers on BERT-type things.<br><br>We are discussing deep into a new idea on pretraining.<br><br>Me: What would TFIDF do here, as a simple place to start?<br>Him: ....<br>Me: ....<br>Him: What’s TFIDF?</p>— Eric Wallace (@Eric_Wallace_) <a href="https://twitter.com/Eric_Wallace_/status/1207528697239982080?ref_src=twsrc%5Etfw">December 19, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script><p>Ми з Мар'яною Романишин розпочали третій набір курсу з обробки людської письмової мови (чи як краще перекласти NLP :-p) в Проджекторі. Хотів написати трохи деталей про нього, бо курс нам дуже подобається і, звісно, кожного року хочеться, щоб його рівень продовжував зростати. А для цього треба, щоб потенційні студенти про нього знали і розуміли, як він влаштований. <p>Курс є трьохмісячним інтенсивом, який ставить на меті підготувати спеціаліста, здатного самостійно і якісно вирішувати NLP-задачі будь-якої складності. Як наодинці, так і в команді. Для того, щоб бути максимально успішним в ньому, треба мати певну базу. Як правило, найкраще себе показують, звісно, програмісти. Але є й винятки: ми залюбки беремо лінгвістів, журналістів, та й, загалом, спеціалістів з інших галузей за умови, що вони володють достатніми навиками програмування, щоб самостійно писати програми, налаштовувати зручне для себе середовище розробки і розуміти базові алгоритмічні концепції. Для курсу не треба знати ML, хоча якесь уявлення про нього бажано мати. Але курс побудований так, що ми почергово розбираємо NLP-теми і пов'язані з ними розділи ML. Звісно, це не значить, що в результаті людина буде гарно розбиратись у машинному навчанні, але необхідну базу для продовження поглиблення у цій сфері отримає. <p>Другою передумовою успіху на курсі є наявність достатнього часу. Ми кажемо, що необхідний мінімум — це 10 годин самостійної роботи на тиждень, плюс 5 годин на заняттях. Іншими словами, враховуючи час на дорогу, це вже пів робочих ставки. Але, звісно, комусь може знадобитись і більше самостійного часу. Крім того мозок буде досить сильно завантажений новими темами, тому на час занять доведеться відмовитись від інших додаткових проєктів, хоббі і т.п. Також не дуже добре виходить, якщо більше тижня підряд випадає з якоїсь зовнішньої причини: хвороби, відрядження, шлюбу, народження дітей... :) <p>Як побудований цей курс? Ми збираємо групу з 15 студентів і зустрічаємось два рази на тиждень: одне заняття — теоретичне у четвер увечері, присвячене розбору певної теми, друге — практичне у суботу, на якому ми показуємо приклад вирішення задачі по цій теми і разом програмуємо його. У більшості випадків, ця програма буде основою для розв'язку більш просунутої, але подібної задачі, яка дається на домашню роботу. Відповідно, у нас є 12 тижнів основної роботи, тобто 12 тем і близько 10 повноцінних домашніх проєктів рівня побудувати систему аналізу тональності, перевірки фактів чи синтаксичного розбору. Звісно, в умовах обмеженого часу, кожний з проєктів робиться в рамках певного обмеженого домену. <p>Курс розбитий на 3 частини: <ul><li>перший місяць — це дуже ґрунтовна підготовча робота: основи структурної лінгвістики, робота з даними, метрики, правильні експерименти, підхід на правилах. В кінці місяця в голові у студента має сформуватись цілком структуроване уявлення про те, як правильно підходити до вирішення NLP-задач. Інший результат цієї частини — сформульоване завдання для курсового і початок роботи над ним: зібрані перші дані, визначена метрика, пророблений план експериментів</li><li>другий місяць — це занурення в класичне NLP з паралельною проробкою наійбільш розповсюджених ML-технік, які в ньому використовуються. В кінці місяця, після вирішення на лекціях, практичних і вдома майже десятка NLP-задач у студентів вже мають сформуватись навички для самостійного застосування цих технік у реальних проєктах. Ну і зроблена основна частина курсової роботи</li><li>останній місяць — це deep learning в NLP. Ми одразу попереджаємо, що цей курс не ставить на меті розказати побільше найгарячішого і проривного: для цього є достатньо інших майданчиків. Ми хочемо сформувати систематичне розуміння NLP, з всією його 70-річною історією. Бо в цій історії є дуже багато корисних і, можливо, timeless речей. Тож до state-of-the-art ми підходимо тільки під кінець (і на останньому занятті у нас виступає запрошений лектор, який розказує щось про bleeding edge :) Але принципові речі, пов'язані з DL, ми також пророблюємо як на заняттях, так і в рамках курсового. Ті зі студентів, кого цікавить саме ця сфера, під кінець курсу ганяють навчання десяток нейронок в своїх пісочницях, а також випробовують можливості глибинного навчання у своєму курсовому проєкті.. Втім, тут ми не можемо похвалитись приголомшливими результатами по якості, адже для їх досягнення замало пари тижнів, які по-максимуму є на ту чи іншу задачу: навчання глибинних моделей потребує багато як обчислювальних ресурсів, так і часових. Але тому, як до цього підходити, ми навчаємося</li></ul><p>Як можна побачити з цього опису, дуже велику увагу ми приділяємо курсовому проєкту, роботу над яким стимулюємо кожного тижня. В результаті, у більшості виходять досить непогані і цікаві штуки, понад 70% студентів доходять до фінішу з якісною завершеною роботою (нечувана кількість для інших курсів, в яких мені доводилось брати участь). Деякі з проєктів навіть виходять у великий світ: хтось робить речі, пов'язані з роботою, хтось — з хоббі. За 2 роки у нас було 2 дослідження для журналістики даних, проєкт з аналізу конфліктів ліків між собою та з хронічними хворобами людини (на основі обробки інструкцій), система пошуку у соціальному застосунку для конференцій з запитами природньою мовою. Був і ряд цікавих проєктів для себе, які досягли класних результатів по якості та були зроблені з душею. Все це студенти презентують після закінчення на великій фінальній вечірці в офісі Grammarly. <p>Одна з основних цілей цього курсу для нас полягає в тому, щоб вирощувати українську NLP-спільноту. Адже школи комп'ютерної лінгвістики у нас, по суті, ніколи не було. І ми сподіваємось, що нам вдастся долучитись до її формування разом з іншими прогресивними проєктами навчання в цій сфері, зокрема магістерською програмою по Data Science в УКУ. У курсу вже більше 30 випускників, які увійшли до закритого клубу prj-nlp-alumni де ми ділимось цікавими речами та можливостями, а також плануємо періодично зустрічатись у неформальній атмосфері, а не тільки на івентах. Тож сподіваємось на розширення цього клубу ще на половину у червні цього року :) <p>P.S. До речі, про УКУ. Я також беру участь як викладач і ментор дипломних робіт у курсі NLP в їх програмі. Це трохи інший досвід, ніж цей курс. Звісно, УКУ пропонує більш академічну програму, яка триває довший час. Студенти отримують там гарну і, що важливо, систематичну підготовку з ML та DL. Тому цьому зовсім не треба приділяти увагу на курсі по NLP. З іншого боку, курс більш короткий і читається декількома викладачами, тому в його рамках важче сформувати цілісну картинку, нема можливості організувати такий же рівень занурення і концентрації, як на курсі в Проджекторі. Зате на магістерську роботу у них більше часу, ніж на весь наш курс. Але, найголовніше, що і тут, і там підбираються гарно підготовлені і мотивовані студенти, тож результати в УКУ також виходять гарної якості з великою кількістю цікавих робіт. Деякі з яких мають рівень статей на топові наукові конференції у цій галузі. Хоча мені особисто все-таки більше подобається формат Проджектора, адже він дає можливість відчути дух інтенсивної командної роботи протягом трьох місяців, зітхнувши з полегшенням в кінці у передчутті дев'ятимісячного перепочинку і нової ітерації... <p><a href="https://3.bp.blogspot.com/-s8heybINgvo/XkKMlbi2daI/AAAAAAAACT0/wGsuFJ2ejrsWW6amXpUK0ibm5rgPVV0vACLcBGAsYHQ/s1600/download.jpeg" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-s8heybINgvo/XkKMlbi2daI/AAAAAAAACT0/wGsuFJ2ejrsWW6amXpUK0ibm5rgPVV0vACLcBGAsYHQ/s400/download.jpeg" width="400" height="225" data-original-width="1600" data-original-height="900" /></a>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com1tag:blogger.com,1999:blog-6031647961506005424.post-20014411548085757012020-01-11T23:08:00.000+02:002020-01-13T11:09:20.275+02:00Programming Algorithms: Approximation<p>This chapter will be a collection of stuff from somewhat related but still distinct domains. What unites it is that all the algorithms we will discuss are, after all, targeted at calculating approximations to some mathematical functions. There are no advanced data structures involved, neither is the aim to find a clever way to improve the runtime of some common operations. No, these algorithms are about calculations and computing an acceptable result within the allocated time budget.</p> <h2 id="combinatorialoptimization">Combinatorial Optimization</h2> <p>Dynamic Programming is a framework that can be used for finding the optimal value of some loss function when there are multiple configurations of the problem space that result in different values. Such search is an example of discrete optimization for there is a countable number of states of the system and a distinct value of the cost function we're optimizing corresponding to each state. There are also similar problems that have an unlimited and uncountable number of states, but there is still a way to find a global or local optimum of the cost function for them. They comprise the continuous optimization domain. Why is optimization not just a specialized area relevant to a few practitioners but a toolbox that every senior programmer should know how to utilize? The primary reason is that it is applicable in almost any domain: the problem just needs to be large enough to rule out simple brute force. You can optimize how the data is stored or how the packets are routed, how the blueprint is laid out or the servers are loaded. Many people are just not used to looking at their problems this way. Also, understanding optimization is an important prerequisite for having a good grasp of machine learning, which is revolutionizing the programming world.</p> <p>DP is an efficient and, overall, great optimization approach, but it can't succeed if the problem doesn't have an optimal substructure. Combinatorial Optimization approaches deal with finding a near-optimum for the problems where an exhaustive search requires <code>O(2^n)</code> computations. Such problems are called NP-hard and a classic example of those is the Travelling Salesman (<strong>TSP</strong>). The task is to find an optimal order of edges in a cycle spanning all vertices of a fully-connected weighted graph. As we saw previously, this problem doesn't have an optimal substructure, i.e. an optimal partial solution isn't necessarily a part of the best overall one, and so taking the shortest edge doesn't allow the search procedure to narrow down the search space when looking at the next vertex. A direct naive approach to TSP will enumerate all the possible variants and select the one with a minimal cost. However, the number of variants is <code>n!</code>, so this approach becomes intractable very fast. A toy example of visiting all the capitals of the 50 US states has <code>10^64</code> variants. This is where quantum computers promise to overturn the situation, but while we're waiting for them to mature, the only feasible approach is developing approximation methods that will get us a good enough solution in polynomial (ideally, linear) time. TSP may look like a purely theoretical problem, but it has some real-world applications. Besides vehicle routing, automated drilling and soldering in electronics is another example. Yet, even more important is that there are many other combinatorial optimization problems, but, in essence, the approaches to solving one of them apply to all the rest. I.e., like with shortest path, coming up with an efficient solution to TSP allows to efficiently solve a very broad range of problems over a variety of domains.</p> <p>So, let's write down the code for the basic TSP solution. As usual, we have to select the appropriate graph representation. From one point of view, we're dealing with a fully-connected graph, so every representation will work and a matrix one will be the most convenient. However, storing an <code>n^2</code>-sized array is not the best option, especially for a large <code>n</code>. A better "distributed" representation might be useful here. Yet, for the TSP graph, an even better approach would be to do the opposite of our usual optimization trick: trade computation for storage space. When the graph is fully-connected, usually, there exists some kind of an underlying metric space that contains all the vertices. The common example is the Euclidian space, in which each vertex has a coordinate (for example, the latitude and longitude). Anyway, whichever way to represent the vertex position is used, the critical requirement is the existence of the metric that may be calculated at any time (and fast). Under such conditions, we don't have to store the edges at all. So, our graph will be just a list of vertices.</p> <p>Let's use the example with the US state capitals. Each vertex will be representated as a pair of floats (lat & lon). We can retireve the raw data from the Wikipedia article about the <a href="https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1">US capitols (with an 'o')</a> and extract the values we need with the following code snippet<a href="#f12-1" name="r12-1">[1]</a>, which cuts a few corners:</p> <pre><code>(defstruct city<br /> name lat lon)<br /><br />(defparameter *wp-link* "https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1")<br /><br />(defparameter *cs*<br /> (with ((raw (drakma:http-request *wp-link*))<br /> (coords-regex (ppcre:create-scanner "\\{\\{coord\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|type"))<br /> (capitals (list)))<br /> (flet ((dms->rad (vec off)<br /> (* (/ pi 180)<br /> (+ (? vec (+ off 0))<br /> (/ (? vec (+ off 1)) 60)<br /> (/ (? vec (+ off 2)) 3600)))))<br /> (dolist (line (split #\Newline (slice raw<br /> (search "{| class=\"wikitable sortable\"" raw)<br /> (search "</textarea><div class='editOptions'>" raw))))<br /> (when-it (and (starts-with "|" line)<br /> (search "{{coord" line))<br /> (with ((_ coords (ppcre:scan-to-strings coords-regex line))<br /> (coords (map* 'read-from-string coords)))<br /> (push (make-city :name (slice line (position-if 'alpha-char-p line)<br /> (position-if (lambda (ch) (member ch '(#\] #\|)))<br /> line :start 1))<br /> :lat (dms->rad coords 0)<br /> :lon (dms->rad coords 3))<br /> capitals)))))<br /> (coerce capitals 'vector)))<br /><br />CL-USER> (length *cs*)<br />50<br /></code></pre> <p>We also need to define the metric. The calculation of distances on Earth, though, is not so straightforward as on a plain. Usually, as a first approximation, the haversine formula is used that provides the estimate of the shortest distance over the surface "as-the-crow-flies" (ignoring the relief).</p> <pre><code>(defun earth-dist (c1 c2)<br /> (with ((lat1 (? c1 'lat))<br /> (lat2 (? c2 'lat))<br /> (a (+ (expt (sin (/ (- lat2 lat1) 2))<br /> 2)<br /> (* (cos lat1)<br /> (cos lat2)<br /> (expt (sin (/ (- (? c2 'lon) (? c1 'lon)) 2)) <br /> 2)))))<br /> (* 1.2742e7 ; Earth diameter<br /> (atan (sqrt a) (sqrt (- 1 a)))))) <br /></code></pre> <p>With the metric at our disposal, let's define the function that will calculate the length of the whole path and use it for a number of random paths (we'll use the RUTILS function <code>shuffle</code> to produce a random path).</p> <pre><code>(defun path-length (path)<br /> (let ((rez (earth-dist (? path 0) (? path -1))))<br /> (dotimes (i (1- (length path)))<br /> (:+ rez (earth-dist (? path i) (? path (1+ i)))))<br /> rez))<br /><br />CL-USER> (path-length *cs*)<br />9.451802301259182d7<br />CL-USER> (path-length (shuffle *cs*))<br />9.964776273250546d7<br />CL-USER> (path-length (shuffle *cs*))<br />1.009761841183094d8<br /></code></pre> <p>We can see that an average path may have a length of around 10k kilometers. However, we don't know anything about the shortest or the longest one, and to find out reliably, we'll have to evaluate <code>50!</code> paths... Yet, as we accept the sad fact that it is not possible to do with our current technology, it's not time to give up yet. Yes, we may not be able to find the absolute best path, but at least we can try to improve on the random one. Already, the three previous calculations had a variance of 5%. So, if we're lucky, maybe we could hit a better path purely by chance. Let's try a thousand paths using our usual argmin pattern:</p> <pre><code>(defun random-search (path n)<br /> (let ((min (path-length path))<br /> (arg path))<br /> (loop :repeat n :do<br /> (with ((path (shuffle path))<br /> (len (path-length path)))<br /> (when (< len min)<br /> (:= min len<br /> arg path))))<br /> (values arg<br /> min)))<br /><br />CL-USER> (:= *print-length* 2)<br />2<br />CL-USER> (random-search *cs* 1000)<br />(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)<br /> #S(CITY :NAME "Montpelier, Vermont" :LAT 0.772521512027179d0 ...) ...)<br />7.756170773802838d7<br /></code></pre> <p>OK, we've got a sizable 20% improvement. What about 1,000,000 combinations?</p> <pre><code>CL-USER> (time (random-search *cs* 1000000))<br />Evaluation took:<br /> 31.338 seconds of real time<br />...<br />(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)<br /> #S(CITY :NAME "Helena, Montana" :LAT 0.813073800024579d0 ...) ...)<br />6.746660953705506d7<br /></code></pre> <p>Cool, another 15%. Should we continue increasing the size of the sample? Maybe, after a day of computations, we could get the path length down by another 20-30%. And that's already a good gain. Surely, we could also parallelize the algorithm or use a supercomputer in order to analyze many more variants. But there should be something smarter than simple brute force, right?</p> <h2 id="localsearch">Local Search</h2> <p>Local Search is the "dumbest" of these smart approaches, built upon the following idea: if we had a way to systematically improve our solution, instead of performing purely random sampling, we could arrive at better variants much faster. The local search procedure starts from a random path and continues improving it until the optimum is reached. This optimum will be a local one (hence the name), but it will still be better than what we have started with. Besides, we could run the optimization procedure many times from a different initial point, basically, getting the benefits of the brute force approach. We can think of the multiple runs local search as sampling + optimization.</p> <pre><code>(defun local-search (path improve-fn)<br /> (let ((min (path-length path))<br /> (cc 0)) ; iteration count<br /> (loop<br /> (:+ cc)<br /> (if-it (call improve-fn path)<br /> (:= min (path-length it)<br /> path it)<br /> (return (values path<br /> min<br /> cc))))))<br /></code></pre> <p>For this code to work, we also need to supply the <code>improve-fn</code>. Coming up with it is where the creativity of the algorithmic researcher needs to be channeled into. Different problems (and even a single problem) may allow for different approaches. For TSP, there are several improvement possibilities discovered so far. And all of them use the planar (2d) nature of the graph we're processing. It is an additional constraint that has a useful consequence: if the paths between two pairs of nodes intersect, definitely, there are also shorter paths between them that are nonintersecting. So, swapping the edges will improve the whole path. If we were to draw a picture of this swap, it would look like this (the edges <code>A-D</code> and <code>C-B</code> intersect, while <code>A-B</code> and <code>C-D</code> don't and hence their total length is shorter):</p> <pre><code> - A B - - A - B -<br /> X ==><br /> - C D - - C - D -<br /></code></pre> <p>This rule allows us to specify the so-called <code>2-opt</code> improvement procedure:</p> <pre><code>(defun 2-opt (path)<br /> (loop :repeat (* 2 (length path)) :do<br /> (with ((len (length path))<br /> (v1 (random len))<br /> (v1* (if (= #1=(1+ v1) len) 0 #1#))<br /> (v2 (loop :for v := (random len)<br /> :when (and (/= v v1) (/= v (1- v1))) :do (return v)))<br /> (v2* (if (= #2=(1+ v2) len) 0 #2#)))<br /> (when (< (+ (path-length (vec (? path v1) (? path v2)))<br /> (path-length (vec (? path v1*) (? path v2*))))<br /> (+ (path-length (vec (? path v1) (? path v1*)))<br /> (path-length (vec (? path v2) (? path v2*)))))<br /> (let ((beg (min v1* v2*))<br /> (end (max v1* v2*)))<br /> (return (concatenate 'vector <br /> (subseq path 0 beg)<br /> (reverse (subseq path beg end))<br /> (subseq path end))))))))<br /></code></pre> <p>Note that we do not need to perform a complicated check for path intersection (which requires an algorithm of its own and there is a number of papers dedicated to this task). In fact, we don't care if there is an intersection: we just need to know that the new path, which consists of the newly replaced edges and a reversed part of the path between the two inner nodes of the old edges, is shorter. One more thing to notice is that this implementation doesn't perform an exhaustive analysis of all possible edge swaps, which is suggested by the original 2-opt algorithm (a <code>O(n^2)</code> operation). Here, we select just a random pair. Both variants are acceptable, and ours is simpler to implement.</p> <pre><code>CL-USER> (local-search *cs* '2-opt)<br />#(#S(CITY :NAME "Jackson, Mississippi" :LAT 0.5638092223095238d0 ...)<br /> #S(CITY :NAME "Baton Rouge, Louisiana" :LAT 0.5315762080646039d0 ...) ...)<br />3.242702077795514d7<br />111<br /></code></pre> <p>So, outright, we've got a 100% improvement on the <code>random-search</code> path obtained after a much larger number of iterations. Iteration counting was added to the code in order to estimate the work we had to do. To make a fair comparison, let's run <code>random-search</code> with the same <code>n</code> (111):</p> <pre><code>CL-USER> (random-search *cs* 111)<br />#(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)<br /> #S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)<br />7.522044767585556d7<br /></code></pre> <p>But this is still not 100% fair as we haven't yet factored in the time needed for the <code>2-opt</code> call which is much heavier than the way random search operates. In my estimates, 111 iterations of <code>local-search</code> took 4 times as long, so...</p> <pre><code>CL-USER> (random-search *cs* 444)<br />#(#S(CITY :NAME "Lansing, Michigan" :LAT 0.745844229097319d0 ...)<br /> #S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)<br />7.537249874357127d7<br /></code></pre> <p>Now, the runtimes are the same, but there's not really much improvement in the random search outcome. That's expected for, as we have already observed, achieving a significant improvement in <code>random-search</code> results requires performing orders of magnitude more operations.</p> <p>Finally, let's define <code>multi-local-search</code> to leverage the power of random sampling:</p> <pre><code>(defun multi-local-search (path n)<br /> (let ((min (path-length path))<br /> (arg path))<br /> (loop :repeat n :do<br /> (with ((cur (local-search (shuffle path) '2-opt)))<br /> (when (< #1=(path-length cur) min)<br /> (:= min #1#<br /> arg cur))))<br /> (values arg<br /> min)))<br /><br />CL-USER> (time (multi-local-search *cs* 1000))<br />Evaluation took:<br /> 22.394 seconds of real time<br />...<br />#(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)<br /> #S(CITY :NAME "Montgomery, Alabama" :LAT 0.5650930224896327d0 ...) ...)<br />2.8086843039667137d7<br /></code></pre> <p>Quite a good improvement that took only 20 seconds to achieve!</p> <p>As a final touch, let's draw the paths on the map. It's always good to double-check the result using some visual approach when it's available. Here is our original random path (Anchorage and Honolulu are a bit off due to the issues with the map projection):</p> <a href="https://4.bp.blogspot.com/-8yB_JUhrpo4/Xho2WWUj1uI/AAAAAAAACSg/wJAgbR8nfF80Eyvc7pWDoWeDKs9xQjt5wCLcBGAsYHQ/s1600/usa-random.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-8yB_JUhrpo4/Xho2WWUj1uI/AAAAAAAACSg/wJAgbR8nfF80Eyvc7pWDoWeDKs9xQjt5wCLcBGAsYHQ/s640/usa-random.png" width="640" height="376" data-original-width="1479" data-original-height="868" /></a> <p>This is the result of random search with a million iterations:</p> <a href="https://4.bp.blogspot.com/-BT9-2cTjrBM/Xho2WVRfvKI/AAAAAAAACSc/B2cstEx8Z9QOBcalifN3yIGaAWYOD0s7QCPcBGAYYCw/s1600/usa-rs.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-BT9-2cTjrBM/Xho2WVRfvKI/AAAAAAAACSc/B2cstEx8Z9QOBcalifN3yIGaAWYOD0s7QCPcBGAYYCw/s640/usa-rs.png" width="640" height="376" data-original-width="1479" data-original-height="868" /></a> <p>And this is our multistart local search outcome. Looks nice, doesn't it?</p> <a href="https://3.bp.blogspot.com/-YXH9QxiOpkw/Xho2WW7BL2I/AAAAAAAACSY/ZtVq4Rz--QAl5r4CCfymQRXoTz3NRyElACPcBGAYYCw/s1600/usa-ls.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-YXH9QxiOpkw/Xho2WW7BL2I/AAAAAAAACSY/ZtVq4Rz--QAl5r4CCfymQRXoTz3NRyElACPcBGAYYCw/s640/usa-ls.png" width="640" height="376" data-original-width="1479" data-original-height="868" /></a> <p>2-opt is the simplest path improving technique. There are more advanced ones like 3-opt and Lin-Kernighan heuristic. Yet, the principle remains the same: for local search to work, we have to find a way to locally improve our current best solution.</p> <p>Another direction of the development of the basic algorithm, besides better local improvement procedures and trying multiple times, is devising a way to avoid being stuck in local optima. <strong>Simulated Annealing</strong> is the most well-known technique for that. The idea is to replace unconditional selection of a better variant (if it exists) with a probabilistic one. The name and inspiration for the technique come from the physical process of cooling molten materials down to the solid state. When molten steel is cooled too quickly, cracks and bubbles form, marring its surface and structural integrity. Annealing is a metallurgical technique that uses a disciplined cooling schedule to efficiently bring the steel to a low-energy, optimal state. The application of this idea to the optimization procedure introduces the temperature parameter <code>T</code>. At each step, a new state is produced from the current one. For instance, it can be achieved using 2-opt, although the algorithm doesn't impose the limitation on the state to necessarily be better than the current one, so even such a simple thing as a random swap of vertices in the path is admissible. Next, unlike with local search, the transition to the candidate step doesn't happen unconditionally, but with a probability proportional to <code>(/ 1 T)</code>. Initially, we start with a high value of <code>T</code> and then decrease it following some annealing schedule. Eventually, <code>T</code> falls to 0 towards the end of the allotted time budget. In this way, the system is expected to wander, at first, towards a broad region of the search space containing good solutions, ignoring small fluctuations; then the drift towards low-energy regions becomes narrower and narrower; and, finally, it transitions to ordinary local search according to the steepest descent heuristic.</p> <h2 id="evolutionaryalgorithms">Evolutionary Algorithms</h2> <p>Local search is the most simple example of a family of approaches that are collectively called <strong>Metaheuristics</strong>. All the algorithms from this family operate, in general, by sampling and evaluating a set of solutions which is too large to be completely evaluated. The difference is in the specific approach to sampling that is employed.</p> <p>A prominent group of metaheuristic approaches is called Evolutionary (and/or nature-inspired) algorithms. It includes such methods as Genetic Algorithms, Ant Colony and Particle Swarm Optimization, Cellular and even Grammatical Evolution. The general idea is to perform optimization in parallel by maintaining the so-called population of states and alter this population using a set of rules that improve the aggregate quality of the whole set while permitting some outliers in hopes that they may lead to better solutions unexplored by the currently fittest part of the population.</p> <p>We'll take a brief glance at evolutionary approaches using the example of <strong>Genetic ALgorithms</strong>, which are, probably, the most well-known technique among them. The genetic algorithm (GA) views each possible state of the system as an individual "genome" (encoded as a vector). GA is best viewed as a framework that requires specification of several procedures that operate on the genomes of the current population:</p> <ul><li>The initialization procedure which creates the initial population. After it, the size of the population remains constant, but each individual may be replaced with another one obtained by applying the evolution procedures.</li> <li>The fitness function that evaluates the quality of the genome and assigns some weight to it. For TSP, the length of the path is the fitness function. For this problem, the smaller is the value of the function the better.</li> <li>The selection procedure specifies which items from the population to use for generating new variants. In the simplest case, this procedure can use the whole population.</li> <li>The evolution operations which may be applied. The usual GA operations are mutation and crossover, although others can be devised also.</li></ul> <p>Mutation operates on a single genome and alters some of its slots according to a specified rule. 2-opt may be a valid mutation strategy, although even the generation of a random permutation of the TSP nodes may work if it is applied to a part of the genome and not to the whole. By controlling the magnitude of mutation (what portion of the genome is allowed to be involved in it) it is possible to choose the level of stochasticity in this process. But the key idea is that each change should retain at least some resemblance with the previous version, or we'll just end up with stochastic search.</p> <p>The crossbreeding operation isn't, strictly speaking, necessary in the GA, but some of the implementations use it. This process transforms two partial solutions into two others by swapping some of the parts. Of course, it's not possible to apply directly to TSP, as it would result in the violation of the main problem constraint of producing a loop that spans all the nodes. Instead, another procedure called the ordered crossover should be used. Without crossbreeding, GA may be considered a parallel version of local search.</p> <p>Here is the basic GA skeleton. It requires definition of the procedures <code>init-population</code>, <code>select-candidates</code>, <code>mutate</code>, <code>crossbread</code>, and <code>score-fitness</code>.</p> <pre><code>(defun ga (population-size &key (n 100))<br /> (let ((genomes (init-population population-size)))<br /> (loop :repeat n :do<br /> (let ((candidates (select-candidates genomes)))<br /> (dolist (ex (mapcar 'mutate candidates))<br /> (push ex genomes))<br /> (dolist (ex (crossbread candidates))<br /> (push ex genomes)))<br /> (:= genomes (take population-size (sort genomes 'score-fitness))))))<br /></code></pre> <p>This template is not a gold standard, it can also be tweaked and altered, but you've got a general idea. The other evolutionary optimization methods also follow the same principles but define different ways to evolve the population. For example, Particle Swarm Optimization operates by moving candidate solutions (particles) around in the search space according to simple mathematical formulae over their position and velocity. The movement of each particle is influenced by its local best known position, as well as guided toward the global best known positions in the search space. And those are, in turn, updated as better positions are found by other particles. By the way, the same idea underlines the Particle Filter algorithm used in signal processing and statistical inference.</p> <h2 id="branchbound">Branch & Bound</h2> <p>Metaheuristics can be, in general, classified as local search optimization methods for they operate in a bottom-up manner by selecting a random solution and trying to improve it by gradual change. The opposite approach is global search that tries to systematically find the optimum by narrowing the whole problem space. We have already seen the same pattern of two alternative ways to approach the task — top-down and bottom-up — in parsing, and it also manifests in other domains that permit problem formulation as a search task.</p> <p>How is a top-down systematic evaluation of the combinatorial search space even possible? Obviously, not in its entirety. However, there are methods that allow the algorithm to rule out significant chunks that certainly contain suboptimal solutions and narrow the search to only the relevant portions of the domain that may be much smaller in cardinality. If we manage to discard, this way, a large number of variants, we have more time to evaluate the other parts, thus achieving better results (for example, with Local search).</p> <p>The classic global search is represented by the Branch & Bound method. It views the set of all candidate solutions as a rooted tree with the full set being at the root. The algorithm explores branches of this tree, which represent subsets of the solution set. Before enumerating the candidate solutions of a branch, the branch is checked against upper and lower estimated bounds on the optimal solution and is discarded if it cannot produce a better solution than the best one found so far by the algorithm. The key feature of the algorithm is efficient bounds estimation. When it is not possible, the algorithm degenerates to an exhaustive search.</p> <p>Here is a skeleton B&B implementation. Similar to the one for Genetic Algorithms, it relies on providing implementations of the key procedures separately for each search problem. For the case of TSP, the function will accept a graph and all the permutations of its vertices comprise the search space. We'll use the <code>branch</code> struct to represent the subspace we're dealing with. We can narrow down the search by pinning a particular subset of edges: this way, the subspace will contain only the variants originating from the possible permutations of the vertices that are not attached to those edges.</p> <pre><code>(defstruct branch<br /> (upper most-positive-fixnum)<br /> (lower 0)<br /> (edges (list))<br /></code></pre> <p>The <code>b&b</code> procedure will operate on the graph <code>g</code> and will have an option to either work until the shortest path is found or terminate after <code>n</code> steps.</p> <pre><code>(defun b&b (g &key n)<br /> (with ((cur (vertices g))<br /> (min (cost cur)))<br /> (arg cur)<br /> (q (make-branch :upper min :lower (lower-bound g ())))<br /> (loop :for i :from 0<br /> :for branch := (pop q) :while item :do<br /> (when (eql i n) (return))<br /> (if (branchp branch)<br /> (dolist (item (branch-out branch))<br /> ;; we leave only the subbranches that can,<br /> ;; at least in theory, improve on the current solution<br /> (when (< (branch-lower item) upper)<br /> (push item q)))<br /> (let ((cost (branch-upper branch)))<br /> (when (< cost lower)<br /> (:= lower cost<br /> arg branch)))))<br /> (values cur<br /> cost)))<br /></code></pre> <p>The <code>branch-out</code> function is rather trivial: it will generate all the possible variants by expanding the current edge set with a single new edge, and it will also calculate the bounds for each variant. The most challenging part is figuring out the way to compute the <code>lower-bound</code>. The key insight here is the observation that each path in the graph is not shorter than half the sum of the shortest edges attached to each vertex. So, the lower bound for a branch with pinned edges <code>e1</code>, <code>e2</code>, and <code>e3</code> will be the sum of the lengths of these edges plus half the sum of the shortest edges attached to all the other vertices that those edges don't cover. It is the most straightforward and raw approximation that will allow the algorithm to operate. It can be further improved upon — a home task for the reader is to devise ways to make it more precise and estimate if they are worth applying in terms of computational complexity.</p> <p>B&B may also use additional heuristics to further optimize its performance at the expense of producing a slightly more suboptimal solution. For example, one may wish to stop branching when the gap between the upper and lower bounds becomes smaller than a certain threshold. Another improvement may be to use a priority queue instead of a stack, in the example, in order to process the most promising branches first.</p> <p>One more thing I wanted to mention in the context of global heuristic search is <strong>Monte Carlo Tree Search</strong> (MCTS), which, in my view, uses a very similar strategy to B&B. It is the currently dominant method for finding near-optimal paths in the decision tree for turn-based and other similar games (like go or chess). The difference between B&B and MCTS is that, typically, B&B will use a conservative exact lower bound for determining which branches to skip. MCTS, instead, calculates the estimate of the potential of the branch to yield the optimal solution by performing the sampling of a number of random items from the branch and averaging their scores. So, it can be considered a "softer" variant of B&B. The two approaches can be also combined, for example, to prioritize the branch in the B&B queue. The term "Monte Carlo", by the way, is applied to many algorithms that use uniform random sampling as the basis of their operation.</p> <h2 id="gradientdescent">Gradient Descent</h2> <p>The key idea behind Local Search was to find a way to somehow improve the current best solution and change it in that direction. It can be similarly utilized when switching from discrete problems to continuous ones. And in this realm, the direction of improvement (actually, the best possible one) is called the <strong>gradient</strong> (or rather, the opposite of the gradient). Gradient Descent (GD) is the principal optimization approach, in the continuous space, that works in the same manner as Local Search: find the direction of improvement and progress alongside it. There's also a vulgar name for this approach: hill climbing. It has a lot of variations and improvements that we'll discuss in this chapter. But we'll start with the code for the basic algorithm. Once again, it will be a template that can be filled in with specific implementation details for the particular problem. We see this "framework" pattern recurring over and over in optimization methods as most of them provide a general solution that can be applied in various domains and be appropriately adjusted for each one.</p> <pre><code>(defun gd (fn data &key n (learning-rate 0.1) (precision 1e-6))<br /> (let ((ws (init-weights fn))<br /> (cost (cost fn ws))<br /> (i 0))<br /> (loop<br /> (update-weights ws learning-rate<br /> (grad fn ws data))<br /> (when (or (< (- (:= cost (cost fn ws))<br /> cost)<br /> precision)<br /> (eql n (:+ i)))<br /> (return)))<br /> (values ws<br /> cost))<br /></code></pre> <p>This procedure optimizes the weights (<code>ws</code>) of some function <code>fn</code>. Moreover, whether we know or not the mathematical formula for <code>fn</code>, doesn't really matter: the key is to be able to compute <code>grad</code>, which may be done analytically (using a formula that is just coded) or in a purely data-driven fashion (what Backprop, which we have seen in the previous chapter, does). <code>ws</code> will usually be a vector or a matrix and <code>grad</code> will be an array fo the same dimensions. In the simplest and not interesting toy case, both are just scalar numbers.</p> <p>Besides, in this framework, we need to define the following procedures:</p> <ul><li><code>init-weights</code> sets the starting values in the <code>ws</code> vector according to <code>fn</code>. There are several popular ways to do that: the obvious set to all zeroes, which doesn't work in conjunction with backrpop; sample from a uniform distribution with a small amplitude; more advanced heuristics like Xavier initialization.</li> <li><code>update-weights</code> has a simple mathematical formulation: <code>(:- ws (* learning-rate gradient))</code>. But as <code>ws</code> is usually a multi-dimensional structure, in Lisp we can't just use <code>-</code> and <code>*</code> on them as these operations are reserved for dealing with numbers.</li> <li>it is also important to be able to calculate the <code>cost</code> function (also often called, "loss"). As you can see from the code, the GD procedure may terminate in two cases: either it has used the whole iteration budget assigned to it, or it has approached the optimum very closely, so that, at each new iteration, the change in the value of the cost function is negligible. Apart from this usage, tracking the <code>cost</code> function is also important to monitor the "learning" process (another name for the optimization procedure, popular in this domain). If GD operating correctly, the cost should monotonically decrease at each step.</li></ul> <p>This template is the most basic one and you can see a lot of ways of its further improvement and tuning. One important direction is controlling the learning rate: similar to Simulated Annealing, it may change over time according to some schedule or heuristics.</p> <p>Another set of issues that we won't elaborate upon now are related to dealing with numeric precision, and they also include such problems as vanishing/exploding gradients.</p> <h3 id="improvinggd">Improving GD</h3> <p>In the majority of interesting real-world optimization problems, the gradient can't be computed analytically using a formula. Instead, it has to be recovered from the data, and this is a computationally-intensive process: for each item in the dataset, we have to run the "forward" computation and then compute the gradient in the "backward" step. A diametrically opposite approach in terms of both computation speed and quality of the gradient would be to take just a single item and use the gradient for it as an approximation of the actual gradient. From the statistics point of view, after a long sequence of such samples, we have to converge to some optimum anyway. This technique, called <strong>Stochastic Gradient Descent</strong> (SGD), can be considered a form of combining sampling with gradient descent. Yet, sampling could be also applied directly the dataset directly. The latter approach is called <strong>Batch Gradient Descent</strong>, and it combines the best of both worlds: decent performance and a much more predictable and close to the actual value of the gradient, which is more suitable for supporting the more advanced approaches, such as momentum.</p> <p>In essence, momentum makes the gradient that is calculated on a batch of samples more straightforward and less prone to oscillation due to the random fluctuations of the batch samples. It is, basically, achieved by applying using the moving average of the gradient. Different momentum-based algorithms operate by combining the currently computed value of the update with the previous value. For example, the simple SGD with momentum will have the following update code:</p> <pre><code>(let ((dws 0))<br /> (loop<br /> (with ((batch (sample data batch-size))<br /> (g (calculate-gradient batch)))<br /> (:= dws (- (* decay-rate dws)<br /> (* learning-rate g)))<br /> (:+ ws dws))))<br /></code></pre> <p>An alternative variant is called the Nesterov accelerated gradient which uses the following update procedure:</p> <pre><code>(let ((dws 0))<br /> (loop<br /> (:+ ws dws)<br /> (with ((batch (sample data batch-size))<br /> (g (- (* learning-rate (calculate-gradient batch)))))<br /> (:= dws (+ (* decay-rate dws) g)) <br /> (:+ ws g))))<br /></code></pre> <p>I.e., we first perform the update using the previous momentum, and only then calculate the gradient and perform the gradient-based update. The motivation for it is the following: while the gradient term always points in the right direction, the momentum term may not. If the momentum term points in the wrong direction or overshoots, the gradient can still "go back" and correct it in the same update step.</p> <p>Another direction of GD improvement is using the adaptive <code>learning-rate</code>. For instance, the famous <strong>Adam</strong> algorithm tracks per-cell learning rate for the <code>ws</code> matrix.</p> <p>These are not all the ways, in which plain gradient descent may be made more sophisticated — in order to converge faster. I won't mention here second-order methods or conjugate gradients. Numerous papers exploring this space continue being published.</p> <h2 id="sampling">Sampling</h2> <p>Speaking about sampling that we have mentioned several times throughout this book... I think this is a good place to mention a couple of simple sampling tricks that may prove useful in many different problems.</p> <p>The sampling that is used in SGD is the simplest form of random selection that is executed by picking a random element from the set and repeating it the specified number of times. This sampling is called "with replacement". The reason for this is that after picking an element it is not removed from the set (i.e. it can be considered "replaced" by an equal element), and so it can be picked again. Such an approach is the simplest one to implement and reason about. There's also the "without replacement" version that removes the element from the set after selecting it. It ensures that each element may be picked only once, but also causes the change in probabilities of picking elements on subsequent iterations.</p> <p>Here is an abstract (as we don't specify the representation of the set and the realted <code>size</code>, <code>remove-item</code>, and <code>empty?</code> procedures) implementation of these sampling methods:</p> <pre><code>(defun sample (n set &key (with-replacement t))<br /> (loop :repeat n<br /> :for i := (random (size set))<br /> :collect (? set i)<br /> :unless with-replacement :do<br /> (remove-item set i)<br /> (when (empty? set) (loop-finish)))<br /></code></pre> <p>This simplest approach samples from a uniform probability distribution, i.e. it assumes that the elements of the set have an equal chance of being selected. In many tasks, these probabilities have to be different. For such cases, a more general sampling implementation is needed:</p> <pre><code>(defun sample-from-dist (n dist)<br /> ;; here, DIST is a hash-table with keys being items<br /> ;; and values — their probabilities<br /> (let ((scale (reduce '+ (vals dist))))<br /> (loop :repeat n :collect<br /> (let ((r (* scale (random 1.0)))<br /> (acc 0))<br /> (dotable (k v dist)<br /> (:+ acc v)<br /> (when (>= acc r)<br /> (return k)))))))<br /><br />CL-USER> (sample-from-dist 10 #h(:foo 2<br /> :quux 1<br /> :baz 10))<br />(:BAZ :BAZ :BAZ :QUUX :BAZ :BAZ :BAZ :BAZ :BAZ :FOO)<br /></code></pre> <p>I'm surprised how often I have to retell this simple sampling technique. In it, all the items are placed on a [0, 1) interval occupying the parts proportionate to their weight in the probability distribution (<code>:baz</code> will have 80% of the weight in the distribution above). Then we put a random point in this interval and determine in which part it falls.</p> <p>The final sampling approach I'd like to show here — quite a popular one for programming interviews — is <strong>Reservoir Sampling</strong>. It deals with uniform sampling from an infinite set. Well, how do you represent an infinite set? For practical purposes, it can be thought of as a stream. So, the items are read sequentially from this stream and we need to decide which ones to collect and which to skip. This is achieved by the following procedure:</p> <pre><code>(defun reservoir-sample (n stream)<br /> (let ((rez (make-array n :initial-element nil))) ; reservoir<br /> (handler-case<br /> (loop :for item := (read stream)<br /> :for i :from 0<br /> :for r := (random (1+ i))<br /> :do (cond<br /> ;; initially, fill the reservoir with the first N items<br /> ((< i n) (:= (? rez i) item))<br /> ;; replace the R-th item with probability<br /> ;; proportionate to (- 1 (/ R N))<br /> ((< r n) (:= (? rez r) item))))<br /> ;; sampling stops when the stream is exhausted<br /> ;; we'll use an input stream and read items from it<br /> (end-of-file () rez))))<br /><br />CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")<br /> (reservoir-sample 3 in))<br />#(BAR BAZ FOO)<br />CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")<br /> (reservoir-sample 3 in))<br />#(FOO FOO FOO)<br />CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")<br /> (reservoir-sample 3 in))<br />#(BAZ FOO FOO)<br />CL-USER> (with-input-from-string (in (format nil "~{~A ~}"<br /> (loop :for i :from 0 :to 100 :collect i)))<br /> (reservoir-sample 10 in))<br />#(30 42 66 68 76 5 22 39 51 24) ; note that 5 stayed on the appropriate position where it was placed initially<br /></code></pre> <h2 id="matrixfactorization">Matrix Factorization</h2> <p>Matrix factorization is a decomposition of a matrix into a product of matrices. It has many different variants that find for particular classes of problems. Matrix factorization is a computationally-intensive task that has many applications: from machine learning to information retrieval to data compression. Its use cases include: background removal in images, topic modeling, collaborative filtering, CT scan reconstruction, etc.</p> <p>Among many factorization methods, the following two stand out as the most prominent: Singular Value Decomposition (SVD) and non-negative matrix factorization/non-negative sparse coding (NNSC). NNSC is interesting as it produces much sharper vectors that still remain sparse, i.e. all the information is concentrated in the non-null slots.</p> <h3 id="singularvaluedecomposition">Singular Value Decomposition</h3> <p>SVD is the generalization of the eigendecomposition (which is defined only for square matrices) to any matrix. It is extremely important as the eigenvectors define the basis of the matrix and the eigenvalues — the relative importance of the eigenvectors. Once SVD is performed, using the obtained vectors, we can immediately figure out a lot of useful properties of the dataset. Thus, SVD is behind such methods as PCA in statistical analysis, LSI topic modeling in NLP, etc.</p> <p>Formally, the singular value decomposition of an <code>m x n</code> matrix <code>M</code> is a factorization of the form <code>(* U S V)</code>, where <code>U</code> is an <code>m x m</code> unitary matrix, <code>V</code> is an <code>n x n</code> unitary matrix, and <code>S</code> (usually, greek sigma) is an <code>m x n</code> rectangular diagonal matrix with non-negative real numbers on the diagonal. The columns of <code>U</code> are left-singular vectors of <code>M</code>, the rows of <code>V</code> are right-singular vectors, and the diagonal elements of <code>S</code> are known as the singular values of <code>M</code>.</p> <p>The singular value decomposition can be computed either analytically or via approximation methods. The analytic approach is not tractable for large matrices — the ones that occur in practice. Thus, approximation methods are used. One of the well-known algorithms is QuasiSVD that was developed as a result of the famous Netflix challenge in the 2000s. The idea behind QuasiSVD is, basically, gradient descent. The algorithm approximates the decomposition with random matrices and then iteratively improves it using the following formula:</p> <pre><code>(defun svd-1 (u v rank training-data &key (learning-rate 0.001))<br /> (dotimes (f rank)<br /> (loop :for (i j val) :in training-data :do<br /> (let ((err (- val (predict rank u i v j))))<br /> (:+ (aref u f i) (* learning-rate err (aref v f j)))<br /> (:+ (aref v f j) (* learning-rate err (aref u f i))))))))<br /></code></pre> <p>The described method is called QuasiSVD because the singular values are not explicit: the decomposition is into just two matrices of non-unit vectors. Another constraint of the algorithm is that the rank of the decomposition (the number of features) should be specified by the user. Yet, for practical purposes, this is often what is actually needed. Here is a brief description at the usage of the method for <a href="https://sifter.org/~simon/journal/20061211.html">predicting movie reviews for the Netflix challenge</a>.</p> <blockquote> <p>For visualizing the problem, it makes sense to think of the data as a big sparsely filled matrix, with users across the top and movies down the side, and each cell in the matrix either contains an observed rating (1-5) for that movie (row) by that user (column) or is blank meaning you don't know. This matrix would have about 8.5 billion entries (number of users times number of movies). Note also that this means you are only given values for one in 85 of the cells. The rest are all blank.</p> <p>The assumption is that a user's rating of a movie is composed of a sum of preferences about the various aspects of that movie. For example, imagine that we limit it to forty aspects, such that each movie is described only by forty values saying how much that movie exemplifies each aspect, and correspondingly each user is described by forty values saying how much they prefer each aspect. To combine these all together into a rating, we just multiply each user preference by the corresponding movie aspect, and then add those forty leanings up into a final opinion of how much that user likes that movie. [...] Such a model requires <code>(* 40 (+ 17k 500k))</code> or about <code>20M</code> values — 400 times less than the original <code>8.5B</code>.</p></blockquote> <p>Here is the function that approximates the rating. The QuasiSVD matrix <code>u</code> is <code>user-features</code> and <code>v</code> — <code>movie-features</code>. As you see, we don't need to further factor <code>u</code> and <code>v</code> into the matrix of singular values and the unit vectors matrices. </p> <pre><code>(defun predict-rating (rank user-features user movie-features movie)<br /> (loop :for f :from 0 :below rank<br /> :sum (* (aref user-features f user)<br /> (aref movie-features f movie))))<br /></code></pre> <h2 id="fouriertransform">Fourier Transform</h2> <p>The last item we'll discuss in this chapter is not exactly an optimization problem, but it's also a numeric algorithm that bears a lot of significance to the previous one and has broad practical applications. The Discrete Fourier Transform (DFT) is the most important discrete transform, used to perform Fourier analysis in many practical applications: in digital signal processing, the function is any quantity or signal that varies over time, such as the pressure of a sound wave, a radio signal, or daily temperature readings, sampled over a finite time interval; in image processing, the samples can be the values of pixels along a row or column of a raster image.</p> <p>It is said that the Fourier Transform transforms a "signal" from the time/space domain (represented by observed samples) into the frequency domain. Put simply, a time-domain graph shows how a signal changes over time, whereas a frequency-domain graph shows how much of the signal lies within each given frequency band over a range of frequencies. The inverse Fourier Transform performs the reverse operation and converts the frequency-domain signal back into the time domain. Explaining the deep meaning of the transform is beyond the scope of this book, the only thing worth mentioning here is that operating on the frequency domain allows us to perform many useful operations on the signal, such as determining the most important features, compression (that we'll discuss below), etc.</p> <p>The complexity of computing DFT naively just by applying its definition on <code>n</code> samples is <code>O(n^2)</code>:</p> <pre><code>(defun dft (vec)<br /> (with ((n (length vec))<br /> (rez (make-array n))<br /> (scale (/ (- (* 2 pi #c(0 1))) n)))<br /> ;; #c(0 1) is imaginary unit (i) - Lisp allows us to operate on complex numbers directly<br /> (dotimes (i n)<br /> (:= (? rez i) (loop :for j :from 0 :below n<br /> :sum (* (? vec j) (exp (* scale i j)))))))) <br /></code></pre> <p>However, the well-known Fast Fourier Transform (FFT) achieves a much better performance of <code>O(n log n)</code>. Actually, a group of algorithms shares the name FFT, but their main principle is the same. You might have already guessed, from our previous chapters, that such reduction in complexity is achieved with the help of the divide-and-conquer approach. A radix-2 decimation-in-time (DIT) FFT is the simplest and most common form of the Cooley-Tukey algorithm, which is the standard FFT implementation. It first computes the DFTs of the even-indexed inputs (indices: <code>0, 2, ..., (- n 2)</code>) and of the odd-indexed inputs (indices: <code>1, 3, ..., (- n 1)</code>), and then combines those two results to produce the DFT of the whole sequence. This idea is utilized recursively. What enables such decomposition is the observation that thanks to the periodicity of the complex exponential, the elements <code>(? rez i)</code> and <code>(? rez (+ i n/2))</code> may be calculated from the FFTs of the same subsequences. The formulas are the following: </p> <pre><code>(let ((e (fft-of-even-indexed-part))<br /> (o (fft-of-odd-indexed-part))<br /> (scale (exp (/ (- (* 2 pi #c(0 1) i))<br /> n)))<br /> (n/2 (floor n 2)))<br /> (:= (? rez i) (+ (? e i) (* scale (? o i)))<br /> (? rez (+ i n/2)) (- (? e i) (* scale (? o i)))))<br /></code></pre> <h3 id="fouriertransforminactionjpeg">Fourier Transform in Action: JPEG</h3> <p>Fourier Transform — or rather its variant that uses only cosine functions<a href="#f12-2" name="r12-2">[2]</a> and operates on real numbers — the Discrete Cosine Transform (DCT) is the enabling factor of the main lossy media compression formats, such as JPEG, MPEG, and MP3. All of them achieve the drastic reduction in the size of the compressed file by first transforming it into the frequency domain, and then identifying the long tail of low amplitude frequencies and removing all the data that is associated with these frequencies (which is, basically, noise). Such an approach allows specifying a threshold of the percentage of data that should be discarded and retained. The use of cosine rather than sine functions is critical for compression since it turns out that fewer cosine functions are needed to approximate a typical signal. Also, this allows sticking to only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry. There are, actually, eight different DCT variants, and we won't go into detail about their differences.</p> <p>The general JPEG compression procedure operates in the following steps:</p> <ul><li>an RGB to YCbCr color space conversion (a special color space with luminescence and chrominance components more suited for further processing)</li> <li>division of the image into 8 x 8 pixel blocks</li> <li>shifting the pixel values from <code>[0,256)</code> to <code>[-128,128)</code></li> <li>applying DCT to each block from left to right, top to bottom</li> <li>compressing of each block through quantization</li> <li>entropy encoding the quantized matrix (we'll dicuss this in the next chapter)</li> <li>compressed image is reconstructed through the reverse process using the Inverse Discrete Cosine Transform (IDCT)</li></ul> <p>The quantization step is where the lossy part of compression takes place. It aims at reducing most of the less important high-frequency DCT coefficients to zero, the more zeros the better the image will compress. Lower frequencies are used to reconstruct the image because the human eye is more sensitive to them and higher frequencies are discarded.</p> <p>P.S. Also, further development of the Fourier-related transforms for lossy compression lies in using the Wavelet family of transforms.</p> <h2 id="takeaways">Take-aways</h2> <p>It was not easy to select the name for this chapter. Originally, I planned to dedicate it to optimization approaches. Then I thought that a number of other numerical algorithms need to be presented, but they were not substantial enough to justify a separate chapter. After all, I saw that what all these different approaches are about is, first of all, approximation. And, after gathering all the descriptions in one place and combining them, I came to the conclusion that approximation is, in a way, a more general and correct term than optimization. Although they go hand in hand, and it's somewhat hard to say which one enables the other...</p> <p>A conclusion that we can draw from this chapter is that the main optimization methods currently in use boil down to greedy local probabilistic search. In both the discrete and continuous domains, the key idea is to quickly find the direction, in which we can somewhat improve the current state of the system, and advance alongside that direction. All the rest is, basically, fine-tuning of this concept. There are alternatives, but local search aka gradient descent aka hill climbing dominates the optimization landscape.</p> <p>Another interesting observation can be made that many approaches we have seen here are more of the templates or frameworks than algorithms. Branch & Bound, Genetic Programming or Local Search define a certain skeleton that should be filled with domain-specific code which will perform the main computations. Such a "big picture" approach is somewhat uncommon to the algorithm world that tends to concentrate on the low-level details and optimize them down to the last bit. So, the skills needed to design such generic frameworks are no less important to the algorithmic developers than knowledge of the low-level optimization techniques.</p> <p>SGD, SVD, MCTS, NNSC, FFT — this sphere has plenty of algorithms with abbreviated names for solving particular numerical problems. We have discussed only the most well-known and principal ones with broad practical significance in the context of software development. But, besides them, there are many other famous numerical algorithms like the Sieve of Eratosthenes, the Finite Element Method, the Simplex Method, and so on and so forth. Yet, many of the ways to tackle them and the issues you will encounter in the process are, essentially, similar.</p> <hr size="1"><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r12-1" name="f12-1">[1]</a> It uses the popular <a href="https://edicl.github.io/drakma/">drakma</a> HTTP client and <a href="http://edicl.github.io/cl-ppcre/">cl-ppcre</a> regex library.</p><a href="#r12-2" name="f12-2">[2]</a> The DFT uses a complex exponent, which consists of a cosine and a sine part.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com1tag:blogger.com,1999:blog-6031647961506005424.post-4836767718951697032019-12-13T12:40:00.000+02:002019-12-14T15:00:00.287+02:00Programming Algorithms: Dynamic Programming<p>This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.</p> <p>We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.</p> <p>Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the <strong>optimal substructure property</strong>:</p> <blockquote> <p>A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.</p></blockquote> <p>An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.</p> <h2 id="fibonaccinumbers">Fibonacci Numbers</h2> <p>So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.</p> <p>A naive approach to calculating the <code>i</code>-th number will be directly coding the Fibonacci formula:</p> <pre><code>(defun naive-fib (i)<br /> (assert (typep i '(integer 0)))<br /> (if (< i 2) 1<br /> (+ (naive-fib (- i 1))<br /> (naive-fib (- i 2)))))<br /></code></pre> <p>However, applying it will result in an exponential growth of the number of computations: each call to <code>naive-fib</code> results in two more calls. So, the number of calls needed for the <code>n</code>-th number, with this approach, is <code>O(2^n)</code>.</p> <pre><code>> (time (naive-fib 40))<br />Evaluation took: 3.390 seconds of real time<br />165580141<br />> (time (naive-fib 42))<br />Evaluation took: 7.827 seconds of real time<br />433494437<br /></code></pre> <p>Yet, we can see here a direct manifestation of an optimal substructure property: the <code>i</code>-th number calculation uses the result of the <code>(1- i)</code>-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.</p> <pre><code>(let ((fib (vec 1 1))) ; our table will be an adjustable vector<br /> (defun fib (i)<br /> (when (< (length fib) i)<br /> (vector-push-extend (fib (- i 1)) fib))<br /> (+ (? fib (- i 1))<br /> (? fib (- i 2)))))<br /></code></pre> <p>What we've done here is added a layer of memoization to our function that uses an array <code>fib</code> that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the <code>fib</code> procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded <code>O(2^n)</code> running time to a mere <code>O(n)</code>!</p> <p>Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only <code>O(n)</code> in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:</p> <pre><code>(let ((fib (vec 1 1)))<br /> (defun bottom-up-fib (i)<br /> (let ((off (length fib)))<br /> (adjust-array fib (1+ i) :fill-pointer t)<br /> (dotimes (j (- (1+ i) off))<br /> (let ((j (+ j off)))<br /> (:= (aref fib j)<br /> (+ (aref fib (- j 1))<br /> (aref fib (- j 2)))))))<br /> (aref fib i)))<br />> (time (bottom-up-fib 42))<br />Evaluation took: 0.000 seconds of real time<br />> (time (bottom-up-fib 4200))<br />Evaluation took: 0.004 seconds of real time<br />40512746637826407679504078155833145442086707013857032517543... ; this number is a Lisp's bignum that has unlimited size<br /></code></pre> <p>Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...</p> <h2 id="stringsegmentation">String Segmentation</h2> <p>Let's consider another 1d problem: suppose we have a dictionary of words and a string consisting of those words that somehow lost the spaces between them — the words got glued together. We need to restore the original string with spaces or, to phrase it differently, split the string into words. This is one of the instances of string segmentation problems, and if you're wondering how and where such a situation could occur for real, consider Chinese text that doesn't have to contain spaces. Every Chinese language processing system needs to solve a similar task.</p> <p>Here's an example input<a href="#f11-1" name="r11-1">[1]</a>:</p> <pre><code>String: thisisatest<br />Dictionary: a, i, s, at, is, hi, ate, his, sat, test, this <br />Expected output: this is a test<br /></code></pre> <p>It is clear that even with such a small dictionary there are multiple ways we could segment the string. The straightforward and naive approach is to use a greedy algorithm. For instance, a shortest-first solution will try to find the shortest word from the dictionary starting at the current position and then split it (as a prefix) from the string. It will result in the following split: <code>this i sat est</code>. But the last part <code>est</code> isn't in the dictionary, so the algorithm has failed to produce some of the possible correct splits (although, by chance, if the initial conditions where different, it could have succeeded). Another version — the longest-first approach — could look for the longest words instead of the shortest. This would result in: <code>this is ate st</code>. Once again the final token is not a word. It is pretty obvious that these simple takes are not correct and we need a more nuanced solution.</p> <p>As a common next step in developing such brute force approaches a developer would resort to backtracking: when the computation reaches the position in the string, from which no word in the dictionary may be recovered, it unwinds to the position of the previous successful split and tries a different word. This procedure may have to return multiple steps back — possibly to the very beginning. As a result, in the worst case, to find a correct split, we may need to exhaustively try all possible combinations of words that fit into the string.</p> <p>Here's an illustration of the recursive shortest-first greedy algorithm operation:</p> <pre><code>(defun shortest-first-restore-spaces (dict str)<br /> (dotimes (i (length str))<br /> (let ((word (slice str 0 (1+ i))))<br /> (when (? dict word)<br /> (return-from shortest-first-restore-spaces<br /> (cond-it<br /> ((= (1+ i) (length str))<br /> word)<br /> ((shortest-first-restore-spaces dict (slice str (1+ i)))<br /> (format nil "~A ~A" word it))))))))<br /><br />CL-USER> (defparameter *dict* (hash-set 'equal "a" "i" "at" "is" "hi" "ate" "his" "sat" "test" "this"))<br />CL-USER> (shortest-first-restore-spaces *dict* "thisisatest")<br /> 0: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")<br /> 1: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")<br /> 2: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")<br /> 3: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")<br /> 3: SHORTEST-FIRST-RESTORE-SPACES returned NIL<br /> 2: SHORTEST-FIRST-RESTORE-SPACES returned NIL<br /> 1: SHORTEST-FIRST-RESTORE-SPACES returned NIL<br /> 0: SHORTEST-FIRST-RESTORE-SPACES returned NIL<br />NIL<br /></code></pre> <p>To add backtracking into the picture, we need to avoid returning in the case of the failure of the recursive call:</p> <pre><code>(defun bt-shortest-first-restore-spaces (dict str)<br /> (dotimes (i (length str))<br /> (let ((word (slice str 0 (1+ i))))<br /> (when (in# word dict)<br /> (when (= (1+ i) (length str))<br /> (return-from bt-shortest-first-restore-spaces word))<br /> (when-it (bt-shortest-first-restore-spaces dict (slice str (1+ i)))<br /> (return-from bt-shortest-first-restore-spaces (format nil "~A ~A" word it)))))))<br /><br />CL-USER> (bt-best-first-restore-spaces *dict* "thisisatest")<br /> 0: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")<br /> 1: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")<br /> 2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")<br /> 3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")<br /> 3: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL<br /> 2: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL<br /> ;; backtracking kicks in here<br /> 2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "atest")<br /> 3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "test")<br /> 3: BT-SHORTEST-FIRST-RESTORE-SPACES returned "test"<br /> 2: BT-SHORTEST-FIRST-RESTORE-SPACES returned "a test"<br /> 1: BT-SHORTEST-FIRST-RESTORE-SPACES returned "is a test"<br /> 0: BT-SHORTEST-FIRST-RESTORE-SPACES returned "this is a test"<br />"this is a test"<br /></code></pre> <p>Lisp <code>trace</code> is an invaluable tool to understand the behavior of recursive functions. Unfortunately, it doesn't work for loops, with which one has to resort to debug printing.</p> <p>Realizing that this is brute force, we could just as well use another approach: generate all combinations of words from the dictionary of the total number of characters (<code>n</code>) and choose the ones that match the current string. The exact complexity of this scheme is <code>O(2^n)</code><a href="#f11-2" name="r11-2">[2]</a>. In other words, our solution leads to a <strong>combinatorial explosion</strong> in the number of possible variants — a clear no-go for every algorithmic developer.</p> <p>So, we need to come up with something different, and, as you might have guessed, DP fits in perfectly as the problem has the optimal substructure: a complete word in the substring of the string remains a complete word in the whole string as well. Based on this understanding, let's reframe the task in a way that lends itself to DP better: find each character in the string that ends a complete word so that all the words combined cover the whole string and do not intersect<a href="#f11-3" name="r11-3">[3]</a>.</p> <p>Here is an implementation of the DP-based procedure. Apart from calculating the maximum length of a word in the dictionary, which usually may be done offline, it requires single forward and backward passes. The forward pass is a linear scan of the string that at each character tries to find all the words starting at it and matching the string. The complexity of this pass is <code>O(n * w)</code>, where <code>w</code> is the constant length of the longest word in the dictionary, i.e. it is, actually, <code>O(n)</code>. The backward pass (called, in the context of DP, <strong>decoding</strong>) restores the spaces using the so-called backpointers stored in the <code>dp</code> array. Below is a simplistic implementation that returns a single match. A recursive variant is possible with or without a backward pass that will accumulate all the possible variants. </p> <pre><code>(defun dp-restore-spaces (dict str)<br /> (let ((dp (make-array (1+ (length str)) :initial-element nil))<br /> ;; in the production implementation, the following calculation<br /> ;; should be performed at the pre-processing stage<br /> (w (reduce 'max (mapcar 'length (keys dict))))<br /> (begs (list))<br /> (rez (list)))<br /> ;; the outer loop tries to find the next word<br /> ;; only starting from the ends of the words that were found previously<br /> (do ((i 0 (pop begs)))<br /> ((or (null i)<br /> (= i (length str))))<br /> ;; the inner loop checks all substrings of length 1..w<br /> (do ((j (1+ i) (1+ j)))<br /> ((>= j (1+ (min (length str)<br /> (+ w i)))))<br /> (when (? dict (slice str i j))<br /> (:= (? dp j) i)<br /> (push j begs)))<br /> (:= begs (reverse begs)))<br /> ;; the backward pass<br /> (do ((i (length str) (? dp i)))<br /> ((null (? dp i)))<br /> (push (slice str (? dp i) i) rez))<br /> (strjoin #\Space rez)))<br /><br />CL-USER> (dp-restore-spaces *dict* "thisisatest")<br />"this is a test"<br /></code></pre> <p>Similarly to the Fibonacci numbers, the solution to this problem doesn't use any additional information to choose between several variants of a split; it just takes the first one. However, if we wanted to find the variant that is most plausible to the human reader, we'd need to add some measure of plausibility. One idea might be to use a frequency dictionary, i.e. prefer the words that have a higher frequency of occurrence in the language. Such an approach, unfortunately, also has drawbacks: it overemphasizes short and frequent words, such as determiners, and also doesn't account for how words are combined in context. A more advanced option would be to use a frequency dictionary not just of words but of separate phrases (ngrams). The longer the phrases are used, the better from the standpoint of linguistics, but also the worse from the engineering point of view: more storage space needed, more data to process if we want to collect reliable statistics for all the possible variants. And, once again, with the rise of the number of words in an ngram, we will be facing the issue of combinatorial explosion petty soon. The optimal point for this particular task might be bigrams or trigrams, i.e. phrases of 2 or 3 words. Using them, we'd have to supply another dictionary to our procedure and track the measure of plausibility of the current split as a product of the frequencies of the selected ngrams. Formulated this way, our exercise becomes not merely an algorithmic task but an optimization problem. And DP is also suited to solving such problems. In fact, that was the primary purpose it was intended for, in the Operations Research community. We'll see it in action with our next problem — text justification. And developing a <code>restore-spaces-plausibly</code> procedure is left as an exercise to the reader. :)</p> <h2 id="textjustification">Text Justification</h2> <p>The task of text justification is relevant to both editing and reading software: given a text, consisting of paragraphs, split each paragraph into lines that contain whole words only with a given line length limit so that the variance of line lengths is the smallest. Its solution may be used, for example, to display text in HTML blocks with an <code>align=justify</code> property.</p> <p>A more formal task description would be the following:</p> <ul><li><p>the algorithm is given a text string and a line length limit (say, 80 characters)</p></li> <li><p>there's a plausibility formula that specifies the penalty for each line being shorter than the length limit. A usual formula is this:</p> <pre><code>(defun penalty (limit length)<br /> (if (<= length limit)<br /> (expt (- limit length) 3)<br /> most-positive-fixnum))<br /></code></pre></li> <li><p>the result should be a list of strings</p></li></ul> <p>As we are discussing this problem in the context of DP, first, we need to determine what is its optimal substructure. Superficially, we could claim that lines in the optimal solution should contain only the lines that have the smallest penalty, according to the formula. However, this doesn't work as some of the potential lines that have the best plausibility (length closest to 80 characters) may overlap, i.e. the optimal split may not be able to include all of them. What we can reliably claim is that, if the text is already justified from position 0 to <code>i</code>, we can still justify the remainder optimally regardless of how the prefix is split into lines. This is, basically, the same as with string segmentation where we didn't care how the string was segmented before position <code>i</code>. And it's a common theme in DP problems: the key feature that allows us to save on redundant computation is that we only remember the optimal result of the computation that led to a particular partial solution, but we don't care about what particular path was taken to obtain it (except we care to restore the path, but that's what the backpointers are for — it doesn't impact the forward pass of the algorithm). So the optimal substructure property of text justification is that if the best split of the whole string includes the consecutive indices <code>x</code> and <code>y</code>, then the best split from 0 to <code>y</code> should include <code>x</code>.</p> <p>Let's justify the following text with a line limit of 50 chars:</p> <pre><code>Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,<br />most prominent descendant of the long-running family of Lisp programming languages.<br /></code></pre> <p>Suppose we've already justified the first 104 characters. This leaves us with a suffix that has a length of 69: <code>descendant of the long-running family of Lisp programming languages.</code> As its length is above 50 chars, but below 100, so we can conclude that it requires exactly 1 split. This split may be performed after the first, second, third, etc. token. Let's calculate the total plausibility of each candidate:</p> <pre><code>after "the": 5832 + 0 = 5832<br />after "long-running": 6859 + 2197 = 9056<br />after "family": 1728 + 8000 = 9728<br />after "of": 729 + 12167 = 12896<br />after "Lisp": 64 + 21952 = 22016<br /></code></pre> <p>So, the optimal split starting at index 105<a href="#f11-4" name="r11-4">[4]</a> is into strings: <code>"descendant of the"</code> and <code>"long-running family of Lisp programming languages."</code> Now, we haven't guaranteed that index 105 will be, in fact, the point in the optimal split of the whole string, but, if it were, we would have already known how to continue. This is the key idea of the DP-based justification algorithm: starting from the end, calculate the cost of justifying the remaining suffix after each token using the results of previous calculations. At first, while suffix length is below line limit they are trivially computed by a single call to the plausibility function. After exceeding the line limit, the calculation will consist of two parts: the plausibility penalty + the previously calculated value.</p> <pre><code>(defun justify (limit str)<br /> (with ((toks (reverse (split #\Space str)))<br /> (n (length toks))<br /> (penalties (make-array n))<br /> (backptrs (make-array n))<br /> (lengths (make-array n)))<br /> ;; forward pass (from the end of the string)<br /> (doindex (i tok toks)<br /> (let ((len (+ (length tok) (if (> i 0)<br /> (? lengths (1- i))<br /> 0))))<br /> (:= (? lengths i) (1+ len))<br /> (if (<= len limit)<br /> (:= (? penalties i) (penalty len limit)<br /> (? backptrs i) -1)<br /> ;; minimization loop<br /> (let ((min most-positive-fixnum)<br /> arg)<br /> (dotimes (j i)<br /> (with ((j (- i j 1))<br /> (len (- (? lengths i)<br /> (? lengths j)))<br /> (penalty (+ (penalty len limit)<br /> (? penalties j))))<br /> (when (> len limit) (return))<br /> (when (< penalty min)<br /> (:= min penalty<br /> arg j))))<br /> (:= (? penalties i) min<br /> (? backptrs i) arg)))))<br /> ;; backward pass (decoding)<br /> (loop :for end := (1- n) :then beg<br /> :for beg := (? backptrs end)<br /> :do (format nil "~A~%" (strjoin #\Space (reverse (subseq toks (1+ beg) (1+ end)))))<br /> :until (= -1 beg))))<br /><br />CL-USER> (justify 50 "Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,<br />most prominent descendant of the long-running family of Lisp programming languages.")<br /><br />Common Lisp is the modern, multi-paradigm,<br />high-performance, compiled, ANSI-standardized,<br />most prominent descendant of the long-running<br />family of Lisp programming languages.<br /></code></pre> <p>This function is somewhat longer, but, conceptually, it is pretty simple. The only insight I needed to implement it efficiently was the additional array for storing the <code>lengths</code> of all the string suffixes we have examined so far. This way, we apply memoization twice: to prevent recalculation of both the penalties and the suffix lengths, and all of the ones we have examined so far are used at each iteration. If we were to store the suffixes themselves we would have had to perform an additional <code>O(n)</code> length calculation at each iteration.</p> <p>The algorithm performs two passes. In the forward pass (which is, in fact, performed from the end), it fills the slots of the DP arrays using the minimum joint penalty for the potential current line and the remaining suffix, the penalty for which was calculated during one of the previous iterations of the algorithm. In the backward pass, the resulting lines are extracted by traversing the backpointers starting from the last index.</p> <p>The key difference from the previous DP example are these lines:</p> <pre><code>(:= (? penalties i) min<br /> (? backptrs i) arg)<br /></code></pre> <p>Adding them (alongside with the whole minimization loop) turns DP into an optimization framework that, in this case, is used to minimize the penalty. The <code>backptrs</code> array, as we said, is used to restore the steps which have lead to the optimal solution. As, eventually (and this is true for the majority of the DP optimization problems), we care about this sequence and not the optimization result itself.</p> <p>As we can see, for the optimization problems, the optimal substructure property is manifested as a mathematical formula called the <strong>recurrence relation</strong>. It is the basis for the selection of a particular substructure among several variants that may be available for the current step of the algorithm. The relation involves an already memoized partial solution and the cost of the next part we consider adding to it. For text justification, the formula is the sum of the current penalty and the penalty of the newly split suffix. Each DP optimization task is based on a recurrence relation of a similar kind.</p> <p>Now, let's look at this problem from a different perspective. We can represent our decision space as a directed acyclic graph. Its leftmost node (the "source") will be index 0, and it will have several direct descendants: nodes with those indices in the string, at which we can potentially split it not exceeding the 50-character line limit, or, alternatively, each substring that spans from index 0 to the end of some token and is not longer than 50 characters. Next, we'll connect each descendant node in a similar manner with all nodes that are "reachable" from it, i.e. they have a higher value of associated string position, and the difference between their index and this node is below 50. The final node of the graph ("sink") will have the value of the length of the string. The cost of each edge is the value of the penalty function. Now, the task is to find the shortest path from source to sink.</p> <p>Here is the DAG for the example string with the nodes labeled with the indices of the potential string splits. As you can see, even for such a simple string, it's already quite big, what to speak of real texts. But it can provide some sense of the number of variants that an algorithm has to evaluate.</p> <a href="https://3.bp.blogspot.com/-7mCwsB25qYM/XfNpHElxKCI/AAAAAAAACRE/WbleHc0LzBwOz8xDpROSdoq559hd21iyQCPcBGAYYCw/s1600/just-dp.jpg" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-7mCwsB25qYM/XfNpHElxKCI/AAAAAAAACRE/WbleHc0LzBwOz8xDpROSdoq559hd21iyQCPcBGAYYCw/s1600/just-dp.jpg" data-original-width="1600" data-original-height="718" width="1000" /></a> <p>What is the complexity of this algorithm? On the surface, it may seem to be <code>O(m^2)</code> where <code>m</code> is the token count, as there are two loops: over all tokens and over the tail. However, the line <code>(when (> len limit) (return))</code> limits the inner loop to only the part of the string that can fit into <code>limit</code> chars, effectively, reducing it to a constant number of operations (not more than <code>limit</code>, but, in practice, an order of magnitude less). Thus, the actual complexity is <code>O(m)</code><a href="#f11-5" name="r11-5">[5]</a>.</p> <h2 id="pathfindingrevisited">Pathfinding Revisited</h2> <p>In fact, any DP problem may be reduced to pathfinding in the graph: the shortest path, if optimization is involved, or just any path otherwise. The nodes in this graph are the intermediate states (for instance, a split at index <code>x</code> or an <code>i</code>-th Fibonacci number) and the edges — possible transitions that may bear an associated cost (as in text justification) or not (as in string segmentation). And the classic DP algorithm to solve the problem is called the Bellman-Form algorithm. Not incidentally, one of its authors, Bellman is the "official" inventor of DP.</p> <pre><code>(defun bf-shortest-path (g)<br /> (with ((n (array-dimension g 0))<br /> (edges (edges-table g))<br /> (dists (make-array n :initial-elment most-positive-fixnum))<br /> (backptrs (make-array n))<br /> (path (list)))<br /> (:= (? dists (1- n)) 0)<br /> (dotimes (v (vertices g))<br /> (dotimes (e (? edges v))<br /> (with ((u (src e))<br /> (dist (+ (dist e)<br /> (? dists u))))<br /> (when (< dist (? dists u))<br /> (:= (? dists u) dist<br /> (? backptrs u) v)))))<br /> (loop :for v := (1- n) :then (? backptrs v) :do<br /> (push v path))<br /> (values path<br /> (? dists (1- n)))))<br /></code></pre> <p>The code for the algorithm is very straightforward, provided that our graph representation already has the vertices and edges as a data structure in convenient format or implements such operations (in the worst case, the overall complexity should be not greater than <code>O(V+E)</code>). For the edges, we need a kv indexed by the edge destination — an opposite to the usual representation that groups them by their sources<a href="#f11-6" name="r11-6">[6]</a>.</p> <p>Compared to text justification, this function looks simpler as we don't have to perform task-specific processing that accounts for character limit and spaces between words. However, if we were to use <code>bf-shortest-path</code>, we'd have to first create the graph data structure from the original text. So all that complexity would go into the graph creation routine. However, from the architectural point-of-views, such split may be beneficial as the pathfinding procedure could be reused for other problems.</p> <p>One might ask a reasonable question: how does Bellman-Ford fare against the Dijkstra's algorithm (DA)? As we have already learned, Dijkstra's is a greedy and optimal solution to pathfinding, so why consider yet another approach? Both algorithms operate by relaxation, in which approximations to the correct distance are replaced by better ones until the final result is reached. And in both of them, the approximate distance to each vertex is always an overestimate of the true distance, and it is replaced by the minimum of its old value and the length of a newly found path. Turns out that DA is also a DP-based approach. But with additional optimizations! It uses the same optimal substructure property and recurrence relations. The advantage of DA is the utilization of the priority queue to effectively select the closest vertex that has not yet been processed. Then it performs the relaxation process on all of its outgoing edges, while the Bellman-Ford algorithm relaxes all the edges. This method allows BF to calculate the shortest paths not to a single node but to all of them (which is also possible for DA but will make its runtime, basically, the same as for BF). So, Bellman-Ford complexity is <code>O(V E)</code> compared to <code>O(E + V logV)</code> for the optimal implementation of DA. Besides, BF can account for negative edge weights, which will break DA.</p> <p>So, DA remains the algorithm of choice for the standard shortest path problem, and it's worth keeping in mind that it can also be also applied as a solver for some DP problems if they are decomposed into graph construction + pathfinding. However, some DP problems have additional constraints that make using DA for them pointless. For example, in text justification, the number of edges to consider at each step is limited by a constant factor, so the complexity of the exhaustive search is, in fact, <code>O(V)</code>. Proving that for our implementation of <code>justify</code> is left as an exercise to the reader...</p> <h2 id="lcsdiff">LCS & Diff</h2> <p>Let's return to strings and the application of DP to them. The ultimate DP-related string problem is string alignment. It manifests in many formulations. The basic one is the Longest Common Subsequence (LCS) task: determine the length of the common part among two input strings. Solving it, however, provides enough data to go beyond that — it enables determining the best alignment of the strings, as well as to enumerating the edit operations needed to transform one string into another. The edit operations, which are usually considered in the context of LCS are:</p> <ul><li>insertion of a character</li> <li>deletion of a character</li> <li>substitution of a character</li></ul> <p>Based on the number of those operations, we can calculate a metric of commonality between two strings that is called the <strong>Levenstein distance</strong>. It is one of the examples of the so-called <strong>Edit distances</strong>. The identical strings have a Levenstein distance of 0, and strings <code>foobar</code> and <code>baz</code> — of 4 (3 deletion operations for the prefix <code>foo</code> and a substitution operation of <code>r</code> into <code>z</code>). The are also other variants of edit distances. FOr instance, the Damerau-Levenstein distance that is better suited to compare texts with misspellings produced by humans, adds another modification operation: <code>swap</code>, which reduces the edit distance in the case of two adjacent characters being swapped to 1 instead of 2 for the Levenstein (1 deletion adn 1 insertion).</p> <p>The Levenstein distance, basically, gives us for free the DP recurrence relations: when we consider the <code>i</code>-th character of the first string and the <code>j</code>-th one of the second, the edit distance between the prefixes <code>0,i</code> and <code>0,j</code> is either the same as for the pair of chars <code>(1- i)</code> and <code>(1- j)</code> respectively, if the current characters are the same, or <code>1+</code> the minimum of the edit distances of the pairs <code>i (1- j)</code>, <code>(1- i) (1- j)</code>, and <code>(1-i) j</code>.</p> <p>We can encode this calculation as a function that uses a matrix for memoization. Basically, this is the DP solution to the LCS problem: now, you just have to subtract the length of the string and the bottom right element of the matrix, which will give you the measure of the difference between the strings.</p> <pre><code>(defun lev-dist (s1 s2 &optional <br /> (i1 (1- (length s1)))<br /> (i2 (1- (length s2)))<br /> (ld (make-array (list (1+ (length s1))<br /> (1+ (length s2)))<br /> :initial-element nil)<br /> ldp)) ; a flag indicating that the argument was supplied<br /> ;; initialization of the 0-th column and row<br /> (unless ldp<br /> (dotimes (k (1+ (length s1))) (:= (aref ld k 0) 0))<br /> (dotimes (k (1+ (length s2))) (:= (aref ld 0 k) 0)))<br /> (values (or (aref ld (1+ i1) (1+ i2))<br /> (:= (aref ld (1+ i1) (1+ i2))<br /> (if (eql (? s1 i1) (? s2 i2))<br /> (lev-dist s1 s2 (1- i1) (1- i2) ld)<br /> (1+ (min (lev-dist s1 s2 (1- i1) (1- i2) ld)<br /> (lev-dist s1 s2 i1 (1- i2) ld)<br /> (lev-dist s1 s2 (1- i1) i2 ld))))))<br /> ld))<br /></code></pre> <p>However, if we want to also use this information to align the sequences, we'll have to make a reverse pass[]{{Here, a separate backpointers array isn't necessary as we can infer the direction by reversing the distance formula.}}.</p> <pre><code>(defun align (s1 s2)<br /> (with ((i1 (length s1))<br /> (i2 (length s2))<br /> ;; our Levenstein distance procedure returns the whole DP matrix<br /> ;; as a second value<br /> (ld (nth-value 1 (lev-dist s1 s2)))<br /> (rez (list)))<br /> (loop<br /> (let ((min (min (aref ld (1- i1) (1- i2))<br /> (aref ld i1 (1- i2))<br /> (aref ld (1- i1) i2))))<br /> (cond ((= min (aref ld (1- i1) (1- i2)))<br /> (push (pair (? s1 (1- i1)) (? s2 (1- i2)))<br /> rez)<br /> (:- i1)<br /> (:- i2))<br /> ((= min (aref ld (1- i1) i2))<br /> (push (pair (? s1 (1- i1)) nil)<br /> rez)<br /> (:- i1))<br /> ((= min (aref ld i1 (1- i2)))<br /> (push (pair nil (? s2 (1- i2)))<br /> rez)<br /> (:- i2))))<br /> (when (= 0 i1)<br /> (loop :for j :from (1- i2) :downto 0 :do<br /> (push (pair #\* (? s2 j)) rez))<br /> (return))<br /> (when (= 0 i2)<br /> (loop :for j :from (1- i1) :downto 0 :do<br /> (push (pair (? s1 j) nil) rez))<br /> (return)))<br /> ;; pretty output formatting<br /> (with-output-to-string (s1)<br /> (with-output-to-string (s2)<br /> (with-output-to-string (s3)<br /> (loop :for (c1 c2) :in rez :do<br /> (format s1 "~C " (or c1 #\.))<br /> (format s2 "~C " (cond ((null c1) #\↓)<br /> ((null c2) #\↑)<br /> ((char= c1 c2) #\|)<br /> (t #\x)))<br /> (format s3 "~C " (or c2 #\.)))<br /> (format t "~A~%~A~%~A~%"<br /> (get-output-stream-string s1)<br /> (get-output-stream-string s2)<br /> (get-output-stream-string s3)))))<br /> rez))<br /><br />CL-USER> (align "democracy" "remorse")<br />d e m o c r a c y<br />x | | | ↑ | ↑ x x<br />r e m o . r . s e<br /><br />CL-USER> (lev-dist "democracy" "remorse")<br />5<br />#2A((0 1 2 3 4 5 6 7)<br /> (1 1 2 3 4 5 6 7)<br /> (2 2 1 2 3 4 5 6)<br /> (3 3 2 1 2 3 4 5)<br /> (4 4 3 2 1 2 3 4)<br /> (5 5 4 3 2 2 3 4)<br /> (6 5 5 4 3 2 3 4)<br /> (7 6 6 5 4 3 3 4)<br /> (8 7 7 6 5 4 4 4)<br /> (9 8 8 7 6 5 5 5))<br /></code></pre> <p>It should be pretty clear how we can also extract the edit operations during the backward pass: depending on the direction of the movement, horizontal, vertical or diagonal, it's either an insertion, deletion or substitution. The same operations may be also grouped to reduce noise. The alignment task is an example of a 2d DP problem. Hence, the diff computation has a complexity of <code>O(n^2)</code>. There are other notable algorithms, such as CYK parsing or the Viterbi algorithm, that also use a 2d array, although they may have higher complexity than just <code>O(n^2)</code>. For instance, the CYK parsing is <code>O(n^3)</code>, which is very slow compared to the greedy <code>O(n)</code> shift-reduce algorithm.</p> <p>However, the diff we will obtain from the basic LCS computation will still be pretty basic. There are many small improvements that are made by production diff implementation both on the UX and performance sides. Besides, the complexity of the algorithm is <code>O(n^2)</code>, which is quite high, so many practical variants perform many additional optimizations to reduce the actual number of operations, at least, for the common cases.</p> <p>The simplest improvement is a <strong>preprocessing</strong> step that is warranted by the fact that, in many applications, the diff is performed on texts that are usually mostly identical and have a small number of differences between them localized in an even smaller number of places. For instance, consider source code management, where diff plays an essential role: the programmers don't tend to rewrite whole files too often, on the contrary, such practice is discouraged due to programmer collaboration considerations.</p> <p>So, some heuristics may be used in the library diff implementations to speed up such common cases:</p> <ul><li>check that the texts are identical</li> <li>identify common prefix/suffix and perform the diff only on the remaining part</li> <li>detect situations when there's just a single or two edits</li></ul> <p>A perfect diff algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect and not very good for human consumption. People will expect operations parts to be separated at token boundaries when possible, also larger contiguous parts are preferred to an alteration of small changes. All these and other diff ergonomic issue may be addressed by various <strong>postprocessing</strong> tweaks.</p> <p>But, besides these simple tricks, are global optimizations to the algorithm possible? After all, <code>O(n^2)</code> space and time requirements are still pretty significant. Originally, diff was developed for Unix by Hunt and McIlroy. Their approach computes matches in the whole file and indexes them into the so-called k-candidates, <code>k</code> being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this, each path is memoized. The problem with the approach is that it performs more computation than necessary: it memoizes all the paths, which requires <code>O(n^2)</code> memory in the worst case, and <code>O(n^2 log n)</code> for the time complexity!</p> <p>The current standard approach is the divide-and-conquer Myers algorithm. It works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memoized, and the two subsequences preceding and following it are compared again recursively by the same procedure until there is nothing more to compare. Finding the central match is done by matching the ends of subsequences as far as possible, and any time it is not possible, augmenting the edit script by 1 operation, scanning each furthest position attained up to there for each diagonal and checking how far the match can expand. If two matches merge, the algorithm has just found the central match. This approach has the advantage to using only <code>O(n)</code> memory, and executes in <code>O(n d)</code>, where <code>d</code> is the edit script complexity (<code>d</code> is less than <code>n</code>, usually, much less). The Myers algorithm wins because it does not memoize the paths while working, and does not need to "foresee" where to go. So, it can concentrate only on the furthest positions it could reach with an edit script of the smallest complexity. The smallest complexity constraint ensures that what is found in the LCS. Unlike the Hunt-McIlroy algorithm, the Myers one doesn't have to memoize the paths. In a sense, the Myers algorithm compared to the vanilla DP diff, like the Dijkstra's one versus Bellman-Ford, cuts down on the calculation of the edit-distances between the substring that don't contribute to the optimal alignment. While solving LCS and building the whole edit-distance matrix performs the computation for all substrings.</p> <p>The diff tool is a prominent example of a transition from quite an abstract algorithm to a practical utility that is is an essential part of many ubiquitous software products, and the additional work needed to ensure that the final result is not only theoretically sane but also usable.</p> <p>P.S. Ever wondered how github and other tools, when displaying the diff, not only show the changed line but also highlight the exact changes in the line? The answer is given in <a href="#f11-7" name="r11-7">[7]</a>.</p> <h2 id="dpinactionbackprop">DP in Action: Backprop</h2> <p>As we said in the beginning, DP has applications in many areas: from Machine Learning to graphics to Source Code Management. Literally, you can find an algorithm that uses DP in every specialized domain, and if you don't — this means you, probably, can still advance this domain and create something useful by applying DP to it. Deep Learning is the fastest developing area of the Machine Learning domain, in recent years. At its core, the discipline is about training huge multilayer optimization functions called "neural networks". And the principal approach to doing that, which, practically speaking, has enabled the rapid development of machine learning techniques that we see today, is the Backpropagation (backprop) optimization algorithm.</p> <p>As <a href="https://colah.github.io/posts/2015-08-Backprop/">pointed out</a> by Christopher Olah, for modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That's the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backprop is a computational tool that may be applied in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names there. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, application-independent, name for it is Reverse-Mode Differentiation. Essentially, it's a technic for calculating partial derivatives quickly using DP on computational graphs.</p> <p>Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression <code>(:= e (* (+ a b) (1+ b)))</code>. There are four operations: two additions, one multiplication, and an assignment. Let's arrange those computations in the same way they would be performed on the computer:</p> <pre><code>(let ((c (+ a b))<br /> (d (1+ b)))<br /> (:= e (* c d)))<br /></code></pre> <p>To create a computational graph, we make each of these operations, along with the input variables, into nodes. When the outcome of one expression is an input to another one, a link points from one node to another:</p> <a href="https://1.bp.blogspot.com/-aFpDlGlz5gc/XfNmQb2Hf9I/AAAAAAAACQg/5zCLeU25V1IuCdQqq_Aw9DZyEoYCFrQDQCLcBGAsYHQ/s1600/comp-graph.jpg" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-aFpDlGlz5gc/XfNmQb2Hf9I/AAAAAAAACQg/5zCLeU25V1IuCdQqq_Aw9DZyEoYCFrQDQCLcBGAsYHQ/s400/comp-graph.jpg" width="400" height="274" data-original-width="387" data-original-height="265" /></a> <p>We can evaluate the expression by setting the values in the input nodes (<code>a</code> and <code>b</code>) to certain values and computing nodes in the graph along the dependency paths. For example, let's set <code>a</code> to 2 and <code>b</code> to 1: the result in node <code>e</code> will be, obviously, 6.</p> <p>The derivatives in a computational graph can be thought of as edge labels. If <code>a</code> directly affects <code>c</code>, then we can write a partial derivative <code>∂c/∂a</code> along the edge from <code>a</code> to <code>c</code>.</p> <p>Here is the computational graph with all the derivatives for the evaluation with the values of <code>a</code> and <code>b</code> set to 2 and 1.</p> <a href="https://4.bp.blogspot.com/-0mkX5wkFfdI/XfNmLdsTwJI/AAAAAAAACQc/d1oXW18EQzsDR5dsbfcWaZez2dbdH3QCwCLcBGAsYHQ/s1600/comp-graph2.jpg" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-0mkX5wkFfdI/XfNmLdsTwJI/AAAAAAAACQc/d1oXW18EQzsDR5dsbfcWaZez2dbdH3QCwCLcBGAsYHQ/s400/comp-graph2.jpg" width="400" height="167" data-original-width="424" data-original-height="177" /></a> <p>But what if we want to understand how nodes that aren't directly connected affect each other. Let's consider how <code>e</code> is affected by <code>a</code>. If we change <code>a</code> at a speed of 1, <code>c</code> also changes at a speed of 1. In turn, <code>c</code> changing at a speed of 1 causes <code>e</code> to change at a speed of 2. So <code>e</code> changes at a rate of <code>(* 1 2)</code> with respect to <code>a</code>. The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. We can see that this graph is, basically, the same as the graph we used to calculate the shortest path.</p> <p>This is where Forward-mode differentiation and Reverse-mode differentiation come in. They're algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once. Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total derivative. Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node. Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output.</p> <p>So, what if we do reverse-mode differentiation from <code>e</code> down? This gives us the derivative of <code>e</code> with respect to every node. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of the derivatives we need in one go. When training neural networks, the cost is a function of the weights of each edge. And using reverse-mode differentiation (aka backprop), we can calculate the derivatives of the cost with respect to all the weights in a single pass through the graph, and then feed them into gradient descent. As there are millions and tens of millions of weights, in a neural network, reverse-mode differentiation results in a speedup of the same factor!</p> <p>Backprop is an example of simple memoization DP. No selection of the best variant is needed, it's just a proper arrangement of the operations to avoid redundant computations.</p> <h2 id="takeaways">Take-aways</h2> <p>DP-based algorithms may operate on one of these three levels:</p> <ul><li>just systematic memoization, when every intermediate result is cached and used to compute subsequent results for larger problems (Fibonacci numbers, backprop)</li><li>memoization + backpointers that allow for the reconstruction of the sequence of actions that lead to the final solution (text segmentation)</li><li>memoization + backpointers + a target function that selects the best intermediate solution (text justification, diff, shortest path)</li></ul> <p>If we want to apply DP to some task, we need to find its optimal substructure: i.e. verify that an optimal solution to a subproblem will remain a part of the optimal solution to the whole problem. Next, if we deal with an optimization task, we may have to formulate the recurrence relations. After that, it's just a matter of technic: those relations may be either programmed directly as a recursive or iterative procedure (like in LCS) or indirectly using the method of consecutive approximations (like in Bellman-Ford).</p> <p>Ultimately, all DP problems may be reduced to pathfinding in the graph, but it doesn't always make sense to have this graph explicitly as a data structure in the program. If it does, however, remember that Dijkstra's algorithm is the optimal algorithm to find a single shortest path in it.</p> <p>DP, usually, is a reasonable next thing to think about after the naive greedy approach (which, let's be frank, everyone tends to take initially) stumbles over backtracking. However, we saw that DP and greedy approaches do not contradict each other: in fact, they can be combined as demonstrated by the Dijkstra's algorithm. Yet, an optimal greedy algorithm is more of an exception than a rule. Although, there is a number of problems for which a top-n greedy solution (the so-called <strong>Beam search</strong>) can be a near-optimal solution that is good enough.</p> <p>Also, DP doesn't necessarily mean optimal. A vanilla dynamic programming algorithm exhaustively explores the decision space, which may be excessive in many cases. It is demonstrated by the examples of the Dijkstra's and Myers algorithms that improve on the DP solution by cutting down some of the corners.</p> <p>P.S. We have also discussed, the first time in this book, the value of heuristic pre- and postprocessing. From the theoretical standpoint, it is not something you have to pay attention to, but, in practice, that's a very important aspect of the production implementation of many algorithms and, thus, shouldn't be frowned upon or neglected. In an ideal world, an algorithmic procedure should both have optimal worst-case complexity and the fastest operation in the common cases.</p> <hr size="1"><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-1" name="f11-1">[1]</a> If you wonder, <code>s</code> is a word that is usually present in English programmatic dictionaries because when <code>it's</code> and friends are tokenized they're split into two tokens, and the apostrophe may be missing sometimes. Also, our dictionary is case-insensitive.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-2" name="f11-2">[2]</a> The intuition for it is the following: in the worst case, every character has two choices: either to be the last letter of the previous word or the first one of the next word, hence the branching factor is 2.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-3" name="f11-3">[3]</a> Actually, the condition of complete string coverage may be lifted, which will allow to use almost the same algorithm but skip over "undictionary" words like misspellings.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-4" name="f11-4">[4]</a> A space at the end of the line is discarded.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-5" name="f11-5">[5]</a> Provided all the length calculations are implemented efficiently. For simplicity, I have used plain lists here with a linear <code>length</code> complexity, but a separate variable may be added to avoid the extra cost.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-6" name="f11-6">[6]</a> However, if we think of it, we could reuse the already proven linked representation just putting the incoming edges into the node structure instead of the outgoing ones.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r11-7" name="f11-7">[7]</a> It runs diff twice: first, at the line-level (using each line as a single unit/token) and then at the character level, as you would normally expect. Then the results are just combined.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-20877866847045248332019-11-20T12:47:00.000+02:002019-11-21T11:52:32.409+02:00Programming Algorithms: Strings<p>It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:</p> <ul><li>indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays</li> <li>there's a lot of string-specific algorithms that deserve their own chapter</li> <li>finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework</li></ul> <p>In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. <strong>Netstrings</strong> are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: <code>3:foo,</code> is the netsrting for the string <code>foo</code>.</p> <p>More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to <code>O(n)</code>. Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic <strong>bitmap</strong> trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via <code>logcount</code> Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke <code>count 1</code> on them. At least this is the case for SBCL:</p> <pre><code>CL-USER> (disassemble (lambda (x) <br /> (declare (type (simple-array bit) x))<br /> (count 1 x)))<br /><br />; disassembly for (LAMBDA (X))<br />; Size: 267 bytes. Origin: #x100FC9FD1A<br />...<br />; DA2: F3480FB8FA POPCNT RDI, RDX<br /></code></pre> <p>The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead <code>n</code> characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than <code>n</code> characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than <code>n/2</code> we can jump ahead <code>n</code> characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below <code>n/2</code> we can jump <code>n/2</code> characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means <code>O(log n)</code> complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.</p> <p>Here is the prototype of the <code>char-index</code> operation implemented according to the described algorithm (without the implementation of the <code>mb-linear-char-index</code> that performs the final linear scan):</p> <pre><code>(defstruct (mb-string (:conc-name mbs-))<br /> bytes<br /> bitmap)<br /><br />(defparameter *mb-threshold* 10)<br /><br />(defun mb-char-index (string i)<br /> (let ((off 0))<br /> (loop<br /> (with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))<br /> (diff (- i cnt)))<br /> (cond<br /> ((= cnt i) (return (+ off i)))<br /> ((< diff *mb-threshold*) (return (mb-linear-char-index<br /> string diff off)))<br /> ((< cnt (floor i 2)) (:+ off i)<br /> (:- i cnt))<br /> (t (:+ off (floor i 2))<br /> (:- i cnt)))))))<br /></code></pre> <p>The <code>length</code> of such a string may be calculated by perfoming the popcount on the whole bitmap:</p> <pre><code>(defun mb-length (string)<br /> (count 1 (mbs-bitmap string)))<br /></code></pre> <p>It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.</p> <h2 id="basicstringrelatedoptimizations">Basic String-Related Optimizations</h2> <p>Strings are often subject to subsequencing, so an efficient implementation may use structure sharing. As we remember, in Lisp, this is accessible via the displaced arrays mechanism (and a convenience RUTILS function <code>slice</code> that we have already used in the code above). Yet, structure sharing should be utilized with care as it opens a possibility for action-at-a-distance bugs if the derived string is modified, which results in parallel modification of the original. Though, strings are rarely modified in-place so, even in its basic form (without mandatory immutability), the approach works well. Moreover, some programming language environments make strings immutable by default. In such cases, to perform on-the-fly string modification (or rather, creation) such patterns as the Java <code>StringBuilder</code> are used, which creates the string from parts by first accumulating them in a list and then, when necessary, concatenating the list's contents into a single final string. An alternative approach is string formatting (the <code>format</code> function in Lisp) that is a higher-level interface, which still needs to utilize some underlying mutation/combination mechanism.</p> <p>Another important string-related technology is <strong>interning</strong>. It is a space-saving measure to avoid duplicating the same strings over and over again, which operates by putting a string in a table and using its index afterwards. This approach also enables efficient equality comparison. Interning is performed by the compiler implicitly for all constant strings (in the special segment of the program's memory called "string table"/<code>sstab</code>), and also may be used explicitly. In Lisp, there's a standard function <code>intern</code>, for this. Lisp symbols used interned strings as their names. Another variant of interning is string pooling. The difference is that interning uses a global string table while the pools may be local.</p> <h2 id="stringsintheeditor">Strings in the Editor</h2> <p>Now, let's consider situations, in which representing strings as arrays doesn't work. The primary one is in the editor. I.e. when constant random modification is the norm. There's another not so obvious requirement related to editing: handle potentially arbitrary long strings that still need to be dynamically modified. Have you tried opening a hundred-megabyte text document in your favorite editor? You'd better don't unless you're a Vim user :) Finally, an additional limitation of handling the strings in the editor is posed when we allow concurrent modification. This we'll discuss in the chapter on concurrent algorithms.</p> <p>So, why array as a string backend doesn't work well in the editor? Because of content relocation required by all edit operations. <code>O(n)</code> editing is, obviously, not acceptable. What to do? There are several more advanced approaches:</p> <ol><li>The simplest change will be, once again, to use an array of arrays. For example, for each line. This will not change the general complexity of <code>O(n)</code> but, at least, will reduce <code>n</code> significantly. The issue is that, still, it will depend on the length of the line so, for not so rare degraded case when there are few or no linebreaks, the performance will seriously deteriorate. And, moreover, having observable performance differences between editing different paragraphs of the text is not user-friendly at all.</li> <li>A more advanced approach would be to use trees, reducing access time to <code>O(log n)</code>. There are many different kinds of trees and, in fact, only a few may work as efficient string representations. Among them a popular data structure, for representing strings, is a <strong>Rope</strong>. It's a binary tree where each leaf holds a substring and its length, and each intermediate node further holds the sum of the lengths of all the leaves in its left subtree. It's a more-or-less classic application of binary trees to a storage problem so we won't spend more time on it here. Suffice to say that it has the expected binary-tree performance of <code>O(log n)</code> for all operations, provided that we keep it balanced. It's an ok alternative to a simple array, but, for such a specialized problem, we can do better with a custom solution.</li> <li>And the custom solution is to return to arrays. There's one clever way to use them that works very well for dynamic strings. It is called a <strong>Gap buffer</strong>. This structure is an array (buffer) with a gap in the middle. I.e., let's imagine that we have a text of <code>n</code> characters. The Gap buffer will have a length of <code>n + k</code> where <code>k</code> is the gap size — some value, derived from practice, that may fluctuate in the process of string modification. You can recognize this gap as the position of the cursor in the text. Insertion operation in the editor is performed exactly at this place, so it's <code>O(1)</code>. Just, afterwards, the gap will shrink by 1 character, so we'll have to resize the array, at some point, if there are too many insertions and the gap shrinks below some minimum size (maybe, below 1). The deletion operation will act exactly the opposite by growing the gap at one of the sides. The Gap buffer is an approach that is especially suited for normal editing — a process that has its own pace. It also allows the system to represent multiple cursors by maintaining several gaps. Also, it may be a good idea to represent each paragraph as a gap buffer and use an array of them for the whole text. The gap buffer is a special case of the Zipper pattern that we'll discuss in the chapter on functional data structures.</li></ol> <h2 id="substringsearch">Substring Search</h2> <p>One of the most common string operations is substring search. For ordinary sequences we, usually, search for a single element, but strings, on the contrary, more often need subsequence search, which is more complex. A naive approach will start by looking for the first character, then trying to match the next character and the next, until either something ends or there's a mismatch. Unlike with hash-tables, Lisp standard library has good support for string processing, including such operations as <code>search</code> (which, actually, operates on any sequence type) and <code>mismatch</code> that compares two strings from a chosen side and returns the position at which they start to diverge.</p> <p>If we were to implement our own string-specific search, the most basic version would, probably, look like this:</p> <pre><code>(defun naive-match (pat str)<br /> (dotimes (i (- (1+ (length str)) (length pat)))<br /> (when (= (mismatch pat (slice str i))<br /> (length pat))<br /> (return-from naive-match i))))<br /></code></pre> <p>If the strings had been random, the probability that we are correctly matching each subsequent character would have dropped to 0 very fast. Even if we consider just the English alphabet, the probability of the first character being the same in 2 random strings is <code>1/26</code>, the first and second — <code>1/676</code>, and so on. And if we assume that the whole charset may be used, we'll have to substitute 26 with 256 or a greater value. So, in theory, such naive approach has almost <code>O(n)</code> complexity, where <code>n</code> is the length of the string. Yet, the worst case has <code>O(n * m)</code>, where <code>m</code> is the length of the pattern. Why? If we try to match a pattern <code>a..ab</code> against a string <code>aa.....ab</code>, at each position, we'll have to check the whole pattern until the last character mismatches. This may seem like an artificial example and, indeed, it rarely occurs. But, still, real-world strings are not so random and are much closer to the uniform corner case than to the random one. So, researchers have come up with a number of ways to improve subsequence matching performance. Those include the four well-known inventor-glorifying substring search algorithms: Knuth-Morris-Pratt, Boyer-Moore, Rabin-Karp, and Aho-Corasick. Let's discuss each one of them and try to determine their interesting properties.</p> <h3 id="kmp">KMP</h3> <p>Knuth-Morris-Pratt is the most basic of these algorithms. Prior to performing the search, it examines the pattern to find repeated subsequences in it and creates a table containing, for each character of the pattern, the length of the prefix of the pattern that can be skipped if we have reached this character and failed the search at it. This table is also called the "failure function". The number in the table is calculated as the length of the proper suffix<a href="#f10-1" name="r10-1">[1]</a> of the pattern substring ending before the current character that matches the start of the pattern.</p> <p>I'll repeat here the example provided in Wikipedia that explains the details of the table-building algorithm, as it's somewhat tricky.</p> <p>Let's build the table for the pattern <code>abdcabd</code>. We set the table entry for the first char <code>a</code> to -1. To find the entry for <code>b</code>, we must discover a proper suffix of <code>a</code> which is also a prefix of the pattern. But there are no proper suffixes of <code>a</code>, so we set this entry to 0. To find the entry with index 2, we see that the substring <code>ab</code> has a proper suffix <code>b</code>. However <code>b</code> is not a prefix of the pattern. Therefore, we also set this entry to 0.</p> <p>For the next entry, we first check the proper suffix of length 1, and it fails like in the previous case. Should we also check longer suffixes? No. We can formulate a shortcut rule: at each stage, we need to consider checking suffixes of a given size <code>(1+ n)</code> only if a valid suffix of size <code>n</code> was found at the previous stage and should not bother to check longer lengths. So we set the table entry for <code>c</code> to 0 also.</p> <p>We pass to the subsequent character <code>a</code>. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since <code>d</code> is not a prefix. But instead of setting the table entry to 0, we can do better by noting that <code>a</code> is also the first character of the pattern, and also that the corresponding character of the string can't be <code>a</code> (as we're calculating for the mismatch case). Thus there is no point in trying to match the pattern for this character again — we should begin 1 character ahead. This means that we may shift the pattern by match length plus one character, so we set the table entry to -1.</p> <p>Considering now the next character <code>b</code>: though by inspection the longest substring would appear to be <code>a</code>, we still set the table entry to 0. The reasoning is similar to the previous case. <code>b</code> itself extends the prefix match begun with <code>a</code>, and we can assume that the corresponding character in the string is not <code>b</code>. So backtracking before it is pointless, but that character may still be <code>a</code>, hence we set the entry not to -1, but to 0, which means shifting the pattern by 1 character to the left and trying to match again.</p> <p>Finally, for the last character <code>d</code>, the rule of the proper suffix matching the prefix applies, so we set the table entry to 2.</p> <p>The resulting table is:</p> <pre><code> a b c d a b d <br /> -1 0 0 0 -1 0 2<br /></code></pre> <p>Here's the implementation of the table-building routine:</p> <pre><code>(defun kmp-table (pat)<br /> (let ((rez (make-array (length pat)))<br /> (i 0)) ; prefix length<br /> (:= (? rez 0) -1)<br /> (loop :for j :from 1 :below (length pat) :do<br /> (if (char= (char pat i) (char pat j))<br /> (:= (? rez j) (? rez i))<br /> (progn (:= (? rez j) i<br /> i (? rez i))<br /> (loop :while (and (>= i 0)<br /> (not (char= (char pat i) (char pat j))))<br /> :do (:= i (? rez i)))))<br /> (:+ i))<br /> rez))<br /></code></pre> <p>It can be proven that it runs in <code>O(m)</code>. We won't show it here, so coming up with proper calculations is left as an exercise to the reader.</p> <p>Now, the question is, how shall we use this table? Let's look at the code:</p> <pre><code>(defun kmp-match (pat str)<br /> (let ((s 0)<br /> (p 0)<br /> (ff (kmp-table pat))<br /> (loop :while (< s (length str)) :do<br /> (if (= (char pat p) (char str s))<br /> ;; if the current characters of the pattern and string match<br /> (if (= (1+ p) (length pat)))<br /> ;; if we reached the end of the pattern - success<br /> (return (- s p))<br /> ;; otherwise, match the subsequent characters<br /> (:= p (1+ p)<br /> s (1+ s)))<br /> ;; if the characters don't match<br /> (if (= -1 (? ff p))<br /> ;; shift the pattern for the whole length <br /> (:= p 0<br /> ;; and skip to the next char in the string<br /> s (1+ s))<br /> ;; try matching the current char again,<br /> ;; shifting the pattern to align the prefix<br /> ;; with the already matched part<br /> (:= p (? ff p)))))))<br /></code></pre> <p>As we see, the index in the string (<code>s</code>), is incremented at each iteration except when the entry in the table is positive. In the latter case, we may examine the same character more than once but not more than we have advanced in the pattern. And the advancement in the pattern meant the same advancement in the string (as the match is required for the advancement). In other words, we can backtrack not more than <code>n</code> times over the whole algorithm runtime, so the worst-case number of operations in <code>kmp-search</code> is <code>2n</code>, while the best-case is just <code>n</code>. Thus, the total complexity is <code>O(n + m)</code>.</p> <p>And what will happen in our <code>aa..ab</code> example? The failure function for it will look like the following: <code>-1 -1 -1 -1 (- m 2)</code>. Once we reach the first mismatch, we'll need to backtrack by 1 character, perform the comparison, which will mismatch, advance by 1 character (to <code>b</code>), mismatch again, again backtrack by 1 character, and so on until the end of the string. So, this case, will have almost the abovementiond <code>2n</code> runtime.</p> <p>To conclude, the optimization of KMP lies in excluding unnecessary repetition of the same operations by memoizing the results of partial computations — both in table-building and matching parts. The next chapter of the book will be almost exclusively dedicated to studying this approach in algorithm design.</p> <h3 id="bm">BM</h3> <p>Boyer-Moore algorithm is conceptually similar to KMP, but it matches from the end of the pattern. It also builds a table, or rather three tables, but using a different set of rules, which also involve the characters in the string we search. More precisely, there are two basic rules instead of one for KMP. Besides, there's another rule, called the Galil rule, that is required to ensure the linear complexity of the algorithm. Overall, BM is pretty complex in the implementation details and also requires more preprocessing than KMP, so its utility outweighs these factors only when the search is repeated multiple times for the same pattern.</p> <p>Overall, BM may be faster with normal text (and the longer the pattern, the faster), while KMP will work the best with strings that have a short alphabet (like DNA). However, I would choose KMP as the default due to its relative simplicity and much better space utilization.</p> <h3 id="rk">RK</h3> <p>Now, let's talk about alternative approaches that rely on techniques other than pattern preprocessing. They are usually used to find matches of multiple patterns in one go as, for the base case, their performance will be worse than that of the previous algorithms.</p> <p>Rabin-Karp algorithm uses an idea of the <strong>Rolling hash</strong>. It is a hash function that can be calculated incrementally. The RK hash is calculated for each substring of the length of the pattern. If we were to calculate a normal hash function like fnv-1, we'd need to use each character for the calculation — resulting in <code>O(n * m)</code> complexity of the whole procedure. The rolling hash is different as it requires, at each step of the algorithm, to perform just 2 operations: as the "sliding window" moves over the string, subtract the part of the hash corresponding to the character that is no longer part of the substring and add the new value for the character that has just become the part of the substring.</p> <p>Here is the skeleton of the RK algorithm:</p> <pre><code>(defun rk-match (pat str)<br /> (let ((len (length pat))<br /> (phash (rk-hash pat)))<br /> (loop :for i :from len :to (length str)<br /> :for beg := (- i len)<br /> :for shash := (rk-hash (slice str 0 len))<br /> :then (rk-rehash len shash (char str beg) (char str i))<br /> :when (and (= phash shash)<br /> (string= pat (slice str beg len))<br /> :collect beg)))<br /></code></pre> <p>A trivial <code>rk-hash</code> function would be just:</p> <pre><code>(defun rk-hash (str)<br /> (loop :for ch :across str :sum (char-code ch)))<br /></code></pre> <p>But it is, obviously, not a good hash-function as it doesn't ensure the equal distribution of hashes. Still, in this case, we need a reversible hash-function. Usually, such hashes add position information into the mix. An original hash-function for the RK algorithm is the Rabin fingerprint that uses random irreducible polynomials over Galois fields of order 2. The mathematical background needed to explain it is somewhat beyond the scope of this book. However, there are simpler alternatives such as the following:</p> <pre><code>(defun rk-hash (str)<br /> (assert (> (length str) 0))<br /> (let ((rez (char-code (char str 0))))<br /> (loop :for ch :across (slice str 1) :do<br /> (:= rez (+ (rem (* rez 256) 101)<br /> (char-code ch))))<br /> (rem rez 101))<br /></code></pre> <p>Its basic idea is to treat the partial values of the hash as the coefficients of some polynomial.</p> <p>The implementation of <code>rk-rehash</code> for this function will look like this:</p> <pre><code>(defun rk-rehash (hash len ch1 ch2)<br /> (rem (+ (* (+ hash 101<br /> (- (rem (* (char-code ch1)<br /> (expt 256 (1- len)))<br /> 101)))<br /> 256)<br /> (char-code ch2))<br /> 101))<br /></code></pre> <p>Our <code>rk-match</code> could be used to find many matches of a single pattern. To adapt it for operating on multiple patterns at once, we'll just need to pre-calculate the hashes for all patterns and lookup the current rk-hash value in this set. Additional optimization of this lookup may be performed with the help of a Bloom filter — a stochastic data structure we'll discuss in more detail later.</p> <p>Finally, it's worth noting that there are other similar approaches to the rolling hash concept that trade some of the uniqueness properties of the hash function for the ability to produce hashes incrementally or have similar hashes for similar sequences. For instance, the <strong>Perceptual hash</strong> (phash) is used to find near-match images.</p> <h3 id="ac">AC</h3> <p>Aho-Corasick is another algorithm that allows matching multiple strings at once. The preprocessing step of the algorithm constructs a <strong>Finite-State Machine</strong> (FSM) that resembles a trie with additional links between the various internal nodes. The FSM is a graph data structure that encodes possible states of the system and actions needed to transfer it from one state to the other.</p> <p>The AC FSM is constructed in the following manner:</p> <ol><li>Build a trie of all the words in the set of search patterns (the search dictionary). This trie represents the possible flows of the program when there's a successful character match at the current position. Add a loop edge for the root node.</li> <li>Add backlinks transforming the trie into a graph. The backlinks are used when a failed match occurs. These backlinks are pointing either to the root of the trie or if there are some prefixes that correspond to the part of the currently matched path — to the end of the longest prefix. The longest prefix is found using BFS of the trie. This approach is, basically, the same idea used in KMP and BM to avoid reexamining the already matched parts. So backlinks to the previous parts of the same word are also possible.</li></ol> <p>Here is the example FSM for the search dictionary <code>'("the" "this" "that" "it" "his")</code>:</p> <a href="https://2.bp.blogspot.com/-w84XBosyVpk/XdUY_5st4HI/AAAAAAAACPc/GnyFn0F62hk0LtqDlaZMU4Nbg-j1086EwCPcBGAYYCw/s1600/ac.jpg" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-w84XBosyVpk/XdUY_5st4HI/AAAAAAAACPc/GnyFn0F62hk0LtqDlaZMU4Nbg-j1086EwCPcBGAYYCw/s640/ac.jpg" width="640" height="381" data-original-width="751" data-original-height="447" /></a> <p>Basically, it's just a trie with some backlinks to account for already processed prefixes. One more detail missing for this graph to be a complete FSM is an implicit backlink from all nodes without an explicit backlink that don't have backlinks to the root node.</p> <p>The main loop of the algorithm is rather straightforward, examine each character and then:</p> <ul><li>either follow one of the transitions (direct edge) if the character of the edge matches</li><li>or follow the backlink if it exists</li><li>or reset the FSM state — go to root</li><li>if the transition leads us to a terminal node, record the match(es) and return to root as, well</li></ul> <p>As we see from the description, the complexity of the main loop is linear in the length of the string: at most, 2 matches are performed, for each character. The FSM construction is also linear in the total length of all the words in the search dictionary.</p> <p>The algorithm is often used in antivirus software to perform an efficient search for code signatures against a database of known viruses. It also formed the basis of the original Unix command <code>fgrep</code>. And, from my point of view, it's the simplest to understand yet pretty powerful and versatile substring search algorithm that may be a default choice if you ever have to implement one yourself.</p> <h2 id="regularexpressions">Regular Expressions</h2> <p>Searching is, probably, the most important advanced string operation. Besides, it is not limited to mere substring search — matching of more complex patterns is even in higher demand. These patterns, which are called "regular expressions" or, simply, <strong>regex</strong>es, may include optional characters, repetition, alternatives, backreferences, etc. Regexes play an important role in the history of the Unix command-line, being the principal technology of the infamous <code>grep</code> utility, and then the cornerstone of Perl. All modern programming languages support them either in the standard library or, as Lisp, with high-quality third-party addons (<a href="http://edicl.github.io/cl-ppcre/">cl-ppcre</a>).</p> <p>One of my favorite programming books, "Beautiful Code", has a chapter on implementing simple regex matching from Brian Kernighan with code written by Rob Pike. It shows how easy it is to perform basic matching of the following patterns:</p> <pre><code>c matches any literal character c<br />. matches any single character<br />^ matches the beginning of the input string<br />$ matches the end of the input string<br />* matches zero or more occurrences of the previous character<br /></code></pre> <p>Below the C code from the book is translated into an equivalent Lisp version:</p> <pre><code>(defun match (regex text)<br /> "Search for REGEX anywhere in TEXT."<br /> (if (starts-with "^" regex) ; STARTS-WITH is from RUTILS<br /> (match-here (slice regex 1) text)<br /> (dotimes (i (length text))<br /> (when (match-here regex (slice text i))<br /> (return t)))))<br /><br />(defun match-here (regex text)<br /> "Search for REGEX at beginning of TEXT."<br /> (cond ((= 0 (length regex))<br /> t)<br /> ((and (> (length regex) 1)<br /> (char= #\* (char regex 1)))<br /> (match-star (char regex 1) (slice regex 2) text))<br /> ((string= "$" regex)<br /> (= 0 (length text)))<br /> ((and (> (length text) 0)<br /> (member (char text 0) (list #\. (char text 0)))<br /> (match-here (slice regex 1) (slice text 1)))))<br /><br />(defun match-star (c regex text)<br /> "Search for C*REGEX at beginning of TEXT."<br /> (loop<br /> (when (match-here regex text) (return t))<br /> (:= text (slice text 1))<br /> (unless (and (> (length text) 0)<br /> (member c (list #\. (char text 0))))<br /> (return)))<br /></code></pre> <p>This is a greedy linear algorithm. However, modern regexes are much more advanced than this naive version. They include such features as register groups (to record the spans of text that match a particular subpattern), backreferences, non-greedy repetition, and so on and so forth. Implementing those will require changing the simple linear algorithm to a backtracking one. And incorporating all of them would quickly transform the code above into a horrible unmaintainable mess: not even due to the number of cases that have to be supported but due to the need of accounting for the complex interdependencies between them.</p> <p>And, what's worse, soon there will arise a need to resort to backtracking. Yet, a backtracking approach has a critical performance flaw: potential exponential runtime for certain input patterns. For instance, the Perl regex engine (PCRE) requires over sixty seconds to match a 30-character string <code>aa..a</code> against the pattern <code>a?{15}a{15}</code> (on standard hardware). While the alternative approach, which we'll discuss next, requires just twenty microseconds — a million times faster. And it handles a 100-character string of a similar kind in under 200 microseconds, while Perl would require over 1015 years.<a href="#f10-2" name="r10-2">[2]</a></p> <p>This issue is quite severe and has even prompted Google to release their own regex library with strict linear performance guarantees — <a href="https://github.com/google/re2">RE2</a>. The goal of the library is not to be faster than all other engines under all circumstances. Although RE2 guarantees linear-time performance, the linear-time constant varies depending on the overhead entailed by its way of handling of the regular expression. In a sense, RE2 behaves pessimistically whereas backtracking engines behave optimistically, so it can be outperformed in various situations. Also, its goal is not to implement all of the features offered by PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.</p> <p>The figures above are taken from a <a href="https://swtch.com/~rsc/regexp/regexp1.html">seminal article</a> by Russ Cox. He goes on to add:</p> <blockquote> <p>Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs. They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools.</p></blockquote> <p>The linear-time approach to regex matching relies on a similar technic to the one in the Aho-Corasick algorithm — the FSM. Actually, if by regular expressions we mean a set of languages that abide by the rules of the regular grammars in the Chomsky hierarchy of languages, the FSM is their exact theoretical computation model. Here is how an FSM for a simple regex <code>a*b$</code> might look like:</p> <a href="https://3.bp.blogspot.com/-hgkgpHFZwVg/XdUY6Wwvg8I/AAAAAAAACPc/gjDY5Qza96YLrRtg-rXeib87AQw13QYZgCPcBGAYYCw/s1600/regex.jpg" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-hgkgpHFZwVg/XdUY6Wwvg8I/AAAAAAAACPc/gjDY5Qza96YLrRtg-rXeib87AQw13QYZgCPcBGAYYCw/s400/regex.jpg" width="400" height="150" data-original-width="393" data-original-height="147" /></a> <p>Such FSM is called an <strong>NFA</strong> (Nondeterministic Finite Automaton) as some states have more than one alternative successor. Another type of automata are <strong>DFA</strong>s (Deterministic Finite Automata) that permit transitions to at most one state, for each state. The method to transform the regex into an NFA is called the Thompson's construction. And an NFA can be made into a DFA by the Powerset construction and then be minimized to get an optimal automaton. DFAs are more efficient to execute than NFAs, because DFAs are only ever in one state at a time: they never have a choice of multiple next states. But the construction takes additional time. Anyway, both NFAs and DFAs guarantee linear-time execution.</p> <p>The Thompson's algorithm builds the NFA up from partial NFAs for each subexpression, with a different construction for each operator. The partial NFAs have no matching states: instead, they have one or more dangling arrows, pointing to nothing. The construction process will finish by connecting these arrows to a matching state.</p> <ul><li>The NFAs for matching a single character <code>e</code> is a single node with a slot for an incoming arrow and a pending outgoing arrow labeled with <code>e</code>.</li> <li>The NFA for the concatenation <code>e1e2</code> connects the outgoing arrow of the <code>e1</code> machine to the incoming arrow of the <code>e2</code> machine.</li> <li>The NFA for the alternation <code>e1|e2</code> adds a new start state with a choice of either the <code>e1</code> machine or the <code>e2</code> machine.</li> <li>The NFA for <code>e?</code> alternates the <code>e</code> machine with an empty path.</li> <li>The NFA for <code>e*</code> uses the same alternation but loops a matching <code>e</code> machine back to the start.</li> <li>The NFA for <code>e+</code> also creates a loop, but one that requires passing through <code>e</code> at least once.</li></ul> <p>Counting the states in the above constructions, we can see that this technic creates exactly one state per character or metacharacter in the regular expression. The only exception is the constructs <code>c{n}</code> or <code>c{n,m}</code> which require to duplicate the single chracter automaton <code>n</code> or <code>m</code> times respectively, but it is still a constant number. Therefore the number of states in the final NFA is at most equal to the length of the original regular expression plus some constant.</p> <h3 id="implementationofthethompsonsconstruction">Implementation of the Thompson's Construction</h3> <p>The core of the algorithm could be implemented very transparently with the help of the Lisp generic functions. However, to enable their application, we'd first need to transform the raw expression into a sexp (tree-based) form. Such representation is supported, for example, in the cl-ppcre library:</p> <pre><code>PPCRE> (parse-string "ab[0-9]+c$")<br />(:SEQUENCE "ab" (:GREEDY-REPETITION 1 NIL (:RANGE #\0 #\9)) #\c :END-ANCHOR)<br /></code></pre> <p>Parsing is a whole separate topic that will be discussed next. But once we have performed it, we gain a possibility to straightforwardly implement the Thompson's construction by traversing the parse tree and emitting, for each state, the corresponding part of the automaton. The Lisp generic functions are a great tool for implementing such transformation as they allow to define methods that are selected based on either the type or the identity of the arguments. And those methods can be added independently, so the implementation is clear and extensible. We will define 2 generic functions: one to emit the automaton fragment (<code>th-part</code>) and another to help in transition selection (<code>th-match</code>).</p> <p>First, let's define the state node of the FSM. We will use a linked graph representation for the automaton. So, a variable for the FSM in the code will point to its start node, and it will, in turn, reference the other nodes. There will also be a special node that will be responsible for recording the matches (<code>*matched-state*</code>).</p> <pre><code>(defstruct th-state<br /> transitions)<br /><br />(defparameter *initial-state* nil)<br />(defparameter *matched-state* (make-th-state))<br /><br />(defun th-state (&rest transitions)<br /> "A small convenience function to construct TH-STATE structs."<br /> (make-th-state :transitions (loop :for (cond state) :on transitions :by 'cddr<br /> :collect (pair cond state))))<br /></code></pre> <p>And now, we can define the generic function that will emit the nodes:</p> <pre><code>(define-condition check-start-anchor () ())<br /><br />(defgeneric th-part (next-state kind &rest args)<br /> (:documentation<br /> "Emit the TH-STATE structure of a certain KIND<br /> (which may be a keyword or a raw string) using the other ARGS<br /> and pointing to NEXT-STATE struct.")<br /> (:method (next-state (kind (eql :sequence)) &rest args)<br /> (apply 'th-part (if (rest args)<br /> (apply 'th-part :sequence (rest args))<br /> next-state)<br /> (first args)))<br /> (:method (next-state (kind (eql :greedy-repetition)) &rest args)<br /> ;; this method can handle *, +, {n}, and {n,m} regex modifiers<br /> ;; in any case, there's a prefix sequence of fixed nonnegative length<br /> ;; of identical elements that should unconditionally match<br /> ;; followed by a bounded or unbounded sequence that,<br /> ;; in case of a failed match, transitions to the next state<br /> (apply 'th-part<br /> (let ((*initial-state* next-state))<br /> (apply 'th-part next-state :sequence<br /> (loop :repeat (or (second args) 1)<br /> :collect (mklist (third args)))))<br /> :sequence (loop :repeat (first args)<br /> :collect (mklist (third args)))))<br /> (:method (next-state (kind character) &rest args)<br /> (th-state kind next-state<br /> ;; Usually, *initial-state* will be nill, i.e. further computations<br /> ;; alone this path will be aborted, but for some variants (? or *)<br /> ;; they will just continue normally to the next state.<br /> ;; The special variable allows to control this as you can see in<br /> ;; the method for :greedy-repetition <br /> t *initial-state*))<br /> (:method (next-state (kind (eql :end-anchor)) &rest args)<br /> (th-state nil *matched-state*<br /> t *initial-state*))<br /> (:method (next-state (kind (eql :start-anchor)) &rest args)<br /> ;; This part is unique in that all the other parts consume the next character<br /> ;; (we're not implementing lookahead here), but this one shouldn't.<br /> ;; To implement such behavior without the additional complexity created by passing<br /> ;; the string being searched to this function (which we'll still, probably, need to do<br /> ;; later on, but were able to avoid so far), we can resort to a cool Lisp technique<br /> ;; of signaling a condition that can be handled specially in the top-level code<br /> (signal 'check-start-anchor)<br /> next-state))<br /></code></pre> <p>Here, we have defined some of the methods of <code>th-part</code> that specialize for the basic <code>:sequence</code> of expressions, <code>:greedy-repetition</code> (regex <code>*</code> and <code>+</code>), a single character and single symbols <code>:start-anchor</code>/<code>:end-anchor</code> (regexes <code>^</code> and <code>$</code>). As you can see, some of them dispatch (are chosen based on) the identity of the first argument (using <code>eql</code> specializers), while the character-related method specializes on the class of the arg. As we develop this facility, we could add more methods with <code>defmethod</code>. Running <code>th-part</code> on the whole parse-tree will produce the complete automaton, we don't need to do anything else!</p> <p>To use the constructed FSM, we run it with the string as input. NFAs are endowed with the ability to guess perfectly when faced with a choice of next state: to run the NFA on a real computer, we must find a way to simulate this guessing. One way to do that is to guess one option, and if that doesn't work, try the other. A more efficient way to simulate perfect guessing is to follow all admissible paths simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter. In the worst case, the NFA might be in every state at each step, but this results in at worst a constant amount of work independent of the length of the string, so arbitrarily large input strings can be processed in linear time. The efficiency comes from tracking the set of reachable states but not which paths were used to reach them. In an NFA with <code>n</code> nodes, there can only be <code>n</code> reachable states at any step.</p> <pre><code>(defun run-nfa (nfa str)<br /> (let ((i 0)<br /> (start 0)<br /> (matches (list))<br /> (states (list nfa)))<br /> ;; this is the counterpart for the start-anchor signal<br /> (handler-bind ((check-start-anchor<br /> ;; there's no sense to proceed matching a ^... regex<br /> ;; if the string is not at its start<br /> (lambda (c) (when (> i 0) (return-from run-nfa)))))<br /> (dovec (char (concatenate 'vector str<br /> #(nil))) ; for handling end-anchor <br /> (let ((new-states (list)))<br /> (dolist (state states)<br /> (dolist (tr (? state 'transitions))<br /> (when (th-match tr char)<br /> (case (rt tr)<br /> (*matched-state* (push start matches))<br /> (nil ) ; ignore it<br /> (t (pushnew (rt tr) new-states)))<br /> (return))))<br /> (if new-states<br /> (:= states new-states)<br /> (:= states (list nfa)<br /> start nil)))<br /> (:+ i)<br /> (unless start (:= start i))))<br /> matches))<br /></code></pre> <p>The <code>th-match</code> function may have methods to match a single char and a character range, as well as a particular predicate. Its implementation is trivial and left as an exercise to the reader.</p> <p>Overall, interpreting an automaton is a simple and robust approach, yet if we want to squeeze all the possible performance, we can compile it directly to machine code. This is much easier to do with the DFA as it has at most 2 possible transitions from each state, so the automaton can be compiled to a multi-level conditional and even a jump-table.</p> <h2 id="grammars">Grammars</h2> <p>Regexes are called "regular" for a reason: there's a corresponding mathematical formalism "regular languages" that originates from the hierarchy of grammars compiled by Noah Chomsky. This hierarchy has 4 levels, each one allowing strictly more complex languages to be expressed with it. And for each level, there's an equivalent computation model:</p> <ul><li>Type-0: recursivel-enumerable (or universal) grammars — Turing machine</li> <li>Type-1: context-dependent (or context-sensitive) grammars — a linear bounded automaton</li> <li>Type-2: context-free grammars — pushdown automaton</li> <li>Type-3: regular grammars — FSM</li></ul> <p>We have already discussed the bottom layer of the hierarchy. Regular languages are the most limited (and thus the simplest to implement): for example, you can write a regex <code>a{15}b{15}</code>, but you won't be able to express <code>a{n}b{n}</code> for an arbitrary <code>n</code>, i.e. ensure that <code>b</code> is repeated the same number of times as <code>a</code>. The top layer corresponds to all programs and so all the programming science and lore, in general, is applicable to it. Now, let's talk about context-free grammars which are another type that is heavily used in practice and even has a dedicated set of algorithms. Such grammars can be used not only for simple matching but also for parsing and generation. <strong>Parsing</strong>, as we have seen above, is the process of transforming a text that is assumed to follow the rules of a certain grammar into the structured form that corresponds to the particular rules that can be applied to this text. And generation is the reverse process: applying the rules, obtain the text. This topic is huge and there's a lot of literature on it including the famous <a href="https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">Dragon Book</a>.</p> <p>Parsing is used for processing both artificial (including programming) and natural languages. And, although different sets of rules may be used, as well as different approaches for selecting a particular rule, the resulting structure will be a tree. In fact, formally, each grammar consists of 4 items:</p> <ul><li>The set of terminals (leaves of the parse tree) or tokens of the text: these could be words or characters for the natural language; keywords, identifiers, and literals for the programming language; etc.</li> <li>The set of nonterminals — symbols used to name different items in the rules and in the resulting parse tree — the non-leaf nodes of the tree. These symbols are abstract and not encountered in the actual text. The examples of nonterminals could be <code>VB</code> (verb) or <code>NP</code> (noun phrase) in natural language parsing, and <code>if-section</code> or <code>template-argument</code> in parsing of C++ code.</li> <li>The root symbol (which should be one of the nonterminals).</li> <li>The set of production rules that have two-sides: a left-hand (lhs) and a right-hand (rhs) one. In the left-hand side, there should be at least one nonterminal, which is substituted with a number of other terminals or nonterminals in the right-hand side. During generation, the rule allows the algorithm to select a particular surface form for an abstract nonterminal (for example, turn a nonterminal <code>VB</code> into a word <code>do</code>). During parsing, which is a reverse process, it allows the program, when it's looking at a particular substring, to replace it with a nonterminal and expand the tree structure. When the parsing process reaches the root symbol in the by performing such substitution and expansion, it is considered terminated.</li></ul> <p>Each compiler has to use parsing as a step in transforming the source into executable code. Also, parsing may be applied for any data format (for instance, JSON) to transform it into machine data. In natural language processing, parsing is used to build the various tree representations of the sentence, which encode linguistic rules and structure.</p> <p>There are many different types of parsers that differ in the additional constraints they impose on the structure of the production rules of the grammar. The generic context-free constraint is that in each production rule the left-hand side may only be a single nonterminal. The most wide-spread of context-free grammars are LL(k) (in particular, LL(1)) and LR (LR(1), SLR, LALR, GLR, etc). For example, LL(1) parsers (one of the easiest to build) parses the input from left to right, performing leftmost derivation of the sentence, and it is allowed to look ahead at most 1 character. Not all combinations of derivation rules allow the algorithm to build a parser that will be able to perform unambiguous rule selection under such constraints. But, as the LL(1) parsing is simple and efficient, some authors of grammars specifically target their language to be LL(1)-parseable. For example, Pascal and other programming languages created by Niklas Wirth fall into this category.</p> <p>There are also two principal approaches to implementing the parser: a top-down and a bottom-up one. In a top-down approach, the parser tries to build the tree from the root, while, in a bottom-up one, it tries to find the rules that apply to groups of terminal symbols and then combine those until the root symbol is reached. Obviously, we can't enumerate all parsing algorithms here, so we'll study only a single approach, which is one of the most wide-spread, efficient, and flexible ones — <strong>Shift-Reduce Parsing</strong>. It's a bottom-up linear algorithm that can be considered one of the instances of the pushdown automaton approach — a theoretical computational model for context-free grammars.</p> <p>A shift-reduce parser operates on a queue of tokens of the original sentence. It also has access to a stack. At each step, the algorithm can perform:</p> <ul><li>either a <code>shift</code> operation: take the token from the queue and push it onto the stack</li> <li>or a <code>reduce</code> operation: take the top items from the stack, select a matching rule from the grammar, and add the corresponding subtree to the partial parse tree, in the process, removing the items from the stack</li></ul> <p>Thus, for each token, it will perform exactly 2 "movement" operations: push it onto the stack and pop from the stack. Plus, it will perform rule lookup, which requires a constant number of operations (maximum length of the rhs of any rule) if an efficient structure is used for storing the rules. A hash-table indexed by the rhs's or a trie are good choices for that.</p> <p>Here's a small example from the domain of NLP syntactic parsing. Let's consider a toy grammar:</p> <pre><code>S -> NP VP .<br />NP -> DET ADJ NOUN<br />NP -> PRP$ NOUN ; PRP$ is a posessive pronoun<br />VP -> VERB VP<br />VP -> VERB NP<br /></code></pre> <p>and the following vocabulary:</p> <pre><code>DET -> a|an|the<br />NOUN -> elephant|pijamas<br />ADJ -> large|small<br />VERB -> is|wearing<br />PRP$ -> my<br /></code></pre> <p>No, let's parse the sentence (already tokenized): <code>A large elephant is wearing my pyjamas .</code> First, we'll need to perform part-of-speech tagging, which, in this example, is a matter of looking up the appropriate nonterminals from the vocabulary grammar. This will result in the following:</p> <pre><code>DET ADJ NOUN VERB VERB PRP$ NOUN .<br /> | | | | | | | |<br /> A large elephant is wearing my pyjamas .<br /></code></pre> <p>This POS tags will serve the role of terminals for our parsing grammar. Now, the shift-reduce process itself begins:</p> <pre><code style="font-family: Courier New;">1. Initial queue: (DET ADJ NOUN VERB VERB PRP$ NOUN .)<br /> Initial stack: ()<br /> Operation: shift<br /><br /><br />2. Queue: (ADJ NOUN VERB VERB PRP$ NOUN .)<br /> Stack: (DET)<br /> Operation: shift (as there are no rules with the rhs DET)<br /><br />3. Queue: (NOUN VERB VERB PRP$ NOUN .)<br /> Stack: (ADJ DET)<br /> Operation: shift<br /><br />4. Queue: (VERB VERB PRP$ NOUN .)<br /> Stack: (NOUN ADJ DET)<br /> Operation: reduce (rule NP -> DET ADJ NOUN)<br /> ; we match the rules in reverse to the stack<br /><br />5. Queue: (VERB VERB PRP$ NOUN .)<br /> Stack: (NP)<br /> Operation: shift<br /><br />6. Queue: (VERB PRP$ NOUN .)<br /> Stack: (VERB NP)<br /> Operation: shift<br /><br />7. Queue: (PRP$ NOUN .)<br /> Stack: (VERB VERB NP)<br /> Operation: shift<br /><br />8. Queue: (NOUN .)<br /> Stack: (PRP$ VERB VERB NP)<br /> Operation: shift<br /><br />9. Queue: (.)<br /> Stack: (NOUN PRP$ VERB VERB NP)<br /> Operation: reduce (rule: NP -> PRP$ NOUN)<br /><br />10. Queue: (.)<br /> Stack: (NP VERB VERB NP)<br /> Operation: reduce (rule: VP -> VERB NP)<br /><br />11. Queue: (.)<br /> Stack: (VP VERB NP)<br /> Operation: reduce (rule: VP -> VERB VP)<br /><br />12. Queue: (.)<br /> Stack: (VP NP)<br /> Operation: shift<br /><br />11. Queue: ()<br /> Stack: (. VP NP)<br /> Operation: reduce (rule: S -> NP VP .)<br /><br />12. Reached root symbol - end.<br /><br /> The resulting parse tree is:<br /><br /> __________S___________<br /> / \ \<br /> / __VP__ \<br /> / / \ \<br /> / / __VP_ \<br /> / / / \ \<br /> ___NP_____ / / _NP_ \<br /> / | \ / / / \ \<br />DET ADJ NOUN VERB VERB PRP$ NOUN .<br /> | | | | | | | |<br /> A large elephant is wearing my pyjamas .<br /></code></pre> <p>The implementation of the basic algorithm is very simple:</p> <pre><code>(defstruct grammar<br /> rules<br /> max-length)<br /><br />(defmacro grammar (&rest rules)<br /> `(make-grammar<br /> :rules (pairs->ht (mapcar (lambda (rule)<br /> (pair (nthcdr 2 rule) (first rule)))<br /> ',rules))<br /> :max-length<br /> (let ((max 0))<br /> (dolist (rule ',rules)<br /> (when (> #1=(length (nthcdr 2 rule)) max)<br /> (:= max #1#)))<br /> max))) ; #1= & #1# are reader-macros for anonymous variables<br /><br />(defun parse (grammar queue)<br /> (let ((stack (list)))<br /> (loop :while queue :do<br /> (print stack) ; diagnostic output<br /> (if-it (find-rule stack grammar)<br /> ;; reduce<br /> (dotimes (i (length (cdr it))<br /> (push it stack))<br /> (pop stack))<br /> ;; shift<br /> (push (pop queue) stack))<br /> :finally (return (find-rule stack grammar)))))<br /><br />(defun find-rule (stack grammar)<br /> (let (prefix)<br /> (loop :for item in stack<br /> :repeat (? grammar 'max-length) :do<br /> (push (car (mklist item)) prefix)<br /> (when-it (? grammar 'rules prefix)<br /> ;; otherwise parsing will fail with a stack<br /> ;; containing a number of partial subtrees<br /> (return (cons it (reverse (subseq stack 0 (length prefix)))))))))<br /><br />CL-USER> (parse (print (grammar (S -> NP VP |.|)<br /> (NP -> DET ADJ NOUN)<br /> (NP -> PRP$ NOUN)<br /> (VP -> VERB VP)<br /> (VP -> VERB NP)))<br /> '(DET ADJ NOUN VERB VERB PRP$ NOUN |.|))<br />#S(GRAMMAR<br /> :RULES #{<br /> '(NP VP |.|) S<br /> '(DET ADJ NOUN) NP<br /> '(PRP$ NOUN) NP<br /> '(VERB VP) VP<br /> '(VERB NP) VP<br /> }<br /> :MAX-LENGTH 3)<br />NIL <br />(DET) <br />(ADJ DET) <br />(NOUN ADJ DET) <br />((NP DET ADJ NOUN)) <br />(VERB (NP DET ADJ NOUN)) <br />(VERB VERB (NP DET ADJ NOUN)) <br />(PRP$ VERB VERB (NP DET ADJ NOUN)) <br />(NOUN PRP$ VERB VERB (NP DET ADJ NOUN)) <br />((NP PRP$ NOUN) VERB VERB (NP DET ADJ NOUN)) <br />((VP VERB (NP PRP$ NOUN)) VERB (NP DET ADJ NOUN)) <br />((VP VERB (VP VERB (NP PRP$ NOUN))) (NP DET ADJ NOUN)) <br />(S (NP DET ADJ NOUN) (VP VERB (VP VERB (NP PRP$ NOUN))) |.|)<br /></code></pre> <p>However, the additional level of complexity of the algorithm arises when the grammar becomes ambiguous, i.e. there may be situations when several rules apply. Shift-reduce is a greedy algorithm, so, in its basic form, it will select some rule (for instance, with the shortest rhs or just the first match), and it cannot backtrack. This may result in a parsing failure. If some form of rule weights is added, the greedy selection may produce a suboptimal parse. Anyway, there's no option of backtracking to correct a parsing error. In the NLP domain, the peculiarity of shift-reduce parsing application is that the number of rules is quite significant (it can reach thousands) and, certainly, there's ambiguity. In this setting, shift-reduce parsing is paired with machine learning technics, which perform a "soft" selection of the action to take at each step, as <code>reduce</code> is applicable almost always, so a naive greedy technique becomes pointless.</p> <p>Actually, shift-reduce would better be called something like stack-queue parsing, as different parsers may not limit the implementation to just the shift and reduce operations. For example, an NLP parser that allows the construction of non-projective trees (those, where the arrows may cross, i.e. subsequent words may not always belong to a single or subsequent upper-level categories), adds a <code>swap</code> operation. A more advanced NLP parser that produces a graph structure called an AMR (abstract meaning representation) has 9 different operations.</p> <p>Shift-reduce parsing is implemented in many of the parser generator tools, which generate a parser program from a set of production rules. For instance, the popular Unix tool <code>yacc</code> is a LALR parser generator that uses shift-reduce. Another popular tool ANTLR is a parser generator for LL(k) languages that uses a non-shift-reduce direct pushdown automaton-based implementation.</p> <p>Besides shift-reduce and similar automata-based parsers, there are many other parsing technics used in practice. For example, CYK probabilistic parsing was popular in NLP for some time, but it's an <code>O(n^3)</code> algorithm, so it gradually fell from grace and lost to machine-learning enhanced shift-reduce variants. Another approach is packrat parsing (based on PEG — parsing expression grammars) that has a great Lisp parser-generator library <a href="https://scymtym.github.io/esrap/">esrap</a>. Packrat is a more powerful top-down parsing approach with backtracking and unlimited lookahead that nevertheless guarantees linear parse time. Any language defined by an LL(k) or LR(k) grammar can be recognized by a packrat parser, in addition to many languages that conventional linear-time algorithms do not support. This additional power simplifies the handling of common syntactic idioms such as the widespread but troublesome longest-match rule, enables the use of sophisticated disambiguation strategies such as syntactic and semantic predicates, provides better grammar composition properties, and allows lexical analysis to be integrated seamlessly into parsing. The last feature makes packrat very appealing to the programmers as they don't have to define separate tools for lexical analysis (tokenization and token categorization) and parsing. Moreover, the rules for tokens use the same syntax, which is also quite similar to regular expression syntax. For example, here's a portion of the esrap rules for parsing tables in Markdown documents. The Markdown table may look something like this:</p> <pre><code>| Left-Aligned | Center Aligned | Right Aligned |<br />| :------------ |:---------------:| -----:|<br />| col 3 is | some wordy text | $1600 |<br />| col 2 is | centered | $12 |<br />| zebra stripes | are neat | $1 |<br /></code></pre> <p>You can see that the code is quite self-explanatory: each <code>defrule</code> form consists of a rule name (lhs), its rhs, and a transformation of the rhs into a data structure. For instance, in the rule <code>table-row</code> the rhs is <code>(and (& #\|) (+ table-cell) #\| sp newline)</code>. The row should start with a <code>|</code> char followed by 1 or more <code>table-cell</code>s (a separate rule), and ended by <code>|</code> with some space charctaers and a newline. And the transformation <code>(:destructure (_ cells &rest __) ...</code> only cares about the content, i.e. the table cells.</p> <pre><codee style="font-family: Courier New;">(defrule sp (* space-char)<br /> (:text t))<br /><br />(defrule table-cell (and #\|<br /> sp<br /> (* (and (! (or (and sp #\|) endline)) inline))<br /> sp<br /> (& #\|))<br /> (:destructure (_ __ content &rest ___)<br /> (mapcar 'second content)))<br /><br />(defrule table-row (and (& #\|) (+ table-cell) #\| sp newline)<br /> (:destructure (_ cells &rest __)<br /> (mapcar (lambda (a) (cons :plain a))<br /> cells)))<br /><br />(defrule table-align-cell (and sp (? #\:) (+ #\-) (? #\:) sp #\|)<br /> (:destructure (_ left __ right &rest ___)<br /> (if right (if left 'center 'right) (when left 'left))))<br /><br />(defrule table-align-row (and #\| (+ table-align-cell) sp newline)<br /> (:destructure (_ aligns &rest __)<br /> aligns))<br /><br />(defrule table-head (and table-row table-align-row))<br /></code></pre> <p>To conclude the topic of parsing, I wanted to pose a question: can it be used to match the regular expressions? And the answer, of course, is that it can, as we are operating in a more powerful paradigm that includes the regexes as a subdomain. However, the critical showstopper of applying parsing to this problem is the need to define the grammar instead of writing a compact and more or less intuitive regex...</p> <h2 id="stringsearchinactionplagiarismdetection">String Search in Action: Plagiarism Detection</h2> <p>Plagiarism detection is a very challenging problem that doesn't have an exact solution. The reason is that there's no exact definition of what can be considered plagiarism and what can't, the boundary is rather blurry. Obviously, if the text or its part is just copy-pasted, everything is clear. But, usually (and especially when they know that plagiarism detection is at play), people will apply their creativity to alter the text in some slight or even significant ways. However, over the years, researchers have come up with numerous algorithms of plagiarism detection, with quality good enough to be used in our educational institutions. The problem is very popular and there are even shared task challenges dedicated to improving plagiarism catchers. It's somewhat an arms race between the plagiarists and the detection systems.</p> <p>One of the earliest but, still, quite effective ways of implementing plagiarism detection is the Shingle algorithm. It is also based on the idea of using hashes and some basic statistical sampling techniques. The algorithm operates in the following stages:</p> <ol><li>Text normalization (this may include case normalization, reduction of the words to basic forms, error correction, cleanup of punctuation, stopwords, etc.)</li> <li>Selection of the shingles and calculation of their hashes.</li> <li>Sampling the shingles from the text at question.</li> <li>Comparison of the hashes of the original shingles to the sampled hashes and evaluation.</li></ol> <p>The single shingle is a continues sequence of words from the normalized text (another name for this object, in NLP, is <code>ngram</code>). The original text will give us <code>(1- n)</code> shingles, where <code>n</code> is the number of words. The hashes of the shingles are normal string hashes (like fnv-1).</p> <p>The text, which is analyzed for plagiarism, is also split into shingles, but not all of them are used. Just a random sample of <code>m</code>. The Sampling theorem can give a good estimate of the number that can be trusted with a high degree of confidence. For efficient comparison, all the original hashes can be stored in a hash-set. If the number of overlapping shingles exceeds some threshold, the text can be considered plagiarised. The other take on the result of the algorithm application may be to return the plagiarism degree, which will be the percentage of the overlapping shingles. The complexity of the algorithm is <code>O(n + m)</code>.</p> <p>In a sense, the Shingle algorithm may be viewed as an instance of massive string search, where the outcome we're interested in is not so much the positions of the patterns in the text (although, those may also be used to indicate the parts of the text that are plagiarism-suspicious) as the fact that they are present in it.</p> <h2 id="takeaways">Take-aways</h2> <p>Strings are peculiar objects: initially, it may seem that they are just arrays. But, beyond this simple understanding, due to the main usage patterns, a much more interesting picture can be seen. Advanced string representations and algorithms are examples of special-purpose optimization applied to general-purpose data structures. This is another reason why strings are presented at the end of the part on derived data structures: string algorithms make heavy use of the material we have covered previously, such as trees and graphs.</p> <p>We have also discussed the FSMs — a powerful data-structure that can be used to reliably implement complex workflows. FSMs may be used not only for string matching but also for implementing protocol handling (for example, in the HTTP server), complex user interactions, and so on. The Erlang programming language even has a standard library behavior <code>gen_fsm</code> (replaced by the newer <code>gen_statem</code>) that is a framework for easy implementation of FSMs — as many Erlang applications are mass service systems that have state machine-like operation.</p> <p>P.S. Originally, I expected this chapter to be one of the smallest in the book, but it turned out to be the longest one. Strings are not so simple as they might seem... ;)</p> <hr size="1"><p ><a href="#r10-1" name="f10-1">[1]</a> A proper suffix is a suffix that is at least one character shorter than the string itself. For example, in the string <code>abc</code> the proper suffices are <code>bc</code> and <code>c</code>.</p><p><a href="#r10-2" name="f10-2">[2]</a> Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the same applies to Python, or PHP, or Ruby, or many other languages.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com1tag:blogger.com,1999:blog-6031647961506005424.post-50685229529855328652019-10-25T13:48:00.001+03:002019-10-28T10:02:26.209+02:00Programmatically Drawing Simple Graphviz Graphs<p>For <a href="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa">my book</a>, I had to draw a number of graphs. Obviously, I wanted to have a programmatic way to do that, and Graphviz is the goto-library for that. In Lisp, the interface to Graphviz is <a href="http://www.foldr.org/~michaelw/projects/cl-dot/">cl-dot</a>, but, for me, it wasn't easy to figure out from the manual the "simple and easy" way to use it. I.e. I couldn't find a stupid beginner-level interface, so I had to code it myself. Here's the implementation that allows anyone with a REPL to send to Graphviz lists of edges and obtain graph images.</p> <script src="https://gist.github.com/vseloved/6275d131d27fb873667a95a681168ca8.js"></script> <p>Generated images:</p> <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-rd2h-P3xDxI/XbLTq6mLEnI/AAAAAAAACOc/gLmrlXfrLXMHCunu8OPVLGLfQcXCctjzQCLcBGAsYHQ/s1600/max-flow-graph.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-rd2h-P3xDxI/XbLTq6mLEnI/AAAAAAAACOc/gLmrlXfrLXMHCunu8OPVLGLfQcXCctjzQCLcBGAsYHQ/s1600/max-flow-graph.jpg" data-original-width="375" data-original-height="154" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-Q_TABCIyEBQ/XbLTjZsFsNI/AAAAAAAACOY/iaL0XfAjn_szYXnts69yN30kU-8gPZg6gCLcBGAsYHQ/s1600/g.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-Q_TABCIyEBQ/XbLTjZsFsNI/AAAAAAAACOY/iaL0XfAjn_szYXnts69yN30kU-8gPZg6gCLcBGAsYHQ/s1600/g.jpg" data-original-width="347" data-original-height="131" /></a></div>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-65370307569991497012019-10-24T16:09:00.001+03:002019-10-24T16:13:14.140+03:00Programming Algorithms: Graphs<p>Graphs have already been mentioned several times, in the book, in quite diverse contexts. Actually, if you are familiar with graphs you can spot opportunities to use them in quite different areas for problems that aren't explicitly formulated with graphs in mind. So, in this chapter, we'll discuss how to handle graphs to develop such intuition to some degree.</p> <p>But first, let's list the most prominent examples of the direct graph applications, some of which we'll see here in action:</p> <ul><li>pathfinding</li><li>network analysis</li><li>dependency analysis in planning, compilers, etc.</li><li>various optimization problems</li><li>distributing and optimizing computations</li><li>knowledge representation and reasoning with it</li><li>meaning representation in natural language processing</li></ul> <p>Graphs may be thought of as a generalization of trees: indeed, trees are, as we said earlier, connected directed acyclic graphs. But there's an important distinction in the patterns of the usage of graphs and trees. Graphs, much more frequently than trees, have weights associated with the edges, which adds a whole new dimension both to algorithms for processing them and to possible data that can be represented in the graph form. So, while the main application of trees is reflecting some hierarchy, for graphs, it is often more about determining connectedness and its magnitude, based on the weights.</p> <h2 id="graphrepresentations">Graph Representations</h2> <p>A graph is, basically, a set of nodes (called "vertices", <code>V</code>) and an enumeration of their pairs ("edges", <code>E</code>). The edges may be directed or undirected (i.e. bidirectional), and also weighted or unweighted. There are many ways that may be used to represent these sets, which have varied utility for different situations. Here are the most common ones:</p> <ul><li>as a linked structure: <code>(defstruct node data links)</code> where <code>links</code> may be either a list of other <code>node</code>s, possibly, paired with weights, or a list of <code>edge</code> structures represented as <code>(defsturct edge source destination weight)</code>. For directed graphs, this representation will be similar to a singly-linked list but for undirected — to a heavier doubly-linked one</li><li>as an adjacency matrix (<code>V x V</code>). This matrix is indexed by vertices and has zeroes when there's no connection between them and some nonzero number for the weight (1 — in case of unweighted graphs) when there is a connection. Undirected graphs have a symmetric adjacency matrix and so need to store only the abovediagonal half of it</li><li>as an adjacency list that enumerates for each vertex the other vertices it's connected to and the weights of connections</li><li>as an incidence matrix (<code>V x E</code>). This matrix is similar to the previous representation, but with much more wasted space. The adjacency list may be thought of as a sparse representation of the incidence matrix. The matrix representation may be more useful for hypergraphs (that have more than 2 vertices for each edge), though</li><li>just as a list of edges</li></ul> <h2 id="topologicalsort">Topological Sort</h2> <p>Graphs may be divided into several kinds according to the different properties they have and specific algorithms, which work on them:</p> <ul><li>disjoint (with several unconnected subgraphs), connected, and fully-connected (every vertex is linked to all the others)</li><li>cyclic and acyclic, including directed acyclic (DAG)</li><li>bipartite: when there are 2 groups of vertices and each vertex from one group is connected only to the vertices from the other</li></ul> <p>In practice, <strong>Directed Acyclic Graphs</strong> are quite important. These are directed graphs, in which there's no vertex that you can start a path from and return back to it. They find applications in optimizing scheduling and computation, determining historical and other types of dependencies (for example, in dataflow programming and even spreadsheets), etc. In particular, every compiler would use one and even <code>make</code> will when building the operational plan. The basic algorithm on DAGs is <strong>Topological sort</strong>. It creates a partial ordering of the graph's vertices which ensures that every child vertex is always preceding all of its ancestors.</p> <p>Here is an example. This is a DAG:</p> <img border="0" src="https://1.bp.blogspot.com/-HCu7HDd2hTw/XbFzkugDFDI/AAAAAAAACNk/JBJfk9o9sdMKu7xLFn0h1FR7A9D_dle9ACPcBGAYYCw/s400/graph-topo.png" width="400" height="170" data-original-width="613" data-original-height="260" /> <p>And these are the variants of its topological ordering:</p> <pre><code>6 4 5 3 2 1 8 7<br />6 4 5 2 3 1 8 7<br />8 7 6 4 5 3 2 1<br />8 7 6 4 5 2 3 1<br /></code></pre> <p>There are several variants as the graph is disjoint, and also the order in which the vertices are traversed is not fully deterministic.</p> <p>There are two common approaches to topological sort: the Kahn's algorithm and the DFS-based one. Here is the DFS version:</p> <ol><li>Choose an arbitrary vertex and perform the DFS from it until a vertex is found without children that weren't visited during the DFS.</li><li>While performing the DFS, add each vertex to the set of visited ones. Also check that the vertex hasn't been visited already, or else the graph is not acyclic.</li><li>Then, add the vertex we have found to the resulting sorted array.</li> <li>Return to the previous vertex and repeat searching for the next descendant that doesn't have children and add it.</li><li>Finally, when all of the current vertex's children are visited add it to the result array.</li><li>Repeat this for the next unvisited vertex until no unvisited ones remain.</li></ol> <p>Why does the algorithm satisfy the desired constraints? First of all, it is obvious that it will visit all the vertices. Next, when we add the vertex we have already added all of its descendants — satisfying the main requirement. Finally, there's a consistency check during the execution of the algorithm that ensures there are no cycles.</p> <p>Before proceeding to the implementation, as with other graph algorithms, it makes sense to ponder what representation will work the best for this problem. The default one — a linked structure — suits it quite well as we'll have to iterate all the outgoing edges of each node. If we had to traverse by incoming edges then it wouldn't have worked, but a matrix one would have.</p> <pre><code>(defstruct node<br /> id edges)<br /><br />(defstruct edge<br /> src dst label)<br /><br />(defstruct (graph (:conc-name nil) (:print-object pprint-graph))<br /> (nodes (make-hash-table))) ; mapping of node ids to nodes<br /></code></pre> <p>As usual, we'll need a more visual way to display the graph than the default print-function. But that is pretty tricky considering that graphs may have an arbitrary structure with possibly intersecting edges. The simplest approach for small graphs would be to just draw the adjacency matrix. We'll utilize it for our examples (relying on the fact that we have control over the set of node ids):</p> <pre><code>(defun pprint-graph (graph stream)<br /> (let ((ids (sort (keys (nodes graph)) '<)))<br /> (format stream "~{ ~A~}~%" ids) ; here, Tab is used for space<br /> (dolist (id1 ids)<br /> (let ((node (? graph 'nodes id1)))<br /> (format stream "~A" id1)<br /> (dolist (id2 ids)<br /> (format stream " ~:[~;x~]" ; here, Tab as well<br /> (find id2 (? node 'edges) :key 'edge-dst)))<br /> (terpri stream)))))<br /></code></pre> <p>Also, let's create a function to simplify graph initialization:</p> <pre><code>(defun init-graph (edges)<br /> (with ((rez (make-graph))<br /> (nodes (nodes rez)))<br /> (loop :for (src dst) :in edges :do<br /> (let ((src-node (getset# src nodes (make-node :id src))))<br /> (getset# dst nodes (make-node :id dst))<br /> (push (make-edge :src src :dst dst)<br /> (? src-node 'edges))))<br /> rez))<br /><br />CL-USER> (init-graph '((7 8)<br /> (1 3)<br /> (1 2)<br /> (3 4)<br /> (3 5)<br /> (2 4)<br /> (2 5)<br /> (5 4)<br /> (5 6)<br /> (4 6)))<br /><br /> 1 2 3 4 5 6 7 8<br />1 x x <br />2 x x <br />3 x x <br />4 x <br />5 x x <br />6 <br />7 x<br />8 <br /></code></pre> <p>So, we already see in action 3 different ways of graphs representation: linked, matrix, and edges lists.</p> <p>Now, we can implement and test topological sort:</p> <pre><code>(defun topo-sort (graph)<br /> (let ((nodes (nodes graph))<br /> (visited (make-hash-table))<br /> (rez (vec)))<br /> (dokv (id node nodes)<br /> (unless (? visited id)<br /> (visit node nodes visited rez)))<br /> rez))<br /><br />(defun visit (node nodes visited rez)<br /> (dolist (edge (? node 'edges))<br /> (with ((id (? edge 'dst))<br /> (child (? nodes id)))<br /> (unless (find id rez)<br /> (assert (not (? visited id)) nil<br /> "The graph isn't acyclic for vertex: ~A" id)<br /> (:= (? visited id) t)<br /> (visit child nodes visited rez))))<br /> (vector-push-extend (? node 'id) rez)<br /> rez)<br /><br />CL-USER> (topo-sort (init-graph '((7 8)<br /> (1 3)<br /> (1 2)<br /> (3 4)<br /> (3 5)<br /> (2 4)<br /> (2 5)<br /> (5 4)<br /> (5 6)<br /> (4 6))))<br />#(8 7 6 4 5 2 3 1)<br /></code></pre> <p>This technique of tracking the visited nodes is used in almost every graph algorithm. As noted previously, it can either be implemented using an additional hash-table (like in the example) or by adding a boolean flag to the vertex/edge structure itself.</p> <h2 id="mst">MST</h2> <p>Now, we can move to algorithms that work with weighted graphs. They represent the majority of the interesting graph-based solutions. One of the most basic of them is determining the Minimum Spanning Tree. Its purpose is to select only those graph edges that form a tree with the lowest total sum of weights. Spanning trees play an important role in network routing where there is a number of protocols that directly use them: STP (Spanning Tree Protocol), RSTP (Rapid STP), MSTP (Multiple STP), etc.</p> <p>If we consider the graph from the previous picture, its MST will include the edges 1-2, 1-3, 3-4, 3-5, 5-6, and 7-8. Its total weight will be 24.</p> <p>Although there are quite a few MST algorithms, the most well-known are Prim's and Kruskal's. Both of them rely on some interesting solutions and are worth studying.</p> <h3 id="primsalgorithm">Prim's Algorithm</h3> <p>Prim's algorithm grows the tree one edge at a time, starting from an arbitrary vertex. At each step, the least-weight edge that has one of the vertices already in the MST and the other one outside is added to the tree. This algorithm always has an MST of the already processed subgraph, and when all the vertices are visited, the MST of the whole graph is completed. The most interesting property of Prim's algorithm is that its time complexity depends on the choice of the data structure for ordering the edges by weight. The straightforward approach that searches for the shortest edge will have <code>O(V^2)</code> complexity, but if we use a priority queue it can be reduced to <code>O(E logV)</code> with a binary heap or even <code>O(E + V logV)</code> with a Fibonacci heap. Obviously, <code>V logV</code> is significantly smaller than <code>E logV</code> for the majority of graphs: up to <code>E = V^2</code> for fully-connected graphs.</p> <p>Here's the implementation of the Prim's algorithm with an abstract heap:</p> <pre><code>(defun prim-mst (graph)<br /> (let ((initial-weights (list))<br /> (mst (list))<br /> (total 0)<br /> weights<br /> edges<br /> cur)<br /> (dokv (id node (nodes graph))<br /> (if cur<br /> (push (pair id (or (? edges id)<br /> ;; a standard constant that is<br /> ;; a good enough substitute for infinity<br /> most-positive-fixnum))<br /> initial-weights)<br /> (:= cur id<br /> edges (? node 'edges))))<br /> (:= weights (heapify initial-weights))<br /> (loop<br /> (with (((id weight) (heap-pop weights)))<br /> (unless id (return))<br /> (when (? edges id)<br /> ;; if not, we have moved to the new connected component<br /> ;; so there's no edge connecting it to the previous one<br /> (push (pair cur id) mst)<br /> (:+ total weight))<br /> (dokv (id w edges)<br /> (when (< w weight)<br /> (heap-decrease-key weights id w)))<br /> (:= cur id<br /> edges (? graph 'nodes id 'edges))))<br /> (values mst<br /> total)))<br /></code></pre> <p>To make it work, we need to perform several modifications:</p> <ul><li>first of all, the list of all node edges should be change to a hash-table to ensure <code>O(1)</code> access by child id</li> <li>the heap should store not only the keys but also values (a trivial change)</li> <li>we need to implement another fundamental heap operation <code>heap-decrease-key</code>, which we haven't mentioned in the previous chapter</li></ul> <p>For the binary heap, it's, actually, just a matter of performing <code>heap-up</code>. But the tricky part is that it requires an initial search for the key. To ensure constant-time search and, subsequently, <code>O(log n)</code> total complexity, we need to store the pointers to heap elements in a separate hash-table.</p> <p>Let's confirm the stated complexity of this implementation? First, the outer loop operates for each vertex so it has <code>V</code> iterations. Each iteration has an inner loop that involves a <code>heap-pop</code> (<code>O(log V)</code>) and a <code>heap-update</code> (also <code>O(log V)</code>) for a number of vertices, plus a small number of constant-time operations. <code>heap-pop</code> will be invoked exactly once per vertex, so it will need <code>O(V logV)</code> total operations, and <code>heap-update</code> will be called at most once for each edge (<code>O(E logV)</code>). Considering that <code>E</code> is usually greater than <code>V</code>, this is how we can arrive at the final complexity estimate.</p> <p>The Fibonacci heap improves on the binary heap, in this context, as its <code>decrease-key</code> operation is <code>O(1)</code> instead of <code>O(log V)</code>, so we are left with just <code>O(V logV)</code> for <code>heap-pop</code>s and <code>E</code> <code>heap-decrease-key</code>s. Unlike the binary heap, the Fibonacci one is not just a single tree but a set of trees. And this is used in decrease-key: instead of popping an item up the heap and rearranging it in the process, a new tree rooted at this element is cut from the current one. This is not always possible in constant time as there are some invariants that might be violated, which will in turn trigger some updates to the newly created two trees. Yet, using an amortized cost of the operation is still <code>O(1)</code>.</p> <p>Here's a brief description of the principle behind the Fibonacci heap adapted from Wikipedia:</p> <p>A Fibonacci heap is a collection of heaps. The trees do not have a prescribed shape and, in the extreme case, every element may be its own separate tree. This flexibility allows some operations to be executed in a lazy manner, postponing the work for later operations. For example, merging heaps is done simply by concatenating the two lists of trees, and operation decrease key sometimes cuts a node from its parent and forms a new tree. However, at some point order needs to be introduced to the heap to achieve the desired running time. In particular, every node can have at most <code>O(log n)</code> children and the size of a subtree rooted in a node with <code>k</code> children is at least <code>F(k+2)</code>, where <code>F(k)</code> is the <code>k</code>-th Fibonacci number. This is achieved by the rule that we can cut at most one child of each non-root node. When a second child is cut, the node itself needs to be cut from its parent and becomes the root of a new tree. The number of trees is decreased in the operation delete minimum, where trees are linked together. Here's an example Fibonacci heap that consists of 3 trees:</p> <pre><code>6 2 1 <- minimum<br /> | / | \<br /> 5 3 4 7<br /> |<br /> 8<br /> |<br /> 9<br /></code></pre> <h3 id="kruskalsalgorithm">Kruskal's Algorithm</h3> <p>Kruskal's algorithm operates from the point of view of not vertices but edges. At each step, it adds to the tree the current smallest edge unless it will produce a cycle. Obviously, the biggest challenge here is to efficiently find the cycle. Yet, the good news is that, like with the Prim's algorithm, we also have already access to an efficient solution for this problem — Union-Find. Isn't it great that we already have built a library of techniques that may be reused in creating more advanced algorithms? Actually, this is the goal of developing as an algorithms programmer — to be able to see a way to reduce the problem, at least partially, to some already known and proven solution.</p> <p>Like Prim's algorithm, Kruskal's approach also has <code>O(E logV)</code> complexity: for each vertex, it needs to find the minimum edge not forming a cycle with the already built partial MST. With Union-Find, this search requires <code>O(logE)</code>, but, as <code>E</code> is at most <code>V^2</code>, <code>logE</code> is at most <code>logV^2</code> that is equal to <code>2 logV</code>. Unlike Prim's algorithm, the partial MST built by the Kruskal's algorithm isn't necessary a tree for the already processed part of the graph.</p> <p>The implementation of the algorithm, using the existing code for Union-Find is trivial and left as an exercise to the reader.</p> <h2 id="pathfinding">Pathfinding</h2> <p>So far, we have only looked at problems with unweighted graphs. Now, we can move to weighted ones. Pathfinding in graphs is a huge topic that is crucial in many domains: maps, games, networks, etc. The goal, usually, is to find the shortest path between two nodes in a directed weighted graph. Yet, there may be variations like finding shortest paths from a selected node to all other nodes, finding the shortest path in a maze (that may be represented as a grid graph with all edges of weight 1), etc.</p> <p>There are, once again, two classic pathfinding algorithms, each one with a certain feature that makes it interesting and notable. Dijkstra's algorithm is a classic example of greedy algorithms as its alternative name suggests — shortest path first (SPF). The A* builds upon it by adding the notion of an heuristic. Dijkstra's approach is the basis of many computer network routing algorithms, such as IS-IS and OSPF, while A* and modifications are often used in games, as well as in pathfinding on the maps.</p> <h3 id="dijkstrasalgorithm">Dijkstra's Algorithm</h3> <p>The idea of Dijkstra's pathfinding is to perform a limited BFS on the graph only looking at the edges that don't lead us "away" from the target. Dijkstra's approach is very similar to the Prim's MST algorithm: it also uses a heap (Binary or Fibonacci) to store the shortest paths from the origin to each node with their weighs (lengths). At each step, it selects the minimum from the heap, expands it to the neighbor nodes, and updates the weights of the neighbors if they become smaller (the weights start from infinity).</p> <p>For our SPF implementation we'll need to use the same trick that was shown in the Union-Find implementation — extend the node structure to hold its weight and the path leading to it:</p> <pre><code>(defstruct (spf-node (:include node))<br /> (weight most-positive-infinity)<br /> (path (list)))<br /></code></pre> <p>Here is the main algorithm:</p> <pre><code>(defun spf (graph src dst)<br /> (with ((nodes (? graph 'nodes))<br /> (spf (list))<br /> ;; the following code should express initialize<br /> ;; the heap with a single node of weight 0<br /> ;; and all other nodes of weight MOST-POSITIVE-FIXNUM<br /> ;; (instead of running a O(n*log n) HEAPIFY)<br /> (weights (init-weights-heap nodes src)))<br /> (loop<br /> (with (((id weight) (heap-pop weights)))<br /> (cond ((eql id dst) (let ((dst (? nodes dst)))<br /> ;; we return 2 values: the path and its length<br /> (return (values (cons dst (? dst 'path))<br /> (? dst 'weight)))))<br /> ((= most-positive-fixnum weight) (return))) ; no path exists<br /> (dolist (edge (? nodes id 'edges))<br /> (with ((cur (? edge 'dst))<br /> (node (? nodes cur))<br /> (w (+ weight (? edge 'weight))))<br /> (when (< w (? node 'weight))<br /> (heap-decrease-key weights cur w) <br /> (:= (? node weight) w<br /> (? node path) (cons (? nodes id)<br /> (? nodes id 'path))))))))))<br /></code></pre> <h3 id="aalgorithm">A* Algorithm</h3> <p>There are many ways to improve the vanilla SPF. One of them is to move in-parallel from both sides: the source and the destination.</p> <p>A* algorithm (also called Best-First Search) improves upon the Dijkstra's method by changing how the weight of the path is estimated. Initially, it was just the distance we've already traveled in the search, which is known exactly. But we don't know for sure the length of the remaining part. However, in Euclidian and similar spaces, where the triangle inequality holds (that the direct distance between 2 points is not greater than the distance between them through any other point) it's not an unreasonable assumption that the direct path will be shorter than the circuitous ones. This premise does not always hold as there may be obstacles, but quite often it does. So, we add a second term to the weight, which is the direct distance between the current node and the destination. This simple idea underpins the A* search and allows it to perform much faster in many real-world scenarios, although its theoretical complexity is the same as for simple SPF. The exact guesstimate of the remaining distance is called the algorithm's heuristic and should be specified for each domain separately: for maps, it is the linear distance, but there are clever ways to invent similar estimates where distances can't be calculated directly.</p> <p>Overall, this algorithm is one of the simplest examples of the <strong>heuristic</strong> approach. The idea of heuristics, basically, lies in finding patterns that may significantly improve the performance of the algorithm for the common cases, although their efficiency can't be proven for the general case. Isn't it the same approach as, for example, hash-tables or splay trees that also don't guarantee the same optimal performance for each operation. The difference is that, although those techniques have possible local cases of suboptimality they provide global probabilistic guarantees. For heuristic algorithms, usually, even such estimations are not available, although they may be performed for some of them. For instance, the performance of A* algorithm will suffer if there is an "obstacle" on the direct path to the destination, and it's not possible to predict, for the general case, what will be the configuration of the graph and where the obstacles will be. Yet, even in the worst case, A* will still have at least the same speed as the basic SPF.</p> <p>The changes to the SPF algorithm needed for A* are the following:</p> <ul><li><code>init-weights-heap</code> will use the value of the heuristic instead of <code>most-positive-fixnum</code> as the initial weight. This will also require us to change the loop termination criteria from <code>(= most-positive-fixnum weight)</code> by adding some notion of visited nodes</li> <li>there will be additional term added to the weight of the node formula: <code>(+ weight (? edge 'weight) (heuristic node))</code></li></ul> <p>A good comparison of the benefits A* brings over simple SPF may be shown with this picture of pathfinding on a rectangular grid without diagonal connections, where each node is labeled with its 2d-coordinates. To find the path from node <code>(0 0)</code> to <code>(2 2)</code> (length 4) using the Dijkstra's algorithm, we'll need to visit all of the points in the grid: </p> <pre><code> 0 1 2<br />0 + .<br />1 .<br />2<br /><br /> 0 1 2<br />0 + . .<br />1 . .<br />2 .<br /><br /> 0 1 2<br />0 + . .<br />1 . . .<br />2 . .<br /><br /> 0 1 2<br />0 + > v<br />1 . . v<br />2 . . +<br /></code></pre> <p>With A*, however, we'll move straight to the point:</p> <pre><code> 0 1 2<br />0 + .<br />1 .<br />2<br /><br /> 0 1 2<br />0 + .<br />1 . .<br />2<br /><br /> 0 1 2<br />0 + .<br />1 . . .<br />2 .<br /><br /> 0 1 2<br />0 + v<br />1 . > v<br />2 . +<br /></code></pre> <p>The final path, in these pictures, is selected by the rule to always open the left neighbor first.</p> <h2 id="maximumflow">Maximum Flow</h2> <p>Weighted directed graphs are often used to represent different kinds of networks. And one of the main tasks on such networks is efficient capacity planning. The main algorithm for that is Maximum Flow calculation. It works on the so-called transport networks contain three kinds of vertices: a source, a sink, and intermediate nodes. The source has only outgoing edges, the sink has only incoming, and all the other nodes obey the balance condition: the total weights (flow) of all incoming and outgoing edges are equal. The task of determining maximum flow is to estimate the largest amount that can flow through the whole net from the source to the sink. Besides knowing the actual capacity of the network, this also allows finding the bottlenecks and edges that are not fully utilized. From such point of view, the problem is called Minimum Cut estimation.</p> <img border="0" src="https://4.bp.blogspot.com/-KYHrlhezQrY/XbF8UlBT8SI/AAAAAAAACOA/izf2WxDpp3ssHKenmtmTOwJLx5PqMzBjQCLcBGAsYHQ/s1600/max-flow-graph.jpg" data-original-width="375" data-original-height="154" /> <p>There are many approaches to solving this problem. The most direct and intuitive of them is the Ford-Fulkerson method. It is a greedy algorithm, once again, that computes the maximum flow by trying all the paths from source to sink until there is some residual capacity available. These paths are called "augmented paths" as they augment the network flow. And, to track the residual capacity, a copy of the initial weight graph called the "residual graph" is maintained. With each new path added to the total flow, its flow is subtracted from the weights of all of its edges in the residual graph. Besides — and this is the key point in the algorithm that allows it to be optimal despite its greediness — the same amount is added to the backward edges in the residual graph. The backward edges don't exist in the original graph, and they are added to the residual graph in order to let the subsequent iterations reduce the flow along some edge, but not below zero. Why this may be necessary? Each graph node has a maximum input and output capacity. It is possible to saturate the output capacity by different input edges and the optimal edge to use depends on the whole graph, so, in a single greedy step, it's not possible to determine over which edges more incoming flow should be directed. The backward edges virtually increase the output capacity by the value of the seized input capacity thus allowing the algorithm to redistribute the flow later on if necessary.</p> <p>We'll implement the FFA using the matrix graph representation. First of all, to show it in action, and also as it's easy to deal with backward edges in a matrix as they are already present, just with zero initial capacity. However, as this matrix will be sparse, in the majority of the cases, to achieve optimal efficiency, just like with most other graph algorithms, we'll need to use a better way to store the edges, for instance, an edge list. With it, we could implement the addition of backward edges directly but lazily during the processing of each augmented path.</p> <pre><code>(defstuct mf-edge<br /> beg end capacity)<br /><br />(defun max-flow (g)<br /> (let ((rg (copy-array g)) ; residual graph<br /> (rez 0))<br /> (loop :for path := (aug-path rg) :while path :do<br /> (let ((flow most-positive-fixnum))<br /> ;; the flow along the path is the residual capacity<br /> ;; of the least wide edge<br /> (dolist (edge path)<br /> (let ((cap (? edge 'capacity)))<br /> (when (< (abs cap) flow)<br /> (:= flow (abs cap)))))<br /> (dolist (edge path)<br /> (with (((beg end) ? edge))<br /> (:- (aref rg beg end) flow)<br /> (:+ (aref rg end beg) flow)))<br /> (:+ rez flow)))<br /> rez))<br /><br />(defun aug-path (g)<br /> (with ((sink (1- (array-dimension g 0)))<br /> (visited (make-array (1+ sink) :initial-element nil)))<br /> (labels ((dfs (g i)<br /> (if (zerop (aref g i sink))<br /> (dotimes (j sink)<br /> (unless (or (zerop (aref g i j))<br /> (? visited j))<br /> (when-it (dfs g j)<br /> (:= (? visited j) t)<br /> (return (cons (make-mf-edge :beg i :end j<br /> :capacity (aref g i j))<br /> it)))))<br /> (list (make-mf-edge :beg i :end sink<br /> :capacity (aref g i sink))))))<br /> (dfs g 0))))<br /><br />CL-USER> (max-flow #2A((0 4 4 0 0 0)<br /> (0 0 0 4 2 0)<br /> (0 0 0 1 2 0)<br /> (0 0 0 0 0 3)<br /> (0 0 0 0 0 5))))<br />7<br /></code></pre> <p>So, as you can see from the code, to find an augmented path, we need to perform DFS on the graph from the source, sequentially examining the edges with some residual capacity to find a path to the sink.</p> <p>A peculiarity of this algorithm is that there is no certainty that we'll eventually reach the state when there will be no augmented paths left. The FFA works correctly for integer and rational weights, but when they are irrational it is not guaranteed to terminate. When the capacities are integers, the runtime of Ford-Fulkerson is bounded by <code>O(E f)</code> where <code>f</code> is the maximum flow in the graph. This is because each augmented path can be found in <code>O(E)</code> time and it increases the flow by an integer amount of at least 1. A variation of the Ford-Fulkerson algorithm with guaranteed termination and a runtime independent of the maximum flow value is the Edmonds-Karp algorithm, which runs in <code>O(V E^2)</code>.</p> <h2 id="graphsinactionpagerank">Graphs in Action: PageRank</h2> <p>Another important set of problems from the field of network analysis is determining "centers of influence", densely and sparsely populated parts, and "cliques". PageRank is the well-known algorithm for ranking the nodes in terms of influence (i.e. the number and weight of incoming connections they have), which was the secret sauce behind Google's initial success as a search engine. It will be the last of the graph algorithms we'll discuss in this chapter, so many more will remain untouched. We'll be returning to some of them in the following chapters, and you'll be seeing them in many problems once you develop an eye for spotting the graphs hidden in many domains.</p> <p>PageRank algorithm outputs a probability distribution of the likelihood that a person randomly clicking on links will arrive at any particular page. This distribution ranks the relative importance of all pages. The probability is expressed as a numeric value between 0 and 1, but Google used to multiply it by 10 and round to the greater integer, so PR of 10 corresponded to the probability of 0.9 and more and PR=1 — to the interval from 0 to 0.1. In the context of Pagerank, all web pages are the nodes in the so-called webgraph, and the links between them are the edges, originally, weighted equally.</p> <p>PageRank is an iterative algorithm that may be considered an instance of the very popular, in unsupervised optimization and machine learning, <strong>Expectation Maximization</strong> (EM) approach. The general idea of EM is to randomly initialize the quantities that we want to estimate, and then iteratively recalculate each quantity, using the information from the neighbors, to "move" it closer to the value that ensures optimality of the target function. Epochs (an iteration that spans the whole data set using each node at most once) of such recalculation should continue either until the whole epoch doesn't produce a significant change of the loss function we're optimizing, i.e. we have reached the stationary point, or a satisfactory number of iterations was performed. Sometimes a stationary point either can't be reached or will take too long to reach, but, according to Pareto's principle, 20% of effort might have moved us 80% to the goal.</p> <p>In each epoch, we recalculate the PageRank of all nodes by transferring weights from a node equally to all of its neighbors. The neighbors with more inbound connections will thus receive more weight. However, the PageRank concept adds a condition that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability that the transfer will continue is called a damping factor <code>d</code>. Various studies have tested different damping factors, but it is generally assumed that the damping factor for the webgraph will be set around 0.85. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. So the PageRank of a page is mostly derived from the PageRanks of other pages. The damping factor adjusts the derived value downward.</p> <h3 id="implementation">Implementation</h3> <p>Actually, PageRank can be computed both iteratively and algebraically. In algebraic form, each PageRank iteration may be expressed simply as:</p> <pre><code>(:= pr (+ (* d (mat* g pr))<br /> (/ (- 1 d) n)))<br /></code></pre> <p>where <code>g</code> is the graph incidence matrix and <code>pr</code> is the vector of PageRank for each node.</p> <p>However, the definitive property of PageRank is that it is estimated for huge graphs. I.e. directly representing them as matrices isn't possible as well as performing the matrix operations on them. The iterative algorithm allows more control as well as distributing of the computation, so it is usually preferred, in practice, not only for Pagerank but also for most other optimization techniques. Sp PageRank should be viewed primarily as a distributed algorithm. The need to implement it on a large cluster triggered the development by Google of the influential MapReduce distributed computation framework.</p> <p>Here is a simplified PageRank implementation of the iterative method:</p> <pre><code>(defun pagerank (g &key (d 0.85) (repeat 100))<br /> (with ((n (tally (nodes g)))<br /> (pr (make-arrray n :initial-element (/ 1 n)))<br /> (loop :repeat repeat :do<br /> (let ((pr2 (map 'vector (lambda (x) (- 1 (/ d n)))<br /> pr)))<br /> (dokv (i node nodes)<br /> (let ((p (? pr i))<br /> (m (tally (? node 'children)))<br /> (dokv (j child (? node 'children))<br /> (:+ (? pr2 j) (* d (/ p m))))))<br /> (:= pr pr2))<br /> pr))<br /></code></pre> <p>We use the same graph representation as previously and perform the update "backwards": not by gathering all incoming edges, which will require us to add another layer of data that is both not necessary and hard to maintain, but transferring the PR value over outgoing edges one by one. Such an approach also makes the computation trivially to distribute as we can split the whole graph into arbitrary set of nodes and the computation for each set can be performed in parallel: we'll just need to maintain a local copy of the <code>pr2</code> vector and merge it at the end of each iteration by simple summation. This method naturally fits the map-reduce framework: the <code>map</code> step is the inner node loop, while the <code>reduce</code> step is merging of the <code>pr2</code> vectors.</p> <h2 id="takeaways">Take-aways</h2> <ol><li>The more we progress into advanced topics of this book, the more apparent will be the tendency to reuse the approaches, tools, and technologies we have developed previously. Graph algorithms are good demonstrations of new features and qualities that can be obtained by a smart combination and reuse of existing data structures.</li><li>Many graph algorithms are greedy. This means that they use the locally optimal solution trying to arrive at a global one. This is conditioned by the structure — or rather lack of structure — of graphs that don't have a specific hierarchy to guide the optimal solution. The greediness, however, shouldn't mean suboptimality. In many greedy algorithms, like FFA, there is a way to play back the wrong solution. Others provide a way to trade off execution speed and optimality. A good example of the latter approach is Beam search that we'll discuss in the next chapters.</li><li>In A*, we had a first glimpse of heuristic algorithms — an area that may be quite appealing to many programmers who are used to solving the problem primarily optimizing for its main scenarios. This approach may lack some mathematical rigor, but it also has its place and we'll see other heuristic algorithms in the following chapters that are, like A*, the best practical solution in their domains: for instance, the Monte Carlo Tree Search (MCTS).</li><li>Another thing that becomes more apparent in the progress of this book is how small the percentage of the domain we can cover in detail in each chapter. This is true for graphs: we have just scratched the surface and outlined the main approaches to handling them. We'll see more of graph-related stuff in the following chapters, as well. Graph algorithms may be quite useful in a great variety of areas that not necessarily have a direct formulation as graph problems (like maps or networks do) and so developing an intuition to recognize the hidden graph structure may help the programmer reuse the existing elegant techniques instead of having to deal with own cumbersome ad-hoc solutions.</li></ol> <hr size="1"><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-87565131004886340612019-09-28T11:51:00.000+03:002019-10-05T23:35:14.367+03:00Programming Algorithms: Trees<p><a href="https://2.bp.blogspot.com/-4WyyUbnXKJg/XY8dBr98KmI/AAAAAAAACMk/Iz-QOloTiEAbW95KHRguwHl5dr6xU-aegCLcBGAsYHQ/s1600/trees.jpg" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-4WyyUbnXKJg/XY8dBr98KmI/AAAAAAAACMk/Iz-QOloTiEAbW95KHRguwHl5dr6xU-aegCLcBGAsYHQ/s320/trees.jpg" width="400" data-original-width="538" data-original-height="386" /></a><br>Couldn't resist adding this <a href="https://xkcd.com/835/">xkcd</a></p> <p>Balancing a binary tree is the infamous interview problem that has all that folklore and debate associated with it. To tell you the truth, like the other 99% of programmers, I never had to perform this task for some work-related project. And not even due to the existence of ready-made libraries, but because self-balancing binary trees are, actually, pretty rarely used. But trees, in general, are ubiquitous even if you may not recognize their presence. The source code we operate with, at some stage of its life, is represented as a tree (a popular term here is Abstract Syntax Tree or AST, but the abstract variant is not the only one the compilers process). The directory structure of the file system is the tree. The object-oriented class hierarchy is likewise. And so on. So, returning to interview questions, trees indeed are a good area as they allow to cover a number of basic points: linked data structures, recursion, complexity. But there's a much better task, which I have encountered a lot in practice and also used quite successfully in the interview process: breadth-first tree traversal. We'll talk about it a bit later.</p> <p>Similar to how hash-tables can be thought of as more sophisticated arrays (they are sometimes even called "associative arrays"), trees may be considered an expansion of linked lists. Although technically, a few specific trees are implemented not as a linked data structure but are based on arrays, the majority of trees are linked. Like hash-tables, some trees also allow for efficient access to the element by key, representing an alternative key-value implementation option.</p> <p>Basically, a tree is a recursive data structure that consists of nodes. Each node may have zero or more children. If the node doesn't have a parent, it is called the <strong>root</strong> of the tree. And the constraint on trees is that the root is always single. Graphs may be considered a generalization of trees that don't impose this constraint, and we'll discuss them in a separate chapter. In graph terms, a tree is an acyclic directed single-component graph. Directed means that there's a one-way parent-child relation. And acyclic means that a child can't have a connection to the parent neither directly, nor through some other nodes (in the opposite case, what will be the parent and what — the child?) The recursive nature of trees manifests in the fact that if we extract an arbitrary node of the tree with all of its descendants, the resulting part will remain a tree. We can call it a <strong>subtree</strong>. Besides parent-child or, more generally, ancestor-descendant "vertical" relationships that apply to all the nodes in the tree, we can also talk about horizontal siblings — the set of nodes that have the same parent/ancestor.</p> <p>Another important tree concept is the distinction between terminal (leaf) and nonterminal (branch) nodes. Leaf nodes don't have any children. In some trees, the data is stored only in the leaves with branch nodes serving to structure the tree in a certain manner. In other trees, the data is stored in all nodes without any distinction.</p> <h2 id="implementationvariants">Implementation Variants</h2> <p>As we said, the default tree implementation is a linked structure. A linked list may be considered a degenerate tree with all nodes having a single child. A tree node may have more than one child, and so, in a linked representation, each tree root or subroot is the origin of a number of linked lists (sometimes, they are called "paths"). </p> <pre><code>Tree: a<br /> / \<br /> b c<br /> / \ \<br /> d e f<br /><br />Lists:<br />a -> b -> d<br />a -> b -> e<br />a -> c -> f<br />b -> d<br />b -> e<br />c -> f<br /></code></pre> <p>So, a simple linked tree implementation will look a lot like a linked list one:</p> <pre><code>(defstruct (tree-node (:conc-name nil))<br /> key<br /> children) ; instead of linked list's next<br /><br />CL-USER> (with ((f (make-tree-node :key "f"))<br /> (e (make-tree-node :key "e"))<br /> (d (make-tree-node :key "d"))<br /> (c (make-tree-node :key "c" :children (list f)))<br /> (b (make-tree-node :key "b" :children (list d e))))<br /> (make-tree-node :key "a"<br /> :children (list b c)))<br />#S(TREE-NODE<br /> :KEY "a"<br /> :CHILDREN (#S(TREE-NODE<br /> :KEY "b"<br /> :CHILDREN (#S(TREE-NODE :KEY "d" :CHILDREN NIL)<br /> #S(TREE-NODE :KEY "e" :CHILDREN NIL)))<br /> #S(TREE-NODE<br /> :KEY "c"<br /> :CHILDREN (#S(TREE-NODE :KEY "f" :CHILDREN NIL)))))<br /></code></pre> <p>Similar to lists that had to be constructed from tail to head, we had to populate the tree in reverse order: from leaves to root. With lists, we could, as an alternative, use push and reverse the result, in the end. But, for trees, there's no such operation as reverse.</p> <p>Obviously, not only lists can be used as a data structure to hold the children. When the number of children is fixed (for example, in a binary tree), they may be defined as separate slots: e.g. <code>left</code> and <code>right</code>. Another option will be to use a key-value, which allows assigning labels to tree edges (as the keys of the kv), but the downside is that the ordering isn't defined (unless we use an ordered kv like a linked hash-table). We may also want to assign weights or other properties to the edges, and, in this case, either an additional collection (say <code>child-weights</code>) or a separate <code>edge</code> struct should be defined to store all those properties. In the latter case, the <code>node</code> structure will contain <code>edges</code> instead of <code>children</code>. In fact, the tree can also be represented as a list of such <code>edge</code> structures, although this approach is quite inefficient, for most of the use cases.</p> <p>Another tree representation utilizes the available linked list implementation directly instead of reimplementing it. Let's consider the following simple Lisp form:</p> <pre><code>(defun foo (bar)<br /> "Foo function."<br /> (baz bar))<br /></code></pre> <p>It is a tree with the root containing the symbol <code>defun</code> and 4 children:</p> <ul><li>the terminal symbol <code>foo</code></li> <li>the tree containing the function arguments (<code>(bar)</code>)</li> <li>the terminal sting (the docstring "Foo function.")</li> <li>and the tree containing the form to evaluate (<code>(baz bar)</code>)</li></ul> <p>By default, in the list-based tree, the first element is the head and the rest are the leaves. This representation is very compact and convenient for humans, so it is used not only for source code. For example, you can see a similar representation for the constituency trees, in linguistics:</p> <pre><code>(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))) (. .)))<br />;; if we'd like to use the above form as Lisp code,<br />;; we'd have to shield the symbol "." with ||: (|.| |.|) instead of (. .)<br /></code></pre> <p>It is equivalent to the following parse tree:</p> <pre><code> TOP<br /> / | \<br /> | VP |<br /> | | \ |<br /> NP | NP |<br /> | | / \ |<br /> DT VBZ DT NN .<br />This is a test .<br /></code></pre> <p>Another, more specific, alternative is when we are interested only in the terminal nodes. In that case, there will be no explicit root and each list item will be a subtree. The following trees are equivalent:</p> <pre><code>(((a b (c d)) e) (f (g h)))<br /><br /> <root><br /> / \<br /> / \ / \<br /> / | \ e f /\ <br /> a b /\ g h<br /> c d<br /></code></pre> <p>A tree that has all terminals at the same depth and all nonterminal nodes present — a complete tree — with a specified number of children may be stored in a vector. This is a very efficient implementation that we'll have a glance at when we'll talk about heaps.</p> <p>Finally, a tree may be also represented, although quite inefficiently, with a matrix (only one half is necessary).</p> <h2 id="treetraversal">Tree Traversal</h2> <p>It should be noted that, unlike with other structures, basic operations, such as tree construction, modification, element search and retrieval, work differently for different tree variants. Thus we'll discuss them further when describing those variants.</p> <p>Yet, one tree-specific operation is common to all tree representations: traversal. Traversing a tree means iterating over its subtrees or nodes in a certain order. The most direct traversal is called depth-first search or <strong>DFS</strong>. It is the recursive traversal from parent to child and then to the next child after we return from the recursion. The simplest DFS for our <code>tree-node</code>-based tree may be coded in the following manner:</p> <pre><code>(defun dfs-node (fn root)<br /> (call fn (key root))<br /> (dolist (child (children root))<br /> (dfs-node fn child)))<br /><br />CL-USER> (dfs-node 'print *tree*) ; where *tree* is taken from the previous example<br />"a" <br />"b" <br />"d" <br />"e" <br />"c" <br />"f"<br /></code></pre> <p>In the spirit of Lisp, we could also define a convenience macro:</p> <pre><code>(defmacro dotree-dfs ((value root) &body body)<br /> (let ((node (gensym))) ; this code is needed to prevent possible symbol collisions for NODE<br /> `(dfs-node (lambda (,node)<br /> (let ((,value (key ,node)))<br /> ,@body))<br /> ,root)))<br /></code></pre> <p>And if we'd like to traverse a tree represented as a list, the changes are minor:</p> <pre><code>(defun dfs-list (fn tree)<br /> ;; we need to handle both subtrees (lists) and leaves (atoms)<br /> ;; so, we'll just convert everything to a list<br /> (let ((tree (mklist tree)))<br /> (call fn (first tree))<br /> (dolist (child (rest tree))<br /> (dfs-list fn child))))<br /><br />CL-USER> (dfs-list 'print '(defun foo (bar)<br /> "Foo function."<br /> (baz bar)))<br />DEFUN <br />FOO <br />BAR <br />"Foo function." <br />BAZ <br />BAR<br /></code></pre> <p>Recursion is very natural in tree traversal: we could even say that trees are recursion realized in a data structure. And the good news here is that, very rarely, there's a chance to hit recursion limits as the majority of trees are not infinite, and also the height of the tree, which conditions the depth of recursion, grows proportionally to the logarithm of the tree size<a href="#f7-1" name="r7-1">[1]</a>, and that's pretty slow. </p> <p>These simple DFS implementations apply the function before descending down the tree. This style is called <strong>preorder</strong> traversal. There are alternative styles: <strong>inorder</strong> and <strong>postorder</strong>. With postorder, the call is executed after the recursion returns, i.e. on the recursive ascent:</p> <pre><code>(defun post-dfs (fn node)<br /> (dolist (child (children node))<br /> (post-dfs fn child))<br /> (call fn (key node)))<br /><br />CL-USER> (post-dfs 'print *tree*)<br />"d" <br />"e" <br />"b" <br />"f" <br />"c" <br />"a" <br /></code></pre> <p>Inorder traversal is applicable only to binary trees: first traverse the left side, then call <code>fn</code> and then descend into the right side.</p> <p>An alternative traversal approach is Breadth-first search (<strong>BFS</strong>). It isn't so natural as DFS as it traverses the tree layer by layer, i.e. it has to, first, accumulate all the nodes that have the same depth and then integrate them. In the general case, it isn't justified, but there's a number of algorithms where exactly such ordering is required.</p> <p>Here is an implementation of BFS (preorder) for our <code>tree-node</code>s:</p> <pre><code>(defun bfs (fn nodes)<br /> (let ((next-level (list)))<br /> (dolist (node (mklist nodes))<br /> (call fn (key node))<br /> (dolist (child (children node))<br /> (push child next-level)))<br /> (when next-level<br /> (bfs fn (reverse next-level)))))<br /><br />CL-USER> (bfs 'print *tree*)<br />"a" <br />"b" <br />"c" <br />"d" <br />"e" <br />"f" <br /></code></pre> <p>An advantage of BFS traversal is that it can handle potentially unbounded trees, i.e. it is suitable for processing trees in a streamed manner, layer-by-layer.</p> <p>In object-orientation, tree traversal is usually accomplished with by the means of the so-called <strong>Visitor pattern</strong>. Basically, it's the same approach of passing a function to the traversal procedure but in disguise of additional (and excessive) OO-related machinery. Here is a Visitor pattern example in Java:</p> <pre><code>interface ITreeVisitor {<br /> List<ITreeNode> children;<br /> void visit(ITreeNode node);<br />}<br /><br />interface ITreeNode {<br /> void accept(ITreeVisitor visitor);<br />}<br /><br />interface IFn {<br /> void call(ITreeNode);<br />}<br /><br />class TreeNode implements ITreeNode {<br /> public void accept(ITreeVisitor visitor) {<br /> visitor.visit(this);<br /> }<br />}<br /><br />class PreOrderTreeVisitor implements ITreeVisitor {<br /> private IFn fn;<br /><br /> public PreOrderTreeVisitor(IFn fn) {<br /> this.fn = fn;<br /> }<br /><br /> public void visit(ITreeNode node) {<br /> fn.call(node);<br /> for (ITreeeNode child : node.children())<br /> child.visit(this);<br /> }<br />}<br /></code></pre> <p>The zest of this example is the implementation of the method <code>visit</code> that calls the function with the current node and iterates over its children by recursively applying the same visitor. You can see that it's exactly the same as our <code>dfs-node</code>.</p> <p>One of the interesting tree-traversal tasks is tree printing. There are many ways in which trees can be displayed. The simplest one is directory-style (like the one used by the Unix <code>tree</code> utility):</p> <pre><code>$ tree /etc/acpi<br />/etc/acpi<br />├── asus-wireless.sh<br />├── events<br />│ ├── asus-keyboard-backlight-down<br />│ ├── asus-keyboard-backlight-up<br />│ ├── asus-wireless-off<br />│ └── asus-wireless-on<br />└── undock.sh<br /></code></pre> <p>It may be implemented with DFS and only requires tracking of the current level in the tree:</p> <pre><code>(defun pprint-tree-dfs (node &optional (level 0) (skip-levels (make-hash-table)))<br /> (when (= 0 level)<br /> (format t "~A~%" (key node)))<br /> (let ((last-index (1- (length (children node))))) <br /> (doindex (i child (children node))<br /> (let ((last-child-p (= i last-index)))<br /> (dotimes (j level)<br /> (format t "~C " (if (? skip-levels j) #\Space #\│)))<br /> (format t "~C── ~A~%"<br /> (if last-child-p #\└ #\├)<br /> (key child))<br /> (:= (? skip-levels level) last-child-p)<br /> (pprint-tree-dfs child<br /> (1+ level)<br /> skip-levels))))))<br /><br />CL-USER> (pprint-tree-dfs *tree*)<br />a<br />├── b<br />│ ├── d<br />│ └── e<br />└── c<br /> └── f<br /></code></pre> <p><code>1+</code> and <code>1-</code> are standard Lisp shortucts for adding/substracting 1 from a number. The <code>skip-levels</code> argument is used for the last elements to not print the excess <code>│</code>.</p> <p>A more complicated variant is top-to-bottom printing:</p> <pre><code>;; example from CL-NLP library<br />CL-USER> (nlp:pprint-tree<br /> '(TOP (S (NP (NN "This"))<br /> (VP (VBZ "is")<br /> (NP (DT "a")<br /> (JJ "simple")<br /> (NN "test")))<br /> (|.| ".")))<br /> TOP <br /> : <br /> S <br /> .-----------:---------. <br /> : VP : <br /> : .---------. : <br /> NP : NP : <br /> : : .----:-----. : <br /> NN VBZ DT JJ NN . <br /> : : : : : : <br />This is a simple test . <br /></code></pre> <p>This style, most probably, will need a BFS and a careful calculation of spans of each node to properly align everything. Implementing such a function is left as an exercise to the reader, and a very enlightening one, I should say.</p> <h2 id="binarysearchtrees">Binary Search Trees</h2> <p>Now, we can return to the topic of basic operations on tree elements. The advantage of trees is that, when built properly, they guarantee <code>O(log n)</code> for all the main operations: search, insertion, modification, and deletion.</p> <p>This quality is achieved by keeping the leaves sorted and the trees in a balanced state. "Balanced" means that any pair of paths from the root to the leaves have lengths that may differ by at most some predefined quantity: ideally, just 1 (AVL trees), or, as in the case of Red-Black trees, the longest path can be at most twice as long as the shortest. Yet, such situations when all the elements align along a single path, effectively, turning the tree into a list, should be completely ruled out. We have already seen, with Binary search and Quicksort (remember the justification for the 3-medians rule), why this constraint guarantees logarithmic complexity.</p> <p>The classic example of balanced trees are Binary Search Trees (BSTs), of which AVL and Red-Black trees are the most popular variants. All the properties of BSTs may be trivially extended to n-ary trees, so we'll discuss the topic using the binary trees examples.</p> <p>Just to reiterate the general intuition for the logarithmic complexity of tree operations, let's examine a complete binary tree: a tree that has all levels completely filled with elements, except maybe for the last one. In it, we have <code>n</code> elements, and each level contains twice as many nodes as the previous. This property means that <code>n</code> is not greater than <code>(+ 1 2 4 ... (/ k 2) k)</code>, where <code>k</code> is the capacity of the last level. This formula is nothing but the sum of a geometric progression with the number of items equal to <code>h</code>, which is, by the textbook:</p> <pre><code>(/ (* 1 (- 1 (expt 2 h)))<br /> (- 1 2))<br /></code></pre> <p>In turn, thisexpression may be reduced to: <code>(- (expt 2 h) 1)</code>. So <code>(+ n 1)</code> equals to <code>(expt 2 h)</code>, i.e. the height of the tree (<code>h</code>) equals to <code>(log (+ n 1) 2)</code>.</p> <p>BSTs have the ordering property: if some element is to the right of another in the tree, it should consistently be greater (or smaller — depending on the ordering direction). This constraint means that after the tree is built, just extracting its elements by performing an inorder DFS produces a sorted array. The Treesort algorithm utilizes this approach directly to achieve the same <code>O(n * log n)</code> complexity as other efficient sorting algorithms. This <code>n * log n</code> is the complexity of each insertion (<code>O(log n)</code>) multiplied by the number of times it should be performed (<code>n</code>). So, Treesort operates by taking an array and adding its elements to the BST, then traversing the tree and putting the encountered elements into the resulting array, in a proper order.</p> <p>Besides, the ordering property also means that, after adding a new element to the tree, in the general case, it should be rebalanced as the newly added element may not be placed in an arbitrary spot, but has just two admissible locations, and choosing any of those may violate the balance constraint. The specific balance invariants and approaches to tree rebalancing are the distinctive properties of each variant of BSTs that we will see below.</p> <h2 id="splaytree">Splay Trees</h2> <p>A Splay tree represents a kind of BST that is one of the simplest to understand and to implement. It is also quite useful in practice. It has the least strict constraints and a nice property that recently accessed elements occur near the root. Thus, a Splay tree can naturally act as an LRU-cache. However, there are degraded scenarios that result in <code>O(n)</code> access performance, although, the average complexity of Splay tree operations is <code>O(log n)</code> due to amortization (we'll talk about it in a bit).</p> <p>The approach to balancing a Splay tree is to move the element we have accessed/inserted into the root position. The movement is performed by a series of operations that are called tree rotations. A certain pattern of rotations forms a step of the algorithm. For all BSTs, there are just two possible tree rotations, and they serve as the basic block, in all balancing algorithms. A rotation may be either a left or a right one. Their purpose is to put the left or the right child into the position of its parent, preserving the order of all the other child elements. The rotations can be illustrated by the following diagrams in which <code>x</code> is the parent node, <code>y</code> is the target child node that will become the new parent, and <code>A</code>,<code>B</code>,<code>C</code> are subtrees. It is said that the rotation is performed around the edge <code>x -> y</code>.</p> <p>Left rotation:</p> <pre><code> x y<br /> / \ / \<br /> y C -> A x<br /> / \ / \<br />A B B C<br /></code></pre> <p>Right rotation:</p> <pre><code> x y<br /> / \ / \<br />A y -> x C<br /> / \ / \<br /> B C A B<br /></code></pre> <p>As you see, the left and right rotations are complementary operations, i.e. performing one after the other will return the tree to the original state. During the rotation, the inner subtree (<code>B</code>) has its parent changed from <code>y</code> to <code>x</code>.</p> <p>Here's an implementation of rotations:</p> <pre><code>(defstruct (bst-node (:conc-name nil)<br /> (:print-object (lambda (node out)<br /> (format out "[~a-~@[~a~]-~@[~a~]]" <br /> (key node)<br /> (lt node)<br /> (rt node)))))<br /> key<br /> val ; we won't use this slot in the examples,<br /> ; but without it, in real-world use cases,<br /> ; the tree doesn't make any sense :)<br /> lt ; left child<br /> rt) ; right child<br /><br />(defun tree-rotate (node parent grandparent)<br /> (cond<br /> ((eql node (lt parent)) (:= (lt parent) (rt node)<br /> (rt node) parent))<br /> ((eql node (rt parent)) (:= (rt parent) (lt node)<br /> (lt node) parent))<br /> (t (error "NODE (~A) is not the child of PARENT (~A)"<br /> node parent)))<br /> (cond <br /> ((null grandparent) (return-from tree-rotate node))<br /> ((eql parent (lt grandparent)) (:= (lt grandparent) node))<br /> ((eql parent (rt grandparent)) (:= (rt grandparent) node))<br /> (t (error "PARENT (~A) is not the child of GRANDPARENT (~A)"<br /> parent grandparent))))<br /></code></pre> <p>You have probably noticed that we need to pass to this function not only the nodes on the edge around which the rotation is executed but also the grandparent node of the target to link the changes to the tree. If <code>grandparent</code> is not supplied, it is assumed that <code>parent</code> is the root and we need to separately reassign the variable holding the reference to the tree to <code>child</code>, after the rotation.</p> <p>Splay trees combine rotations into three possible actions:</p> <ul><li>The Zig step is used to make the node the new root when it's already the direct child of the root. It is accomplished by a single left/right rotation(depending on whether the target is to the left or to the right of the <code>root</code>) followed by an assignment.</li> <li>The Zig-zig step is a combination of two zig steps that is performed when both the target node and its parent are left/right nodes. The first rotation is around the edge between the target node and its parent, and the second — around the target and its former grandparent that has become its new parent, after the first rotation.</li> <li>The Zig-zag step is performed when the target and its parent are not in the same direction: either one is left while the other is right or vise versa. In this case, correspondingly, first a left rotation around the parent is needed, and then a right one around its former grandparent (that has now become the new parent of the target). Or vice versa.</li></ul> <p>However, with our implementation of tree rotations, we don't have to distinguish the 3 different steps and the implementation of the operation <code>splay</code> becomes really trivial:</p> <pre><code>(defun splay (node &rest chain)<br /> (loop :for (parent grandparent) :on chain :do<br /> (tree-rotate node parent grandparent))<br /> node)<br /></code></pre> <p>The key point here and in the implementation of Splay tree operations is the use of reverse chains of nodes from the child to the root which will allow us to perform chains of splay operations in an end-to-end manner and also custom modifications of the tree structure.</p> <p>From the code, it is clear that splaying requires at maximum the same number of steps as the height of the tree because each rotation brings the target element 1 level up. Now, let's discuss why all Splay tree operations are <code>O(log n)</code>. Element access requires binary search for the element in the tree, which is <code>O(log n)</code> provided the tree is balanced, and then splaying it to root — also <code>O(log n)</code>. Deletion requires search, then swapping the element either with the rightmost child of its left subtree or the leftmost child of its right subtree (direct predecessor/successor) — to make it childless, removing it, and, finally, splaying the parent of the removed node. And update is, at worst, deletion followed by insertion. </p> <p>Here is the implementation of the Splay tree built of <code>bst-node</code>s and restricted to only arithmetic comparison operations. All of the high-level functions, such as <code>st-search</code>, <code>st-insert</code> or <code>st-delete</code> return the new tree root obtained after that should substitute the previous one in the caller code.</p> <pre><code>(defun node-chain (item root &optional chain)<br /> "Return as the values the node equal to ITEM or the closest one to it<br /> and the chain of nodes leading to it, in the splay tree based in ROOT."<br /> (if root<br /> (with (((key lt rt) ? root)<br /> (chain (cons root chain)))<br /> (cond ((= item key) (values root<br /> chain))<br /> ((< item key) (st-search item lt chain))<br /> ((> item key) (st-search item rt chain))))<br /> (values nil<br /> chain)))<br /><br />(defun st-search (item root)<br /> (with ((node chain (node-chain item root)))<br /> (when node<br /> (apply 'splay chain))))<br /><br />(defun st-insert (item root)<br /> (assert root nil "Can't insert item into a null tree")<br /> (with ((node chain (st-search item root)))<br /> (unless node<br /> (let ((parent (first chain)))<br /> ;; here, we use the property of the := expression<br /> ;; that it returns the item being set<br /> (push (:= (? parent (if (> (key parent) item) 'lt 'rt))<br /> (make-bst-node :key item))<br /> chain)))<br /> (apply 'splay chain)))<br /><br />(defun idir (dir)<br /> (case dir<br /> (lt 'rt)<br /> (rt 'lt)))<br /><br />(defun closest-child (node)<br /> (dolist (dir '(lt rt))<br /> (let ((parent nil)<br /> (current nil))<br /> (do ((child (call dir node) (call (idir dir) child)))<br /> ((null child) (when current<br /> (return-from closest-child<br /> (values dir<br /> current<br /> parent))))<br /> (:= current child<br /> parent current)))))<br /><br />(defun st-delete (item root)<br /> (with ((node chain (st-search item root))<br /> (parent (second chain))) <br /> (if (null node)<br /> root ; ITEM was not found<br /> (with ((dir child child-parent (closest-child node))<br /> (idir (idir dir)))<br /> (when parent<br /> (:= (? parent (if (eql (lt parent) node) 'lt 'rt))<br /> child))<br /> (when child<br /> (:= (? child idir) (? node idir))<br /> (when child-parent<br /> (:= (? child-parent idir) (? child dir))))<br /> (if parent<br /> (apply 'splay (rest chain))<br /> child)))))<br /><br />(defun st-update (old new root)<br /> (st-insert new (st-delete old root)))<br /></code></pre> <p>The deletion is somewhat tricky due to the need to account for different cases: when removing the root, the direct child of the root, or the other node.</p> <p>Let's test the Splay tree operation in the REPL (coding <code>pprint-bst</code> as a slight modification of <code>pprint-tree-dfs</code> is left as an excercise to the reader):</p> <pre><code>CL-USER> (defparameter *st* (make-bst-node :key 5))<br />CL-USER> *st*<br />[5--]<br />CL-USER> (pprint-bst (:= *st* (st-insert 1 *st*)))<br />1<br />├── .<br />└── 5<br />CL-USER> (pprint-bst (:= *st* (st-insert 10 *st*)))<br />10<br />├── 1<br />│ ├── .<br />│ └── 5<br />└── .<br />CL-USER> (pprint-bst (:= *st* (st-insert 3 *st*)))<br />3<br />├── 1<br />└── 10<br /> ├── .<br /> └── 5<br />CL-USER> (pprint-bst (:= *st* (st-insert 7 *st*)))<br />7<br />├── 3<br />│ ├── 1<br />│ └── 5<br />└── 10<br />CL-USER> (pprint-bst (:= *st* (st-insert 8 *st*)))<br />8<br />├── 7<br />│ ├── 3<br />│ │ ├── 1<br />│ │ └── 5<br />│ └── .<br />└── 10<br />CL-USER> (pprint-bst (:= *st* (st-insert 2 *st*)))<br />2<br />├── 1<br />└── 8<br /> ├── 7<br /> │ ├── 3<br /> │ │ ├── .<br /> │ │ └── 5<br /> │ └── .<br /> └── 10<br />CL-USER> (pprint-bst (:= *st* (st-insert 4 *st*)))<br />4<br />├── 2<br />│ ├── 1<br />│ └── 3<br />└── 8<br /> ├── 7<br /> │ ├── 5<br /> │ └── .<br /> └── 10<br />CL-USER> *st*<br />[4-[2-[1--]-[3--]]-[8-[7-[5--]-]-[10--]]]<br /></code></pre> <p>As you can see, the tree gets constantly rearranged at every insertion.</p> <p>Accessing an element, when it's found in the tree, also triggers tree restructuring:</p> <pre><code>RTL-USER> (pprint-bst (st-search 5 *st*))<br />5<br />├── 4<br />│ ├── 2<br />│ │ ├── 1<br />│ │ └── 3<br />│ └── .<br />└── 8<br /> ├── 7<br /> └── 10<br /></code></pre> <p>The insertion and deletion operations, for the Splay tree, also may have an alternative implementation: first, split the tree in two at the place of the element to be added/removed and then combine them. For insertion, the combination is performed by making the new element the root and linking the previously split subtrees to its left and right. As for deletion, splitting the Splay tree requires splaying the target element and then breaking the two subtrees apart (removing the target that has become the root). The combination is also <code>O(log n)</code> and it is performed by splaying the rightmost node of the left subtree (the largest element) so that it doesn't have the right child. Then the right subtree can be linked to this vacant slot.</p> <p>Although regular access to the Splay tree requires splaying of the element we have touched, tree traversal should be implemented without splaying. Or rather, just the normal DFS/BFS procedures should be used. First of all, this approach will keep the complexity of the operation at <code>O(n)</code> without the unnecessary <code>log n</code> multiplier added by the splaying operations. Besides, accessing all the elements inorder will trigger the edge-case scenario and turn the Splay tree into a list — exactly the situation we want to avoid.</p> <h3 id="complexityanalysis">Complexity Analysis</h3> <p>All of those considerations apply under the assumption that all the tree operations are <code>O(log n)</code>. But we haven't proven it yet. Turns out that, for Splay trees, it isn't a trivial task and requires <strong>amortized analysis</strong>. Basically, this approach averages the cost of all operations over all tree elements. Amortized analysis allows us to confidently use many advanced data structures for which it isn't possible to prove the required time bounds for individual operations, but the general performance over the lifetime of the data structure is in those bounds.</p> <p>The principal tool of the amortized analysis is the <strong>potential method</strong>. Its idea is to combine, for each operation, not only its direct cost but also the change to the <em>potential</em> cost of other operations that it brings. For Splay trees, we can observe that only zig-zig and zig-zag steps are important, for the analysis, as zig step happens only once for each splay operation and changes the height of the tree by at most 1. Also, both zig-zig and zig-zag have the same potential.</p> <p>Rigorously calculating the exact potential requires a number of mathematical proofs that we don't have space to show here, so let's just list the main results.</p> <ol><li><p>The potential of the whole Splay tree is the sum of the ranks of all nodes, where rank is the logarithm of the number of elements in the subtree rooted at node:</p> <pre><code>(defun rank (node)<br /> (let ((size 0))<br /> (dotree-dfs (_ node)<br /> (:+ size))<br /> (log size 2)))<br /></code></pre></li> <li><p>The change of potential produced by a single zig-zig/zig-zag step can be calculated in the following manner:</p> <pre><code>(+ (- (rank grandparent-new) (rank grandparent-old))<br /> (- (rank parent-new) (rank parent-old))<br /> (- (rank node-new) (rank node-old)))<br /></code></pre> <p>Since <code>(= (rank node-new) (rank grandparent-old))</code> it can be reduced to:</p> <pre><code>(- (+ (rank grandparent-new) (rank parent-new))<br /> (+ (rank parent-old) (rank node-old)))<br /></code></pre> <p>Which is not larger than:</p> <pre><code>(- (+ (rank grandparent-new) (rank node-new))<br /> (* 2 (rank node-old)))<br /></code></pre> <p>Which, in turn, due to the concavity of the log function, may be reduced to:</p> <pre><code>(- (* 3 (- (rank node-new) (rank node-old))) 2)<br /></code></pre> <p>The amortized cost of any step is 2 operations larger than the change in potential as we need to perform 2 tree rotations, so it's not larger than:</p> <pre><code>(* 3 (- (rank node-new) (rank node-old)))<br /></code></pre></li> <li><p>When summed over the entire splay operation, this expression "telescopes" to <code>(* 3 (- (rank root) (rank node)))</code> which is <code>O(log n)</code>. Telescoping means that when we calculate the sum of the cost of all zig-zag/zig-zig steps, the inner terms cancel each other and only the boundary ones remain. The difference in ranks is, in the worst case, <code>log n</code> as the rank of the root is <code>(log n 2)</code> and the rank of the arbitrary node is between that value and <code>(log 1 2)</code> (0).</p></li> <li><p>Finally, the total cost for <code>m</code> splay operations is <code>O(m log n + n log n)</code>, where <code>m log n</code> term represents the total amortized cost of a sequence of <code>m</code> operations and <code>n log n</code> is the change in potential that it brings.</p></li></ol> <p>As mentioned, the above exposition is just a cursory look at the application of the potential method that skips some important details. If you want to learn more you can start with this <a href="https://cstheory.stackexchange.com/questions/22277/splay-tree-potential-function-why-sum-the-logs-of-the-sizes">discussion on CS Theory StackExchange</a>.</p> <p>To conclude, similar to hash-tables, the performance of Splay tree operations for a concrete element depends on the order of the insertion/removal of all the elements of the tree, i.e. it has an unpredictable (random) nature. This property is a disadvantage compared to some other BST variants that provide precise performance guarantees. Another disadvantage, in some situations, is that the tree is constantly restructured, which makes it mostly unfit for usage as a persistent data structure and also may not play well with many storage options. Yet, Splay trees are simple and, in many situations, due to their LRU-property, may be preferable over other BSTs.</p> <h2 id="redblackandavltrees">Red-Black and AVL Trees</h2> <p>Another BST that also has similar complexity characteristics to Splay trees and, in general, a somewhat similar approach to rebalancing is the Scapegoat tree. Both of these BSTs don't require storing any additional information about the current state of the tree, which results in the random aspect of their operation. And although it is smoothed over all the tree accesses, it may not be acceptable in some usage scenarios.</p> <p>An alternative approach, if we want to exclude the random factor, is to track the tree state. Tracking may be achieved by adding just 1 bit to each tree node (as with Red-Black trees) or 2 bits, the so-called balance factors (AVL trees)<a href="#f7-2" name="r7-2">[2]</a>. However, for most of the high-level languages, including Lisp, we'll need to go to great lengths or even perform low-level non-portable hacking to, actually, ensure that exactly 1 or 2 bits is spent for this data, as the standard structure implementation will allocate a whole word even for a bit-sized slot. Moreover, in C likewise, due to cache alignment, the structure will also have the size aligned to memory word boundaries. So, by and large, usually we don't really care whether the data we'll need to track is a single bit flag or a full integer counter.</p> <p>The balance guarantee of an RB tree is that, for each node, the height of the left and right subtrees may differ by at most a factor of 2. Such boundary condition occurs when the longer path contains alternating red and black nodes, and the shorter — only black nodes. Balancing is ensured by the requirement to satisfy the following invariants:</p> <ol><li>Each tree node is assigned a label: red or black (basically, a 1-bit flag: 0 or 1).</li> <li>The root should be black (0).</li> <li>All the leaves are also black (0). And the leaves don't hold any data. A good implementation strategy to satisfy this property is to have a constant singleton terminal node that all preterminals will link to. (<code>(defparameter *rb-leaf* (make-rb-node))</code>).</li> <li>If a parent node is red (1) then both its children should be black (0). Due to mock leaves, each node has exactly 2 children.</li> <li>Every path from a given node to any of its descendant leaf nodes should contain the same number of black nodes.</li></ol> <p>So, to keep the tree in a balanced state, the insert/update/delete operations should perform rebalancing when the constraints are violated. Robert Sedgewick has proposed the simplest version of the red-black tree called the Left-Leaning Red-Black Tree (LLRB). The LLRB maintains an additional invariant that all red links must lean left except during inserts and deletes, which makes for the simplest implementation of the operations. Below, we can see the outline of the insert operation:</p> <pre><code>(defstruct (rb-node (:include bst-node) (:conc-name nil))<br /> (red nil :type boolean))<br /><br />(defun rb-insert (item root &optional parent)<br /> (let ((node (make-rb-node :key item)))<br /> (when (null root)<br /> (return-from rb-insert node))<br /> (when (and (red (lt root))<br /> (red (rt root)))<br /> (:= (red root) (not (red root))<br /> (red (lt root)) nil<br /> (red (rt root)) nil))<br /> (cond ((< (key root) value)<br /> (:= (lt root) (rb-insert node (lt root) root)))<br /> ((> (key root) value)<br /> (:= (rt root) (rb-insert node (rt root) root))))<br /> (when (and (red (rt root))<br /> (not (red (lt root))))<br /> (:= (red (lt root)) (red root)<br /> root (tree-rotate (lt root) root parent)<br /> (red root) t))<br /> (when (and (red (lt root))<br /> (not (red (rt root))))<br /> (:= (red (rt root)) (red root)<br /> root (tree-rotate (rt root) root parent)<br /> (red root) t)))<br /> root)<br /></code></pre> <p>This code is more of an outline. You can easily find the complete implementation of the RB-tree on the internet. The key here is to understand the principle of their operation. Also, we won't discuss AVL trees, in detail. Suffice to say that they are based on the same principles but use a different set of balancing operations.</p> <p>Both Red-Black and AVL trees may be used when worst-case performance guarantees are required, for example, in real-time systems. Besides, they serve a basis for implementing persistent data-structures that we'll talk about later. The Java <code>TreeMap</code> and similar data structures from the standard libraries of many languages are implemented with one of these BSTs. And the implementations of them both are present in the Linux kernel and are used as data structures for various queues.</p> <p>OK, now you know how to balance a binary tree :D</p> <h2 id="btrees">B-Trees</h2> <p>B-tree is a generalization of a BST that allows for more than two children. The number of children is not unbounded and should be in a predefined range. For instance, the simplest B-tree — 2-3 tree — allows for 2 or 3 children. Such trees combine the main advantage of self-balanced trees — logarithmic access time — with the benefit of arrays — locality — the property which allows for faster cache access or retrieval from the storage. That's why B-trees are mainly used in data storage systems. Overall, B-tree implementations perform the same trick as we saw in <code>prod-sort</code>: switching to sequential search when the sequence becomes small enough to fit into the cache line of the CPU.</p> <p>Each internal node of a B-tree contains a number of keys. For a 2-3 tree, the number is either 1 or 2. The keys act as separation values which divide the subtrees. For example, if the keys are <code>x</code> and <code>y</code>, all the values in the leftmost subtree will be less than <code>x</code>, all values in the middle subtree will be between <code>x</code> and <code>y</code>, and all values in the rightmost subtree will be greater than <code>y</code>. Here is an example:</p> <pre><code> [ 7 . 18 ]<br /> / | \<br />[ 1 . 3 ] [ 10 . 15 ] [ 20 . _ ]<br /></code></pre> <p>This tree has 4 nodes. Each node has 2 key slots and may have 0 (in the case of the leaf nodes), 2 or 3 children. The node structure for it might look like this:</p> <pre><code>(defstruct 23-node<br /> key1<br /> key2<br /> val1<br /> val2<br /> lt<br /> md<br /> rt)<br /></code></pre> <p>Yet, a more general B-tree node would, probably, contain arrays for keys/values and children links:</p> <pre><code>(defstruct bt-node<br /> (keys (make-array *max-keys*))<br /> (vals (make-array *max-keys*))<br /> (children (make-array (1+ *max-keys*)))<br /></code></pre> <p>The element search in a B-tree is very similar to that of a BST. Just, there will be up to <code>*max-keys*</code> comparisons instead of 1, in each node. Insertion is more tricky as it may require rearranging the tree items to satisfy its invariants. A B-tree is kept balanced after insertion by the procedure of splitting a would-be overfilled node, of <code>(1+ n)</code> keys, into two <code>(/ n 2)</code>-key siblings and inserting the mid-value key into the parent. That's why, usually, the range of the number of keys in the node, in the B-tree is chosen to be between <code>k</code> and <code>(* 2 k)</code>. Also, in practice, <code>k</code> will be pretty large: an order of 10s or even 100. Depth only increases when the root is split, maintaining balance. Similarly, a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the minimum number of keys for non-root nodes. A merger reduces the number of keys in the parent potentially forcing it to merge or redistribute keys with its siblings, and so on. The depth of the tree will increase slowly as elements are added to it, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node farther away from the root.</p> <p>A version of B-trees that is particularly developed for storage systems and is used in a number of filesystems, such as NTFS and ext4, and databases, such as Oracle and SQLite, is B+ trees. A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves. The leaves of the B+ tree are linked to one another in a linked list, making range queries or an (ordered) iteration through the blocks simpler and more efficient. Such a property could not be achieved in a B-tree, since not all keys are present in the leaves: some are stored in the root or intermediate nodes.</p> <p>However, a newer Linux file-system, developed specifically for use on the SSDs and called btrfs uses plain B-trees instead of B+ trees because the former allows implementing copy-on-write, which is needed for efficient snapshots. The issue with B+ trees is that its leaf nodes are interlinked, so if a leaf were copy-on-write, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. We can recall the same situation pertaining to the doubly-linked list compared to singly-linked ones. So, a modified B-tree without leaf linkage is used in btrfs, with a refcount associated with each tree node but stored in an ad-hoc free map structure.</p> <p>Overall, B-trees are a very natural continuation of BSTs, so we won't spend more time with them here. I believe, it should be clear how to deal with them, overall. Surely, there are a lot of B-tree variants that have their nuances, but those should be studied in the context of a particular problem they are considered for.</p> <h2 id="heaps">Heaps</h2> <p>A different variant of a binary tree is a Binary Heap. Heaps are used in many different algorithms, such as path pathfinding, encoding, minimum spanning tree, etc. They even have their own <code>O(log n)</code> sorting algorithm — the elegant Heapsort. In a heap, each element is either the smallest (min-heap) or the largest (max-heap) element of its subtree. It is also a complete tree and the last layer should be filled left-to-right. This invariant makes the heap well suited for keeping track of element priorities. So Priority Queues are, usually, based on heaps. Thus, it's beneficial to be aware of the existence of this peculiar data structure.</p> <p>The constraints on the heap allow representing it in a compact and efficient manner — as a simple vector. Its first element is the heap root, the second and third are its left and right child (if present) and so on, by recursion. This arrangement permits access to the parent and children of any element using the simple offset-based formulas (in which the element is identified by its index):</p> <pre><code>(defun hparent (i)<br /> "Calculate the index of the parent of the heap element with index I."<br /> (floor (- i 1) 2))<br /><br />(defun hrt (i)<br /> "Calculate the index of the right child of the heap element with index I."<br /> (* (+ i 1) 2))<br /><br />(defun hlt (i)<br /> "Calculate the index of the left child of the heap element with index I."<br /> (- (hrt i) 1))<br /></code></pre> <p>So, to implement a heap, we don't need to define a custom node structure, and besides, can get to any element in <code>O(1)</code>! Here is the utility to rearrange an arbitrary array in a min-heap formation (in other words, we can consider a binary heap to be a special arrangement of array elements). It works by iteratively placing each element in its proper place by swapping with children until it's larger than both of the children.</p> <pre><code>(defun heapify (vec)<br /> (let ((mid (floor (length vec) 2)))<br /> (dotimes (i mid)<br /> (heap-down vec (- mid i 1))))<br /> vec)<br /><br />(defun heap-down (vec beg &optional (end (length vec)))<br /> (let ((l (hlt beg))<br /> (r (hrt beg)))<br /> (when (< l end)<br /> (let ((child (if (or (>= r end)<br /> (> (? vec l)<br /> (? vec r)))<br /> l r)))<br /> (when (> (? vec child)<br /> (? vec beg))<br /> ;; rotatef swaps the elements of the sequence<br /> (rotatef (? vec beg)<br /> (? vec child))<br /> (heap-down vec child end)))))<br /> vec)<br /></code></pre> <p>And here is the reverse operation to pop the item up the heap:</p> <pre><code>(defun heap-up (vec i)<br /> (when (> (? vec i)<br /> (? vec (hparent i)))<br /> (rotatef (? vec i)<br /> (? vec (hparent i)))<br /> (heap-up vec (hparent i)))<br /> heap)<br /></code></pre> <p>Also, as with other data structures, it's essential to be able to visualize the content of the heap in a convenient form, as well as to check the invariants. These tasks may be accomplished with the help of the following functions:</p> <pre><code>(defun draw-heap (vec)<br /> (format t "~%")<br /> (with ((size (length vec))<br /> (h (+ 1 (floor (log size 2)))))<br /> (dotimes (i h)<br /> (let ((spaces (loop :repeat (- (expt 2 (- h i)) 1) :collect #\Space)))<br /> (dotimes (j (expt 2 i))<br /> (let ((k (+ (expt 2 i) j -1)))<br /> (when (= k size) (return))<br /> (format t "~{~C~}~2D~{~C~}" spaces (? vec k) spaces)))<br /> (format t "~%"))))<br /> (format t "~%")<br /> vec)<br /><br />(defun check-heap (vec)<br /> (dotimes (i (floor (length vec) 2))<br /> (when (= (hlt i) (length vec)) (return))<br /> (assert (not (> (? vec (hlt i)) (? vec i)))<br /> () "Left child (~A) is > parent at position ~A (~A)."<br /> (? vec (hlt i)) i (? vec i))<br /> (when (= (hrt i) (length vec)) (return))<br /> (assert (not (> (? vec (hrt i)) (? vec i)))<br /> () "Right child (~A) is > than parent at position ~A (~A)."<br /> (? vec (hrt i)) i (? vec i)))<br /> vec)<br /><br />CL-USER> (check-heap #(10 5 8 2 3 7 1 9))<br />Left child (9) is > parent at position 3 (2).<br /> [Condition of type SIMPLE-ERROR]<br /><br />CL-USER> (check-heap (draw-heap (heapify #(1 22 10 5 3 7 8 9 7 13))))<br /><br /> 22 <br /> 13 10 <br /> 9 3 7 8 <br /> 5 7 1 <br /><br />#(22 13 10 9 3 7 8 5 7 1)<br /></code></pre> Due to the regular nature of the heap, drawing it with BFS is much simpler than for most other trees. As with ordered trees, heap element insertion and deletion require repositioning of some of the elements. <pre><code>(defun heap-push (node vec)<br /> (vector-push-extend node vec)<br /> (heap-up vec (1- (length vec))))<br /><br />(defun heap-pop (vec)<br /> (rotatef (? vec 0) (? vec (- (length vec) 1)))<br /> ;; PROG1 is used to return the result of the first form<br /> ;; instead of the last, like it happens with PROGN<br /> (prog1 (vector-pop vec)<br /> (heap-down vec 0)))</code></pre> <p>Now, we can implement Heapsort. The idea is to iteratively arrange the array in heap order element by element. Each arrangement will take <code>log n</code> time as we're pushing the item down a complete binary tree the height of which is <code>log n</code>. And we'll need to perform <code>n</code> such iterations.</p> <pre><code>(defun heapsort (vec)<br /> (heapify vec)<br /> (dotimes (i (length vec))<br /> (let ((last (- (length vec) i 1)))<br /> (rotatef (? vec 0)<br /> (? vec last))<br /> (heap-down vec 0 last)))<br /> vec)<br /><br />CL-USER> (heapsort #(1 22 10 5 3 7 8 9 7 13))<br />#(1 3 5 7 7 8 9 10 13 22)<br /></code></pre> <p>There are so many sorting algorithms, so why invent another one? That's a totally valid point, but the advantage of heaps is that they keep the maximum/minimum element constantly at the top so you don't have to perform a full sort or even descend into the tree if you need just the top element. This simplification is especially relevant if we constantly need to access such elements as with priority queues.</p> <p>Actually, a heap should not necessarily be a tree. Besides the Binary Heap, there are also Binomial, Fibonacci and other kinds of heaps that may even not necessary be trees, but even collections of trees (<b>forests</b>). We'll discuss some of them in more detail in the next chapters, in the context of the algorithms for which their use makes a notable difference in performance.</p> <h2 id="tries">Tries</h2> <p>If I were to answer the question, what's the most underappreciated data structure, I'd probably say, a trie. For me, tries are a gift that keeps on giving, and they have already saved me program performance in a couple of situations that seemed hopeless. Besides, they are very simple to understand and implement.</p> <p>A trie is also called a prefix tree. It is, usually, used to optimize dictionary storage and lookup when the dictionary has a lot of entries and there is some overlap between them. The most obvious example is a normal English language dictionary. A lot of words have common stems ("work", "word", "worry" all share the same beginning "wor"), there are many wordforms of the same word ("word", "words", "wording", "worded"). </p> <p>Thre are many approaches to trie implementation. Let's discuss with the most straightforward and, to so to say, primitive one. Here is a trie for representing a string dictionary that is character-based and uses an alist to store children pointers:</p> <pre><code>(defstruct (tr-node (:conc-name nil))<br /> val<br /> (children (list)))<br /><br />(defun tr-lookup (key root)<br /> (dovec (ch key<br /> ;; when iteration terminates normally<br /> ;; we have found the node we were looking for<br /> (val root))<br /> (if-it (assoc1 ch (children root))<br /> (:= root it)<br /> (return))))<br /><br />(defun tr-add (key val root)<br /> (let ((i 0))<br /> (dovec (ch key)<br /> (if-it (assoc1 ch (children root))<br /> (:= root it<br /> i (1+ i))<br /> (return)))<br /> (if (= i (length key))<br /> ;; there was already something at key -<br /> ;; several actions might be taken:<br /> ;; signal an error (continuable), overwrite, abort<br /> (cerror "Assign a new value"<br /> "There was already a value at key: ~A" (val root))<br /> (dovec (ch (slice key i))<br /> (let ((child (make-tr-node)))<br /> (push (cons ch child) (children root))<br /> (:= root child))))<br /> (:= (val root) val)))<br /><br />CL-USER> (defparameter *trie* (make-tr-node))<br />*TRIE*<br />CL-USER> *trie*<br />#S(TR-NODE :VAL NIL :CHILDREN NIL)<br /></code></pre> <p>For the sake of brevity, we won't define a special print-function for our trie and will use a default one. In a real setting, though, it is highly advisable.</p> <pre><code>CL-USER> (tr-lookup "word" *trie*)<br />NIL<br />CL-USER> (tr-add "word" 42 *trie*)<br />42<br />CL-USER> *trie*<br />#S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\w<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\o<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\r<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\d<br /> . #S(TR-NODE<br /> :VAL 42<br /> :CHILDREN NIL)))))))))))))<br />CL-USER> (tr-lookup "word" *trie*)<br />42<br />CL-USER> (tr-add "word" :foo *trie*)<br /><br />There was already a value at key: 42<br /> [Condition of type SIMPLE-ERROR]<br /><br />Restarts:<br /> 0: [CONTINUE] Assign a new value<br /> 1: [RETRY] Retry SLIME REPL evaluation request.<br /> 2: [*ABORT] Return to SLIME's top level.<br /> 3: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {100F6297C3}>)<br /><br />Backtrace:<br /> 0: (TR-ADD "word" :FOO #S(TR-NODE :VAL 42 :CHILDREN NIL))<br /> 1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (TR-ADD "word" :FOO *TRIE*) #<NULL-LEXENV>)<br /> 2: (EVAL (TR-ADD "word" :FOO *TRIE*))<br /> --more--<br /><br />;;; Take the restart 0<br /><br />:FOO<br />Cl-USER> (tr-add "we" :baz *trie*)<br />:BAZ<br />CL-USER> *trie*<br />#S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\w<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))<br /> (#\o<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\r<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\k<br /> . #S(TR-NODE<br /> :VAL :BAR<br /> :CHILDREN NIL))<br /> (#\d<br /> . #S(TR-NODE<br /> :VAL :FOO<br /> :CHILDREN NIL)))))))))))))<br /></code></pre> <p>There are many ways to optimize this trie implementation. First of all, you can see that some space is wasted on intermediate nodes with no values. This is mended by <strong>Radix Trees</strong> (also known as Patricia Trees) that merge all intermediate nodes. I.e., our trie would change into the following more compact structure:</p> <pre><code>#S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\w<br /> . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))<br /> ("or" . #S(TR-NODE<br /> :VAL NIL<br /> :CHILDREN ((#\k<br /> . #S(TR-NODE<br /> :VAL :BAR<br /> :CHILDREN NIL))<br /> (#\d<br /> . #S(TR-NODE<br /> :VAL :FOO<br /> :CHILDREN NIL)))))))))))))<br /></code></pre> <p>Besides, there are ways to utilize the array to store trie offsets (similar to heaps), instead of using a linked backbone for it. Such variant is called a <strong>succinct</strong> trie. Also, there are compressed (C-tries), hash-array mapped (HAMTs), and other kinds of tries.</p> <p>The main advantage of tries is efficient space usage thanks to the elimination of repetition in keys storage. In many scenarios, usage of tries also improves the speed of access. Consider the task of matching against a dictionary of phrases, for example, biological or medical terms, names of companies or works of art, etc. These are usually 2-3 words long phrases, but, occasionally, there may be an outlier of 10 or more words. The straightforward approach would be to put the dictionary into a hash-table, then iterate over the input string trying to find the phrases in the table, starting from each word. The only question is: where do we put an end of the phrase? As we said, the phrase may be from 1 to, say, 10 words in length. With a hash-table, we have to check every variant: a single-word phrase, a two-word one, and so on up to the maximum length. Moreover, if there are phrases with the same beginning, which is often the case, we'd do duplicate work of hashing that beginning, for each variant (unless we use an additive hash, but this isn't adviced for hash-tables). With a trie, all the duplication is not necessary: we can iteratively match each word until we either find the match in the tree or discover that there is no continuation of the current subphrase.</p> <h2 id="treesinactionefficientmapping">Trees in Action: Efficient Mapping</h2> <p>Finally, the last family of tree data structures I had to mention is trees for representing spatial relations. Overall, mapping and pathfinding is an area that prompted the creation of a wide range of useful algorithms and data structures. There are two fundamental operations for processing spatial data: nearest neighbor search and range queries. Given the points on the plane, how do we determine the closest points to a particular one? How do we retrieve all points inside a rectangle or a circle? A primitive approach is to loop through all the points and collect the relevant information, which results in at least <code>O(n)</code> complexity — prohibitively expensive if the number of points is beyond several tens or hundreds. And such problems, by the way, arise not only in the field of processing geospatial data (they are at the core of such systems as PostGIS, mapping libraries, etc.) but also in Machine Learning (for instance, the k-NN algorithm directly requires such calculations) and other areas.</p> <p>A more efficient solution has an <code>O(log n)</code> complexity and is, as you might expect, based on indexing the data in a special-purpose tree. The changes to the tree will also have <code>O(log n)</code> complexity, while the initial indexing is <code>O(n log n)</code>. However, in most of the applications that use this technic, changes are much less frequent than read operations, so the upfront cost pays off.</p> <p>There is a number of trees that allow efficient storage of spatial data: segment trees, interval trees, k-d trees, R-trees, etc. The most common spatial data structure is an <strong>R-tree</strong> (rectangle-tree). It distributes all the points in an <code>n</code>-dimensional space (usually, <code>n</code> will be 2 or 3) among the leaves of the tree by recursively dividing the space into <code>k</code> rectangles holding roughly the same number of points until each tree node has at most <code>k</code> points. Let's say we have started from 1000 points on the plane and chosen <code>k</code> to be 10. In this case, the first level of the tree (i.e. children of the root) will contain 10 nodes, each one having as the value the dimensions of the rectangle that bounds approximately 100 points. Every node like that will have 10 more children, each one having around 10 points. Maybe, some will have more, and, in this case, we'll give those nodes 10 children each with, probably, 1 or 2 points in the rectangles they will command. Now, we can perform a range search with the obtained tree by selecting only the nodes that intersect the query rectangle. For a small query box, this approach will result in the discarding of the majority of the nodes at each level of the tree. So, a range search over an R-tree has <code>O(k log n)</code> where <code>k</code> is the number of intersecting rectangles.</p> <p>Now, let's consider neighbor search. Obviously, the closest points to a particular one we are examining lie either in the same rectangle as the point or in the closest ones to it. So, we need to, first, find the smallest bounding rectangle, which contains our point, perform the search in it, then, if we haven't got enough points yet, process the siblings of the current tree node in the order of their proximity to it.</p> <p>There are many other spatial problems that may be efficiently solved with this approach. One thing to note is that the described procedures require the tree to store, in the leaf nodes, references to every point contained in their rectangles.</p> <h2 id="takeaways">Take-aways</h2> <p>So, balancing a tree isn't such a unique and interesting task. On the contrary, it's quite simple yet boring due to the number of edge cases you have to account for. Yet, we have just scratched the surface of the general topic of trees. It is vast: the Wikipedia section for tree data structures contains almost 100 of them and it's, definitely, not complete. Moreover, new tree variants will surely be invented in the future. But you will hardly deal with more than just a few variants during the course of your career, spending the majority of time with the simple "unconstrained" trees. And we have seen, in action, the basic principles of tree operation that will be helpful, in the process.</p> <p>There's a couple of other general observations about programming algorithms we can draw from this chapter:</p> <ol><li>Trees are very versatile data structures that are a default choice when you need to represent some hierarchy. They are also one of a few data structures for which recursive processing is not only admissible but also natural and efficient. </li> <li>Visualization is key to efficient debugging of complex data-structures. Unfortunately, it's hard to show that in the book how I have spent several hours on the code for the splay tree, but without an efficient way to display the trees coupled with dynamic tracing, I would probably have spent twice as much. And, both the <code>print-function</code> for individual node and <code>pprint-bst</code> were helpful here.</li></ol><hr size="1"><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r7-1" name="f7-1">[1]</a> This statement is stricktly true for balanced trees, but, even for imbalanced trees, such estimation is usually correct.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r7-2" name="f7-2">[2]</a> Although it was shown that this value may also be reduced to a single bit using a clever implementation trick.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-82814239418249227812019-09-11T22:25:00.000+03:002019-09-11T23:02:58.621+03:00Programming Algorithms: Hash-Tables<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-A6g3QCjrqN0/XXlGLFRnhbI/AAAAAAAACLU/yQp3nT9j6NEhjdc9lurDRqPf_PKiV5BdgCLcBGAsYHQ/s1600/hash.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://4.bp.blogspot.com/-A6g3QCjrqN0/XXlGLFRnhbI/AAAAAAAACLU/yQp3nT9j6NEhjdc9lurDRqPf_PKiV5BdgCLcBGAsYHQ/s320/hash.jpg" width="180" data-original-width="645" data-original-height="990" /></a></div> <p>Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.</p> <p>Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure <code>O(1)</code> membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with <code>t</code> for all values.</p> <h2 id="implementation">Implementation</h2> <p>The basic properties of hash-tables are average <code>O(1)</code> access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:</p> <ol><li>As the backing array has a limited number of cells (<code>n</code>), the function should produce values in the interval <code>[0, n)</code>. This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by <code>n</code>.</li> <li>Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".</li></ol> <h3 id="dealingwithcollisions">Dealing with Collisions</h3> <p>Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.</p> <p>For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us <code>(/ (* 20 19) 2)</code>, i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is <code>(/ (- 365 1) 365)</code>: there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in <code>(* (/ 364 365) (/ (- 365 2) 365))</code>. The value <code>(- 365 2)</code> refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:</p> <pre><code>(defun birthday-collision-prob (n)<br /> (let ((rez 1))<br /> (dotimes (i n)<br /> (:* rez (/ (- 365 i) 365)))<br /> ;; don't forget that we want the complement<br /> ;; of the probability of no collisions,<br /> ;; hence (- 1.0 ...)<br /> (- 1.0 rez)))<br /><br />CL-USER> (birthday-collision-prob 20)<br />0.4114384<br /></code></pre> <p>So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!</p> <p>But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).</p> <pre><code>(defun hash-collision-prob (n size)<br /> (let ((rez 1))<br /> (dotimes (i n)<br /> (:* rez (/ (- size i) size)))<br /> (- 1.0 rez)))<br /></code></pre> <p>Let's say, we have 10 keys. What should be the array size to be safe against collisions?</p> <pre><code>CL-USER> (hash-collision-prob 10 10)<br />0.9996371<br /></code></pre> <p>99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?</p> <pre><code>CL-USER> (hash-collision-prob 10 20)<br />0.9345271<br /></code></pre> <p>93%. Still, pretty high.</p> <pre><code>CL-USER> (hash-collision-prob 10 100)<br />0.37184352<br />CL-USER> (hash-collision-prob 10 10000)<br />0.004491329<br /></code></pre> <p>So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...</p> <p>Note that the number depends on both arguments, so <code>(hash-collision-prob 10 100)</code> (0.37) is not the same as <code>(hash-collision-prob 20 200)</code> (0.63).</p> <p>We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.</p> <p>Now, let's discuss what difference the presence of these collisions makes to our hash-table idea and how to deal with this issue. The obvious solution is to have a fallback option: when two keys hash to the same index, store both of the items in a list. The retrieval operation, in this case, will require a sequential scan to find the requested key and return the corresponding value. Such an approach is called "chaining" and it is used by some implementations. Yet, it has a number of drawbacks:</p> <ul><li>It complicates the implementation: we now have to deal with both a static array and a dynamic list/array/tree. This change opens a possibility for some hard-to-catch bugs, especially, <a href="http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html">in the concurrent settings</a>.</li> <li>It requires more memory than the hash-table backing array, so we will be in a situation when some of the slots of the array are empty while others chain several elements.</li> <li>It will have poor performance due to the necessity of dealing with a linked structure and, what's worse, not respecting cache locality: the chain will not fit in the original array so at least one additional RAM round-trip will be required.</li></ul> <p>One upside of this approach is that it can store more elements than the size of the backing array. And, in the extreme case, it degrades to bucketing: when a small number of buckets point to long chains of randomly shuffled elements.</p> <p>The more widely-used alternative to chaining is called "open addressing" or "closed hashing". With it, the chains are, basically, stored in the same backing array. The algorithm is simple: when the calculated hash is pointing at an already occupied slot in the array, find the next vacant slot by cycling over the array. If the table isn't full we're guaranteed to find one. If it is full, we need to resize it, first. Now, when the element is retrieved by key, we need to perform the same procedure: calculate the hash, then compare the key of the item at the returned index. if the keys are the same, we've found the desired element, otherwise — we need to cycle over the array comparing keys until we encounter the item we need.</p> <p>Here's an implementation of the simple open addressing hash-table using <code>eql</code> for keys comparison:</p> <pre><code>(defstruct ht<br /> array<br /> (count 0))<br /><br />(defun ht (&rest kvs)<br /> (let ((rez (make-ht :array (make-array 16))))<br /> (loop :for (k v) :in kvs :do<br /> (add-ht k v rez))<br /> rez))<br /><br />(defun ht-get (key ht)<br /> (with ((size (length @ht.array)))<br /> (start (rem (hash key) size)))<br /> (do ((count 0 (1+ count))<br /> (i start (rem (1+ i) size))<br /> (item (? ht 'array start)<br /> (? ht 'array i))<br /> ((or (null item)<br /> (= count size)))<br /> (when (eql key (car item))<br /> (return <br /> (values (cdr item)<br /> ;; the second value is an index, at which<br /> ;; the item was found (also used to distinguish<br /> ;; the value nil from not found, which is also<br /> ;; represented by nil but without the second value)<br /> i))))))<br /><br />(defun ht-add (key val ht)<br /> (with ((array (ht-array ht))<br /> (size (length array)))<br /> ;; flet defines a local function that has access<br /> ;; to the local variables defined in HT-ADD<br /> (flet ((add-item (k v)<br /> (do ((i (rem (hash k) size)<br /> (rem (1+ i) size))<br /> ((null @ht.array#i)<br /> (:= @ht.array#i (cons k v)))<br /> ;; this do-loop doesn't have a body<br /> )))<br /> ;; TALLY is a generic function for size retrieval, from RUTILS<br /> (when (= (tally ht) size)<br /> ;; when the backing array is full<br /> ;; expand it to have the length equal to the next power of 2<br /> (:= size (expt 2 (ceiling (log (1+ count) 2)))<br /> @ht.array (make-array size))<br /> ;; and re-add its contents<br /> (dovec (item array)<br /> (add-item (car item) (cdr item)))<br /> ;; finally, add the new item<br /> (add-item key val)))<br /><br />(defun ht-rem (key ht)<br /> ;; here, we use the index of the item<br /> ;; returned as the 2nd value of HT-GET<br /> ;; (when-it is a so called anaphoric macro, from RUTILS,<br /> ;; that assigns the value of its first argument<br /> ;; to an implicitly created variable IT<br /> ;; and evaluates the body when IT isn't null)<br /> (when-it (nth-value 2 (ht-get key ht))<br /> (void (? ht 'array it))<br /> ;; return the index to indicate that the item was found<br /> it))<br /></code></pre> <p>To avoid constant resizing of the hash-table, just as with dynamic arrays, the backing array is, usually, allocated to have the size equal to a power of 2: 16 elements, to begin with. When it is filled up to a certain capacity it is resized to the next power of 2: 32, in this case. Usually, around 70-80% is considered peak occupancy as too collisions may happen afterward and the table access performance severely degrades. In practice, this means that normal open-addressing hash-tables also waste from 20 to 50 percent of allocated space. This inefficiency becomes a serious problem with large tables, so other implementation strategies become preferable when the size of data reaches tens and hundreds of megabytes. Note that, in our trivial implementation above, we have, effectively, used the threshold of 100% to simplify the code. Adding a configurable threshold is just a matter of introducing a parameter and initiating resizing not when <code>(= (ht-count ht) size)</code> but upon <code>(= (ht-count ht) (floor size threshold))</code>. As we've seen, resizing the hash-table requires calculating the new indices for all stored elements and adding them anew into the resized array.</p> <p>Analyzing the complexity of the access function of the hash-table and proving that it is amortized <code>O(1)</code> isn't trivial. It depends on the properties of the hash-function, which should ensure good uniformity. Besides, the resizing threshold also matters: the more elements are in the table, the higher the chance of collisions. Also, you should keep in mind that if the keys possess some strange qualities that prevent them from being hashed uniformly, the theoretical results will not hold.</p> <p>In short, if we consider a hash-table with 60% occupancy (which should be the average number, for a common table) we end up with the following probabilities:</p> <ul><li>probability that we'll need just 1 operation to access the item (i.e. the initially indexed slot is empty): 0.4</li> <li>probability that we'll need 2 operations (the current slot is occupied, the next one is empty): <code>(* 0.6 0.4)</code> — 0.24</li> <li>probability that we'll need 3 operations: <code>(* (expt 0.6 2) 0.4)</code> — 0.14</li> <li>probability that we'll need 4 operations: <code>(* (expt 0.6 3) 0.4)</code> — 0.09</li></ul> <p>Actually, these calculations are slightly off and the correct probability of finding an empty slot should be somewhat lower, although the larger the table is, the smaller the deviation in the numbers. Finding out why is left as an exercise for the reader :)</p> <p>As you see, there's a progression here. With probability around 0.87, we'll need no more than 4 operations. Without continuing with the arithmetic, I think, it should be obvious that we'll need, on average, around 3 operations to access each item and the probability that we'll need twice as many (6) is quite low (below 5%). So, we can say that the number of access operations is constant (i.e. independent of the number of elements in the table) and is determined only by the occupancy percent. So, if we keep the occupancy in the reasonable bounds, named earlier, on average, 1 hash code calculation/lookup and a couple of retrievals and equality comparisons will be needed to access an item in our hash-table.</p> <h3 id="hashcode">Hash-Code</h3> <p>So, we can conclude that a hash-table is primarily parametrized by two things: the hash-function and the equality predicate. In Lisp, in particular, there's a choice of just the four standard equality predicates: <code>eq</code>, <code>eql</code>, <code>equal</code>, and <code>equalp</code>. It's somewhat of a legacy that you can't use other comparison functions so some implementations, as an extension, allow th programmer to specify other predicates. However, in practice, the following approach is sufficient for the majority of the hash-table use cases:</p> <ul><li>use the <code>eql</code> predicate if the keys are numbers, characters, or symbols</li> <li>use <code>equal</code> if the keys are strings or lists of the mentioned items</li> <li>use <code>equalp</code> if the keys are vectors, structs, CLOS objects or anything else containing one of those</li></ul> <p>But I'd recommend trying your best to avoid using the complex keys requiring <code>equalp</code>. Besides the performance penalty of using the heaviest equality predicate that performs deep structural comparison, structs, and vectors, in particular, will most likely hash to the same index. Here is a quote from one of the implementors describing why this happens:</p> <blockquote> <p>Structs have no extra space to store a unique hash code within them. The decision was made to implement this because automatic inclusion of a hashing slot in all structure objects would have made all structs an average of one word longer. For small structs this is unacceptable. Instead, the user may define a struct with an extra slot, and the constructor for that struct type could store a unique value into that slot (either a random value or a value gotten by incrementing a counter each time the constructor is run). Also, create a hash generating function which accesses this hash-slot to generate its value. If the structs to be hashed are buried inside a list, then this hash function would need to know how to traverse these keys to obtain a unique value. Finally, then, build your hash-table using the <code>:hash-function</code> argument to make-hash-table (still using the equal test argument), to create a hash-table which will be well-distributed. Alternatively, and if you can guarantee that none of the slots in your structures will be changed after they are used as keys in the hash-table, you can use the <code>equalp</code> test function in your make-hash-table call, rather than equal. If you do, however, make sure that these struct objects don't change, because then they may not be found in the hash-table.</p></blockquote> <p>But what if you still need to use a struct or a CLOS object as a hash key (for instance, if you want to put them in a set)? There are three possible workarounds:</p> <ul><li>Choose one of their slots as a key (if you can guarantee its uniqueness).</li> <li>Add a special slot to hold a unique value that will serve as a key.</li> <li>Use the literal representation obtained by calling the print-function of the object. Still, you'll need to ensure that it will be unique and constant. Using an item that changes while being the hash key is a source of very nasty bugs, so avoid it at all cost.</li></ul> <p>These considerations are also applicable to the question of why Java requires defining both <code>equals</code> and <code>hashCode</code> methods for objects that are used as keys in the hash-table or hash-set.</p> <h3 id="advancedhashingtechniques">Advanced Hashing Techniques</h3> <p>Beyond the direct implementation of open addressing, called "linear probing" (for it tries to resolve collisions by performing a linear scan for an empty slot), a number of approaches were proposed to improve hash distribution and reduce the collision rate. However, for the general case, their superiority remains questionable, and so the utility of a particular approach has to be tested in the context of the situations when linear probing demonstrates suboptimal behavior. One type of such situations occurs when the hash-codes become clustered near some locations due to deficiencies of either the hash-function or the keyset.</p> <p>The simplest modification of linear probing is called "quadratic probing". It operates by performing the search for the next vacant slot using the linear probing offsets (or some other sequence of offsets) that are just raised to the power 2. I.e. if, with linear probing, the offset sequence was 1,2,3,etc, with the quadratic one, it is 1,4,9,... "Double hashing" is another simple alternative, which, instead of a linear sequence of offsets, calculates the offsets using another hash-function. This approach makes the sequence specific to each key, so the keys that map to the same location will have different possible variants of collision resolution. "2-choice hashing" also uses 2 hash-functions but selects the particular one for each key based on the distance from the original index it has to be moved for collision resolution.</p> <p>More elaborate changes to the original idea are proposed in Cuckoo, Hopscotch, and Robin Hood caching, to name some of the popular alternatives. We won't discuss them now, but if the need arises to implement a non-standard hash-table it's worth studying all of those before proceeding with an idea of your own. Although, who knows, someday you might come up with a viable alternative technique, as well...</p> <h2 id="hashfunctions">Hash-Functions</h2> <p>The class of possible hash-functions is very diverse: any function that sufficiently randomizes the key hashes will do. But what good enough means? One of the ways to find out is to look at the <a href="https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed">the pictures of the distribution of hashes</a>. Yet, there are other factors that may condition the choice: speed, complexity of implementation, collision resistance (important for cryptographic hashes that we won't discuss in this book).</p> <p>The good news is that, for most practical purposes, there's a single function that is both fast and easy to implement and understand. It is called <strong>FNV-1a</strong>.</p> <pre><code>(defparameter *fnv-primes*<br /> '((32 . 16777619)<br /> (64 . 1099511628211)<br /> (128 . 309485009821345068724781371)<br /> (256 . 374144419156711147060143317175368453031918731002211)))<br /><br />(defparameter *fnv-offsets*<br /> '((32 . 2166136261)<br /> (64 . 14695981039346656037)<br /> (128 . 144066263297769815596495629667062367629)<br /> (256 . 100029257958052580907070968620625704837092796014241193945225284501741471925557)))<br /><br />(defun fnv-1a (x &key (bits 32))<br /> (assert (member bits '(32 64 128 256)))<br /> (let ((rez (assoc1 bits *fnv-offsets*))<br /> (prime (assoc1 bits *fnv-primes*)))<br /> (dotimes (i (/ bits 8))<br /> (:= rez (ldb (byte bits 0)<br /> (* (logxor rez (ldb (byte 8 (* i 8)) x))<br /> prime))))<br /> rez))<br /></code></pre> <p>The constants <code>*fnv-primes*</code> and <code>*fnv-offsets*</code> are precalculated up to 1024 bits (here, I used just a portion of the tables).</p> <p>Note that, in this implementation, we use normal Lisp multiplication (<code>*</code>) that is not limited to fixed-size numbers (32-bit, 64-bit,...) so we need to extract only the first <code>bits</code> with <code>ldb</code>.</p> <p>Also note that if you were to calculate FNV-1a with some online hash calculator you'd, probably, get a different result. Experimenting with it, I noticed that it is the same if we use only the non-zero bytes from the input number. This observation aligns well with calculating the hash for simple strings when each character is a single byte. For them the hash-function would look like the following:</p> <pre><code>(defun fnv-1a-str (str)<br /> (let ((rez (assoc1 32 *fnv-offsets*))<br /> (prime (assoc1 32 *fnv-primes*)))<br /> (dovec (char str)<br /> (:= rez (ldb 32 (* (logxor rez (ldb (byte 8 (* i 8))<br /> (char-code char)))<br /> prime))))<br /> rez))<br /></code></pre> <p>So, even such a simple hash-function has nuances in its implementation and it should be meticulously checked against some reference implementation or a set of expected results.</p> <p>Alongside FNV-1a, there's also FNV-1, which is a slightly worse variation, but it may be used if we need to apply 2 different hash functions at once (like, in 2-way or double hashing).</p> <p>What is the source of the hashing property of FNV-1a? Xors and modulos. Combining these simple and efficient operations is enough to create a desired level of randomization. Most of the other hash-functions use the same building blocks as FNV-1a. They all perform arithmetic (usually, addition and multiplication as division is slow) and xor'ing, adding into the mix some prime numbers. For instance, here's what the code for another popular hash-function "djb2" approximately looks like:</p> <pre><code>(defun djb2-str (str)<br /> (let ((rez 5381) ; a DJB2 prime number<br /> (loop :for char :across str :do<br /> (:= rez (ldb 32 (+ (char-code char)<br /> (ldb 32 (+ (ash rez 5)<br /> rez)))))))<br /> rez))<br /></code></pre> <h2 id="operations">Operations</h2> <p>The generic key-value operations we have discussed in the previous chapter, obviously also apply to hash-tables. There are also specific low-level ones, defined by the Lisp standard. And it's worth mentioning that, in regards to hash-tables, I find the standard quite lacking so a lot of utilities were added as part of RUTILS. The reason for the deficiency in the stadard is, I believe, that when hash-tables had been added to Lisp they had been still pretty novel technology not widely adopted in the programming languages community. So there had been neither any significant experience using them, nor a good understanding of the important role they would play. Languages such as Python or Clojure as well as the ones that were designed even later, were developed with this knowledge already in mind. Yet, this situation doesn't pose insurmountable difficulty for Lisp users as the language provides advanced extension tools such as macros and reader macros, so the necessary parts can be added and, in fact, exist as 3rd-party extensions. Using them becomes just a question of changing your habits and adapting to more efficient approaches. The situation is different for the users of many other languages, such as Java users, who had to wait for the new major version of the language to get access to such things as literal hash-table initialization. The feature I consider to be crucially important to improving the level of code clarity, in the declarative paradigm.</p> <h3 id="initialization">Initialization</h3> <p>Normally, the hash-table can be created with <code>make-hash-table</code>, which has a number of configuration options, including <code>:test</code> (default: <code>eql</code>). Most of the implementations allow the programmer to make synchronized (thread-safe) hash-tables via another configuration parameter, but the variants of concurrency control will differ.</p> <p>Yet, it is important to have a way to define hash-tables already pre-initialized with a number of key-value pairs, and <code>make-hash-table</code> can't handle this. Pre-initialized hash tables represent a common necessity for tables serving as dictionaries, and such pre-initialization greatly simplifies many code patterns. Thus RUTILS provides such a syntax (in fact, in 2 flavors) with the help of reader macros:</p> <pre><code>#{equal "foo" :bar "baz" 42}<br />#h(equal "foo" :bar "baz" 42)<br /></code></pre> <p>Both of these expressions will expand into a call to <code>make-hash-table</code> with <code>equal</code> test and a two calls to set operation to populate the table with the kv-pairs <code>"foo" :bar</code> and <code>"baz" 42</code>. For this stuff to work, you need to switch to the appropriate readtable by executing: <code>(named-readtables:in-readtable rutils-readtable)</code>.</p> <p>The reader-macro to parse <code>#h()</code>-style literal readtables isn't very complicated. As all reader-macros, it operates on the character stream of the program text, processing one character at a time. Here is it's implementation:</p> <pre><code>(defun |#h-reader| (stream char arg)<br /> (read-char stream) ; skip the open paren<br /> ;; we can also add a sanity check to ensure that this character<br /> ;; is indeed a #\(<br /> (with (;; read-delimited-list is a standard library function<br /> ;; that reads items until a delimiter is encountered<br /> ;; and then returns them as a list of parsed Lisp objects<br /> (sexp (read-delimited-list #\) stream t))<br /> ;; the idea is that the first element may be a hash-table<br /> ;; test function; in this case, the number of items in the<br /> ;; definition will be odd as each key-value pair should have<br /> ;; an even number of elements<br /> (test (when (oddp (length sexp))<br /> (first sexp)))<br /> ;; the rest of the values, after the possible test function,<br /> ;; are key-value pairs<br /> (kvs (group 2 (if test (rest sexp) sexp)))<br /> (ht (gensym)))<br /> `(let ((,ht (make-hash-table :test ',(or test 'eql))))<br /> ;; iterate the tail of the KVS list (:on loop clause)<br /> ;; and, for each key-value pair, generate an expression<br /> ;; to add the value for the key in the resulting hash-table<br /> ,@(mapcar (lambda (kv)<br /> `(:= (? ,ht ,(first kv)) ,(second kv)))<br /> ,kvs)<br /> ,ht)))<br /></code></pre> <p>After such a function is defined, it can be plugged into the standard readtable:</p> <pre><code>(set-dispatch-macro-character #\# #\h '|#h-reader|)</code></pre> <p>Or it may be used in a named-readtable (you can learn how to do that, from the docs).</p> <p><code>print-hash-table</code> is the utility to perform the reverse operation — display hash-tables in the similar manner:</p> <pre><code>RUTILS> (print-hash-table #h(equal "foo" :bar "baz" 42))<br />#{EQUAL<br /> "foo" :BAR<br /> "baz" 42<br /> } <br />#<HASH-TABLE :TEST EQUAL :COUNT 2 {10127C0003}><br /></code></pre> <p>The last line of the output is the default Lisp printed representation of the hash-table. As you see, it is opaque and doesn't display the elements of the table. RUTILS also allows switching to printing the literal representation instead of the standard one with the help of <code>toggle-print-hash-table</code>. However, this extension is intended only for debugging purposes as it is not fully standard-conforming.</p> <h3 id="access">Access</h3> <p>Accessing the hash-table elements is performed with <code>gethash</code>, which returns two things: the value at key and <code>t</code> when the key was found in the table, or two nils otherwise. By using <code>(:= (gethash key ht) val)</code> (or <code>(:= (? ht key) val)</code>) we can modify the stored value. Notice the reverse order of arguments of <code>gethash</code> compared to the usual order in most accessor functions, when the structure is placed first and the key second. However, <code>gethash</code> differs from generic <code>?</code> in that it accepts an optional argument that is used as the default value if the requested key is not present in the table. In some languages, like Python, there's a notion of "default hash-tables" that may be initialized with a common default element. In Lisp, a different approach is taken. However, it's possible to easily implement default hash-tables and plug them into the <code>generic-elt</code> mechanism:</p> <pre><code>(defstruct default-hash-table<br /> table<br /> default-value)<br /><br />(defun gethash-default (key ht)<br /> (gethash key (? ht 'table) (? ht 'default-value)))<br /><br />(defmethod generic-elt ((kv default-hash-table) key &rest keys)<br /> (gethash-default key kv))<br /></code></pre> <p>RUTILS also defines a number of aliases/shorthands for hash-table operations. As the <code>#</code> symbol is etymologically associated with hashes, it is used in the names of all these functions:</p> <ul><li><code>get#</code> is a shorthand and a more distinctive alias for <code>gethash</code></li> <li><code>set#</code> is an alias for <code>(:= (gethash ...</code></li> <li><code>getset#</code> is an implementation of the common pattern: this operation either retrieves the value if the key is found in the table or calculates its third argument returns it and also sets it for the given key for future retrieval</li> <li><code>rem#</code> is an alias for <code>remhash</code> (remove the element from the table)</li> <li><code>take#</code> both returns the key and removes it (unlike <code>rem#</code> that only removes)</li> <li><code>in#</code> tests for the presence of the key in the table</li> <li>also, <code>p#</code> is an abbreviated version of <code>print-hash-table</code></li></ul> <h3 id="iteration">Iteration</h3> <p>Hash-tables are unordered collections, in principle. But, still, there is always a way to iterate over them in some (unspecified) order. The standard utility for that is either <code>maphash</code>, which unlike <code>map</code> doesn't populate the resulting collection and is called just for the side effects, or the special <code>loop</code> syntax. Both are suboptimal, from several points of view, so RUTILS defines a couple of alternative options:</p> <ul><li><code>dotable</code> functions in the same manner as <code>dolist</code> except that it uses two variables: for the key and the value</li> <li><code>mapkv</code>, mentioned in the previous chapter, works just like <code>mapcar</code> by creating a new result table with the same configuration as the hash-table it iterates over and assigns the results of invoking the first argument — the function of two elements — with each of the kv-pairs</li></ul> <p>Despite the absence of a predefined ordering, there are ways in which some order may be introduced. For example, in SBCL, the order in which the elements are added, is preserved by using additional vectors called <code>index-vector</code> and <code>next-vector</code> that store this information. Another option which allows forcing arbitrary ordering is to use the so-called <strong>Linked Hash-Table</strong>. It is a combination of a hash-table and a linked list: each key-value pair also has the next pointer, which links it to some other item in the table. This way, it is possible to have ordered key-values without resorting to tree-based structures. A poor man's linked hash-table can be created on top of the normal one with the following trick: substitute values by pairs containing a value plus a pointer to the next pair and keep track of the pointer to the first pair in a special slot.</p> <pre><code>(deftruct linked-hash-table-item<br /> key<br /> val<br /> next)<br /><br />(defstruct linked-hash-table<br /> table<br /> head<br /> tail)<br /><br />(defun gethash-linked (key ht)<br /> (? ht 'table key 'val))<br /><br />(defun sethash-linked (key ht val)<br /> ;; The initial order of items is the order of addition.<br /> ;; If we'd like to impose a different order,<br /> ;; we'll have to perform reordering after each addition<br /> ;; or implement a custom sethash function.<br /> (with (((table head tail) ? ht)<br /> (cur (gethash key table))) <br /> (if cur<br /> (:= (? cur 'val) val)<br /> (let ((new (make-linked-hash-table-item<br /> :key key :val val)))<br /> (when (null head)<br /> (:= (? ht 'head) new))<br /> (:= (? ht 'tail)<br /> (if tail<br /> (:= (? ht 'tail 'next) new)<br /> new)))))) <br /><br />(defmethod mapkv (fn (ht linked-hash-table))<br /> (with ((rez (make-linked-hash-table<br /> :table (hash-table :key (hash-table-key (? rez 'table)))))<br /> (prev nil))<br /> (do ((item (? rez 'head) (? item 'next)))<br /> ((null item))<br /> (sethash-linked key rez (call fn (? item 'val))))))<br /></code></pre> <p>The issue with this approach, as you can see from the code, is that we also need to store the key, and it duplicates the data also stored in the backing hash-table itself. So, an efficient linked hash-table has to be implemented from scratch using an array as a base instead of a hash-table.</p> <h2 id="perfecthashing">Perfect Hashing</h2> <p>In the previous exposition, we have concluded that using hash-tables implies a significant level of reserved unused space (up to 30%) and inevitable collisions. Yet, if the keyset is static and known beforehand, we can do better: find a hash-function, which will exclude collisions (simple perfect hashing) and even totally get rid of reserved space (minimal perfect hashing, MPH). Although the last variant will still need extra space to store the additional information about the hash-functions, it may be much smaller: in some methods, down to ~3-4 bits per key, so just 5-10% overhead. Statistically speaking, constructing such a hash-function is possible. But the search for its parameters may require some trial and error.</p> <h3 id="implementation-1">Implementation</h3> <p>The general idea is simple, but how to find the appropriate hash-function? There are several approaches described in sometimes hard-to-follow scientific papers and a number of cryptic programs in low-level C libraries. At a certain point in time, I needed to implement some variant of an MPH so I read those papers and studied the libraries to some extent. Not the most pleasant process, I should confess. One of my twitter pals once wrote: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And, although he was putting a negative connotation to it, I recognized the statement as a very precise description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and — as a possible byproduct — into an explanation ("blog post") that other engineers will understand and be able to reproduce. And it's not a skill every software developer should be easily capable of. Not all papers can even be reproduced because the experiment was not set up correctly, some parts of the description are missing, the data is not available, etc. Of those, which, in principle, can be, only some are presented in the form that is clear enough to be reliably programmed.</p> <p>Here is one of the variants of minimal perfect hashing that possesses such qualities. It works for datasets of any size as a 3-step process:</p> <ol><li>At the first stage, by the use of a common hash-function (in particular, the Jenkins hash), all keys are near-uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. It can be achieved with very high probability if the hash divisor is set to <code>(ceiling (length keyset) 200)</code>. This allows the algorithm to work for data sets of arbitrary size, thereby reducing the problem to a simpler one that already has a known solution.</li> <li>Next, for each bucket, the perfect hash function is constructed. This function is a table (and it's an important mathematical fact that each discrete function is equivalent to a table, albeit, potentially, of unlimited length). The table contains byte-sized offsets for each hash code, calculated by another application of the Jenkins hash, which produces two values in one go (actually, three, but one of them is not used). The divisor of the hash-function, this time, equals to double the number of elements in the bucket. And the uniqueness requirement is that the sum of offsets corresponding, in the table, to the two values produced by the Jenkins hash is unique, for each key. To check if the constraint is satisfied, the hashes are treated as vertices of a graph, and if it happens to be acyclic (the probability of this event is quite high if the parameters are chosen properly), the requirement can be satisfied, and it is possible to construct the perfect hash function, by the process described as the next step. Otherwise, we change the seed of the Jenkins hash and try again until the resulting graph is acyclic. In practice, just a couple of tries are needed.</li> <li>Finally, the hash-function for the current bucket may be constructed from the graph by the CHM92 algorithm (named after the authors and the year of the paper), which is another version of perfect hashing but suitable only for limited keysets. Here, you can see the CHM92 formula implemented in code:</li></ol> <pre><code>(defstruct mpht<br /> (data nil :type simple-vector)<br /> (offsets nil :type (simple-array octet))<br /> (meta nil :type (simple-array quad))<br /> (div nil))<br /><br />;; div is the divisor of the top-level hash, which is calculated as:<br />;; (/ (1- (length meta)) 2)<br /><br />(defun mpht-index (item mpht)<br /> (with (((offsets meta div) ? mpht)<br /> (bucket-id (* (mod (jenkins-hash item) div) 2))<br /> (bucket-offset (? meta bucket-id))<br /> (bucket-seed (? meta (+ 1 bucket-id)))<br /> ;; the number of items in the bucket is calculated<br /> ;; by substracting the offset of the next bucket<br /> ;; from the offset of the current one<br /> (bucket-count (- (? meta (+ 2 bucket-id))<br /> bucket-offset))<br /> (hash1 hash2 (jenkins-hash2 item bucket-seed bucket-div))<br /> (base (* bucket-offset 2)))<br /> (+ bucket-offset (mod (+ (? offsets (+ base hash1))<br /> (? offsets (+ base hash2)))<br /> bucket-count))))<br /></code></pre> <p>This algorithm guarantees exactly <code>O(1)</code> hash-table access and uses 2 bytes per key, i.e. it will result in a constant 25% overhead on the table's size (in a 64-bit system): 2 byte-sized offsets for the hashes plus negligible 8 bytes per bucket (each bucket contains ~200 elements) for meta information. Better space-utilization solutions (up to 4 times more efficient) exist, but they are harder to implement and explain.</p> <p>The Jenkins hash-function was chosen for two reasons:</p> <ul><li>Primarily, because, being a relatively good-quality hash, it has a configurable parameter <code>seed</code> that is used for probabilistic probing (searching for an acyclic graph). On the contrary, FNV-1a doesn't work well with an arbitrary prime hence the usage of a pre-calculated one that isn't subject to change.</li> <li>Also, it produces 3 pseudo-random numbers right away, and we need 2 for the second stage of the algorithm.</li></ul> <h3 id="chm92">The CHM92 Algorithm</h3> <a href="https://2.bp.blogspot.com/-EVu1wmAKXAE/XXlH-pIo_WI/AAAAAAAACLg/xmafp7c42FYbj3tF3Eqzmx6p7VWEclO7gCLcBGAsYHQ/s1600/mpht-graph.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-EVu1wmAKXAE/XXlH-pIo_WI/AAAAAAAACLg/xmafp7c42FYbj3tF3Eqzmx6p7VWEclO7gCLcBGAsYHQ/s320/mpht-graph.png" width="320" height="133" data-original-width="546" data-original-height="227" /></a> <p>The CHM92 algorithm operates by performing a depth-first search (DFS) on the graph, in the process, labeling the edges with unique numbers and calculating the corresponding offset for each of the Jenkins hash values. In the picture, you can see one of the possible labelings: each vertex is the value of one of the two hash-codes returned by <code>jenkins-hash2</code> for each key, and every edge, connecting them, corresponds to a key that produced the hashes. The unique indices of the edges were obtained during DFS. Now, each hash-code is mapped iteratively to the number that is <code>(- edge-index other-vertex-index)</code>. So, some codes will map to the same number, but it is guaranteed that, for each key, the sum of two corresponding numbers will be unique (as the edge indices are unique).</p> <p>CHM92 is an example of the probabilistic algorithms we will discuss in more detail near the end of the book.</p> <p>Let's say we have implemented the described scheme like I did in the <a href="https://github.com/vseloved/const-table">const-table</a> library. Now, we need to perform the measurements to validate that we have, in fact, achieved the desired improvement over the standard hash-table implementation. In this case, we are interested not only in speed measurements, which we already know how to perform but also in calculating the space occupied.</p> <p>The latter goal is harder to achieve. Usually, most of the programming languages will provide the analog of a <code>sizeof</code> function that returns the space occupied by an array, a structure or an object. Here, we're interested not in "shallow" <code>sizeof</code> but in a "deep" one that will descend into the structure's slots and add their sizes recursively.</p> <p>First, let's create functions to populate the tables with a significant number of random string key-value pairs.</p> <pre><code>(defun random-string (size)<br /> (coerce (loop :repeat size :collect (code-char (+ 32 (random 100))))<br /> 'string))<br /><br />(defun random-hash-table (&key (n 100000))<br /> (let ((rez (make-hash-table :test 'equal)))<br /> (loop :repeat n :do<br /> (:= (? rez (random-string (+ 3 (random 4))))<br /> (random-string (+ 3 (random 4)))))<br /> rez))<br /><br />(defun random-const-table (&key (n 100000))<br /> (let ((rez (make-const-table :test 'equal)))<br /> (loop :repeat n :do<br /> (:= (? rez (random-string (+ 3 (random 4))))<br /> (random-string (+ 3 (random 4)))))<br /> rez))<br /></code></pre> <p>A very approximate space measurement may be performed using the standard operator <code>room</code>. But it doesn't provide detailed per-object statistics. Here's a result of the <code>room</code> measurement, in SBCL (the format of the report will be somewhat different, for each implementation):</p> <pre><code>CL-USER> (room)<br />Dynamic space usage is: 45,076,224 bytes.<br />Immobile space usage is: 18,998,832 bytes (64,672 bytes overhead).<br />Read-only space usage is: 0 bytes.<br />Static space usage is: 1,264 bytes.<br />Control stack usage is: 9,048 bytes.<br />Binding stack usage is: 640 bytes.<br />Control and binding stack usage is for the current thread only.<br />Garbage collection is currently enabled.<br /><br />Breakdown for dynamic space:<br /> 11,369,232 bytes for 76,040 simple-vector objects<br /> 9,095,952 bytes for 160,669 instance objects<br /> 8,289,568 bytes for 518,098 cons objects<br /> 3,105,920 bytes for 54,655 simple-array-unsigned-byte-8 objects<br /> 2,789,168 bytes for 54,537 simple-base-string objects<br /> 2,344,672 bytes for 9,217 simple-character-string objects<br /> 6,973,472 bytes for 115,152 other objects<br /><br /> 43,967,984 bytes for 988,368 dynamic objects (space total)<br /><br />Breakdown for immobile space:<br /> 16,197,840 bytes for 24,269 code objects<br /> 1,286,496 bytes for 26,789 symbol objects<br /> 1,041,936 bytes for 27,922 other objects<br /><br /> 18,526,272 bytes for 78,980 immobile objects (space total)<br /><br /><br />CL-USER> (defparameter *ht* (random-hash-table))<br />*HT*<br />CL-USER> (room)<br />...<br />Breakdown for dynamic space:<br /> 13,349,920 bytes for 77,984 simple-vector objects<br /> 11,127,008 bytes for 208,576 simple-character-string objects<br /> 9,147,824 bytes for 161,469 instance objects<br /> 8,419,360 bytes for 526,210 cons objects<br /> 3,517,792 bytes for 2,997 simple-array-unsigned-byte-32 objects<br /> 3,106,288 bytes for 54,661 simple-array-unsigned-byte-8 objects<br /> 7,671,168 bytes for 166,882 other objects<br /><br /> 56,339,360 bytes for 1,198,779 dynamic objects (space total)<br /></code></pre> <p>So, it seems like we added roughly 10 megabytes by creating a hash-table with 100,000 random 5-9 character keys and values. Almost all of that space went into the keys and values themselves — 9 Mb ("11,127,008 bytes for 208,576 simple-character-string objects" versus "2,344,672 bytes for 9,217 simple-character-string objects" — a bit less than 200,000 new strings were added).</p> <p>Also, if we examine the hash-table, we can see that its occupancy is rather high — around 90%! (The number of keys 99706 instead of 10000 tells us that there was a small portion of duplicate keys among the randomly generated ones).</p> <pre><code>CL-USER> (describe *ht*)<br />#<HASH-TABLE :TEST EQUAL :COUNT 99706 {1002162EF3}><br /> [hash-table]<br /><br />Occupancy: 0.9<br />Rehash-threshold: 1.0<br />Rehash-size: 1.5<br />Size: 111411<br /></code></pre> <p>And now, a simple time measurement:</p> <pre><code>CL-USER> (let ((keys (keys *ht*)))<br /> (time (loop :repeat 100 :do<br /> (dolist (k keys)<br /> (gethash k *ht*)))))<br />Evaluation took:<br /> 0.029 seconds of real time<br /> 0.032000 seconds of total run time (0.032000 user, 0.000000 system)<br /> 110.34% CPU<br /> 72,079,880 processor cycles<br /> 0 bytes consed<br /></code></pre> <p>Now, let's try the <code>const-table</code>s that are the MPHT implementation:</p> <pre><code>СL-USER> (time (defparameter *ct* (cstab:build-const-table *ht*)))<br />...................................................................................................<br />Evaluation took:<br /> 0.864 seconds of real time<br />...<br />СL-USER> (room)<br />...<br />Breakdown for dynamic space:<br /> 14,179,584 bytes for 78,624 simple-vector objects<br /> 11,128,464 bytes for 208,582 simple-character-string objects<br /> 9,169,120 bytes for 161,815 instance objects<br /> 8,481,536 bytes for 530,096 cons objects<br /> 3,521,808 bytes for 2,998 simple-array-unsigned-byte-32 objects<br /> 3,305,984 bytes for 54,668 simple-array-unsigned-byte-8 objects<br /> 7,678,064 bytes for 166,992 other objects<br /><br /> 57,464,560 bytes for 1,203,775 dynamic objects (space total)<br /></code></pre> <p>Another megabyte was added for the metadata of the new table, which doesn't seem significantly different from the hash-table version. Surely, often we'd like to be much more precise in space measurements. For this, SBCL recently added an allocation profiler <code>sb-aprof</code>, but we won't go into the details of its usage, in this chapter.</p> <p>And now, time measurement:</p> <pre><code>CL-USER> (let ((keys (keys *ht*)))<br /> (time (loop :repeat 100 :do<br /> (dolist (k keys)<br /> (cstab:csget k *ct*)))))<br />Evaluation took:<br /> 3.561 seconds of real time<br /></code></pre> <p>Oops, a two-orders-of-magnitude slowdown! Probably, it has to do with many factors: the lack of optimization in my implementation compared to the one in SBCL, the need to calculate more hashes and with a slower hash-function, etc. I'm sure that the implementation may be sped up at least an order of magnitude, but, even then, what's the benefit of using it over the default hash-tables? Especially, considering that MPHTs have a lot of moving parts and rely on a number of "low-level" algorithms like graph traversal or efficient membership testing, most of which need a custom efficient implementation...</p> <p>Still, there's one dimension in which MPHTs may provide an advantage: significantly reduce space usage by not storing the keys. Though, it becomes problematic if we need to distinguish the keys that are in the table from the unknown ones as those will also hash to some index, i.e. overlap with an existing key. So, either the keyspace should be known beforehand and exhaustively covered in the table or some precursory membership test is necessary when we anticipate the possibility of unseen keys. Yet, there are ways to perform the test efficiently (exactly or probabilistically), which require much less storage space than would be needed to store the keys themselves. Some of them we'll see in the following chapters.</p> <p>If the keys are omitted, the whole table may be reduced to a <strong>Jump-table</strong>. Jump-tables are a low-level trick possible when all the keys are integers in the interval <code>[0, n)</code>. It removes the necessity to perform sequential equality comparisons for every possible branch until one of the conditions matches: instead, the numbers are used directly as an offset. I.e. the table is represented by a vector, each hash-code being the index in that vector.</p> <p>A jump-table for the MPHT will be simply a data array, but sometimes evaluation of different code is required for different keys. Such more complex behavior may be implemented in Lisp using the lowest-level operators <code>tagbody</code> and <code>go</code> (and a bit of macrology if we need to generate a huge table). This implementation will be a complete analog of the C <code>switch</code> statement. The skeleton for such "executable" table will look like this, where 0, 1,... are goto labels:</p> <pre><code>(block nil<br /> (tagbody (go key)<br /> 0 (return (do-something0))<br /> 1 (return (do-something1))<br /> ...))<br /></code></pre> <h2 id="distributedhashtables">Distributed Hash-Tables</h2> <p>Another active area of hash-table-related research is algorithms for distributing them over the network. This is a natural way to represent a lot of datasets, and thus there are numerous storage systems (both general- and special-purpose) which are built as distributed hash-tables. Among them are, for instance, Amazon DynamoDB or an influential open-source project Kademlia. We will discuss in more detail, in the chapter on Distributed Algorithms, some of the technologies developed for this use case, and here I wanted to mention just one concept.</p> <p><strong>Consistent Hashing</strong> addresses the problem of distributing the hash-codes among <code>k</code> storage nodes under the real-world limitations that some of them may become temporarily unavailable or new peers may be added into the system. The changes result in changes of the value of <code>k</code>. The straightforward approach would just divide the space of all codes into <code>k</code> equal portions and select the node into whose portion the particular key maps. Yet, if <code>k</code> is changed, all the keys need to be rehashed, which we'd like to avoid at all cost as rehashing the whole database and moving the majority of the keys between the nodes, at once, will saturate the network and bring the system to a complete halt.</p> <p>The idea or rather the tweak behind Consistent Hashing is simple: we also hash the node ids and store the keys on the node that has the next hash-code larger than the hash of the key (modulo <code>n</code>, i.e. wrap around 0). Now, when a new node is added, it is placed on this so-called "hash ring" between two other peers, so only part of the keys from a single node (the next on the ring) require being redistributed to it. Likewise, when the node is removed, only its keys need to be reassigned to the next peer on the ring (it is supposed that the data is stored in multiple copies on different nodes, so when one of the nodes disappears the data doesn't become totally lost).</p> <p>The only problem with applying this approach directly is the uneven distribution of keys originating from uneven placement of the hash-codes of the nodes on the hash ring. This problem can be solved with another simple tweak: have multiple ids for each node that will be hashed to different locations, effectively emulating a larger number of virtual nodes, each storing a smaller portion of the keys. Due to the randomization property of hashes, not so many virtual nodes will be needed, to obtain a nearly uniform distribution of keys over the nodes.</p> <p>A more general version of this approach is called <strong>Rendezvous Hashing</strong>. In it, the key for the item is combined with the node id for each node and then hashed. The largest value of the hash determines the designated node to store the item.</p> <h2 id="hashinginactioncontentaddressing">Hashing in Action: Content Addressing</h2> <p>Hash-tables are so ubiquitous that it's, actually, difficult to single out some peculiar use case. Instead, let's talk about hash-functions. They can find numerous uses beyond determining the positions of the items in the hash-table, and one of them is called "content addressing": globally identify a piece of data by its fingerprint instead of using external meta information like name or path. This is one of the suggested building blocks for large-scale distributed storage systems, but it works locally, as well: your git SCM system silently uses it behind the scenes to identify the changesets it operates upon.</p> <p>The advantages of Content Addressing are:</p> <ul><li>Potential for space economy: if the system has a chance of operating on repeated items (like git does, although it's not the only reason for choosing such naming scheme for blobs: the other being the lack of a better variant), content addressing will make it possible to avoid storing them multiple times.</li> <li>It guarantees that the links will always return the same content, regardless of where it is retrieved from, who added it to the network, how and when. This enables such distributed protocols as BitTorrent that split the original file into multiple pieces, each one identified by its hash. These pieces can be distributed in an untrusted network.</li> <li>As mentioned above, content addressing also results in a conflict-free naming scheme (provided that the hash has enough bits — usually, cryptographic hashes such as SHA-1 are used for this purpose, although, in many cases, such powerful hash-functions are an overkill).</li></ul> <h2 id="takeaways">Take-aways</h2> <p>This chapter resented a number of complex approaches that require a lot of attention to detail to be implemented efficiently. On the surface, the hash-table concept may seem rather simple, but, as we have seen, the production-grade implementations are not that straightforward. What general conclusions can we make?</p> <ol><li>In such mathematically loaded areas as hash-function and hash-table implementation, rigorous testing is critically important. For there is a number of unexpected sources of errors: incorrect implementation, integer overflow, concurrency issues, etc. A good testing strategy is to use an already existing trusted implementation and perform a large-scale comparison testing with a lot of random inputs. We haven't discussed the testing code here but will return to the practical implementation of such testing frameworks in the following chapters.</li> <li>Besides, a correct implementation doesn't necessarily mean a fast one. Low-level optimization techniques play a crucial role here.</li> <li>In the implementation of MPHT, we have seen in action another important approach to solving algorithmic and, more generally, mathematic problems: reducing them to a problem that has a known solution.</li> <li>Space measurement is another important area of algorithms evaluation that is somewhat harder to accomplish than runtime profiling. We'll also see more usage of both of these tools throughout the book.</li></ol> <hr size="1"><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-87515507509631701512019-08-30T20:29:00.002+03:002019-09-28T12:03:45.536+03:00Programming Algorithms: Key-Values<p>To conclude the description of essential data structures, we need to discuss key-values (kvs), which are the broadest family of structures one can imagine. Unlike arrays and lists, kvs are not concrete structures. In fact, they span, at least in some capacity, all of the popular concrete ones, as well as some obscure.</p> <p>The main feature of kvs is efficient access to the values by some kind of keys that they are associated with. In other words, each element of such data structure is a key-value pair that can be easily retrieved if we know the key, and, on the other hand, if we ask for the key that is not in the structure, the null result is also returned efficiently. By "efficiently", we usually mean <code>O(1)</code> or, at least, something sublinear (like <code>O(log n)</code>), although, for some cases, even <code>O(n)</code> retrieval time may be acceptable. See how broad this is! So, a lot of different structures may play the role of key-values.</p> <p>By the way, there isn't even a single widely-adopted name for such structures. Besides key-values — which isn't such a popular term (I derived it from key-value stores) — in different languages, they are called maps, dictionaries, associative arrays, tables, objects and so on.</p> <a href="https://4.bp.blogspot.com/-JXtohwYM8Wg/XWlbkO1s__I/AAAAAAAACKE/AdKSUpDZcTAEPAmX23mvcv1b7vHTEFOIgCLcBGAs/s1600/kv.jpg" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-JXtohwYM8Wg/XWlbkO1s__I/AAAAAAAACKE/AdKSUpDZcTAEPAmX23mvcv1b7vHTEFOIgCLcBGAs/s320/kv.jpg" width="320" height="213" data-original-width="600" data-original-height="400" /></a> <p>In a sense, these are the most basic and essential data structures. They are so essential that some dynamic languages — for example, Lua, explicitly, and JavaScript, without a lot of advertisement — rely on them as the core (sometimes sole) language's data structure. Moreover, key-values are used almost everywhere. Below is a list of some of the most popular scenarios:</p> <ul><li>implementation of the object system in programming languages</li><li>most of the key-value stores are, for the most part, glorified key-value structures</li><li>internal tables in the operating system (running process table or file descriptor tables in the Linux kernel), programming language environment or application software</li><li>all kinds of memoization and caching</li><li>efficient implementation of sets</li><li>ad hoc or predefined records for returning aggregated data from function calls</li><li>representing various dictionaries (in language processing and beyond)</li></ul> <p>Considering such a wide spread, it may be surprising that, historically, the programming language community only gradually realized the usefulness of key-values. For instance, such languages as C and C++ don't have the built-in support for general kvs (if we don't count structs and arrays, which may be considered significantly limited versions). Lisp, on the contrary, was to some extent pioneering their recognition with the concepts of alists and plists, as well as being one of the first languages to have hash-table support in the standard.</p> <h2 id="concretekeyvalues">Concrete Key-values</h2> <p>Let's see what concrete structures can be considered key-values and in which cases it makes sense to use them.</p> <h3 id="simplearrays">Simple Arrays</h3> <p>Simple sequences, especially arrays, may be regarded as a particular variant of kvs that allows only numeric keys with efficient (and fastest) constant-time access. This restriction is serious. However, as we'll see below, it can often be worked around with clever algorithms. As a result, arrays actually play a major role in the key-value space, but not in the most straightforward form. Although, if it is possible to be content with numeric keys and their number is known beforehand, vanilla arrays are the best possible implementation option. Example: OS kernels that have a predefined limit on the number of processes and a "process table" that is indexed by pid (process id) that lies in the range <code>0..MAX_PID</code>.</p> <p>So, let's note this curious fact that arrays are also a variant of key-values.</p> <h3 id="associativelists">Associative Lists</h3> <p>The main drawback of using simple arrays for kvs is not even the restriction that all keys should somehow be reduced to numbers, but the static nature of arrays, that do not lend themselves well to resizing. As an alternative, we could then use linked lists, which do not have this restriction. If the key-value contains many elements, linked lists are clearly not ideal in terms of efficiency. Many times, the key-value contains very few elements, perhaps only half a dozen or so. In this case, even a linear scan of the whole list may not be such an expensive operation. This is where various forms of associative lists enter the scene. They store pairs of keys and values and don't impose any restrictions, neither on the keys nor on the number of elements. But their performance quickly degrades below acceptable once the number of elements grows above several. Many flavors of associative lists can be invented. Historically, Lisp supports two variants in the standard library:</p> <ul><li><strong>alists</strong> (association lists) are lists of cons pairs. A cons pair is the original Lisp data structure, and it consists of two values called the <code>car</code> and the <code>cdr</code> (the names come from two IBM machine instructions). Association lists have dedicated operations to find a pair in the list (<code>assoc</code>) and to add an item to it (<code>pairlis</code>), although, it may be easier to just <code>push</code> the new cons cell onto it. Modification may be performed simply by altering the <code>cdr</code> of the appropriate cons-cell. <code>((:foo . "bar") (42 . "baz"))</code> is an alist of 2 items with keys <code>:foo</code> and <code>42</code>, and values <code>"bar"</code> and <code>"baz"</code>. As you can see, it's heterogenous in a sense that it allows keys of arbitrary type.</li> <li><strong>plists</strong> (property lists) are flat lists of alternating keys and values. They also have dedicated search (<code>getf</code>) and modify operations (<code>setf getf</code>), while insertion may be performed by calling <code>push</code> twice (on the value, and then the key). The plist with the same data as the previous alist will look like this: <code>(:foo "bar" 42 "baz")</code>. Plists are used in Lisp to represent the keyword function arguments as a whole.</li></ul> <p>Deleting an item from such lists is quite efficient if we already know the place that we want to clear, but tracking this place if we haven't found it yet is a bit cumbersome. In general, the procedure will be to iterate the list by tails until the relevant cons cell is found and then make the previous cell point to this one's tail. A destructive version for alists will look like this:</p> <pre><code>(defun alist-del (key alist)<br /> (loop :for tail := alist :then (rest tail)<br /> :for prev := alist :then tail<br /> :when (eql key (car (first tail)))<br /> :do (return (if (eql prev alist) ; special case of first item<br /> (rest alist)<br /> (progn (setf (rest prev) (rest tail))<br /> alist)))))<br /></code></pre> <p>However, the standard provides higher-level delete operations for plists (<code>remf</code>) and alists: <code>(remove key alist :key 'car)</code>.</p> <p>Both of these ad-hoc list-based kvs have some historical baggage associated with them and are not very convenient to use. Nevertheless, they can be utilized for some simple scenarios, as well as for interoperability with the existing language machinery. And, however counter-intuitive it may seem, if the number of items is small, alists may be the most efficient key-value data structure.</p> <p>Another nonstandard but more convenient and slightly more efficient variant of associatie lists was proposed by Ron Garret and is called <strong>dlists</strong> (dictionary lists). It is a cons-pair of two lists: the list of keys and the list of values. The dlist for our example will look like this: <code>((:foo 42) . ("bar" "baz"))</code>.</p> <p>As the interface of different associative lists is a thin wrapper over the standard list API, the general list-processing knowledge can be applied to dealing with them, so we won't spend any more time describing how they work.</p> <h3 id="hashtables">Hash-Tables</h3> <p>Hash-tables are, probably, the most common way to do key-values, nowadays. They are dynamic and don't impose restrictions on keys while having an amortized <code>O(1)</code> performance albeit with a rather high constant. The next chapter will be exclusively dedicated to hash-table implementation and usage. Here, it suffices to say that hash-tables come in many different flavors, including the ones that can be efficiently pre-computed if we want to store a set of items that is known ahead-of-time. Hash-tables are, definitely, the most versatile key-value variant and thus the default choice for such a structure. However, they are not so simple and may pose a number of surprises that the programmer should understand in order to use them properly.</p> <h3 id="structs">Structs</h3> <p>Speaking of structs, they may also be considered a special variant of key-values with a predefined set of keys. In this respect, structs are similar to arrays, which have a fixed set of keys (from 0 to <code>MAX_KEY</code>). As we already know, structs internally map to arrays, so they may be considered a layer of syntactic sugar that provides names for the keys and handy accessors. Usually, the struct is pictured not as a key-value but rather a way to make the code more "semantic" and understandable. Yet, if we consider returning the aggregate value from a function call, as the possible set of keys is known beforehand, it's a good stylistic and implementation choice to define a special-purpose one-off struct for this instead of using an alist or a hash-table. Here is a small example — compare the clarity of the alternatives:</p> <pre><code>(defun foo-adhoc-list (arg)<br /> (let ((rez (list)))<br /> ...<br /> (push "hello" rez)<br /> ...<br /> (push arg rez)<br /> ...<br /> rez))<br /><br />CL-USER> (foo-adhoc-list 42)<br />(42 "hello")<br /><br />(defun foo-adhoc-hash (arg)<br /> (let ((rez (make-hash-table)))<br /> ...<br /> (:= (gethash :baz rez) "hello")<br /> ...<br /> (:= (gethash :quux rez) arg))<br /> ...<br /> rez))<br /><br />CL-USER> (foo-adhoc-hash 42)<br />#<HASH-TABLE :TEST EQL :COUNT 2 {1040DBFE83}><br /><br />(defstruct foo-rez<br /> baz quux)<br /><br />(defun foo-struct (&rest args)<br /> (let ((rez (make-foo-rez)))<br /> ...<br /> (:= (foo-baz rez) "hello")<br /> ...<br /> (:= (foo-quux rez) 42))<br /> ...<br /> rez))<br /><br />CL-USER> (foo-struct 42)<br />#S(FOO-REZ :BAZ "hello" :QUUX 42)<br /></code></pre> <h3 id="trees">Trees</h3> <p>Another versatile option for implementing kvs is by using trees. There are even more tree variants than hash-tables and we'll also have dedicated chapters to study them. Generally, the main advantage of trees, compared to simple hash-tables, is the possibility to impose some ordering on the keys (although, linked hash-tables also allow for that), while the disadvantage is less efficient operation: <code>O(log n)</code>. Also, trees don't require hashing. Another major direction that the usage of trees opens is the possibility of persistent key-values implementation. Some languages, like Java, have standard-library support for tree-based kvs (<code>TreeMap</code>), but most languages delegate dealing with such structures to library authors for there is a wide choice of specific trees and neither may serve as the default choice of a key-value structure.</p> <h2 id="kvoperations">KV Operations</h2> <p>The primary operation for a kv structure is access to its elements by key: to set, change, and remove. As there are so many different variants of concrete kvs there's a number of different low-level access operations, some of which we have already discussed in the previous chapters and the others will see in the next ones.</p> <p>Yet, most of the algorithms don't necessarily require the efficiency of built-in accessors, while their clarity will seriously benefit from a uniform generic access operation. Such an operation, as we have already mentioned, is defined by RUTILS and is called <code>generic-elt</code> or <code>?</code>, for short. We have already seen it in action in some of the examples before. And that's not an accident as kv access is among the most frequent o. In the following chapter, we will stick to the rule of using the specific accessors like <code>gethash</code> when we are talking about some structure-specific operations and <code>?</code> in all other cases — when clarity matters more than low-level considerations. <code>?</code> is implemented using the CLOS generic function machinery that provides dynamic dispatch to a concrete retrieval operation and allows defining additional variants for new structures as the need arises. Another useful feature of <code>generic-elt</code> is chaining that allows expressing multiple accesses as a single call. This comes in very handy for nested structures. Consider an example of accessing the first element of the field of the struct that is the value in some hash table: <code>(? x :key 0 'field)</code>. If we were to use concrete operations it would look like this: <code>(slot-value (nth 0 (gethash :key x)) 'field)</code>.</p> <p>Below is the backbone of the <code>generic-elt</code> function that handles chaining and error reporting:</p> <pre><code>(defgeneric generic-elt (obj key &rest keys)<br /> (:documentation<br /> "Generic element access in OBJ by KEY.<br /> Supports chaining with KEYS.")<br /> (:method :around (obj key &rest keys)<br /> (reduce #'generic-elt keys :initial-value (call-next-method obj key)))<br /> (:method (obj key &rest keys)<br /> (declare (ignore keys))<br /> (error 'generic-elt-error :obj obj :key key)))<br /></code></pre> <p>And here are some methods for specific kvs (as well as sequences):</p> <pre><code>(defmethod generic-elt ((obj hash-table) key &rest keys)<br /> (declare (ignore keys))<br /> (gethash key obj))<br /><br />(defmethod generic-elt ((obj vector) key &rest keys)<br /> (declare (ignore keys))<br /> ;; Python-like handling of negative indices as offsets from the end<br /> (when (minusp key) (setf key (- (length obj) key)))<br /> (aref obj key))<br /><br />(defmethod generic-elt ((obj (eql nil)) key &rest keys)<br /> (declare (ignore key keys))<br /> (error "Can't access NIL with generic-elt!"))<br /></code></pre> <p><code>generic-setf</code> is a complement function that allows defining setter operations for <code>generic-elt</code>. There exists a built-in protocol to make Lisp aware that <code>generic-setf</code> should be called whenever <code>:=</code> (or the standard <code>setf</code>) is invoked for the value accessed with <code>?</code>: <code>(defsetf ? generic-setf)</code>.</p> <p>It is also common to retrieve all keys or values of the kv, which is handled in a generic way by the <code>keys</code> and <code>vals</code> RUTILS functions.</p> <p>Key-values are not sequences in a sense that they are not necessarily ordered, although some variants are. But even unordered kvs may be traversed in some random order. Iterating over kvs is another common and essential operation. In Lisp, as we already know, there are two complimentary iteration patterns: the functional <code>map-</code> and the imperative <code>do</code>-style. RUTILS provides both of them as <code>mapkv</code> and <code>dokv</code>, although I'd recommend to first consider the macro <code>dotable</code> that is specifically designed to operate on hash-tables.</p> <p>Finally, another common necessity is the transformation between different kv representations, primarily, between hash-tables and lists of pairs, which is also handled by RUTILS with its <code>ht->pairs</code>/<code>ht->alist</code> and <code>pairs->ht</code>/<code>alist->ht</code> functions.</p> <p>As you see, the authors of the Lisp standard library hadn't envisioned the generic key-value access protocols, and so it is implemented completely in a 3rd-party addon. Yet, what's most important is that the building blocks for doing that were provided by the language, so this case shows the critical importance that these blocks (primarily, CLOS generic functions) have in future-proofing the language's design.</p> <h2 id="memoization">Memoization</h2> <p>One of the major use cases for key-values is memoization — storing the results of previous computations in a dedicated table (<strong>cache</strong>) to avoid recalculating them. Memoization is one of the main optimization techniques; I'd even say the default one. Essentially, it trades space for speed. And the main issue is that space is also limited so memoization algorithms are geared towards optimizing its usage to retain the most relevant items, i.e. maximize the probability that the items in the cache will be reused.</p> <p>Memoization may be performed ad-hoc or explicitly: just set up some key scheme and a table to store the results and add/retrieve/remove the items as needed. It can also be delegated to the compiler in the implicit form. For instance, Java or Python provide the <code>@memoize</code> decorator: once it is used with the function definition, each call to it will pass through the assigned cache using the call arguments as the cache keys. This is how the same feature may be implemented in Lisp, in the simplest fashion:</p> <pre><code>(defun start-memoizing (fn)<br /> (stop-memoizing fn)<br /> (:= (symbol-function fn)<br /> (let ((table (make-hash-table :test 'equal))<br /> (vanilla-fn (symbol-function fn)))<br /> (:= (get fn :cache) table<br /> (get fn :fn) vanilla-fn)<br /> (lambda (&rest args)<br /> (getset# (format nil "~{~A~^|~}" args)<br /> table<br /> (apply vanilla-fn args))))))<br /><br />(defun stop-memoizing (fn)<br /> (when (get fn :fn)<br /> (:= (symbol-function fn) (get fn :fn)<br /> (get fn :fn) nil)))<br /><br />CL-USER> (defun foo (x)<br /> (sleep 5)<br /> x) <br />CL-USER> (start-memoizing 'foo)<br />CL-USER> (time (foo 1))<br />Evaluation took:<br /> 5.000 seconds of real time<br />CL-USER> (time (foo 1))<br />Evaluation took:<br /> 0.000 seconds of real time<br />CL-USER> (time (foo 2))<br />Evaluation took:<br /> 5.001 seconds of real time<br /></code></pre> <p>We use a hash-table to store the memoized results. The <code>getset#</code> macro from RUTILS tries to retrieve the item from the table by key and, if it's not present there, performs the calculation given as its last argument returning its result while also storing it in the table at key. Another useful Lisp feature utilized in this facility is called "symbol plist": every symbol has an associated key-value plist. Items in this plist can be retrieved using the <code>get</code> operator.<a href="#f5-1" name="r5-1">[1]</a></p> <p>This approach is rather primitive and has a number of drawbacks. First of all, the hash-table is not limited in capacity. Thus if it is used carelessly, a memory-leak is inevitable. Another possible issue may occur with the keys, which are determined by simply concatenating the string representations of the arguments — possibly, non-unique. Such bug may be very subtle and hard to infer. Overall, memoization is the source of implicit behavior that always poses potential trouble but sometimes is just necessary. A more nuanced solution will allow us to configure both how the keys are calculated and various parameters of the cache, which we'll discuss next. One more possible decision to make might be about what to cache and what not: for example, we could add a time measurement around the call to the original function and only when it exceeds a predefined limit the results will be cached.</p> <h2 id="cacheinvalidation">Cache Invalidation</h2> <p>The problem of cache invalidation arises when we set some limit on the size of the cache. Once it is full — and a properly setup cache should be full, effectively, all the time — we have to decide which item to remove (evict) when we need to put a new one in the cache. I've already mentioned the saying that (alongside naming things) it is the hardest challenge in computer science. In fact, it's not, it's rather trivial, from the point of view of algorithms. The hard part is defining the notion of relevance. There are two general approximations which are used unless there are some specific considerations: frequency of access or time of last access. Let's see the algorithms built around these. Each approach uses some additional data stored with each key. The purpose of the data is to track one of the properties, i.e., either frequency of access or time of last access.</p> <h3 id="secondchanceandclockalgorithms">Second Chance and Clock Algorithms</h3> <p>The simplest approach to cache invalidation except for random choice eviction may be utilized when we are severely limited in the amount of additional space we can use per key. Usually, this situation is typical for hardware caches. The minimal possible amount of information to store is 1 bit. If we have just as much space, the only option we have is to use it as a flag indicating whether the item was accessed again after it was put into the cache. This technique is very fast and very simple. And improves cache performance to some extent. There may be two ways of tracking this bit efficiently:</p> <ol><li>Just use a bit vector (usually called "bitmap", in such context) of the same length as the cache size. To select the item for eviction, find the first 0 from the left or right. With the help of one of the hardware instructions from the bit scan family (<code>ffs</code> — find first zero, <code>clz</code> — count trailing zeros, etc.), this operation can be blazingly fast. In Lisp, we could use the high-level function <code>position</code>: <pre><code>(defun find-candidate-second-chance (bitmap)<br /> (declare (type bit-vector bitmap))<br /> (position 0 bitmap))<br /></code></pre> <p>The type declaration is necessary for the implementation to emit the appropriate machine instruction. If you're not confident in that, just disassemble the function and look at the generated machine code:</p> <pre><code>CL-USER> (disassemble 'find-candidate-second-chance)<br />; disassembly for FIND-CANDIDATE-SECOND-CHANCE<br />; Size: 228 bytes. Origin: #x103A8E42F0<br />...<br />; 340: B878D53620 MOV EAX, #x2036D578 ; #<FDEFN SB-KERNEL:%BIT-POSITION/0><br />...<br /></code></pre> <p>So, SBCL uses <code>sb-kernel:%bit-position/0</code>, nice. If you look inside this function, though, you'll find out that it's also pretty complicated. And, overall, there are lots of other assembler instructions in this piece, so if our goal is squeezing the last bit out of it there's more we can do:</p> <ul><li>Force the implementation to optimize for speed: put <code>(declaim (optimize (speed 3) (debug 0) (safety 1)))</code> at the top of the file with the function definition or use <code>proclaim</code> in the REPL with the same declarations.</li> <li>Use the low-level function <code>sb-kernel:%bit-position/0</code> directly.</li> <li>Go even deeper and use the machine instruction directly — SBCL allows that as well: <code>(sb-vm::%primitive sb-vm::unsigned-word-find-first-bit x)</code>. But this will be truly context-dependent (on the endianness, hardware architecture, and the size of the bit vector itself, which should fit into a machine word for this technique to work).</li></ul> <p>However, there's one problem with the function <code>find-candidate-second-chance</code>: if all the bits are set it will return nil. By selecting the first element (or even better, some random element), we can fix this problem. Still, eventually, we'll end up with all elements of the bitmap set to 1, so the method will degrade to simple random choice. It means that we need to periodically reset the bit vector. Either on every eviction — this is a good strategy if we happen to hit the cache more often than miss. Or after some number of iterations. Or after every bit is set to 1.</p> </li><li>Another method for selecting a candidate to evict is known as the Clock algorithm. It keeps examining the visited bit of each item, in a cycle: if it's equal to 1 reset it and move to the next item; if it's 0 — select the item for eviction. Basically, it's yet another strategy for dealing with the saturation of the bit vector. Here's how it may be implemented in Lisp with the help of the <strong>closure pattern</strong>: the function keeps track of its internal state, using a lexical variable that is only accessible from inside the function, and that has a value that persists between calls to the function. The closure is created by the <code>let</code> block and the variable closed over is <code>i</code>, here:</li> <pre><code>(let ((i 0))<br /> (defun find-candidate-clock (bitmap)<br /> (declare (type (vector bit) bitmap))<br /> (loop :with len := (length bitmap)<br /> :until (zerop (svref bitmap i))<br /> :do (:+ i)<br /> (when (= i len)<br /> (:= i 0)))<br /> i))<br /></code></pre> <p>Our loop is guaranteed to find the zero bit at least after we cycle over all the elements and return to the first one that we have set to zero ourselves. Obviously, here and in other places where it is not stated explicitly, we're talking about single-threaded execution only.</p></ol> <h3 id="lfu">LFU</h3> <p>So, what if we don't have such a serious restriction on the size of the access counter? In this case, a similar algorithm that uses a counter instead of a flag will be called least frequently used (LFU) item eviction. There is one problem though: the access counter will only grow over time, so some items that were heavily used during some period will never be evicted from the cache, even though they may never be accessed again. To counteract this accumulation property, which is similar to bitmap saturation we've seen in the previous algorithm, a similar measure can be applied. Namely, we'll have to introduce some notion of epochs, which reset or diminish the value of all counters. The most common approach to epochs is to right shift each counter, i.e. divide by 2. This strategy is called <strong>aging</strong>. An LFU cache with aging may be called LRFU — least frequently and recently used.</p> <p>As usual, the question arises, how often to apply aging. The answer may be context-dependent and dependent on the size of the access counter. For instance, usually, a 1-byte counter, which can distinguish between 256 access operations, will be good enough, and it rarely makes sense to use a smaller one as most hardware operates in byte-sized units. The common strategies for aging may be:</p> <ul><li>periodically with an arbitrarily chosen interval — which should be enough to accumulate some number of changes in the counters but not to overflow them</li> <li>after a certain number of cache access operations. Such an approach may ensure that the counter doesn't overflow: say, if we use a 1-byte counter and age after each 128 access operations the counter will never exceed 192. Or we could perform the shift after 256 operations and still ensure lack of overflows with high probability</li></ul> <h3 id="lru">LRU</h3> <p>An alternative approach to LFU is LRU — evict the item that was used the longest time ago. LRU means that we need to store either last-access timestamps or some generation/epoch counters. Another possibility is to utilize access counters, similar to the ones that were used for LFU, except that we initialize them by setting all bits to 1, i.e. to the maximum possible value (255 for 1-byte counter). The counters are decremented, on each cache access, simultaneously for all items except for the item being accessed. The benefit of such an approach is that it doesn't require accessing any external notion of time making the cache fully self-contained, which is necessary for some hardware implementations, for instance. The only thing to remember is not to decrement the counter beyond 0 :)</p> <p>Unlike LFU, this strategy can't distinguish between a heavily-accessed item and a sparingly-accessed one. So, in the general case, I'd say that LFU with aging (LRFU) should be the default approach, although its implementation is slightly more complex.</p> <h2 id="memoizationinactiontranspositiontables">Memoization in Action: Transposition Tables</h2> <p>Transposition Tables is a characteristic example of the effective usage of memoization, which comes from classic game AI. But the same approach may be applied in numerous other areas with lots of computation paths that converge and diverge at times. We'll return to similar problems in the last third of this book.</p> <p>In such games as chess, the same position may be reached in a great variety of moves. All possible sequences are called transpositions, and it is obvious that, regardless of how we reached a certain position, if we have already analyzed that situation previously, we don't need to repeat the analysis when it repeats. So, caching the results allows us to save a lot of redundant computation. However, the number of positions, in chess, that comes up during the analysis is huge so we don't stand a chance of remembering all of them. In this case, a good predictor for the chance of a situation to occur is very likely the number of times it has occurred in the past. For that reason, an appropriate caching technique, in this context, is plain LFU. But there's more. Yet, another measure of the value of a certain position is how early it occurred in the game tree (since the number of possible developments, from it, is larger). So, classic LFU should be mixed with this temporal information yielding a domain-specific caching approach. And the parameters of combining the two measures together are subject to empirical evaluation and research.</p> <p>There's much more to transposition tables than mentioned in this short introduction. For instance, the keys describing the position may need to include additional information if the history of occurrence in it impacts the further game outcome (castling and repetition rules). Here's, also, a quote from Wikipedia on their additional use in another common chess-playing algorithm:</p> <blockquote> <p>The transposition table can have other uses than finding transpositions. In alpha-beta pruning, the search is fastest (in fact, optimal) when the child of a node corresponding to the best move is always considered first. Of course, there is no way of knowing the best move beforehand, but when iterative deepening is used, the move that was found to be the best in a shallower search is a good approximation. Therefore this move is tried first. For storing the best child of a node, the entry corresponding to that node in the transposition table is used.</p></blockquote> <h2 id="lowlevelcaching">Low-Level Caching</h2> <p>So, memoization is the primary tool for algorithm optimization, and the lower we descend into our computing platform the more this fact becomes apparent. For hardware, it is, basically, the only option. There are many caches in the platform that act behind the scenes, but which have a great impact on the actual performance of your code: the CPU caches, the disk cache, the page cache, and other OS caches. The main issue, here, is the lack of transparency into their operation and sometimes even the lack of awareness of their existence. This topic is, largely, beyond the scope of our book, so if you want to learn more, there's a well-known talk <a href="https://www.infoq.com/presentations/click-crash-course-modern-hardware/">"A Crash Course in Modern Hardware"</a> and an accompanying list of <a href="https://gist.github.com/jboner/2841832">"Latency Numbers Every Programmer Should Know"</a> that you can start with. Here, I can provide only a brief outline.</p> <p>The most important cache in the system is the CPU cache — or, rather, in most of the modern architectures, a system of 2 or 3 caches. There's an infamous <strong>von-Neumann's bottleneck</strong> of the conventional computer hardware design: the CPU works roughly 2 orders of magnitude faster than it can fetch data from memory. Last time I checked, the numbers were: execution of one memory transfer took around 250-300 CPU cycles, i.e. around 300 additions or other primitive instructions could be run during that time. And the problem is that CPUs operate only on data that they get from memory, so if the bottleneck didn't exist at all, theoretically, we could have 2 orders of magnitude faster execution. Fortunately, the degradation in performance is not so drastic, thanks to the use of CPU caches: only around an order of magnitude. The cache transfer numbers are the following: from L1 (the fastest and hence smallest) cache — around 5 cycles, from L2 — 20-30 cycles, from L3 — 50-100 cycles (that's why L3 is, not always used as it's almost on par with the main memory). Why do I say that fastest means smallest? Just because fast access memory is more expensive and requires more energy. Otherwise, we could just make all RAM as fast as the L1 cache.</p> <p>How these caches operate? This is one of the things that every algorithmic programmer should know, at least, in general. Even if some algorithm seems good on paper, a more cache-friendly one with worse theoretical properties may very well outperform it.</p> <p>The CPU cache temporarily stores contents of the memory cells (memory words) indexed by their addresses. It is called set-associative as it operates not on single cells but on sequential blocks of those (in the so-called cache lines). The L1 cache of size 1MB, usually, will store 64 such blocks each one holding 16 words. This approach is oriented towards the normal sequential layout of executable code, structures, and arrays — the majority of the memory contents. And the corresponding common memory access pattern — sequential. I.e., after reading one memory cell, usually, the processor will move on to the next: either because it's the next instruction to execute or the next item in the array being iterated over. That's why so much importance in program optimization folklore is given to <strong>cache alignment</strong>, i.e. structuring the program's memory so that the things commonly accessed together will fit into the same cache line. One example of this principle is the padding of structures with zeroes to align their size to be a multiple of 32 or 64. The same applies to code padding with <code>nop</code>s. And this is another reason why arrays are a preferred data structure compared to linked lists: when the whole contents fit in the same cache line its processing performance is blazingly fast. The catch, though, is that it's, practically, impossible, for normal programmers, to directly observe how CPU cache interoperates with their programs. There are no tools to make it transparent so what remains is to rely on the general principles, second-guessing, and trial&error.</p> <p>Another interesting choice for hardware (and some software) caches is write-through versus write-back behavior. The question is how the cache deals with cached data being modified:</p> <ul><li>either the modifications will be immediately stored to the main storage, effectively, making the whole operation longer</li> <li>or they may, first, be persisted to the cache only; while writing to the backing store (synchronization) will be performed on of all data in the cache at configured intervals</li></ul> <p>The second option is faster as there's a smaller number of expensive round-trips, but it is less resilient to failure. A good example of the write-back cache in action is the origin of the Windows "Safely remove hardware" option. The underlying assumption is that the data to be written to the flash drive passes through the OS cache, which may be configured in the write-back fashion. In this case, forced sync is required before disconnecting the device to ensure that the latest version of the cached data is saved to it.</p> <p>Another example of caching drastically impacting performance, which everyone is familiar with, is paging or swapping — an operation performed by the operating system. When the executing programs together require more (virtual) memory than the size of the RAM that is physically available, the OS saves some of the pages of data that these program use to a place on disk known as the swap section.</p> <p>A few points we can take away from this chapter:</p> <ol><li>Key-values are very versatile and widely-used data structures. Don't limit your understanding of them to a particular implementation choice made by the designers of the programming language you're currently using.</li> <li>Trading space for time is, probably, the most wide-spread and impactful algorithmic technique.</li> <li>Caching, which is a direct manifestation of this technic and one of the main applications of key-value data structures, is one of the principal factors impacting program performance, on a large scale. It may be utilized by the programmer in the form of memoization, and will also inevitably be used by the underlying platform, in hard to control and predict ways. The area of program optimization for efficient hardware utilization represents a distinct set of techniques, requiring skills that are obscure and also not fully systematized.</li></ol> <hr size="1"><p>Footnotes:</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r5-1" name="f5-1">[1]</a> Symbol plists represent on of the unpleasant legacy features of the language, in that the most obvious accessor name, namely get, is reserved for working with symbols. Therefore, this name can not be used for accessing other kinds of data. Historically, symbol plists where the first and only variant of key-values available in the language (at that time, the other languages didn't have the slightest idea of such a high-level concept).</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-3682337440627104352019-08-29T11:10:00.002+03:002019-08-29T11:33:48.929+03:00RUTILS 5.0 and Tutorial<img src="https://github.com/vseloved/rutils/raw/master/docs/logo.jpg" width=300"/><p>RUTILS is my take on the Lisp "modernization" effort that adds the missing syntactic and data structure pieces, which became proven and well-established, in Lisp itself or in other languages. The programming field is constantly developing while the Lisp standard remains fixed so some additions, over time, are not only desirable but inevitable, if we don't want to lag behind. Thankfully, Lisp provides all the necessary means for implementing them and so, with some creativity, there's a way to have access to almost anything you want and need while retaining full backward compatibility (a lack of which is the most critical problem of some alternative solutions). <p>I, surely, understand that using such an extension remains a matter of taste and not every Lisper will like it. I didn't try to seriously promote it and was quite satisfied with the benefit that it provided to me and my teams' development. However, as I decided to use it for the <a href="http://lisp-univ-etc.blogspot.com/2019/07/programming-algorithms-book.html">"Programming Algorithms" book</a>, it received some attention and a number of questions. From the direction of the discussions, I realized that the docs are lacking a critical part — the tutorial explaining how to effectively use the library. This text is intended to bridge that gap. I had to finish it before publishing the next chapter of the book, which I'll do on Friday. <p>So, today, version 5 of RUTILS is released alongside with the <a href="https://github.com/vseloved/rutils/blob/master/docs/tutorial.md">tutorial</a> that aims to better explain its usage. <script src="https://gist.github.com/vseloved/9c6e36f2fa89f4accf3f3cbc371b3ce1.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-54091898401375240482019-08-19T15:32:00.001+03:002019-08-21T13:55:13.243+03:00Programming Algorithms: Linked Lists<p>Linked data structures are in many ways the opposite of the contiguous ones that we have explored to some extent in the previous chapter using the example of arrays. In terms of complexity, they fail where those ones shine (first of all, at random access) — but prevail at scenarios when a repeated modification is necessary. In general, they are much more flexible and so allow the programmer to represent almost any kind of a data structure, although the ones that require such level of flexibility may not be too frequent. Usually, they are specialized trees or graphs.</p> <p>The basic linked data structure is a singly-linked list. </p> <p><a href="https://4.bp.blogspot.com/-s1EnynikJWI/XVqUcv4GFhI/AAAAAAAACJE/cBFOEuqxc18XSGOCx8E8mQlk2NUppizyQCLcBGAs/s1600/list.jpg" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-s1EnynikJWI/XVqUcv4GFhI/AAAAAAAACJE/cBFOEuqxc18XSGOCx8E8mQlk2NUppizyQCLcBGAs/s320/list.jpg" width="320" height="185" data-original-width="1600" data-original-height="926" /></a></p> <p>Just like arrays, lists in Lisp may be created both with a literal syntax for constants and by calling a function — <code>make-list</code> — that creates a list of a certain size filled with <code>nil</code> elements. Besides, there's a handy <code>list</code> utility that is used to create lists with the specified content (the analog of <code>vec</code>).</p> <pre><code>CL-USER> '("hello" world 111)<br />("hello" WORLD 111)<br />CL-USER> (make-list 3)<br />(NIL NIL NIL)<br />CL-USER> (list "hello" 'world 111)<br />("hello" WORLD 111)<br /></code></pre> <p>An empty list is represented as <code>()</code> and, interestingly, in Lisp, it is also a synonym of logical falsehood (<code>nil</code>). This property is used very often, and we'll have a chance to see that.</p> <p>If we were to introduce our own lists, which may be quite a common scenario in case the built-in ones' capabilities do not suit us, we'd need to define the structure "node", and our list would be built as a chain of such nodes. We might have wanted to store the list head and, possibly, tail, as well as other properties like size. All in all, it would look like the following:</p> <pre><code>(defstruct list-cell<br /> data<br /> next)<br /><br />(defstruct our-own-list<br /> (head nil :type (or list-cell null))<br /> (tail nil :type (or list-cell null)))<br /><br />CL-USER> (let ((tail (make-list-cell :data "world")))<br /> (make-our-own-list<br /> :head (make-list-cell<br /> :data "hello"<br /> :next tail)<br /> :tail tail))<br />#S(OUR-OWN-LIST<br /> :HEAD #S(LIST-CELL<br /> :DATA "hello"<br /> :NEXT #S(LIST-CELL :DATA "world" :NEXT NIL))<br /> :TAIL #S(LIST-CELL :DATA "world" :NEXT NIL))<br /></code></pre> <h2 id="listsassequences">Lists as Sequences</h2> <p>Alongside arrays, list is the other basic data structure that implements the sequence abstract data type. Let's consider the complexity of basic sequence operations for linked lists:</p> <ul><li>so-called random access, i.e. access by index of a random element, requires <code>O(n)</code> time as we have to traverse all the preceding elements before we can reach the desired one (<code>n/2</code> operations on average)</li> <li>yet, once we have reached some element, removing it or inserting something after it takes <code>O(1)</code></li> <li>subsequencing is also <code>O(n)</code></li></ul> <p>Getting the list length, in the basic case, is also <code>O(n)</code> i.e. it requires full list traversal. It is possible, though, to store list length as a separate slot, tracking each change on the fly, which means <code>O(1)</code> complexity. Lisp, however, implements the simplest variant of lists without size tracking. This is an example of a small but important decision that real-world programming is full of. Why is such a solution the right thing™, in this case? Adding the size counter to each list would have certainly made this common <code>length</code> operation more effective, but the cost of doing that would've included: increase in occupied storage space for all lists, a need to update size in all list modification operations, and, possibly, a need for a more complex cons cell implementation<a href="#f4-1" name="r4-1">[1]</a>. These considerations make the situation with lists almost opposite to arrays, for which size tracking is quite reasonable because they change much less often and not tracking the length historically proved to be a terrible security decision. So, what side to choose? A default approach is to prefer the solution which doesn't completely rule out the alternative strategy. If we were to choose a simple cons-cell sans size (what the authors of Lisp did) we'll always be able to add the "smart" list data structure with the size field, on top of it. Yet, stripping the size field from built-in lists won't be possible. Similar reasoning is also applicable to other questions, such as: why aren't lists, in Lisp, doubly-linked. Also, it helps that there's no security implication as lists aren't used as data exchange buffers, for which the problem manifests itself. </p> <p>For demonstration, let's add the size field to <code>our-own-list</code> (and, meanwhile, consider all the functions that will need to update it...):</p> <pre><code>(defstruct our-own-list<br /> (head nil :type (or list-cell nil))<br /> (tail nil :type (or list-cell nil))<br /> (size 0 :type (integer 0)))<br /></code></pre> <p>Given that obtaining the length of a list, in Lisp, is an expensive operation, a common pattern in programs that require multiple requests of the length field is to store its value in some variable at the beginning of the algorithm and then use this cached value, updating it if necessary.</p> <p>As we see, lists are quite inefficient in random access scenarios. However, many sequences don't require random access and can satisfy all the requirements of a particular use case using just the sequential one. That's one of the reasons why they are called sequences, after all. And if we consider the special case of list operations at index 0 they are, obviously, efficient: both access and addition/removal is <code>O(1)</code>. Also, if the algorithm requires a sequential scan, list traversal is rather efficient too, although not as good as array traversal for it still requires jumping over the memory pointers. There are numerous sequence operations that are based on sequential scans. The most common is <code>map</code>, which we analyzed in the previous chapter. It is the functional programming alternative to looping, a more high-level operation, and thus simpler to understand for the common cases, although less versatile.</p> <p><code>map</code> is a function that works with different types of built-in sequences. It takes as the first argument the target sequence type (if <code>nil</code> is supplied it won't create the resulting sequence and so will be used just for side-effects). Here is a polymorphic example involving lists and vectors:</p> <pre><code>CL-USER> (map 'vector '+<br /> '(1 2 3 4 5)<br /> #(1 2 3))<br />#(2 4 6)<br /></code></pre> <p><code>map</code> applies the function provided as its second argument (here, addition) sequentially to every element of the sequences that are supplied as other arguments, until one of them ends, and records the result in the output sequence. <code>map</code> would have been even more intuitive, if it just had used the type of the first argument for the result sequence, i.e. be a "do what I mean" <code>dwim-map</code>, while a separate advanced variant with result-type selection might have been used in the background. Unfortunately, the current standard scheme is not for change, but we can define our own wrapper function:</p> <pre><code>(defun dwim-map (fn seq &rest seqs)<br /> "A thin wrapper over MAP that uses the first SEQ's type for the result."<br /> (apply 'map (type-of seq) fn seqs))<br /></code></pre> <p><code>map</code> in Lisp is, historically, used for lists. So there's also a number of list-specific map variants that predated the generic <code>map</code>, in the earlier versions of the language, and are still in wide use today. These include <code>mapcar</code>, <code>mapc</code>, and <code>mapcan</code> (replaced in RUTILS by a safer <code>flat-map</code>). Now, let's see a couple of examples of using mapping. Suppose that we'd like to extract odd numbers from a list of numbers. Using <code>mapcar</code> as a list-specific <code>map</code> we might try to call it with an anonymous function that tests its argument for oddity and keeps them in such case:</p> <pre><code>CL-USER> (mapcar (lambda (x) (when (oddp x) x))<br /> (range 1 10))<br />(1 NIL 3 NIL 5 NIL 7 NIL 9)<br /></code></pre> <p>However, the problem is that non-odd numbers still have their place reserved in the result list, although it is not filled by them. Keeping only the results that satisfy (or don't) certain criteria and discarding the others is a very common pattern that is known as "filtering". There's a set of Lisp functions for such scenarios: <code>remove</code>, <code>remove-if</code>, and <code>remove-if-not</code>, as well as RUTILS' complements to them <code>keep-if</code> and <code>keep-if-not</code>. We can achieve the desired result adding <code>remove</code> to the picture:</p> <pre><code>CL-USER> (remove nil (mapcar (lambda (x) (when (oddp x) x))<br /> (range 1 10)))<br />(1 3 5 7 9)<br /></code></pre> <p>A more elegant solution will use the <code>remove-if(-not)</code> or <code>keep-if(-not)</code> variants. <code>remove-if-not</code> is the most popular among these functions. It takes a predicate and a sequence and returns the sequence of the same type holding only the elements that satisfy the predicate:</p> <pre><code>CL-USER> (remove-if-not 'oddp (range 1 10))<br />(1 3 5 7 9)<br /></code></pre> <p>Using such high-level mapping functions is very convenient, which is why there's a number of other <code>-if(-not)</code> operations, like <code>find(-if(-not))</code>, <code>member(-if(-not))</code>, <code>position(-if(-not))</code>, etc.</p> <p>The implementation of <code>mapcar</code> or any other list mapping function, including your own task-specific variants, follows the same pattern of traversing the list accumulating the result into another list and reversing it, in the end:</p> <pre><code>(defun simple-mapcar (fn list)<br /> (let ((rez ()))<br /> (dolist (item list)<br /> (:= rez (cons (call fn item) rez)))<br /> (reverse rez)))<br /></code></pre> <p>The function <code>cons</code> is used to add an item to the beginning of the list. It creates a new list head that points to the previous list as its tail.</p> <p>From the complexity point of view, if we compare such iteration with looping over an array we'll see that it is also a linear traversal that requires twice as many operations as with arrays because we need to traverse the result fully once again, in the end, to reverse it. Its advantage, though, is higher versatility: if we don't know the size of the resulting sequence (for example, in the case of <code>remove-if-not</code>) we don't have to change anything in this scheme and just add a filter line (<code>(when (oddp item) ...</code>), while for arrays we'd either need to use a dynamic array (that will need constant resizing and so have at least the same double number of operations) or pre-allocate the full-sized result sequence and then downsize it to fit the actual accumulated number of elements, which may be problematic when we deal with large arrays. </p> <h2 id="listsasfunctionaldatastructures">Lists as Functional Data Structures</h2> <p>The distinction between arrays and linked lists in many ways reflects the distinction between the imperative and functional programming paradigms. Within the imperative or, in this context, procedural approach, the program is built out of low-level blocks (conditionals, loops, and sequentials) that allow for the most fine-tuned and efficient implementation, at the expense of abstraction level and modularization capabilities. It also heavily utilizes in-place modification and manual resource management to keep overhead at a minimum. An array is the most suitable data-structure for such a way of programming. Functional programming, on the contrary, strives to bring the abstraction level higher, which may come at a cost of sacrificing efficiency (only when necessary, and, ideally, only for non-critical parts). Functional programs are built by combining referentially transparent computational procedures (aka "pure functions") that operate on more advanced data structures (either persistent ones or having special access semantics, e.g. transactional) that are also more expensive to manage but provide additional benefits.</p> <p>Singly-linked lists are a simple example of functional data structures. A <strong>functional</strong> or <strong>persistent</strong> data structure is the one that doesn't allow in-place modification. In other words, to alter the contents of the structure a fresh copy with the desired changes should be created. The flexibility of linked data structures makes them suitable for serving as functional ones. We have seen the <code>cons</code> operation that is one of the earliest examples of non-destructive, i.e. functional, modification. This action prepends an element to the head of a list, and as we're dealing with the singly-linked list the original doesn't have to be updated: a new cons cell is added in front of it with its <code>next</code> pointer referencing the original list that becomes the new tail. This way, we can preserve both the pointer to the original head and add a new head. Such an approach is the basis for most of the functional data structures: the functional trees, for example, add a new head and a new route from the head to the newly added element, adding new nodes along the way — according to the same principle.</p> <p>It is interesting, though, that lists can be used in destructive and non-destructive fashion likewise. There are both low- and high-level functions in Lisp that perform list modification, and their existence is justified by the use cases in many algorithms. Purely functional lists render many of the efficient list algorithms useless. One of the high-level list modification function is <code>nconc</code>. It concatenates two lists together updating in the process the <code>next</code> pointer of the last cons cell of the first list:</p> <pre><code>CL-USER> (let ((l1 (list 1 2 3))<br /> (l2 (list 4 5 6)))<br /> (nconc l1 l2) ; note no assignment to l1<br /> l1) ; but it is still changed<br />(1 2 3 4 5 6)<br /></code></pre> <p>There's a functional variant of this operation, <code>append</code>, and, in general, it is considered distasteful to use <code>nconc</code> for two reasons:</p> <ul><li>the risk of unwarranted modification</li> <li>funny enough, the implementation of <code>nconc</code>, actually, isn't mandated to be more efficient than that of <code>append</code></li></ul> <p>So, forget <code>nconc</code>, <code>append</code> all the lists!</p> <p>Using <code>append</code> we'll need to modify the previous piece of code because otherwise the newly created list will be garbage-collected immediately:</p> <pre><code>CL-USER> (let ((l1 (list 1 2 3))<br /> (l2 (list 4 5 6)))<br /> (:= l1 (append l1 l2))<br /> l1)<br />(1 2 3 4 5 6)<br /></code></pre> <p>The low-level list modification operations are <code>rplaca</code> and <code>rplacd</code>. They can be combined with list-specific accessors <code>nth</code> and <code>nthcdr</code> that provide indexed access to list elements and tails respectively. Here's, for example, how to add an element in the middle of a list:</p> <pre><code>CL-USER> (let ((l1 (list 1 2 3)))<br /> (rplacd (nthcdr 0 l1)<br /> (cons 4 (nthcdr 1 l1)))<br /> l1)<br />(1 4 2 3)<br /></code></pre> <p>Just to re-iterate, although functional list operations are the default choice, for efficient implementation of some algorithms, you'll need to resort to the ugly destructive ones.</p> <h2 id="differentkindsoflists">Different Kinds of Lists</h2> <p>We have, thus far, seen the most basic linked list variant — a singly-linked one. It has a number of limitations: for instance, it's impossible to traverse it from the end to the beginning. Yet, there are many algorithms that require accessing the list from both sides or do other things with it that are inefficient or even impossible with the singly-linked one, hence other, more advanced, list variants exist.</p> <p>But first, let's consider an interesting tweak to the regular singly-linked list — a circular list. It can be created from the normal one by making the last cons cell point to the first. It may seem like a problematic data structure to work with, but all the potential issues with infinite looping while traversing it are solved if we keep a pointer to any node and stop iteration when we encounter this node for the second time. What's the use for such structure? Well, not so many, but there's a prominent one: the ring buffer. A ring or circular buffer is a structure that can hold a predefined number of items and each item is added to the next slot of the current item. This way, when the buffer is completely filled it will wrap around to the first element, which will be overwritten at the next modification. By our buffer-filling algorithm, the element to be overwritten is the one that was written the earliest for the current item set. Using a circular linked list is one of the simplest ways to implement such a buffer. Another approach would be to use an array of a certain size moving the pointer to the next item by incrementing an index into the array. Obviously, when the index reaches array size it should be reset to zero.</p> <p>A more advanced list variant is a doubly-linked one, in which all the elements have both the <code>next</code> and <code>previous</code> pointers. The following definition, using inheritance, extends our original <code>list-cell</code> with a pointer to the previous element. Thanks to the basic object-oriented capabilities of structs, it will work with the current definition of <code>our-own-list</code> as well, and allow it to function as a doubly-linked list.</p> <pre><code>(defstruct (list-cell2 (:include list-cell))<br /> prev)<br /></code></pre> <p>Yet, we still haven't shown the implementation of the higher-level operations of adding and removing an element to/from <code>our-own-list</code>. Obviously, they will differ for singly- and doubly-linked lists, and that distinction will require us to differentiate the doubly-linked list types. That, in turn, will demand invocation of a rather heavy OO-machinery, which is beyond the subject of this book. Instead, for now, let's just examine the basic list addition function, for the doubly-linked list:</p> <pre><code>(defun our-cons2 (data list)<br /> (when (null list) (:= list (make-our-own-list)))<br /> (let ((new-head (make-list-cell2<br /> :data data <br /> :next (when list @list.head))))<br /> (:= @list.head.prev new-head)<br /> (make-our-own-list<br /> :head new-head<br /> :tail @list.tail<br /> :size (1+ @list.size))))<br /></code></pre> <p>The first thing to note is the use of the <code>@</code> syntactic sugar, from RUTILS, that implements the mainstream dot notation for slot-value access (i.e. <code>@list.head.prev</code> refers to the <code>prev</code> field of the <code>head</code> field of the provided <code>list</code> structure of the assumed <code>our-own-list</code> type, which in a more classically Lispy, although cumbersome, variants may look like one of the following: <code>(our-cons2-prev (our-own-list-head list))</code> or <code>(slot-value (slot-value list 'head) 'prev)</code><a href="#f4-2" name="r4-2">[2]</a>).</p> <p>More important here is that, unlike for the singly-linked list, this function requires an in-place modification of the head element of the original list: setting its <code>prev</code> pointer. Immediately making doubly-linked lists non-persistent.</p> <p>Finally, the first line is the protection against trying to access the null list (that will result in a much-feared, especially in Java-land, null-pointer exception class of error).</p> <p>At first sight, it may seem that doubly-linked lists are more useful than singly-linked ones. But they also have higher overhead so, in practice, they are used quite sporadically. We may see just a couple of use cases on the pages of this book. One of them is presented in the next part — a double-ended queue. </p> <p>Besides doubly-linked, there are also association lists that serve as a variant of key-value data structures. At least 3 types may be found in Common Lisp code, and we'll briefly discuss them in the chapter on key-value structures. Finally, a skip list is a probabilistic data structure based on singly-linked lists, that allows for faster search, which we'll also discuss in a separate chapter on probabilistic structures. Other more esoteric list variants, such as self-organized list and XOR-list, may also be found in the literature — but very rarely, in practice.</p> <h2 id="fifolifo">FIFO & LIFO</h2> <p>The flexibility of lists allows them to serve as a common choice for implementing a number of popular abstract data structures.</p> <h3 id="queue">Queue</h3> <p>A queue or FIFO has the following interface:</p> <ul><li><code>enqueue</code> an item at the end</li> <li><code>dequeue</code> the first element: get it and remove it from the queue</li></ul> <p>It imposes a first-in-first-out (FIFO) ordering on the elements. A queue can be implemented directly with a singly-linked list like <code>our-own-list</code>. Obviously, it can also be built on top of a dynamic array but will require permanent expansion and contraction of the collection, which, as we already know, isn't the preferred scenario for their usage.</p> <p>There are numerous uses for the queue structures for processing items in a certain order (some of which we'll see in further chapters of this book).</p> <h3 id="stack">Stack</h3> <p>A stack or LIFO (last-in-first-out) is even simpler than a queue, and it is used even more widely. Its interface is:</p> <ul><li><code>push</code> an item on top of the stack making it the first element</li> <li><code>pop</code> an item from the top: get it and remove it from the stack</li></ul> <p>A simple Lisp list can serve as a stack, and you can see such uses in almost every file with Lisp code. The most common pattern is result accumulation during iteration: using the stack interface, we can rewrite <code>simple-mapcar</code> in an even simpler way (which is idiomatic Lisp):</p> <pre><code>(defun simple-mapcar (fn list)<br /> (let ((rez ()))<br /> (dolist (item list)<br /> (push (call fn item) rez))<br /> (reverse rez)))<br /></code></pre> <p>Stacks hold elements in reverse-chronological order and can thus be used to keep the history of changes to be able to undo them. This feature is used in procedure calling conventions by the compilers: there exists a separate segment of program memory called the Stack segment, and when a function call happens (beginning from the program's entry point called the <code>main</code> function in C) all of its arguments and local variables are put on this stack as well as the return address in the program code segment where the call was initiated. Such an approach allows for the existence of local variables that last only for the duration of the call and are referenced relative to the current stack head and not bound to some absolute position in memory like the global ones. After the procedure call returns, the stack is "unwound" and all the local data is forgotten returning the context to the same state in which it was before the call. Such stack-based history-keeping is a very common and useful pattern that may be utilized in userland code likewise.</p> <p>Lisp itself also uses this trick to implement global variables with a capability to have context-dependent values through the extent of <code>let</code> blocks: each such variable also has a stack of values associated with it. This is one of the most underappreciated features of the Lisp language used quite often by experienced lispers. Here is a small example with a standard global variable (they are called <strong>special</strong> in Lisp parlance due to this special property) <code>*standard-output*</code> that stores a reference to the current output stream:</p> <pre><code>CL-USER> (print 1)<br />1<br />1<br />CL-USER> (let ((*standard-output* (make-broadcast-stream)))<br /> (print 1))<br />1<br /></code></pre> <p>In the first call to print, we see both the printed value and the returned one, while in the second — only the return value of the print function, while it's output is sent, effectively, to /dev/null.</p> <p>Stacks can be also used to implement queues. We'll need two of them to do that: one will be used for enqueuing the items and another — for dequeuing. Here's the implementation:</p> <pre><code>(defstruct queue<br /> head<br /> tail)<br /><br />(defun enqueue (item queue)<br /> (push item @queue.head))<br /><br />(defun dequeue (queue)<br /> ;; Here and in the next condition, we use the property that an empty list<br /> ;; is also logically false. This is discouraged by many Lisp style-guides,<br /> ;; but, in many cases, such code is not only more compact but also more clear.<br /> (unless @queue.tail<br /> (do ()<br /> ((null @queue.head)) ; this loop continues until head becomes empty<br /> (push (pop @queue.head) @queue.tail)))<br /> ;; By pushing all the items from the head to the tail we reverse<br /> ;; their order — this is the second reversing that cancels the reversing<br /> ;; performed when we push the items to the head and it restores the original order.<br /> (when @queue.tail<br /> (values (pop @queue.tail)<br /> t))) ; this second value is used to indicate that the queue was not empty<br /> <br />CL-USER> (let ((q (make-queue)))<br /> (print q)<br /> (enqueue 1 q)<br /> (enqueue 2 q)<br /> (enqueue 3 q)<br /> (print q)<br /> (print q)<br /> (dequeue q)<br /> (print q)<br /> (enqueue 4 q)<br /> (print q)<br /> (dequeue q)<br /> (print q)<br /> (dequeue q)<br /> (print q)<br /> (dequeue q)<br /> (print q)<br /> (dequeue q))<br />#S(QUEUE :HEAD NIL :TAIL NIL) <br />#S(QUEUE :HEAD (3 2 1) :TAIL NIL) <br />#S(QUEUE :HEAD (3 2 1) :TAIL NIL) <br />#S(QUEUE :HEAD NIL :TAIL (2 3)) <br />#S(QUEUE :HEAD (4) :TAIL (2 3)) <br />#S(QUEUE :HEAD (4) :TAIL (3)) <br />#S(QUEUE :HEAD (4) :TAIL NIL) <br />#S(QUEUE :HEAD NIL :TAIL NIL) <br />NIL ; no second value indicates that the queue is now empty<br /></code></pre> <p>Such queue implementation still has <code>O(1)</code> operation times for <code>enqueue</code>/<code>dequeue</code>. Each element will experience exactly 4 operations: 2 <code>push</code>s and 2 <code>pop</code>s (for the <code>head</code> and <code>tail</code>).</p> <p>Another stack-based structure is the stack with a minimum element, i.e. some structure that not only holds elements in LIFO order but also keeps track of the minimum among them. The challenge is that if we just add the <code>min</code> slot that holds the current minimum, when this minimum is <code>pop</code>ped out of the stack we'll need to examine all the remaining elements to find the new minimum. We can avoid this additional work by adding another stack — a stack of minimums. Now, each <code>push</code> and <code>pop</code> operation requires us to also check the head of this second stack and, in case the added/removed element is the minimum, <code>push</code> it to the stack of minimums or <code>pop</code> it from there, accordingly.</p> <p>A well-known algorithm that illustrates stack usage is fully-parenthesized arithmetic expressions evaluation:</p> <pre><code>(defun arith-eval (expr)<br /> "EXPR is a list of symbols that may include:<br /> square brackets, arithmetic operations, and numbers."<br /> (let ((ops ())<br /> (vals ())<br /> (op nil))<br /> (dolist (item expr)<br /> (case item<br /> ([ ) ; do nothing<br /> ((+ - * /) (push item ops))<br /> (] (:= op (pop ops)<br /> val (pop vals))<br /> (case op<br /> (+ (:+ val (pop vals)))<br /> (- (:- val (pop vals)))<br /> (* (:* val (pop vals)))<br /> (/ (:/ val (pop vals))))<br /> (push val vals))<br /> (otherwise (push item vals))))<br /> (pop vals)))<br /><br />CL-USER> (arith-eval '([ 1 + [ [ 2 + 3 ] * [ 4 * 5 ] ] ] ]))<br />101<br /></code></pre> <h3 id="deque">Deque</h3> <p>A deque is a short name for a double-ended queue, which can be traversed in both orders: FIFO and LIFO. It has 4 operations: <code>push-front</code> and <code>push-back</code> (also called <code>shift</code>), <code>pop-front</code> and <code>pop-back</code> (<code>unshift</code>). This structure may be implemented with a doubly-linked list or likewise a simple queue with 2 stacks. The difference for the 2-stacks implementation is that now items may be pushed back and forth between <code>head</code> and <code>tail</code> depending on the direction we're <code>pop</code>ping from, which results in worst-case linear complexity of such operations: when there's constant alteration of front and back directions. </p> <p>The use case for such structure is the algorithm that utilizes both direct and reverse ordering: a classic example being job-stealing algorithms, when the main worker is processing the queue from the front, while other workers, when idle, may steal the lowest priority items from the back (to minimize the chance of a conflict for the same job).</p> <h3 id="stacksinactionsaxparsing">Stacks in Action: SAX Parsing</h3> <p>Custom XML parsing is a common task for those who deal with different datasets, as many of them come in XML form, for example, Wikipedia and other Wikidata resources. There are two main approaches to XML parsing:</p> <ul><li>DOM parsing reads the whole document and creates its tree representation in memory. This technique is handy for small documents, but, for huge ones, such as the dump of Wikipedia, it will quickly fill all available memory. Also, dealing with the deep tree structure, if you want to extract only some specific pieces from it, is not very convenient.</li> <li>SAX parsing is an alternative variant that uses the stream approach. The parser reads the document and, upon completing the processing of a particular part, invokes the relevant callback: what to do when an open tag is read, when a closed one, and with the contents of the current element. These actions happen for each tag, and we can think of the whole process as traversing the document tree utilizing the so-called "visitor pattern": when visiting each node we have a chance to react after the beginning, in the middle, and in the end.</li></ul> <p>Once you get used to SAX parsing, due to its simplicity, it becomes a tool of choice for processing XML, as well as JSON and other formats that allow for a similar stream parsing approach. Often the simplest parsing pattern is enough: remember the tag we're looking at, and when it matches a set of interesting tags, process its contents. However, sometimes, we need to make decisions based on the broader context. For example, let's say, we have the text marked up into paragraphs, which are split into sentences, which are, in turn, tokenized. To process such a three-level structure, with SAX parsing, we could use the following outline (utilizing CXML library primitives):</p> <pre><code>(defclass text-sax (sax:sax-parser-mixin)<br /> ((parags :initform nil :accessor sax-parags)<br /> (parag :initform nil :accessor sax-parag)<br /> (sent :initform nil :accessor sax-sent)<br /> (tag :initform nil :accessor sax-tag)))<br /><br />(defmethod sax:start-element ((sax text-sax)<br /> namespace-uri local-name qname attrs)<br /> (declare (ignore namespace-uri qname attrs))<br /> (:= (sax-tag sax) (mkeyw local-name))<br /><br />(defmethod sax:end-element ((sax text-sax)<br /> namespace-uri local-name qname)<br /> (declare (ignore namespace-uri qname))<br /> (with-slots (tag parags sent) sax<br /> (case tag<br /> (:paragraph (push (reverse parag) parags)<br /> (:= parag nil))<br /> (:sentence (push (reverse sent) parag)<br /> (:= sent nil)))))<br /><br />(defmethod sax:characters ((sax text-sax) text)<br /> (when (eql :token (sax-tag sax))<br /> (push text (sax-sent sax)))<br /><br />(defmethod sax:end-document ((sax text-sax))<br /> (reverse (sax-parags sax)))<br /></code></pre> <p>This code will return the accumulated structure of paragraphs from the <code>sax:end-document</code> method. And two stacks: the current sentence and the current paragraph are used to accumulate intermediate data while parsing. In a similar fashion, another stack of encountered tags might have been used to exactly track our position in the document tree if there were such necessity. Overall, the more you'll be using SAX parsing, the more you'll realize that stacks are enough to address 99% of the arising challenges.</p> <h2 id="listsassets">Lists as Sets</h2> <p>Another very important abstract data structure is a Set. It is a collection that holds each element only once no matter how many times we add it there. This structure may be used in a variety of cases: when we need to track the items we have already seen and processed, when we want to calculate some relations between groups of elements,s and so forth.</p> <p>Basically, its interface consists of set-theoretic operations:</p> <ul><li>add/remove an item </li> <li>check whether an item is in the set</li> <li>check whether a set is a subset of another set</li> <li>union, intersection, difference, etc.</li></ul> <p>Sets have an interesting aspect that an efficient implementation of element-wise operations (add/remove/member) and set-wise (union/...) require the use of different concrete data-structures, so a choice should be made depending on the main use case. One way to implement sets is by using linked lists. Lisp has standard library support for this with the following functions:</p> <ul><li><code>adjoin</code> to add an item to the list if it's not already there</li> <li><code>member</code> to check for item presence in the set</li> <li><code>subsetp</code> for subset relationship query</li> <li><code>union</code>, <code>intersection</code>, <code>set-difference</code>, and <code>set-exclusive-or</code> for set operations</li></ul> <p>This approach works well for small sets (up to tens of elements), but it is rather inefficient, in general. Adding an item to the set or checking for membership will require <code>O(n)</code> operations, while, in the hash-set (that we'll discuss in the chapter on key-value structures), these are <code>O(1)</code> operations. A naive implementation of <code>union</code> and other set-theoretic operations will require <code>O(n^2)</code> as we'll have to compare each element from one set with each one from the other. However, if our set lists are in sorted order set-theoretic operations can be implemented efficiently in just <code>O(n)</code> where <code>n</code> is the total number of elements in all sets, by performing a single linear scan over each set in parallel. Using a hash-set will also result in the same complexity.</p> <p>Here is a simplified implementation of <code>union</code> for sets of numbers built on sorted lists:</p> <pre><code>(defun sorted-union (s1 s2)<br /> (let ((rez ()))<br /> (do ()<br /> ((and (null s1) (null s2)))<br /> (let ((i1 (first s1))<br /> (i2 (first s2)))<br /> (cond ((null i1) (dolist (i2 s2)<br /> (push i2 rez))<br /> (return))<br /> ((null i2) (dolist (i1 s1)<br /> (push i1 rez))<br /> (return))<br /> ((= i1 i2) (push i1 rez)<br /> (:= s1 (rest s1)<br /> s2 (rest s2)))<br /> ((< i1 i2) (push i1 rez)<br /> (:= s1 (rest s1)))<br /> ;; just T may be used instead<br /> ;; of the following condition<br /> ((> i1 i2) (push i2 rez)<br /> (:= s2 (rest s2))))))<br /> (reverse rez)))<br /><br />CL-USER> (sorted-union '(1 2 3)<br /> '(0 1 5 6))<br />(0 1 2 3 5 6)<br /></code></pre> <p>This approach may be useful even for unsorted list-based sets as sorting is a merely <code>O(n * log n)</code> operation. Even better though, when the use case requires primarily set-theoretic operations on our sets and the number of changes/membership queries is comparatively low, the most efficient technique may be to keep the lists sorted at all times.</p> <h2 id="mergesort">Merge Sort</h2> <p>Speaking about sorting, the algorithms we discussed for array sorting in the previous chapter do not work as efficient for lists for they are based on swap operations, which are <code>O(n)</code>, in the list case. Thus, another approach is required, and there exist a number of efficient list sorting algorithms, the most prominent of which is Merge sort. It works by splitting the list into two equal parts until we get to trivial one-element lists and then merging the sorted lists into the bigger sorted ones. The merging procedure for sorted lists is efficient as we've seen in the previous example. A nice feature of such an approach is its stability, i.e. preservation of the original order of the equal elements, given the proper implementation of the merge procedure.</p> <pre><code>(defun merge-sort (list comp)<br /> (if (null (rest list))<br /> list<br /> (let ((half (floor (length list) 2)))<br /> (merge-lists (merge-sort (subseq seq 0 half) comp)<br /> (merge-sort (subseq seq half) comp)<br /> comp))))<br /><br />(defun merge-lists (l1 l2 comp)<br /> (let ((rez ())<br /> (do ()<br /> ((and (null l1) (null l2)))<br /> (let ((i1 (first l1))<br /> (i2 (first l2)))<br /> (cond ((null i1) (dolist (i l2)<br /> (push i rez))<br /> (return))<br /> ((null i2) (dolist (i l1)<br /> (push i rez))<br /> (return))<br /> ((call comp i1 i2) (push i1 rez)<br /> (:= l1 (rest l1)))<br /> (t (push i2 rez)<br /> (:= l2 (rest l2))))))<br /> (reverse rez)))<br /></code></pre> <p>The same complexity analysis as for binary search applies to this algorithm. At each level of the recursion tree, we perform <code>O(n)</code> operations: each element is pushed into the resulting list once, reversed once, and there are at most 4 comparison operations: 3 null checks and 1 call of the <code>comp</code> function. We also need to perform one copy per element in the <code>subseq</code> operation and take the length of the list (although it can be memorized and passed down as the function call argument) on the recursive descent. This totals to not more than 10 operations per element, which is a constant. And the height of the tree is, as we already know, <code>(log n 2)</code>. So, the total complexity is <code>O(n * log n)</code>.</p> <p>Let's now measure the real time needed for such sorting, and let's compare it to the time of <code>prod-sort</code> (with optimal array accessors) from the Arrays chapter:</p> <pre><code>CL-USER> (with ((lst (random-list 10000))<br /> (vec (make-array 10000 :initial-contents lst)))<br /> (print-sort-timings "Prod" 'prod-sort vec)<br /> (print-sort-timings "Merge " 'merge-sort lst))<br />= Prodsort of random vector =<br />Evaluation took:<br /> 0.048 seconds of real time<br />= Prodsort of sorted vector =<br />Evaluation took:<br /> 0.032 seconds of real time<br />= Prodsort of reverse sorted vector =<br />Evaluation took:<br /> 0.044 seconds of real time<br />= Merge sort of random vector =<br />Evaluation took:<br /> 0.007 seconds of real time<br />= Merge sort of sorted vector =<br />Evaluation took:<br /> 0.007 seconds of real time<br />= Merge sort of reverse sorted vector =<br />Evaluation took:<br /> 0.008 seconds of real time<br /></code></pre> <p>Interestingly enough, Merge sort turned out to be around 5 times faster, although it seems that the number of operations required at each level of recursion is at least 2-3 times bigger than for quicksort. Why we got such result is left as an exercise to the reader: I'd start from profiling the function calls and looking where most of the time is wasted...</p> <p>It should be apparent that the <code>merge-lists</code> procedure works in a similar way to set-theoretic operations on sorted lists that we've discussed in the previous part. It is, in fact, provided in the Lisp standard library. Using the standard <code>merge</code>, Merge sort may be written in a completely functional and also generic way to support any kind of sequences:</p> <pre><code>(defun merge-sort (seq comp)<br /> (if (or (null seq) ; avoid expensive length calculation for lists<br /> (<= (length seq) 1))<br /> seq<br /> (let ((half (floor (length seq) 2)))<br /> (merge (type-of seq)<br /> (merge-sort (subseq seq 0 half) comp)<br /> (merge-sort (subseq seq half) comp)<br /> comp))))<br /></code></pre> <p>There's still one substantial difference of Merge sort from the array sorting functions: it is not in-place. So it also requires the <code>O(n * log n)</code> additional space to hold the half sublists that are produced at each iteration. Sorting and merging them in-place is not possible. There are ways to somewhat reduce this extra space usage but not totally eliminate it. </p> <h3 id="parallelizationofmergesort">Parallelization of Merge Sort</h3> <p>The extra-space drawback of Merge sort may, however, turn irrelevant if we consider the problem of parallelizing this procedure. The general idea of parallelized implementation of any algorithm is to split the work in a way that allows reducing the runtime proportional to the number of workers performing those jobs. In the ideal case, if we have <code>m</code> workers and are able to spread the work evenly the running time should be reduced by a factor of <code>m</code>. For the Merge sort, it will mean just <code>O(n/m * log n)</code>. Such ideal reduction is not always achievable, though, because often there are bottlenecks in the algorithm that require all or some workers to wait for one of them to complete its job.</p> <p>Here's a trivial parallel Merge sort implementation that uses the <code>eager-future2</code> library, which adds high-level data parallelism capabilities based on the Lisp implementation's multithreading facilities:</p> <pre><code>(defun parallel-merge-sort (seq comp)<br /> (if (or (null seq) (<= (length seq) 1))<br /> seq<br /> (with ((half (floor (length seq) 2))<br /> (thread1 (eager-future2:pexec<br /> (merge-sort (subseq seq 0 half) comp)))<br /> (thread2 (eager-future2:pexec<br /> (merge-sort (subseq seq half) comp))))<br /> (merge (type-of seq)<br /> (eager-future2:yield thread1)<br /> (eager-future2:yield thread2) <br /> comp))))<br /></code></pre> <p>The <code>eager-future2:pexec</code> procedure submits each <code>merge-sort</code> to the thread pool that manages multiple CPU threads available in the system and continues program execution not waiting for it to return. While <code>eager-future2:yield</code> pauses execution until the thread performing the appropriate <code>merge-sort</code> returns.</p> <p>When I ran our testing function with both serial and parallel merge sorts on my machine, with 4 CPUs, I got the following result:</p> <pre><code>CL-USER> (with ((lst1 (random-list 10000))<br /> (lst2 (copy-list lst1)))<br /> (print-sort-timings "Merge " 'merge-sort lst1)<br /> (print-sort-timings "Parallel Merge " 'parallel-merge-sort lst2))<br />= Merge sort of random vector =<br />Evaluation took:<br /> 0.007 seconds of real time<br /> 114.29% CPU<br />= Merge sort of sorted vector =<br />Evaluation took:<br /> 0.006 seconds of real time<br /> 116.67% CPU<br />= Merge sort of reverse sorted vector =<br />Evaluation took:<br /> 0.007 seconds of real time<br /> 114.29% CPU<br />= Parallel Merge sort of random vector =<br />Evaluation took:<br /> 0.003 seconds of real time<br /> 266.67% CPU<br />= Parallel Merge sort of sorted vector =<br />Evaluation took:<br /> 0.003 seconds of real time<br /> 266.67% CPU<br />= Parallel Merge sort of reverse sorted vector =<br />Evaluation took:<br /> 0.005 seconds of real time<br /> 220.00% CPU<br /></code></pre> <p>A speedup of approximately 2x, which is also reflected by the rise in CPU utilization from around 100% (i.e. 1 CPU) to 250%. These are correct numbers as the merge procedure is still executed serially and remains the bottleneck. There are more sophisticated ways to achieve optimal <code>m</code> times speedup, in Merge sort parallelization, but we won't discuss them here due to their complexity.</p> <h2 id="listsandlisp">Lists and Lisp</h2> <p>Historically, Lisp's name originated as an abbreviation of "List Processing", which points both to the significance that lists played in the language's early development and also to the fact that flexibility (a major feature of lists) was always a cornerstone of its design. Why are lists important to Lisp? Maybe, originally, it was connected with the availability and the good support of this data structure in the language itself. But, quickly, the focus shifted to the fact that, unlike other languages, Lisp code is input in the compiler not in a custom string-based format but in the form of nested lists that directly represent the syntax tree. Coupled with superior support for the list data structure, it opens numerous possibilities for programmatic processing of the code itself, which are manifest in the macro system, code walkers and generators, etc. So, "List Processing" turns out to be not about lists of data, but about lists of code, which perfectly describes the main distinctive feature of this language...</p> <hr size="1"><p>Footnotes:</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r4-1" name="f4-1">[1]</a> While, in the Lisp machines, cons cells even had special hardware support, and such change would have made it useless.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r4-2" name="f4-2">[2]</a> Although, for structs, it is implementation-dependent if this will work. In the major implementations, it will.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-92125002997480243072019-08-12T16:37:00.001+03:002019-08-13T11:19:06.021+03:00Programming Algorithms: Arrays<a href="https://1.bp.blogspot.com/-E9ImAEIb2kM/XVFn93I7lKI/AAAAAAAACIg/S98DEvEJDWwJhdfxgcmENTLL2e1f_zzRACLcBGAs/s1600/array.jpg" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-E9ImAEIb2kM/XVFn93I7lKI/AAAAAAAACIg/S98DEvEJDWwJhdfxgcmENTLL2e1f_zzRACLcBGAs/s320/array.jpg" width="320" height="227" data-original-width="1600" data-original-height="1133" /></a> <p>Arrays are, alongside structs, the most basic data structure and, at the same time, the default choice for implementing algorithms. A one-dimensional array that is also called a "vector" is a contiguous structure consisting of the elements of the same type. One of the ways to create such arrays, in Lisp, is this:</p> <pre><code>CL-USER> (make-array 3)<br />#(0 0 0)<br /></code></pre> <p>The printed result is the literal array representation. It happens that the array is shown to hold 0's, but that's implementation-dependent. Additional specifics can be set during array initialization: for instance, the <code>:element-type</code>, <code>:initial-element</code>, and even full contents:</p> <pre><code>CL-USER> (make-array 3 :element-type 'list :initial-element nil)<br />#(NIL NIL NIL)<br />CL-USER> (make-array 3 :initial-contents '(1.0 2.0 3.0))<br />#(1.0 2.0 3.0)<br /></code></pre> <p>If you read back such an array you'll get a new copy with the same contents:</p> <pre><code>CL-USER> #(1.0 2.0 3.0)<br />#(1.0 2.0 3.0)<br /></code></pre> <p>It is worth noting that the element type restriction is, in fact, not a limitation the default type is <code>T</code><a href="#f3-1" name="r3-1">[1]</a>. In this case, the array will just hold pointers to its elements that can be of arbitrary type. If we specify a more precise type, however, the compiler might be able to optimize storage and access by putting the elements in memory directly in the array space. This is, mainly, useful for numeric arrays, but it makes multiple orders of magnitude difference for them for several reasons, including the existence of vector CPU instructions that operate on such arrays.</p> <p>The arrays we have created are mutable, i.e. we can change their contents, although we cannot resize them. The main operator to access array elements is <code>aref</code>. You will see it in those pieces of code, in this chapter, where we care about performance. </p> <pre><code>CL-USER> (let ((vec (make-array 3 :initial-contents '(1.0 2.0 3.0))))<br /> (print (aref vec 0))<br /> (print (? vec 1))<br /> (:= (aref vec 2) 4.0))<br /> (print (? vec 2))<br /> (aref vec 3))<br />1.0 <br />2.0 <br />4.0<br />; Evaluation aborted on #<SIMPLE-TYPE-ERROR expected-type: (MOD 3) datum: 3><br /></code></pre> <p>In Lisp, array access beyond its boundary, as expected, causes an error.</p> <p>It is also possible to create constant arrays using the literal notation <code>#()</code>. These constants can, actually, be changed in some environments, but don't expect anything nice to come out of such abuse — and the compiler will warn you of that:</p> <pre><code>CL-USER> (let ((vec #(1.0 2.0 3.0)))<br /> (:= (aref vec 2) nil)<br /> (print vec))<br />; caught WARNING:<br />; Destructive function (SETF AREF) called on constant data.<br />; See also:<br />; The ANSI Standard, Special Operator QUOTE<br />; The ANSI Standard, Section 3.2.2.3<br />; <br />; compilation unit finished<br />; caught 1 WARNING condition<br /><br />#(1.0 2.0 NIL)<br /></code></pre> <p>RUTILS provides more options to easily create arrays with a shorthand notation:</p> <pre><code>CL-USER> #v(1 2 3)<br />#(1 2 3)<br />CL-USER> (vec 1 2 3)<br />#(1 2 3)<br /></code></pre> <p>Although the results seem identical they aren't. The first version creates a mutable analog of <code>#(1 2 3)</code>, and the second also makes it adjustable (we'll discuss adjustable or dynamic arrays next).</p> <h2 id="arraysassequences">Arrays as Sequences</h2> <p>Vectors are one of the representatives of the abstract <code>sequence</code> container type that has the following basic interface:</p> <ul><li>inquire the length of a sequence — performed in Lisp using the function <code>length</code></li> <li>access the element by index — the RUTILS <code>?</code> operator is the most generic variant while the native one for arrays is <code>aref</code> and a more general <code>elt</code>, for all built-in sequences (this also includes lists and, in some implementations, user-defined, so-called, extensible sequences)</li> <li>get the subsequence — the standard provides the function <code>subseq</code> for this purpose</li></ul> <p>These methods have some specific that you should mind:</p> <ul><li>the <code>length</code> function, for arrays, works in <code>O(1)</code> time as length is tracked in the array structure. There is an alternative (more primitive) way to handle arrays, employed, primarily, in C when the length is not stored, and, instead, there's a special termination "symbol" that indicates the end of an array. For instance, C strings have a <code>'\0'</code> termination character, and arrays representing command-line arguments, in the Unix syscalls API for such functions as <code>exec</code>, are terminated with null-pointers. Such an approach is, first of all, not efficient from the algorithmic point of view as it requires <code>O(n)</code> time to query the array's length. But, what's even more important, it has proven to be a source of a number of catastrophic security vulnerabilities — the venerable "buffer overflow" family of errors</li> <li>the <code>subseq</code> function creates a copy of the part of its argument, which is an expensive operation. This is the functional approach that is a proper default, but many of the algorithms don't involve subarray mutation, and, for them, a more efficient variant would be to use a shared-structure variant that doesn't make a copy but merely returns a pointer into the original array. Such option is provided, in the Lisp standard, via the so-called displaced arrays, but it is somewhat cumbersome to use, that's why a more straightforward version is present in RUTILS which is named <code>slice</code></li></ul> <pre><code>CL-USER> (with ((vec (vec 1 2 3))<br /> (part (slice vec 2)))<br /> (print part)<br /> (:= (? part 0) 4)<br /> (print part)<br /> vec)<br /><br />#(3)<br />#(4)<br />#(1 2 4)<br /></code></pre> <p>Beyond the basic operations, sequences in Lisp are the target of a number of higher-order functions, such as <code>find</code>, <code>position</code>, <code>remove-if</code> etc. We'll get back to discussing their use later in the book.</p> <h2 id="dynamicvectors">Dynamic Vectors</h2> <p>Let's examine arrays from the point of view of algorithmic complexity. General-purpose data structures are usually compared by their performance on several common operations and, also, space requirements. These common operations are: access, insertion, deletion, and, sometimes, search.</p> <p>In the case of ordinary arrays, the space used is the minimum possible: almost no overhead is incurred except, perhaps, for some meta-information about array size. Array element access is performed by index in constant time because it's just an offset from the beginning that is the product of index by the size of a single element. Search for an element requires a linear scan of the whole array or, in the special case of a sorted array, it can be done in <code>O(log n)</code> using binary search.</p> <p>Insertion (at the end of an array) and deletion with arrays is problematic, though. Basic arrays are static, i.e. they can't be expanded or shrunk at will. The case of expansion requires free space after the end of the array that isn't generally available (because it's already occupied by other data used by the program) so it means that the whole array needs to be relocated to another place in memory with sufficient space. Shrinking is possible, but it still requires relocation of the elements following the deleted one. Hence, both of these operations require <code>O(n)</code> time and may also cause memory fragmentation. This is a major drawback of arrays. </p> <p>However, arrays definitely should be the default choice for most algorithms. Why? First of all, because of the other excellent properties arrays provide and also because, in many cases, lack of flexibility can be circumvented in a certain manner. One common example is iteration with accumulation of results in a sequence. This is often performed with the help of a stack (as a rule, implemented with a linked list), but, in many cases (especially, when the length of the result is known beforehand), arrays may be used to the same effect. Another approach is using dynamic arrays, which add array resizing capabilities. And only in the case when an algorithm requires contiguous manipulation (insertion and deletion) of a collection of items or other advanced flexibility, linked data structures are preferred.</p> <p>So, the first approach to working around the static nature of arrays is possible when we know the target number of elements. For instance, the most common pattern of sequence processing is to <code>map</code> a function over it, which produces the new sequence of the same size filled with results of applying the function to each element of the original sequence. With arrays, it can be performed even more efficiently than with a list. We just need to pre-allocate the resulting vector and set its elements one by one as we process the input:</p> <pre><code>(defun map-vec (fn vec)<br /> "Map function FN over each element of VEC<br /> and return the new vector with the results."<br /> (let ((rez (make-array (length vec))))<br /> (dotimes (i (length vec))<br /> (:= (aref rez i) (call fn (aref vec i))))<br /> rez))<br /><br />CL-USER> (map-vec '1+ #(1 2 3))<br />#(2 3 4)<br /></code></pre> <p>We use a specific accessor <code>aref</code> here instead of generic <code>?</code> to ensure efficient operation in the so-called "inner loop" — although, there's just one loop here, but it will be the inner loop of many complex algorithms.</p> <p>However, in some cases we don't know the size of the result beforehand. For instance, another popular sequence processing function is called <code>filter</code> or <code>remove-if</code>(<code>-not</code>) in Lisp. It iterates over the sequence and keeps only elements that satisfy/don't satisfy a certain predicate. It is, generally, unknown how many elements will remain, so we can't predict the size of the resulting array. One solution will be to allocate the full-sized array and fill only so many cells as needed. It is a viable approach although suboptimal. Filling the result array can be performed by tracking the current index in it or, in Lisp, by using an array with a <strong>fill-pointer</strong>:</p> <pre><code>(defun clumsy-filter-vec (pred vec)<br /> "Return the vector with only those elements of VEC<br /> for which calling pred returns true."<br /> (let ((rez (make-array (length vec) :fill-pointer t)))<br /> (dotimes (i (length vec))<br /> (when (call pred (aref vec i))<br /> (vector-push (aref vec i) rez)))<br /> rez))<br /><br />CL-USER> (describe (clumsy-filter-vec 'oddp #(1 2 3)))<br />#(1 3)<br /> [vector]<br />Element-type: T<br />Fill-pointer: 2<br />Size: 3<br />Adjustable: yes<br />Displaced-to: NIL<br />Displaced-offset: 0<br />Storage vector: #<(SIMPLE-VECTOR 3) {100E9AF30F}><br /></code></pre> <p>Another, more general way, would be to use a "dynamic vector". This is a kind of an array that supports insertion by automatically expanding its size (usually, not one element at a time but proportionally to the current size of the array). Here is how it works:</p> <pre><code>CL-USER> (let ((vec (make-array 0 :fill-pointer t :adjustable t)))<br /> (dotimes (i 10)<br /> (vector-push-extend i vec)<br /> (describe vec)))<br />#(0)<br /> [vector]<br />Element-type: T<br />Fill-pointer: 1<br />Size: 1<br />Adjustable: yes<br />Displaced-to: NIL<br />Displaced-offset: 0<br />Storage vector: #<(SIMPLE-VECTOR 1) {100ED9238F}><br /><br />#(0 1)<br />Fill-pointer: 2<br />Size: 3<br /><br />#(0 1 2)<br />Fill-pointer: 3<br />Size: 3<br /><br />#(0 1 2 3)<br />Element-type: T<br />Fill-pointer: 4<br />Size: 7<br /><br />...<br /><br />#(0 1 2 3 4 5 6 7)<br />Fill-pointer: 8<br />Size: 15<br /><br />#(0 1 2 3 4 5 6 7 8)<br />Element-type: T<br />Fill-pointer: 9<br />Size: 15<br /><br />#(0 1 2 3 4 5 6 7 8 9)<br />Element-type: T<br />Fill-pointer: 10<br />Size: 15<br /></code></pre> <p>For such "smart" arrays the complexity of insertion of an element becomes <strong>asymptotically</strong> constant: resizing and moving elements happens less and less often the more elements are added. With a large number of elements, this comes at a cost of a lot of wasted space, though. At the same time, when the number of elements is small (below 20), it happens often enough, so that the performance is worse than for a linked list that requires a constant number of 2 operations for each insertion (or 1 if we don't care to preserve the order). So, dynamic vectors are the solution that can be used efficiently only when the number of elements is neither too big nor too small. </p> <h2 id="whyarearraysindexedfrom0">Why Are Arrays Indexed from 0</h2> <p>Although most programmers are used to it, not everyone understands clearly why the choice was made, in most programming languages, for 0-based array indexing. Indeed, there are several languages that prefer a 1-based variant (for instance, MATLAB and Lua). This is quite a deep and yet very practical issue that several notable computer scientists, including Dijkstra, have <a href="https://www.quora.com/Why-do-array-indexes-start-with-0-zero-in-many-programming-languages">contributed to</a>.</p> <p>At first glance, it is "natural" to expect the first element of a sequence to be indexed with 1, second — with 2, etc. This means that if we have a subsequence from the first element to the tenth it will have the beginning index 1 and the ending — 10, i.e. be a closed interval also called a segment: <code>[1, 10]</code>. The cons of this approach are the following:</p> <ol><li><p>It is more straightforward to work with half-open intervals (i.e. the ones that don't include the ending index): especially, it is much more convenient to split and merge such intervals, and, also, test for membership. With 0-based indexing, our example interval would be half-open: <code>[0, 10)</code>.</p></li> <li><p>If we consider multi-dimensional arrays that are most often represented using one-dimensional ones, getting an element of a matrix with indices <code>i</code> and <code>j</code> translates to accessing the element of an underlying vector with an index <code>i*w + j</code> or <code>i + j*h</code> for 0-based arrays, while for 1-based ones, it's more cumbersome: <code>(i-1)*w + j</code>. And if we consider 3-dimensional arrays (tensors), we'll still get the obvious <code>i*w*h + j*h + k</code> formula, for 0-based arrays, and, maybe, <code>(i-1)*w*h + (j-1)*h + k</code> for 1-based ones, although I'm not, actually, sure if it's correct (which shows how such calculations quickly become untractable). Besides, multi-dimensional array operations that are much more complex than mere indexing also often occur in many practical tasks, and they are also more complex and thus error-prone with base 1.</p></li></ol> <p>There are other arguments, but I consider them to be much more minor and a matter of taste and convenience. However, the intervals and multi-dimensional arrays issues are quite serious. And here is a good place to quote one of my favorite anecdotes that there are two hard problems in CS: cache invalidation and naming things,.. and off-by-one errors. Arithmetic errors with indexing are a very nasty kind of bug, and although it can't be avoided altogether 0-based indexing turns out to be a much more balanced solution. </p> <p>Now, using 0-based indexing, let's write down the formula for finding the middle element of an array. Usually, it is chosen to be <code>(floor (length array) 2)</code>. This element will divide the array into two parts, left and right, each one having length at least <code>(1- (floor (length array) 2)</code>: the left part will always have such size and will not include the middle element. The right side will start from the middle element and will have the same size if the total number of array elements is even or be one element larger if it is odd.</p> <h2 id="multidimensionalarrays">Multi-Dimensional Arrays</h2> <p>So far, we have only discussed one-dimensional arrays. However, more complex data-structures can be represented using simple arrays. The most obvious example of such structures is multi-dimensional arrays. There's a staggering variety of other structures that can be built on top of arrays, such as binary (or, in fact, any n-ary) trees, hash-tables, and graphs, to name a few. If we have a chance to implement the data structure on an array, usually, we should not hesitate to take it as it will result in constant access time, good cache locality contributing to faster processing and, in most cases, efficient space usage.</p> <p>Multi-dimensional arrays are a contiguous data-structure that stores its elements so that, given the coordinates of an element in all dimensions, it can be retrieved according to a known formula. Such arrays are also called <strong>tensors</strong>, and in case of 2-dimensional arrays — <strong>matrices</strong>. We have already seen one matrix example in the discussion of complexity:</p> <pre><code>#2A((1 2 3)<br /> (4 5 6))<br /></code></pre> <p>A matrix has rows (first dimension) and columns (second dimension). Accordingly, the elements of a matrix may be stored in the row-major or column-major order. In row-major, the elements are placed row after row — just like on this picture, i.e., the memory will contain the sequence: <code>1 2 3 4 5 6</code>. In column-major order, they are stored by column (this approach is used in many "mathematical" languages, such as Fortran or MATLAB), so raw memory will look like this: <code>1 4 2 5 3 6</code>. If row-major order is used the formula to access the element with coordinates <code>i</code> (row) and <code>j</code> (column) is: <code>(+ (* i n) j)</code> where <code>n</code> is the length of the matrix's row, i.e. its width. In the case of column-major order, it is: <code>(+ i (* j m))</code> where <code>m</code> is the matrix's height. It is necessary to know, which storage style is used in a particular language as in numeric computing it is common to intermix libraries written in many languages — C, Fortran, and others — and, in the process, incompatible representations may clash.<a href="#f3-2" name="r3-2">[2]</a></p> <p>Such matrix representation is the most obvious one, but it's not exclusive. Many languages, including Java, use <code>iliffe vectors</code> to represent multi-dimensional arrays. These are vectors of vectors, i.e. each matrix row is stored in a separate 1-dimensional array, and the matrix is the vector of such vectors. Besides, more specific multi-dimensional arrays, such as sparse or diagonal matrices, may be represented using more efficient storage techniques at the expense of a possible loss in access speed. Higher-order tensors may also be implemented with the described approaches.</p> <p>One classic example of operations on multi-dimensional arrays is matrix multiplication. The simple straightforward algorithm below has the complexity of <code>O(n^3)</code> where <code>n</code> is the matrix dimension. The condition for successful multiplication is equality of height of the first marix and width of the second one. The cubic complexity is due to 3 loops: by the outer dimensions of each matrix and by the inner identical dimension.</p> <pre><code>(defun m* (m1 m2)<br /> (let ((n (array-dimension m1 1))<br /> (n1 (array-dimension m1 0))<br /> (n2 (array-dimension m2 1))<br /> (rez (make-array (list n1 n2))))<br /> (assert (= n (array-dimension m2 1)))<br /> (dotimes (i n1)<br /> (dotimes (j n2)<br /> (let ((cur 0))<br /> (dotimes (k n)<br /> ;; :+ is the incrementing analog of :=<br /> (:+ cur (* (aref m1 i k)<br /> (aref m2 k j))))<br /> (:= (aref rez i j) cur))))<br /> rez))<br /></code></pre> <p>There are more efficient albeit much more complex versions using divide-and-conquer approach that can work in only <code>O(n^2.37)</code>, but they have significant hidden constants and, that's why, are rarely used in practice, although if you're relying on an established library for matrix operations, such as the Fortran-based BLAS/ATLAS, you will find one of them under-the-hood.</p> <h2 id="binarysearch">Binary Search</h2> <p>Now, let's talk about some of the important and instructive array algorithms. The most prominent ones are searching and sorting.</p> <p>A common sequence operation is searching for the element either to determine if it is present, to get its position or to retrieve the object that has a certain property (key-based search). The simple way to search for an element in Lisp is using the function <code>find</code>:</p> <pre><code>CL-USER> (let ((vec #v((pair :foo :bar) (pair :baz :quux))))<br /> (print (find (pair :foo :bar) vec))<br /> (print (find (pair :foo :bar) vec :test 'equal))<br /> (print (find (pair :bar :baz) vec :test 'equal))<br /> (print (find :foo vec :key 'lt)))<br />NIL<br />(:FOO :BAR) <br />NIL<br />(:FOO :BAR) <br /></code></pre> <p>In the first case, the element was not found due to the wrong comparison predicate: the default <code>eql</code> will only consider to structures the same if it's the same object, and, in this case, there will be two separate pairs with the same content. So, the second search is successful as <code>equal</code> performs deep comparison. Then the element is not found as it is just not present. And, in the last case, we did the key-based search looking just at the <code>lt</code> element of all pairs in <code>vec</code>.</p> <p>Such search is called sequential scan because it is performed in a sequential manner over all elements of the vector starting from the beginning (or end if we specify <code>:from-end t</code>) until either the element is found or we have examined all the elements. The complexity of such search is, obviously, <code>O(n)</code>, i.e. we need to access each element of the collection (if the element is present we'll look, on average, at <code>n/2</code> elements, and if not present — always at all <code>n</code> elements).</p> <p>However, if we know that our sequence is sorted, we can perform the search much faster. The algorithm used for that is one of the most famous algorithms that every programmer has to know and use, from time to time — binary search. The more general idea behind it is called "divide and conquer": if there's some way, looking at one element, to determine the outcome of our global operation for more than just this element we can discard the part, for which we already know that the outcome is negative. In binary search, when we're looking at an arbitrary element of the sorted vector and compare it with the item we search for:</p> <ul><li>if the element is the same we have found it</li> <li>if it's smaller all the previous elements are also smaller and thus uninteresting to us — we need to look only on the subsequent ones</li> <li>if it's greater all the following elements are not interesting</li></ul> <a href="https://2.bp.blogspot.com/-E8mRensw4Ko/XVFoObqZQ7I/AAAAAAAACIo/DbhiAW3XKpQ1hw68evWnMDJLiKJKszWQACLcBGAs/s1600/bin-search.jpg" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-E8mRensw4Ko/XVFoObqZQ7I/AAAAAAAACIo/DbhiAW3XKpQ1hw68evWnMDJLiKJKszWQACLcBGAs/s320/bin-search.jpg" width="320" height="272" data-original-width="1584" data-original-height="1344" /></a> <p>Thus, each time we can examine the middle element and, after that, can discard half of the elements of the array without checking them. We can repeat such comparisons and halving until the resulting array contains just a single element.</p> <p>Here's the straightforward binary search implementation using recursion:</p> <pre><code>(defun bin-search (val vec &optional (pos 0))<br /> (if (> (length vec) 1)<br /> (with ((mid (floor (+ beg end) 2))<br /> (cur (aref vec mid)))<br /> (cond ((< cur val) (bin-search val<br /> (slice vec (1+ mid))<br /> (+ pos mid 1)))<br /> ((> cur val) (bin-search val<br /> (slice vec 0 (1+ mid))<br /> pos))<br /> (t (+ pos mid))))<br /> (when (= (aref vec 0) val)<br /> pos)))<br /></code></pre> <p>If the middle element differs from the one we're looking for it halves the vector until just one element remains. If the element is found its position (which is passed as an optional 3rd argument to the recursive function) is returned. Note that we assume that the array is sorted. Generally, there's no way to quickly check this property unless we examine all array elements (and thus lose all the benefits of binary search). That's why we don't assert the property in any way and just trust the programmer :)</p> <p>An important observation is that such recursion is very similar to a loop that at each stage changes the boundaries we're looking in-between. Not every recursive function can be matched with a similar loop so easily (for instance, when there are multiple recursive calls in its body an additional memory data structure is needed), but when it is possible it usually makes sense to choose the loop variant. The pros of looping is the avoidance of both the function calls' overhead and the danger of hitting the recursion limit or the stack overflow associated with it. While the pros of recursion are simpler code and better debuggability that comes with the possibility to examine each iteration by tracing using the built-in tools.</p> <p>Another thing to note is interesting counter-intuitive arithmetic of additional comparisons. In our naive approach, we had 3 <code>cond</code> clauses, i.e. up to 2 comparisons to make at each iteration. In total, we'll look at <code>(log n 2)</code> elements of our array, so we have no more than <code>(/ (1- (log n 2)) n)</code> chance to match the element with the <code>=</code> comparison before we get to inspect the final 1-element array. I.e. with the probability of <code>(- 1 (/ (1- (log n 2)) n))</code> we'll have to make all the comparisons up to the final one. Even for such small <code>n</code> as 10 this probability is 0.77 and for 100 — 0.94. And this is an optimistic estimate for the case when the element searched for is actually present in the array, which may not always be so. Otherwise, we'll have to make all the comparisons. Effectively, these numbers prove the equality comparison meaningless and just a waste of computation, although from "normal" programmer intuition it might seem like a good idea to implement early exit in this situation... </p> <p>Finally, there's also one famous non-obvious bug associated with the binary search that was still present in many production implementations, for many years past the algorithm's inception. It's also a good example of the dangers of forfeiting boundary conditions check that is the root of many severe problems plaguing our computer systems by opening them to various exploits. The problem, in our code, may manifest in systems that have limited integer arithmetic with potential overflow. In Lisp, if the result of summing two fixnums is greater than <code>most-positive-fixnum</code> (the maximum number that can be represented directly by the machine word) it will be automatically converted to bignums, which are a slower representation but with unlimited precision:</p> <pre><code>CL-USER> most-positive-fixnum<br />4611686018427387903<br />CL-USER> (type-of most-positive-fixnum)<br />(INTEGER 0 4611686018427387903)<br />CL-USER> (+ most-positive-fixnum most-positive-fixnum)<br />9223372036854775806<br />CL-USER> (type-of (+ most-positive-fixnum most-positive-fixnum))<br />(INTEGER 4611686018427387904)<br /></code></pre> <p>In many other languages, such as C or Java, what will happen is either silent overflow (the worst), in which case we'll get just the remainder of division of the result by the maximum integer, or an overflow error. Both of these situations are not accounted for in the <code>(floor (+ beg end) 2)</code> line. The simple fix to this problem, which makes sense to keep in mind for future similar situations, is to change the computation to the following equivalent form: <code>(+ beg (floor (- end beg) 2))</code>. It will never overflow. Why? Try to figure out on your own ;)</p> <p>Taking all that into account and allowing for a custom comparator function, here's an "optimized" version of binary search that returns 3 values:</p> <ul><li>the final element of the array</li><li>its position</li><li>has it, actually, matched the element we were searching for?</li></ul> <pre><code>(defun bin-search (val vec &key (less '<) (test '=) (key 'identity))<br /> (when (plusp (length vec))<br /> (let ((beg 0)<br /> (end (length vec)))<br /> (do ()<br /> ((= beg end))<br /> (let ((mid (+ beg (floor (- end beg) 2))))<br /> (if (call less (call key (aref vec mid)) val)<br /> (:= beg (1+ mid))<br /> (:= end mid))))<br /> (values (aref vec beg)<br /> beg<br /> (call test (call key (aref vec beg)) val)))))<br /></code></pre> <p>How many loop iterations do we need to complete the search? If we were to take the final one-element array and expand the array from it by adding the discarded half it would double in size at each step, i.e. we'll be raising 2 to the power of the number of expansion iterations (initially, before expansion — after 0 iterations — we have 1 element, which is <code>2^0</code>, after 1 iteration, we have 2 elements, after 2 — 4, and so on). The number of iterations needed to expand the full array may be calculated by the inverse of exponentiation — the logarithmic function. I.e. we'll need <code>(log n 2)</code> iterations (where <code>n</code> is the initial array size). Shrinking the array takes the same as expanding, just in the opposite order, so the complexity of binary search is <code>O(log n)</code>.</p> <p>How big is the speedup from linear to logarithmic complexity? Let's do a quick-and-dirty speed comparison between the built-in (and optimized) sequential scan fucntion <code>find</code> and our <code>bin-search</code>:</p> <pre><code>CL-USER> (with ((size 100000000)<br /> (mid (1+ (/ size 2)))<br /> (vec (make-array size)))<br /> (dotimes (i size)<br /> (:= (? vec i) i))<br /> (time (find mid vec))<br /> (time (bin-search mid vec)))<br />Evaluation took:<br /> 0.591 seconds of real time<br /> 0.595787 seconds of total run time (0.595787 user, 0.000000 system)<br /> 100.85% CPU<br /> ...<br />Evaluation took:<br /> 0.000 seconds of real time<br /> 0.000000 seconds of total run time (0.000000 user, 0.000000 system)<br /> 100.00% CPU<br /> ...<br /></code></pre> <p>Unfortunately, I don't have enough RAM on my notebook to make <code>bin-search</code> take at least a millisecond of CPU time. We can count nanoseconds to get the exact difference, but a good number to remember is that <code>(log 1000000 2)</code> is approximately 20, so, for the million elements array, the speedup will be 50000x!</p> <p>The crucial limitation of binary search is that it requires our sequence to be pre-sorted because sorting before each search already requires at least linear time to complete, which kills any performance benefit we might have expected. There are multiple situations when the pre-sort condition may hold without our intervention:</p> <ul><li>all the data is known beforehand and we can sort it just once prior to running the search, which may be repeated multiple times for different values</li> <li>we maintain the sorted order as we add data. Such an approach is feasible only if addition is performed less frequently than search. This is often the case with databases, which store their indices in sorted order</li></ul> <p>A final note on binary search: obviously, it will only work fast for vectors and not linked sequences.</p> <h3 id="binarysearchinaction">Binary Search in Action</h3> <p>In one consumer internet company I was working for, a lot of text processing (which was the company's bread-and-butter) relied on access to a huge statistical dataset called "ngrams". Ngrams is a simple Natural Language Processing concept: basically, they are phrases of a certain length. A unigram (1gram) is a single word, a bigram — a pair of words, a fivegram — a list of 5 words. Each ngram has some weight associated with it, which is calculated (estimated) from the huge corpus of texts (we used the crawl of the whole Internet). There are numerous ways to estimate this weight, but the basic one is to just count the frequency of the occurance of a specific ngram phrase in the corpus.</p> <p>The total number of ngrams may be huge: for our case, the whole dataset, on disk, measured in tens of gigabytes. And the application requires constant random access to it. Using an off-the-shelf database would have incurred us too much overhead as such systems are general-purpose and don't optimize for the particular use cases, like the one we had. So, a special-purpose solution was needed. In fact, now there is readily-available ngrams handling software, such as KenLM. We have built our own, and, initially, it relied on binary search of the in-memory dataset to answer the queries. Considering the size of the data, what do you think was the number of operations required? I don't remember it exactly, but somewhere between 25 and 30. For handling tens of gigabytes or hundreds of millions/billions of ngrams — quite a decent result. And, most important, it didn't exceed our application's latency limits! The key property that enabled such solution was the fact that all the ngrams were known beforehand and hence the dataset could be pre-sorted. Yet, eventually, we moved to an even faster solution based on perfect hash-tables (that we'll discuss later in this book).</p> <p>One more interesting property of this program was that it took significant time to initialize as all the data had to be loaded into memory from disk. During that time, which measured in several dozens of minutes, the application was not available, which created a serious bottleneck in the whole system and complicated updates as well as put normal operation at additional risk. The solution we utilized to counteract this was also a common one for such cases: lazy loading in memory using the Unix <code>mmap</code> facility.</p> <h2 id="sorting">Sorting</h2> <p>Sorting is another fundamental sequence operation that has many applications. Unlike searching, the sorted sequence, there is no single optimal algorithm for sorting, and different data structures allow different approaches to it. In general, the problem of sorting a sequence is to place all of its elements in a certain order determined by the comparison predicate. There are several aspects that differentiate sorting functions:</p> <ul><li><strong>in-place</strong> sorting is a destructive operation, but it is often desired because it may be faster and also it preserves space (especially relevant when sorting big amounts of data at once). The alternative is copying sort</li> <li><strong>stable</strong>: whether 2 elements, which are considered the same by the predicate, retain their original order or may be shuffled</li> <li><strong>online</strong>: does the function require to see the whole sequence before starting the sorting process or can it work with each element one-by-one always preserving the result of processing the already seen part of the sequence in the sorted order </li></ul> <p>One more aspect of a particular sorting algorithm is its behavior on several special kinds of input data: already sorted (in direct and reversed order), almost sorted, completely random. An ideal algorithm should show better than average performance (up to <code>O(1)</code>) on the sorted and almost sorted special cases.</p> <p>Over the history of CS, sorting was and still remains a popular research topic. Not surprisingly, several dozens of different sorting algorithms were developed. But before discussing the prominent ones, let's talk about "Stupid sort" (or "Bogosort"). It is one of the sorting algorithms that has a very simple idea behind, but an outstandingly nasty performance. The idea is that among all permutations of the input sequence there definitely is the completely sorted one. If we were to take it, we don't need to do anything else. It's an example of the so-called "generate and test" paradigm that may be employed when we know next to nothing about the nature of our task: then, put some input into the black box and see the outcome. In the case of bogosort, the number of possible inputs is the number of all permutations that's equal to <code>n!</code>, so considering that we need to also examine each permutation's order the algorithm's complexity is <code>O(n * n!)</code> — quite a bad number, especially, since some specialized sorting algorithms can work as fast as <code>O(n)</code> (for instance, Bucket sort for integer numbers). On the other hand, if generating all permutations is a library function and we don't care about complexity such an algorithm will have a rather simple implementation that looks quite innocent. So you should always inquire about the performance characteristics of 3rd-party functions. And, by the way, your standard library <code>sort</code> function is also a good example of this rule.</p> <pre><code>(defun bogosort (vec comp)<br /> (dolist (variant (all-permutations vec))<br /> (dotimes (i (1- (length variant)))<br /> ;; this is the 3rd optional argument of dotimes header<br /> ;; that is evaluated only after the loop finishes normally<br /> ;; if it does we have found a completely sorted permutation!<br /> (return-from bogosort variant))<br /> (when (call comp (? variant (1+ i)) (? variant i))<br /> (return))))) ; current variant is not sorted, skip it<br /></code></pre> <h3 id="on2sorting">O(n^2) Sorting</h3> <p>Although we can imagine an algorithm with even worse complexity factors than this, bogosort gives us a good lower bound on the sorting algorithm's performance and an idea of the potential complexity of this task. However, there are much faster approaches that don't have a particularly complex implementation. There is a number of such simple algorithms that work in quadratic time. A very well-known one, which is considered by many a kind of "Hello world" algorithm, is Bubble sort. Yet, in my opinion, it's quite a bad example to teach (sadly, often it is taught) because it's both not very straightforward and has poor performance characteristics. That's why it's <em>never</em> used in practice. There are two other simple quadratic sorting algorithms that you actually have a chance to encounter in the wild, especially, Insertion sort that is used rather frequently. Their comparison is also quite insightful, so we'll take a look at both, instead of focusing just on the former. </p> <p><strong>Selection sort</strong> is an in-place sorting algorithm that moves left-to-right from the beginning of the vector one element at a time and builds the sorted prefix to the left of the current element. This is done by finding the "largest" (according to the comparator predicate) element in the right part and swapping it with the current element.</p> <pre><code>(defun selection-sort (vec comp)<br /> (dotimes (i (1- (length vec)))<br /> (let ((best (aref vec i))<br /> (idx i))<br /> (dotimes (j (- (length vec) i 1))<br /> (when (call comp (aref vec (+ i j 1)) best)<br /> (:= best (aref vec (+ i j 1))<br /> idx (+ i j 1)))))<br /> (rotatef (aref vec i) (aref vec idx))) ; this is the lisp's swap operator<br /> vec)<br /></code></pre> <p>Selection sort requires a constant number of operations regardless of the level of sortedness of the original sequence: <code>(/ (* n (- n 1)) 2)</code> — the sum of the arithmetic progression from 1 to <code>n</code>, because, at each step, it needs to fully examine the remainder of the elements to find the maximum, and the remainder's size varies from <code>n</code> to <code>1</code>. It handles equally well both contiguous and linked sequences.</p> <p><strong>Insertion sort</strong> is another quadratic-time in-place sorting algorithm that builds the sorted prefix of the sequence. However, it has a few key differences from Selection sort: instead of looking for the global maximum in the right-hand side it looks for a proper place of the current element in the left-hand side. As this part is always sorted it takes linear time to find the place for the new element and insert it there leaving the side in sorted order. Such change has great implications:</p> <ul><li>it is stable</li> <li>it is online: the left part is already sorted, and, in contrast with selection sort, it doesn't have to find the maximum element of the whole sequence in the first step, it can handle encountering it at any step</li> <li>for sorted sequences it works in the fastest possible way — in linear time — as all elements are already inserted into proper places and don't need moving. The same applies to almost sorted sequences, for which it works in almost linear time. However, for reverse sorted sequences, its performance will be the worse. In fact, there is a clear proportion of the algorithm's complexity to the average offset of the elements from their proper positions in the sorted sequence: <code>O(k * n)</code>, where <code>k</code> is the average offset of the element. For sorted sequences <code>k=0</code> and for reverse sorted it's <code>(/ (- n 1) 2)</code>. </li></ul> <pre><code>(defun insertion-sort (vec comp)<br /> (dotimes (i (1- (length vec)))<br /> (do ((j i (1- j)))<br /> ((minusp j))<br /> (if (call comp (aref vec (1+ j)) (aref vec j))<br /> (rotatef (aref vec (1+ j)) (aref vec j))<br /> (return))))<br /> vec)<br /></code></pre> <p>As you see, the implementation is very simple: we look at each element starting from the second, compare it to the previous element, and if it's better we swap them and continue the comparison with the previous element until we reach the array's beginning.</p> <p>So, where's the catch? Is there anything that makes Selection sort better than Insertion? Well, if we closely examine the number of operations required by each algorithm we'll see that Selection sort needs exactly <code>(/ (* n (- n 1)) 2)</code> comparisons and on average <code>n/2</code> swaps. For Insertion sort, the number of comparisons varies from <code>n-1</code> to <code>(/ (* n (- n 1)) 2)</code>, so, in the average case, it will be <code>(/ (* n (- n 1)) 4)</code>, i.e. half as many as for the other algorithm. In the sorted case, each element is already in its position, and it will take just 1 comparison to discover that, in the reverse sorted case, the average distance of an element from its position is <code>(/ (- n 1) 2)</code>, and for the middle variant, it's in the middle, i.e. <code>(/ (- n 1) 4)</code>. Times the number of elements (<code>n</code>). But, as we can see from the implementation, Insertion sort requires almost the same number of swaps as comparisons, i.e. <code>(/ (* (- n 1) (- n 2)) 4)</code> in the average case, and it matches the number of swaps of Selection sort only in the close to best case, when each element is on average 1/2 steps away from its proper position. If we sum up all comparisons and swaps for the average case, we'll get the following numbers:</p> <ul><li>Selection sort: <code>(+ (/ (* n (- n 1)) 2) (/ n 2)) = (/ (+ (* n n) n) 2)</code></li> <li>Insertion sort: <code>(+ (/ (* n (- n 1)) 2) (+ (/ (* (- n 1) (- n 2)) 4) = (/ (+ (* 1.5 n n) (* -2.5 n) 1) 2)</code></li></ul> <p>The second number is slightly higher than the first. For small <code>n</code>s it is almost negligible: for instance, when <code>n=10</code>, we get 55 operations for Selection sort and 63 for Insertion. But, asymptotically (for huge <code>n</code>s like millions and billions), Insertion sort will need 1.5 times more operations. Also, it is often the case that swaps are more expensive operations than comparisons (although, the opposite is also possible).</p> <p>In practice, Insertion sort ends up being used more often, for, in general, quadratic sorts are only used when the input array is small (and so the difference in the number of operations) doesn't matter, while it has other good properties we mentioned. However, one situation when Selection sort's predictable performance is an important factor is in the systems with deadlines.</p> <h3 id="quicksort">Quicksort</h3> <p>There is a number of other <code>O(n^2)</code> sorting algorithms similar to Selection and Insertion sorts, but studying them quickly turns boring so we won't. As there's also a number of significantly faster algorithms that work in <code>O(n * log n)</code> time (almost linear). They usually rely on the <strong>divide-and-conquer</strong> approach when the whole sequence is recursively divided into smaller subsequences that have some property, thanks to which it's easier to sort them, and then these subsequences are combined back into the final sorted sequence. The feasibility of such performance characteristics is justified by the observation that ordering relations are recursive, i.e. if we have compared two elements of an array and then compare one of them to the third element, with a probability of 1/2 we'll also know how it relates to the other element. </p> <p>Probably, the most famous of such algorithms is Quicksort. Its idea is, at each iteration, to select some element of the array as the "pivot" point and divide the array into two parts: all the elements that are smaller and all those that are larger than the pivot; then recursively sort each subarray. As all left elements are below the pivot and all right — above when we manage to sort the left and right sides the whole array will be sorted. This invariant holds for all iterations and for all subarrays. The word "invariant", literally, means some property that doesn't change over the course of the algorithm's execution when other factors, e.g. bounds of the array we're processing, are changing.</p> <p>There're several tricks in Quicksort implementation. The first one has to do with pivot selection. The simplest approach is to always use the last element as the pivot. Now, how do we put all the elements greater than the pivot after it if it's already the last element? Let's say that all elements are greater — then the pivot will be at index 0. Now, if moving left to right over the array we encounter an element that is not greater than the pivot we should put it before, i.e. the pivot's index should increment by 1. When we reach the end of the array we know the correct position of the pivot, and in the process, we can swap all the elements that should precede it in front of this position. Now, we have to put the element that is currently occupying the pivot's place somewhere. Where? Anywhere after the pivot, but the most obvious thing is to swap it with the pivot.</p> <pre><code>(defun quicksort (vec comp)<br /> (when (> (length vec) 1)<br /> (with ((pivot-i 0)<br /> (pivot (aref vec (1- (length vec)))))<br /> (dotimes (i (1- (length vec)))<br /> (when (call comp (aref vec i) pivot)<br /> (rotatef (aref vec i)<br /> (aref vec pivot-i))<br /> (:+ pivot-i)))<br /> ;; swap the pivot (last element) in its proper place<br /> (rotatef (aref vec (1- (length vec)))<br /> (aref vec pivot-i))<br /> (quicksort (slice vec 0 pivot-i) comp)<br /> (quicksort (slice vec (1+ pivot-i)) comp)))<br /> vec)<br /></code></pre> <p>Although recursion is employed here, such implementation is space-efficient as it uses array displacement ("slicing") that doesn't create new copies of the subarrays, so sorting happens in-place. Speaking of recursion, this is one of the cases when it's not so straightforward to turn it into looping (this is left as an exercise to the reader :) ).</p> <p>What is the complexity of such implementation? Well, if, on every iteration, we divide the array in two equal halves we'll need to perform <code>n</code> comparisons and <code>n/2</code> swaps and increments, which totals to <code>2n</code> operations. And we'll need to do that <code>(log n 2)</code> times, which is the height of a complete binary tree with <code>n</code> elements. At every level in the recursion tree, we'll need to perform twice as many sorts with twice as little data, so each level will take the same number of <code>2n</code> operations. Total complexity: <code>2n * (log n 2)</code>, i.e. <code>O(n * log n)</code>. In the ideal case.</p> <p>However, we can't guarantee that the selected pivot will divide the array into two ideally equal parts. In the worst case, if we were to split it into 2 totally unbalanced subarrays, with <code>n-1</code> and 0 elements respectively, we'd need to perform sorting <code>n</code> times and had to perform a number of operations that will diminish in the arithmetic progression from <code>2n</code> to 2. Which sums to <code>(* n (- n 1))</code>. A dreaded <code>O(n^2)</code> complexity. So, the worst-case performance for quicksort is not just worse, but in a different complexity league than the average-case one. Moreover, the conditions for such performance (given our pivot selection scheme) are not so uncommon: sorted and reverse-sorted arrays. And the almost sorted ones will result in the almost worst-case scenario.</p> <p>It is also interesting to note that if, at each stage, we were to split the array into parts that have a 10:1 ratio of lengths this would have resulted in <code>n * log n</code> complexity! How come? The 10:1 ratio, basically, means that the bigger part each time is shortened at a factor of around 1.1, which still is a power-law recurrence. The base of the algorithm will be different, though: 1.1 instead of 2. Yet, from the complexity theory point of view, the logarithm base is not important because it's still a constant: <code>(log n x)</code> is the same as <code>(/ (log n 2) (log x 2))</code>, and <code>(/ 1 (log x 2))</code> is a constant for any fixed logarithm base <code>x</code>. In our case, if <code>x</code> is 1.1 the constant factor is 7.27. Which means that quicksort, in the quite bad case of recurring 10:1 splits, will be just a little more than 7 times slower than, in the best case, of recurring equal splits. Significant — yes. But, if we were to compare <code>n * log n</code> (with base 2) vs <code>n^2</code> performance for <code>n=1000</code> we'd already get a 100 times slowdown, which will only continue increasing as the input size grows. Compare this to a constant factor of 7...</p> <p>So, how do we achieve at least 10:1 split, or, at least, 100:1, or similar? One of the simple solutions is called 3-medians approach. The idea is to consider not just a single point as a potential pivot but 3 candidates: first, middle, and last points — and select the one, which has the median value among them. Unless accidentally two or all three points are equal, this guarantees us not taking the extreme value that is the cause of the all-to-nothing split. Also, for a sorted array, this should produce a nice near to equal split. How probable is stumbling at the special case when we'll always get at the extreme value due to equality of the selected points? The calculations here are not so simple, so I'll give just the answer: it's extremely improbable that such condition will hold for all iterations of the algorithm due to the fact that we'll always remove the last element and all the swapping that is going on. More precisely, the only practical variant when it may happen is when the array consists almost or just entirely of the same elements. And this case will be addressed next. One more refinement to the 3-medians approach that will work even better for large arrays is 9-medians that, as is apparent from its name, performs the median selection not among 3 but 9 equidistant points in the array.</p> <p>Dealing with equal elements is another corner case for quicksort that should be addressed properly. The fix is simple: to divide the array not in 2 but 3 parts, smaller, larger, and equal to the pivot. This will allow for the removal of the equal elements from further consideration and will even speed up sorting instead of slowing it down. The implementation adds another index (this time, from the end of the array) that will tell us where the equal-to-pivot elements will start, and we'll be gradually swapping them into this tail as they are encountered during array traversal.</p> <h3 id="productionsort">Production Sort</h3> <p>I was always wondering how it's possible, for Quicksort, to be the default sorting algorithm when it has such bad worst-case performance and there are other algorithms like Merge sort or Heap sort that have guaranteed <code>O(n * log n)</code> ones. With all the mentioned refinements, it's apparent that the worst-case scenario, for Quicksort, can be completely avoided (in the probabilistic sense) while it has a very nice property of sorting in-place with good cache locality, which significantly contributes to better real-world performance. Moreover, production sort implementation will be even smarter by utilizing Quicksort while the array is large and switching to something like Insertion sort when the size of the subarray reaches a certain threshold (10-20 elements). All this, however, is applicable only to arrays. When we consider lists, other factors come into play that make Quicksort much less plausible.</p> <p>Here's an attempt at such — let's call it "Production sort" — implementation (the function <code>3-medians</code> is left as an excercise to the reader). </p> <pre><code>(defun prod-sort (vec comp &optional (eq 'eql))<br /> (cond ((< (length vec) 2)<br /> vec)<br /> ((< (length vec) 10)<br /> (insertion-sort vec comp))<br /> (t<br /> (rotatef (aref vec (1- (length vec)))<br /> (aref vec (3-medians vec comp eq)))<br /> (with ((pivot-i 0)<br /> (pivot-count 1)<br /> (last-i (1- (length vec)))<br /> (pivot (aref vec last-i)))<br /> (do ((i 0 (1+ i)))<br /> ((> i (- last-i pivot-count)))<br /> (cond ((call comp (aref vec i) pivot)<br /> (rotatef (aref vec i)<br /> (aref vec pivot-i))<br /> (:+ pivot-i))<br /> ((call eq (aref vec i) pivot)<br /> (rotatef (aref vec i)<br /> (aref vec (- last-i pivot-count)))<br /> (:+ pivot-count)<br /> (:- i)))) ; decrement i to reprocess newly swapped point<br /> (dotimes (i pivot-count)<br /> (rotatef (aref vec (+ pivot-i i))<br /> (aref vec (- last-i i))))<br /> (prod-sort (slice vec 0 pivot-i) comp eq)<br /> (prod-sort (slice vec (+ pivot-i pivot-count)) comp eq))))<br /> vec)<br /></code></pre> <p>All in all, the example of Quicksort is very interesting, from the point of view of complexity analysis. It shows the importance of analyzing the worst-case and other corner-case scenarios, and, at the same time, teaches that we shouldn't give up immediately if the worst case is not good enough, for there may be ways to handle such corner cases that reduce or remove their impact.</p> <h3 id="performancebenchmark">Performance Benchmark</h3> <p>Finally, let's look at our problem from another angle: simple and stupid. We have developed 3 sorting functions' implementations: Insertion, Quick, and Prod. Let's create a tool to compare their performance on randomly generated datasets of decent sizes. This may be done with the following code and repeated many times to exclude the effects of randomness.</p> <pre><code>(defun random-vec (size)<br /> (let ((vec (make-array size)))<br /> (dotimes (i size)<br /> (:= (? vec i) (random size)))<br /> vec))<br /><br />(defun print-sort-timings (sort-name sort-fn vec)<br /> ;; we'll use in-place modification of the input vector VEC<br /> ;; so we need to copy it to preserve the original for future use<br /> (let ((vec (copy-seq vec))<br /> (len (length vec)))<br /> (format t "= ~Asort of random vector (length=~A) =~%"<br /> sort-name len)<br /> (time (call sort-fn vec '<))<br /> (format t "= ~Asort of sorted vector (length=~A) =~%"<br /> sort-name len)<br /> (time (call sort-fn vec '<))<br /> (format t "= ~Asort of reverse sorted vector (length=~A) =~%"<br /> sort-name len)<br /> (time (call sort-fn vec '>))))<br /><br />CL-USER> (let ((vec (random-vec 1000)))<br /> (print-sort-timings "Insertion " 'insertion-sort vec)<br /> (print-sort-timings "Quick" 'quicksort vec)<br /> (print-sort-timings "Prod" 'prod-sort vec))<br />= Insertion sort of random vector (length=1000) =<br />Evaluation took:<br /> 0.128 seconds of real time<br />...<br />= Insertion sort of sorted vector (length=1000) =<br />Evaluation took:<br /> 0.001 seconds of real time<br />...<br />= Insertion sort of reverse sorted vector (length=1000) =<br />Evaluation took:<br /> 0.257 seconds of real time<br />...<br />= Quicksort of random vector (length=1000) =<br />Evaluation took:<br /> 0.005 seconds of real time<br />...<br />= Quicksort of sorted vector (length=1000) =<br />Evaluation took:<br /> 5.429 seconds of real time<br />...<br />= Quicksort of reverse sorted vector (length=1000) =<br />Evaluation took:<br /> 2.176 seconds of real time<br />...<br />= Prodsort of random vector (length=1000) =<br />Evaluation took:<br /> 0.008 seconds of real time<br />...<br />= Prodsort of sorted vector (length=1000) =<br />Evaluation took:<br /> 0.004 seconds of real time<br />...<br />= Prodsort of reverse sorted vector (length=1000) =<br />Evaluation took:<br /> 0.007 seconds of real time<br /></code></pre> <p>Overall, this is a really primitive approach that can't serve as conclusive evidence on its own, but it has value as it aligns well with our previous calculations. Moreover, it once again reveals some things that may be omitted in those calculations: for instance, the effects of the hidden constants of the Big-O notation or of the particular programming vehicles used. We can see that, for their worst-case scenarios, where Quicksort and Insertion sort both have <code>O(n^2)</code> complexity and work the longest, Quicksort comes 10 times slower, although it's more than 20 times faster for the average case. This slowdown may be attributed both to the larger number of operations and to using recursion. Also, our Prodsort algorithm demonstrates its expected performance. As you see, such simple testbeds quickly become essential in testing, debugging, and fine-tuning our algorithms' implementations. So it's a worthy investment.</p> <p>Finally, it is worth noting that array sort is often implemented as in-place sorting, which means that it will modify (spoil) the input vector. We use that in our test function: first, we sort the array and then sort the sorted array in direct and reverse orders. This way, we can omit creating new arrays. Such destructive sort behavior may be both the intended and surprising behavior. The standard Lisp's <code>sort</code> and <code>stable-sort</code> functions also exhibit it, which is, unfortunately, a source of numerous bugs due to the application programmer forgetfulness of the function's side-effects (at least, this is an acute case, for myself). That's why RUTILS provides an additional function <code>safe-sort</code> that is just a thin wrapper over standard <code>sort</code> to free the programmer's mind from worrying or forgetting about this treacherous <code>sort</code>'s property.</p> <p>A few points we can take away from this chapter:</p> <ol><li>Array is a goto structure for implementing your algorithms. First, try to fit it before moving to other things like lists, trees, and so on.</li> <li>Complexity estimates should be considered in context: of the particular task's requirements and limitations, of the hardware platform, etc. Performing some real-world benchmarking alongside back-of-the-napkin abstract calculations may be quite insightful.</li> <li>It's always worth thinking of how to reduce the code to the simplest form: checking of additional conditions, recursion, and many other forms of code complexity, although, rarely are a game changer, often may lead to significant unnecessary slowdowns.</li></ol><hr size="1"><p>Footnotes:</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r3-1" name="f3-1">[1]</a> or <code>void*</code> in C, or some other type that allows any element in your language of choice</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r3-2" name="f3-2">[2]</a> Such incompatibility errors are not a cheap thing: for instance, it is reported that the crash of the first Arian V rocket happened due to interoperation of two programs that used the metric and the imperial measurement systems without explicit conversion of the data. There's an elegant solution to such problem: "dimensional numbers", which a custom reader macro to encode the measure alongside the number. Here is a formula expressed with such numbers:</p> <pre><code>(defun running-distance-for-1kg-weight-loss (mass)<br /> (* 1/4 (/ #M37600kJ (* #M0.98m/s2 mass))))<br /><br />CL-USER> (running-distance-for-1kg-weight-loss #M80kg)<br />119897.96<br />CL-USER> (running-distance-for-1kg-weight-loss #I200lb)<br />105732.45<br /></code></pre> <p>The output is, of course, in metric units. Unfortunately, this approach will not be useful for arrays encoded by different languages as they are obtained not by reading the input but by referencing external memory. Instead, a wrapper struct/class is, usually, used to specify the elements order.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-39422374906246238982019-08-05T13:40:00.000+03:002019-08-08T08:02:08.248+03:00Programming Algorithms: Data Structures<p>The next several chapters will be describing the basic data structures that every programming language provides, their usage and the most important algorithms relevant to them. And we'll start with the notion of a data-structure and tuples or structs that are the most primitive and essential one.</p> <a href="https://4.bp.blogspot.com/-cZjINhOsM30/XUgGvPsIu1I/AAAAAAAACIA/xO5fMmXSPNEyw5iokbENCFFFDohNwFjGQCLcBGAs/s1600/ds.jpg" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-cZjINhOsM30/XUgGvPsIu1I/AAAAAAAACIA/xO5fMmXSPNEyw5iokbENCFFFDohNwFjGQCLcBGAs/s320/ds.jpg" width="320" height="164" data-original-width="450" data-original-height="230" /></a> <h2 id="datastructuresvsalgorithms">Data Structures vs Algorithms</h2> <p>Let's start with a somewhat abstract question: what's more important, algorithms or data structures?</p> <p>From one point of view, algorithms are the essence of many programs, while data structures may seem secondary. Besides, although a majority of algorithms rely on certain features of particular data structures, not all do. Good examples of the data-structure-relying algorithms are heapsort, search using BSTs, and union-find. And of the second type: the sieve of Erastophenes and consistent hashing.</p> <p>At the same time, some seasoned developers state that when the right data structure is found, the algorithm will almost write itself. Linus Torvalds, the creator of Linux, is <a href="http://programmers.stackexchange.com/questions/163185/torvalds-quote-about-good-programmer">quoted saying</a>:</p> <blockquote> <p>Bad programmers worry about the code. Good programmers worry about data structures and their relationships.</p></blockquote> <p>A somewhat less poignant version of the same idea is formulated in the Art of Unix Programming by Eric S. Raymond as the "<a href="http://www.catb.org/esr/writings/taoup/html/ch01s06.html#id2878263">Rule of Representation</a>":</p> <blockquote> <p>Fold knowledge into data so program logic can be stupid and robust.</p> <p>Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic.</p> <p>Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.</p></blockquote> <p>Data structures are more static than algorithms. Surely, most of them allow change of their contents over time, but there are certain invariants that always hold. This allows reasoning by simple induction: consider only two (or at least a small number of) cases, the base one(s) and the general. In other words, data structures remove, in the main, the notion of time from consideration, and change over time is one of the major causes of program complexity. In other words, data structures are declarative, while most of the algorithms are imperative. The advantage of the declarative approach is that you don't have to imagine (trace) the flow of time through it.</p> <p>So, this book, like most other books on the subject, is organized around data structures. The majority of the chapters present a particular structure, its properties and interface, and explain the algorithms, associated with it, showing its real-world use cases. Yet, some important algorithms don't require a particular data structure, so there are also several chapters dedicated exclusively to them.</p> <h2 id="thedatastructureconcept">The Data Structure Concept</h2> <p>Among data structures, there are, actually, two distinct kinds: abstract and concrete. The significant difference between them is that an abstract structure is just an interface (a set of operations) and a number of conditions or invariants that have to be met. Their particular implementations, which may differ significantly in efficiency characteristics and inner mechanisms, are provided by the concrete data structures. For instance, an abstract data structure <code>queue</code> has just two operations: <code>enqueue</code> that adds an item to the end of the queue and <code>dequeue</code> that gets an item at the beginning and removes it. There's also a constraint that the items should be dequeued in the same order they are enqueued. Now, a queue may be implemented using a number of different underlying data structures: a linked or a double-linked list, an array or a tree. Each one having different efficiency characteristics and additional properties beyond the queue interface. We'll discuss both kinds in the book, focusing on the concrete structures and explaining their usage to implement a particular abstract interface.</p> <p>The term data structures has somewhat fallen from grace, in the recent years, being often replaced by conceptually more loaded notions of types, in the context of the functional programming paradigm, or classes, in object-orientated one. Yet, both of those notions imply something more than just algorithmic machinery we're exclusively interested in, for this book. First of all, they also distinguish among primitive values (numbers, characters, etc.) that are all non-distinct, in the context of algorithms. Besides, classes form a hierarchy of inheritance while types are associated with algebraic rules of category theory. So, we'll stick to a neutral data structures term, throughout the book, with occasional mentions of the other variants where appropriate.</p> <h2 id="contiguousandlinkeddatastructures">Contiguous and Linked Data Structures</h2> <p>The current computer architectures consist of a central processor (CPU), memory and peripheral input-output devices. The data is someway exchanged with the outside world via the IO-devices, stored in memory, and processed by the CPU. And there's a crucial constraint, called the von Neumann's bottleneck: the CPU can only process data that is stored inside of it in a limited number of special basic memory blocks called registers. So it has to constantly move data elements back and forth between the registers and main memory (using intermediate cache to speed up the process). Now, there are things that can fit in a register and those that can't. The first ones are called primitive and mostly unite those items that can be directly represented with integer numbers: integers proper, floats, characters. Everything that requires a custom data structure to be represented can't be put in a register as a whole.</p> <p>Another item that fits into the processor register is a memory address. In fact, there's an important constant — the number of bits in a general-purpose register, which defines the maximum memory address that a particular CPU may handle and, thus, the maximum amount of memory it can work with. For a 32-bit architecture it's <code>2^32</code> (4 GB) and for 64-bit — you've guessed it, <code>2^64</code>. A memory address is usually called a <strong>pointer</strong>, and if you put a pointer in a register, there are commands that allow the CPU to retrieve the data in-memory from where it points.</p> <p>So, there are two ways to place a data structure inside the memory:</p> <ul><li>a <strong>contiguous</strong> structure occupies a single chunk of memory and its contents are stored in adjacent memory blocks. To access a particular piece we should know the offset of its beginning from the start of the memory range allocated to the structure. (This is usually handled by the compiler). When the processor needs to read or write to this piece it will use the pointer calculated as the sum of the base address of the structure and the offset. The examples of contiguous structures are arrays and structs</li> <li>a <strong>linked</strong> structure, on the contrary, doesn't occupy a contiguous block of memory, i.e. its contents reside in different places. This means that pointers to a particular piece can't be pre-calculated and should be stored in the structure itself. Such structures are much more flexible at the cost of this additional overhead both in terms of used space and time to access an element (which may require several hops when there's nesting, while in the contiguous structure it is always constant). There exists a multitude of linked data structures like lists, trees, and graphs</li></ul> <h2 id="tuples">Tuples</h2> <p>In most languages, some common data structures, like arrays or lists, are "built-in", but, under the hood, they will mostly work in the same way as any user-defined ones. To implement an arbitrary data structure, these languages provide a special mechanism called records, structs, objects, etc. The proper name for it would be "tuple". It is the data structure that consists of a number of fields each one holding either a primitive value, another tuple or a pointer to another tuple of any type. This way a tuple can represent any structure, including nested and recursive ones. In the context of type theory, such structures are called product types.</p> <p>A tuple is an abstract data structure and its sole interface is the field accessor function: by name (a named tuple) or index (an anonymous tuple). It can be implemented in various ways, although a contiguous variant with constant-time access is preferred. However, in many languages, especially dynamic, programmers often use lists or dynamic arrays to create throw-away ad-hoc tuples. Python has a dedicated tuple data type, that is often for this purpose, that is a linked data structure under the hood. The following Python function will return a tuple (written in parens) of a decimal and remainder parts of the number <code>x</code><a href="#f2-1" name="r2-1">[1]</a>:</p> <pre><code>def truncate(x):<br /> dec = int(x)<br /> rem = x - dec<br /> return (dec, rem)<br /></code></pre> <p>This is a simple and not very efficient way that may have its place when the number of fields is small and the lifetime of the structure is short. However, a better approach both from the point of view of efficiency and code clarity is to use a pre-defined structure. In Lisp, a tuple is called "struct" and is defined with <code>defstruct</code>, which uses a contiguous representation by default (although there's an option to use a linked list under-the-hood). Following is the definition of a simple pair data structure that has two fields (called "slots" in Lisp parlance): <code>left</code> and <code>right</code>.</p> <pre><code>(defstruct pair<br /> left right)<br /></code></pre> <p>The <code>defstruct</code> macro, in fact, generates several definitions: of the struct type, its constructor that will be called <code>make-pair</code> and have 2 keyword arguments <code>:left</code> and <code>:right</code>, and field accessors <code>pair-left</code> and <code>pair-right</code>. Also, a common <code>print-object</code> method for structs will work for our new structure, as well as a reader-macro to restore it from the printed form. Here's how it all fits together:</p> <pre><code>CL-USER> (make-pair :left "foo" :right "bar")<br />#S(PAIR :LEFT "foo" :RIGHT "bar")<br />CL-USER> (pair-right (read-from-string (prin1-to-string *)))<br />"bar"<br /></code></pre> <p><code>prin1-to-string</code> and <code>read-from-string</code> are complimentary Lisp functions that allow to print the value in a computer-readable form (if an appropriate print-function is provided) and read it back. Good print-representations readable to both humans and, ideally, computers are very important to code transparency and should never be neglected.</p> <p>There's a way to customize every part of the definition. For instance, if we plan to use pairs frequently we can leave out the <code>pair-</code> prefix by specifying <code>(:conc-name nil)</code> property. Here is an improved <code>pair</code> definition and shorthand constructor for it from RUTILS, which we'll use throughout the book. It uses <code>:type list</code> allocation to integrate with destructuring macros.</p> <pre><code>(defstruct (pair (:type list) (:conc-name nil))<br /> "A generic pair with left (LT) and right (RT) elements."<br /> lt rt)<br /><br />(defun pair (x y)<br /> "A shortcut to make a pair of X and Y."<br /> (make-pair :lt x :rt y))<br /></code></pre> <h2 id="passingdatastructuresinfunctioncalls">Passing Data Structures in Function Calls</h2> <p>One final remark. There are two ways to use data structures with functions: either pass them directly via copying appropriate memory areas (<strong>call-by-value</strong>) — an approach, usually, applied to primitive types — or pass a pointer (<strong>call-by-reference</strong>). In the first case, there's no way to modify the contents of the original structure in the called function, while in the second variant it is possible, so the risk of unwarranted change should be taken into account. The usual way to handle it is by making a copy before invoking any changes, although, sometimes, mutation of the original data structure may be intended so a copy is not needed. Obviously, the call-by-reference approach is more general, because it allows both modification and copying, and more efficient because copying is on-demand. That's why it is the default way to handle structures (and objects) in most programming languages. In a low-level language like C, however, both variants are supported. Moreover, in C++ the pass-by-reference has two kinds: pass the pointer and pass what's actually called a reference, which is syntax sugar over pointers that allows accessing the argument with non-pointer syntax (dot instead of arrow) and adds a couple of restrictions. But the general idea, regardless of the idiosyncrasies of particular languages, remains the same.</p> <h2 id="datastructuresinactionunionfind">Structs in Action: Union-Find</h2> <p>Data structures come in various shapes and flavors. Here, I'd like to mention one peculiar and interesting example that is both a data structure and an algorithm, to some extent. Even the name speaks about certain operations rather than a static form. Well, most of the more advanced data structures all have this feature that they are defined not only by the shape and arrangement but also via the set of operations that are applicable. Union-Find is a family of data-structure-algorithms that can be used for efficient determination of set membership in sets that change over time. They may be used for finding the disjoint parts in networks, detection of cycles in graphs, finding the minimum spanning tree and so forth. One practical example of such problems is automatic image segmentation: separating different parts of an image, a car from the background or a cancer cell from a normal one.</p> <p>Let's consider the following problem: how to determine if two points of the graph have a path between them? Given that a graph is a set of points (vertices) and edges between some of the pairs of these points. A path in the graph is a sequence of points leading from source to destination with each pair having an edge that connects them. If some path between two points exists they belong to the same component if it doesn't — to two disjoint ones.</p> <a href="https://4.bp.blogspot.com/-9bYtjGnuln8/XUgHVIlJY_I/AAAAAAAACII/uHKkADHw8ewweGfGNsUN10Py4m_lQVnEwCLcBGAs/s1600/graph3.jpg" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-9bYtjGnuln8/XUgHVIlJY_I/AAAAAAAACII/uHKkADHw8ewweGfGNsUN10Py4m_lQVnEwCLcBGAs/s320/graph3.jpg" width="200" data-original-width="277" data-original-height="252" /></a><br>A graph with 3 disjoint components <p>For two arbitrary points, how to determine if they have a connecting path? The naive implementation may take one of them and start building all the possible paths (this may be done in breadth-first or depth-first manner, or even randomly). Anyway, such procedure will, generally, require a number of steps proportional to the number of vertices of the graph. Can we do better? This is a usual question that leads to the creation of more efficient algorithms.</p><p>Union-Find approach is based on a simple idea: when adding the items record the id of the component they belong to. But how to determine this id? Use the id associated with some point already in this subset or the current point's id if the point is in a subset of its own. And what if we have the subsets already formed? No problem, we can simulate the addition process by iterating over each vertex and taking the id of an arbitrary point it's connected to as the subset's id. Below is the implementation of this approach (to simplify the code, we'll use the pointers to `point` structs instead of ids, but, conceptually, it's the same idea):</p> <pre><code>(defstruct point<br /> parent) ; if the parent is null the point is the root<br /><br />(defun uf-union (point1 point2)<br /> "Join the subsets of POINT1 and POINT2."<br /> (:= (point-parent point1) (or (point-parent point2)<br /> point2)))<br /><br />(defun uf-find (point)<br /> "Determine the id of the subset that a POINT belongs to."<br /> (let ((parent (point-parent point)))<br /> (if parent<br /> (uf-find parent)<br /> point)))<br /></code></pre> <p>Just calling <code>(make-point)</code> will add a new subset with a single item in it to our set.</p> <p>Note that <code>uf-find</code> uses recursion to find the root of the subset, i.e. the point that was added first. So, for each vertex, we store some intermediary data and, to get the subset id, each time, we'll have to perform additional calculations. This way, we managed to reduce the average-case find time, but, still, haven't completely excluded the possibility of it requiring traversal of every element of the set. Such so-called degraded case may manifest when each item is added referencing the previously added one. I.e. there will be a single subset with a chain of its members connected to the next one like this: <code>a -> b -> c -> d</code>. If we call <code>uf-find</code> on <code>a</code> it will have to enumerate all of the set's elements.</p> <p>Yet, there is a way to improve <code>uf-find</code> behavior: by compressing the tree depth to make all points along the path to the root point to it, i.e squashing each chain into a wide shallow tree of depth 1.</p> <pre><code> d<br />^ ^ ^<br />| | |<br />a b c<br /></code></pre> <p>Unfortunately, we can't do that, at once, for the whole subset, but, during each run of <code>uf-find</code>, we can compress one path, which will also shorten all the paths in the subtree that is rooted in the points on it! Still, this cannot guarantee that there will not be a sequence of enough unions to grow the trees faster than finds can flatten them. But there's another tweak that, combined with path compression, allows to ensure sublinear (actually, almost constant) time of both operations: keep track of the size of all trees and link the smaller tree below the larger one. This will ensure that all trees' heights will stay below <code>(log n)</code>. The rigorous proof of that is quite complex, although, intuitively, we can see the tendency by looking at the base case: if we add a 2-element tree and a 1-element one we'll still get the tree of the height 2.</p> <p>Here is the implementation of the optimized version:</p> <pre><code>(defstruct point<br /> parent<br /> (size 1))<br /><br />(defun uf-find (point)<br /> (let ((parent (point-parent point)))<br /> (if parent<br /> ;; here, we use the fact that the assignment will also return<br /> ;; the value to perform both path compression and find<br /> (:= (point-parent point) (uf-find parent))<br /> point)))<br /><br />(defun uf-union (point1 point2)<br /> (with ((root1 (uf-find point1))<br /> (root2 (uf-find point2))<br /> (major minor (if (> (point-size root1)<br /> (point-size root2))<br /> (values root1 root2)<br /> (values root2 root1))))<br /> (:+ (point-size major) (point-size minor))<br /> (:= (point-parent minor) major)))<br /> </code></pre> <p>Here, Lisp multiple <code>values</code> come handy, to simplify the code. See the footnote [1] for more details about them.</p> <p>The suggested approach is quite simple in implementation but complex in complexity analysis. So, I'll have to give just the final result: <code>m</code> union/find operations, with tree weighting and path compression, on a set of <code>n</code> objects will work in <code>O((m + n) log* n)</code> (where <code>log*</code> is iterated logarithm — a very slowly increasing function, that can be considered a constant, for practical purposes).</p> <p>Finally, this is how to check if none of the points belong to the same subset in almost <code>O(n)</code> where <code>n</code> is the number of points to check<a href="#f2-2" name="r2-2">[2]</a>, so in <code>O(1)</code> for 2 points:</p> <pre><code class=" language- ">(defun uf-disjoint (points)<br /> "Return true if all of the POINTS belong to different subsets."<br /> (let (roots)<br /> (dolist (point points)<br /> (let ((root (uf-find point)))<br /> (when (member root roots)<br /> (return-from uf-disjoint nil))<br /> (push root roots))))<br /> t)<br /></code></pre> <p>A couple more observations may be drawn from this simple example:</p> <ol><li>Not always the clever idea that we, initially, have works flawlessly at once. It is important to check the edge cases for potential problems.</li><li>We've seen an example of a data structre that, directly, doesn't exist: pieces of information are distributed over individual data points. Sometimes, there's a choice between storing the information, in a centralized way, in a dedicated structure like a hash-table and distributing it over individual nodes. The latter approach is often more elegant and efficient, although it's not so obvious.</li></ol><hr size="1"><p>Footnotes:</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r2-1" name="f2-1">[1]</a> Moreover, Python has special syntax for destructuring such tuples: <code>dec, rem = truncate(3.14)</code>. However, this is not the optimal way to handle returning the primary and one or more secondary values from a function. Lisp provides a more elegant solution called multiple values: all the necessary values are returned via the <code>values</code> form: <code>(values dec rem)</code> and can be retrieved with <code>(multiple-value-bind (dec rem) (truncate 3.14) ...)</code> or <code>(with ((dec rem (truncate 3.14))) ...)</code>. It is more elegant because secondary values may be discarded at will by calling the function in a usual way: <code>(+ 1 (truncate 3.14)) => 4</code> — not possible in Python, because you can't sum a tuple with something.</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r2-2" name="f2-2">[2]</a> Actually, the complexity here is <code>O(n^2)</code> due to the use of the function <code>member</code> that performs set membership test in <code>O(n)</code>, but it's not essential to the general idea. If <code>uf-disjoint</code> is expected to be called with tens or hundreds of points the <code>roots</code> structure has to be changed to a hash-set that has a <code>O(1)</code> membership operation.</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-25464123944170408692019-07-29T11:39:00.000+03:002019-12-16T07:43:36.830+02:00Programming Algorithms: A Crash Course in Lisp<p class="has-line-data" data-line-start="4" data-line-end="5">The introductory post for this book, unexpectedly, received quite a lot of attention, which is nice since it prompted some questions, and one of them I planned to address in this chapter.</p><p>I expect that there will be two main audiences, for this book:</p><ul><li class="has-line-data" data-line-start="6" data-line-end="7">people who’d like to advance in algorithms and writing efficient programs — the major group</li><li class="has-line-data" data-line-start="7" data-line-end="9">lispers, either accomplished or aspiring who also happen to be interested in algorithms</li></ul><p class="has-line-data" data-line-start="9" data-line-end="10">This introductory chapter is, primarily, for the first group. After reading it, the rest of the book’s Lisp code should become understandable to you. Besides, you’ll know the basics to run Lisp and experiment with it if will you so desire.</p><p class="has-line-data" data-line-start="11" data-line-end="12">For the lispers, I have one comment and one remark. You might be interested to read this part just to understand my approach of utilizing the language throughout the book. Also, you’ll find my stance regarding the question that was voiced several times in the comments: whether it’s justified to use some 3rd-party extensions and to what extent or should the author vigilantly stick to only the tools provided by the standard.</p><h2 class="code-line" data-line-start=14 data-line-end=15 ><a id="The_Core_of_Lisp_14"></a>The Core of Lisp</h2><img border="0" src="https://2.bp.blogspot.com/-SNngueqcr7I/XT6s1nCHafI/AAAAAAAACHo/SxVHMjDK_MMr1NI5HhJ1kVSPPHYGZgLYQCLcBGAs/s640/lisp.jpg" width="640" data-original-width="1600" data-original-height="936" /><p class="has-line-data" data-line-start="16" data-line-end="17">To effortlessly understand Lisp, you’ll have to forget, for a moment, any concepts of how programming languages should work that you might have acquired from your prior experience in coding. Lisp is simpler; and when people bring their Java, C or Python approaches to programming with it, first of all, the results are suboptimal in terms of code quality (simplicity, clarity, and beauty), and, what’s more important, there’s much less satisfaction from the process, not to mention very few insights and little new knowledge gained.</p><p class="has-line-data" data-line-start="18" data-line-end="19">It is much easier to explain Lisp if we begin from a blank slate. In essence, all there is to it is just an evaluation rule: Lisp programs consist of <strong>forms</strong> that are <strong>evaluated</strong> by the compiler. There are 3+2 ways how that can happen:</p><ul><li class="has-line-data" data-line-start="20" data-line-end="21">self-evaluation: all literal constants (like <code>1</code>, <code>"hello"</code>, etc.) are evaluated to themselves. These literal objects can be either built-in primitive types (<code>1</code>) or data structures (<code>"hello"</code>)</li><li class="has-line-data" data-line-start="21" data-line-end="22">symbol evaluation: separate symbols are evaluated as names of variables, functions, types or classes depending on the context. The default is variable evaluation, i.e. if we encounter a symbol <code>foo</code> the compiler will substitute in its place the current value associated with this variable (more on this a little bit later)</li><li class="has-line-data" data-line-start="22" data-line-end="27">expression evaluation: compound expressions are formed by grouping symbols and literal objects with parenthesis. The following form <code>(oper 1 foo)</code> is considered a “functional” expression: the operator name is situated in the first position (head), and its arguments, if any, in the subsequent positions (rest). There are 3 ways to evaluate a functional expression: <ul><li class="has-line-data" data-line-start="23" data-line-end="24">there are 25 special operators that are defined in lower-level code and may be considered something like axioms of the language: they are pre-defined, always present, and immutable. Those are the building blocks, on top of which all else is constructed, and they include the sequential <code>block</code> operator, the conditional expression <code>if</code>, and the unconditional jump <code>go</code>, to name a few. If <code>oper</code> is the name of a special operator, the low-level code for this operator that deals with the arguments in its own unique way is executed</li><li class="has-line-data" data-line-start="24" data-line-end="25">there’s also ordinary function evaluation: if <code>oper</code> is a function name, first, all the arguments are evaluated with the same evaluation rule, and then the function is called with the obtained values</li><li class="has-line-data" data-line-start="25" data-line-end="27">finally, there’s macro evaluation. Macros provide a way to change the evaluation rule for a particular form. If <code>oper</code> names a macro, its code is substituted instead of our expression and then evaluated. Macros are a major topic in Lisp, and they are used to build a large part of the language, as well as provide an accessible way, for the users, to extend it. However, they are orthogonal to the subject of this book and won’t be discussed in further detail here. You can delve deeper into macros in such books as <a href="http://www.paulgraham.com/onlisp.html">On Lisp</a> or <a href="https://letoverlambda.com/">Let Over Lambda</a></li></ul></li></ul><p class="has-line-data" data-line-start="27" data-line-end="28">It’s important to note that, in Lisp, there’s no distinction between statements and expressions, no special keywords, no operator precedence rules, and other similar arbitrary stuff you can stumble upon in other languages. Everything is uniform; everything is an expression in a sense that it will be evaluated and return some value.</p><h2 class="code-line" data-line-start=29 data-line-end=30 ><a id="A_Code_Example_29"></a>A Code Example</h2><p class="has-line-data" data-line-start="31" data-line-end="32">To sum up, let’s consider an example of the evaluation of a Lisp form. The following one implements the famous binary search algorithm (that we’ll discuss in more detail in one of the following chapters):</p><pre><code class="has-line-data" data-line-start="34" data-line-end="47">(when (> (length vec) 0)<br /> (let ((beg 0)<br /> (end (length vec)))<br /> (do ()<br /> ((= beg end))<br /> (let ((mid (floor (+ beg end) 2)))<br /> (if (> (? vec mid) val)<br /> (:= beg (1+ mid))<br /> (:= end mid))))<br /> (values beg<br /> (? vec beg)<br /> (= (? vec beg) val))))<br /></code></pre><p class="has-line-data" data-line-start="48" data-line-end="49">It is a compound form. In it, the so-called top-level form is <code>when</code>, which is a macro for a one-clause conditional expression: an <code>if</code> with only the true-branch. First, it evaluates the expression <code>(> (length vec) 0)</code>, which is an ordinary function for a logical operator <code>></code> applied to two args: the result of obtaining the <code>length</code> of the contents of the variable <code>vec</code> and a constant <code>0</code>. If the evaluation returns true, i.e. the length of <code>vec</code> is greater than <code>0</code>, the rest of the form is evaluated in the same manner. The result of the evaluation, if nothing exceptional happens, is either false (which is called <code>nil</code>, in Lisp) or 3 values returned from the last form <code>(values ...)</code>. <code>?</code> is the generic access operator, which abstracts over different ways to query data structures by key. In this case, it retrieves the item from <code>vec</code> at the index of the second argument. Below we’ll talk about other operators shown here.</p><p class="has-line-data" data-line-start="50" data-line-end="51">But first I need to say a few words abut <code>RUTILS</code>. It is a 3rd-party library that provides a number of extensions to the standard Lisp syntax and its basic operators. The reason for its existence is that Lisp standard is not going to change ever, and, as eveything in this world, it has its flaws. Besides, our understanding of what’s elegant and efficient code evolves over time. The great advantage of the Lisp standard, however, which counteracts the issue of its immutability, is that its authors had put into it multiple ways to modify and evolve the language at almost all levels starting from even the basic syntax. And this addresses our ultimate need, after all: we’re not so interested in changing the standard as we’re in changing the language. So, <code>RUTILS</code> is one of the ways of evolving Lisp and its purpose is to make programming in it more accessible without compromising the principles of the language. So, in this book, I will use some basic extensions from <code>RUTILS</code> and will explain them as needed. Surely, using 3rd-party tools is the question of preference and taste and might not be approved by some of the Lisp old-times, but no worries, in your code, you’ll be able to easily swap them for your favorite alternatives.</p><h2 class="code-line" data-line-start=53 data-line-end=54 ><a id="The_REPL_53"></a>The REPL</h2><p class="has-line-data" data-line-start="55" data-line-end="56">Lisp programs are supposed to be run not only in a one-off fashion of simple scripts, but also as live systems that operate over long periods of time experiencing change not only of their data but also code. This general way of interaction with a program is called Read-Eval-Print-Loop (REPL), which literally means that the Lisp compiler <code>read</code>s a form, <code>eval</code>uates it with the aforementioned rule, <code>print</code>s the results back to the user, and <code>loop</code>s over.</p><p class="has-line-data" data-line-start="57" data-line-end="58">REPL is the default way to interact with a Lisp program, and it is very similar to the Unix shell. When you run your Lisp (for example, by entering <code>sbcl</code> at the shell) you’ll drop into the REPL. We’ll preceede all REPL-based code interactions in the book with a REPL prompt (<code>CL-USER></code> or similar). Here’s an example one:</p><pre><code class="has-line-data" data-line-start="60" data-line-end="64">CL-USER> (print "Hello world")<br />"Hello world" <br />"Hello world"<br /></code></pre><p class="has-line-data" data-line-start="65" data-line-end="66">A curious reader may be asking why <code>"Hello world"</code> is printed twice. It’s a proof that everything is an expression in Lisp. :) The <code>print</code> “statement”, unlike in most other languages, not only prints its argument to the console (or other output stream), but also returns it as is. This comes very handy when debugging, as you can wrap almost any form in a <code>print</code> not changing the flow of the program.</p><p class="has-line-data" data-line-start="67" data-line-end="68">Obviously, if the interaction is not necessary, just the read-eval part may remain. But, what’s more important, Lisp provides a way to customize every stage of the process:</p><ul><li class="has-line-data" data-line-start="69" data-line-end="70">at the <code>read</code> stage special syntax (“syntax sugar”) may be introduced via a mechanism called reader macros</li><li class="has-line-data" data-line-start="70" data-line-end="71">ordinary macros are a way to customize the <code>eval</code> stage</li><li class="has-line-data" data-line-start="71" data-line-end="72">the <code>print</code> stage is conceptually the simplest one, and there’s also a standard way to customize object printing via the Common Lisp Object System’s (CLOS) <code>print-object</code> function</li><li class="has-line-data" data-line-start="72" data-line-end="73">and the <code>loop</code> stage can be replaced by any desired program logic</li></ul><h2 class="code-line" data-line-start=75 data-line-end=76 ><a id="Basic_Expressions_75"></a>Basic Expressions</h2><p class="has-line-data" data-line-start="77" data-line-end="78">The structural programming paradigm states that all programs can be expressed in terms of 3 basic constructs: sequential execution, branching, and looping. Let’s see how these operators are expressed in Lisp.</p><h3 class="code-line" data-line-start=79 data-line-end=80 ><a id="Sequential_Execution_79"></a>Sequential Execution</h3><p class="has-line-data" data-line-start="81" data-line-end="82">The simplest program flow is sequential execution. In all imperative languages, it is what is assumed to happen if you put several forms in a row and evaluate the resulting code block. Like this:</p><pre><code class="has-line-data" data-line-start="84" data-line-end="88">CL-USER> (print "hello") (+ 2 2)<br />"hello"<br />4<br /></code></pre><p class="has-line-data" data-line-start="89" data-line-end="90">The value returned by the last expression is returned as the value of the whole sequence.</p><p class="has-line-data" data-line-start="91" data-line-end="92">Here, the REPL-interaction forms an implicit unit of sequential code. However, there are many cases when we need to explicitly delimit such units. This can be done with the <code>block</code> operator:</p><pre><code class="has-line-data" data-line-start="94" data-line-end="100">CL-USER> (block test<br /> (print "hello")<br /> (+ 2 2))<br />"hello"<br />4<br /></code></pre><p class="has-line-data" data-line-start="101" data-line-end="102">Such block has a name (in this example: <code>test</code>). This allows to prematurely end its execution by using an operator <code>return-from</code>:</p><pre><code class="has-line-data" data-line-start="104" data-line-end="110">CL-USER> (block test<br /> (return-from test 0)<br /> (print "hello")<br /> (+ 2 2))<br />0<br /></code></pre><p class="has-line-data" data-line-start="111" data-line-end="112">A shorthand <code>return</code> is used to exit from blocks with a <code>nil</code> name (which are implicit in most of the looping constructs we’ll see further):</p><pre><code class="has-line-data" data-line-start="114" data-line-end="120">CL-USER> (block nil<br /> (return 0)<br /> (print "hello")<br /> (+ 2 2))<br />0<br /></code></pre><p class="has-line-data" data-line-start="121" data-line-end="122">Finally, if we don’t even plan to ever prematurely return from a block, we can use the <code>progn</code> operator that doesn’t require a name:</p><pre><code class="has-line-data" data-line-start="124" data-line-end="130">CL-USER> (progn<br /> (print "hello")<br /> (+ 2 2))<br />"hello"<br />4<br /></code></pre><h3 class="code-line" data-line-start=131 data-line-end=132 ><a id="Branching_131"></a>Branching</h3><p class="has-line-data" data-line-start="133" data-line-end="134">Conditional expressions calculate the value of their first form and, depending on it, execute one of several alternative code paths. The basic conditional expression is <code>if</code>:</p><pre><code class="has-line-data" data-line-start="136" data-line-end="142">CL-USER> (if nil<br /> (print "hello")<br /> (print "world"))<br />"world"<br />"world"<br /></code></pre><p class="has-line-data" data-line-start="143" data-line-end="144">As we’ve seen, <code>nil</code> is used to represent logical falsity, in Lisp. All other values are considered logically true, including the symbol <code>T</code> or <code>t</code> which directly has the meaning of truth.</p><p class="has-line-data" data-line-start="145" data-line-end="146">And when we need to do several things at once, in one of the conditional branches, it’s one of the cases when we need to use <code>progn</code> or <code>block</code>:</p><pre><code class="has-line-data" data-line-start="148" data-line-end="156">CL-USER> (if (+ 2 2)<br /> (progn<br /> (print "hello")<br /> 4)<br /> (print "world"))<br />"hello"<br />4<br /></code></pre><p class="has-line-data" data-line-start="157" data-line-end="158">However, often we don’t need both branches of the expressions, i.e. we don’t care what will happen if our condition doesn’t hold (or holds). This is such a common case that there are special expressions for it in Lisp — <code>when</code> and <code>unless</code>:</p><pre><code class="has-line-data" data-line-start="160" data-line-end="170">CL-USER> (when (+ 2 2)<br /> (print "hello")<br /> 4)<br />"world"<br />4<br />CL-USER> (unless (+ 2 2)<br /> (print "hello")<br /> 4)<br />NIL<br /></code></pre><p class="has-line-data" data-line-start="171" data-line-end="172">As you see, it’s also handy because you don’t have to explicitly wrap the sequential forms in a <code>progn</code>.</p><p class="has-line-data" data-line-start="173" data-line-end="174">One other standard conditional expression is <code>cond</code>, which is used when we want to evaluate several conditions in a row:</p><pre><code class="has-line-data" data-line-start="176" data-line-end="187">CL-USER> (cond<br /> ((typep 4 'string)<br /> (print "hello"))<br /> ((> 4 2)<br /> (print "world")<br /> nil)<br /> (t<br /> (print "can't get here")))<br />"world"<br />NIL<br /></code></pre><p class="has-line-data" data-line-start="188" data-line-end="189">The <code>t</code> case is a catch-all that will trigger if none of the previous conditions worked (as its condition is always true). The above code is equivalent to the following:</p><pre><code class="has-line-data" data-line-start="191" data-line-end="199">(if (typep 4 'string)<br /> (print "hello")<br /> (if (> 4 2)<br /> (progn<br /> (print "world")<br /> nil)<br /> (print "can't get here")))<br /></code></pre><p class="has-line-data" data-line-start="200" data-line-end="201">There are many more conditional expressions in Lisp, and it’s very easy to define your own with macros (it’s actually, how <code>when</code>, <code>unless</code>, and <code>cond</code> are defined), and when there arises a need to use a special one, we’ll discuss its implementation.</p><h3 class="code-line" data-line-start=202 data-line-end=203 ><a id="Looping_202"></a>Looping</h3><p class="has-line-data" data-line-start="204" data-line-end="205">Like with branching, Lisp has a rich set of looping constructs, and it’s also easy to define new ones when necessary. This approach is different from the mainstream languages, that usually have a small number of such statements and, sometimes, provide an extension mechanism via polymorphism. And it’s even considered to be a virtue justified by the idea that it’s less confusing for the beginners. It makes sense to a degree. Still, in Lisp, both generic and custom approaches manage to coexist and complement each other. Yet, the tradition of defining custom control constructs is very strong. Why? One justification for this is the parallel to human languages: indeed, <code>when</code> and <code>unless</code>, as well as <code>dotimes</code> and <code>loop</code> are either directly words from the human language or are derived from natural language expressions. Our mother tongues are not so primitive and dry. The other reason is because you can™. I.e. it’s so much easier to define custom syntactic extensions in Lisp than in other languages that sometimes it’s just impossible to resist. :) And in many use cases they make the code much more simple and clear.</p><p class="has-line-data" data-line-start="206" data-line-end="207">Anyway, for a complete beginner, actually, you have to know the same number of iteration constructs as in any other language. The simplest one is <code>dotimes</code> that iterates the counter variable a given number of times (from 0 to <code>(- times 1)</code>) and executes the body on each iteration. It is analogous to <code>for (int i = 0; i < times; i++)</code> loops found in C-like languages.</p><pre><code class="has-line-data" data-line-start="209" data-line-end="216">CL-USER> (dotimes (i 3)<br /> (print i))<br />0<br />1<br />2<br />NIL<br /></code></pre><p class="has-line-data" data-line-start="217" data-line-end="218">The return value is <code>nil</code> by default, although it may be specified in the loop header.</p><p class="has-line-data" data-line-start="219" data-line-end="220">The most versatile (and low-level) looping construct, on the other hand, is <code>do</code>:</p><pre><code class="has-line-data" data-line-start="222" data-line-end="236">CL-USER> (do ((i 0 (1+ i))<br /> (prompt (read-line) (read-line)))<br /> ((> i 1) i)<br /> (print (pair i prompt))<br /> (terpri))<br />foo<br /><br />(0 "foo") <br />bar<br /><br />(1 "bar") <br /><br />2<br /></code></pre><p class="has-line-data" data-line-start="237" data-line-end="238"><code>do</code> iterates a number of variables (zero or more) that are defined in the first part (here, <code>i</code> and <code>prompt</code>) until the termination condition in the second part is satisfied (here, <code>(> i 1)</code>), and as with <code>dotimes</code> (and other do-style macros) executes its body — rest of the forms (here, <code>print</code> and <code>terpri</code>, which is a shorthand for printing a newline). <code>read-line</code> reads from standard input until newline is encountered and <code>1+</code> returns the current value of <code>i</code> increased by 1.</p><p class="has-line-data" data-line-start="239" data-line-end="240">All do-style macros (and there’s quite a number of them, both built-in and provided from external libraries: <code>dolist</code>, <code>dotree</code>, <code>do-register-groups</code>, <code>dolines</code> etc.) have an optional return value. In <code>do</code> it follows the termination condition, here — just return the final value of <code>i</code>.</p><p class="has-line-data" data-line-start="241" data-line-end="242">Besides do-style iteration, there’s also a substantially different beast in CL ecosystem — the infamous <code>loop</code> macro. It is very versatile, although somewhat unlispy in terms of syntax and with a few surprising behaviors. But elaborating on it is beyond the scope of this book, especially since there’s an excellent introduction to <code>loop</code> in Peter Seibel’s "<a href="http://www.gigamonkeys.com/book/loop-for-black-belts.html">LOOP for Black Belts</a>".</p><p class="has-line-data" data-line-start="243" data-line-end="244">Many languages provide a generic looping construct that is able to iterate an arbitrary sequence, a generator and other similar-behaving things — usually, some variant of <code>foreach</code>. We’ll return to such constructs after speaking about sequences in more detail.</p><p class="has-line-data" data-line-start="245" data-line-end="246">And there’s also an alternative iteration philosophy: the functional one, which is based on higher-order functions (<code>map</code>, <code>reduce</code> and similar) — we’ll cover it in more detail in the following chapters, also.</p><h3 class="code-line" data-line-start=247 data-line-end=248 ><a id="Procedures_and_Variables_247"></a>Procedures and Variables</h3><p class="has-line-data" data-line-start="249" data-line-end="250">We have covered the 3 pillars of structural programming, but one essential, in fact, the most essential, construct still remains — variables and procedures.</p><p class="has-line-data" data-line-start="251" data-line-end="252">What if I told you that you can perform the same computation many times, but changing some parameters… OK, OK, pathetic joke. So, procedures are the simplest way to reuse computations, and procedures accept arguments, which allows to pass values into their bodies. A procedure, in Lisp, is called <code>lambda</code>. You can define one like this: <code>(lambda (x y) (+ x y))</code>. When used, such procedure — also often called a function, although it’s quite different from what we consider a mathematical function — and, in this case, it’s called an anonymous function as it doesn’t have any name — will produce the sum of its inputs:</p><pre><code class="has-line-data" data-line-start="254" data-line-end="257">CL-USER> ((lambda (x y) (+ x y)) 2 2)<br />4<br /></code></pre><p class="has-line-data" data-line-start="258" data-line-end="259">It is quite cumbersome to refer to procedures by their full code signature, and an obvious solution is to assign names to them. A common way to do that in Lisp is via the <code>defun</code> macro:</p><pre><code class="has-line-data" data-line-start="261" data-line-end="266">CL-USER> (defun add2 (x y) (+ x y))<br />ADD2<br />CL-USER> (add2 2 2)<br />4<br /></code></pre><p class="has-line-data" data-line-start="267" data-line-end="268">The arguments of a procedure are examples of variables. Variables are used to name memory cells whose contents are used more than once and may be changed in the process. They serve different purposes:</p><ul><li class="has-line-data" data-line-start="269" data-line-end="270">to pass data into procedures</li><li class="has-line-data" data-line-start="270" data-line-end="271">as temporary placeholders for some varying data in code blocks (like loop counters)</li><li class="has-line-data" data-line-start="271" data-line-end="272">as a way to store computation results for further reuse</li><li class="has-line-data" data-line-start="272" data-line-end="273">to define program configuration parameters (like the OS environment variables, which can also be thought of as arguments to the main function of our program)</li><li class="has-line-data" data-line-start="273" data-line-end="274">to refer to global objects that should be accessible from anywhere in the program (like <code>*standard-output*</code> stream)</li><li class="has-line-data" data-line-start="274" data-line-end="276">and more</li></ul><p class="has-line-data" data-line-start="276" data-line-end="277">Can we live without variables? Theoretically, well, maybe. At least, there’s the so-called point-free style of programming that strongly discourages the use of variables. But, as they say, don’t try this at home (at least, until you know perfectly well what you’re doing :) Can we replace variables with constants, or single-assignment variables, i.e. variables that can’t change over time? Such approach is promoted by the so called <em>purely</em> functional languages. To a certain degree, yes. But, from the point of view of algorithms development, it makes life a lot harder by complicating many optimizations if not totally outruling them.</p><p class="has-line-data" data-line-start="278" data-line-end="279">So, how to define variables in Lisp? You’ve already seen some of the variants: procedural arguments and <code>let</code>-bindings. Such variables are called local or lexical, in Lisp parlance. That’s because they are only accessible locally throughout the execution of the code block, in which they are defined. <code>let</code> is a general way to introduce such local variables, which is <code>lambda</code> in disguise (a thin layer of syntax sugar over it):</p><pre><code class="has-line-data" data-line-start="281" data-line-end="288">CL-USER> (let ((x 2))<br /> (+ x x))<br />4<br />CL-USER> ((lambda (x) (+ x x))<br /> 2)<br />4<br /></code></pre><p class="has-line-data" data-line-start="289" data-line-end="290">While with <code>lambda</code> you can create a procedure in one place, possibly, assign it to a variable (that’s what, in essence, <code>defun</code> does), and then apply many times in various places, with <code>let</code> you define a procedure and immediately call it, leaving no way to store it and re-apply again afterwards. That’s even more anonymous than an anonymous function! Also, it requires no overhead, from the compiler. But the mechanism is the same.</p><p class="has-line-data" data-line-start="291" data-line-end="292">Creating variables via <code>let</code> is called binding, because they are immediately assigned (bound with) values. It is possible to bind several variables at once:</p><pre><code class="has-line-data" data-line-start="294" data-line-end="299">CL-USER> (let ((x 2)<br /> (y 2))<br /> (+ x y))<br />4<br /></code></pre><p class="has-line-data" data-line-start="300" data-line-end="301">However, often we want to define a row of variables with next ones using the previous ones’ values. It is cumbersome to do with <code>let</code>, because you need nesting (as procedural arguments are assigned independently):</p><pre><code class="has-line-data" data-line-start="303" data-line-end="309">(let ((len (length list)))<br /> (let ((mid (floor len 2)))<br /> (let ((left-part (subseq list 0 mid))<br /> (right-part (subseq list mid)))<br /> ...)))<br /></code></pre><p class="has-line-data" data-line-start="310" data-line-end="311">To simplify this use case, there’s <code>let*</code>:</p><pre><code class="has-line-data" data-line-start="313" data-line-end="319">(let* ((len (length list))<br /> (mid (floor len 2))<br /> (left-part (subseq list 0 mid))<br /> (right-part (subseq list mid)))<br /> ...)<br /></code></pre><p class="has-line-data" data-line-start="320" data-line-end="321">However, there are many other ways to define variables: bind multiple values at once; perform the so called “destructuring” binding when the contents of a data structure (usually, a list) are assigned to several variables, first element to the first variable, second to the second, and so on; access the slots of a certain structure etc. For such use cases, there’s <code>with</code> binding from RUTILS, which works like <code>let*</code> with extra powers. Here’s a very simple example:</p><pre><code class="has-line-data" data-line-start="323" data-line-end="331">(with ((len (length list))<br /> (mid rem (floor len 2))<br /> ;; this group produces a list of 2 sublists<br /> ;; that are bound to left-part and right-part<br /> ;; and ; character starts a comment in lisp<br /> ((left-part right-part) (group mid list)))<br /> ...<br /></code></pre><p class="has-line-data" data-line-start="332" data-line-end="333">In the code throughout this book, you’ll only see these two binding constructs: <code>let</code> for trivial and parallel bindings and <code>with</code> for all the rest.</p><p class="has-line-data" data-line-start="334" data-line-end="335">As we said, variables may not only be defined, or they’d be called “constants”, instead, but also modified. To alter the variable’s value we’ll use <code>:=</code> from RUTILS (it is an abbreviation of the standard <code>psetf</code> macro):</p><pre><code class="has-line-data" data-line-start="337" data-line-end="344">CL-USER> (let ((x 2))<br /> (print (+ x x))<br /> (:= x 4)<br /> (+ x x))<br />4<br />8<br /></code></pre><p class="has-line-data" data-line-start="345" data-line-end="346">Modification, generally, is a dangerous construct as it can create unexpected action-at-a-distance effects, when changing the value of a variable in one place of the code effects the execution of a different part that uses the same variable. This, however, can’t happen with lexical variables: each <code>let</code> creates its own scope that shields the previous values from modification (just like passing arguments to a procedure call and modifying them within the call doesn’t alter those values, in the calling code):</p><pre><code class="has-line-data" data-line-start="348" data-line-end="357">CL-USER> (let ((x 2))<br /> (print (+ x x))<br /> (let ((x 4))<br /> (print (+ x x)))<br /> (print (+ x x)))<br />4<br />8<br />4<br /></code></pre><p class="has-line-data" data-line-start="358" data-line-end="359">Obviously, when you have two <code>let</code>s in different places using the same variable name they don’t affect each other and these two variables are, actually, totally distinct.</p><p class="has-line-data" data-line-start="360" data-line-end="361">Yet, sometimes it is useful to modify a variable in one place and see the effect in another. The variables, which have such behavior, are called global or dynamic (and also special, in Lisp jargon). They have several important purposes. One is defining important configuration parameters that need to be accessible anywhere. The other is referencing general-purpose singleton objects like the standard streams or the state of the random number generator. Yet another is pointing to some context that can be altered in certain places subject to the needs of a particular procedure (for instance, the <code>*package*</code> global variable determines in what package we operate — <code>CL-USER</code> in all previous examples). More advanced uses for global variables also exist. The common way to define a global variable is with <code>defparameter</code>, which specifies its initial value:</p><pre><code class="has-line-data" data-line-start="363" data-line-end="366">(defparameter *connection* nil<br /> "A default connection object.") ; this is a docstring describing the variable<br /></code></pre><p class="has-line-data" data-line-start="367" data-line-end="368">Global variables, in Lisp, usually have so-called “earmuffs” around their names to remind the user of what they are dealing with. Due to their action-at-a-distance feature, it is not the safest programming language feature, and even a “global variables considered harmful” mantra exists. Lisp is, however, not one of those squeamish languages, and it finds many uses for special variables. By the way, they are called “special” due to a special feature, which greatly broadens the possibilities for their sane usage: if bound in <code>let</code> they act as lexical variables, i.e. the previous value is preserved and restored upon leaving the body of a <code>let</code>:</p><pre><code class="has-line-data" data-line-start="370" data-line-end="384">CL-USER> (defparameter *temp* 1)<br />*TEMP*<br />CL-USER> (print *temp*)<br />1<br />CL-USER> (progn<br /> (let ((*temp* 2))<br /> (print *temp*)<br /> (:= *temp* 4)<br /> (print *temp*))<br /> *temp*)<br />2<br />4<br />1<br /></code></pre><p class="has-line-data" data-line-start="385" data-line-end="386">Procedures in Lisp are first-class objects. This means the one you can assign to a variable, as well as inspect and redefine at run-time, and, consequently, do many other useful things with. The RUTILS function <code>call</code><a href="#f1-1" name="r1-1">[1]</a> will call a procedure passed to it as an argument:</p><pre><code class="has-line-data" data-line-start="388" data-line-end="394">CL-USER> (call 'add2 2 2)<br />4<br />CL-USER> (let ((add2 (lambda (x y) (+ x y))))<br /> (call add2 2 2))<br />4<br /></code></pre><p class="has-line-data" data-line-start="399" data-line-end="400">In fact, defining a function with <code>defun</code> also creates a global variable, although in the function namespace. Functions, types, classes — all of these objects are usually defined as global. Though, for functions there’s a way to define them locally with <code>flet</code>:</p><pre><code class="has-line-data" data-line-start="402" data-line-end="410">CL-USER> (foo 1)<br />;; ERROR: The function COMMON-LISP-USER::FOO is undefined.<br />CL-USER> (flet ((foo (x) (1+ x)))<br /> (foo 1))<br />2<br />CL-USER> (foo 1)<br />;; ERROR: The function COMMON-LISP-USER::FOO is undefined.<br /></code></pre><h3 class="code-line" data-line-start=411 data-line-end=412 ><a id="Comments_411"></a>Comments</h3><p class="has-line-data" data-line-start="413" data-line-end="414">Finally, there’s one more syntax we need to know: how to put comments in the code. Only losers don’t comment their code, and comments will be used extensively, throughout this book, to explain some parts of the code examples, inside of them. Comments, in Lisp, start with a <code>;</code> character and end at the end of a line. So, the following snippet is a comment: <code>; this is a comment</code>. There’s also a common style of commenting, when short comments that follow the current line of code start with a single <code>;</code>, longer comments for a certain code block precede it, occupy the whole line or a number of lines and start with <code>;;</code>, comments for code section that include several Lisp top-level forms (global definitions) start with <code>;;;</code> and also occupy whole lines. Besides, each global definition can have a special comment-like string, called the “docstring”, that is intended to describe its purpose and usage, and that can be queried programmatically. To put it all together, this is how different comments may look like:</p><pre><code class="has-line-data" data-line-start="416" data-line-end="428">;;; Some code section<br /><br />(defun this ()<br /> "This has a curious docstring."<br /> ...)<br /><br />(defun that ()<br /> ...<br /> ;; this is an interesting block don't you find?<br /> (block interesting<br /> (print "hello"))) ; it prints hello<br /></code></pre><h2 class="code-line" data-line-start=430 data-line-end=431 ><a id="Getting_Started_430"></a>Getting Started</h2><p class="has-line-data" data-line-start="432" data-line-end="433">I strongly encourage you to play around with the code presented in the following chapters of the book. Try to improve it, find issues with it, and come up with fixes, measure and trace everything. This will not only help you master some Lisp, but also understand much deeper the descriptions of the discussed algorithms and data structures, their pitfalls and corner cases. Doing that is, in fact, quite easy. All you need is install some Lisp (preferrably, SBCL or CCL), add Quicklisp, and, with its help, RUTILS.</p><p class="has-line-data" data-line-start="434" data-line-end="435">As I said above, the usual way to work with Lisp is interacting with its REPL. Running the REPL is fairly straightforward. On my Mint Linux I’d run the following commands:</p><pre><code class="has-line-data" data-line-start="437" data-line-end="447">$ apt-get install sbcl rlwrap<br />...<br />$ rlwrap sbcl<br />...<br />* (print "hello world")<br /><br />"hello world" <br />"hello world"<br />* <br /></code></pre><p class="has-line-data" data-line-start="448" data-line-end="449"><code>*</code> is the Lisp raw prompt. It’s, basically, the same as <code>CL-USER></code> prompt you’ll see in SLIME. You can also run a Lisp script file: <code>sbcl --script hello.lisp</code>. If it contains just a single <code>(print "hello world")</code> line we’ll see the “hello world” phrase printed to the console.</p><p class="has-line-data" data-line-start="450" data-line-end="451">This is a working, but not the most convenient setup. A much more advanced environment is <a href="https://common-lisp.net/project/slime/">SLIME</a> that works inside Emacs (a similar project for vim is called SLIMV). There exists a number of other solutions: some Lisp implementations provide and IDE, some IDEs and editors provide integration.</p><p class="has-line-data" data-line-start="452" data-line-end="453">After getting into the REPL, you’ll have to issue the following commands:</p><pre><code class="has-line-data" data-line-start="455" data-line-end="459">* (ql:quickload :rutilsx)<br />* (use-package :rutilsx)<br />* (named-readtables:in-readtable rutilsx-readtable)<br /></code></pre><p class="has-line-data" data-line-start="460" data-line-end="461">Well, that’s enough Lisp you’ll need to know, to start. We’ll get acquainted with other Lisp concepts as they will become needed for the next chapters of this book. Yet, you’re all set to read and write Lisp programs. They may seem unfamiliar, at first, but as you overcome the initial bump and get used to their paranthesised prefix surface syntax, I promise that you’ll be able to recognize and appreciate their clarity and conciseness.</p><p class="has-line-data" data-line-start="462" data-line-end="463">So, as they say in Lisp land, happy hacking!</p><hr size="1"><p>Footnotes:</p><p class="has-line-data" data-line-start="395" data-line-end="398"><a href="#r1-1" name="f1-1">[1]</a> <code>call</code> is the RUTILS abbreviation of the standard <code>funcall</code>. It was surely fun to be able to call a function from a variable back in the 60’s, but now it has become so much more common that there’s no need for the prefix ;)</p><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-34817072413946390422019-07-22T17:47:00.000+03:002019-07-29T11:48:24.804+03:00"Programming Algorithms" Book<div class="separator" style="text-align: center; clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><a href="https://2.bp.blogspot.com/-xmwP2e5QeSA/XTXJQd0BTFI/AAAAAAAACG4/FAlIVeffx1M524S7ANdlbwu7Pg4DxK74QCLcBGAs/s1600/cover3.png" imageanchor="1"><img border="0" src="https://2.bp.blogspot.com/-xmwP2e5QeSA/XTXJQd0BTFI/AAAAAAAACG4/FAlIVeffx1M524S7ANdlbwu7Pg4DxK74QCLcBGAs/s320/cover3.png" width="213" height="320" data-original-width="960" data-original-height="1440" /></a><br>Drago — a nice example of a real-world binary tree</div><p>I'm writing a book about algorithms and Lisp. It, actually, started several years ago, but as I experience a constant shortage of quality time to devote to such side activities, short periods of writing alternated with long pauses. Now, I'm, finally, at the stage when I can start publishing it. But I intend to do that, first, gradually in this blog and then put the final version — hopefully, improved and polished thanks to the comments of the first readers — on Leanpub. The book will be freely available with a <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND license</a>. <p>The book will have 16 chapters grouped into 3 parts: essential data structures, derivative ones, and advanced algorithms. I plan to publish each one approximately once a week. To finish the process by the end of the year. <p>I hope the book turns out to be an enlightening read for those who start their career in programming or want to level up in it. At least, I tried to accumulate inside all my experience from production algorithmic development, teaching of these topics, and Lisp programming, over the last 10+ years. Below is a short preface and an introductory chapter about Complexity. <h2>Why Algorithms Matter</h2><p>In our industry, currently, there seems to prevail a certain misunderstanding of the importance of algorithms for the working programmer. There's often a disconnect between the algorithmic questions posed at the job interviews and the everyday essence of the same job. That's why <a href="http://nathanmarz.com/blog/the-limited-value-of-a-computer-science-education.html">opinions</a> are <a href="https://news.ycombinator.com/item?id=9695102">voiced</a> that you, actually, don't have to know CS to be successful in the software developer's job. That's true, you don't, but you'd better do if you want to be in the notorious top 10% programmers. For several reasons. One is that, actually, you can find room for algorithms almost at every corner of your work — provided you are aware of their existence. To put it simply, the fact that you don't know a more efficient or elegant solution to a particular programming problem doesn't make your code less crappy. The current trend in software development is that, although the hardware becomes more performant, the software becomes slower faster. There are two reasons for that, in my humble opinion: <ol><li>Most of the application programmers don't know the inner workings of the underlying platforms. And the number of platform layers keeps increasing.</li><li>Most of the programmers also don't know enough algorithms and algorithmic development technics to squeeze the most from their code. And often this means a loss of one or more orders of magnitude of performance.</li></ol><p>In the book, I'll address, primarily, the second issue but will also try to touch on the first whenever possible. <p>Besides, learning the art of solving difficult algorithmic problems trains the brain and makes it more apt to solving various other problems, in the course of your day-to-day work. <p>Finally, you will be speaking the same lingua franca as other advanced programmers — the tongue that transcends the mundane differences of particular programming languages. And you'll gain a more detached view of those differences, freeing your mind from the dictate of a particular set of choices exhibiting in any one of them. <p>One of the reasons for this gap of understanding of the value of algorithms, probably, originates from how they are usually presented in the computer science curriculum. First, it is often done in a rather theoretical or "mathematical" way with rigorous proofs and lack of connection to the real world™. Second, the audience is usually freshmen or sophomores who don't have a lot of practical programming experience and thus can't appreciate and relate how this knowledge may be applied to their own programming challenges (because they didn't have those yet) — rather, most of them are still at the level of struggling to learn well their first programming language and, in their understanding of computing, are very much tied to its choices and idiosyncrasies. <p>In this book, the emphasis is made on the demonstration of the use of the described data structures and algorithms in various areas of computer programming. Moreover, I anticipate that the self-selected audience will comprise programmers with some experience in the field. This makes a significant difference in the set of topics that are relevant and how they can be conveyed. Another thing that helps a lot is when the programmer has a good command of more than one programming language, especially, if the languages are from different paradigms: static and dynamic, object-oriented and functional. These factors allow bridging the gap between "theoretical" algorithms and practical coding, making the topic accessible, interesting, and inspiring. <p>This is one answer to a possible question: why write another book on algorithms? Indeed, there are several good textbooks and online courses on the topic, of which I'd recommend the most Steven Skienna's <a href="http://www.algorist.com/">The Algorithm Design Manual</a>. Yet, as I said, this book is not at all academic in presentation of the material, which is a norm for other textbooks. Except for simple arithmetic, it contains almost no "math" or proofs. And, although proper attention is devoted to algorithm complexity, it doesn't deal with theories of complexity or computation and similar scientific topics. Besides, all the algorithms and data structures come with some example practical use cases. Last, but not least, there's no book on algorithms in Lisp, and, in my opinion, it's a great topic to introduce the language. The next chapter will provide a crash course to grasp the basic ideas, and then we'll discuss various Lisp programming approaches alongside the algorithms they will be used to implement. <p>This is an introductory book, not a bible of algorithms. It will draw a comprehensive picture and cover all topics necessary for further advancement of your algorithms knowledge. However, it won't go too deep into the advanced topics, such as persistent or probabilistic data structures, advanced tree, graph, and optimization algorithms, as well as algorithms for particular fields, such as Machine Learning, Cryptography or Computational Geometry. All of those fields require (and usually have) separate books of their own. <h2>A Few Words about Lisp</h2>For a long time, I've been contemplating writing an introductory book on Lisp, but something didn't add up, I couldn't see the coherent picture, in my mind. And then I got a chance to teach algorithms with Lisp. From my point of view, it's a perfect fit for demonstrating data structures and algorithms (with a caveat that students should be willing to learn it), while discussing the practical aspects of those algorithms allows to explain the language naturally. At the same time, this topic requires almost no endeavor into the adjacent areas of programming, such as architecture and program design, integration with other systems, user interface, and use of advanced language features, such as types or macros. And that is great because those topics are overkill for an introductory text and they are also addressed nicely and in great detail elsewhere (see <a href="http://www.gigamonkeys.com/book/">Practical Common Lisp</a> and <a href="http://www.paulgraham.com/acl.html">ANSI Common Lisp</a>). <p>Why Lisp is great for algorithmic programs? One reason is that the language was created with such use case in mind. It has support for all the proper basic data structures, such as arrays, hash-tables, linked lists, strings, and tuples. It also has a numeric tower, which means no overflow errors and, so, a much saner math. Next, it's created for the interactive development style, so the experimentation cycle is very short, there's no compile-wait-run-revise red tape, and there are no unnecessary constraints, like the need for additional annotations (a.k.a. types), prohibition of variable mutation or other stuff like that. You just write a function in the REPL, run it and see the results. In my experience, Lisp programs look almost like pseudocode. Compared to other languages, they may be slightly more verbose at times but are much more clear, simple, and directly compatible with the algorithm's logical representation. <p>But why not choose a popular programming language? The short answer is that it wouldn't have been optimal. There are 4 potential mainstream languages that could be considered for this book: C++, Java, Python, and JavaScript. (Surely, there's already enough material on algorithms that uses them). The first two are statically-typed, which is, in itself, a big obstacle to using them as teaching languages. Java is also too verbose, while C++ — too low-level. These qualities don't prevent them from being used in the majority of production algorithm code, in the wild, and you'll, probably, end up dealing with such code sooner than later if not already. Besides, their standard libraries provide great examples of practical algorithm implementation. But, I believe that gaining good conceptual understanding will allow to easily adapt to one of these languages if necessary while learning them in parallel with diving into algorithms creates unnecessary complexity. Python and JS are, in many ways, the opposite choices: they are dynamic and provide some level of an interactive experience (albeit inferior compared to Lisp), but those languages are in many ways anti-algorithmic. Trying to be simple and accessible, they hide too much from the programmer and don't give enough control of the concrete data. Teaching algorithms, using their standard libraries, seems like cheating to me as their basic data structures often are not what they claim to be. Lisp is in the middle: it is both highly interactive and gives enough control of the environment, while not being too verbose and demanding. And the price to pay — the unfamiliar syntax — is really small, in my humble opinion. <p>Yet, there's another language that is rapidly gaining popularity and is considered by many to be a good choice for algorithmic development — Rust. It's also a static language, although not so ceremonial as Java or C++. However, neither am I an expert in Rust, nor intend to become one. Moreover, I think the same general considerations regarding static languages apply to it. <h2>Algorithmic Complexity</h2><p>Complexity is a point that will be mentioned literally on every page of this book; the discussion of any algorithm or data structure can't avoid this topic. After correctness, it is the second most important quality of every algorithm — moreover, often correctness alone doesn't matter if complexity is neglected, while the opposite is possible: to compromise correctness somewhat in order to get significantly better complexity. By and large, algorithm theory differs from other subjects of CS in that it concerns not about presenting a working (correct) way to solve some problem but about finding an efficient way to do it. Where efficiency is understood as the minimal (or admissible) number of operations performed and occupied memory space. <p>In principle, the complexity of an algorithm is the dependence of the number of operations that will be performed on the size of the input. It is crucial to the computer system's scalability: it may be easy to solve the programming problem for a particular set of inputs, but how will the solution behave if the input is doubled, increased tenfold or million-fold? This is not a theoretical question, and an analysis of any general-purpose algorithm should have a clear answer to it. <p>Complexity is a substantial research topic: a whole separate branch of CS — Complexity Theory — exists to study it. Yet, throughout the book, we'll try to utilize the end results of such research without delving deep into rigorous proofs or complex math, especially since, in most of the cases, measuring complexity is a matter of simple counting. Let's look at the following illustrative example: <code><pre><br />(defun mat-max (mat)<br /> (let (max)<br /> (dotimes (i (array-dimension mat 0))<br /> (dotimes (j (array-dimension mat 1))<br /> (when (or (null max)<br /> (> (aref mat i j) max))<br /> (:= max (aref mat i j)))))<br /> max))<br /></pre></code><p>This function finds the maximum element of a two-dimensional array (matrix): <code><pre><br />CL-USER> (mat-max #2A((1 2 3) (4 5 6)))<br />6<br /></pre></code><p>What's its complexity? To answer, we can just count the number of operations performed: at each iteration of the inner loop, there are 2 comparisons involving 1 array access, and, sometimes, if the planets align we perform another access for assignment. The inner loop is executed <code>(array-dimension mat 1)</code> times (let's call it <code>m</code> where <code>m=3</code>), and the outer one — <code>(array-dimension mat 0)</code> (<code>n=2</code>, in the example). If we sum this all up we'll get: <code>n * m * 4</code> as an upper limit, for the worst case when each sequent array element is larger then the previous. As a rule of thumb, each loop adds multiplication to the formula, and each sequential block adds a plus sign. <p>In this calculation, there are two variables (array dimensions <code>n</code> and <code>m</code>) and one constant (the number of operations performed for each array element). There exists a special notation — <b>Big-O</b> — used to simplify the representation of end results of such complexity arithmetic. In it, all constants are reduced to 1, and thus <code>m * 1</code> becomes just <code>m</code>, and also since we don't care about individual array dimension differences we can just put <code>n * n</code> instead of <code>n * m</code>. With such simplification, we can write down the final complexity result for this function: <code>O(n^2)</code>. In other words, our algorithm has quadratic complexity (which happens to be a variant of a broader class called "polynomial complexity") in array dimensions. It means that by increasing the dimensions of our matrix ten times, we'll increase the number of operations of the algorithm 100 times. In this case, however, it may be more natural to be concerned with the dependence of the number of operations on the number of <b>elements</b> of the matrix, not its dimensions. We can observe that <code>n^2</code> is the actual number of elements, so it can also be written as just <code>n</code> — if by `n` we mean the number of elements, and then the complexity is linear in the number of elements (<code>O(n)</code>). As you see, it is crucial to understand what `n` we are talking about! <p>There are just a few more things to know about Big-O complexity before we can start using it to analyze our algorithms. <p>1. There are 6 major complexity classes of algorithms: <ul><li>constant-time (<code>O(1)</code>) <li>sublinear (usually, logarithmic — <code>O(log n)</code>) <li>linear (<code>O(n)</code>) and superlinear (<code>O(n * log n)</code>) <li>higher-order polynomial (<code>O(n^c)</code>, where <code>c</code> is some constant greater than 1) <li>exponential (<code>O(с^n)</code>, where <code>с</code> is usually 2 but, at least, greater than 1) <li>and just plain lunatic complex (<code>O(n!)</code> and so forth) — I call them <code>O(mg)</code>, jokingly</ul><p>Each class is a step-function change in performance, especially, at scale. We'll talk about each one of them as we'll be discussing the particular examples of algorithms falling into it. <p>2. Worst-case vs. average-case behavior. In this example, we saw that there may be two counts of operations: for the average case, we can assume that approximately half of the iterations will require assignment (which results in 3,5 operations in each inner loop), and, for the worst case, the number will be exactly 4. As Big-O reduces all numbers to 1, for this example, the difference is irrelevant, but there may be others, for which it is much more drastic and can't be discarded. Usually, for such algorithms, both complexities should be mentioned (alongside with ways to avoid worst-case scenarios): a good example is quicksort algorithm described in the subsequent chapter. <p>3. We have also seen the so-called "constant factors hidden by the Big-O notation". I.e., from the point of view of algorithm complexity, it doesn't matter if we need to perform 3 operations in the inner loop or 30. Yet, it is quite important in practice, and we'll also discuss it below when examining binary search. Even more, some algorithms with better theoretical complexity may be worse in many practical applications due to these hidden factors (for example, until the dataset reaches a certain size). <p>4. Finally, besides execution time complexity, there's also space complexity, which instead of the number of operations measures the amount of storage space used proportional to the size of the input. In general, similar approaches are applied to its estimation. <hr size="1"><script src="https://gist.github.com/vseloved/915a2aad64bddfae8376e0b1b4ca29aa.js"></script>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-63139770371752513992018-11-27T21:49:00.000+02:002018-11-29T22:14:42.953+02:00Structs vs Parametric Polymorphism<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-eBK3aGy9-J4/W_2ZFV5fPRI/AAAAAAAACDg/r5Ht5YZHqPo6oRTysuMD47CzgQe54JuVgCLcBGAs/s1600/oop.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://4.bp.blogspot.com/-eBK3aGy9-J4/W_2ZFV5fPRI/AAAAAAAACDg/r5Ht5YZHqPo6oRTysuMD47CzgQe54JuVgCLcBGAs/s320/oop.png" width="320" height="224" data-original-width="220" data-original-height="154" /></a></div>Recently, Tamas Papp <a href="https://tpapp.github.io/post/common-lisp-to-julia/">wrote</a> about one problem he had with Lisp in the context of scientific computing: that it's impossible to specialize methods on parametric types. <blockquote>While you can tell a function that operates on arrays that these arrays have element type double-float, you cannot dispatch on this, as Common Lisp does not have parametric types.</blockquote> I encountered the same issue while developing the CL-NLP Lisp toolkit for natural language processing. For instance, I needed to specialize methods on sentences, which may come in different flavors: as lists of tokens, vectors of tokens, lists of strings or some more elaborate data-structure with attached metadata. Here's an example code. There's a generic function to perform various tagпing jobs (POS, NER, SRL etc.) It takes two arguments: the first — as with all CL-NLP generic functions — is the tagger object that is used for algorithm selection, configuration, as well as for storing intermediate state when necessary. The second one is a sentence being tagged. Here are two of its possible methods: <code><pre><br />(defmethod tag ((tagger ap-dict-postagger) (sent string)) ...)<br />(defmethod tag ((tagger ap-dict-postagger) (sent list)) ...)<br /></pre></code> The first processes a raw string, which assumes that we should invoke some pre-processing machinery that tokenizes it and then, basically, call the second method, which will perform the actual tagging of the resulting tokens. So, <code>list</code> here means list of tokens. But what if we already have the tokenization, but haven't created the token objects? I.e. a list of strings is supplied as the input to the <code>tag</code> method. The CLOS machinery doesn't have a way to distinguish, so we'll have to resort to using <code>typecase</code> inside the method, which is exactly what <code>defmethod</code> replaces as a transparent and extensible alternative. Well, in most other languages we'll stop here and just have to assert that nothing can be done and it should be accepted as is. After all, it's a local nuisance and not a game changer for our code (although Tamas refers to it as a game-changer for his). In Lisp, we can do better. Thinking about this problem, I see at least 3 solutions with a varying level of elegance and portability. Surely, they may seem slightly inferior to such capability being built directly into the language, but demanding to have everything built-in is unrealistic, to say the least. Instead, having a way to build something in ourselves is the only future-proof and robust alternative. And this is what Lisp is known for. The first approach was mentioned by Tamas himself: <blockquote>You can of course branch on the array element types and maybe even <b>paper over the whole mess with sufficient macrology</b> (which is what LLA ended up doing), but this approach is not very extensible, as, eventually, you end up hardcoding a few special types for which your functions will be "fast", otherwise they have to fall back to a generic, boxed type. With multiple arguments, the number of combinations explodes very quickly.</blockquote> Essentially, rely on <code>typecase</code>-ing but use macros to blend it into the code in the most non-intrusive way, minimizing boilerplate. This is a straightforward path, in Lisp, and it has its drawbacks for long-running projects that need to evolve over time. But it remains a no-brainer for custom one-offs. That's why, usually, few venture further to explore other alternatives. The other solution was <a href="https://www.reddit.com/r/lisp/comments/9y425b/switching_from_common_lisp_to_julia_your_thoughts/e9y3qap/">mentioned</a> in the Reddit discussion of the post: <blockquote>Generics dispatching on class rather than type is an interesting topic. I've definitely sometimes wanted the latter so far in doing CL for non-scientific things. It is certainly doable to make another group of generics that do this using the MOP.</blockquote> I.e. use the MOP to introduce type-based generic dispatch. I won't discuss it here but will say that similar things were tried in the past quite successfully. ContextL and Layered functions are some of the examples. Yet, the MOP path is rather heavy and has portability issues (as the MOP is not in the standard, although there is the closer-to-mop project that unifies most of the implementations). In my point of view, its best use is for serious and fundamental extension of the CL object system, not to solve a local problem that may occur in some contexts but is not so pervasive. Also, I'd say that the Lisp approach that doesn't mix objects and types (almost) is, conceptually, the right one as these two facilities solve a different set of problems. There's a third — much simpler, clear and portable solution that requires minimal boilerplate and, in my view, is best suited for such level of problems. To use <code>struct</code>-s. Structs are somewhat underappreciated in the Lisp world, not a lot of books and study materials give them enough attention. And that is understandable as there's not a lot to explain. But structs are handy for many problems as they are a hassle-free and efficient facility that provides some fundamental capabilities. In its basic form, the solution is obvious, although a bit heavy. We'll have to define the wrapper structs for each parametric type we'd like to dispatch upon. For example, <code>list-of-strings</code> and <code>list-of-tokens</code>. This looks a little stupid and it is, because what's the semantic value of a list of strings? That's why I'd go for <code>sentence/string</code> and <code>sentence/token</code> which is a clearer naming scheme. (Or, if we want to mimic Julia, <code>sentence<string></code>). <code><pre><br />(defstruct sent/str<br /> toks)<br /></pre></code> Now, from the method's signature, we will already see that we're dealing with sentences in the tagging process. And will be able to spot when some other tagging algorithm operates on the paragraphs instead of words: let's say, tagging parts of an email with such labels as greeting, signature, and content. Yes, this can also be conveyed via the name of the tagger, but, still, it's helpful. And it's also one of the hypothetical fail cases for a parametric type-based dispatch system: if we have two different kinds of lists of strings that need to be processed differently, we'd have to resort to similar workarounds in it as well. However, if we'd like to distinguish between lists of strings and vectors of strings, as well as more generic sequences of strings we'll have to resort to more elaborate names, like <code>sent-vec/str</code>, as a variant. It's worth noting though that, for the sake of producing efficient compiled code, only vectors of different types of numbers really make a difference. A list of strings or a list of tokens, in Lisp, uses the same accessors so optimization here is useless and type information may be used only for dispatch and, possibly, type checking. Actually, Lisp doesn't support type-checking of homogenous lists, so you can't say <code>:type (list string)</code>, only <code>:type list</code>. (Wel, you can, actually uses <code>(and satisfies (lambda (x) (every 'stringp x))</code>, but what's the gain?) Yet, using structs adds more semantic dimensions to the code than just naming. They may store additional metadata and support simple inheritance, which will come handy when we'd like to track sentence positions in the text and so on. <code><pre><br />(defstruct sent-vec/tok<br /> (toks nil :type (vector tok)))<br /><br />(defstruct (corpus-sent-vec/tok (:include sent-vec/tok))<br /> file beg end)<br /></pre></code> And structs are efficient in terms of both space consumption and speed of slot access. <br>So, now we can do the following: <code><pre><br />(defmethod tag ((tagger ap-dict-postagger) (sent sent/str)) ...)<br />(defmethod tag ((tagger ap-dict-postagger) (sent sent/tok)) ...)<br />(defmethod tag ((tagger ap-dict-postagger) (sent sent-vec/tok)) ...)<br /></pre></code> We'll also have to <code>defstruct</code> each parametric type we'd like to use. As a result, with this approach, we can have the following clean and efficient dispatch: <code><pre><br />(defgeneric tag (tagger sent)<br /> (:method (tagger (sent string))<br /> (tag tagger (tokenize *word-splitter* sent))<br /> (:method (tagger (sent sent/str))<br /> (tag tagger (make-sent/tok :toks (map* ^(prog1 (make-tok <br /> :word %<br /> :beg off <br /> :end (+ off (length %))) <br /> (:+ off (1+ (length %)))<br /> @sent.toks)))<br /> (:method ((tagger pos-tagger) (sent sent/tok))<br /> (copy sent :toks (map* ^(copy % :pos (classify tagger<br /> (extract-features tagger %))<br /> @sent.toks))))<br /><br />CL-USER> (tag *pos-tagger* "This is a test.")<br />#S(SENT/TOK :TOKS (<This/DT 0..4> <is/VBZ 5..7> <a/DT 8..9><br /> <test/NN 10..14> <./. 14..15>))<br /></pre></code> Some of the functions used here, <code>?</code>, <code>map*</code>, <code>copy</code>, as well as <code>@</code> and <code>^</code> reader macros, come from my RUTILS, which fills the missing pieces of the CL standard library. Also an advantage of structs is that they define a lot of things in the background: invoking type-checking for slots, a readable print-function, a constructor, a builtin <code>copy-structure</code> and more. In my view, this solution isn't any less easy-to-use than the static-typed one (Julia's). There's a little additional boilerplate (defstructs), which may be even considered to have a positive impact on the code's overall clarity. And yes, you have to write boilerplate in Lisp sometimes, although not so much of it. Here's a <a href="https://twitter.com/johanatan/status/1064747571090874369">fun quote on the topic</a> I saw on twitter some days ago: <blockquote>Lisp is an embarrassingly concise language. If you’re writing a bunch of boilerplate in it, you need to read SICP & “Lisp: A Language for Stratified Design”.</blockquote> P.S. There's one more thing I wanted to address from Tamas's post <blockquote>Now I think that one of the main reasons for this is that while you can write scientific code in CL that will be (1) fast, (2) portable, and (3) convenient, you cannot do all of these at the same time.</blockquote> I'd say that this choice (or rather a need to prioritize one over the others) exists in every ecosystem. At least, looking at his Julia example, there's no word of portability (citing Tamas's own words about the language: "At this stage, code that was written half a year ago is very likely to be broken with the most recent release."), while convenience may be manifest well for his current use case, but what if we require to implement in the same system features that deal with other areas outside of numeric computing? I'm not so convinced. Or, speaking about Python, which is a goto language for scientific computing. In terms of performance, the only viable solution is to implement the critical parts in C (or Cython). Portable? No. Convenient — likewise. Well, as a user you get convenience, and speed, and portability (although, pretty limited). But at what cost? I'd argue that developing the Common Lisp scientific computing ecosystem to a similar quality would have required only 10% of the effort that went into building numpy and scipy... Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-16605301956114591722018-09-30T19:15:00.002+03:002018-09-30T19:17:27.380+03:00ANN: flight-recorder - a robust REPL logging facility<div dir="ltr" style="text-align: left;" trbidi="on"><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-McqUxpNdprc/W7Dxkmki6AI/AAAAAAAACBQ/i8WWNQgE5HA02JjsD8_umII0wXTewfTfACLcBGAs/s1600/frlogo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://2.bp.blogspot.com/-McqUxpNdprc/W7Dxkmki6AI/AAAAAAAACBQ/i8WWNQgE5HA02JjsD8_umII0wXTewfTfACLcBGAs/s320/frlogo.png" width="320" height="151" data-original-width="360" data-original-height="170" /></a></div><p>Interactivity is a principal requirement for a usable programming environment. Interactivity means that there should be a shell/console/REPL or other similar text-based command environment. And a principal requirement for such an environment is keeping history. And not just keeping it, but doing it robustly: <ul><li>recording history from concurrently running sessions <li>keeping unlimited history <li>identifying the time of the record and its context </ul><p>This allows to freely experiment and reproduce the results of successful experiments, as well as go back to an arbitrary point in time and take another direction of your work, as well as keeping DRY while performing common repetitive tasks in the REPL (e.g. initialization of an environment or context). <p><a href="https://github.com/vseloved/flight-recorder">flight-recorder</a> or <code>frlog</code> (when you need a distinct name) is a small tool that intends to support all these requirements. It grew out of a frustration with how history is kept in SLIME, so it was primarily built to support this environment, but it can be easily utilized for other shells that don't have good enough history facility. It is possible due to its reliance on the most common and accessible data-interchange facility: text-based HTTP. <p><code>frlog</code> is a backend service that supports any client that is able to send an HTTP request. <p>The backend is a Common Lisp script that can be run in the following manner (probably, the best way to do it is inside screen): <code><pre><br />sbcl --noprint --load hunch.lisp -- -port 7654 -script flight-recorder.lisp<br /></pre></code><p>If will print a bunch of messages that should end with the following line (modulo timestamp): <code><pre><br />[2018-09-29 16:00:53 [INFO]] Started hunch acceptor at port: 7654.<br /></pre></code><p>The service appends each incoming request to the text file in markdown format: <code>~/.frlog.md.</code><p>The API is just a single endpoint - <code>/frlog</code> that accepts GET and POST requests. The parameters are: <ul><li><code>text</code> is the content (url-encoded, for sure) of the record that can, alternatively, be sent in the POST request's body (more robust)</li></ul><p>Optional query parameters are: <ul><li><code>title</code> - used to specify that this is a new record: for console-based interactions, usually, there's a command and zero or more results - a command starts the record (and thus should be accompanied with the title: for SLIME interactions it's the current Lisp package and a serial number). An entitled record is added in the following manner: <code><pre><br />### cl-user (10) 2018-09-29_15:49:17<br /><br /> (uiop:pathname-directory-pathname )<br /></pre></code>If there's no title, the text is added like this: <code><pre><br />;;; 2018-09-29_15:49:29<br /><br /> #<program-error @ #x100074bfd72><br /></pre></code></li><li><code>tag</code> - if provided it signals that the record should be made not to a standard <code>.frlog.md</code> file, but to <code>.frlog-<tag>.md</code>. This allows to easily log a specific group of interactions separately If the response code is 200 everything's fine. </li></ul><p>Currently, 2 clients are available: <ul><li>a SLIME client <code>flight-recorder.el</code> that monkey-patches a couple of basic SLIME functions (just load it from Emacs if you have SLIME initialized) <li>and a tiny Lisp client <code>frlog.lisp</code></li></ul><p>P.S. To sum up, more and more I've grown to appreciate simple (sometimes even primitive - the primitive the better :) tools. <code>flight-recorder</code> seems to me to be just like that: it was very easy to hack together, but it solves an important problem for me and, I guess, for many. And it's the modern "Unix way": small independent daemons, text-based formats and HTTP instead of pipes... <p>P.P.S. <code>frlog</code> uses another tiny tool of mine - <a href="https://github.com/vseloved/flight-recorder/blob/master/hunch.lisp">hunch</a> that I've already utilized in a number of projects but haven't described yet - it's a topic for another post. In short, it is a script to streamline running hunchentoot that does all the basic setup and reduces the developer's work to just defining the HTTP endpoints. <p> P.P.P.S. I know, the name is, probably, taken and it's a rather obvious one. But I think it just doesn't matter in this case... :) </div>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-18217777389880721332018-01-31T18:49:00.000+02:002018-02-01T09:02:59.142+02:00Minimal Perfect Hash-Tables in Common Lisp<p>Recently, my twitter pal @ifesdjeen wrote a <a href="https://twitter.com/ifesdjeen/status/951841941733421057">line</a> that resonated with me: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And although he used it in a negative context, I recognized it as a very precise (and, actually, positive) description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and - as a possible byproduct - into a blog post that other engineers will understand and be able to reproduce. I'm in the business of reproducing papers for about 7 years now, and, I believe, everyone with the same experience will confirm that it's not a skill every engineer should be easily capable of. Not all papers even can be reproduced (because the experiment was just not set up correctly), and of those, which, in principle, can be, only some are presented in the form that I can grasp well enough to be able to program them. And, after all, working code is an ultimate judge of your understanding. <p>But I digressed... :) This post is meant to explain (in simple "engineering" terms) the concept of minimal perfect hash-tables and how I've recently implemented one of their varieties to improve the memory efficiency of my language identification library <a href="https://github.com/vseloved/wiki-lang-detect">WILD</a>. Uncovering, in the process, a more abstract concept behind it. <h2>The Justification of Perfect Hash-Tables</h2><p>Minimal perfect hash-tables are persistent data structures that solve 2 basic deficiencies of normal hash-tables: they guarantee both constant (not only amortized) O(1) time for collision-free key-based access, while no reserved space is required (which for hash-tables may be in the range of 20%-50% of its size). It comes at a cost of two restrictions: the table should be filled with a keyset known ahead-of-time, and the process of building one takes longer than for a usual hash-table (although, for many methods, it can still be bounded by amortized O(1) time). <p>From my point of view, the main advantage of perfect HTs is the possibility to substantially reduce memory usage, which is important in such use cases as storing big dictionaries (relevant to many NLP tasks). Moreover, the space requirements can be lowered even more if the whole keyset is stored in the table, i.e. there can be no misses. Under such constraints, unlike with normal hash-tables, which still require storing the keys alongside the values due to the need to compare them in cases of hash collisions, in a perfect HT they can be just omitted. Unfortunately, such situation is rarely the case, but if some level of false positives is permitted, with the help of additional simple programmatic tricks, the memory usage for keys can be also reduced by orders of magnitude. <p>Hash-tables on their own are the trickiest among the basic data structures. But, in terms of complexity, they pale in comparison to minimal perfect hash-tables, which belong to the advanced algorithms family. One reason for that is that perfect hash-tables require more "moving parts", but the main one is, probably, that there's no common well-known algorithm to distribute keys in a perfect manner. And reading many perfect hash-table papers will scare away most programmers, at least it did me. However, after some research and trial-and-error, I've managed to find the <a href="http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf">one</a> that presents a simple and scalable approach. So, I'm going to explain it in this post. <h2>Hash-tables in SBCL</h2><p>If you take a look at some particular hash-table implementation, your mental model of what a hash-table is may be quite seriously shaken. A straightforward open addressing table will assume a vector of key-value pairs as an underlying concrete structure, filled as the table is modified. On a 64-bit system, it will require: (8 × 1.5 + 8 × 16) × entries-count + total size of all keys + total size of all values + constant size of metadata. The 8 × 1.5 bytes in the formula is storage needed to hold a pointer to a hash-table entry including an average 33% extra overhead, the additional 8 × 16 bytes per entry will be spent on a cons cell (if we use this efficient although antiquated way to represent pairs in Lisp). It should be noted that depending on the size of keys and values the first term may be either a significant (up to half of the total size) or a negligible part of the table's memory profile. <p>However, this is not how SBCL implements hash-tables. See for yourself. Let's create a random hash-table with approximately 1000 keys: <pre><code><br />> (defparameter *ht*<br /> (pairs->ht (loop :repeat 1000<br /> :collect (pair (fmt "~{~C~}"<br /> (loop :repeat (random 10)<br /> :collect (code-char (+ 65 (random 50)))))<br /> (random 1000)))<br /> :test 'equal))<br /></code></pre><p>And inspect its contents: <pre><code><br />> (inspect *ht*)<br /><br />The object is a STRUCTURE-OBJECT of type HASH-TABLE.<br />0. TEST: EQUAL<br />1. TEST-FUN: #<FUNCTION EQUAL><br />2. HASH-FUN: #<FUNCTION SB-IMPL::EQUAL-HASH><br />3. REHASH-SIZE: 1.5<br />4. REHASH-THRESHOLD: 1.0<br />5. REHASH-TRIGGER: 1024<br />6. NUMBER-ENTRIES: 844<br />7. TABLE: #(#{EQUAL<br /> "plCjU" 985<br /> "VVYZqKm[" 861<br /> "N\\" 870<br /> "fqfBdZP\\" 364<br /> "cHNjIM]Y" 804<br /> "p_oHUWc^" 630<br /> "CqGRaMH" 408<br /> "NecO" 636<br /> "QDBq" 380<br /> "M" 838<br /> ...<br /> }<br /> 0 "plCjU" 985 "VVYZqKm[" 861 "N\\" 870 "fqfBdZP\\" 364 ...)<br />8. NEXT-WEAK-HASH-TABLE: NIL<br />9. %WEAKNESS: 0<br />10. NEXT-FREE-KV: 845<br />11. CACHE: 1688<br />12. INDEX-VECTOR: #(0 617 332 393 35 0 194 512 0 420 ...)<br />13. NEXT-VECTOR: #(0 72 0 0 253 499 211 0 349 488 ...)<br />14. HASH-VECTOR: #(9223372036854775808 1830284672361826498 3086478891113066655<br /> 24243962159539 2602570438820086662 2431530612434713043<br /> 4568274338569089845 3085527599069570347 2374133389538826432<br /> 3322613783177911862 ...)<br />15. LOCK: #<SB-THREAD:MUTEX "hash-table lock" (free)><br />16. SYNCHRONIZED-P: NIL<br /></code></pre><p>As the fog of initial surprise clears, we can see that instead of a single vector it uses 4! The keys and values are placed in the one called TABLE (that starts with a reference to the hash-table object itself). Note that storing the entries as pairs is redundant both in terms of memory and access (additional dereferencing). That's why an obvious optimization is to put them directly in the backing array: so the keys and values here are interleaved. One more indirection that may be removed, at least in Lisp, is "unboxing" the numbers, i.e. storing them immediately in the array avoiding the extra pointer. This may be an additional specialized optimization for number-keyed hash-tables (to which we'll return later), but it is hardly possible with the interleaved scheme. <p>Perhaps, the biggest surprise is that the entries of the TABLE array are stored sequentially, i.e. there are no gaps. If they were to be added randomly, we'd expect the uniform distribution of 845×2 unique keys and corresponding values over the 2048-element array. Instead, this randomness is transferred to the 1024-element integer INDEX-VECTOR: another level of indirection. Why? For the same reason yet another 1024-element array is used — a HASH-VECTOR, which stores the values of hashes for all the table keys: <b>efficient resizing</b>. Although, it may seem that the biggest overhead of a hash-table is incurred due to reserved space for new keys, in fact, as we have now learned, resizing is a much heavier factor. Especially this relates to the HASH-VECTOR: if the hashes of all keys would not have been stored, each time the table got resized they would have had to be recalculated anew: an intolerable slowdown for larger tables. Having a separate INDEX-VECTOR is not so critical, but it provides two additional nice capabilities. Resizing the table is performed by adjusting the vectors' lengths (an optimized operation) without the need to redistribute the entries. Besides, unlike in theory, we can iterate this table in a deterministic order: the one of keys addition into the table. Which comes quite handy, although SBCL developers don't want to advertise this as a public API as it may restrict potential future modifications to the data structure. There's also a NEXT-VECTOR that is used to optimize insertion. <p>Overall, we can see that a production-optimized hash-table turns out to be much heavier than a textbook variant. And, indeed, these optimizations are relevant, as SBCL's hash-table are quite efficient. In my experiments several years ago, they turned out to be 2-3 times faster than, for instance, the CCL ones. Yet, it's clear that the speed optimizations, as usual, come at a cost of storage deoptimization. To restore storage sanity, we could fall back to the basic hash-table variant, and it's a possible solution for some cases, although a mediocre one, in general. Neither will it be the fastest, nor fully optimized. <h2>Building an MPHT</h2><p>Most of space in the SBCL's hash-table implementation is spent on metadata and keys, not on values. Yet, if we introduce a restriction that the table cannot be altered — no new values added after initial construction and no resizing possible — most of those optimizations and connected storage become redundant. An ideally space-efficient data structure is an array, but it doesn't allow key-based access. In theory, minimal perfect hash-tables are arrays with a minimal amount of metadata overhead. Yet, the precise amount is dependent on the algorithm, and there's still new research improving them going on. Overviewing all the existing approaches (most of which I don't fully understand) is beyond the scope of this post. <p>Besides, MPHTs require additional calculations at access time compared to normal hash-tables. So, if a hash-table is well-optimized it will usually be faster than and MPH. <p>And still, MPHs require a non-negligible amount of additional space for metadata: for the algorithm that will be discussed it's around 2 bytes per key. The best algorithms in the literature claim to reduce this amount more than 4 times to less than 4 bits per key. However, for now, I have picked the simpler approach, since this difference is not critical considering that every value in the table, on a 64-bit machine, occupies at least 16 bytes (a pointer plus the data), so an overhead of 2 bytes versus 0.5 bytes (that will probably be rounded to 1 byte) is already negligible. <p>Now, let's think of how we can distribute the keys in a hash-table so that there are no collisions and the backing array has the same number of elements as the number of hash-table keys. Theoretically, as the set of keys is known before the creation of the table, we can find a hash function that produces such distribution. Unfortunately, due to the Birthday paradox, it may be a long search. The algorithms for MPHTs suggest ways of structuring it. A good algorithm should have at most O(n) complexity as, otherwise, it will be infeasible for large keysets, which are the main use case for perfect hash-tables. <p>The algorithm I will briefly describe now was suggested by <a href="http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf">Botelho and Ziviani</a>. It is a 2-step process: <ul><li>at first stage, using a normal hash-function (in particular, Jenkins hash), all keys are nearly uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. This can be done by setting the hash divisor to <code>(ceiling (length keyset) 200 #| or slightly less |#)</code>;</li><li>next, for each bucket, a perfect hash function is constructed via a simple algorithm: for each key, two hash codes are calculated by one call to the Jenkins hash (this function outputs 3 hashes at once), which are treated as vertices of a graph. If the graph happens to be non-circular (which can be ensured with high probability with a suitable hash divisor), it is possible to construct the desired function as a sum of 2 hashes. Otherwise, we change the Jenkins hash seed and try constructing a new graph until a non-circular one is obtained. In practice, this requires just a couple of tries;</li><li>the construction of the final hash function is described very clearly by <a href="http://cmph.sourceforge.net/papers/chm92.pdf">Czech, Havas and Majewski</a> who have proposed this method: it lies in performing a depth-first search on the graph, labelling the edges with unique numbers and deducing the corresponding number for each possible Jenkins hash value.</li></ul><p>Here you can see one of the possible labelings (each edge corresponds to a key and its unique index, each vertex — to the value for each of the possible Jenkins hashes): <p><a href="https://3.bp.blogspot.com/-o3hfno10q4U/WnHxAr_70DI/AAAAAAAAB9w/rSLUH6ph29M78gH4vPt6LM2VnGG04XUVwCLcBGAs/s1600/g.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-o3hfno10q4U/WnHxAr_70DI/AAAAAAAAB9w/rSLUH6ph29M78gH4vPt6LM2VnGG04XUVwCLcBGAs/s320/g.png" width="320" height="133" data-original-width="546" data-original-height="227" /></a><p>Now, the per-bucket hash-function can be reconstructed from an array of numbers (in the range 0-255) associated with each possible hash. The divisor of the hash is twice the number of keys in the bucket, though it can be any number above the number of keys: the greater the divisor, the more space is used for storing the numbers (the divisor is the length of an array) and the less time it takes to find an acyclic graph. <p>The algorithm works quite fast: on my laptop, it takes 8 seconds for the table of 725 359 character trigrams from a mid-size language identification model and 13 seconds for 1 236 452 words from the same model. <p>To sum up, this is how to find the index of an element (<code>bytes</code> argument) in our perfect hash-table: <pre><code><br />(defun csindex (bytes cstab)<br /> (with ((mod (/ (1- (length @cstab.meta)) 2)) ; divisor of the top-level hash<br /> (hash (* (mod (jenkins-hash bytes) mod) 2)) ; determine the bucket<br /> (off (? @cstab.meta hash))<br /> (seed (? @cstab.meta (1+ hash))) ; each bucket has its own Jenkins seed<br /> (mod2 (- (? @cstab.meta (+ 2 hash)) off))<br /> (b c (jenkins-hash2 bytes seed (* mod2 2)))<br /> (goff (* 2 off)))<br /> ;; bucket offset + in-bucket hash<br /> (+ off (mod (+ (? gs (+ goff b))<br /> (? gs (+ goff c)))<br /> mod2))))<br /></code></pre><p>Note, that, in the code for this article, I refer to perfect hash tables as cstabs for the reasons described in the end. <h2>Efficiency In Practice</h2><p>So, let's now examine the memory efficiency of this method. Thankfully, just recently the SBCL developers started working on a critical for every algorithmic developer missing piece in the SBCL API: a <a href="https://sourceforge.net/p/sbcl/sbcl/ci/e2e9060d4b4a76600d44b44e151a02a2755373f7/">function to determine the size in memory occupied by an arbitrary object</a>. As we know from the famous koan, "LISP programmers know the value of everything and the cost of nothing". Indeed, from a certain point of view, this applies to SBCL. Although we, now, have a rough tool at our disposal that patches this hole... ;) <p>Using this unofficial function, we can roughly calculate the space occupied by the character trigrams hash-table mentioned above: <pre><code><br />> (let ((ht (? wild:*lang-detector* '3gs)))<br /> (+ (object-size ht)<br /> (object-size @ht.table)<br /> (object-size @ht.index-vector)<br /> (object-size @ht.hash-vector)<br /> (reduce '+ (map 'list<br /> (lambda (obj)<br /> ;; the keys of the table are strings<br /> ;; and values -- alists of a symbol and a float<br /> (reduce '+ (map 'list ^(if (listp %)<br /> (sum 'object-size %)<br /> (object-size %))<br /> @ht.table)))))))<br />102856432<br /></code></pre><p>100 MB! <pre><code><br />> (let ((ct (build-const-table (? wild:*lang-detector* '3gs))))<br /> (+ (object-size ct)<br /> (object-size @ct.gs)<br /> (object-size @ct.meta)<br /> (reduce '+ (map 'list ^(sum 'object-size %)<br /> @ct.data))))<br />53372880<br /></code></pre><p>I.e., we have reduced the memory usage almost twice. Because all the metadata now occupies just 1479808 bytes or 1,5 MB. <p>One critical decision that allows for such drastic memory-use improvement is omitting the keys from the table. It should be noted that adding them back is trivial: <code>(defstruct (keyed-pht (:include const-table)) (keys nil :type simple-vector) (test 'eql))</code>, for which getting the value will work like this: <pre><code><br />(defun csget (key cstab)<br /> (let ((idx (csindex (bytes key) cstab)))<br /> (when (call @cstab.test key (svref @cstab.keys idx))<br /> (values (svref @cstab.data idx)<br /> t)))<br /></code></pre><p>However, this will, basically, return us to the same ballpark as a textbook hash-table implementation in terms of memory usage while loosing in terms of speed. Yet, if we allow for some controlled level of false positives, there are a few algorithmic tricks that can be used to once again make keys almost negligible. <p>The first one is really simple and straightforward: replace the vector of keys with the vector of their hashes. In particular, if we take a single byte of the hash, such array will be 10-100 times smaller than the generic keys array and produce an FP-rate of 1/255. <p>Another trick will be to use a Bloom filter. For instance, a filter with 0.1 FP-rate for all the trigrams from our language identification model will require just around 0.4 MB of storage compared to 0.7 MB needed to store the 1-byte hashes and 30 MB needed to store just the keys. The disadvantage of a Bloom filter, however, is slower processing time: for the mentioned 10% FP-rate we'll need to perform 4 hash calculations, and if we'd like to reach the same 0.3% rate as the 1-byte hash array we'd have to increase the memory usage to 1MB and perform 8 hash calculations. <p>Finally, it should be noted that the main motivator for this work was reducing the memory footprint of my language identification library, for, to be widely adopted, such kind of project should be very frugal in its memory usage: 10-30 MB is its maximum polite footprint. By switching to a perfect hash-table, we haven't reached that goal (at least, for this particular model), but there's also plenty of room for optimizing values memory usage that I will return to later. <h2>Const-tables</h2><p>The other motivator for this work, in general, was my interest in the topic of efficient "static" data structures. In this context, I feel that the notion of a perfect hash table doesn't fully capture the essential features of the data structures class we're dealing with. First of all, the main distinguishing factor is static vs dynamic usage. A hash-table is thought of as a dynamic entity while our data structure is primarily a static one. Next, hashing is, for sure, involved in constructing all these structures, but it's only part of a bigger picture. The way of structuring the metadata, in particular, the index arrays may differ greatly not only between perfect and usual hash-table, but also within the ordinary hash-table class — within the different implementations. <p>So, I decided to come up with a different name for this data structure — a const-table (or cstab). It defines a broad class of persistent data structures that allow constant-time key-based access and, ideally, efficiently store the keys. The implementation, described here, is released as the <a href="https://github.com/vseloved/const-table">library with the same name</a>. It is still in its infancy, so major changes are possible — and suggestions on its improvement are also welcome.Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-28419229502147161972017-04-17T21:09:00.000+03:002019-09-05T22:45:51.662+03:00Pretty-Printing Trees<div dir="ltr" style="text-align: left;" trbidi="on"> (or The Ugliest Code I've Ever Written)<br><br><a href="https://1.bp.blogspot.com/-IpVDrelfZ9w/WPT_wTT9lII/AAAAAAAABtk/cP2Fng43B3g4AQOlRJphjvr6EN7R7uU3QCLcB/s1600/pic.jpg" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-IpVDrelfZ9w/WPT_wTT9lII/AAAAAAAABtk/cP2Fng43B3g4AQOlRJphjvr6EN7R7uU3QCLcB/s1600/pic.jpg" /></a><p>In the last couple of days, I was ill and had to stay in bed, so I've used this time also to tidy up the work that accumulated over the past year in <a href="">cl-nlp</a>. That was especially timely, considering the interest that was expressed in using it by some people who I've met at the recent Lisp-related events. </p><p>I've even assembled a rough <a href="https://github.com/vseloved/cl-nlp/issues/30">checklist</a> of the things that need to be finished to get it to v.1.0 and beyond. </p><p>Besides, after finishing the basic cleaning, I've returned to one of the programming tasks that has racked my head for long: tree pretty-printing. In NLP, we constantly have to deal with various versions of parse trees, like the constituency or dependency ones, but the problem is that they are not easily visualized. And good visualization plays, at least for me, a critical role in effective debugging, ideation and programming. It's an essential part of a solid interactive experience that is one of the fundamental traits of Lisp development. </p><p>For instance, a constituency tree is usually presented as a Lisp list. Here's an infamous example from the Penn Treebank: </p><code><pre><br /> (S <br /> (NP-SBJ <br /> (NP (NNP Pierre) (NNP Vinken) )<br /> (, ,) <br /> (ADJP <br /> (NP (CD 61) (NNS years) )<br /> (JJ old) )<br /> (, ,) )<br /> (VP (MD will) <br /> (VP (VB join) <br /> (NP (DT the) (NN board) )<br /> (PP-CLR (IN as) <br /> (NP (DT a) (JJ nonexecutive) (NN director) ))<br /> (NP-TMP (NNP Nov.) (CD 29) )))<br /> (. .) )<br /></pre></code><p>A dependency tree has several representations, all of which are not really intuitive to grasp. This is the Stanford format: </p><code><pre><br /> amod(ideas-2, Colorless-0)<br /> amod(ideas-2, green-1)<br /> nsubj(sleep-3, ideas-2)<br /> root(sleep-3, sleep-3)<br /> advmod(sleep-3, furiously-4)<br /> punct(sleep-3, .-5)<br /></pre></code><p>And here's the CoNLL one: </p><code><pre><br />0 Colorless _ _ ADJ 2<br />1 green _ _ ADJ 2<br />2 ideas _ _ NOUN 3<br />3 sleep _ _ NOUN 3<br />4 furiously _ _ ADV 3<br />5 . _ _ PUNCT 3<br /></pre></code><p>Also, Google's Parsey McParseface offers another - presumably, more visual - representation (using the asciitree lib). Still, it is not good enough, as it messes with the order of words in a sentence. </p><code><pre><br />Input: Bob brought the pizza to Alice .<br />Parse:<br />brought VBD ROOT<br /> +-- Bob NNP nsubj<br /> +-- pizza NN dobj<br /> | +-- the DT det<br /> +-- to IN prep<br /> | +-- Alice NNP pobj<br /> +-- . . punct<br /></pre></code><p>As you see, dependency trees are not trivial to visualize (or pretty-print) in ASCII. The authors of Spacy creatively approached solving this problem by using CSS in their displaCy tool: </p><a href="https://1.bp.blogspot.com/-Ie7YvRAsNkQ/WPT0w1v9ueI/AAAAAAAABtE/UtfqSAs57xwyaPUERmCJE4EqetYuvchFACPcB/s1600/displacy1.jpg" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-Ie7YvRAsNkQ/WPT0w1v9ueI/AAAAAAAABtE/UtfqSAs57xwyaPUERmCJE4EqetYuvchFACPcB/s1600/displacy1.jpg" width="1024" /></a><p>However, it seems like an overkill to bring a browser to with you for such a small task. And it's also not very scalable: </p><a href="https://3.bp.blogspot.com/-gE4UsfVVVHg/WPT0yc_RIlI/AAAAAAAABtE/sMZZtc-2-q4kV1o6_t-v6Mg64Z3HbdNXACPcB/s1600/displacy2.jpg" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-gE4UsfVVVHg/WPT0yc_RIlI/AAAAAAAABtE/sMZZtc-2-q4kV1o6_t-v6Mg64Z3HbdNXACPcB/s1600/displacy2.jpg" width="1024" /></a><p>I, in fact, was always interested in creative ways of text-based visualization. So, I thought of ways to represent parse trees in ASCII. </p><p>With constituency ones, it's rather trivial: </p><code><pre><br />> (pprint-tree '(TOP (S (NP (NN <This:0 0..4>))<br /> (VP (VBZ <is:1 5..7>)<br /> (NP (DT <a:2 8..9>)<br /> (JJ <simple:3 10..16>)<br /> (NN <test:4 17..21>)))<br /> (|.| <.:5 22..23>)))<br /> TOP <br /> : <br /> S <br /> .-----------:---------. <br /> : VP : <br /> : .---------. : <br /> NP : NP : <br /> : : .----:-----. : <br /> NN VBZ DT JJ NN . <br /> : : : : : : <br />This is a simple test . <br /></pre></code><p>The dependencies are trickier, but I managed to find a way to show them without compromising the sentence word order: </p><code><pre><br />> (pprint-deps '(<This:0 0..4> <is:1 5..7> <a:2 8..9> <simple:3 10..16> <test:4 17..21> <.:5 22..23>)<br /> '(root(_ROOT_-0, is-1) nsubj(is-1, This-0) dobj(is-1, test-4) det(test-4, a-2) amod(test-4, simple-3) punct(is-1, .-5)))<br />Colorless green ideas sleep furiously . <br /> ^ ^ .^ .^. ^ ^<br /> : `. amod .´: ::: : :<br /> `..... amod .....´: ::: : :<br /> `. nsubj .´:: : :<br /> :`. advmod .´ :<br /> :`.... punct .....´<br /> root<br /></pre></code><p>And it looks pretty neat even for longer sentences: </p><code><pre style="overflow: scroll;"><br />We hold these truths to be self - evident , that all men are created equal , that they are endowed by their Creator with certain unalienable Rights , that among these are Life , Liberty and the pursuit of Happiness . <br /> ^ .^. ^ .^ ^ .^. ^ ^ .^ ^ ^ ^ .^ ^ .^. ^ ^ ^ ^ ^ .^. ^. ^ .^. ^. ^ ^ .^ ^ ^ ^. ^ .^. ^. ^ ^. ^ ^ .^. ^. ^ ^<br /> `. nsubj .´:: `. det .´: `. aux .´:: : `. punct .´: : : `. det .´: `. auxpass .´:: : : : : `. auxpass .´:: :: `. poss .´:: :: : `. amod .´: : : :`. pobj .´ ::: :`. punct .´ :`. cc .´ `. det .´:: :`. pobj .´ :<br /> :`... dobj ...´ :: `. npadvmod .´: : : : ::`. advcl .´ : : : ::: :: :: :: `...... amod ......´: : : : ::: :: :: :`. prep .´ :<br /> :: :`..... acomp .....´ : : `.. nsubjpass ..´:: : : : ::: :: :: :`......... pobj .........´ : : : ::: :: :`...... conj .......´ :<br /> :`......... advcl ..........´ : : ::`... punct ...´ : : ::: :: :`. prep .´ : : : ::: :`.... conj ....´ :<br /> :`..................... punct ......................´ `........... mark ...........´:: : : ::: :`... pobj ....´ : : : ::`. attr .´ :<br /> :: :: : : ::`. agent .´ : : `... prep ....´: :<br /> :: :: : `.. nsubjpass ..´:: : `...... mark ......´: :<br /> :: :: `....... mark .......´:: : : :<br /> :: :: :`............................ punct .............................´ : :<br /> :: :: :`........................................ advcl .........................................´ :<br /> :: :`................ advcl ................´ :<br /> :`...................................... ccomp .......................................´ :<br /> :`............................................................................................................................................ punct .............................................................................................................................................´<br /> root<br /><br /></pre></code></p><p>However, writing the visualization code was one of the most intimidating programming tasks I've ever encountered. One explanation is that trees are most naturally processed in depth-first order top-down, while the visualization requires bottom-up BFS approach. The other may be that pixel-perfect (or, in this case, character-perfect display is always tedious). As far as I'm concerned, this is not a sufficient explanation, but I couldn't find any other. The ugliest part of this machinery is <a href="https://github.com/vseloved/cl-nlp/blob/master/src/parsing/pprint.lisp#L184">deps->levels</a> function that prints the dependency relations in a layered fashion. The problem is to properly calculate minimal space necessary to accommodate both tokens and dependency labels and to account for different cases when the token has outgoing dependency arcs or doesn't. In theory sounds pretty easy, but in practice, it turned out a nightmare. </p><p>And all of this assumes projective trees (non-intersecting arcs), as well as doesn't know how to show on one level two arcs going from one token in two directions. Finally, I still couldn't align the two trees (constituency and dependency) above and under the sentence. Here's the target: </p><code><pre><br /> TOP <br /> : <br /> S <br /> .----------------:--------------. <br /> : VP : <br /> : .---------. : <br /> NP : NP : <br /> : : .----:---------. : <br /> NN VBZ DT JJ NN . <br />This is a simple test . <br /> ^ .^. ^ ^ .^ ^<br /> `. nsubj .´:: : `. amod .´: :<br /> :: `.... det ....´: :<br /> :`..... dobj .....´ :<br /> :`...... punct ......´<br /> root<br /></pre></code></p><p>and this is how it prints for now (one more challenge was to transfer additional offsets from dependencies into the constituency tree): </p><code><pre><br /> TOP <br /> : <br /> S <br /> .-----------:---------. <br /> : VP : <br /> : .---------. : <br /> NP : NP : <br /> : : .----:-----. : <br /> NN VBZ DT JJ NN . <br />This is a simple test . <br /> ^ .^. ^ ^ .^ ^<br /> `. nsubj .´:: : `. amod .´: :<br /> :: `.... det ....´: :<br /> :`..... dobj .....´ :<br /> :`...... punct ......´<br /> root<br /></pre></code><p>Well, the good news is that it is usable, but it still needs more work to be feature complete. I wonder what was I doing wrong: maybe, someone can come up with a clean and simple implementation of this functionality (in any language)? I consider it a great coding challenge, although it may require a week of your free time and a bunch of dead neurons to accomplish. But if you're willing to take it, I'd be glad to see the results... :D </p></div>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-60612655310517670262017-01-02T18:46:00.000+02:002017-01-03T21:34:14.635+02:00(m8n)ware Open for Business<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-ztrGRNC-psk/WGoPZQF7JyI/AAAAAAAABp4/RVDK7xpmgnsNkvqtynVnTgBJTrxQ5uv6gCLcB/s1600/logo.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-ztrGRNC-psk/WGoPZQF7JyI/AAAAAAAABp4/RVDK7xpmgnsNkvqtynVnTgBJTrxQ5uv6gCLcB/s1600/logo.png" width="150" /></a></div> <p>Today, I want to announce (m8n)ware (the name is an i18n-abbreviation of "meditationware" with a mix of Lisp parens). This is a thing I always wanted to build. After parting ways with Grammarly almost a year ago, I had some time to rest and think about my next move. And this thought I couldn't let go so I figured: you can always go work somewhere, but you don't have a lot of stabs at realizing some of your own ideas. Maybe, two or three in a lifetime. I had tried this once already with fin-ack almost 8 years ago, and the concept behind it was, basically, the same — the implementation differed.</p> <h2><a id="user-content-in-theory" class="anchor" href="#in-theory" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>In theory</h2> <p>What is (m8n)ware? It is a company aimed at solving problems in the area of cognition-related computing, which will be built as a distributed network of mostly Lisp research-oriented engineers. This sounds rather complex, so let me try to explain a couple of points:</p> <ul><li><strong>Cognition-related computing</strong> is the best term I came to after a long thinking about the area of CS that includes various tasks related to cognition, intelligence, knowledge, and associated logic. The common marketing buzzword is Artificial Intelligence, but it has a negative history and is quite misleading. All computer programs implement some form of "artificial" intelligent behavior. The defining feature of cognition-related computing is that it requires some transformation of raw data into structured computer-processable information and back, which is similar to human cognitive functions that arguably do the same transformation for our own internal processing.</li><li>The zest of the <strong>distributed network</strong> notion is that the primary focus of (m8n)ware is building not a localized corporate-like structure, in which people are bound primarily by legal contracts and payment obligations, but a loosely coupled group of like-minded people, who share the same values, interests, and approaches to technology. This organization will be seeking a perfect middle-ground between a corporation and an open-source community.</li><li><strong>Research-oriented engineers</strong> is another "middle-ground" term that describes the main multidisciplinary role needed in this organization. We're not a scientific lab that is focused on fundamental research, neither are we an outsourcing shop that faithfully implements existing results according to a given spec. We're engineers in the sense that we deliver production-ready technology that may be useful to the end users in a straightforward manner. And, at the same time, we're researchers because we wield the methodology and are ready to experiment in new areas that don't have satisfactory state-of-the-art solutions.</li></ul> <p>I don't believe in the VC mantra of "build a startup, get rich, change the world." First of all, I don't believe in changing the material world (which implies a conviction that you know better). I believe in changing yourself. Also, getting rich and doing something good (to the world) are not the goals that are always aligned. Moreover, they are usually in conflict. I'm not a businessman in the sense that money is not my ultimate goal. But I like to see things grow and develop, things that are bigger than myself. Thus I'm interested not in market share but in mind share.</p> <p>Considering all of the above, (m8n)ware is not going to be a product company in a traditional sense. It will be a technology company that will create and disseminate knowledge-based services and products. Also, it will not aim at rapid growth, but rather at sustainable development.</p> <p>In the <a href="http://lisp-univ-etc.blogspot.com/2016/09/the-technology-company-case.html">previous post</a>, I've listed my motivations for moving in this particular direction and explained my values. I can't say that it got overwhelmingly positive feedback, but, in general, the results were better than I expected :) Now, I have several clues on how to cross the most challenging chasm of scaling its operation from a single-person endeavor to a productive and sustainable group. Meanwhile, I was testing if this approach may work in practice and doing market research. Now, I'm ready to go all in and devote at least the next year of my professional life to building this thing.</p> <p><a href="http://m8nware.com">http://m8nware.com</a> is oficially live. If you're interested in cooperation as a client, partner or co-worker, please, let me know...</p> <h2><a id="user-content-in-practice" class="anchor" href="#in-practice" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>In practice</h2> <p>There are several aspects that I'm betting on in (m8n)ware that are non-mainstream and may seem somewhat counter-intuitive. In this part, I want to provide more details about why and how, I think, this will work.</p> <p>The first aspect is radical transparency. From this and the previous post, it should be clear that (m8n)ware originated and plans to continue functioning fully exposed to the outside world, not relying on any secret know-hows or clever tricks. I don't plan to conceal what we're going to do and why. Neither "fake it till you make it." Why this will work? First of all, my experience shows that in the current age of information overload, we're fighting primarily not for the purse but for the thoughts of our "customers" (in many possible markets: not only where you sell your product/services, but also in the labor market, and in the ecosystem of potential competitors, partners, and vendors). And this requires information sharing, not safekeeping and concealment. Secondly, in general, I'm not interested in competition — rather I'd like to find a unique niche in the market that will be served by the company in the best possible manner and will be big enough to sustain it. The good news is that the AI-market is, currently, growing very fast and this trend will last at least for a couple more years. So demand is greater than supply, and this means not a very harsh competitive environment. Another thing is Lisp: no one in their right mind will bet a company on Lisp, so I'm not really worried about the competition in the labor market. :) The final point about openness is that I personally endorse it, and as this is the company that aims to be as close to my ideal as possible it should endorse it.</p> <p>Although it's not a classic product company, it's not going to be a typical outsourcing one either. Yes, initially, it will provide primarily consulting services, but the idea is that, with time, the share of these services will decrease in favor of supporting more general-purpose tools and technology developed in-house. And to ensure the constant priority of this goal, we'll be doing such work from day one. Currently, I see it in the following manner: the time of all engineers will be split in some proportion between for-pay consulting and developing open-source/research projects for free, and with time as some of these projects become important to the company, it will start paying the people who develop them for this work as well. This is a frugal approach, but I advocate it based on personal perspective: working at my previous gigs, I'd be eager to forfeit, say, 20% of my salary to be able to spend 20% of my time on open-source projects that matter to me personally. Actually, the percentage may be much bigger. Currently, I spend 50% of my time working on such projects and am quite happy with this. I deeply believe that such balance is more appealing to many programmers (especially, the kind of people I'd be willing to cooperate with) than a conventional approach.</p> <p>Lisp again. From my experience working in cognitive problems domain, I can definitely say that it's not about coding. For several reasons. The obvious one is that 80% of resources are spent in other parts: thinking/learning, working with data, experiments, documentation. (The remaining 20% are still critical, especially since most of the solutions are resource-demanding and the code is algorithm-heavy). Then, current technology situation: the days of backend-only solutions are, unfortunately, gone. A lot of problems require heavy mobile or in-browser presence. And on the backend, thanks to microservices and other stuff, no one is developing in a single language and even on a single platform anymore. Finally, there's knowledge transfer. Programs may be not a bad medium to express concepts, but not the optimal one either: between scientific papers, blog posts, markdown documents, experiment notebooks, and production-optimized programs, there is no one-size-fits-all solution. All this creates conditions, in which the choice of a programming language becomes much less a constraint than it was just a a few years ago. On the other side, from the point of view of "internal" productivity (not concerned with integration into the bigger picture), Lisp has proven to be a great and rewarding environment very well suited for research work. Plus a great way to differentiate in the labor market... :)</p> <h2><a id="user-content-our-value-proposition" class="anchor" href="#our-value-proposition" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Our value proposition</h2> <p>So, if you need to solve some cognitive computing problems, here's what (m8n)ware may offer.</p> <ol><li>Provide small-scale consulting services: talk to your people and help them with their challenges, perform an audit, study feasibility of some serious project, help with gathering relevant data sets, etc.</li><li>Develop a prototype solution for a particular problem, and deliver it as a set of data, documentation, and a working web-service to allow integration in your prototypes, testing in your environment, with your clients and data.</li><li>Develop a turnkey solution and integrate it into your environment. This is rather tricky as we'll prefer to work in Lisp, and not every environment will be ready for this. We're also willing to compromise and develop some non-critical integration parts in other languages when necessary, provided that the core remains Lisp-based. </li></ol> <p>Why you should go to us instead of solving the problem on your own? The current situation in cognition-related computing is that such projects have high business value, but are not easy to complete: they require not just engineering, but a substantial/prevailing research component. Productive work in this area assumes a skill set of developers and managers that is different from conventional software development. Obviously, you want to develop this expertise in-house, but growing it, currently, is a slow and daunting process. Still, you should definitely do that for the long-term benefits, but this doesn't mean that you can say something in the lines of: in the next half a year I need to solve this complex AI problem, so I'll just hire a person/team in a couple of months and let them do it. It's a risky approach even in conventional software development, and in this field, it just doesn't work. The competition for AI researchers is insane and, moreover, if you're a regular company and not Google/Facebook or the latest-hottest startup your chances of hiring and retaining top talent are, basically, nil. (Why we'll be able to have the talented people while you won't? Because our particular focus — Cognitive+Tech+Distiributed+Lisp — will allow us to appeal to a portion of the talent pool that is not happy in mainstream environments).</p> <p>Cognitive computing projects are risky and hard to predict. That's why for any serious long-term (longer than a couple of months) partnership we'll be dividing the work into reasonable chunks that will allow you to get at least part of the value at each checkpoint, see and assess progress, and pivot if your plans or conditions change.</p> <p>We're open for business — write to <a href="mailto:info@m8nware.com">info@m8nware.com</a> if interested.</p>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-20105554836243714662016-12-31T14:21:00.000+02:002016-12-31T14:23:28.554+02:00Уроки курса «Алгоритмика»<img src="https://prjctr.com.ua/assets/static/v1.4.2/images/logo.png" style="float:right; margin-left: 30px;"/><p>28 декабря состоялся второй выпуск слушателей курса по алгоритмам в Projector. Пока что именно "слушателей", а не полноценных выпускников, потому что до 100% того, чем может и должен стать этот курс, еще очень далеко. Но чтобы достичь этих 100%, в том числе, нужно подвести некоторые промежуточные итоги. Первая попытка, прошедшая весной и в начале лета, официально позиционировалась как бета-версия, но и вторая, по факту, тоже оказалось пробной, т.к. курс был разделен на 2 уровня сложности и существенно переработан. <p>Что можно сказать точно — что курс нужен. Эти знания и навыки дают возможность программисту перейти из категории кодера в полноценного разработчика, способного решать задачи любой сложности. Именно такие люди нужны как гуглам и фейсбукам (и поэтому они уделяют такое внимание теме алгоритмов на собеседованиях), так и небольшим амбициозным продуктовым командам. Да и самому себе: это один из аспектов, котороый дает возможность программисту "расправить крылья" и полюбить свою работу, получив возможность сделать ее разнообразной и приносящей гораздо большую отдачу. И то количество людей, которые этим активно интересуются, показывает, что на рынке тоже есть такое понимание. Хотя эти знания относятся к базовым для университетского курса Компьютерных наук, далеко не у всех была возможность их освоить: кто-то пришел в профессию не классическим путем, кому-то не повезло с вузом и преподавателями, кто-то, просто, прогулял и теперь, набравшись ума-разумуа, хотел бы наверстать упущенное. <p>Другой вопрос, что разным категориям из этой группы людей подходят разные форматы обучения. Для кого-то это онлайн-обучение, для кого-то — классический университет, и только немногие, объективно, готовы к формату интенсивных трехмесячных курсов. Практика показывает, что где-то треть людей, записавшихся и начавших ходить на такие курсы, быстро осознает это и отпадает. В итоге, до конца двух первых курсов у меня доходило чуть меньше половины участников, причем далеко не все из них успешно проходили весь материал. Т.е. помимо тех, кто просто попал не туда и ушел сразу, есть еще примерно такое же количество, которые могли бы получить хороший результат, но этого не произошло. Почему так? В первую очередь, конечно, причина в недоработках с моей стороны. <p>Наконец, если в востребованности направления алгоритмов в целом у меня нет сомнений, то вот потребность именно в продвинутом курсе пока что не доказана. Да, первый набор был закрыт, но мы сделали ключевую ошибку: не отфильтровали его должным образом, т.к. надеялись, что те, кто в себе не уверен, пойдут на базовый курс. Это сработало лишь отчасти. Второй момент: люди ждут от продвинутых алгоритмов нечто большого, чем только алгоритмов — конкретно, они хотят машинного обучения. И это можно понять: все хотят машинного обучения :) Оно было заключительной частью программы последней итерации курса (и прошло, как по мне, на ура), но из-за этого пострадал весь остальной материал, который объективно не вмещался в формат. Поэтому машинного обучения больше не будет в рамках Алгоритмики Про. Впрочем, ее сомневаюсь, что очень скоро оно появится в виде отдельного курса, т.к. спрос на это направление сейчас зашкаливает. <p>В итоге, что же такое Алгоритмика Про и есть ли в достаточном количестве те, кому она нужна? Это сложный курс и сложный вопрос. Его основная идея — это погрузить практикующих программистов в проблематику решения реальных алгоритмических задач: работу с данными размером как минимумом в гигабайты, или реальными графами с миллионами узлов, разобраться в том, как функционируют современные базы данных, редакторы и системы контроля версий, изучить различные алгоритмы оптимизации и применить их для задач из окружающей действительности, залезть под капот сервера и уменьшить время ожидания запросов в его очереди. К сожалению, пока что я только нащупываю это направление. С одной стороны, не все практические проблемы, до которых можно дотянутся, лежали на поверхности (хотя сейчас по опыту двух курсов эта тема для меня довольно неплохо прояснилась). С другой — для этого не было достаточно готовности, вовлеченности и отдачи со стороны участников. Много ли у нас программистов, которые хотят прокачаться в алгоритмической разработке, находятся на должном уровне и готовы выделить для этого большой кусок времени в течение трех месяцев — вот это главный вопрос, ответа на который мы пока не знаем. Такие люди, одназначно, были на первых двух моих курсах. Но, к сожалению, их можно было сосчитать на пальцах одной руки, когда нужно было бы задействовать хотя бы четыре... <h2>Основные вызовы курса для меня</h2><p>Первый из них я, фактически, описал выше. С самого первого дня обсуждения мы говорили о том, что этот курс должен быть максимально практичным. И я честно старался добиться этого. Но практичность имеет разные аспекты: один аспект — это практическая работа, т.е., в данном контексте, банальное программирование. К сожалению, мне не удалось привлечь к этому каждого участника, хотя по моими прикидкам и прошлому опыту казалось, что это произойдет естественно. Второй аспект практичности — это обсуждение (и реализация) примеров из реального мира, где это все используется. Этому я также старался уделять должное внимание: как рассказывать кейсы на лекциях, так и давать подобные задания (хотя их было меньше, чем могло быть), так и в аспекте курсовой работы. К сожалению, эта часть, как по мне, основывается на первой, т.е. активном программировании, а без него она тоже сильно буксовала. Это проблема номер один, которую я намерен активно решать в рамках следующего курса. <p>Она также упирается в наличии удобной среды для такой работы у каждого участника курса, а также ее единства для всех участников, чтобы можно было где-то подталкивать продвижение вперед. Я начал первый курс с демонстрацией примеров кода на Лиспе. Это не всем нравилось, хотя все был об этом честно предупрежденны. В итоге, второй вариант был более абстрактным: описание алгоритмов на доске без привязки к тому или иному языку. Этот AB-тест показал, что так не работает: нужно иметь под рукой код, который можно пощупать и покрутить, и который можно подкинуть человек, который застрял, чтоб он мог двигаться дальше. Учитывая мою собственную привязку к Лиспу, а также то, что язык хорошо подходит для реализации алгоритмов, я планирую продолжать настаивать на его использовании. Почему не Python или что-то другое? Во-первых, многие языки не очень пригодны для изучения алгоритмов вообще: яркий пример — это JavaScript, который слишком не четок, не имеет полноценной поддержки арифметики и нужных структур данных, другая крайность — это статические языки, особенно низкоуровневые, которые, с одной стороны, дают много возможностей для оптимизации, но, с другой, вносят слишком много ограничений (в частности, более сложный процесс разработки) и избыточной сложности. Что до Питона, то он более-менее подходит, но я его, по-просту, не люблю, тем более, что курсов по алгоритмам на Питоне хватает. Что ж до конкретно Лиспа и его особенностей: я считаб, что это хороший фильтр, которых нам не хватало при наборе на предыдущие курсы. На самом деле, разобраться в Лиспе на базовом уровне, необходимом для этого курса, не сложно. И если у человека не хватает мотивации и доверия, чтобы это сделать, это многое говорит о его дальнейшей мотивации преодолевать трудности во время самого курса. А, как показала практика, цитируя одного из студентов курса, "Сам факт, що лекція коштує 550 грн, ще не має достатньо стимулючого ефекту," щоб виконати домашнє завдання :( <h2>А зачем, вообще, платить?</h2><p>Очевидный и резонный вопрос, на который должен ответить для себя каждый желающий пройти этот курс — это вопрос, стоит ли он того и зачем, вообще, платить? Ведь есть интернет, википедия и прекрасные онлайн-курсы, на которых можно изучить то же самое. И это, действительно, так! В отличие от онлайн-курсов, оффлайн-курсы не могут быть бесплатными, поскольку они должны окупать аренду помещения и другие расходы, достойную оплату преподаваталя и персонала, и давать какую-то прибыль организаторам. И к ним не применима фримиум-модель, которую используют Курсера и другие. Да и, вообще, за все в жизни нужно платить. <p>Но если взглянуть с практической стороны, ROI любого обучения — это отношение полученного результата к затратам денег и времени на его достижение. По-идее, оффлайн-курсы могут выигрывать за счет более высокого среднего результата и меньших затрат времени. Что может войти в этот лучший результат? <ul><li>во-первых, как ни банально это звучит, "волшебный пендель", т.е. внешняя мотивация пробежать этот забег от начала и до конца. И вложенные деньги тоже являются частью этой мотивации, хотя, как показывает практика, не достаточной. В этих курсах пока не было соревновательного момента, который присущ классическому обучению, и это еще одно направление, над которым нужно немного поработать (наметки есть) <li>во-вторых, возможность личного общения с преподавателем и другими учениками. Для меня это, на самом деле, одна из главных мотиваций делать этот курс: возможность взаимодействия с программистами, которые ищут и хотят развиваться в профессии. Парадокс в том, что даже не смотря на то, что я получаю неплохие деньги за этот курс, я все равно зарабатываю больше за основную свою работу. Т.е. меньший заработок должен компенсироваться чем-то другим. Для меня это другое — это возможность со-творчества с участниками курса. А это значит, что мы должны быть на одной волне и идти, в первую очередь, иметь желание попасть на занятие и провести его полноценно. В идеале, завязавшиеся во время обучения связи должны быть одним из главных долгосрочных активов после окончания курса <li>комфортная среда обучения и общения, причастность к сообществу. Projector делает очень важное дело, создавая на основе своей площадки сообщества профессионалов в сфере дизайна, продуктовой разработки и программирования (а также, в будущем, я думаю и других областях)</ul> <h2>Кому стоит и не стоит идти на курсы по алгоритмам в Projector</h2><p>Для меня, на самом деле, это ключевой вопрос всей этой темы. Ни я, ни Projector не ставим себе цели массовости и сверхприбылей. Во-первых, это не устойчиво и закончится пшиком, во-вторых, никакого внутреннего удовлетворения от такой работы не получишь. Между группой из 8-10 мотивированных людей, которые знают, куда и зачем пришли, и 20 вольно интересующимися я выбираю первый вариант, хотя второй, на самом деле, проще. Первые две итерации курсов были поиском: поиском правильного формата и адекватной ему аудитории. <p>Мой вывод следующий: эти курсы подходят тем, кто <ul><li>уже имеют некоторый опыт программирования (в идеале, хотя бы пару лет) <li>осознал для себя ценность алгоритмов, и не будет мучаться вечным вопросом украинского студента: "где же это все применяется в реальной жизни?" Ответ на него, с моей точки зрения: кто хочет, тот найдет. Спрос на алгоритмических программистов есть, и хотя он не более нишевый, но в нишах всегда больше и интерес, и доходы <li>готов (как психологически, так и организационно) 3 месяца стабильно уделять минимум 10 часов в неделю этим занятиям, а также, что еще более важно, уделять им основной ресурс своего мозга. Практически, это означает, что в это время не удастся полноценно интенсивно работать. Как показали эти 2 курса, самое лучший период для участия в этой авантюре — это либо перерыв между работами, либо последние курсы вуза. Те, кто пытаются одновременно интенсивно работать в разработке и учиться, либо забивают на курс, либо жалуются, что работа начинает страдать, либо берут отпуск, чтобы подтянуть хвосты. Также это может прокатить для тех, у кого работа сейчас не предполагает активное написание кода. Если же вы только что поменяли работу, как раз должны заканчивать важный проект, ожидаете рождения ребенка (да, и такие случаи уже бывали :) или же собираетесь уехать в середине в отпуск или коммандировку, то этот формат точно не для вас</ul> Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-64159922753682206292016-12-28T13:16:00.003+02:002016-12-29T22:27:54.907+02:005 Steps to Grasping Modern ML<img src="https://imgs.xkcd.com/comics/progeny.png" width="200"> <p>Recently, I've been teaching an advanced Algorithms course, which concluded in a short introduction to Machine Learning. Obviously, ML is its own track in Computer Science curriculum, but, nevertheless, there's a substantial overlap between these 2 disciplines: algorithms and ML. However, ML adds another dimension that is not usually considered in the world of algorithmic thinking.</p> <p>Anyhow, this experience helped me formulate the minimal selection of concepts that need to be grasped in order to start practical ML work. An ML crash course so to say.</p> <p>As I've never seen such compilation, I'd like to share it in this post. Here are the 5 essential steps to understanding </p> <h2><a id="user-content-step-1-understanding-the-ml-problem-formulation-knn-algorithm" class="anchor" href="#step-1-understanding-the-ml-problem-formulation-knn-algorithm" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Step 1. Understanding the ML problem formulation. kNN algorithm</h2> <p>The first thing one needs to realize is the difference between an ML problem and the common programming problems. Here training/test data and an objective function should be explained alongside with the 3 common "learning" approaches: supervised, unsupervised, and reinforcement. A widely used and good initial examples is the <a href="https://archive.ics.uci.edu/ml/datasets/Iris">Iris data set</a> and kNN algorithm.</p> <h2><a id="user-content-step-2-adding-features-and-iterative-training-into-the-picture-perceptron-algorithm" class="anchor" href="#step-2-adding-features-and-iterative-training-into-the-picture-perceptron-algorithm" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Step 2. Adding features and iterative training into the picture. Perceptron algorithm</h2> <p>The second step is introduction of the concept of feature extraction that allows approaching a problem from different angles. The Iris data set already has features, but initially they may be perceived as given. Iterative training is another common ML approach (although some popular algorithms like kNN or Decision Trees don't rely upon it). Perceptron is the simplest algorithm to explain (which still remains in practical use) and leads nicely to the next step.</p> <p>A good example task and data set for this part is the <a href="https://archive.org/details/BrownCorpus">Brown Corpus</a> and the problem of POS tagging. And there's a <a href="https://explosion.ai/blog/part-of-speech-pos-tagger-in-python">great post</a> outlining its soultion by Matthew Honnibal.</p> <h2><a id="user-content-step-3-continuous-vs-discrete-learning-gradient-descent-softmax-algorithm" class="anchor" href="#step-3-continuous-vs-discrete-learning-gradient-descent-softmax-algorithm" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Step 3. Continuous vs discrete learning, gradient descent. Softmax algorithm</h2> <p>The obvious next step is transitioning from a discrete perceptron learning to continuous gradient descent used in Logistic regression. Andrew Ng provides a lucid connection in Part II & III of his <a href="https://datajobs.com/data-science-repo/Generalized-Linear-Models-%5BAndrew-Ng%5D.pdf">tutorial on Linear Models</a>. It also helps that Logistic regression and Softmax are the basic building blocks of Neural Networks that are to be discussed next. The example task for this problem may remain the same POS tagging, although others, like the ones used by Andrew, may be also utilized.</p> <h2><a id="user-content-step-4-learning-graphs-aka-neural-nets-backprop-feed-forward-neural-network-algorithm" class="anchor" href="#step-4-learning-graphs-aka-neural-nets-backprop-feed-forward-neural-network-algorithm" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Step 4. Learning graphs (aka neural nets), backprop. Feed-forward Neural Network algorithm</h2> <p>As soon as we understand gradient descent and logistic regression, it's rather easy to make the next step to forming layers of such blocks to allow the combined model to "learn" higher-level feature representations. This is where the Backprop algorithm for efficient training comes into play (that is, by the way, another example of a dynamic programming algorithm). Also in this part, it's possible to talk about vector representations of words and other highly contextualized objects (landmark position in image, etc.) A great explanation of Backprop is presented in <a href="http://colah.github.io/posts/2015-08-Backprop/">this post</a> of Christopher Olah. Also, a good exaple data set here is the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a>.</p> <h2><a id="user-content-step-5-bias-variance-tradeoff-regularization--ensembles-random-forest-algorithm" class="anchor" href="#step-5-bias-variance-tradeoff-regularization--ensembles-random-forest-algorithm" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Step 5. Bias-variance tradeoff, regularization & ensembles. Random Forest algorithm</h2> <p>Finally, we should return to the beginning and revisit the learning problem, but with some practical experience already under our belt. This is where the essential bias-variance tradeoff and common ways to tackle it should be discussed: regularization and ensembles. It's also a good place to introduce the Decision Tree algorithms and the ensemble methods based upon them (Random Forest and, maybe, others) as one of the most widely-used current practical approach.</p> <p><a href="https://jvns.ca/blog/2016/01/02/winning-the-bias-variance-tradeoff/">"Winning the Bias-Variance Tradeoff"</a> by Juila Evans may be a good introductory text on this.</p> <p>Overall, due to the highly condensed nature of such presentation, a lot of important things will be almost not covered. For example, unsupervised learning, CV with its convolutions, sequence models. However, I believe that with the obtained knowledge and conceptual understand of the mentioned basis those parts may be grasped quite easily.</p> <p>If this plan turns out helpful to you or some essential improvements are necessary, please, leave your thoughts and comments...</p>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0tag:blogger.com,1999:blog-6031647961506005424.post-27536105030371741532016-09-23T21:42:00.003+03:002016-09-23T21:44:38.332+03:00The Technology Company Case<h2>What's a Technology Company?</h2><p>I'm a programmer. Obviously, this means that I have to earn money and realize my talents working for some company that employs programmers (or on my own). It's worth noting that there are several kinds of such companies. </p><p>One is traditional enterprises, like banks or government agencies, that need programmers to automate their processes and improve output. Every company needs an accountant, and, likewise, nowadays every needs a programmer. </p><p>There are also companies that provide software development and related services - the so-called consulting or outsourcing firms. They employ programmers to automate the work and improve the output of, mainly, the first breed of companies. </p><p>Then, there are also technology product companies, like Instagram or Apple, that employ engineers to build their products, services or media, which are then consumed by ordinary people. </p><p>Finally, there are truly technology companies that produce new technology that is used by all the previous three groups, as well as by the technology companies themselves. From the business standpoint, this technology may be supplied either in the form of on-the-spot consulting work, licensing or even separate products. </p><p>Every group has some percentage of technology work in its operation. This work, often called R&D, comprises of implementation of existing technology (D in R&D) and creation of the new one (R). The share of the two differs substantially between the groups. The companies from the first one may be 1 to 10% dependent on R&D work and have almost 0% of R in it, the second group is 90% R&D work, still, with mere percents of R in it, the third group is just 30-50% R&D, and the share of R in it may rise to 10-20% but rarely more, and the last group should have 90% R&D with >50% R in it. </p><p>A technology company should be a thought leader in its sphere. This means not chasing fashions in our pop-culture-like industry but setting an example justified by technological excellence instead of marketing. This means building something that will last and have an impact for a substantially longer period of time than the ever-accelerating hype cycle. This means having an ultimate goal of solving hard technical problems and not chasing profits or market share. While product companies try to change the world by producing their innovative products that merely use technology, a technology company does that by producing technology that enables more innovative products. A closed vs an open approach. </p><h2>10x Programmers</h2><p>There's this popular meme of 10x programmers that constantly spurs discussion and flamewars among our peers. Is it just fad, who are those 10xers, do they really exist? </p><p>Let's first consider this question from the perspective of other crafts and professions. Are there 10x painters? Well, if we compare painter productivity by the number of pieces drawn it would be hard to tell. But if you think about price, clearly, there are even 1000x ones: an ordinary painter's work may cost $1000, and a famous masterpiece will be in the millions. If we consider the number of people reached the same rule applies: maybe, thousands will see quality works of a common professional painter, and millions or even billions - the works of a master. But you may say that painting, unlike programming, is an art. What about carpentry? Well, I'd compare with professions that require mostly intellectual work. Are there 100x doctors? Surely, there are those who saved 100x more people by inventing a new operation method or treatment. Lawyers? A person who writes a law impacts orders of magnitude more than an ordinary counselor at some random firm. This list may be continued on and on. </p><p>I've compiled a book called "<a href="https://leanpub.com/lisphackers">Interviews with 100x programmers</a>". To some extent, the name was an exaggeration. But, as they say, every joke has some truth in it. In fact, I fully subscribe to the 10x programmer concept. Moreover, I consider that there are not only 10x ones but also 100x, 1000x... Definitely, there are hardly any 10x coders, i.e. people who produce 10x the amount of code a good professional programmer will create in the same timeframe. But there's much more to programming than merely writing program code. </p><p>To be an order of magnitude more productive means to solve problems an order of magnitude more complex than the ones considered accessible at a given point in time. Obviously, such problems exist, and there will, probably, always be an unlimited supply of them. Also, it should be clear from the short history of computing that there are some people capable of bringing a new perspective, coming up with approaches that allow solving such problems either in a much better way or just solve them at all. As Alan Kay, who's for sure one of such 100x programmers, has famously said: "A change in perspective is worth 80 IQ points." </p><p>Still, there's more to it than just solving harder problems. Another popular explanation given to the 10x thing is that such a programmer is the one who makes 10 other programmers 2x more productive. This, from my point of view, implies the one who is showing a better approach, in other words, a thought leader, and the one who implements this vision in some technology that other programmers use. In fact, we're productive in our work at our current level mostly thanks to such prolific programmers: every day I use Unix, Emacs, Lisp, git and other tools that were initially conceived and built by a handful of the 10x programmers. Their vision and impulse made thousands and even millions more productive. </p><p>Those 10x programmers are the ones I'd like to be around at work. And so, my ideal company is the one that attracts such persons. And although a significant percent of such people are loners, most of them are also highly motivated by the presence of similar colleagues. </p><p>So which one of the 4 company types mentioned above will such people choose? </p><p>The first one is mostly out of consideration because in it the programmers are not the primary value creators - on the contrary, often they are considered a cost center. I.e. they are just another service function similar to an accountant or a janitor. Surely, there are exceptions to this rule when the company leaders realize the potential that technology change bears to their company, which, basically, means that the firm is transitioning to type 3. Even in such case, it's still a much less productive environment than a type 3 firm built with the right principles in mind from the start. </p><p>What about outsourcing companies? Their advantage is that programmers are their primary asset, which means that the company will be built around them, have a substantial number of them and will do a lot to attract and hold prominent people. The nature of work, unfortunately, is usually a severely limiting factor here. First of all, in most of the cases, the customer doesn't really care about the technological excellence or innovative nature of the result. The projects are in most of the cases counter-innovative, i.e. the more mundane, reproducible, and ordinary the technological solution that achieves the desired result is the better. And it's quite reasonable from the business standpoint: innovation is risky. This means that, ultimately, such companies reward uniformity and interchangeability of their stuff and their output, especially, since it's much easier to manage and scale. Have I mentioned that managing programmers is very hard (the common metaphor used is "herding cats")? </p><p>Now, let's look at product companies. Are they a heaven for 10x programmers? Well, a lot of such people flock there. One reason is that such companies understand the need for talented programmers because unlike the previous 2 types they may and should face unique technological challenges, and, moreover, their leadership is able to recognize that (type 1 companies also face those challenges, but usually they just don't view them from the technology standpoint). Yet, a product company is only X% new technology and another (100-X)% other things. What is the value of X? Maybe, it's 20-30% at Google or Facebook, and even less at smaller companies with fewer resources. Why? Because, as we discussed above, the ultimate goal of most of such companies is making money by serving masses of customers. This requires huge marketing, sales, operations, and support "vehicles" that employ professionals to operate and programmers to build, maintain and develop. But have quite little interesting technical challenges. Once again, this is the right thing from the business standpoint, especially if you have to earn more and more money each year and grow your market share. But focus on earnings and market share means that technological excellence becomes secondary. Surely, the best of the leaders and managers realize its importance, but they have to make many trade-offs all the time. </p><p>That's why I have singled out "pure" technology companies. Such organizations are naturally inclined to make tech excellency their focus. There are, surely, counterexamples that are infected with the Silicon Valley "growth virus" and try to win the market as fast as possible with marketing, but it doesn't mean that it always has to work that way. In my opinion, purely technological companies are the best place for 10x programmers because they will not merely utilize their work to some other end goal but have vested interest in amplifying its influence. They are not inclined to conceal the know-hows and innovations as trade secrets, but will benefit from sharing and promoting them. They may also provide maximum freedom of choice: of approaches, tools, supporting technologies, because their primary concern is not effective scaling of the same ultimately repetitive work to many similar programmers but creating breakthroughs. Their dependence on such ultra-productive programmers is existential. </p><p>I don't consider myself to be a 10x programmer, but, surely, I'd like to reach such level someday and I also aspire to work alongside them. </p><h2>A Company I'd Build</h2><p>All in all, being part of a technology company seems like the best choice for me both in terms of potential impact and possibilities to have 10x programmer colleagues. Eventually, either you have to join one or create one yourself. For the last 5 years, I've been working in the so-called AI, and my experience both from product company side and individual consultant work shows that demand for research-related technology expertise here is growing much faster than the supply. I see it as a chance for new technology companies to emerge and gather those few capable people in this field to amplify their impact. So I'm seriously considering starting a technology company, and I'm looking for like-minded people who share my values and vision to join our forces. </p><p>If I were to start such company, I'd build its foundation on a few things that really matter to me personally. Some principles or, as they used to call them, values. Unfortunately, the notion of "values" has somewhat lost its original meaning in the corporate world. When you see such qualities as effectiveness or adaptability cast as values that's a sign of such misconception. Values are something that you don't compromise upon at all. Surely, it's pointless to compromise any parts of your professionalism (such as effectiveness), so professionalism is a default value not even worth discussing. Real "values", however, are those aspects of your work culture that run a real risk of conflicting with the things that are considered universally important. In business, those are profits, market share, favorable competitive position. So, being true to your values means not forfeiting them even if you're going to lose in those basic areas. </p></p>Here is a list of the values that I subscribe to: </p><ul><li><b>Technological excellence</b> should be a basic trait of any technology company. For me, an example of applying such value would be using Lisp as a starting point for most of the solutions despite the fact that the language is quite unpopular and underappreciated - my personal experience shows that it works very well, especially in the fields that are heavily knowledge-based. Another example is that in a technology company literally everyone should be technology-savvy: even the office manager should be programming at times. <li><b>Personalism</b> is the main quality that a company has to support in its dealings with all the people it's interacting with: employees, customers, contractors and providers. This means, for example, striving to provide flexible and productive working conditions to each employee instead of trying to fit everyone in the same conditions (because management is hard). Overall, lack of management competency should never become a limiting factor. One manifestation of this is that a modern technology company should be built as a distributed organization from day 1. <li><b>Ahimsa</b> is an ancient word meaning not harming anyone. It is a little bit more than our modern-day ethics, but it's worth it. Why create something if you know that it will cause misery and suffering to others? In effect, this means, for example, refusal to provide services to companies that are clearly unethical. <li><b>Radical openness</b>. As they say, "information wants to be free." :) Maximal sharing and minimal secrecy makes so many things much simpler. And in our lowest-common-denominator technology world, ultimately, the risk of competitors copying and abusing your work is much less than that of brilliant people not joining your cause because they just haven't heard of it. </ul><p>So... If you're interested in solving complex AI challenges out of whatever part of the world you're living in, working with 10x programmers, using Lisp and other advanced technologies in the process - drop me a line, I'd be glad to chat.<br><img border="0" src="https://4.bp.blogspot.com/-9x-_9mn-PXw/V-VumICcBdI/AAAAAAAABmM/B3jegDJz9K0exsMFTcyjBVsNPCQ_guv5gCLcB/s320/m8n.png"></p>Vsevolod Dyomkinhttp://www.blogger.com/profile/07729454371491530027noreply@blogger.com0