2019-12-13

Programming Algorithms: Dynamic Programming

This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.

We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.

Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:

A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.

An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.

Fibonacci Numbers

So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.

A naive approach to calculating the i-th number will be directly coding the Fibonacci formula:

(defun naive-fib (i)
  (assert (typep i '(integer 0)))
  (if (< i 2) 1
      (+ (naive-fib (- i 1))
         (naive-fib (- i 2)))))

However, applying it will result in an exponential growth of the number of computations: each call to naive-fib results in two more calls. So, the number of calls needed for the n-th number, with this approach, is O(2^n).

> (time (naive-fib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naive-fib 42))
Evaluation took: 7.827 seconds of real time
433494437

Yet, we can see here a direct manifestation of an optimal substructure property: the i-th number calculation uses the result of the (1- i)-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.

(let ((fib (vec 1 1)))  ; our table will be an adjustable vector
  (defun fib (i)
    (when (< (length fib) i)
      (vector-push-extend (fib (- i 1)) fib))
    (+ (? fib (- i 1))
       (? fib (- i 2)))))

What we've done here is added a layer of memoization to our function that uses an array fib that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n) running time to a mere O(n)!

Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n) in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:

(let ((fib (vec 1 1)))
  (defun bottom-up-fib (i)
    (let ((off (length fib)))
      (adjust-array fib (1+ i) :fill-pointer t)
      (dotimes (j (- (1+ i) off))
        (let ((j (+ j off)))
          (:= (aref fib j)
              (+ (aref fib (- j 1))
                 (aref fib (- j 2)))))))
    (aref fib i)))
> (time (bottom-up-fib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottom-up-fib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543...  ; this number is a Lisp's bignum that has unlimited size

Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...

String Segmentation

Let's consider another 1d problem: suppose we have a dictionary of words and a string consisting of those words that somehow lost the spaces between them — the words got glued together. We need to restore the original string with spaces or, to phrase it differently, split the string into words. This is one of the instances of string segmentation problems, and if you're wondering how and where such a situation could occur for real, consider Chinese text that doesn't have to contain spaces. Every Chinese language processing system needs to solve a similar task.

Here's an example input[1]:

String: thisisatest
Dictionary: a, i, s, at, is, hi, ate, his, sat, test, this 
Expected output: this is a test

It is clear that even with such a small dictionary there are multiple ways we could segment the string. The straightforward and naive approach is to use a greedy algorithm. For instance, a shortest-first solution will try to find the shortest word from the dictionary starting at the current position and then split it (as a prefix) from the string. It will result in the following split: this i sat est. But the last part est isn't in the dictionary, so the algorithm has failed to produce some of the possible correct splits (although, by chance, if the initial conditions where different, it could have succeeded). Another version — the longest-first approach — could look for the longest words instead of the shortest. This would result in: this is ate st. Once again the final token is not a word. It is pretty obvious that these simple takes are not correct and we need a more nuanced solution.

As a common next step in developing such brute force approaches a developer would resort to backtracking: when the computation reaches the position in the string, from which no word in the dictionary may be recovered, it unwinds to the position of the previous successful split and tries a different word. This procedure may have to return multiple steps back — possibly to the very beginning. As a result, in the worst case, to find a correct split, we may need to exhaustively try all possible combinations of words that fit into the string.

Here's an illustration of the recursive shortest-first greedy algorithm operation:

(defun shortest-first-restore-spaces (dict str)
  (dotimes (i (length str))
    (let ((word (slice str 0 (1+ i))))
      (when (? dict word)
        (return-from shortest-first-restore-spaces
          (cond-it
            ((= (1+ i) (length str))
             word)
            ((shortest-first-restore-spaces dict (slice str (1+ i)))
             (format nil "~A ~A" word it))))))))

CL-USER> (defparameter *dict* (hash-set 'equal "a" "i" "at" "is" "hi" "ate" "his" "sat" "test" "this"))
CL-USER> (shortest-first-restore-spaces *dict* "thisisatest")
  0: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
    1: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
      2: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
        3: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
        3: SHORTEST-FIRST-RESTORE-SPACES returned NIL
      2: SHORTEST-FIRST-RESTORE-SPACES returned NIL
    1: SHORTEST-FIRST-RESTORE-SPACES returned NIL
  0: SHORTEST-FIRST-RESTORE-SPACES returned NIL
NIL

To add backtracking into the picture, we need to avoid returning in the case of the failure of the recursive call:

(defun bt-shortest-first-restore-spaces (dict str)
  (dotimes (i (length str))
    (let ((word (slice str 0 (1+ i))))
      (when (in# word dict)
        (when (= (1+ i) (length str))
          (return-from bt-shortest-first-restore-spaces word))
        (when-it (bt-shortest-first-restore-spaces dict (slice str (1+ i)))
          (return-from bt-shortest-first-restore-spaces (format nil "~A ~A" word it)))))))

CL-USER> (bt-best-first-restore-spaces *dict* "thisisatest")
  0: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
    1: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
      2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
        3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
        3: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
      2: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
      ;; backtracking kicks in here
      2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "atest")
        3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "test")
        3: BT-SHORTEST-FIRST-RESTORE-SPACES returned "test"
      2: BT-SHORTEST-FIRST-RESTORE-SPACES returned "a test"
    1: BT-SHORTEST-FIRST-RESTORE-SPACES returned "is a test"
  0: BT-SHORTEST-FIRST-RESTORE-SPACES returned "this is a test"
"this is a test"

Lisp trace is an invaluable tool to understand the behavior of recursive functions. Unfortunately, it doesn't work for loops, with which one has to resort to debug printing.

Realizing that this is brute force, we could just as well use another approach: generate all combinations of words from the dictionary of the total number of characters (n) and choose the ones that match the current string. The exact complexity of this scheme is O(2^n)[2]. In other words, our solution leads to a combinatorial explosion in the number of possible variants — a clear no-go for every algorithmic developer.

So, we need to come up with something different, and, as you might have guessed, DP fits in perfectly as the problem has the optimal substructure: a complete word in the substring of the string remains a complete word in the whole string as well. Based on this understanding, let's reframe the task in a way that lends itself to DP better: find each character in the string that ends a complete word so that all the words combined cover the whole string and do not intersect[3].

Here is an implementation of the DP-based procedure. Apart from calculating the maximum length of a word in the dictionary, which usually may be done offline, it requires single forward and backward passes. The forward pass is a linear scan of the string that at each character tries to find all the words starting at it and matching the string. The complexity of this pass is O(n * w), where w is the constant length of the longest word in the dictionary, i.e. it is, actually, O(n). The backward pass (called, in the context of DP, decoding) restores the spaces using the so-called backpointers stored in the dp array. Below is a simplistic implementation that returns a single match. A recursive variant is possible with or without a backward pass that will accumulate all the possible variants.

(defun dp-restore-spaces (dict str)
  (let ((dp (make-array (1+ (length str)) :initial-element nil))
        ;; in the production implementation, the following calculation
        ;; should be performed at the pre-processing stage
        (w (reduce 'max (mapcar 'length (keys dict))))
        (begs (list))
        (rez (list)))
    ;; the outer loop tries to find the next word
    ;; only starting from the ends of the words that were found previously
    (do ((i 0 (pop begs)))
        ((or (null i)
             (= i (length str))))
      ;; the inner loop checks all substrings of length 1..w
      (do ((j (1+ i) (1+ j)))
          ((>= j (1+ (min (length str)
                          (+ w i)))))
        (when (? dict (slice str i j))
          (:= (? dp j) i)
          (push j begs)))
      (:= begs (reverse begs)))
    ;; the backward pass
    (do ((i (length str) (? dp i)))
        ((null (? dp i)))
      (push (slice str (? dp i) i) rez))
    (strjoin #\Space rez)))

CL-USER> (dp-restore-spaces *dict* "thisisatest")
"this is a test"

Similarly to the Fibonacci numbers, the solution to this problem doesn't use any additional information to choose between several variants of a split; it just takes the first one. However, if we wanted to find the variant that is most plausible to the human reader, we'd need to add some measure of plausibility. One idea might be to use a frequency dictionary, i.e. prefer the words that have a higher frequency of occurrence in the language. Such an approach, unfortunately, also has drawbacks: it overemphasizes short and frequent words, such as determiners, and also doesn't account for how words are combined in context. A more advanced option would be to use a frequency dictionary not just of words but of separate phrases (ngrams). The longer the phrases are used, the better from the standpoint of linguistics, but also the worse from the engineering point of view: more storage space needed, more data to process if we want to collect reliable statistics for all the possible variants. And, once again, with the rise of the number of words in an ngram, we will be facing the issue of combinatorial explosion petty soon. The optimal point for this particular task might be bigrams or trigrams, i.e. phrases of 2 or 3 words. Using them, we'd have to supply another dictionary to our procedure and track the measure of plausibility of the current split as a product of the frequencies of the selected ngrams. Formulated this way, our exercise becomes not merely an algorithmic task but an optimization problem. And DP is also suited to solving such problems. In fact, that was the primary purpose it was intended for, in the Operations Research community. We'll see it in action with our next problem — text justification. And developing a restore-spaces-plausibly procedure is left as an exercise to the reader. :)

Text Justification

The task of text justification is relevant to both editing and reading software: given a text, consisting of paragraphs, split each paragraph into lines that contain whole words only with a given line length limit so that the variance of line lengths is the smallest. Its solution may be used, for example, to display text in HTML blocks with an align=justify property.

A more formal task description would be the following:

  • the algorithm is given a text string and a line length limit (say, 80 characters)

  • there's a plausibility formula that specifies the penalty for each line being shorter than the length limit. A usual formula is this:

    (defun penalty (limit length)
      (if (<= length limit)
          (expt (- limit length) 3)
          most-positive-fixnum))
    
  • the result should be a list of strings

As we are discussing this problem in the context of DP, first, we need to determine what is its optimal substructure. Superficially, we could claim that lines in the optimal solution should contain only the lines that have the smallest penalty, according to the formula. However, this doesn't work as some of the potential lines that have the best plausibility (length closest to 80 characters) may overlap, i.e. the optimal split may not be able to include all of them. What we can reliably claim is that, if the text is already justified from position 0 to i, we can still justify the remainder optimally regardless of how the prefix is split into lines. This is, basically, the same as with string segmentation where we didn't care how the string was segmented before position i. And it's a common theme in DP problems: the key feature that allows us to save on redundant computation is that we only remember the optimal result of the computation that led to a particular partial solution, but we don't care about what particular path was taken to obtain it (except we care to restore the path, but that's what the backpointers are for — it doesn't impact the forward pass of the algorithm). So the optimal substructure property of text justification is that if the best split of the whole string includes the consecutive indices x and y, then the best split from 0 to y should include x.

Let's justify the following text with a line limit of 50 chars:

Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.

Suppose we've already justified the first 104 characters. This leaves us with a suffix that has a length of 69: descendant of the long-running family of Lisp programming languages. As its length is above 50 chars, but below 100, so we can conclude that it requires exactly 1 split. This split may be performed after the first, second, third, etc. token. Let's calculate the total plausibility of each candidate:

after "the": 5832 + 0 = 5832
after "long-running": 6859 + 2197 = 9056
after "family": 1728 + 8000 = 9728
after "of": 729 + 12167 = 12896
after "Lisp": 64 + 21952 = 22016

So, the optimal split starting at index 105[4] is into strings: "descendant of the" and "long-running family of Lisp programming languages." Now, we haven't guaranteed that index 105 will be, in fact, the point in the optimal split of the whole string, but, if it were, we would have already known how to continue. This is the key idea of the DP-based justification algorithm: starting from the end, calculate the cost of justifying the remaining suffix after each token using the results of previous calculations. At first, while suffix length is below line limit they are trivially computed by a single call to the plausibility function. After exceeding the line limit, the calculation will consist of two parts: the plausibility penalty + the previously calculated value.

(defun justify (limit str)
  (with ((toks (reverse (split #\Space str)))
         (n (length toks))
         (penalties (make-array n))
         (backptrs (make-array n))
         (lengths (make-array n)))
    ;; forward pass (from the end of the string)
    (doindex (i tok toks)
       (let ((len (+ (length tok) (if (> i 0)
                                      (? lengths (1- i))
                                    0))))
         (:= (? lengths i) (1+ len))
         (if (<= len limit)
             (:= (? penalties i) (penalty len limit)
                 (? backptrs i) -1)
           ;; minimization loop
           (let ((min most-positive-fixnum)
                 arg)
             (dotimes (j i)
               (with ((j (- i j 1))
                      (len (- (? lengths i)
                              (? lengths j)))
                      (penalty (+ (penalty len limit)
                                  (? penalties j))))
                 (when (> len limit) (return))
                 (when (< penalty min)
                   (:= min penalty
                       arg j))))
             (:= (? penalties i) min
                 (? backptrs  i) arg)))))
    ;; backward pass (decoding)
    (loop :for end := (1- n) :then beg
          :for beg := (? backptrs end)
          :do (format nil "~A~%" (strjoin #\Space (reverse (subseq toks (1+ beg) (1+ end)))))
          :until (= -1 beg))))

CL-USER> (justify 50 "Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.")

Common Lisp is the modern, multi-paradigm,
high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running
family of Lisp programming languages.

This function is somewhat longer, but, conceptually, it is pretty simple. The only insight I needed to implement it efficiently was the additional array for storing the lengths of all the string suffixes we have examined so far. This way, we apply memoization twice: to prevent recalculation of both the penalties and the suffix lengths, and all of the ones we have examined so far are used at each iteration. If we were to store the suffixes themselves we would have had to perform an additional O(n) length calculation at each iteration.

The algorithm performs two passes. In the forward pass (which is, in fact, performed from the end), it fills the slots of the DP arrays using the minimum joint penalty for the potential current line and the remaining suffix, the penalty for which was calculated during one of the previous iterations of the algorithm. In the backward pass, the resulting lines are extracted by traversing the backpointers starting from the last index.

The key difference from the previous DP example are these lines:

(:= (? penalties i) min
    (? backptrs  i) arg)

Adding them (alongside with the whole minimization loop) turns DP into an optimization framework that, in this case, is used to minimize the penalty. The backptrs array, as we said, is used to restore the steps which have lead to the optimal solution. As, eventually (and this is true for the majority of the DP optimization problems), we care about this sequence and not the optimization result itself.

As we can see, for the optimization problems, the optimal substructure property is manifested as a mathematical formula called the recurrence relation. It is the basis for the selection of a particular substructure among several variants that may be available for the current step of the algorithm. The relation involves an already memoized partial solution and the cost of the next part we consider adding to it. For text justification, the formula is the sum of the current penalty and the penalty of the newly split suffix. Each DP optimization task is based on a recurrence relation of a similar kind.

Now, let's look at this problem from a different perspective. We can represent our decision space as a directed acyclic graph. Its leftmost node (the "source") will be index 0, and it will have several direct descendants: nodes with those indices in the string, at which we can potentially split it not exceeding the 50-character line limit, or, alternatively, each substring that spans from index 0 to the end of some token and is not longer than 50 characters. Next, we'll connect each descendant node in a similar manner with all nodes that are "reachable" from it, i.e. they have a higher value of associated string position, and the difference between their index and this node is below 50. The final node of the graph ("sink") will have the value of the length of the string. The cost of each edge is the value of the penalty function. Now, the task is to find the shortest path from source to sink.

Here is the DAG for the example string with the nodes labeled with the indices of the potential string splits. As you can see, even for such a simple string, it's already quite big, what to speak of real texts. But it can provide some sense of the number of variants that an algorithm has to evaluate.

What is the complexity of this algorithm? On the surface, it may seem to be O(m^2) where m is the token count, as there are two loops: over all tokens and over the tail. However, the line (when (> len limit) (return)) limits the inner loop to only the part of the string that can fit into limit chars, effectively, reducing it to a constant number of operations (not more than limit, but, in practice, an order of magnitude less). Thus, the actual complexity is O(m)[5].

Pathfinding Revisited

In fact, any DP problem may be reduced to pathfinding in the graph: the shortest path, if optimization is involved, or just any path otherwise. The nodes in this graph are the intermediate states (for instance, a split at index x or an i-th Fibonacci number) and the edges — possible transitions that may bear an associated cost (as in text justification) or not (as in string segmentation). And the classic DP algorithm to solve the problem is called the Bellman-Form algorithm. Not incidentally, one of its authors, Bellman is the "official" inventor of DP.

(defun bf-shortest-path (g)
  (with ((n (array-dimension g 0))
         (edges (edges-table g))
         (dists (make-array n :initial-elment most-positive-fixnum))
         (backptrs (make-array n))
         (path (list)))
    (:= (? dists (1- n)) 0)
    (dotimes (v (vertices g))
      (dotimes (e (? edges v))
        (with ((u (src e))
               (dist (+ (dist e)
                        (? dists u))))
          (when (< dist (? dists u))
            (:= (? dists u) dist
                (? backptrs u) v)))))
    (loop :for v := (1- n) :then (? backptrs v) :do
      (push v path))
    (values path
            (? dists (1- n)))))

The code for the algorithm is very straightforward, provided that our graph representation already has the vertices and edges as a data structure in convenient format or implements such operations (in the worst case, the overall complexity should be not greater than O(V+E)). For the edges, we need a kv indexed by the edge destination — an opposite to the usual representation that groups them by their sources[6].

Compared to text justification, this function looks simpler as we don't have to perform task-specific processing that accounts for character limit and spaces between words. However, if we were to use bf-shortest-path, we'd have to first create the graph data structure from the original text. So all that complexity would go into the graph creation routine. However, from the architectural point-of-views, such split may be beneficial as the pathfinding procedure could be reused for other problems.

One might ask a reasonable question: how does Bellman-Ford fare against the Dijkstra's algorithm (DA)? As we have already learned, Dijkstra's is a greedy and optimal solution to pathfinding, so why consider yet another approach? Both algorithms operate by relaxation, in which approximations to the correct distance are replaced by better ones until the final result is reached. And in both of them, the approximate distance to each vertex is always an overestimate of the true distance, and it is replaced by the minimum of its old value and the length of a newly found path. Turns out that DA is also a DP-based approach. But with additional optimizations! It uses the same optimal substructure property and recurrence relations. The advantage of DA is the utilization of the priority queue to effectively select the closest vertex that has not yet been processed. Then it performs the relaxation process on all of its outgoing edges, while the Bellman-Ford algorithm relaxes all the edges. This method allows BF to calculate the shortest paths not to a single node but to all of them (which is also possible for DA but will make its runtime, basically, the same as for BF). So, Bellman-Ford complexity is O(V E) compared to O(E + V logV) for the optimal implementation of DA. Besides, BF can account for negative edge weights, which will break DA.

So, DA remains the algorithm of choice for the standard shortest path problem, and it's worth keeping in mind that it can also be also applied as a solver for some DP problems if they are decomposed into graph construction + pathfinding. However, some DP problems have additional constraints that make using DA for them pointless. For example, in text justification, the number of edges to consider at each step is limited by a constant factor, so the complexity of the exhaustive search is, in fact, O(V). Proving that for our implementation of justify is left as an exercise to the reader...

LCS & Diff

Let's return to strings and the application of DP to them. The ultimate DP-related string problem is string alignment. It manifests in many formulations. The basic one is the Longest Common Subsequence (LCS) task: determine the length of the common part among two input strings. Solving it, however, provides enough data to go beyond that — it enables determining the best alignment of the strings, as well as to enumerating the edit operations needed to transform one string into another. The edit operations, which are usually considered in the context of LCS are:

  • insertion of a character
  • deletion of a character
  • substitution of a character

Based on the number of those operations, we can calculate a metric of commonality between two strings that is called the Levenstein distance. It is one of the examples of the so-called Edit distances. The identical strings have a Levenstein distance of 0, and strings foobar and baz — of 4 (3 deletion operations for the prefix foo and a substitution operation of r into z). The are also other variants of edit distances. FOr instance, the Damerau-Levenstein distance that is better suited to compare texts with misspellings produced by humans, adds another modification operation: swap, which reduces the edit distance in the case of two adjacent characters being swapped to 1 instead of 2 for the Levenstein (1 deletion adn 1 insertion).

The Levenstein distance, basically, gives us for free the DP recurrence relations: when we consider the i-th character of the first string and the j-th one of the second, the edit distance between the prefixes 0,i and 0,j is either the same as for the pair of chars (1- i) and (1- j) respectively, if the current characters are the same, or 1+ the minimum of the edit distances of the pairs i (1- j), (1- i) (1- j), and (1-i) j.

We can encode this calculation as a function that uses a matrix for memoization. Basically, this is the DP solution to the LCS problem: now, you just have to subtract the length of the string and the bottom right element of the matrix, which will give you the measure of the difference between the strings.

(defun lev-dist (s1 s2 &optional 
                         (i1 (1- (length s1)))
                         (i2  (1- (length s2)))
                         (ld (make-array (list (1+ (length s1))
                                               (1+ (length s2)))
                                         :initial-element nil)
                             ldp))  ; a flag indicating that the argument was supplied
  ;; initialization of the 0-th column and row
  (unless ldp
    (dotimes (k (1+ (length s1))) (:= (aref ld k 0) 0))
    (dotimes (k (1+ (length s2))) (:= (aref ld 0 k) 0)))
  (values (or (aref ld (1+ i1) (1+ i2))
              (:= (aref ld (1+ i1) (1+ i2))
                  (if (eql (? s1 i1) (? s2 i2))
                      (lev-dist s1 s2 (1- i1) (1- i2) ld)
                      (1+ (min (lev-dist s1 s2 (1- i1) (1- i2) ld)
                               (lev-dist s1 s2 i1 (1- i2) ld)
                               (lev-dist s1 s2 (1- i1) i2 ld))))))
          ld))

However, if we want to also use this information to align the sequences, we'll have to make a reverse pass[]{{Here, a separate backpointers array isn't necessary as we can infer the direction by reversing the distance formula.}}.

(defun align (s1 s2)
  (with ((i1 (length s1))
         (i2 (length s2))
         ;; our Levenstein distance procedure returns the whole DP matrix
         ;; as a second value
         (ld (nth-value 1 (lev-dist s1 s2)))
         (rez (list)))
    (loop
      (let ((min (min (aref ld (1- i1) (1- i2))
                      (aref ld     i1  (1- i2))
                      (aref ld (1- i1)     i2))))
        (cond ((= min (aref ld (1- i1) (1- i2)))
               (push (pair (? s1 (1- i1)) (? s2 (1- i2)))
                     rez)
               (:- i1)
               (:- i2))
             ((= min (aref ld (1- i1) i2))
              (push (pair (? s1 (1- i1)) nil)
                    rez)
              (:- i1))
             ((= min (aref ld i1 (1- i2)))
              (push (pair nil (? s2 (1- i2)))
                    rez)
              (:- i2))))
      (when (= 0 i1)
        (loop :for j :from (1- i2) :downto 0 :do
          (push (pair #\* (? s2 j)) rez))
        (return))
      (when (= 0 i2)
        (loop :for j :from (1- i1) :downto 0 :do
          (push (pair (? s1 j) nil) rez))
        (return)))
    ;; pretty output formatting
    (with-output-to-string (s1)
      (with-output-to-string (s2)
        (with-output-to-string (s3)
          (loop :for (c1 c2) :in rez :do
            (format s1 "~C " (or c1 #\.))
            (format s2 "~C " (cond ((null c1) #\↓)
                                  ((null c2) #\↑)
                                  ((char= c1 c2) #\|)
                                  (t #\x)))
            (format s3 "~C " (or c2 #\.)))
          (format t "~A~%~A~%~A~%"
                  (get-output-stream-string s1)
                  (get-output-stream-string s2)
                  (get-output-stream-string s3)))))
  rez))

CL-USER> (align "democracy" "remorse")
d e m o c r a c y
x | | | ↑ | ↑ x x
r e m o . r . s e

CL-USER> (lev-dist "democracy" "remorse")
5
#2A((0 1 2 3 4 5 6 7)
    (1 1 2 3 4 5 6 7)
    (2 2 1 2 3 4 5 6)
    (3 3 2 1 2 3 4 5)
    (4 4 3 2 1 2 3 4)
    (5 5 4 3 2 2 3 4)
    (6 5 5 4 3 2 3 4)
    (7 6 6 5 4 3 3 4)
    (8 7 7 6 5 4 4 4)
    (9 8 8 7 6 5 5 5))

It should be pretty clear how we can also extract the edit operations during the backward pass: depending on the direction of the movement, horizontal, vertical or diagonal, it's either an insertion, deletion or substitution. The same operations may be also grouped to reduce noise. The alignment task is an example of a 2d DP problem. Hence, the diff computation has a complexity of O(n^2). There are other notable algorithms, such as CYK parsing or the Viterbi algorithm, that also use a 2d array, although they may have higher complexity than just O(n^2). For instance, the CYK parsing is O(n^3), which is very slow compared to the greedy O(n) shift-reduce algorithm.

However, the diff we will obtain from the basic LCS computation will still be pretty basic. There are many small improvements that are made by production diff implementation both on the UX and performance sides. Besides, the complexity of the algorithm is O(n^2), which is quite high, so many practical variants perform many additional optimizations to reduce the actual number of operations, at least, for the common cases.

The simplest improvement is a preprocessing step that is warranted by the fact that, in many applications, the diff is performed on texts that are usually mostly identical and have a small number of differences between them localized in an even smaller number of places. For instance, consider source code management, where diff plays an essential role: the programmers don't tend to rewrite whole files too often, on the contrary, such practice is discouraged due to programmer collaboration considerations.

So, some heuristics may be used in the library diff implementations to speed up such common cases:

  • check that the texts are identical
  • identify common prefix/suffix and perform the diff only on the remaining part
  • detect situations when there's just a single or two edits

A perfect diff algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect and not very good for human consumption. People will expect operations parts to be separated at token boundaries when possible, also larger contiguous parts are preferred to an alteration of small changes. All these and other diff ergonomic issue may be addressed by various postprocessing tweaks.

But, besides these simple tricks, are global optimizations to the algorithm possible? After all, O(n^2) space and time requirements are still pretty significant. Originally, diff was developed for Unix by Hunt and McIlroy. Their approach computes matches in the whole file and indexes them into the so-called k-candidates, k being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this, each path is memoized. The problem with the approach is that it performs more computation than necessary: it memoizes all the paths, which requires O(n^2) memory in the worst case, and O(n^2 log n) for the time complexity!

The current standard approach is the divide-and-conquer Myers algorithm. It works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memoized, and the two subsequences preceding and following it are compared again recursively by the same procedure until there is nothing more to compare. Finding the central match is done by matching the ends of subsequences as far as possible, and any time it is not possible, augmenting the edit script by 1 operation, scanning each furthest position attained up to there for each diagonal and checking how far the match can expand. If two matches merge, the algorithm has just found the central match. This approach has the advantage to using only O(n) memory, and executes in O(n d), where d is the edit script complexity (d is less than n, usually, much less). The Myers algorithm wins because it does not memoize the paths while working, and does not need to "foresee" where to go. So, it can concentrate only on the furthest positions it could reach with an edit script of the smallest complexity. The smallest complexity constraint ensures that what is found in the LCS. Unlike the Hunt-McIlroy algorithm, the Myers one doesn't have to memoize the paths. In a sense, the Myers algorithm compared to the vanilla DP diff, like the Dijkstra's one versus Bellman-Ford, cuts down on the calculation of the edit-distances between the substring that don't contribute to the optimal alignment. While solving LCS and building the whole edit-distance matrix performs the computation for all substrings.

The diff tool is a prominent example of a transition from quite an abstract algorithm to a practical utility that is is an essential part of many ubiquitous software products, and the additional work needed to ensure that the final result is not only theoretically sane but also usable.

P.S. Ever wondered how github and other tools, when displaying the diff, not only show the changed line but also highlight the exact changes in the line? The answer is given in [7].

DP in Action: Backprop

As we said in the beginning, DP has applications in many areas: from Machine Learning to graphics to Source Code Management. Literally, you can find an algorithm that uses DP in every specialized domain, and if you don't — this means you, probably, can still advance this domain and create something useful by applying DP to it. Deep Learning is the fastest developing area of the Machine Learning domain, in recent years. At its core, the discipline is about training huge multilayer optimization functions called "neural networks". And the principal approach to doing that, which, practically speaking, has enabled the rapid development of machine learning techniques that we see today, is the Backpropagation (backprop) optimization algorithm.

As pointed out by Christopher Olah, for modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That's the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backprop is a computational tool that may be applied in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names there. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, application-independent, name for it is Reverse-Mode Differentiation. Essentially, it's a technic for calculating partial derivatives quickly using DP on computational graphs.

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression (:= e (* (+ a b) (1+ b))). There are four operations: two additions, one multiplication, and an assignment. Let's arrange those computations in the same way they would be performed on the computer:

(let ((c (+ a b))
      (d (1+ b)))
  (:= e (* c d)))

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When the outcome of one expression is an input to another one, a link points from one node to another:

We can evaluate the expression by setting the values in the input nodes (a and b) to certain values and computing nodes in the graph along the dependency paths. For example, let's set a to 2 and b to 1: the result in node e will be, obviously, 6.

The derivatives in a computational graph can be thought of as edge labels. If a directly affects c, then we can write a partial derivative ∂c/∂a along the edge from a to c.

Here is the computational graph with all the derivatives for the evaluation with the values of a and b set to 2 and 1.

But what if we want to understand how nodes that aren't directly connected affect each other. Let's consider how e is affected by a. If we change a at a speed of 1, c also changes at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So e changes at a rate of (* 1 2) with respect to a. The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. We can see that this graph is, basically, the same as the graph we used to calculate the shortest path.

This is where Forward-mode differentiation and Reverse-mode differentiation come in. They're algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once. Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total derivative. Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node. Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output.

So, what if we do reverse-mode differentiation from e down? This gives us the derivative of e with respect to every node. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of the derivatives we need in one go. When training neural networks, the cost is a function of the weights of each edge. And using reverse-mode differentiation (aka backprop), we can calculate the derivatives of the cost with respect to all the weights in a single pass through the graph, and then feed them into gradient descent. As there are millions and tens of millions of weights, in a neural network, reverse-mode differentiation results in a speedup of the same factor!

Backprop is an example of simple memoization DP. No selection of the best variant is needed, it's just a proper arrangement of the operations to avoid redundant computations.

Take-aways

DP-based algorithms may operate on one of these three levels:

  • just systematic memoization, when every intermediate result is cached and used to compute subsequent results for larger problems (Fibonacci numbers, backprop)
  • memoization + backpointers that allow for the reconstruction of the sequence of actions that lead to the final solution (text segmentation)
  • memoization + backpointers + a target function that selects the best intermediate solution (text justification, diff, shortest path)

If we want to apply DP to some task, we need to find its optimal substructure: i.e. verify that an optimal solution to a subproblem will remain a part of the optimal solution to the whole problem. Next, if we deal with an optimization task, we may have to formulate the recurrence relations. After that, it's just a matter of technic: those relations may be either programmed directly as a recursive or iterative procedure (like in LCS) or indirectly using the method of consecutive approximations (like in Bellman-Ford).

Ultimately, all DP problems may be reduced to pathfinding in the graph, but it doesn't always make sense to have this graph explicitly as a data structure in the program. If it does, however, remember that Dijkstra's algorithm is the optimal algorithm to find a single shortest path in it.

DP, usually, is a reasonable next thing to think about after the naive greedy approach (which, let's be frank, everyone tends to take initially) stumbles over backtracking. However, we saw that DP and greedy approaches do not contradict each other: in fact, they can be combined as demonstrated by the Dijkstra's algorithm. Yet, an optimal greedy algorithm is more of an exception than a rule. Although, there is a number of problems for which a top-n greedy solution (the so-called Beam search) can be a near-optimal solution that is good enough.

Also, DP doesn't necessarily mean optimal. A vanilla dynamic programming algorithm exhaustively explores the decision space, which may be excessive in many cases. It is demonstrated by the examples of the Dijkstra's and Myers algorithms that improve on the DP solution by cutting down some of the corners.

P.S. We have also discussed, the first time in this book, the value of heuristic pre- and postprocessing. From the theoretical standpoint, it is not something you have to pay attention to, but, in practice, that's a very important aspect of the production implementation of many algorithms and, thus, shouldn't be frowned upon or neglected. In an ideal world, an algorithmic procedure should both have optimal worst-case complexity and the fastest operation in the common cases.


[1] If you wonder, s is a word that is usually present in English programmatic dictionaries because when it's and friends are tokenized they're split into two tokens, and the apostrophe may be missing sometimes. Also, our dictionary is case-insensitive.

[2] The intuition for it is the following: in the worst case, every character has two choices: either to be the last letter of the previous word or the first one of the next word, hence the branching factor is 2.

[3] Actually, the condition of complete string coverage may be lifted, which will allow to use almost the same algorithm but skip over "undictionary" words like misspellings.

[4] A space at the end of the line is discarded.

[5] Provided all the length calculations are implemented efficiently. For simplicity, I have used plain lists here with a linear length complexity, but a separate variable may be added to avoid the extra cost.

[6] However, if we think of it, we could reuse the already proven linked representation just putting the incoming edges into the node structure instead of the outgoing ones.

[7] It runs diff twice: first, at the line-level (using each line as a single unit/token) and then at the character level, as you would normally expect. Then the results are just combined.

2019-11-20

Programming Algorithms: Strings

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

  • indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
  • there's a lot of string-specific algorithms that deserve their own chapter
  • finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
                        (declare (type (simple-array bit) x))
                        (count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2:       F3480FB8FA       POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
  bytes
  bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
  (let ((off 0))
    (loop
      (with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
             (diff (- i cnt)))
        (cond
         ((= cnt i) (return (+ off i)))
         ((< diff *mb-threshold*) (return (mb-linear-char-index
                                           string diff off)))
         ((< cnt (floor i 2)) (:+ off i)
                              (:- i cnt))
         (t (:+ off (floor i 2))
            (:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
  (count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

Basic String-Related Optimizations

Strings are often subject to subsequencing, so an efficient implementation may use structure sharing. As we remember, in Lisp, this is accessible via the displaced arrays mechanism (and a convenience RUTILS function slice that we have already used in the code above). Yet, structure sharing should be utilized with care as it opens a possibility for action-at-a-distance bugs if the derived string is modified, which results in parallel modification of the original. Though, strings are rarely modified in-place so, even in its basic form (without mandatory immutability), the approach works well. Moreover, some programming language environments make strings immutable by default. In such cases, to perform on-the-fly string modification (or rather, creation) such patterns as the Java StringBuilder are used, which creates the string from parts by first accumulating them in a list and then, when necessary, concatenating the list's contents into a single final string. An alternative approach is string formatting (the format function in Lisp) that is a higher-level interface, which still needs to utilize some underlying mutation/combination mechanism.

Another important string-related technology is interning. It is a space-saving measure to avoid duplicating the same strings over and over again, which operates by putting a string in a table and using its index afterwards. This approach also enables efficient equality comparison. Interning is performed by the compiler implicitly for all constant strings (in the special segment of the program's memory called "string table"/sstab), and also may be used explicitly. In Lisp, there's a standard function intern, for this. Lisp symbols used interned strings as their names. Another variant of interning is string pooling. The difference is that interning uses a global string table while the pools may be local.

Strings in the Editor

Now, let's consider situations, in which representing strings as arrays doesn't work. The primary one is in the editor. I.e. when constant random modification is the norm. There's another not so obvious requirement related to editing: handle potentially arbitrary long strings that still need to be dynamically modified. Have you tried opening a hundred-megabyte text document in your favorite editor? You'd better don't unless you're a Vim user :) Finally, an additional limitation of handling the strings in the editor is posed when we allow concurrent modification. This we'll discuss in the chapter on concurrent algorithms.

So, why array as a string backend doesn't work well in the editor? Because of content relocation required by all edit operations. O(n) editing is, obviously, not acceptable. What to do? There are several more advanced approaches:

  1. The simplest change will be, once again, to use an array of arrays. For example, for each line. This will not change the general complexity of O(n) but, at least, will reduce n significantly. The issue is that, still, it will depend on the length of the line so, for not so rare degraded case when there are few or no linebreaks, the performance will seriously deteriorate. And, moreover, having observable performance differences between editing different paragraphs of the text is not user-friendly at all.
  2. A more advanced approach would be to use trees, reducing access time to O(log n). There are many different kinds of trees and, in fact, only a few may work as efficient string representations. Among them a popular data structure, for representing strings, is a Rope. It's a binary tree where each leaf holds a substring and its length, and each intermediate node further holds the sum of the lengths of all the leaves in its left subtree. It's a more-or-less classic application of binary trees to a storage problem so we won't spend more time on it here. Suffice to say that it has the expected binary-tree performance of O(log n) for all operations, provided that we keep it balanced. It's an ok alternative to a simple array, but, for such a specialized problem, we can do better with a custom solution.
  3. And the custom solution is to return to arrays. There's one clever way to use them that works very well for dynamic strings. It is called a Gap buffer. This structure is an array (buffer) with a gap in the middle. I.e., let's imagine that we have a text of n characters. The Gap buffer will have a length of n + k where k is the gap size — some value, derived from practice, that may fluctuate in the process of string modification. You can recognize this gap as the position of the cursor in the text. Insertion operation in the editor is performed exactly at this place, so it's O(1). Just, afterwards, the gap will shrink by 1 character, so we'll have to resize the array, at some point, if there are too many insertions and the gap shrinks below some minimum size (maybe, below 1). The deletion operation will act exactly the opposite by growing the gap at one of the sides. The Gap buffer is an approach that is especially suited for normal editing — a process that has its own pace. It also allows the system to represent multiple cursors by maintaining several gaps. Also, it may be a good idea to represent each paragraph as a gap buffer and use an array of them for the whole text. The gap buffer is a special case of the Zipper pattern that we'll discuss in the chapter on functional data structures.

Substring Search

One of the most common string operations is substring search. For ordinary sequences we, usually, search for a single element, but strings, on the contrary, more often need subsequence search, which is more complex. A naive approach will start by looking for the first character, then trying to match the next character and the next, until either something ends or there's a mismatch. Unlike with hash-tables, Lisp standard library has good support for string processing, including such operations as search (which, actually, operates on any sequence type) and mismatch that compares two strings from a chosen side and returns the position at which they start to diverge.

If we were to implement our own string-specific search, the most basic version would, probably, look like this:

(defun naive-match (pat str)
  (dotimes (i (- (1+ (length str)) (length pat)))
    (when (= (mismatch pat (slice str i))
             (length pat))
      (return-from naive-match i))))

If the strings had been random, the probability that we are correctly matching each subsequent character would have dropped to 0 very fast. Even if we consider just the English alphabet, the probability of the first character being the same in 2 random strings is 1/26, the first and second — 1/676, and so on. And if we assume that the whole charset may be used, we'll have to substitute 26 with 256 or a greater value. So, in theory, such naive approach has almost O(n) complexity, where n is the length of the string. Yet, the worst case has O(n * m), where m is the length of the pattern. Why? If we try to match a pattern a..ab against a string aa.....ab, at each position, we'll have to check the whole pattern until the last character mismatches. This may seem like an artificial example and, indeed, it rarely occurs. But, still, real-world strings are not so random and are much closer to the uniform corner case than to the random one. So, researchers have come up with a number of ways to improve subsequence matching performance. Those include the four well-known inventor-glorifying substring search algorithms: Knuth-Morris-Pratt, Boyer-Moore, Rabin-Karp, and Aho-Corasick. Let's discuss each one of them and try to determine their interesting properties.

KMP

Knuth-Morris-Pratt is the most basic of these algorithms. Prior to performing the search, it examines the pattern to find repeated subsequences in it and creates a table containing, for each character of the pattern, the length of the prefix of the pattern that can be skipped if we have reached this character and failed the search at it. This table is also called the "failure function". The number in the table is calculated as the length of the proper suffix[1] of the pattern substring ending before the current character that matches the start of the pattern.

I'll repeat here the example provided in Wikipedia that explains the details of the table-building algorithm, as it's somewhat tricky.

Let's build the table for the pattern abdcabd. We set the table entry for the first char a to -1. To find the entry for b, we must discover a proper suffix of a which is also a prefix of the pattern. But there are no proper suffixes of a, so we set this entry to 0. To find the entry with index 2, we see that the substring ab has a proper suffix b. However b is not a prefix of the pattern. Therefore, we also set this entry to 0.

For the next entry, we first check the proper suffix of length 1, and it fails like in the previous case. Should we also check longer suffixes? No. We can formulate a shortcut rule: at each stage, we need to consider checking suffixes of a given size (1+ n) only if a valid suffix of size n was found at the previous stage and should not bother to check longer lengths. So we set the table entry for c to 0 also.

We pass to the subsequent character a. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since d is not a prefix. But instead of setting the table entry to 0, we can do better by noting that a is also the first character of the pattern, and also that the corresponding character of the string can't be a (as we're calculating for the mismatch case). Thus there is no point in trying to match the pattern for this character again — we should begin 1 character ahead. This means that we may shift the pattern by match length plus one character, so we set the table entry to -1.

Considering now the next character b: though by inspection the longest substring would appear to be a, we still set the table entry to 0. The reasoning is similar to the previous case. b itself extends the prefix match begun with a, and we can assume that the corresponding character in the string is not b. So backtracking before it is pointless, but that character may still be a, hence we set the entry not to -1, but to 0, which means shifting the pattern by 1 character to the left and trying to match again.

Finally, for the last character d, the rule of the proper suffix matching the prefix applies, so we set the table entry to 2.

The resulting table is:

    a    b   c   d   a   b   d   
   -1    0   0   0  -1   0   2

Here's the implementation of the table-building routine:

(defun kmp-table (pat)
  (let ((rez (make-array (length pat)))
        (i 0))  ; prefix length
    (:= (? rez 0) -1)
    (loop :for j :from 1 :below (length pat) :do
      (if (char= (char pat i) (char pat j))
          (:= (? rez j) (? rez i))
          (progn (:= (? rez j) i
                     i (? rez i))
                 (loop :while (and (>= i 0)
                                   (not (char= (char pat i) (char pat j))))
                       :do (:= i (? rez i)))))
      (:+ i))
    rez))

It can be proven that it runs in O(m). We won't show it here, so coming up with proper calculations is left as an exercise to the reader.

Now, the question is, how shall we use this table? Let's look at the code:

(defun kmp-match (pat str)
  (let ((s 0)
        (p 0)
        (ff (kmp-table pat))
    (loop :while (< s (length str)) :do
      (if (= (char pat p) (char str s))
          ;; if the current characters of the pattern and string match
          (if (= (1+ p) (length pat)))
              ;; if we reached the end of the pattern - success
              (return (- s p))
              ;; otherwise, match the subsequent characters
              (:= p (1+ p)
                  s (1+ s)))
          ;; if the characters don't match
          (if (= -1 (? ff p))
              ;; shift the pattern for the whole length 
              (:= p 0
                  ;; and skip to the next char in the string
                  s (1+ s))
              ;; try matching the current char again,
              ;; shifting the pattern to align the prefix
              ;; with the already matched part
              (:= p (? ff p)))))))

As we see, the index in the string (s), is incremented at each iteration except when the entry in the table is positive. In the latter case, we may examine the same character more than once but not more than we have advanced in the pattern. And the advancement in the pattern meant the same advancement in the string (as the match is required for the advancement). In other words, we can backtrack not more than n times over the whole algorithm runtime, so the worst-case number of operations in kmp-search is 2n, while the best-case is just n. Thus, the total complexity is O(n + m).

And what will happen in our aa..ab example? The failure function for it will look like the following: -1 -1 -1 -1 (- m 2). Once we reach the first mismatch, we'll need to backtrack by 1 character, perform the comparison, which will mismatch, advance by 1 character (to b), mismatch again, again backtrack by 1 character, and so on until the end of the string. So, this case, will have almost the abovementiond 2n runtime.

To conclude, the optimization of KMP lies in excluding unnecessary repetition of the same operations by memoizing the results of partial computations — both in table-building and matching parts. The next chapter of the book will be almost exclusively dedicated to studying this approach in algorithm design.

BM

Boyer-Moore algorithm is conceptually similar to KMP, but it matches from the end of the pattern. It also builds a table, or rather three tables, but using a different set of rules, which also involve the characters in the string we search. More precisely, there are two basic rules instead of one for KMP. Besides, there's another rule, called the Galil rule, that is required to ensure the linear complexity of the algorithm. Overall, BM is pretty complex in the implementation details and also requires more preprocessing than KMP, so its utility outweighs these factors only when the search is repeated multiple times for the same pattern.

Overall, BM may be faster with normal text (and the longer the pattern, the faster), while KMP will work the best with strings that have a short alphabet (like DNA). However, I would choose KMP as the default due to its relative simplicity and much better space utilization.

RK

Now, let's talk about alternative approaches that rely on techniques other than pattern preprocessing. They are usually used to find matches of multiple patterns in one go as, for the base case, their performance will be worse than that of the previous algorithms.

Rabin-Karp algorithm uses an idea of the Rolling hash. It is a hash function that can be calculated incrementally. The RK hash is calculated for each substring of the length of the pattern. If we were to calculate a normal hash function like fnv-1, we'd need to use each character for the calculation — resulting in O(n * m) complexity of the whole procedure. The rolling hash is different as it requires, at each step of the algorithm, to perform just 2 operations: as the "sliding window" moves over the string, subtract the part of the hash corresponding to the character that is no longer part of the substring and add the new value for the character that has just become the part of the substring.

Here is the skeleton of the RK algorithm:

(defun rk-match (pat str)
  (let ((len (length pat))
        (phash (rk-hash pat)))
    (loop :for i :from len :to (length str)
          :for beg := (- i len)
          :for shash := (rk-hash (slice str 0 len))
            :then (rk-rehash len shash (char str beg) (char str i))
          :when (and (= phash shash)
                     (string= pat (slice str beg len))
          :collect beg)))

A trivial rk-hash function would be just:

(defun rk-hash (str)
  (loop :for ch :across str :sum (char-code ch)))

But it is, obviously, not a good hash-function as it doesn't ensure the equal distribution of hashes. Still, in this case, we need a reversible hash-function. Usually, such hashes add position information into the mix. An original hash-function for the RK algorithm is the Rabin fingerprint that uses random irreducible polynomials over Galois fields of order 2. The mathematical background needed to explain it is somewhat beyond the scope of this book. However, there are simpler alternatives such as the following:

(defun rk-hash (str)
  (assert (> (length str) 0))
  (let ((rez (char-code (char str 0))))
    (loop :for ch :across (slice str 1) :do
      (:= rez (+ (rem (* rez 256) 101)
                 (char-code ch))))
    (rem rez 101))

Its basic idea is to treat the partial values of the hash as the coefficients of some polynomial.

The implementation of rk-rehash for this function will look like this:

(defun rk-rehash (hash len ch1 ch2)
  (rem (+ (* (+ hash 101
                (- (rem (* (char-code ch1)
                           (expt 256 (1- len)))
                        101)))
             256)
          (char-code ch2))
       101))

Our rk-match could be used to find many matches of a single pattern. To adapt it for operating on multiple patterns at once, we'll just need to pre-calculate the hashes for all patterns and lookup the current rk-hash value in this set. Additional optimization of this lookup may be performed with the help of a Bloom filter — a stochastic data structure we'll discuss in more detail later.

Finally, it's worth noting that there are other similar approaches to the rolling hash concept that trade some of the uniqueness properties of the hash function for the ability to produce hashes incrementally or have similar hashes for similar sequences. For instance, the Perceptual hash (phash) is used to find near-match images.

AC

Aho-Corasick is another algorithm that allows matching multiple strings at once. The preprocessing step of the algorithm constructs a Finite-State Machine (FSM) that resembles a trie with additional links between the various internal nodes. The FSM is a graph data structure that encodes possible states of the system and actions needed to transfer it from one state to the other.

The AC FSM is constructed in the following manner:

  1. Build a trie of all the words in the set of search patterns (the search dictionary). This trie represents the possible flows of the program when there's a successful character match at the current position. Add a loop edge for the root node.
  2. Add backlinks transforming the trie into a graph. The backlinks are used when a failed match occurs. These backlinks are pointing either to the root of the trie or if there are some prefixes that correspond to the part of the currently matched path — to the end of the longest prefix. The longest prefix is found using BFS of the trie. This approach is, basically, the same idea used in KMP and BM to avoid reexamining the already matched parts. So backlinks to the previous parts of the same word are also possible.

Here is the example FSM for the search dictionary '("the" "this" "that" "it" "his"):

Basically, it's just a trie with some backlinks to account for already processed prefixes. One more detail missing for this graph to be a complete FSM is an implicit backlink from all nodes without an explicit backlink that don't have backlinks to the root node.

The main loop of the algorithm is rather straightforward, examine each character and then:

  • either follow one of the transitions (direct edge) if the character of the edge matches
  • or follow the backlink if it exists
  • or reset the FSM state — go to root
  • if the transition leads us to a terminal node, record the match(es) and return to root as, well

As we see from the description, the complexity of the main loop is linear in the length of the string: at most, 2 matches are performed, for each character. The FSM construction is also linear in the total length of all the words in the search dictionary.

The algorithm is often used in antivirus software to perform an efficient search for code signatures against a database of known viruses. It also formed the basis of the original Unix command fgrep. And, from my point of view, it's the simplest to understand yet pretty powerful and versatile substring search algorithm that may be a default choice if you ever have to implement one yourself.

Regular Expressions

Searching is, probably, the most important advanced string operation. Besides, it is not limited to mere substring search — matching of more complex patterns is even in higher demand. These patterns, which are called "regular expressions" or, simply, regexes, may include optional characters, repetition, alternatives, backreferences, etc. Regexes play an important role in the history of the Unix command-line, being the principal technology of the infamous grep utility, and then the cornerstone of Perl. All modern programming languages support them either in the standard library or, as Lisp, with high-quality third-party addons (cl-ppcre).

One of my favorite programming books, "Beautiful Code", has a chapter on implementing simple regex matching from Brian Kernighan with code written by Rob Pike. It shows how easy it is to perform basic matching of the following patterns:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

Below the C code from the book is translated into an equivalent Lisp version:

(defun match (regex text)
  "Search for REGEX anywhere in TEXT."
  (if (starts-with "^" regex)  ; STARTS-WITH is from RUTILS
      (match-here (slice regex 1) text)
      (dotimes (i (length text))
        (when (match-here regex (slice text i))
          (return t)))))

(defun match-here (regex text)
  "Search for REGEX at beginning of TEXT."
  (cond ((= 0 (length regex))
         t)
        ((and (> (length regex) 1)
              (char= #\* (char regex 1)))
         (match-star (char regex 1) (slice regex 2) text))
        ((string= "$" regex)
         (= 0 (length text)))
        ((and (> (length text) 0)
              (member (char text 0) (list #\. (char text 0)))
         (match-here (slice regex 1) (slice text 1)))))

(defun match-star (c regex text)
  "Search for C*REGEX at beginning of TEXT."
  (loop
    (when (match-here regex text) (return t))
    (:= text (slice text 1))
    (unless (and (> (length text) 0)
                 (member c (list #\. (char text 0))))
      (return)))

This is a greedy linear algorithm. However, modern regexes are much more advanced than this naive version. They include such features as register groups (to record the spans of text that match a particular subpattern), backreferences, non-greedy repetition, and so on and so forth. Implementing those will require changing the simple linear algorithm to a backtracking one. And incorporating all of them would quickly transform the code above into a horrible unmaintainable mess: not even due to the number of cases that have to be supported but due to the need of accounting for the complex interdependencies between them.

And, what's worse, soon there will arise a need to resort to backtracking. Yet, a backtracking approach has a critical performance flaw: potential exponential runtime for certain input patterns. For instance, the Perl regex engine (PCRE) requires over sixty seconds to match a 30-character string aa..a against the pattern a?{15}a{15} (on standard hardware). While the alternative approach, which we'll discuss next, requires just twenty microseconds — a million times faster. And it handles a 100-character string of a similar kind in under 200 microseconds, while Perl would require over 1015 years.[2]

This issue is quite severe and has even prompted Google to release their own regex library with strict linear performance guarantees — RE2. The goal of the library is not to be faster than all other engines under all circumstances. Although RE2 guarantees linear-time performance, the linear-time constant varies depending on the overhead entailed by its way of handling of the regular expression. In a sense, RE2 behaves pessimistically whereas backtracking engines behave optimistically, so it can be outperformed in various situations. Also, its goal is not to implement all of the features offered by PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

The figures above are taken from a seminal article by Russ Cox. He goes on to add:

Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs. They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools.

The linear-time approach to regex matching relies on a similar technic to the one in the Aho-Corasick algorithm — the FSM. Actually, if by regular expressions we mean a set of languages that abide by the rules of the regular grammars in the Chomsky hierarchy of languages, the FSM is their exact theoretical computation model. Here is how an FSM for a simple regex a*b$ might look like:

Such FSM is called an NFA (Nondeterministic Finite Automaton) as some states have more than one alternative successor. Another type of automata are DFAs (Deterministic Finite Automata) that permit transitions to at most one state, for each state. The method to transform the regex into an NFA is called the Thompson's construction. And an NFA can be made into a DFA by the Powerset construction and then be minimized to get an optimal automaton. DFAs are more efficient to execute than NFAs, because DFAs are only ever in one state at a time: they never have a choice of multiple next states. But the construction takes additional time. Anyway, both NFAs and DFAs guarantee linear-time execution.

The Thompson's algorithm builds the NFA up from partial NFAs for each subexpression, with a different construction for each operator. The partial NFAs have no matching states: instead, they have one or more dangling arrows, pointing to nothing. The construction process will finish by connecting these arrows to a matching state.

  • The NFAs for matching a single character e is a single node with a slot for an incoming arrow and a pending outgoing arrow labeled with e.
  • The NFA for the concatenation e1e2 connects the outgoing arrow of the e1 machine to the incoming arrow of the e2 machine.
  • The NFA for the alternation e1|e2 adds a new start state with a choice of either the e1 machine or the e2 machine.
  • The NFA for e? alternates the e machine with an empty path.
  • The NFA for e* uses the same alternation but loops a matching e machine back to the start.
  • The NFA for e+ also creates a loop, but one that requires passing through e at least once.

Counting the states in the above constructions, we can see that this technic creates exactly one state per character or metacharacter in the regular expression. The only exception is the constructs c{n} or c{n,m} which require to duplicate the single chracter automaton n or m times respectively, but it is still a constant number. Therefore the number of states in the final NFA is at most equal to the length of the original regular expression plus some constant.

Implementation of the Thompson's Construction

The core of the algorithm could be implemented very transparently with the help of the Lisp generic functions. However, to enable their application, we'd first need to transform the raw expression into a sexp (tree-based) form. Such representation is supported, for example, in the cl-ppcre library:

PPCRE> (parse-string "ab[0-9]+c$")
(:SEQUENCE "ab" (:GREEDY-REPETITION 1 NIL (:RANGE #\0 #\9)) #\c :END-ANCHOR)

Parsing is a whole separate topic that will be discussed next. But once we have performed it, we gain a possibility to straightforwardly implement the Thompson's construction by traversing the parse tree and emitting, for each state, the corresponding part of the automaton. The Lisp generic functions are a great tool for implementing such transformation as they allow to define methods that are selected based on either the type or the identity of the arguments. And those methods can be added independently, so the implementation is clear and extensible. We will define 2 generic functions: one to emit the automaton fragment (th-part) and another to help in transition selection (th-match).

First, let's define the state node of the FSM. We will use a linked graph representation for the automaton. So, a variable for the FSM in the code will point to its start node, and it will, in turn, reference the other nodes. There will also be a special node that will be responsible for recording the matches (*matched-state*).

(defstruct th-state
  transitions)

(defparameter *initial-state* nil)
(defparameter *matched-state* (make-th-state))

(defun th-state (&rest transitions)
  "A small convenience function to construct TH-STATE structs."
  (make-th-state :transitions (loop :for (cond state) :on transitions :by 'cddr
                                    :collect (pair cond state))))

And now, we can define the generic function that will emit the nodes:

(define-condition check-start-anchor () ())

(defgeneric th-part (next-state kind &rest args)
  (:documentation
   "Emit the TH-STATE structure of a certain KIND
    (which may be a keyword or a raw string) using the other ARGS
    and pointing to NEXT-STATE struct.")
  (:method (next-state (kind (eql :sequence)) &rest args)
    (apply 'th-part (if (rest args)
                        (apply 'th-part :sequence (rest args))
                        next-state)
           (first args)))
  (:method (next-state (kind (eql :greedy-repetition)) &rest args)
    ;; this method can handle *, +, {n}, and {n,m} regex modifiers
    ;; in any case, there's a prefix sequence of fixed nonnegative length
    ;; of identical elements that should unconditionally match
    ;; followed by a bounded or unbounded sequence that,
    ;; in case of a failed match, transitions to the next state
    (apply 'th-part
           (let ((*initial-state* next-state))
             (apply 'th-part next-state :sequence
                    (loop :repeat (or (second args) 1)
                          :collect (mklist (third args)))))
           :sequence (loop :repeat (first args)
                           :collect (mklist (third args)))))
  (:method (next-state (kind character) &rest args)
    (th-state kind next-state
              ;; Usually, *initial-state* will be nill, i.e. further computations
              ;; alone this path will be aborted, but for some variants (? or *)
              ;; they will just continue normally to the next state.
              ;; The special variable allows to control this as you can see in
              ;; the method for :greedy-repetition 
              t *initial-state*))
  (:method (next-state (kind (eql :end-anchor)) &rest args)
    (th-state nil *matched-state*
              t *initial-state*))
  (:method (next-state (kind (eql :start-anchor)) &rest args)
    ;; This part is unique in that all the other parts consume the next character
    ;; (we're not implementing lookahead here), but this one shouldn't.
    ;; To implement such behavior without the additional complexity created by passing
    ;; the string being searched to this function (which we'll still, probably, need to do
    ;; later on, but were able to avoid so far), we can resort to a cool Lisp technique
    ;; of signaling a condition that can be handled specially in the top-level code
    (signal 'check-start-anchor)
    next-state))

Here, we have defined some of the methods of th-part that specialize for the basic :sequence of expressions, :greedy-repetition (regex * and +), a single character and single symbols :start-anchor/:end-anchor (regexes ^ and $). As you can see, some of them dispatch (are chosen based on) the identity of the first argument (using eql specializers), while the character-related method specializes on the class of the arg. As we develop this facility, we could add more methods with defmethod. Running th-part on the whole parse-tree will produce the complete automaton, we don't need to do anything else!

To use the constructed FSM, we run it with the string as input. NFAs are endowed with the ability to guess perfectly when faced with a choice of next state: to run the NFA on a real computer, we must find a way to simulate this guessing. One way to do that is to guess one option, and if that doesn't work, try the other. A more efficient way to simulate perfect guessing is to follow all admissible paths simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter. In the worst case, the NFA might be in every state at each step, but this results in at worst a constant amount of work independent of the length of the string, so arbitrarily large input strings can be processed in linear time. The efficiency comes from tracking the set of reachable states but not which paths were used to reach them. In an NFA with n nodes, there can only be n reachable states at any step.

(defun run-nfa (nfa str)
  (let ((i 0)
        (start 0)
        (matches (list))
        (states (list nfa)))
    ;; this is the counterpart for the start-anchor signal
    (handler-bind ((check-start-anchor
                     ;; there's no sense to proceed matching a ^... regex
                     ;; if the string is not at its start
                     (lambda (c) (when (> i 0) (return-from run-nfa)))))
      (dovec (char (concatenate 'vector str
                                #(nil)))  ; for handling end-anchor 
        (let ((new-states (list)))
          (dolist (state states)
            (dolist (tr (? state 'transitions))
              (when (th-match tr char)
                (case (rt tr)
                  (*matched-state* (push start matches))
                  (nil )  ; ignore it
                  (t (pushnew (rt tr) new-states)))
                (return))))
           (if new-states
               (:= states new-states)
               (:= states (list nfa)
                   start nil)))
        (:+ i)
        (unless start (:= start i))))
    matches))

The th-match function may have methods to match a single char and a character range, as well as a particular predicate. Its implementation is trivial and left as an exercise to the reader.

Overall, interpreting an automaton is a simple and robust approach, yet if we want to squeeze all the possible performance, we can compile it directly to machine code. This is much easier to do with the DFA as it has at most 2 possible transitions from each state, so the automaton can be compiled to a multi-level conditional and even a jump-table.

Grammars

Regexes are called "regular" for a reason: there's a corresponding mathematical formalism "regular languages" that originates from the hierarchy of grammars compiled by Noah Chomsky. This hierarchy has 4 levels, each one allowing strictly more complex languages to be expressed with it. And for each level, there's an equivalent computation model:

  • Type-0: recursivel-enumerable (or universal) grammars — Turing machine
  • Type-1: context-dependent (or context-sensitive) grammars — a linear bounded automaton
  • Type-2: context-free grammars — pushdown automaton
  • Type-3: regular grammars — FSM

We have already discussed the bottom layer of the hierarchy. Regular languages are the most limited (and thus the simplest to implement): for example, you can write a regex a{15}b{15}, but you won't be able to express a{n}b{n} for an arbitrary n, i.e. ensure that b is repeated the same number of times as a. The top layer corresponds to all programs and so all the programming science and lore, in general, is applicable to it. Now, let's talk about context-free grammars which are another type that is heavily used in practice and even has a dedicated set of algorithms. Such grammars can be used not only for simple matching but also for parsing and generation. Parsing, as we have seen above, is the process of transforming a text that is assumed to follow the rules of a certain grammar into the structured form that corresponds to the particular rules that can be applied to this text. And generation is the reverse process: applying the rules, obtain the text. This topic is huge and there's a lot of literature on it including the famous Dragon Book.

Parsing is used for processing both artificial (including programming) and natural languages. And, although different sets of rules may be used, as well as different approaches for selecting a particular rule, the resulting structure will be a tree. In fact, formally, each grammar consists of 4 items:

  • The set of terminals (leaves of the parse tree) or tokens of the text: these could be words or characters for the natural language; keywords, identifiers, and literals for the programming language; etc.
  • The set of nonterminals — symbols used to name different items in the rules and in the resulting parse tree — the non-leaf nodes of the tree. These symbols are abstract and not encountered in the actual text. The examples of nonterminals could be VB (verb) or NP (noun phrase) in natural language parsing, and if-section or template-argument in parsing of C++ code.
  • The root symbol (which should be one of the nonterminals).
  • The set of production rules that have two-sides: a left-hand (lhs) and a right-hand (rhs) one. In the left-hand side, there should be at least one nonterminal, which is substituted with a number of other terminals or nonterminals in the right-hand side. During generation, the rule allows the algorithm to select a particular surface form for an abstract nonterminal (for example, turn a nonterminal VB into a word do). During parsing, which is a reverse process, it allows the program, when it's looking at a particular substring, to replace it with a nonterminal and expand the tree structure. When the parsing process reaches the root symbol in the by performing such substitution and expansion, it is considered terminated.

Each compiler has to use parsing as a step in transforming the source into executable code. Also, parsing may be applied for any data format (for instance, JSON) to transform it into machine data. In natural language processing, parsing is used to build the various tree representations of the sentence, which encode linguistic rules and structure.

There are many different types of parsers that differ in the additional constraints they impose on the structure of the production rules of the grammar. The generic context-free constraint is that in each production rule the left-hand side may only be a single nonterminal. The most wide-spread of context-free grammars are LL(k) (in particular, LL(1)) and LR (LR(1), SLR, LALR, GLR, etc). For example, LL(1) parsers (one of the easiest to build) parses the input from left to right, performing leftmost derivation of the sentence, and it is allowed to look ahead at most 1 character. Not all combinations of derivation rules allow the algorithm to build a parser that will be able to perform unambiguous rule selection under such constraints. But, as the LL(1) parsing is simple and efficient, some authors of grammars specifically target their language to be LL(1)-parseable. For example, Pascal and other programming languages created by Niklas Wirth fall into this category.

There are also two principal approaches to implementing the parser: a top-down and a bottom-up one. In a top-down approach, the parser tries to build the tree from the root, while, in a bottom-up one, it tries to find the rules that apply to groups of terminal symbols and then combine those until the root symbol is reached. Obviously, we can't enumerate all parsing algorithms here, so we'll study only a single approach, which is one of the most wide-spread, efficient, and flexible ones — Shift-Reduce Parsing. It's a bottom-up linear algorithm that can be considered one of the instances of the pushdown automaton approach — a theoretical computational model for context-free grammars.

A shift-reduce parser operates on a queue of tokens of the original sentence. It also has access to a stack. At each step, the algorithm can perform:

  • either a shift operation: take the token from the queue and push it onto the stack
  • or a reduce operation: take the top items from the stack, select a matching rule from the grammar, and add the corresponding subtree to the partial parse tree, in the process, removing the items from the stack

Thus, for each token, it will perform exactly 2 "movement" operations: push it onto the stack and pop from the stack. Plus, it will perform rule lookup, which requires a constant number of operations (maximum length of the rhs of any rule) if an efficient structure is used for storing the rules. A hash-table indexed by the rhs's or a trie are good choices for that.

Here's a small example from the domain of NLP syntactic parsing. Let's consider a toy grammar:

S -> NP VP .
NP -> DET ADJ NOUN
NP -> PRP$ NOUN  ; PRP$ is a posessive pronoun
VP -> VERB VP
VP -> VERB NP

and the following vocabulary:

DET -> a|an|the
NOUN -> elephant|pijamas
ADJ -> large|small
VERB -> is|wearing
PRP$ -> my

No, let's parse the sentence (already tokenized): A large elephant is wearing my pyjamas . First, we'll need to perform part-of-speech tagging, which, in this example, is a matter of looking up the appropriate nonterminals from the vocabulary grammar. This will result in the following:

DET ADJ    NOUN  VERB  VERB PRP$  NOUN  .
 |   |      |      |    |     |    |    |
 A large elephant is wearing my pyjamas .

This POS tags will serve the role of terminals for our parsing grammar. Now, the shift-reduce process itself begins:

1. Initial queue: (DET ADJ NOUN VERB VERB PRP$ NOUN .)
   Initial stack: ()
   Operation: shift


2. Queue: (ADJ NOUN VERB VERB PRP$ NOUN .)
   Stack: (DET)
   Operation: shift (as there are no rules with the rhs DET)

3. Queue: (NOUN VERB VERB PRP$ NOUN .)
   Stack: (ADJ DET)
   Operation: shift

4. Queue: (VERB VERB PRP$ NOUN .)
   Stack: (NOUN ADJ DET)
   Operation: reduce (rule NP -> DET ADJ NOUN)
   ; we match the rules in reverse to the stack

5. Queue: (VERB VERB PRP$ NOUN .)
   Stack: (NP)
   Operation: shift

6. Queue: (VERB PRP$ NOUN .)
   Stack: (VERB NP)
   Operation: shift

7. Queue: (PRP$ NOUN .)
   Stack: (VERB VERB NP)
   Operation: shift

8. Queue: (NOUN .)
   Stack: (PRP$ VERB VERB NP)
   Operation: shift

9. Queue: (.)
   Stack: (NOUN PRP$ VERB VERB NP)
   Operation: reduce (rule: NP -> PRP$ NOUN)

10. Queue: (.)
    Stack: (NP VERB VERB NP)
    Operation: reduce (rule: VP -> VERB NP)

11. Queue: (.)
    Stack: (VP VERB NP)
    Operation: reduce (rule: VP -> VERB VP)

12. Queue: (.)
    Stack: (VP NP)
    Operation: shift

11. Queue: ()
    Stack: (. VP NP)
    Operation: reduce (rule: S -> NP VP .)

12. Reached root symbol - end.

    The resulting parse tree is:

           __________S___________
          /              \       \
         /             __VP__     \
        /             /      \     \
       /             /     __VP_    \
      /             /     /     \    \
  ___NP_____       /     /     _NP_   \
 /   |      \     /     /     /    \   \
DET ADJ    NOUN  VERB  VERB PRP$  NOUN  .
 |   |      |      |    |     |    |    |
 A large elephant is wearing my pyjamas .

The implementation of the basic algorithm is very simple:

(defstruct grammar
  rules
  max-length)

(defmacro grammar (&rest rules)
  `(make-grammar
    :rules (pairs->ht (mapcar (lambda (rule)
                                (pair (nthcdr 2 rule) (first rule)))
                              ',rules))
    :max-length
    (let ((max 0))
      (dolist (rule ',rules)
        (when (> #1=(length (nthcdr 2 rule)) max)
          (:= max #1#)))
      max)))  ; #1= & #1# are reader-macros for anonymous variables

(defun parse (grammar queue)
  (let ((stack (list)))
    (loop :while queue :do
      (print stack)  ; diagnostic output
      (if-it (find-rule stack grammar)
             ;; reduce
             (dotimes (i (length (cdr it))
                         (push it stack))
               (pop stack))
             ;; shift
             (push (pop queue) stack))
      :finally (return (find-rule stack grammar)))))

(defun find-rule (stack grammar)
  (let (prefix)
    (loop :for item in stack
          :repeat (? grammar 'max-length) :do
          (push (car (mklist item)) prefix)
          (when-it (? grammar 'rules prefix)
            ;; otherwise parsing will fail with a stack
            ;; containing a number of partial subtrees
            (return (cons it (reverse (subseq stack 0 (length prefix)))))))))

CL-USER> (parse (print (grammar (S -> NP VP |.|)
                                (NP -> DET ADJ NOUN)
                                (NP -> PRP$ NOUN)
                                (VP -> VERB VP)
                                (VP -> VERB NP)))
                '(DET ADJ NOUN VERB VERB PRP$ NOUN |.|))
#S(GRAMMAR
   :RULES #{
            '(NP VP |.|) S
            '(DET ADJ NOUN) NP
            '(PRP$ NOUN) NP
            '(VERB VP) VP
            '(VERB NP) VP
           }
   :MAX-LENGTH 3)
NIL 
(DET) 
(ADJ DET) 
(NOUN ADJ DET) 
((NP DET ADJ NOUN)) 
(VERB (NP DET ADJ NOUN)) 
(VERB VERB (NP DET ADJ NOUN)) 
(PRP$ VERB VERB (NP DET ADJ NOUN)) 
(NOUN PRP$ VERB VERB (NP DET ADJ NOUN)) 
((NP PRP$ NOUN) VERB VERB (NP DET ADJ NOUN)) 
((VP VERB (NP PRP$ NOUN)) VERB (NP DET ADJ NOUN)) 
((VP VERB (VP VERB (NP PRP$ NOUN))) (NP DET ADJ NOUN)) 
(S (NP DET ADJ NOUN) (VP VERB (VP VERB (NP PRP$ NOUN))) |.|)

However, the additional level of complexity of the algorithm arises when the grammar becomes ambiguous, i.e. there may be situations when several rules apply. Shift-reduce is a greedy algorithm, so, in its basic form, it will select some rule (for instance, with the shortest rhs or just the first match), and it cannot backtrack. This may result in a parsing failure. If some form of rule weights is added, the greedy selection may produce a suboptimal parse. Anyway, there's no option of backtracking to correct a parsing error. In the NLP domain, the peculiarity of shift-reduce parsing application is that the number of rules is quite significant (it can reach thousands) and, certainly, there's ambiguity. In this setting, shift-reduce parsing is paired with machine learning technics, which perform a "soft" selection of the action to take at each step, as reduce is applicable almost always, so a naive greedy technique becomes pointless.

Actually, shift-reduce would better be called something like stack-queue parsing, as different parsers may not limit the implementation to just the shift and reduce operations. For example, an NLP parser that allows the construction of non-projective trees (those, where the arrows may cross, i.e. subsequent words may not always belong to a single or subsequent upper-level categories), adds a swap operation. A more advanced NLP parser that produces a graph structure called an AMR (abstract meaning representation) has 9 different operations.

Shift-reduce parsing is implemented in many of the parser generator tools, which generate a parser program from a set of production rules. For instance, the popular Unix tool yacc is a LALR parser generator that uses shift-reduce. Another popular tool ANTLR is a parser generator for LL(k) languages that uses a non-shift-reduce direct pushdown automaton-based implementation.

Besides shift-reduce and similar automata-based parsers, there are many other parsing technics used in practice. For example, CYK probabilistic parsing was popular in NLP for some time, but it's an O(n^3) algorithm, so it gradually fell from grace and lost to machine-learning enhanced shift-reduce variants. Another approach is packrat parsing (based on PEG — parsing expression grammars) that has a great Lisp parser-generator library esrap. Packrat is a more powerful top-down parsing approach with backtracking and unlimited lookahead that nevertheless guarantees linear parse time. Any language defined by an LL(k) or LR(k) grammar can be recognized by a packrat parser, in addition to many languages that conventional linear-time algorithms do not support. This additional power simplifies the handling of common syntactic idioms such as the widespread but troublesome longest-match rule, enables the use of sophisticated disambiguation strategies such as syntactic and semantic predicates, provides better grammar composition properties, and allows lexical analysis to be integrated seamlessly into parsing. The last feature makes packrat very appealing to the programmers as they don't have to define separate tools for lexical analysis (tokenization and token categorization) and parsing. Moreover, the rules for tokens use the same syntax, which is also quite similar to regular expression syntax. For example, here's a portion of the esrap rules for parsing tables in Markdown documents. The Markdown table may look something like this:

| Left-Aligned  | Center Aligned  | Right Aligned |
| :------------ |:---------------:| -----:|
| col 3 is      | some wordy text | $1600 |
| col 2 is      | centered        |   $12 |
| zebra stripes | are neat        |    $1 |

You can see that the code is quite self-explanatory: each defrule form consists of a rule name (lhs), its rhs, and a transformation of the rhs into a data structure. For instance, in the rule table-row the rhs is (and (& #\|) (+ table-cell) #\| sp newline). The row should start with a | char followed by 1 or more table-cells (a separate rule), and ended by | with some space charctaers and a newline. And the transformation (:destructure (_ cells &rest __) ... only cares about the content, i.e. the table cells.

(defrule sp (* space-char)
  (:text t))

(defrule table-cell (and #\|
                         sp
                         (* (and (! (or (and sp #\|) endline)) inline))
                         sp
                         (& #\|))
  (:destructure (_ __ content &rest ___)
    (mapcar 'second content)))

(defrule table-row (and (& #\|) (+ table-cell) #\| sp newline)
  (:destructure (_ cells &rest __)
    (mapcar (lambda (a) (cons :plain a))
            cells)))

(defrule table-align-cell (and sp (? #\:) (+ #\-) (? #\:) sp #\|)
  (:destructure (_ left __ right &rest ___)
    (if right (if left 'center 'right) (when left 'left))))

(defrule table-align-row (and #\| (+ table-align-cell) sp newline)
  (:destructure (_ aligns &rest __)
    aligns))

(defrule table-head (and table-row table-align-row))

To conclude the topic of parsing, I wanted to pose a question: can it be used to match the regular expressions? And the answer, of course, is that it can, as we are operating in a more powerful paradigm that includes the regexes as a subdomain. However, the critical showstopper of applying parsing to this problem is the need to define the grammar instead of writing a compact and more or less intuitive regex...

String Search in Action: Plagiarism Detection

Plagiarism detection is a very challenging problem that doesn't have an exact solution. The reason is that there's no exact definition of what can be considered plagiarism and what can't, the boundary is rather blurry. Obviously, if the text or its part is just copy-pasted, everything is clear. But, usually (and especially when they know that plagiarism detection is at play), people will apply their creativity to alter the text in some slight or even significant ways. However, over the years, researchers have come up with numerous algorithms of plagiarism detection, with quality good enough to be used in our educational institutions. The problem is very popular and there are even shared task challenges dedicated to improving plagiarism catchers. It's somewhat an arms race between the plagiarists and the detection systems.

One of the earliest but, still, quite effective ways of implementing plagiarism detection is the Shingle algorithm. It is also based on the idea of using hashes and some basic statistical sampling techniques. The algorithm operates in the following stages:

  1. Text normalization (this may include case normalization, reduction of the words to basic forms, error correction, cleanup of punctuation, stopwords, etc.)
  2. Selection of the shingles and calculation of their hashes.
  3. Sampling the shingles from the text at question.
  4. Comparison of the hashes of the original shingles to the sampled hashes and evaluation.

The single shingle is a continues sequence of words from the normalized text (another name for this object, in NLP, is ngram). The original text will give us (1- n) shingles, where n is the number of words. The hashes of the shingles are normal string hashes (like fnv-1).

The text, which is analyzed for plagiarism, is also split into shingles, but not all of them are used. Just a random sample of m. The Sampling theorem can give a good estimate of the number that can be trusted with a high degree of confidence. For efficient comparison, all the original hashes can be stored in a hash-set. If the number of overlapping shingles exceeds some threshold, the text can be considered plagiarised. The other take on the result of the algorithm application may be to return the plagiarism degree, which will be the percentage of the overlapping shingles. The complexity of the algorithm is O(n + m).

In a sense, the Shingle algorithm may be viewed as an instance of massive string search, where the outcome we're interested in is not so much the positions of the patterns in the text (although, those may also be used to indicate the parts of the text that are plagiarism-suspicious) as the fact that they are present in it.

Take-aways

Strings are peculiar objects: initially, it may seem that they are just arrays. But, beyond this simple understanding, due to the main usage patterns, a much more interesting picture can be seen. Advanced string representations and algorithms are examples of special-purpose optimization applied to general-purpose data structures. This is another reason why strings are presented at the end of the part on derived data structures: string algorithms make heavy use of the material we have covered previously, such as trees and graphs.

We have also discussed the FSMs — a powerful data-structure that can be used to reliably implement complex workflows. FSMs may be used not only for string matching but also for implementing protocol handling (for example, in the HTTP server), complex user interactions, and so on. The Erlang programming language even has a standard library behavior gen_fsm (replaced by the newer gen_statem) that is a framework for easy implementation of FSMs — as many Erlang applications are mass service systems that have state machine-like operation.

P.S. Originally, I expected this chapter to be one of the smallest in the book, but it turned out to be the longest one. Strings are not so simple as they might seem... ;)


[1] A proper suffix is a suffix that is at least one character shorter than the string itself. For example, in the string abc the proper suffices are bc and c.

[2] Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the same applies to Python, or PHP, or Ruby, or many other languages.