2019-12-13

Programming Algorithms: Dynamic Programming

This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.

We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.

Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:

A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.

An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.

Fibonacci Numbers

So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.

A naive approach to calculating the i-th number will be directly coding the Fibonacci formula:

(defun naive-fib (i)
  (assert (typep i '(integer 0)))
  (if (< i 2) 1
      (+ (naive-fib (- i 1))
         (naive-fib (- i 2)))))

However, applying it will result in an exponential growth of the number of computations: each call to naive-fib results in two more calls. So, the number of calls needed for the n-th number, with this approach, is O(2^n).

> (time (naive-fib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naive-fib 42))
Evaluation took: 7.827 seconds of real time
433494437

Yet, we can see here a direct manifestation of an optimal substructure property: the i-th number calculation uses the result of the (1- i)-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.

(let ((fib (vec 1 1)))  ; our table will be an adjustable vector
  (defun fib (i)
    (when (< (length fib) i)
      (vector-push-extend (fib (- i 1)) fib))
    (+ (? fib (- i 1))
       (? fib (- i 2)))))

What we've done here is added a layer of memoization to our function that uses an array fib that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n) running time to a mere O(n)!

Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n) in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:

(let ((fib (vec 1 1)))
  (defun bottom-up-fib (i)
    (let ((off (length fib)))
      (adjust-array fib (1+ i) :fill-pointer t)
      (dotimes (j (- (1+ i) off))
        (let ((j (+ j off)))
          (:= (aref fib j)
              (+ (aref fib (- j 1))
                 (aref fib (- j 2)))))))
    (aref fib i)))
> (time (bottom-up-fib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottom-up-fib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543...  ; this number is a Lisp's bignum that has unlimited size

Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...

String Segmentation

Let's consider another 1d problem: suppose we have a dictionary of words and a string consisting of those words that somehow lost the spaces between them — the words got glued together. We need to restore the original string with spaces or, to phrase it differently, split the string into words. This is one of the instances of string segmentation problems, and if you're wondering how and where such a situation could occur for real, consider Chinese text that doesn't have to contain spaces. Every Chinese language processing system needs to solve a similar task.

Here's an example input[1]:

String: thisisatest
Dictionary: a, i, s, at, is, hi, ate, his, sat, test, this 
Expected output: this is a test

It is clear that even with such a small dictionary there are multiple ways we could segment the string. The straightforward and naive approach is to use a greedy algorithm. For instance, a shortest-first solution will try to find the shortest word from the dictionary starting at the current position and then split it (as a prefix) from the string. It will result in the following split: this i sat est. But the last part est isn't in the dictionary, so the algorithm has failed to produce some of the possible correct splits (although, by chance, if the initial conditions where different, it could have succeeded). Another version — the longest-first approach — could look for the longest words instead of the shortest. This would result in: this is ate st. Once again the final token is not a word. It is pretty obvious that these simple takes are not correct and we need a more nuanced solution.

As a common next step in developing such brute force approaches a developer would resort to backtracking: when the computation reaches the position in the string, from which no word in the dictionary may be recovered, it unwinds to the position of the previous successful split and tries a different word. This procedure may have to return multiple steps back — possibly to the very beginning. As a result, in the worst case, to find a correct split, we may need to exhaustively try all possible combinations of words that fit into the string.

Here's an illustration of the recursive shortest-first greedy algorithm operation:

(defun shortest-first-restore-spaces (dict str)
  (dotimes (i (length str))
    (let ((word (slice str 0 (1+ i))))
      (when (? dict word)
        (return-from shortest-first-restore-spaces
          (cond-it
            ((= (1+ i) (length str))
             word)
            ((shortest-first-restore-spaces dict (slice str (1+ i)))
             (format nil "~A ~A" word it))))))))

CL-USER> (defparameter *dict* (hash-set 'equal "a" "i" "at" "is" "hi" "ate" "his" "sat" "test" "this"))
CL-USER> (shortest-first-restore-spaces *dict* "thisisatest")
  0: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
    1: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
      2: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
        3: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
        3: SHORTEST-FIRST-RESTORE-SPACES returned NIL
      2: SHORTEST-FIRST-RESTORE-SPACES returned NIL
    1: SHORTEST-FIRST-RESTORE-SPACES returned NIL
  0: SHORTEST-FIRST-RESTORE-SPACES returned NIL
NIL

To add backtracking into the picture, we need to avoid returning in the case of the failure of the recursive call:

(defun bt-shortest-first-restore-spaces (dict str)
  (dotimes (i (length str))
    (let ((word (slice str 0 (1+ i))))
      (when (in# word dict)
        (when (= (1+ i) (length str))
          (return-from bt-shortest-first-restore-spaces word))
        (when-it (bt-shortest-first-restore-spaces dict (slice str (1+ i)))
          (return-from bt-shortest-first-restore-spaces (format nil "~A ~A" word it)))))))

CL-USER> (bt-best-first-restore-spaces *dict* "thisisatest")
  0: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
    1: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
      2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
        3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
        3: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
      2: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
      ;; backtracking kicks in here
      2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "atest")
        3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "test")
        3: BT-SHORTEST-FIRST-RESTORE-SPACES returned "test"
      2: BT-SHORTEST-FIRST-RESTORE-SPACES returned "a test"
    1: BT-SHORTEST-FIRST-RESTORE-SPACES returned "is a test"
  0: BT-SHORTEST-FIRST-RESTORE-SPACES returned "this is a test"
"this is a test"

Lisp trace is an invaluable tool to understand the behavior of recursive functions. Unfortunately, it doesn't work for loops, with which one has to resort to debug printing.

Realizing that this is brute force, we could just as well use another approach: generate all combinations of words from the dictionary of the total number of characters (n) and choose the ones that match the current string. The exact complexity of this scheme is O(2^n)[2]. In other words, our solution leads to a combinatorial explosion in the number of possible variants — a clear no-go for every algorithmic developer.

So, we need to come up with something different, and, as you might have guessed, DP fits in perfectly as the problem has the optimal substructure: a complete word in the substring of the string remains a complete word in the whole string as well. Based on this understanding, let's reframe the task in a way that lends itself to DP better: find each character in the string that ends a complete word so that all the words combined cover the whole string and do not intersect[3].

Here is an implementation of the DP-based procedure. Apart from calculating the maximum length of a word in the dictionary, which usually may be done offline, it requires single forward and backward passes. The forward pass is a linear scan of the string that at each character tries to find all the words starting at it and matching the string. The complexity of this pass is O(n * w), where w is the constant length of the longest word in the dictionary, i.e. it is, actually, O(n). The backward pass (called, in the context of DP, decoding) restores the spaces using the so-called backpointers stored in the dp array. Below is a simplistic implementation that returns a single match. A recursive variant is possible with or without a backward pass that will accumulate all the possible variants.

(defun dp-restore-spaces (dict str)
  (let ((dp (make-array (1+ (length str)) :initial-element nil))
        ;; in the production implementation, the following calculation
        ;; should be performed at the pre-processing stage
        (w (reduce 'max (mapcar 'length (keys dict))))
        (begs (list))
        (rez (list)))
    ;; the outer loop tries to find the next word
    ;; only starting from the ends of the words that were found previously
    (do ((i 0 (pop begs)))
        ((or (null i)
             (= i (length str))))
      ;; the inner loop checks all substrings of length 1..w
      (do ((j (1+ i) (1+ j)))
          ((>= j (1+ (min (length str)
                          (+ w i)))))
        (when (? dict (slice str i j))
          (:= (? dp j) i)
          (push j begs)))
      (:= begs (reverse begs)))
    ;; the backward pass
    (do ((i (length str) (? dp i)))
        ((null (? dp i)))
      (push (slice str (? dp i) i) rez))
    (strjoin #\Space rez)))

CL-USER> (dp-restore-spaces *dict* "thisisatest")
"this is a test"

Similarly to the Fibonacci numbers, the solution to this problem doesn't use any additional information to choose between several variants of a split; it just takes the first one. However, if we wanted to find the variant that is most plausible to the human reader, we'd need to add some measure of plausibility. One idea might be to use a frequency dictionary, i.e. prefer the words that have a higher frequency of occurrence in the language. Such an approach, unfortunately, also has drawbacks: it overemphasizes short and frequent words, such as determiners, and also doesn't account for how words are combined in context. A more advanced option would be to use a frequency dictionary not just of words but of separate phrases (ngrams). The longer the phrases are used, the better from the standpoint of linguistics, but also the worse from the engineering point of view: more storage space needed, more data to process if we want to collect reliable statistics for all the possible variants. And, once again, with the rise of the number of words in an ngram, we will be facing the issue of combinatorial explosion petty soon. The optimal point for this particular task might be bigrams or trigrams, i.e. phrases of 2 or 3 words. Using them, we'd have to supply another dictionary to our procedure and track the measure of plausibility of the current split as a product of the frequencies of the selected ngrams. Formulated this way, our exercise becomes not merely an algorithmic task but an optimization problem. And DP is also suited to solving such problems. In fact, that was the primary purpose it was intended for, in the Operations Research community. We'll see it in action with our next problem — text justification. And developing a restore-spaces-plausibly procedure is left as an exercise to the reader. :)

Text Justification

The task of text justification is relevant to both editing and reading software: given a text, consisting of paragraphs, split each paragraph into lines that contain whole words only with a given line length limit so that the variance of line lengths is the smallest. Its solution may be used, for example, to display text in HTML blocks with an align=justify property.

A more formal task description would be the following:

  • the algorithm is given a text string and a line length limit (say, 80 characters)

  • there's a plausibility formula that specifies the penalty for each line being shorter than the length limit. A usual formula is this:

    (defun penalty (limit length)
      (if (<= length limit)
          (expt (- limit length) 3)
          most-positive-fixnum))
    
  • the result should be a list of strings

As we are discussing this problem in the context of DP, first, we need to determine what is its optimal substructure. Superficially, we could claim that lines in the optimal solution should contain only the lines that have the smallest penalty, according to the formula. However, this doesn't work as some of the potential lines that have the best plausibility (length closest to 80 characters) may overlap, i.e. the optimal split may not be able to include all of them. What we can reliably claim is that, if the text is already justified from position 0 to i, we can still justify the remainder optimally regardless of how the prefix is split into lines. This is, basically, the same as with string segmentation where we didn't care how the string was segmented before position i. And it's a common theme in DP problems: the key feature that allows us to save on redundant computation is that we only remember the optimal result of the computation that led to a particular partial solution, but we don't care about what particular path was taken to obtain it (except we care to restore the path, but that's what the backpointers are for — it doesn't impact the forward pass of the algorithm). So the optimal substructure property of text justification is that if the best split of the whole string includes the consecutive indices x and y, then the best split from 0 to y should include x.

Let's justify the following text with a line limit of 50 chars:

Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.

Suppose we've already justified the first 104 characters. This leaves us with a suffix that has a length of 69: descendant of the long-running family of Lisp programming languages. As its length is above 50 chars, but below 100, so we can conclude that it requires exactly 1 split. This split may be performed after the first, second, third, etc. token. Let's calculate the total plausibility of each candidate:

after "the": 5832 + 0 = 5832
after "long-running": 6859 + 2197 = 9056
after "family": 1728 + 8000 = 9728
after "of": 729 + 12167 = 12896
after "Lisp": 64 + 21952 = 22016

So, the optimal split starting at index 105[4] is into strings: "descendant of the" and "long-running family of Lisp programming languages." Now, we haven't guaranteed that index 105 will be, in fact, the point in the optimal split of the whole string, but, if it were, we would have already known how to continue. This is the key idea of the DP-based justification algorithm: starting from the end, calculate the cost of justifying the remaining suffix after each token using the results of previous calculations. At first, while suffix length is below line limit they are trivially computed by a single call to the plausibility function. After exceeding the line limit, the calculation will consist of two parts: the plausibility penalty + the previously calculated value.

(defun justify (limit str)
  (with ((toks (reverse (split #\Space str)))
         (n (length toks))
         (penalties (make-array n))
         (backptrs (make-array n))
         (lengths (make-array n)))
    ;; forward pass (from the end of the string)
    (doindex (i tok toks)
       (let ((len (+ (length tok) (if (> i 0)
                                      (? lengths (1- i))
                                    0))))
         (:= (? lengths i) (1+ len))
         (if (<= len limit)
             (:= (? penalties i) (penalty len limit)
                 (? backptrs i) -1)
           ;; minimization loop
           (let ((min most-positive-fixnum)
                 arg)
             (dotimes (j i)
               (with ((j (- i j 1))
                      (len (- (? lengths i)
                              (? lengths j)))
                      (penalty (+ (penalty len limit)
                                  (? penalties j))))
                 (when (> len limit) (return))
                 (when (< penalty min)
                   (:= min penalty
                       arg j))))
             (:= (? penalties i) min
                 (? backptrs  i) arg)))))
    ;; backward pass (decoding)
    (loop :for end := (1- n) :then beg
          :for beg := (? backptrs end)
          :do (format nil "~A~%" (strjoin #\Space (reverse (subseq toks (1+ beg) (1+ end)))))
          :until (= -1 beg))))

CL-USER> (justify 50 "Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.")

Common Lisp is the modern, multi-paradigm,
high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running
family of Lisp programming languages.

This function is somewhat longer, but, conceptually, it is pretty simple. The only insight I needed to implement it efficiently was the additional array for storing the lengths of all the string suffixes we have examined so far. This way, we apply memoization twice: to prevent recalculation of both the penalties and the suffix lengths, and all of the ones we have examined so far are used at each iteration. If we were to store the suffixes themselves we would have had to perform an additional O(n) length calculation at each iteration.

The algorithm performs two passes. In the forward pass (which is, in fact, performed from the end), it fills the slots of the DP arrays using the minimum joint penalty for the potential current line and the remaining suffix, the penalty for which was calculated during one of the previous iterations of the algorithm. In the backward pass, the resulting lines are extracted by traversing the backpointers starting from the last index.

The key difference from the previous DP example are these lines:

(:= (? penalties i) min
    (? backptrs  i) arg)

Adding them (alongside with the whole minimization loop) turns DP into an optimization framework that, in this case, is used to minimize the penalty. The backptrs array, as we said, is used to restore the steps which have lead to the optimal solution. As, eventually (and this is true for the majority of the DP optimization problems), we care about this sequence and not the optimization result itself.

As we can see, for the optimization problems, the optimal substructure property is manifested as a mathematical formula called the recurrence relation. It is the basis for the selection of a particular substructure among several variants that may be available for the current step of the algorithm. The relation involves an already memoized partial solution and the cost of the next part we consider adding to it. For text justification, the formula is the sum of the current penalty and the penalty of the newly split suffix. Each DP optimization task is based on a recurrence relation of a similar kind.

Now, let's look at this problem from a different perspective. We can represent our decision space as a directed acyclic graph. Its leftmost node (the "source") will be index 0, and it will have several direct descendants: nodes with those indices in the string, at which we can potentially split it not exceeding the 50-character line limit, or, alternatively, each substring that spans from index 0 to the end of some token and is not longer than 50 characters. Next, we'll connect each descendant node in a similar manner with all nodes that are "reachable" from it, i.e. they have a higher value of associated string position, and the difference between their index and this node is below 50. The final node of the graph ("sink") will have the value of the length of the string. The cost of each edge is the value of the penalty function. Now, the task is to find the shortest path from source to sink.

Here is the DAG for the example string with the nodes labeled with the indices of the potential string splits. As you can see, even for such a simple string, it's already quite big, what to speak of real texts. But it can provide some sense of the number of variants that an algorithm has to evaluate.

What is the complexity of this algorithm? On the surface, it may seem to be O(m^2) where m is the token count, as there are two loops: over all tokens and over the tail. However, the line (when (> len limit) (return)) limits the inner loop to only the part of the string that can fit into limit chars, effectively, reducing it to a constant number of operations (not more than limit, but, in practice, an order of magnitude less). Thus, the actual complexity is O(m)[5].

Pathfinding Revisited

In fact, any DP problem may be reduced to pathfinding in the graph: the shortest path, if optimization is involved, or just any path otherwise. The nodes in this graph are the intermediate states (for instance, a split at index x or an i-th Fibonacci number) and the edges — possible transitions that may bear an associated cost (as in text justification) or not (as in string segmentation). And the classic DP algorithm to solve the problem is called the Bellman-Form algorithm. Not incidentally, one of its authors, Bellman is the "official" inventor of DP.

(defun bf-shortest-path (g)
  (with ((n (array-dimension g 0))
         (edges (edges-table g))
         (dists (make-array n :initial-elment most-positive-fixnum))
         (backptrs (make-array n))
         (path (list)))
    (:= (? dists (1- n)) 0)
    (dotimes (v (vertices g))
      (dotimes (e (? edges v))
        (with ((u (src e))
               (dist (+ (dist e)
                        (? dists u))))
          (when (< dist (? dists u))
            (:= (? dists u) dist
                (? backptrs u) v)))))
    (loop :for v := (1- n) :then (? backptrs v) :do
      (push v path))
    (values path
            (? dists (1- n)))))

The code for the algorithm is very straightforward, provided that our graph representation already has the vertices and edges as a data structure in convenient format or implements such operations (in the worst case, the overall complexity should be not greater than O(V+E)). For the edges, we need a kv indexed by the edge destination — an opposite to the usual representation that groups them by their sources[6].

Compared to text justification, this function looks simpler as we don't have to perform task-specific processing that accounts for character limit and spaces between words. However, if we were to use bf-shortest-path, we'd have to first create the graph data structure from the original text. So all that complexity would go into the graph creation routine. However, from the architectural point-of-views, such split may be beneficial as the pathfinding procedure could be reused for other problems.

One might ask a reasonable question: how does Bellman-Ford fare against the Dijkstra's algorithm (DA)? As we have already learned, Dijkstra's is a greedy and optimal solution to pathfinding, so why consider yet another approach? Both algorithms operate by relaxation, in which approximations to the correct distance are replaced by better ones until the final result is reached. And in both of them, the approximate distance to each vertex is always an overestimate of the true distance, and it is replaced by the minimum of its old value and the length of a newly found path. Turns out that DA is also a DP-based approach. But with additional optimizations! It uses the same optimal substructure property and recurrence relations. The advantage of DA is the utilization of the priority queue to effectively select the closest vertex that has not yet been processed. Then it performs the relaxation process on all of its outgoing edges, while the Bellman-Ford algorithm relaxes all the edges. This method allows BF to calculate the shortest paths not to a single node but to all of them (which is also possible for DA but will make its runtime, basically, the same as for BF). So, Bellman-Ford complexity is O(V E) compared to O(E + V logV) for the optimal implementation of DA. Besides, BF can account for negative edge weights, which will break DA.

So, DA remains the algorithm of choice for the standard shortest path problem, and it's worth keeping in mind that it can also be also applied as a solver for some DP problems if they are decomposed into graph construction + pathfinding. However, some DP problems have additional constraints that make using DA for them pointless. For example, in text justification, the number of edges to consider at each step is limited by a constant factor, so the complexity of the exhaustive search is, in fact, O(V). Proving that for our implementation of justify is left as an exercise to the reader...

LCS & Diff

Let's return to strings and the application of DP to them. The ultimate DP-related string problem is string alignment. It manifests in many formulations. The basic one is the Longest Common Subsequence (LCS) task: determine the length of the common part among two input strings. Solving it, however, provides enough data to go beyond that — it enables determining the best alignment of the strings, as well as to enumerating the edit operations needed to transform one string into another. The edit operations, which are usually considered in the context of LCS are:

  • insertion of a character
  • deletion of a character
  • substitution of a character

Based on the number of those operations, we can calculate a metric of commonality between two strings that is called the Levenstein distance. It is one of the examples of the so-called Edit distances. The identical strings have a Levenstein distance of 0, and strings foobar and baz — of 4 (3 deletion operations for the prefix foo and a substitution operation of r into z). The are also other variants of edit distances. FOr instance, the Damerau-Levenstein distance that is better suited to compare texts with misspellings produced by humans, adds another modification operation: swap, which reduces the edit distance in the case of two adjacent characters being swapped to 1 instead of 2 for the Levenstein (1 deletion adn 1 insertion).

The Levenstein distance, basically, gives us for free the DP recurrence relations: when we consider the i-th character of the first string and the j-th one of the second, the edit distance between the prefixes 0,i and 0,j is either the same as for the pair of chars (1- i) and (1- j) respectively, if the current characters are the same, or 1+ the minimum of the edit distances of the pairs i (1- j), (1- i) (1- j), and (1-i) j.

We can encode this calculation as a function that uses a matrix for memoization. Basically, this is the DP solution to the LCS problem: now, you just have to subtract the length of the string and the bottom right element of the matrix, which will give you the measure of the difference between the strings.

(defun lev-dist (s1 s2 &optional 
                         (i1 (1- (length s1)))
                         (i2  (1- (length s2)))
                         (ld (make-array (list (1+ (length s1))
                                               (1+ (length s2)))
                                         :initial-element nil)
                             ldp))  ; a flag indicating that the argument was supplied
  ;; initialization of the 0-th column and row
  (unless ldp
    (dotimes (k (1+ (length s1))) (:= (aref ld k 0) 0))
    (dotimes (k (1+ (length s2))) (:= (aref ld 0 k) 0)))
  (values (or (aref ld (1+ i1) (1+ i2))
              (:= (aref ld (1+ i1) (1+ i2))
                  (if (eql (? s1 i1) (? s2 i2))
                      (lev-dist s1 s2 (1- i1) (1- i2) ld)
                      (1+ (min (lev-dist s1 s2 (1- i1) (1- i2) ld)
                               (lev-dist s1 s2 i1 (1- i2) ld)
                               (lev-dist s1 s2 (1- i1) i2 ld))))))
          ld))

However, if we want to also use this information to align the sequences, we'll have to make a reverse pass[]{{Here, a separate backpointers array isn't necessary as we can infer the direction by reversing the distance formula.}}.

(defun align (s1 s2)
  (with ((i1 (length s1))
         (i2 (length s2))
         ;; our Levenstein distance procedure returns the whole DP matrix
         ;; as a second value
         (ld (nth-value 1 (lev-dist s1 s2)))
         (rez (list)))
    (loop
      (let ((min (min (aref ld (1- i1) (1- i2))
                      (aref ld     i1  (1- i2))
                      (aref ld (1- i1)     i2))))
        (cond ((= min (aref ld (1- i1) (1- i2)))
               (push (pair (? s1 (1- i1)) (? s2 (1- i2)))
                     rez)
               (:- i1)
               (:- i2))
             ((= min (aref ld (1- i1) i2))
              (push (pair (? s1 (1- i1)) nil)
                    rez)
              (:- i1))
             ((= min (aref ld i1 (1- i2)))
              (push (pair nil (? s2 (1- i2)))
                    rez)
              (:- i2))))
      (when (= 0 i1)
        (loop :for j :from (1- i2) :downto 0 :do
          (push (pair #\* (? s2 j)) rez))
        (return))
      (when (= 0 i2)
        (loop :for j :from (1- i1) :downto 0 :do
          (push (pair (? s1 j) nil) rez))
        (return)))
    ;; pretty output formatting
    (with-output-to-string (s1)
      (with-output-to-string (s2)
        (with-output-to-string (s3)
          (loop :for (c1 c2) :in rez :do
            (format s1 "~C " (or c1 #\.))
            (format s2 "~C " (cond ((null c1) #\↓)
                                  ((null c2) #\↑)
                                  ((char= c1 c2) #\|)
                                  (t #\x)))
            (format s3 "~C " (or c2 #\.)))
          (format t "~A~%~A~%~A~%"
                  (get-output-stream-string s1)
                  (get-output-stream-string s2)
                  (get-output-stream-string s3)))))
  rez))

CL-USER> (align "democracy" "remorse")
d e m o c r a c y
x | | | ↑ | ↑ x x
r e m o . r . s e

CL-USER> (lev-dist "democracy" "remorse")
5
#2A((0 1 2 3 4 5 6 7)
    (1 1 2 3 4 5 6 7)
    (2 2 1 2 3 4 5 6)
    (3 3 2 1 2 3 4 5)
    (4 4 3 2 1 2 3 4)
    (5 5 4 3 2 2 3 4)
    (6 5 5 4 3 2 3 4)
    (7 6 6 5 4 3 3 4)
    (8 7 7 6 5 4 4 4)
    (9 8 8 7 6 5 5 5))

It should be pretty clear how we can also extract the edit operations during the backward pass: depending on the direction of the movement, horizontal, vertical or diagonal, it's either an insertion, deletion or substitution. The same operations may be also grouped to reduce noise. The alignment task is an example of a 2d DP problem. Hence, the diff computation has a complexity of O(n^2). There are other notable algorithms, such as CYK parsing or the Viterbi algorithm, that also use a 2d array, although they may have higher complexity than just O(n^2). For instance, the CYK parsing is O(n^3), which is very slow compared to the greedy O(n) shift-reduce algorithm.

However, the diff we will obtain from the basic LCS computation will still be pretty basic. There are many small improvements that are made by production diff implementation both on the UX and performance sides. Besides, the complexity of the algorithm is O(n^2), which is quite high, so many practical variants perform many additional optimizations to reduce the actual number of operations, at least, for the common cases.

The simplest improvement is a preprocessing step that is warranted by the fact that, in many applications, the diff is performed on texts that are usually mostly identical and have a small number of differences between them localized in an even smaller number of places. For instance, consider source code management, where diff plays an essential role: the programmers don't tend to rewrite whole files too often, on the contrary, such practice is discouraged due to programmer collaboration considerations.

So, some heuristics may be used in the library diff implementations to speed up such common cases:

  • check that the texts are identical
  • identify common prefix/suffix and perform the diff only on the remaining part
  • detect situations when there's just a single or two edits

A perfect diff algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect and not very good for human consumption. People will expect operations parts to be separated at token boundaries when possible, also larger contiguous parts are preferred to an alteration of small changes. All these and other diff ergonomic issue may be addressed by various postprocessing tweaks.

But, besides these simple tricks, are global optimizations to the algorithm possible? After all, O(n^2) space and time requirements are still pretty significant. Originally, diff was developed for Unix by Hunt and McIlroy. Their approach computes matches in the whole file and indexes them into the so-called k-candidates, k being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this, each path is memoized. The problem with the approach is that it performs more computation than necessary: it memoizes all the paths, which requires O(n^2) memory in the worst case, and O(n^2 log n) for the time complexity!

The current standard approach is the divide-and-conquer Myers algorithm. It works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memoized, and the two subsequences preceding and following it are compared again recursively by the same procedure until there is nothing more to compare. Finding the central match is done by matching the ends of subsequences as far as possible, and any time it is not possible, augmenting the edit script by 1 operation, scanning each furthest position attained up to there for each diagonal and checking how far the match can expand. If two matches merge, the algorithm has just found the central match. This approach has the advantage to using only O(n) memory, and executes in O(n d), where d is the edit script complexity (d is less than n, usually, much less). The Myers algorithm wins because it does not memoize the paths while working, and does not need to "foresee" where to go. So, it can concentrate only on the furthest positions it could reach with an edit script of the smallest complexity. The smallest complexity constraint ensures that what is found in the LCS. Unlike the Hunt-McIlroy algorithm, the Myers one doesn't have to memoize the paths. In a sense, the Myers algorithm compared to the vanilla DP diff, like the Dijkstra's one versus Bellman-Ford, cuts down on the calculation of the edit-distances between the substring that don't contribute to the optimal alignment. While solving LCS and building the whole edit-distance matrix performs the computation for all substrings.

The diff tool is a prominent example of a transition from quite an abstract algorithm to a practical utility that is is an essential part of many ubiquitous software products, and the additional work needed to ensure that the final result is not only theoretically sane but also usable.

P.S. Ever wondered how github and other tools, when displaying the diff, not only show the changed line but also highlight the exact changes in the line? The answer is given in [7].

DP in Action: Backprop

As we said in the beginning, DP has applications in many areas: from Machine Learning to graphics to Source Code Management. Literally, you can find an algorithm that uses DP in every specialized domain, and if you don't — this means you, probably, can still advance this domain and create something useful by applying DP to it. Deep Learning is the fastest developing area of the Machine Learning domain, in recent years. At its core, the discipline is about training huge multilayer optimization functions called "neural networks". And the principal approach to doing that, which, practically speaking, has enabled the rapid development of machine learning techniques that we see today, is the Backpropagation (backprop) optimization algorithm.

As pointed out by Christopher Olah, for modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That's the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backprop is a computational tool that may be applied in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names there. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, application-independent, name for it is Reverse-Mode Differentiation. Essentially, it's a technic for calculating partial derivatives quickly using DP on computational graphs.

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression (:= e (* (+ a b) (1+ b))). There are four operations: two additions, one multiplication, and an assignment. Let's arrange those computations in the same way they would be performed on the computer:

(let ((c (+ a b))
      (d (1+ b)))
  (:= e (* c d)))

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When the outcome of one expression is an input to another one, a link points from one node to another:

We can evaluate the expression by setting the values in the input nodes (a and b) to certain values and computing nodes in the graph along the dependency paths. For example, let's set a to 2 and b to 1: the result in node e will be, obviously, 6.

The derivatives in a computational graph can be thought of as edge labels. If a directly affects c, then we can write a partial derivative ∂c/∂a along the edge from a to c.

Here is the computational graph with all the derivatives for the evaluation with the values of a and b set to 2 and 1.

But what if we want to understand how nodes that aren't directly connected affect each other. Let's consider how e is affected by a. If we change a at a speed of 1, c also changes at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So e changes at a rate of (* 1 2) with respect to a. The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. We can see that this graph is, basically, the same as the graph we used to calculate the shortest path.

This is where Forward-mode differentiation and Reverse-mode differentiation come in. They're algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once. Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total derivative. Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node. Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output.

So, what if we do reverse-mode differentiation from e down? This gives us the derivative of e with respect to every node. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of the derivatives we need in one go. When training neural networks, the cost is a function of the weights of each edge. And using reverse-mode differentiation (aka backprop), we can calculate the derivatives of the cost with respect to all the weights in a single pass through the graph, and then feed them into gradient descent. As there are millions and tens of millions of weights, in a neural network, reverse-mode differentiation results in a speedup of the same factor!

Backprop is an example of simple memoization DP. No selection of the best variant is needed, it's just a proper arrangement of the operations to avoid redundant computations.

Take-aways

DP-based algorithms may operate on one of these three levels:

  • just systematic memoization, when every intermediate result is cached and used to compute subsequent results for larger problems (Fibonacci numbers, backprop)
  • memoization + backpointers that allow for the reconstruction of the sequence of actions that lead to the final solution (text segmentation)
  • memoization + backpointers + a target function that selects the best intermediate solution (text justification, diff, shortest path)

If we want to apply DP to some task, we need to find its optimal substructure: i.e. verify that an optimal solution to a subproblem will remain a part of the optimal solution to the whole problem. Next, if we deal with an optimization task, we may have to formulate the recurrence relations. After that, it's just a matter of technic: those relations may be either programmed directly as a recursive or iterative procedure (like in LCS) or indirectly using the method of consecutive approximations (like in Bellman-Ford).

Ultimately, all DP problems may be reduced to pathfinding in the graph, but it doesn't always make sense to have this graph explicitly as a data structure in the program. If it does, however, remember that Dijkstra's algorithm is the optimal algorithm to find a single shortest path in it.

DP, usually, is a reasonable next thing to think about after the naive greedy approach (which, let's be frank, everyone tends to take initially) stumbles over backtracking. However, we saw that DP and greedy approaches do not contradict each other: in fact, they can be combined as demonstrated by the Dijkstra's algorithm. Yet, an optimal greedy algorithm is more of an exception than a rule. Although, there is a number of problems for which a top-n greedy solution (the so-called Beam search) can be a near-optimal solution that is good enough.

Also, DP doesn't necessarily mean optimal. A vanilla dynamic programming algorithm exhaustively explores the decision space, which may be excessive in many cases. It is demonstrated by the examples of the Dijkstra's and Myers algorithms that improve on the DP solution by cutting down some of the corners.

P.S. We have also discussed, the first time in this book, the value of heuristic pre- and postprocessing. From the theoretical standpoint, it is not something you have to pay attention to, but, in practice, that's a very important aspect of the production implementation of many algorithms and, thus, shouldn't be frowned upon or neglected. In an ideal world, an algorithmic procedure should both have optimal worst-case complexity and the fastest operation in the common cases.


[1] If you wonder, s is a word that is usually present in English programmatic dictionaries because when it's and friends are tokenized they're split into two tokens, and the apostrophe may be missing sometimes. Also, our dictionary is case-insensitive.

[2] The intuition for it is the following: in the worst case, every character has two choices: either to be the last letter of the previous word or the first one of the next word, hence the branching factor is 2.

[3] Actually, the condition of complete string coverage may be lifted, which will allow to use almost the same algorithm but skip over "undictionary" words like misspellings.

[4] A space at the end of the line is discarded.

[5] Provided all the length calculations are implemented efficiently. For simplicity, I have used plain lists here with a linear length complexity, but a separate variable may be added to avoid the extra cost.

[6] However, if we think of it, we could reuse the already proven linked representation just putting the incoming edges into the node structure instead of the outgoing ones.

[7] It runs diff twice: first, at the line-level (using each line as a single unit/token) and then at the character level, as you would normally expect. Then the results are just combined.

No comments: