2020-01-11

Programming Algorithms: Approximation

This chapter will be a collection of stuff from somewhat related but still distinct domains. What unites it is that all the algorithms we will discuss are, after all, targeted at calculating approximations to some mathematical functions. There are no advanced data structures involved, neither is the aim to find a clever way to improve the runtime of some common operations. No, these algorithms are about calculations and computing an acceptable result within the allocated time budget.

Combinatorial Optimization

Dynamic Programming is a framework that can be used for finding the optimal value of some loss function when there are multiple configurations of the problem space that result in different values. Such search is an example of discrete optimization for there is a countable number of states of the system and a distinct value of the cost function we're optimizing corresponding to each state. There are also similar problems that have an unlimited and uncountable number of states, but there is still a way to find a global or local optimum of the cost function for them. They comprise the continuous optimization domain. Why is optimization not just a specialized area relevant to a few practitioners but a toolbox that every senior programmer should know how to utilize? The primary reason is that it is applicable in almost any domain: the problem just needs to be large enough to rule out simple brute force. You can optimize how the data is stored or how the packets are routed, how the blueprint is laid out or the servers are loaded. Many people are just not used to looking at their problems this way. Also, understanding optimization is an important prerequisite for having a good grasp of machine learning, which is revolutionizing the programming world.

DP is an efficient and, overall, great optimization approach, but it can't succeed if the problem doesn't have an optimal substructure. Combinatorial Optimization approaches deal with finding a near-optimum for the problems where an exhaustive search requires O(2^n) computations. Such problems are called NP-hard and a classic example of those is the Travelling Salesman (TSP). The task is to find an optimal order of edges in a cycle spanning all vertices of a fully-connected weighted graph. As we saw previously, this problem doesn't have an optimal substructure, i.e. an optimal partial solution isn't necessarily a part of the best overall one, and so taking the shortest edge doesn't allow the search procedure to narrow down the search space when looking at the next vertex. A direct naive approach to TSP will enumerate all the possible variants and select the one with a minimal cost. However, the number of variants is n!, so this approach becomes intractable very fast. A toy example of visiting all the capitals of the 50 US states has 10^64 variants. This is where quantum computers promise to overturn the situation, but while we're waiting for them to mature, the only feasible approach is developing approximation methods that will get us a good enough solution in polynomial (ideally, linear) time. TSP may look like a purely theoretical problem, but it has some real-world applications. Besides vehicle routing, automated drilling and soldering in electronics is another example. Yet, even more important is that there are many other combinatorial optimization problems, but, in essence, the approaches to solving one of them apply to all the rest. I.e., like with shortest path, coming up with an efficient solution to TSP allows to efficiently solve a very broad range of problems over a variety of domains.

So, let's write down the code for the basic TSP solution. As usual, we have to select the appropriate graph representation. From one point of view, we're dealing with a fully-connected graph, so every representation will work and a matrix one will be the most convenient. However, storing an n^2-sized array is not the best option, especially for a large n. A better "distributed" representation might be useful here. Yet, for the TSP graph, an even better approach would be to do the opposite of our usual optimization trick: trade computation for storage space. When the graph is fully-connected, usually, there exists some kind of an underlying metric space that contains all the vertices. The common example is the Euclidian space, in which each vertex has a coordinate (for example, the latitude and longitude). Anyway, whichever way to represent the vertex position is used, the critical requirement is the existence of the metric that may be calculated at any time (and fast). Under such conditions, we don't have to store the edges at all. So, our graph will be just a list of vertices.

Let's use the example with the US state capitals. Each vertex will be representated as a pair of floats (lat & lon). We can retireve the raw data from the Wikipedia article about the US capitols (with an 'o') and extract the values we need with the following code snippet[1], which cuts a few corners:

(defstruct city
  name lat lon)

(defparameter *wp-link* "https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1")

(defparameter *cs*
  (with ((raw (drakma:http-request *wp-link*))
         (coords-regex (ppcre:create-scanner "\\{\\{coord\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|type"))
         (capitals (list)))
    (flet ((dms->rad (vec off)
             (* (/ pi 180)
                (+     (? vec (+ off 0))
                    (/ (? vec (+ off 1)) 60)
                    (/ (? vec (+ off 2)) 3600)))))
      (dolist (line (split #\Newline (slice raw
                                            (search "{| class=\"wikitable sortable\"" raw)
                                            (search "</textarea><div class='editOptions'>" raw))))
        (when-it (and (starts-with "|" line)
                      (search "{{coord" line))
          (with ((_ coords (ppcre:scan-to-strings coords-regex line))
                 (coords (map* 'read-from-string coords)))
            (push (make-city :name (slice line (position-if 'alpha-char-p line)
                                          (position-if (lambda (ch) (member ch '(#\] #\|)))
                                                       line :start 1))
                           :lat (dms->rad coords 0)
                           :lon (dms->rad coords 3))
                capitals)))))
    (coerce capitals 'vector)))

CL-USER> (length *cs*)
50

We also need to define the metric. The calculation of distances on Earth, though, is not so straightforward as on a plain. Usually, as a first approximation, the haversine formula is used that provides the estimate of the shortest distance over the surface "as-the-crow-flies" (ignoring the relief).

(defun earth-dist (c1 c2)
  (with ((lat1 (? c1 'lat))
         (lat2 (? c2 'lat))
         (a (+ (expt (sin (/ (- lat2 lat1) 2))
                     2)
               (* (cos lat1)
                  (cos lat2)
                  (expt (sin (/ (- (? c2 'lon) (? c1 'lon)) 2)) 
                        2)))))
    (* 1.2742e7  ; Earth diameter
       (atan (sqrt a) (sqrt (- 1 a)))))) 

With the metric at our disposal, let's define the function that will calculate the length of the whole path and use it for a number of random paths (we'll use the RUTILS function shuffle to produce a random path).

(defun path-length (path)
  (let ((rez (earth-dist (? path 0) (? path -1))))
    (dotimes (i (1- (length path)))
      (:+ rez (earth-dist (? path i) (? path (1+ i)))))
    rez))

CL-USER> (path-length *cs*)
9.451802301259182d7
CL-USER> (path-length (shuffle *cs*))
9.964776273250546d7
CL-USER> (path-length (shuffle *cs*))
1.009761841183094d8

We can see that an average path may have a length of around 10k kilometers. However, we don't know anything about the shortest or the longest one, and to find out reliably, we'll have to evaluate 50! paths... Yet, as we accept the sad fact that it is not possible to do with our current technology, it's not time to give up yet. Yes, we may not be able to find the absolute best path, but at least we can try to improve on the random one. Already, the three previous calculations had a variance of 5%. So, if we're lucky, maybe we could hit a better path purely by chance. Let's try a thousand paths using our usual argmin pattern:

(defun random-search (path n)
  (let ((min (path-length path))
        (arg path))
    (loop :repeat n :do
      (with ((path (shuffle path))
             (len (path-length path)))
        (when (< len min)
          (:= min len
              arg path))))
    (values arg
            min)))

CL-USER> (:= *print-length* 2)
2
CL-USER> (random-search *cs* 1000)
(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
 #S(CITY :NAME "Montpelier, Vermont" :LAT 0.772521512027179d0 ...) ...)
7.756170773802838d7

OK, we've got a sizable 20% improvement. What about 1,000,000 combinations?

CL-USER> (time (random-search *cs* 1000000))
Evaluation took:
  31.338 seconds of real time
...
(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
 #S(CITY :NAME "Helena, Montana" :LAT 0.813073800024579d0 ...) ...)
6.746660953705506d7

Cool, another 15%. Should we continue increasing the size of the sample? Maybe, after a day of computations, we could get the path length down by another 20-30%. And that's already a good gain. Surely, we could also parallelize the algorithm or use a supercomputer in order to analyze many more variants. But there should be something smarter than simple brute force, right?

Local Search

Local Search is the "dumbest" of these smart approaches, built upon the following idea: if we had a way to systematically improve our solution, instead of performing purely random sampling, we could arrive at better variants much faster. The local search procedure starts from a random path and continues improving it until the optimum is reached. This optimum will be a local one (hence the name), but it will still be better than what we have started with. Besides, we could run the optimization procedure many times from a different initial point, basically, getting the benefits of the brute force approach. We can think of the multiple runs local search as sampling + optimization.

(defun local-search (path improve-fn)
  (let ((min (path-length path))
        (cc 0))  ; iteration count
    (loop
      (:+ cc)
      (if-it (call improve-fn path)
             (:= min (path-length it)
                 path it)
             (return (values path
                             min
                             cc))))))

For this code to work, we also need to supply the improve-fn. Coming up with it is where the creativity of the algorithmic researcher needs to be channeled into. Different problems (and even a single problem) may allow for different approaches. For TSP, there are several improvement possibilities discovered so far. And all of them use the planar (2d) nature of the graph we're processing. It is an additional constraint that has a useful consequence: if the paths between two pairs of nodes intersect, definitely, there are also shorter paths between them that are nonintersecting. So, swapping the edges will improve the whole path. If we were to draw a picture of this swap, it would look like this (the edges A-D and C-B intersect, while A-B and C-D don't and hence their total length is shorter):

 - A   B -      - A - B -
     X      ==>
 - C   D -      - C - D -

This rule allows us to specify the so-called 2-opt improvement procedure:

(defun 2-opt (path)
  (loop :repeat (* 2 (length path)) :do
    (with ((len (length path))
           (v1 (random len))
           (v1* (if (= #1=(1+ v1) len) 0 #1#))
           (v2 (loop :for v := (random len)
                     :when (and (/= v v1) (/= v (1- v1))) :do (return v)))
           (v2* (if (= #2=(1+ v2) len) 0 #2#)))
      (when (< (+ (path-length (vec (? path v1) (? path v2)))
                  (path-length (vec (? path v1*) (? path v2*))))
               (+ (path-length (vec (? path v1) (? path v1*)))
                  (path-length (vec (? path v2) (? path v2*)))))
        (let ((beg (min v1* v2*))
              (end (max v1* v2*)))
          (return (concatenate 'vector 
                               (subseq path 0 beg)
                               (reverse (subseq path beg end))
                               (subseq path end))))))))

Note that we do not need to perform a complicated check for path intersection (which requires an algorithm of its own and there is a number of papers dedicated to this task). In fact, we don't care if there is an intersection: we just need to know that the new path, which consists of the newly replaced edges and a reversed part of the path between the two inner nodes of the old edges, is shorter. One more thing to notice is that this implementation doesn't perform an exhaustive analysis of all possible edge swaps, which is suggested by the original 2-opt algorithm (a O(n^2) operation). Here, we select just a random pair. Both variants are acceptable, and ours is simpler to implement.

CL-USER> (local-search *cs* '2-opt)
#(#S(CITY :NAME "Jackson, Mississippi" :LAT 0.5638092223095238d0 ...)
  #S(CITY :NAME "Baton Rouge, Louisiana" :LAT 0.5315762080646039d0 ...) ...)
3.242702077795514d7
111

So, outright, we've got a 100% improvement on the random-search path obtained after a much larger number of iterations. Iteration counting was added to the code in order to estimate the work we had to do. To make a fair comparison, let's run random-search with the same n (111):

CL-USER> (random-search *cs* 111)
#(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
  #S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)
7.522044767585556d7

But this is still not 100% fair as we haven't yet factored in the time needed for the 2-opt call which is much heavier than the way random search operates. In my estimates, 111 iterations of local-search took 4 times as long, so...

CL-USER> (random-search *cs* 444)
#(#S(CITY :NAME "Lansing, Michigan" :LAT 0.745844229097319d0 ...)
  #S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)
7.537249874357127d7

Now, the runtimes are the same, but there's not really much improvement in the random search outcome. That's expected for, as we have already observed, achieving a significant improvement in random-search results requires performing orders of magnitude more operations.

Finally, let's define multi-local-search to leverage the power of random sampling:

(defun multi-local-search (path n)
  (let ((min (path-length path))
        (arg path))
    (loop :repeat n :do
      (with ((cur (local-search (shuffle path) '2-opt)))
        (when (< #1=(path-length cur) min)
          (:= min #1#
              arg cur))))
    (values arg
            min)))

CL-USER> (time (multi-local-search *cs* 1000))
Evaluation took:
  22.394 seconds of real time
...
#(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
  #S(CITY :NAME "Montgomery, Alabama" :LAT 0.5650930224896327d0 ...) ...)
2.8086843039667137d7

Quite a good improvement that took only 20 seconds to achieve!

As a final touch, let's draw the paths on the map. It's always good to double-check the result using some visual approach when it's available. Here is our original random path (Anchorage and Honolulu are a bit off due to the issues with the map projection):

This is the result of random search with a million iterations:

And this is our multistart local search outcome. Looks nice, doesn't it?

2-opt is the simplest path improving technique. There are more advanced ones like 3-opt and Lin-Kernighan heuristic. Yet, the principle remains the same: for local search to work, we have to find a way to locally improve our current best solution.

Another direction of the development of the basic algorithm, besides better local improvement procedures and trying multiple times, is devising a way to avoid being stuck in local optima. Simulated Annealing is the most well-known technique for that. The idea is to replace unconditional selection of a better variant (if it exists) with a probabilistic one. The name and inspiration for the technique come from the physical process of cooling molten materials down to the solid state. When molten steel is cooled too quickly, cracks and bubbles form, marring its surface and structural integrity. Annealing is a metallurgical technique that uses a disciplined cooling schedule to efficiently bring the steel to a low-energy, optimal state. The application of this idea to the optimization procedure introduces the temperature parameter T. At each step, a new state is produced from the current one. For instance, it can be achieved using 2-opt, although the algorithm doesn't impose the limitation on the state to necessarily be better than the current one, so even such a simple thing as a random swap of vertices in the path is admissible. Next, unlike with local search, the transition to the candidate step doesn't happen unconditionally, but with a probability proportional to (/ 1 T). Initially, we start with a high value of T and then decrease it following some annealing schedule. Eventually, T falls to 0 towards the end of the allotted time budget. In this way, the system is expected to wander, at first, towards a broad region of the search space containing good solutions, ignoring small fluctuations; then the drift towards low-energy regions becomes narrower and narrower; and, finally, it transitions to ordinary local search according to the steepest descent heuristic.

Evolutionary Algorithms

Local search is the most simple example of a family of approaches that are collectively called Metaheuristics. All the algorithms from this family operate, in general, by sampling and evaluating a set of solutions which is too large to be completely evaluated. The difference is in the specific approach to sampling that is employed.

A prominent group of metaheuristic approaches is called Evolutionary (and/or nature-inspired) algorithms. It includes such methods as Genetic Algorithms, Ant Colony and Particle Swarm Optimization, Cellular and even Grammatical Evolution. The general idea is to perform optimization in parallel by maintaining the so-called population of states and alter this population using a set of rules that improve the aggregate quality of the whole set while permitting some outliers in hopes that they may lead to better solutions unexplored by the currently fittest part of the population.

We'll take a brief glance at evolutionary approaches using the example of Genetic ALgorithms, which are, probably, the most well-known technique among them. The genetic algorithm (GA) views each possible state of the system as an individual "genome" (encoded as a vector). GA is best viewed as a framework that requires specification of several procedures that operate on the genomes of the current population:

  • The initialization procedure which creates the initial population. After it, the size of the population remains constant, but each individual may be replaced with another one obtained by applying the evolution procedures.
  • The fitness function that evaluates the quality of the genome and assigns some weight to it. For TSP, the length of the path is the fitness function. For this problem, the smaller is the value of the function the better.
  • The selection procedure specifies which items from the population to use for generating new variants. In the simplest case, this procedure can use the whole population.
  • The evolution operations which may be applied. The usual GA operations are mutation and crossover, although others can be devised also.

Mutation operates on a single genome and alters some of its slots according to a specified rule. 2-opt may be a valid mutation strategy, although even the generation of a random permutation of the TSP nodes may work if it is applied to a part of the genome and not to the whole. By controlling the magnitude of mutation (what portion of the genome is allowed to be involved in it) it is possible to choose the level of stochasticity in this process. But the key idea is that each change should retain at least some resemblance with the previous version, or we'll just end up with stochastic search.

The crossbreeding operation isn't, strictly speaking, necessary in the GA, but some of the implementations use it. This process transforms two partial solutions into two others by swapping some of the parts. Of course, it's not possible to apply directly to TSP, as it would result in the violation of the main problem constraint of producing a loop that spans all the nodes. Instead, another procedure called the ordered crossover should be used. Without crossbreeding, GA may be considered a parallel version of local search.

Here is the basic GA skeleton. It requires definition of the procedures init-population, select-candidates, mutate, crossbread, and score-fitness.

(defun ga (population-size &key (n 100))
  (let ((genomes (init-population population-size)))
    (loop :repeat n :do
      (let ((candidates (select-candidates genomes)))
        (dolist (ex (mapcar 'mutate candidates))
          (push ex genomes))
        (dolist (ex (crossbread candidates))
          (push ex genomes)))
      (:= genomes (take population-size (sort genomes 'score-fitness))))))

This template is not a gold standard, it can also be tweaked and altered, but you've got a general idea. The other evolutionary optimization methods also follow the same principles but define different ways to evolve the population. For example, Particle Swarm Optimization operates by moving candidate solutions (particles) around in the search space according to simple mathematical formulae over their position and velocity. The movement of each particle is influenced by its local best known position, as well as guided toward the global best known positions in the search space. And those are, in turn, updated as better positions are found by other particles. By the way, the same idea underlines the Particle Filter algorithm used in signal processing and statistical inference.

Branch & Bound

Metaheuristics can be, in general, classified as local search optimization methods for they operate in a bottom-up manner by selecting a random solution and trying to improve it by gradual change. The opposite approach is global search that tries to systematically find the optimum by narrowing the whole problem space. We have already seen the same pattern of two alternative ways to approach the task — top-down and bottom-up — in parsing, and it also manifests in other domains that permit problem formulation as a search task.

How is a top-down systematic evaluation of the combinatorial search space even possible? Obviously, not in its entirety. However, there are methods that allow the algorithm to rule out significant chunks that certainly contain suboptimal solutions and narrow the search to only the relevant portions of the domain that may be much smaller in cardinality. If we manage to discard, this way, a large number of variants, we have more time to evaluate the other parts, thus achieving better results (for example, with Local search).

The classic global search is represented by the Branch & Bound method. It views the set of all candidate solutions as a rooted tree with the full set being at the root. The algorithm explores branches of this tree, which represent subsets of the solution set. Before enumerating the candidate solutions of a branch, the branch is checked against upper and lower estimated bounds on the optimal solution and is discarded if it cannot produce a better solution than the best one found so far by the algorithm. The key feature of the algorithm is efficient bounds estimation. When it is not possible, the algorithm degenerates to an exhaustive search.

Here is a skeleton B&B implementation. Similar to the one for Genetic Algorithms, it relies on providing implementations of the key procedures separately for each search problem. For the case of TSP, the function will accept a graph and all the permutations of its vertices comprise the search space. We'll use the branch struct to represent the subspace we're dealing with. We can narrow down the search by pinning a particular subset of edges: this way, the subspace will contain only the variants originating from the possible permutations of the vertices that are not attached to those edges.

(defstruct branch
  (upper most-positive-fixnum)
  (lower 0)
  (edges (list))

The b&b procedure will operate on the graph g and will have an option to either work until the shortest path is found or terminate after n steps.

(defun b&b (g &key n)
  (with ((cur (vertices g))
         (min (cost cur)))
         (arg cur)
         (q (make-branch :upper min :lower (lower-bound g ())))
    (loop :for i :from 0
          :for branch := (pop q) :while item :do
      (when (eql i n) (return))
      (if (branchp branch)
          (dolist (item (branch-out branch))
            ;; we leave only the subbranches that can,
            ;; at least in theory, improve on the current solution
            (when (< (branch-lower item) upper)
              (push item q)))
          (let ((cost (branch-upper branch)))
            (when (< cost lower)
              (:= lower cost
                  arg branch)))))
    (values cur
            cost)))

The branch-out function is rather trivial: it will generate all the possible variants by expanding the current edge set with a single new edge, and it will also calculate the bounds for each variant. The most challenging part is figuring out the way to compute the lower-bound. The key insight here is the observation that each path in the graph is not shorter than half the sum of the shortest edges attached to each vertex. So, the lower bound for a branch with pinned edges e1, e2, and e3 will be the sum of the lengths of these edges plus half the sum of the shortest edges attached to all the other vertices that those edges don't cover. It is the most straightforward and raw approximation that will allow the algorithm to operate. It can be further improved upon — a home task for the reader is to devise ways to make it more precise and estimate if they are worth applying in terms of computational complexity.

B&B may also use additional heuristics to further optimize its performance at the expense of producing a slightly more suboptimal solution. For example, one may wish to stop branching when the gap between the upper and lower bounds becomes smaller than a certain threshold. Another improvement may be to use a priority queue instead of a stack, in the example, in order to process the most promising branches first.

One more thing I wanted to mention in the context of global heuristic search is Monte Carlo Tree Search (MCTS), which, in my view, uses a very similar strategy to B&B. It is the currently dominant method for finding near-optimal paths in the decision tree for turn-based and other similar games (like go or chess). The difference between B&B and MCTS is that, typically, B&B will use a conservative exact lower bound for determining which branches to skip. MCTS, instead, calculates the estimate of the potential of the branch to yield the optimal solution by performing the sampling of a number of random items from the branch and averaging their scores. So, it can be considered a "softer" variant of B&B. The two approaches can be also combined, for example, to prioritize the branch in the B&B queue. The term "Monte Carlo", by the way, is applied to many algorithms that use uniform random sampling as the basis of their operation.

Gradient Descent

The key idea behind Local Search was to find a way to somehow improve the current best solution and change it in that direction. It can be similarly utilized when switching from discrete problems to continuous ones. And in this realm, the direction of improvement (actually, the best possible one) is called the gradient (or rather, the opposite of the gradient). Gradient Descent (GD) is the principal optimization approach, in the continuous space, that works in the same manner as Local Search: find the direction of improvement and progress alongside it. There's also a vulgar name for this approach: hill climbing. It has a lot of variations and improvements that we'll discuss in this chapter. But we'll start with the code for the basic algorithm. Once again, it will be a template that can be filled in with specific implementation details for the particular problem. We see this "framework" pattern recurring over and over in optimization methods as most of them provide a general solution that can be applied in various domains and be appropriately adjusted for each one.

(defun gd (fn data &key n (learning-rate 0.1) (precision 1e-6))
  (let ((ws (init-weights fn))
        (cost (cost fn ws))
        (i 0))
    (loop
       (update-weights ws learning-rate
                       (grad fn ws data))
       (when (or (< (- (:= cost (cost fn ws))
                       cost)
                    precision)
                 (eql n (:+ i)))
         (return)))
    (values ws
            cost))

This procedure optimizes the weights (ws) of some function fn. Moreover, whether we know or not the mathematical formula for fn, doesn't really matter: the key is to be able to compute grad, which may be done analytically (using a formula that is just coded) or in a purely data-driven fashion (what Backprop, which we have seen in the previous chapter, does). ws will usually be a vector or a matrix and grad will be an array fo the same dimensions. In the simplest and not interesting toy case, both are just scalar numbers.

Besides, in this framework, we need to define the following procedures:

  • init-weights sets the starting values in the ws vector according to fn. There are several popular ways to do that: the obvious set to all zeroes, which doesn't work in conjunction with backrpop; sample from a uniform distribution with a small amplitude; more advanced heuristics like Xavier initialization.
  • update-weights has a simple mathematical formulation: (:- ws (* learning-rate gradient)). But as ws is usually a multi-dimensional structure, in Lisp we can't just use - and * on them as these operations are reserved for dealing with numbers.
  • it is also important to be able to calculate the cost function (also often called, "loss"). As you can see from the code, the GD procedure may terminate in two cases: either it has used the whole iteration budget assigned to it, or it has approached the optimum very closely, so that, at each new iteration, the change in the value of the cost function is negligible. Apart from this usage, tracking the cost function is also important to monitor the "learning" process (another name for the optimization procedure, popular in this domain). If GD operating correctly, the cost should monotonically decrease at each step.

This template is the most basic one and you can see a lot of ways of its further improvement and tuning. One important direction is controlling the learning rate: similar to Simulated Annealing, it may change over time according to some schedule or heuristics.

Another set of issues that we won't elaborate upon now are related to dealing with numeric precision, and they also include such problems as vanishing/exploding gradients.

Improving GD

In the majority of interesting real-world optimization problems, the gradient can't be computed analytically using a formula. Instead, it has to be recovered from the data, and this is a computationally-intensive process: for each item in the dataset, we have to run the "forward" computation and then compute the gradient in the "backward" step. A diametrically opposite approach in terms of both computation speed and quality of the gradient would be to take just a single item and use the gradient for it as an approximation of the actual gradient. From the statistics point of view, after a long sequence of such samples, we have to converge to some optimum anyway. This technique, called Stochastic Gradient Descent (SGD), can be considered a form of combining sampling with gradient descent. Yet, sampling could be also applied directly the dataset directly. The latter approach is called Batch Gradient Descent, and it combines the best of both worlds: decent performance and a much more predictable and close to the actual value of the gradient, which is more suitable for supporting the more advanced approaches, such as momentum.

In essence, momentum makes the gradient that is calculated on a batch of samples more straightforward and less prone to oscillation due to the random fluctuations of the batch samples. It is, basically, achieved by applying using the moving average of the gradient. Different momentum-based algorithms operate by combining the currently computed value of the update with the previous value. For example, the simple SGD with momentum will have the following update code:

(let ((dws 0))
  (loop
    (with ((batch (sample data batch-size))
           (g (calculate-gradient batch)))
      (:= dws (- (* decay-rate dws)
                 (* learning-rate g)))
      (:+ ws dws))))

An alternative variant is called the Nesterov accelerated gradient which uses the following update procedure:

(let ((dws 0))
  (loop
    (:+ ws dws)
    (with ((batch (sample data batch-size))
           (g (- (* learning-rate (calculate-gradient batch)))))
      (:= dws (+ (* decay-rate dws) g)) 
      (:+ ws g))))

I.e., we first perform the update using the previous momentum, and only then calculate the gradient and perform the gradient-based update. The motivation for it is the following: while the gradient term always points in the right direction, the momentum term may not. If the momentum term points in the wrong direction or overshoots, the gradient can still "go back" and correct it in the same update step.

Another direction of GD improvement is using the adaptive learning-rate. For instance, the famous Adam algorithm tracks per-cell learning rate for the ws matrix.

These are not all the ways, in which plain gradient descent may be made more sophisticated — in order to converge faster. I won't mention here second-order methods or conjugate gradients. Numerous papers exploring this space continue being published.

Sampling

Speaking about sampling that we have mentioned several times throughout this book... I think this is a good place to mention a couple of simple sampling tricks that may prove useful in many different problems.

The sampling that is used in SGD is the simplest form of random selection that is executed by picking a random element from the set and repeating it the specified number of times. This sampling is called "with replacement". The reason for this is that after picking an element it is not removed from the set (i.e. it can be considered "replaced" by an equal element), and so it can be picked again. Such an approach is the simplest one to implement and reason about. There's also the "without replacement" version that removes the element from the set after selecting it. It ensures that each element may be picked only once, but also causes the change in probabilities of picking elements on subsequent iterations.

Here is an abstract (as we don't specify the representation of the set and the realted size, remove-item, and empty? procedures) implementation of these sampling methods:

(defun sample (n set &key (with-replacement t))
  (loop :repeat n
        :for i := (random (size set))
        :collect (? set i)
        :unless with-replacement :do
          (remove-item set i)
          (when (empty? set) (loop-finish)))

This simplest approach samples from a uniform probability distribution, i.e. it assumes that the elements of the set have an equal chance of being selected. In many tasks, these probabilities have to be different. For such cases, a more general sampling implementation is needed:

(defun sample-from-dist (n dist)
  ;; here, DIST is a hash-table with keys being items
  ;; and values — their probabilities
  (let ((scale (reduce '+ (vals dist))))
    (loop :repeat n :collect
      (let ((r (* scale (random 1.0)))
            (acc 0))
        (dotable (k v dist)
          (:+ acc v)
          (when (>= acc r)
            (return k)))))))

CL-USER> (sample-from-dist 10 #h(:foo 2
                                 :quux 1
                                 :baz 10))
(:BAZ :BAZ :BAZ :QUUX :BAZ :BAZ :BAZ :BAZ :BAZ :FOO)

I'm surprised how often I have to retell this simple sampling technique. In it, all the items are placed on a [0, 1) interval occupying the parts proportionate to their weight in the probability distribution (:baz will have 80% of the weight in the distribution above). Then we put a random point in this interval and determine in which part it falls.

The final sampling approach I'd like to show here — quite a popular one for programming interviews — is Reservoir Sampling. It deals with uniform sampling from an infinite set. Well, how do you represent an infinite set? For practical purposes, it can be thought of as a stream. So, the items are read sequentially from this stream and we need to decide which ones to collect and which to skip. This is achieved by the following procedure:

(defun reservoir-sample (n stream)
  (let ((rez (make-array n :initial-element nil)))  ; reservoir
    (handler-case
        (loop :for item := (read stream)
              :for i :from 0
              :for r := (random (1+ i))
              :do (cond
                    ;; initially, fill the reservoir with the first N items
                    ((< i n) (:= (? rez i) item))
                    ;; replace the R-th item with probability
                    ;; proportionate to (- 1 (/ R N))
                    ((< r n) (:= (? rez r) item))))
      ;; sampling stops when the stream is exhausted
      ;; we'll use an input stream and read items from it
      (end-of-file () rez))))

CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
           (reservoir-sample 3 in))
#(BAR BAZ FOO)
CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
           (reservoir-sample 3 in))
#(FOO FOO FOO)
CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
           (reservoir-sample 3 in))
#(BAZ FOO FOO)
CL-USER> (with-input-from-string (in (format nil "~{~A ~}"
                                             (loop :for i :from 0 :to 100 :collect i)))
           (reservoir-sample 10 in))
#(30 42 66 68 76 5 22 39 51 24)  ; note that 5 stayed on the appropriate position where it was placed initially

Matrix Factorization

Matrix factorization is a decomposition of a matrix into a product of matrices. It has many different variants that find for particular classes of problems. Matrix factorization is a computationally-intensive task that has many applications: from machine learning to information retrieval to data compression. Its use cases include: background removal in images, topic modeling, collaborative filtering, CT scan reconstruction, etc.

Among many factorization methods, the following two stand out as the most prominent: Singular Value Decomposition (SVD) and non-negative matrix factorization/non-negative sparse coding (NNSC). NNSC is interesting as it produces much sharper vectors that still remain sparse, i.e. all the information is concentrated in the non-null slots.

Singular Value Decomposition

SVD is the generalization of the eigendecomposition (which is defined only for square matrices) to any matrix. It is extremely important as the eigenvectors define the basis of the matrix and the eigenvalues — the relative importance of the eigenvectors. Once SVD is performed, using the obtained vectors, we can immediately figure out a lot of useful properties of the dataset. Thus, SVD is behind such methods as PCA in statistical analysis, LSI topic modeling in NLP, etc.

Formally, the singular value decomposition of an m x n matrix M is a factorization of the form (* U S V), where U is an m x m unitary matrix, V is an n x n unitary matrix, and S (usually, greek sigma) is an m x n rectangular diagonal matrix with non-negative real numbers on the diagonal. The columns of U are left-singular vectors of M, the rows of V are right-singular vectors, and the diagonal elements of S are known as the singular values of M.

The singular value decomposition can be computed either analytically or via approximation methods. The analytic approach is not tractable for large matrices — the ones that occur in practice. Thus, approximation methods are used. One of the well-known algorithms is QuasiSVD that was developed as a result of the famous Netflix challenge in the 2000s. The idea behind QuasiSVD is, basically, gradient descent. The algorithm approximates the decomposition with random matrices and then iteratively improves it using the following formula:

(defun svd-1 (u v rank training-data &key (learning-rate 0.001))
  (dotimes (f rank)
    (loop :for (i j val) :in training-data :do
      (let ((err (- val (predict rank u i v j))))
        (:+ (aref u f i) (* learning-rate err (aref v f j)))
        (:+ (aref v f j) (* learning-rate err (aref u f i))))))))

The described method is called QuasiSVD because the singular values are not explicit: the decomposition is into just two matrices of non-unit vectors. Another constraint of the algorithm is that the rank of the decomposition (the number of features) should be specified by the user. Yet, for practical purposes, this is often what is actually needed. Here is a brief description at the usage of the method for predicting movie reviews for the Netflix challenge.

For visualizing the problem, it makes sense to think of the data as a big sparsely filled matrix, with users across the top and movies down the side, and each cell in the matrix either contains an observed rating (1-5) for that movie (row) by that user (column) or is blank meaning you don't know. This matrix would have about 8.5 billion entries (number of users times number of movies). Note also that this means you are only given values for one in 85 of the cells. The rest are all blank.

The assumption is that a user's rating of a movie is composed of a sum of preferences about the various aspects of that movie. For example, imagine that we limit it to forty aspects, such that each movie is described only by forty values saying how much that movie exemplifies each aspect, and correspondingly each user is described by forty values saying how much they prefer each aspect. To combine these all together into a rating, we just multiply each user preference by the corresponding movie aspect, and then add those forty leanings up into a final opinion of how much that user likes that movie. [...] Such a model requires (* 40 (+ 17k 500k)) or about 20M values — 400 times less than the original 8.5B.

Here is the function that approximates the rating. The QuasiSVD matrix u is user-features and vmovie-features. As you see, we don't need to further factor u and v into the matrix of singular values and the unit vectors matrices.

(defun predict-rating (rank user-features user movie-features movie)
  (loop :for f :from 0 :below rank
        :sum (* (aref user-features f user)
                (aref movie-features f movie))))

Fourier Transform

The last item we'll discuss in this chapter is not exactly an optimization problem, but it's also a numeric algorithm that bears a lot of significance to the previous one and has broad practical applications. The Discrete Fourier Transform (DFT) is the most important discrete transform, used to perform Fourier analysis in many practical applications: in digital signal processing, the function is any quantity or signal that varies over time, such as the pressure of a sound wave, a radio signal, or daily temperature readings, sampled over a finite time interval; in image processing, the samples can be the values of pixels along a row or column of a raster image.

It is said that the Fourier Transform transforms a "signal" from the time/space domain (represented by observed samples) into the frequency domain. Put simply, a time-domain graph shows how a signal changes over time, whereas a frequency-domain graph shows how much of the signal lies within each given frequency band over a range of frequencies. The inverse Fourier Transform performs the reverse operation and converts the frequency-domain signal back into the time domain. Explaining the deep meaning of the transform is beyond the scope of this book, the only thing worth mentioning here is that operating on the frequency domain allows us to perform many useful operations on the signal, such as determining the most important features, compression (that we'll discuss below), etc.

The complexity of computing DFT naively just by applying its definition on n samples is O(n^2):

(defun dft (vec)
  (with ((n (length vec))
         (rez (make-array n))
         (scale (/ (- (* 2 pi #c(0 1))) n)))
    ;; #c(0 1) is imaginary unit (i) - Lisp allows us to operate on complex numbers directly
    (dotimes (i n)
      (:= (? rez i) (loop :for j :from 0 :below n
                          :sum (* (? vec j) (exp (* scale i j))))))))  

However, the well-known Fast Fourier Transform (FFT) achieves a much better performance of O(n log n). Actually, a group of algorithms shares the name FFT, but their main principle is the same. You might have already guessed, from our previous chapters, that such reduction in complexity is achieved with the help of the divide-and-conquer approach. A radix-2 decimation-in-time (DIT) FFT is the simplest and most common form of the Cooley-Tukey algorithm, which is the standard FFT implementation. It first computes the DFTs of the even-indexed inputs (indices: 0, 2, ..., (- n 2)) and of the odd-indexed inputs (indices: 1, 3, ..., (- n 1)), and then combines those two results to produce the DFT of the whole sequence. This idea is utilized recursively. What enables such decomposition is the observation that thanks to the periodicity of the complex exponential, the elements (? rez i) and (? rez (+ i n/2)) may be calculated from the FFTs of the same subsequences. The formulas are the following:

(let ((e (fft-of-even-indexed-part))
      (o (fft-of-odd-indexed-part))
      (scale (exp (/ (- (* 2 pi #c(0 1) i))
                     n)))
      (n/2 (floor n 2)))
  (:= (? rez i) (+ (? e i) (* scale (? o i)))
      (? rez (+ i n/2)) (- (? e i) (* scale (? o i)))))

Fourier Transform in Action: JPEG

Fourier Transform — or rather its variant that uses only cosine functions[2] and operates on real numbers — the Discrete Cosine Transform (DCT) is the enabling factor of the main lossy media compression formats, such as JPEG, MPEG, and MP3. All of them achieve the drastic reduction in the size of the compressed file by first transforming it into the frequency domain, and then identifying the long tail of low amplitude frequencies and removing all the data that is associated with these frequencies (which is, basically, noise). Such an approach allows specifying a threshold of the percentage of data that should be discarded and retained. The use of cosine rather than sine functions is critical for compression since it turns out that fewer cosine functions are needed to approximate a typical signal. Also, this allows sticking to only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry. There are, actually, eight different DCT variants, and we won't go into detail about their differences.

The general JPEG compression procedure operates in the following steps:

  • an RGB to YCbCr color space conversion (a special color space with luminescence and chrominance components more suited for further processing)
  • division of the image into 8 x 8 pixel blocks
  • shifting the pixel values from [0,256) to [-128,128)
  • applying DCT to each block from left to right, top to bottom
  • compressing of each block through quantization
  • entropy encoding the quantized matrix (we'll dicuss this in the next chapter)
  • compressed image is reconstructed through the reverse process using the Inverse Discrete Cosine Transform (IDCT)

The quantization step is where the lossy part of compression takes place. It aims at reducing most of the less important high-frequency DCT coefficients to zero, the more zeros the better the image will compress. Lower frequencies are used to reconstruct the image because the human eye is more sensitive to them and higher frequencies are discarded.

P.S. Also, further development of the Fourier-related transforms for lossy compression lies in using the Wavelet family of transforms.

Take-aways

It was not easy to select the name for this chapter. Originally, I planned to dedicate it to optimization approaches. Then I thought that a number of other numerical algorithms need to be presented, but they were not substantial enough to justify a separate chapter. After all, I saw that what all these different approaches are about is, first of all, approximation. And, after gathering all the descriptions in one place and combining them, I came to the conclusion that approximation is, in a way, a more general and correct term than optimization. Although they go hand in hand, and it's somewhat hard to say which one enables the other...

A conclusion that we can draw from this chapter is that the main optimization methods currently in use boil down to greedy local probabilistic search. In both the discrete and continuous domains, the key idea is to quickly find the direction, in which we can somewhat improve the current state of the system, and advance alongside that direction. All the rest is, basically, fine-tuning of this concept. There are alternatives, but local search aka gradient descent aka hill climbing dominates the optimization landscape.

Another interesting observation can be made that many approaches we have seen here are more of the templates or frameworks than algorithms. Branch & Bound, Genetic Programming or Local Search define a certain skeleton that should be filled with domain-specific code which will perform the main computations. Such a "big picture" approach is somewhat uncommon to the algorithm world that tends to concentrate on the low-level details and optimize them down to the last bit. So, the skills needed to design such generic frameworks are no less important to the algorithmic developers than knowledge of the low-level optimization techniques.

SGD, SVD, MCTS, NNSC, FFT — this sphere has plenty of algorithms with abbreviated names for solving particular numerical problems. We have discussed only the most well-known and principal ones with broad practical significance in the context of software development. But, besides them, there are many other famous numerical algorithms like the Sieve of Eratosthenes, the Finite Element Method, the Simplex Method, and so on and so forth. Yet, many of the ways to tackle them and the issues you will encounter in the process are, essentially, similar.


[1] It uses the popular drakma HTTP client and cl-ppcre regex library.

[2] The DFT uses a complex exponent, which consists of a cosine and a sine part.