# Lisp, the Universe and Everything

## 2019-09-28

### Programming Algorithms: Trees

Balancing a binary tree is the infamous interview problem that has all that folklore and debate associated with it. To tell you the truth, like the other 99% of programmers, I never had to perform this task for some work-related project. And not even due to the existence of ready-made libraries, but because self-balancing binary trees are, actually, pretty rarely used. But trees, in general, are ubiquitous even if you may not recognize their presence. The source code we operate with, at some stage of its life, is represented as a tree (a popular term here is Abstract Syntax Tree or AST, but the abstract variant is not the only one the compilers process). The directory structure of the file system is the tree. The object-oriented class hierarchy is likewise. And so on. So, returning to interview questions, trees indeed are a good area as they allow to cover a number of basic points: linked data structures, recursion, complexity. But there's a much better task, which I have encountered a lot in practice and also used quite successfully in the interview process: breadth-first tree traversal. We'll talk about it a bit later.

Similar to how hash-tables can be thought of as more sophisticated arrays (they are sometimes even called "associative arrays"), trees may be considered an expansion of linked lists. Although technically, a few specific trees are implemented not as a linked data structure but are based on arrays, the majority of trees are linked. Like hash-tables, some trees also allow for efficient access to the element by key, representing an alternative key-value implementation option.

Basically, a tree is a recursive data structure that consists of nodes. Each node may have zero or more children. If the node doesn't have a parent, it is called the root of the tree. And the constraint on trees is that the root is always single. Graphs may be considered a generalization of trees that don't impose this constraint, and we'll discuss them in a separate chapter. In graph terms, a tree is an acyclic directed single-component graph. Directed means that there's a one-way parent-child relation. And acyclic means that a child can't have a connection to the parent neither directly, nor through some other nodes (in the opposite case, what will be the parent and what — the child?) The recursive nature of trees manifests in the fact that if we extract an arbitrary node of the tree with all of its descendants, the resulting part will remain a tree. We can call it a subtree. Besides parent-child or, more generally, ancestor-descendant "vertical" relationships that apply to all the nodes in the tree, we can also talk about horizontal siblings — the set of nodes that have the same parent/ancestor.

Another important tree concept is the distinction between terminal (leaf) and nonterminal (branch) nodes. Leaf nodes don't have any children. In some trees, the data is stored only in the leaves with branch nodes serving to structure the tree in a certain manner. In other trees, the data is stored in all nodes without any distinction.

## Implementation Variants

As we said, the default tree implementation is a linked structure. A linked list may be considered a degenerate tree with all nodes having a single child. A tree node may have more than one child, and so, in a linked representation, each tree root or subroot is the origin of a number of linked lists (sometimes, they are called "paths").

``````Tree:  a
/ \
b   c
/ \   \
d   e   f

Lists:
a -> b -> d
a -> b -> e
a -> c -> f
b -> d
b -> e
c -> f
``````

So, a simple linked tree implementation will look a lot like a linked list one:

``````(defstruct (tree-node (:conc-name nil))
key

CL-USER> (with ((f (make-tree-node :key "f"))
(e (make-tree-node :key "e"))
(d (make-tree-node :key "d"))
(c (make-tree-node :key "c" :children (list f)))
(b (make-tree-node :key "b" :children (list d e))))
(make-tree-node :key "a"
:children (list b c)))
#S(TREE-NODE
:KEY "a"
:CHILDREN (#S(TREE-NODE
:KEY "b"
:CHILDREN (#S(TREE-NODE :KEY "d" :CHILDREN NIL)
#S(TREE-NODE :KEY "e" :CHILDREN NIL)))
#S(TREE-NODE
:KEY "c"
:CHILDREN (#S(TREE-NODE :KEY "f" :CHILDREN NIL)))))
``````

Similar to lists that had to be constructed from tail to head, we had to populate the tree in reverse order: from leaves to root. With lists, we could, as an alternative, use push and reverse the result, in the end. But, for trees, there's no such operation as reverse.

Obviously, not only lists can be used as a data structure to hold the children. When the number of children is fixed (for example, in a binary tree), they may be defined as separate slots: e.g. `left` and `right`. Another option will be to use a key-value, which allows assigning labels to tree edges (as the keys of the kv), but the downside is that the ordering isn't defined (unless we use an ordered kv like a linked hash-table). We may also want to assign weights or other properties to the edges, and, in this case, either an additional collection (say `child-weights`) or a separate `edge` struct should be defined to store all those properties. In the latter case, the `node` structure will contain `edges` instead of `children`. In fact, the tree can also be represented as a list of such `edge` structures, although this approach is quite inefficient, for most of the use cases.

Another tree representation utilizes the available linked list implementation directly instead of reimplementing it. Let's consider the following simple Lisp form:

``````(defun foo (bar)
"Foo function."
(baz bar))
``````

It is a tree with the root containing the symbol `defun` and 4 children:

• the terminal symbol `foo`
• the tree containing the function arguments (`(bar)`)
• the terminal sting (the docstring "Foo function.")
• and the tree containing the form to evaluate (`(baz bar)`)

By default, in the list-based tree, the first element is the head and the rest are the leaves. This representation is very compact and convenient for humans, so it is used not only for source code. For example, you can see a similar representation for the constituency trees, in linguistics:

``````(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))) (. .)))
;; if we'd like to use the above form as Lisp code,
;; we'd have to shield the symbol "." with ||: (|.| |.|) instead of (. .)
``````

It is equivalent to the following parse tree:

``````     TOP
/   |          \
|    VP          |
|    |   \       |
NP   |    NP     |
|    |   /  \    |
DT  VBZ DT  NN   .
This is  a  test  .
``````

Another, more specific, alternative is when we are interested only in the terminal nodes. In that case, there will be no explicit root and each list item will be a subtree. The following trees are equivalent:

``````(((a b (c d)) e) (f (g h)))

<root>
/      \
/  \    /  \
/ | \ e  f   /\
a  b /\      g  h
c  d
``````

A tree that has all terminals at the same depth and all nonterminal nodes present — a complete tree — with a specified number of children may be stored in a vector. This is a very efficient implementation that we'll have a glance at when we'll talk about heaps.

Finally, a tree may be also represented, although quite inefficiently, with a matrix (only one half is necessary).

## Tree Traversal

It should be noted that, unlike with other structures, basic operations, such as tree construction, modification, element search and retrieval, work differently for different tree variants. Thus we'll discuss them further when describing those variants.

Yet, one tree-specific operation is common to all tree representations: traversal. Traversing a tree means iterating over its subtrees or nodes in a certain order. The most direct traversal is called depth-first search or DFS. It is the recursive traversal from parent to child and then to the next child after we return from the recursion. The simplest DFS for our `tree-node`-based tree may be coded in the following manner:

``````(defun dfs-node (fn root)
(call fn (key root))
(dolist (child (children root))
(dfs-node fn child)))

CL-USER> (dfs-node 'print *tree*)  ; where *tree* is taken from the previous example
"a"
"b"
"d"
"e"
"c"
"f"
``````

In the spirit of Lisp, we could also define a convenience macro:

``````(defmacro dotree-dfs ((value root) &body body)
(let ((node (gensym)))  ; this code is needed to prevent possible symbol collisions for NODE
`(dfs-node (lambda (,node)
(let ((,value (key ,node)))
,@body))
,root)))
``````

And if we'd like to traverse a tree represented as a list, the changes are minor:

``````(defun dfs-list (fn tree)
;; we need to handle both subtrees (lists) and leaves (atoms)
;; so, we'll just convert everything to a list
(let ((tree (mklist tree)))
(call fn (first tree))
(dolist (child (rest tree))
(dfs-list fn child))))

CL-USER> (dfs-list 'print '(defun foo (bar)
"Foo function."
(baz bar)))
DEFUN
FOO
BAR
"Foo function."
BAZ
BAR
``````

Recursion is very natural in tree traversal: we could even say that trees are recursion realized in a data structure. And the good news here is that, very rarely, there's a chance to hit recursion limits as the majority of trees are not infinite, and also the height of the tree, which conditions the depth of recursion, grows proportionally to the logarithm of the tree size[1], and that's pretty slow.

These simple DFS implementations apply the function before descending down the tree. This style is called preorder traversal. There are alternative styles: inorder and postorder. With postorder, the call is executed after the recursion returns, i.e. on the recursive ascent:

``````(defun post-dfs (fn node)
(dolist (child (children node))
(post-dfs fn child))
(call fn (key node)))

CL-USER> (post-dfs 'print *tree*)
"d"
"e"
"b"
"f"
"c"
"a"
``````

Inorder traversal is applicable only to binary trees: first traverse the left side, then call `fn` and then descend into the right side.

An alternative traversal approach is Breadth-first search (BFS). It isn't so natural as DFS as it traverses the tree layer by layer, i.e. it has to, first, accumulate all the nodes that have the same depth and then integrate them. In the general case, it isn't justified, but there's a number of algorithms where exactly such ordering is required.

Here is an implementation of BFS (preorder) for our `tree-node`s:

``````(defun bfs (fn nodes)
(let ((next-level (list)))
(dolist (node (mklist nodes))
(call fn (key node))
(dolist (child (children node))
(push child next-level)))
(when next-level
(bfs fn (reverse next-level)))))

CL-USER> (bfs 'print *tree*)
"a"
"b"
"c"
"d"
"e"
"f"
``````

An advantage of BFS traversal is that it can handle potentially unbounded trees, i.e. it is suitable for processing trees in a streamed manner, layer-by-layer.

In object-orientation, tree traversal is usually accomplished with by the means of the so-called Visitor pattern. Basically, it's the same approach of passing a function to the traversal procedure but in disguise of additional (and excessive) OO-related machinery. Here is a Visitor pattern example in Java:

``````interface ITreeVisitor {
List<ITreeNode> children;
void visit(ITreeNode node);
}

interface ITreeNode {
void accept(ITreeVisitor visitor);
}

interface IFn {
void call(ITreeNode);
}

class TreeNode implements ITreeNode {
public void accept(ITreeVisitor visitor) {
visitor.visit(this);
}
}

class PreOrderTreeVisitor implements ITreeVisitor {
private IFn fn;

public PreOrderTreeVisitor(IFn fn) {
this.fn = fn;
}

public void visit(ITreeNode node) {
fn.call(node);
for (ITreeeNode child : node.children())
child.visit(this);
}
}
``````

The zest of this example is the implementation of the method `visit` that calls the function with the current node and iterates over its children by recursively applying the same visitor. You can see that it's exactly the same as our `dfs-node`.

One of the interesting tree-traversal tasks is tree printing. There are many ways in which trees can be displayed. The simplest one is directory-style (like the one used by the Unix `tree` utility):

``````\$ tree /etc/acpi
/etc/acpi
├── asus-wireless.sh
├── events
│   ├── asus-keyboard-backlight-down
│   ├── asus-keyboard-backlight-up
│   ├── asus-wireless-off
│   └── asus-wireless-on
└── undock.sh
``````

It may be implemented with DFS and only requires tracking of the current level in the tree:

``````(defun pprint-tree-dfs (node &optional (level 0) (skip-levels (make-hash-table)))
(when (= 0 level)
(format t "~A~%" (key node)))
(let ((last-index (1- (length (children node)))))
(doindex (i child (children node))
(let ((last-child-p (= i last-index)))
(dotimes (j level)
(format t "~C    " (if (? skip-levels j) #\Space #\│)))
(format t "~C── ~A~%"
(if last-child-p #\└ #\├)
(key child))
(:= (? skip-levels level) last-child-p)
(pprint-tree-dfs child
(1+ level)
skip-levels))))))

CL-USER> (pprint-tree-dfs *tree*)
a
├── b
│   ├── d
│   └── e
└── c
└── f
``````

`1+` and `1-` are standard Lisp shortucts for adding/substracting 1 from a number. The `skip-levels` argument is used for the last elements to not print the excess `│`.

A more complicated variant is top-to-bottom printing:

``````;; example from CL-NLP library
CL-USER> (nlp:pprint-tree
'(TOP (S (NP (NN "This"))
(VP (VBZ "is")
(NP (DT "a")
(JJ "simple")
(NN "test")))
(|.| ".")))
TOP
:
S
.-----------:---------.
:          VP         :
:   .---------.       :
NP   :        NP       :
:   :   .----:-----.  :
NN  VBZ DT   JJ    NN  .
:   :   :    :     :  :
This  is  a simple test .
``````

This style, most probably, will need a BFS and a careful calculation of spans of each node to properly align everything. Implementing such a function is left as an exercise to the reader, and a very enlightening one, I should say.

## Binary Search Trees

Now, we can return to the topic of basic operations on tree elements. The advantage of trees is that, when built properly, they guarantee `O(log n)` for all the main operations: search, insertion, modification, and deletion.

This quality is achieved by keeping the leaves sorted and the trees in a balanced state. "Balanced" means that any pair of paths from the root to the leaves have lengths that may differ by at most some predefined quantity: ideally, just 1 (AVL trees), or, as in the case of Red-Black trees, the longest path can be at most twice as long as the shortest. Yet, such situations when all the elements align along a single path, effectively, turning the tree into a list, should be completely ruled out. We have already seen, with Binary search and Quicksort (remember the justification for the 3-medians rule), why this constraint guarantees logarithmic complexity.

The classic example of balanced trees are Binary Search Trees (BSTs), of which AVL and Red-Black trees are the most popular variants. All the properties of BSTs may be trivially extended to n-ary trees, so we'll discuss the topic using the binary trees examples.

Just to reiterate the general intuition for the logarithmic complexity of tree operations, let's examine a complete binary tree: a tree that has all levels completely filled with elements, except maybe for the last one. In it, we have `n` elements, and each level contains twice as many nodes as the previous. This property means that `n` is not greater than `(+ 1 2 4 ... (/ k 2) k)`, where `k` is the capacity of the last level. This formula is nothing but the sum of a geometric progression with the number of items equal to `h`, which is, by the textbook:

``````(/ (* 1 (- 1 (expt 2 h)))
(- 1 2))
``````

In turn, thisexpression may be reduced to: `(- (expt 2 h) 1)`. So `(+ n 1)` equals to `(expt 2 h)`, i.e. the height of the tree (`h`) equals to `(log (+ n 1) 2)`.

BSTs have the ordering property: if some element is to the right of another in the tree, it should consistently be greater (or smaller — depending on the ordering direction). This constraint means that after the tree is built, just extracting its elements by performing an inorder DFS produces a sorted array. The Treesort algorithm utilizes this approach directly to achieve the same `O(n * log n)` complexity as other efficient sorting algorithms. This `n * log n` is the complexity of each insertion (`O(log n)`) multiplied by the number of times it should be performed (`n`). So, Treesort operates by taking an array and adding its elements to the BST, then traversing the tree and putting the encountered elements into the resulting array, in a proper order.

Besides, the ordering property also means that, after adding a new element to the tree, in the general case, it should be rebalanced as the newly added element may not be placed in an arbitrary spot, but has just two admissible locations, and choosing any of those may violate the balance constraint. The specific balance invariants and approaches to tree rebalancing are the distinctive properties of each variant of BSTs that we will see below.

## Splay Trees

A Splay tree represents a kind of BST that is one of the simplest to understand and to implement. It is also quite useful in practice. It has the least strict constraints and a nice property that recently accessed elements occur near the root. Thus, a Splay tree can naturally act as an LRU-cache. However, there are degraded scenarios that result in `O(n)` access performance, although, the average complexity of Splay tree operations is `O(log n)` due to amortization (we'll talk about it in a bit).

The approach to balancing a Splay tree is to move the element we have accessed/inserted into the root position. The movement is performed by a series of operations that are called tree rotations. A certain pattern of rotations forms a step of the algorithm. For all BSTs, there are just two possible tree rotations, and they serve as the basic block, in all balancing algorithms. A rotation may be either a left or a right one. Their purpose is to put the left or the right child into the position of its parent, preserving the order of all the other child elements. The rotations can be illustrated by the following diagrams in which `x` is the parent node, `y` is the target child node that will become the new parent, and `A`,`B`,`C` are subtrees. It is said that the rotation is performed around the edge `x -> y`.

Left rotation:

``````    x          y
/ \        / \
y   C  ->  A   x
/ \            / \
A   B          B   C
``````

Right rotation:

``````  x              y
/ \            / \
A   y    ->    x   C
/ \        / \
B   C      A   B
``````

As you see, the left and right rotations are complementary operations, i.e. performing one after the other will return the tree to the original state. During the rotation, the inner subtree (`B`) has its parent changed from `y` to `x`.

Here's an implementation of rotations:

``````(defstruct (bst-node (:conc-name nil)
(:print-object (lambda (node out)
(format out "[~a-~@[~a~]-~@[~a~]]"
(key node)
(lt node)
(rt node)))))
key
val  ; we won't use this slot in the examples,
; but without it, in real-world use cases,
; the tree doesn't make any sense :)
lt   ; left child
rt)  ; right child

(defun tree-rotate (node parent grandparent)
(cond
((eql node (lt parent)) (:= (lt parent) (rt node)
(rt node) parent))
((eql node (rt parent)) (:= (rt parent) (lt node)
(lt node) parent))
(t (error "NODE (~A) is not the child of PARENT (~A)"
node parent)))
(cond
((null grandparent) (return-from tree-rotate node))
((eql parent (lt grandparent)) (:= (lt grandparent) node))
((eql parent (rt grandparent)) (:= (rt grandparent) node))
(t (error "PARENT (~A) is not the child of GRANDPARENT (~A)"
parent grandparent))))
``````

You have probably noticed that we need to pass to this function not only the nodes on the edge around which the rotation is executed but also the grandparent node of the target to link the changes to the tree. If `grandparent` is not supplied, it is assumed that `parent` is the root and we need to separately reassign the variable holding the reference to the tree to `child`, after the rotation.

Splay trees combine rotations into three possible actions:

• The Zig step is used to make the node the new root when it's already the direct child of the root. It is accomplished by a single left/right rotation(depending on whether the target is to the left or to the right of the `root`) followed by an assignment.
• The Zig-zig step is a combination of two zig steps that is performed when both the target node and its parent are left/right nodes. The first rotation is around the edge between the target node and its parent, and the second — around the target and its former grandparent that has become its new parent, after the first rotation.
• The Zig-zag step is performed when the target and its parent are not in the same direction: either one is left while the other is right or vise versa. In this case, correspondingly, first a left rotation around the parent is needed, and then a right one around its former grandparent (that has now become the new parent of the target). Or vice versa.

However, with our implementation of tree rotations, we don't have to distinguish the 3 different steps and the implementation of the operation `splay` becomes really trivial:

``````(defun splay (node &rest chain)
(loop :for (parent grandparent) :on chain :do
(tree-rotate node parent grandparent))
node)
``````

The key point here and in the implementation of Splay tree operations is the use of reverse chains of nodes from the child to the root which will allow us to perform chains of splay operations in an end-to-end manner and also custom modifications of the tree structure.

From the code, it is clear that splaying requires at maximum the same number of steps as the height of the tree because each rotation brings the target element 1 level up. Now, let's discuss why all Splay tree operations are `O(log n)`. Element access requires binary search for the element in the tree, which is `O(log n)` provided the tree is balanced, and then splaying it to root — also `O(log n)`. Deletion requires search, then swapping the element either with the rightmost child of its left subtree or the leftmost child of its right subtree (direct predecessor/successor) — to make it childless, removing it, and, finally, splaying the parent of the removed node. And update is, at worst, deletion followed by insertion.

Here is the implementation of the Splay tree built of `bst-node`s and restricted to only arithmetic comparison operations. All of the high-level functions, such as `st-search`, `st-insert` or `st-delete` return the new tree root obtained after that should substitute the previous one in the caller code.

``````(defun node-chain (item root &optional chain)
"Return as the values the node equal to ITEM or the closest one to it
and the chain of nodes leading to it, in the splay tree based in ROOT."
(if root
(with (((key lt rt) ? root)
(chain (cons root chain)))
(cond ((= item key) (values root
chain))
((< item key) (st-search item lt chain))
((> item key) (st-search item rt chain))))
(values nil
chain)))

(defun st-search (item root)
(with ((node chain (node-chain item root)))
(when node
(apply 'splay chain))))

(defun st-insert (item root)
(assert root nil "Can't insert item into a null tree")
(with ((node chain (st-search item root)))
(unless node
(let ((parent (first chain)))
;; here, we use the property of the := expression
;; that it returns the item being set
(push (:= (? parent (if (> (key parent) item) 'lt 'rt))
(make-bst-node :key item))
chain)))
(apply 'splay chain)))

(defun idir (dir)
(case dir
(lt 'rt)
(rt 'lt)))

(defun closest-child (node)
(dolist (dir '(lt rt))
(let ((parent nil)
(current nil))
(do ((child (call dir node) (call (idir dir) child)))
((null child) (when current
(return-from closest-child
(values dir
current
parent))))
(:= current child
parent current)))))

(defun st-delete (item root)
(with ((node chain (st-search item root))
(parent (second chain)))
(if (null node)
(with ((dir child child-parent (closest-child node))
(idir (idir dir)))
(when parent
(:= (? parent (if (eql (lt parent) node) 'lt 'rt))
child))
(when child
(:= (? child idir) (? node idir))
(when child-parent
(:= (? child-parent idir) (? child dir))))
(if parent
(apply 'splay (rest chain))
child)))))

(defun st-update (old new root)
(st-insert new (st-delete old root)))
``````

The deletion is somewhat tricky due to the need to account for different cases: when removing the root, the direct child of the root, or the other node.

Let's test the Splay tree operation in the REPL (coding `pprint-bst` as a slight modification of `pprint-tree-dfs` is left as an excercise to the reader):

``````CL-USER> (defparameter *st* (make-bst-node :key 5))
CL-USER> *st*
[5--]
CL-USER> (pprint-bst (:= *st* (st-insert 1 *st*)))
1
├── .
└── 5
CL-USER> (pprint-bst (:= *st* (st-insert 10 *st*)))
10
├── 1
│    ├── .
│    └── 5
└── .
CL-USER> (pprint-bst (:= *st* (st-insert 3 *st*)))
3
├── 1
└── 10
├── .
└── 5
CL-USER> (pprint-bst (:= *st* (st-insert 7 *st*)))
7
├── 3
│    ├── 1
│    └── 5
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 8 *st*)))
8
├── 7
│    ├── 3
│    │    ├── 1
│    │    └── 5
│    └── .
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 2 *st*)))
2
├── 1
└── 8
├── 7
│    ├── 3
│    │    ├── .
│    │    └── 5
│    └── .
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 4 *st*)))
4
├── 2
│    ├── 1
│    └── 3
└── 8
├── 7
│    ├── 5
│    └── .
└── 10
CL-USER> *st*
[4-[2-[1--]-[3--]]-[8-[7-[5--]-]-[10--]]]
``````

As you can see, the tree gets constantly rearranged at every insertion.

Accessing an element, when it's found in the tree, also triggers tree restructuring:

``````RTL-USER> (pprint-bst (st-search 5 *st*))
5
├── 4
│    ├── 2
│    │    ├── 1
│    │    └── 3
│    └── .
└── 8
├── 7
└── 10
``````

The insertion and deletion operations, for the Splay tree, also may have an alternative implementation: first, split the tree in two at the place of the element to be added/removed and then combine them. For insertion, the combination is performed by making the new element the root and linking the previously split subtrees to its left and right. As for deletion, splitting the Splay tree requires splaying the target element and then breaking the two subtrees apart (removing the target that has become the root). The combination is also `O(log n)` and it is performed by splaying the rightmost node of the left subtree (the largest element) so that it doesn't have the right child. Then the right subtree can be linked to this vacant slot.

Although regular access to the Splay tree requires splaying of the element we have touched, tree traversal should be implemented without splaying. Or rather, just the normal DFS/BFS procedures should be used. First of all, this approach will keep the complexity of the operation at `O(n)` without the unnecessary `log n` multiplier added by the splaying operations. Besides, accessing all the elements inorder will trigger the edge-case scenario and turn the Splay tree into a list — exactly the situation we want to avoid.

### Complexity Analysis

All of those considerations apply under the assumption that all the tree operations are `O(log n)`. But we haven't proven it yet. Turns out that, for Splay trees, it isn't a trivial task and requires amortized analysis. Basically, this approach averages the cost of all operations over all tree elements. Amortized analysis allows us to confidently use many advanced data structures for which it isn't possible to prove the required time bounds for individual operations, but the general performance over the lifetime of the data structure is in those bounds.

The principal tool of the amortized analysis is the potential method. Its idea is to combine, for each operation, not only its direct cost but also the change to the potential cost of other operations that it brings. For Splay trees, we can observe that only zig-zig and zig-zag steps are important, for the analysis, as zig step happens only once for each splay operation and changes the height of the tree by at most 1. Also, both zig-zig and zig-zag have the same potential.

Rigorously calculating the exact potential requires a number of mathematical proofs that we don't have space to show here, so let's just list the main results.

1. The potential of the whole Splay tree is the sum of the ranks of all nodes, where rank is the logarithm of the number of elements in the subtree rooted at node:

``````(defun rank (node)
(let ((size 0))
(dotree-dfs (_ node)
(:+ size))
(log size 2)))
``````
2. The change of potential produced by a single zig-zig/zig-zag step can be calculated in the following manner:

``````(+ (- (rank grandparent-new) (rank grandparent-old))
(- (rank parent-new) (rank parent-old))
(- (rank node-new) (rank node-old)))
``````

Since `(= (rank node-new) (rank grandparent-old))` it can be reduced to:

``````(- (+ (rank grandparent-new) (rank parent-new))
(+ (rank parent-old) (rank node-old)))
``````

Which is not larger than:

``````(- (+ (rank grandparent-new) (rank node-new))
(* 2 (rank node-old)))
``````

Which, in turn, due to the concavity of the log function, may be reduced to:

``````(- (* 3 (- (rank node-new) (rank node-old))) 2)
``````

The amortized cost of any step is 2 operations larger than the change in potential as we need to perform 2 tree rotations, so it's not larger than:

``````(* 3 (- (rank node-new) (rank node-old)))
``````
3. When summed over the entire splay operation, this expression "telescopes" to `(* 3 (- (rank root) (rank node)))` which is `O(log n)`. Telescoping means that when we calculate the sum of the cost of all zig-zag/zig-zig steps, the inner terms cancel each other and only the boundary ones remain. The difference in ranks is, in the worst case, `log n` as the rank of the root is `(log n 2)` and the rank of the arbitrary node is between that value and `(log 1 2)` (0).

4. Finally, the total cost for `m` splay operations is `O(m log n + n log n)`, where `m log n` term represents the total amortized cost of a sequence of `m` operations and `n log n` is the change in potential that it brings.

As mentioned, the above exposition is just a cursory look at the application of the potential method that skips some important details. If you want to learn more you can start with this discussion on CS Theory StackExchange.

To conclude, similar to hash-tables, the performance of Splay tree operations for a concrete element depends on the order of the insertion/removal of all the elements of the tree, i.e. it has an unpredictable (random) nature. This property is a disadvantage compared to some other BST variants that provide precise performance guarantees. Another disadvantage, in some situations, is that the tree is constantly restructured, which makes it mostly unfit for usage as a persistent data structure and also may not play well with many storage options. Yet, Splay trees are simple and, in many situations, due to their LRU-property, may be preferable over other BSTs.

## Red-Black and AVL Trees

Another BST that also has similar complexity characteristics to Splay trees and, in general, a somewhat similar approach to rebalancing is the Scapegoat tree. Both of these BSTs don't require storing any additional information about the current state of the tree, which results in the random aspect of their operation. And although it is smoothed over all the tree accesses, it may not be acceptable in some usage scenarios.

An alternative approach, if we want to exclude the random factor, is to track the tree state. Tracking may be achieved by adding just 1 bit to each tree node (as with Red-Black trees) or 2 bits, the so-called balance factors (AVL trees)[2]. However, for most of the high-level languages, including Lisp, we'll need to go to great lengths or even perform low-level non-portable hacking to, actually, ensure that exactly 1 or 2 bits is spent for this data, as the standard structure implementation will allocate a whole word even for a bit-sized slot. Moreover, in C likewise, due to cache alignment, the structure will also have the size aligned to memory word boundaries. So, by and large, usually we don't really care whether the data we'll need to track is a single bit flag or a full integer counter.

The balance guarantee of an RB tree is that, for each node, the height of the left and right subtrees may differ by at most a factor of 2. Such boundary condition occurs when the longer path contains alternating red and black nodes, and the shorter — only black nodes. Balancing is ensured by the requirement to satisfy the following invariants:

1. Each tree node is assigned a label: red or black (basically, a 1-bit flag: 0 or 1).
2. The root should be black (0).
3. All the leaves are also black (0). And the leaves don't hold any data. A good implementation strategy to satisfy this property is to have a constant singleton terminal node that all preterminals will link to. (`(defparameter *rb-leaf* (make-rb-node))`).
4. If a parent node is red (1) then both its children should be black (0). Due to mock leaves, each node has exactly 2 children.
5. Every path from a given node to any of its descendant leaf nodes should contain the same number of black nodes.

So, to keep the tree in a balanced state, the insert/update/delete operations should perform rebalancing when the constraints are violated. Robert Sedgewick has proposed the simplest version of the red-black tree called the Left-Leaning Red-Black Tree (LLRB). The LLRB maintains an additional invariant that all red links must lean left except during inserts and deletes, which makes for the simplest implementation of the operations. Below, we can see the outline of the insert operation:

``````(defstruct (rb-node (:include bst-node) (:conc-name nil))
(red nil :type boolean))

(defun rb-insert (item root &optional parent)
(let ((node (make-rb-node :key item)))
(when (null root)
(return-from rb-insert node))
(when (and (red (lt root))
(red (rt root)))
(:= (red root) (not (red root))
(red (lt root)) nil
(red (rt root)) nil))
(cond ((< (key root) value)
(:= (lt root) (rb-insert node (lt root) root)))
((> (key root) value)
(:= (rt root) (rb-insert node (rt root) root))))
(when (and (red (rt root))
(not (red (lt root))))
(:= (red (lt root)) (red root)
root (tree-rotate (lt root) root parent)
(red root) t))
(when (and (red (lt root))
(not (red (rt root))))
(:= (red (rt root)) (red root)
root (tree-rotate (rt root) root parent)
(red root) t)))
root)
``````

This code is more of an outline. You can easily find the complete implementation of the RB-tree on the internet. The key here is to understand the principle of their operation. Also, we won't discuss AVL trees, in detail. Suffice to say that they are based on the same principles but use a different set of balancing operations.

Both Red-Black and AVL trees may be used when worst-case performance guarantees are required, for example, in real-time systems. Besides, they serve a basis for implementing persistent data-structures that we'll talk about later. The Java `TreeMap` and similar data structures from the standard libraries of many languages are implemented with one of these BSTs. And the implementations of them both are present in the Linux kernel and are used as data structures for various queues.

OK, now you know how to balance a binary tree :D

## B-Trees

B-tree is a generalization of a BST that allows for more than two children. The number of children is not unbounded and should be in a predefined range. For instance, the simplest B-tree — 2-3 tree — allows for 2 or 3 children. Such trees combine the main advantage of self-balanced trees — logarithmic access time — with the benefit of arrays — locality — the property which allows for faster cache access or retrieval from the storage. That's why B-trees are mainly used in data storage systems. Overall, B-tree implementations perform the same trick as we saw in `prod-sort`: switching to sequential search when the sequence becomes small enough to fit into the cache line of the CPU.

Each internal node of a B-tree contains a number of keys. For a 2-3 tree, the number is either 1 or 2. The keys act as separation values which divide the subtrees. For example, if the keys are `x` and `y`, all the values in the leftmost subtree will be less than `x`, all values in the middle subtree will be between `x` and `y`, and all values in the rightmost subtree will be greater than `y`. Here is an example:

``````              [ 7 . 18 ]
/        |        \
[ 1 . 3 ]    [ 10 . 15 ]    [ 20 . _ ]
``````

This tree has 4 nodes. Each node has 2 key slots and may have 0 (in the case of the leaf nodes), 2 or 3 children. The node structure for it might look like this:

``````(defstruct 23-node
key1
key2
val1
val2
lt
md
rt)
``````

Yet, a more general B-tree node would, probably, contain arrays for keys/values and children links:

``````(defstruct bt-node
(keys (make-array *max-keys*))
(vals (make-array *max-keys*))
(children (make-array (1+ *max-keys*)))
``````

The element search in a B-tree is very similar to that of a BST. Just, there will be up to `*max-keys*` comparisons instead of 1, in each node. Insertion is more tricky as it may require rearranging the tree items to satisfy its invariants. A B-tree is kept balanced after insertion by the procedure of splitting a would-be overfilled node, of `(1+ n)` keys, into two `(/ n 2)`-key siblings and inserting the mid-value key into the parent. That's why, usually, the range of the number of keys in the node, in the B-tree is chosen to be between `k` and `(* 2 k)`. Also, in practice, `k` will be pretty large: an order of 10s or even 100. Depth only increases when the root is split, maintaining balance. Similarly, a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the minimum number of keys for non-root nodes. A merger reduces the number of keys in the parent potentially forcing it to merge or redistribute keys with its siblings, and so on. The depth of the tree will increase slowly as elements are added to it, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node farther away from the root.

A version of B-trees that is particularly developed for storage systems and is used in a number of filesystems, such as NTFS and ext4, and databases, such as Oracle and SQLite, is B+ trees. A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves. The leaves of the B+ tree are linked to one another in a linked list, making range queries or an (ordered) iteration through the blocks simpler and more efficient. Such a property could not be achieved in a B-tree, since not all keys are present in the leaves: some are stored in the root or intermediate nodes.

However, a newer Linux file-system, developed specifically for use on the SSDs and called btrfs uses plain B-trees instead of B+ trees because the former allows implementing copy-on-write, which is needed for efficient snapshots. The issue with B+ trees is that its leaf nodes are interlinked, so if a leaf were copy-on-write, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. We can recall the same situation pertaining to the doubly-linked list compared to singly-linked ones. So, a modified B-tree without leaf linkage is used in btrfs, with a refcount associated with each tree node but stored in an ad-hoc free map structure.

Overall, B-trees are a very natural continuation of BSTs, so we won't spend more time with them here. I believe, it should be clear how to deal with them, overall. Surely, there are a lot of B-tree variants that have their nuances, but those should be studied in the context of a particular problem they are considered for.

## Heaps

A different variant of a binary tree is a Binary Heap. Heaps are used in many different algorithms, such as path pathfinding, encoding, minimum spanning tree, etc. They even have their own `O(log n)` sorting algorithm — the elegant Heapsort. In a heap, each element is either the smallest (min-heap) or the largest (max-heap) element of its subtree. It is also a complete tree and the last layer should be filled left-to-right. This invariant makes the heap well suited for keeping track of element priorities. So Priority Queues are, usually, based on heaps. Thus, it's beneficial to be aware of the existence of this peculiar data structure.

The constraints on the heap allow representing it in a compact and efficient manner — as a simple vector. Its first element is the heap root, the second and third are its left and right child (if present) and so on, by recursion. This arrangement permits access to the parent and children of any element using the simple offset-based formulas (in which the element is identified by its index):

``````(defun hparent (i)
"Calculate the index of the parent of the heap element with index I."
(floor (- i 1) 2))

(defun hrt (i)
"Calculate the index of the right child of the heap element with index I."
(* (+ i 1) 2))

(defun hlt (i)
"Calculate the index of the left child of the heap element with index I."
(- (hrt i) 1))
``````

So, to implement a heap, we don't need to define a custom node structure, and besides, can get to any element in `O(1)`! Here is the utility to rearrange an arbitrary array in a min-heap formation (in other words, we can consider a binary heap to be a special arrangement of array elements). It works by iteratively placing each element in its proper place by swapping with children until it's larger than both of the children.

``````(defun heapify (vec)
(let ((mid (floor (length vec) 2)))
(dotimes (i mid)
(heap-down vec (- mid i 1))))
vec)

(defun heap-down (vec beg &optional (end (length vec)))
(let ((l (hlt beg))
(r (hrt beg)))
(when (< l end)
(let ((child (if (or (>= r end)
(> (? vec l)
(? vec r)))
l r)))
(when (> (? vec child)
(? vec beg))
;; rotatef swaps the elements of the sequence
(rotatef (? vec beg)
(? vec child))
(heap-down vec child end)))))
vec)
``````

And here is the reverse operation to pop the item up the heap:

``````(defun heap-up (vec i)
(when (> (? vec i)
(? vec (hparent i)))
(rotatef (? vec i)
(? vec (hparent i)))
(heap-up vec (hparent i)))
heap)
``````

Also, as with other data structures, it's essential to be able to visualize the content of the heap in a convenient form, as well as to check the invariants. These tasks may be accomplished with the help of the following functions:

``````(defun draw-heap (vec)
(format t "~%")
(with ((size (length vec))
(h (+ 1 (floor (log size 2)))))
(dotimes (i h)
(let ((spaces (loop :repeat (- (expt 2 (- h i)) 1) :collect #\Space)))
(dotimes (j (expt 2 i))
(let ((k (+ (expt 2 i) j -1)))
(when (= k size) (return))
(format t "~{~C~}~2D~{~C~}" spaces (? vec k) spaces)))
(format t "~%"))))
(format t "~%")
vec)

(defun check-heap (vec)
(dotimes (i (floor (length vec) 2))
(when (= (hlt i) (length vec)) (return))
(assert (not (> (? vec (hlt i)) (? vec i)))
() "Left child (~A) is > parent at position ~A (~A)."
(? vec (hlt i)) i (? vec i))
(when (= (hrt i) (length vec)) (return))
(assert (not (> (? vec (hrt i)) (? vec i)))
() "Right child (~A) is > than parent at position ~A (~A)."
(? vec (hrt i)) i (? vec i)))
vec)

CL-USER> (check-heap #(10 5 8 2 3 7 1 9))
Left child (9) is > parent at position 3 (2).
[Condition of type SIMPLE-ERROR]

CL-USER> (check-heap (draw-heap (heapify #(1 22 10 5 3 7 8 9 7 13))))

22
13              10
9       3       7       8
5   7   1

#(22 13 10 9 3 7 8 5 7 1)
``````
Due to the regular nature of the heap, drawing it with BFS is much simpler than for most other trees. As with ordered trees, heap element insertion and deletion require repositioning of some of the elements.
``````(defun heap-push (node vec)
(vector-push-extend node vec)
(heap-up vec (1- (length vec))))

(defun heap-pop (vec)
(rotatef (? vec 0) (? vec (- (length vec) 1)))
;; PROG1 is used to return the result of the first form
;; instead of the last, like it happens with PROGN
(prog1 (vector-pop vec)
(heap-down vec 0)))``````

Now, we can implement Heapsort. The idea is to iteratively arrange the array in heap order element by element. Each arrangement will take `log n` time as we're pushing the item down a complete binary tree the height of which is `log n`. And we'll need to perform `n` such iterations.

``````(defun heapsort (vec)
(heapify vec)
(dotimes (i (length vec))
(let ((last (- (length vec) i 1)))
(rotatef (? vec 0)
(? vec last))
(heap-down vec 0 last)))
vec)

CL-USER> (heapsort #(1 22 10 5 3 7 8 9 7 13))
#(1 3 5 7 7 8 9 10 13 22)
``````

There are so many sorting algorithms, so why invent another one? That's a totally valid point, but the advantage of heaps is that they keep the maximum/minimum element constantly at the top so you don't have to perform a full sort or even descend into the tree if you need just the top element. This simplification is especially relevant if we constantly need to access such elements as with priority queues.

Actually, a heap should not necessarily be a tree. Besides the Binary Heap, there are also Binomial, Fibonacci and other kinds of heaps that may even not necessary be trees, but even collections of trees (forests). We'll discuss some of them in more detail in the next chapters, in the context of the algorithms for which their use makes a notable difference in performance.

## Tries

If I were to answer the question, what's the most underappreciated data structure, I'd probably say, a trie. For me, tries are a gift that keeps on giving, and they have already saved me program performance in a couple of situations that seemed hopeless. Besides, they are very simple to understand and implement.

A trie is also called a prefix tree. It is, usually, used to optimize dictionary storage and lookup when the dictionary has a lot of entries and there is some overlap between them. The most obvious example is a normal English language dictionary. A lot of words have common stems ("work", "word", "worry" all share the same beginning "wor"), there are many wordforms of the same word ("word", "words", "wording", "worded").

Thre are many approaches to trie implementation. Let's discuss with the most straightforward and, to so to say, primitive one. Here is a trie for representing a string dictionary that is character-based and uses an alist to store children pointers:

``````(defstruct (tr-node (:conc-name nil))
val
(children (list)))

(defun tr-lookup (key root)
(dovec (ch key
;; when iteration terminates normally
;; we have found the node we were looking for
(val root))
(if-it (assoc1 ch (children root))
(:= root it)
(return))))

(let ((i 0))
(dovec (ch key)
(if-it (assoc1 ch (children root))
(:= root it
i (1+ i))
(return)))
(if (= i (length key))
;; there was already something at key -
;; several actions might be taken:
;; signal an error (continuable), overwrite, abort
(cerror "Assign a new value"
"There was already a value at key: ~A" (val root))
(dovec (ch (slice key i))
(let ((child (make-tr-node)))
(push (cons ch child) (children root))
(:= root child))))
(:= (val root) val)))

CL-USER> (defparameter *trie* (make-tr-node))
*TRIE*
CL-USER> *trie*
#S(TR-NODE :VAL NIL :CHILDREN NIL)
``````

For the sake of brevity, we won't define a special print-function for our trie and will use a default one. In a real setting, though, it is highly advisable.

``````CL-USER> (tr-lookup "word" *trie*)
NIL
42
CL-USER> *trie*
#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\o
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\r
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\d
. #S(TR-NODE
:VAL 42
:CHILDREN NIL)))))))))))))
CL-USER> (tr-lookup "word" *trie*)
42

There was already a value at key: 42
[Condition of type SIMPLE-ERROR]

Restarts:
0: [CONTINUE] Assign a new value
1: [RETRY] Retry SLIME REPL evaluation request.

Backtrace:
0: (TR-ADD "word" :FOO #S(TR-NODE :VAL 42 :CHILDREN NIL))
1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (TR-ADD "word" :FOO *TRIE*) #<NULL-LEXENV>)
2: (EVAL (TR-ADD "word" :FOO *TRIE*))
--more--

;;; Take the restart 0

:FOO
:BAZ
CL-USER> *trie*
#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))
(#\o
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\r
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\k
. #S(TR-NODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TR-NODE
:VAL :FOO
:CHILDREN NIL)))))))))))))
``````

There are many ways to optimize this trie implementation. First of all, you can see that some space is wasted on intermediate nodes with no values. This is mended by Radix Trees (also known as Patricia Trees) that merge all intermediate nodes. I.e., our trie would change into the following more compact structure:

``````#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))
("or" . #S(TR-NODE
:VAL NIL
:CHILDREN ((#\k
. #S(TR-NODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TR-NODE
:VAL :FOO
:CHILDREN NIL)))))))))))))
``````

Besides, there are ways to utilize the array to store trie offsets (similar to heaps), instead of using a linked backbone for it. Such variant is called a succinct trie. Also, there are compressed (C-tries), hash-array mapped (HAMTs), and other kinds of tries.

The main advantage of tries is efficient space usage thanks to the elimination of repetition in keys storage. In many scenarios, usage of tries also improves the speed of access. Consider the task of matching against a dictionary of phrases, for example, biological or medical terms, names of companies or works of art, etc. These are usually 2-3 words long phrases, but, occasionally, there may be an outlier of 10 or more words. The straightforward approach would be to put the dictionary into a hash-table, then iterate over the input string trying to find the phrases in the table, starting from each word. The only question is: where do we put an end of the phrase? As we said, the phrase may be from 1 to, say, 10 words in length. With a hash-table, we have to check every variant: a single-word phrase, a two-word one, and so on up to the maximum length. Moreover, if there are phrases with the same beginning, which is often the case, we'd do duplicate work of hashing that beginning, for each variant (unless we use an additive hash, but this isn't adviced for hash-tables). With a trie, all the duplication is not necessary: we can iteratively match each word until we either find the match in the tree or discover that there is no continuation of the current subphrase.

## Trees in Action: Efficient Mapping

Finally, the last family of tree data structures I had to mention is trees for representing spatial relations. Overall, mapping and pathfinding is an area that prompted the creation of a wide range of useful algorithms and data structures. There are two fundamental operations for processing spatial data: nearest neighbor search and range queries. Given the points on the plane, how do we determine the closest points to a particular one? How do we retrieve all points inside a rectangle or a circle? A primitive approach is to loop through all the points and collect the relevant information, which results in at least `O(n)` complexity — prohibitively expensive if the number of points is beyond several tens or hundreds. And such problems, by the way, arise not only in the field of processing geospatial data (they are at the core of such systems as PostGIS, mapping libraries, etc.) but also in Machine Learning (for instance, the k-NN algorithm directly requires such calculations) and other areas.

A more efficient solution has an `O(log n)` complexity and is, as you might expect, based on indexing the data in a special-purpose tree. The changes to the tree will also have `O(log n)` complexity, while the initial indexing is `O(n log n)`. However, in most of the applications that use this technic, changes are much less frequent than read operations, so the upfront cost pays off.

There is a number of trees that allow efficient storage of spatial data: segment trees, interval trees, k-d trees, R-trees, etc. The most common spatial data structure is an R-tree (rectangle-tree). It distributes all the points in an `n`-dimensional space (usually, `n` will be 2 or 3) among the leaves of the tree by recursively dividing the space into `k` rectangles holding roughly the same number of points until each tree node has at most `k` points. Let's say we have started from 1000 points on the plane and chosen `k` to be 10. In this case, the first level of the tree (i.e. children of the root) will contain 10 nodes, each one having as the value the dimensions of the rectangle that bounds approximately 100 points. Every node like that will have 10 more children, each one having around 10 points. Maybe, some will have more, and, in this case, we'll give those nodes 10 children each with, probably, 1 or 2 points in the rectangles they will command. Now, we can perform a range search with the obtained tree by selecting only the nodes that intersect the query rectangle. For a small query box, this approach will result in the discarding of the majority of the nodes at each level of the tree. So, a range search over an R-tree has `O(k log n)` where `k` is the number of intersecting rectangles.

Now, let's consider neighbor search. Obviously, the closest points to a particular one we are examining lie either in the same rectangle as the point or in the closest ones to it. So, we need to, first, find the smallest bounding rectangle, which contains our point, perform the search in it, then, if we haven't got enough points yet, process the siblings of the current tree node in the order of their proximity to it.

There are many other spatial problems that may be efficiently solved with this approach. One thing to note is that the described procedures require the tree to store, in the leaf nodes, references to every point contained in their rectangles.

## Take-aways

So, balancing a tree isn't such a unique and interesting task. On the contrary, it's quite simple yet boring due to the number of edge cases you have to account for. Yet, we have just scratched the surface of the general topic of trees. It is vast: the Wikipedia section for tree data structures contains almost 100 of them and it's, definitely, not complete. Moreover, new tree variants will surely be invented in the future. But you will hardly deal with more than just a few variants during the course of your career, spending the majority of time with the simple "unconstrained" trees. And we have seen, in action, the basic principles of tree operation that will be helpful, in the process.

There's a couple of other general observations about programming algorithms we can draw from this chapter:

1. Trees are very versatile data structures that are a default choice when you need to represent some hierarchy. They are also one of a few data structures for which recursive processing is not only admissible but also natural and efficient.
2. Visualization is key to efficient debugging of complex data-structures. Unfortunately, it's hard to show that in the book how I have spent several hours on the code for the splay tree, but without an efficient way to display the trees coupled with dynamic tracing, I would probably have spent twice as much. And, both the `print-function` for individual node and `pprint-bst` were helpful here.

[1] This statement is stricktly true for balanced trees, but, even for imbalanced trees, such estimation is usually correct.

[2] Although it was shown that this value may also be reduced to a single bit using a clever implementation trick.

## 2019-09-11

### Programming Algorithms: Hash-Tables

Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.

Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure `O(1)` membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with `t` for all values.

## Implementation

The basic properties of hash-tables are average `O(1)` access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:

1. As the backing array has a limited number of cells (`n`), the function should produce values in the interval `[0, n)`. This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by `n`.
2. Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".

### Dealing with Collisions

Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.

For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us `(/ (* 20 19) 2)`, i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is `(/ (- 365 1) 365)`: there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in `(* (/ 364 365) (/ (- 365 2) 365))`. The value `(- 365 2)` refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:

``````(defun birthday-collision-prob (n)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- 365 i) 365)))
;; don't forget that we want the complement
;; of the probability of no collisions,
;; hence (- 1.0 ...)
(- 1.0 rez)))

CL-USER> (birthday-collision-prob 20)
0.4114384
``````

So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!

But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).

``````(defun hash-collision-prob (n size)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- size i) size)))
(- 1.0 rez)))
``````

Let's say, we have 10 keys. What should be the array size to be safe against collisions?

``````CL-USER> (hash-collision-prob 10 10)
0.9996371
``````

99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?

``````CL-USER> (hash-collision-prob 10 20)
0.9345271
``````

93%. Still, pretty high.

``````CL-USER> (hash-collision-prob 10 100)
0.37184352
CL-USER> (hash-collision-prob 10 10000)
0.004491329
``````

So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...

Note that the number depends on both arguments, so `(hash-collision-prob 10 100)` (0.37) is not the same as `(hash-collision-prob 20 200)` (0.63).

We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.

Now, let's discuss what difference the presence of these collisions makes to our hash-table idea and how to deal with this issue. The obvious solution is to have a fallback option: when two keys hash to the same index, store both of the items in a list. The retrieval operation, in this case, will require a sequential scan to find the requested key and return the corresponding value. Such an approach is called "chaining" and it is used by some implementations. Yet, it has a number of drawbacks:

• It complicates the implementation: we now have to deal with both a static array and a dynamic list/array/tree. This change opens a possibility for some hard-to-catch bugs, especially, in the concurrent settings.
• It requires more memory than the hash-table backing array, so we will be in a situation when some of the slots of the array are empty while others chain several elements.
• It will have poor performance due to the necessity of dealing with a linked structure and, what's worse, not respecting cache locality: the chain will not fit in the original array so at least one additional RAM round-trip will be required.

One upside of this approach is that it can store more elements than the size of the backing array. And, in the extreme case, it degrades to bucketing: when a small number of buckets point to long chains of randomly shuffled elements.

The more widely-used alternative to chaining is called "open addressing" or "closed hashing". With it, the chains are, basically, stored in the same backing array. The algorithm is simple: when the calculated hash is pointing at an already occupied slot in the array, find the next vacant slot by cycling over the array. If the table isn't full we're guaranteed to find one. If it is full, we need to resize it, first. Now, when the element is retrieved by key, we need to perform the same procedure: calculate the hash, then compare the key of the item at the returned index. if the keys are the same, we've found the desired element, otherwise — we need to cycle over the array comparing keys until we encounter the item we need.

Here's an implementation of the simple open addressing hash-table using `eql` for keys comparison:

``````(defstruct ht
array
(count 0))

(defun ht (&rest kvs)
(let ((rez (make-ht :array (make-array 16))))
(loop :for (k v) :in kvs :do
rez))

(defun ht-get (key ht)
(with ((size (length @ht.array)))
(start (rem (hash key) size)))
(do ((count 0 (1+ count))
(i start (rem (1+ i) size))
(item (? ht 'array start)
(? ht 'array i))
((or (null item)
(= count size)))
(when (eql key (car item))
(return
(values (cdr item)
;; the second value is an index, at which
;; the item was found (also used to distinguish
;; represented by nil but without the second value)
i))))))

(with ((array (ht-array ht))
(size (length array)))
;; flet defines a local function that has access
;; to the local variables defined in HT-ADD
(do ((i (rem (hash k) size)
(rem (1+ i) size))
((null @ht.array#i)
(:= @ht.array#i (cons k v)))
;; this do-loop doesn't have a body
)))
;; TALLY is a generic function for size retrieval, from RUTILS
(when (= (tally ht) size)
;; when the backing array is full
;; expand it to have the length equal to the next power of 2
(:= size (expt 2 (ceiling (log (1+ count) 2)))
@ht.array (make-array size))
(dovec (item array)
;; finally, add the new item

(defun ht-rem (key ht)
;; here, we use the index of the item
;; returned as the 2nd value of HT-GET
;; (when-it is a so called anaphoric macro, from RUTILS,
;; that assigns the value of its first argument
;; to an implicitly created variable IT
;; and evaluates the body when IT isn't null)
(when-it (nth-value 2 (ht-get key ht))
(void (? ht 'array it))
;; return the index to indicate that the item was found
it))
``````

To avoid constant resizing of the hash-table, just as with dynamic arrays, the backing array is, usually, allocated to have the size equal to a power of 2: 16 elements, to begin with. When it is filled up to a certain capacity it is resized to the next power of 2: 32, in this case. Usually, around 70-80% is considered peak occupancy as too collisions may happen afterward and the table access performance severely degrades. In practice, this means that normal open-addressing hash-tables also waste from 20 to 50 percent of allocated space. This inefficiency becomes a serious problem with large tables, so other implementation strategies become preferable when the size of data reaches tens and hundreds of megabytes. Note that, in our trivial implementation above, we have, effectively, used the threshold of 100% to simplify the code. Adding a configurable threshold is just a matter of introducing a parameter and initiating resizing not when `(= (ht-count ht) size)` but upon `(= (ht-count ht) (floor size threshold))`. As we've seen, resizing the hash-table requires calculating the new indices for all stored elements and adding them anew into the resized array.

Analyzing the complexity of the access function of the hash-table and proving that it is amortized `O(1)` isn't trivial. It depends on the properties of the hash-function, which should ensure good uniformity. Besides, the resizing threshold also matters: the more elements are in the table, the higher the chance of collisions. Also, you should keep in mind that if the keys possess some strange qualities that prevent them from being hashed uniformly, the theoretical results will not hold.

In short, if we consider a hash-table with 60% occupancy (which should be the average number, for a common table) we end up with the following probabilities:

• probability that we'll need just 1 operation to access the item (i.e. the initially indexed slot is empty): 0.4
• probability that we'll need 2 operations (the current slot is occupied, the next one is empty): `(* 0.6 0.4)` — 0.24
• probability that we'll need 3 operations: `(* (expt 0.6 2) 0.4)` — 0.14
• probability that we'll need 4 operations: `(* (expt 0.6 3) 0.4)` — 0.09

Actually, these calculations are slightly off and the correct probability of finding an empty slot should be somewhat lower, although the larger the table is, the smaller the deviation in the numbers. Finding out why is left as an exercise for the reader :)

As you see, there's a progression here. With probability around 0.87, we'll need no more than 4 operations. Without continuing with the arithmetic, I think, it should be obvious that we'll need, on average, around 3 operations to access each item and the probability that we'll need twice as many (6) is quite low (below 5%). So, we can say that the number of access operations is constant (i.e. independent of the number of elements in the table) and is determined only by the occupancy percent. So, if we keep the occupancy in the reasonable bounds, named earlier, on average, 1 hash code calculation/lookup and a couple of retrievals and equality comparisons will be needed to access an item in our hash-table.

### Hash-Code

So, we can conclude that a hash-table is primarily parametrized by two things: the hash-function and the equality predicate. In Lisp, in particular, there's a choice of just the four standard equality predicates: `eq`, `eql`, `equal`, and `equalp`. It's somewhat of a legacy that you can't use other comparison functions so some implementations, as an extension, allow th programmer to specify other predicates. However, in practice, the following approach is sufficient for the majority of the hash-table use cases:

• use the `eql` predicate if the keys are numbers, characters, or symbols
• use `equal` if the keys are strings or lists of the mentioned items
• use `equalp` if the keys are vectors, structs, CLOS objects or anything else containing one of those

But I'd recommend trying your best to avoid using the complex keys requiring `equalp`. Besides the performance penalty of using the heaviest equality predicate that performs deep structural comparison, structs, and vectors, in particular, will most likely hash to the same index. Here is a quote from one of the implementors describing why this happens:

Structs have no extra space to store a unique hash code within them. The decision was made to implement this because automatic inclusion of a hashing slot in all structure objects would have made all structs an average of one word longer. For small structs this is unacceptable. Instead, the user may define a struct with an extra slot, and the constructor for that struct type could store a unique value into that slot (either a random value or a value gotten by incrementing a counter each time the constructor is run). Also, create a hash generating function which accesses this hash-slot to generate its value. If the structs to be hashed are buried inside a list, then this hash function would need to know how to traverse these keys to obtain a unique value. Finally, then, build your hash-table using the `:hash-function` argument to make-hash-table (still using the equal test argument), to create a hash-table which will be well-distributed. Alternatively, and if you can guarantee that none of the slots in your structures will be changed after they are used as keys in the hash-table, you can use the `equalp` test function in your make-hash-table call, rather than equal. If you do, however, make sure that these struct objects don't change, because then they may not be found in the hash-table.

But what if you still need to use a struct or a CLOS object as a hash key (for instance, if you want to put them in a set)? There are three possible workarounds:

• Choose one of their slots as a key (if you can guarantee its uniqueness).
• Add a special slot to hold a unique value that will serve as a key.
• Use the literal representation obtained by calling the print-function of the object. Still, you'll need to ensure that it will be unique and constant. Using an item that changes while being the hash key is a source of very nasty bugs, so avoid it at all cost.

These considerations are also applicable to the question of why Java requires defining both `equals` and `hashCode` methods for objects that are used as keys in the hash-table or hash-set.

Beyond the direct implementation of open addressing, called "linear probing" (for it tries to resolve collisions by performing a linear scan for an empty slot), a number of approaches were proposed to improve hash distribution and reduce the collision rate. However, for the general case, their superiority remains questionable, and so the utility of a particular approach has to be tested in the context of the situations when linear probing demonstrates suboptimal behavior. One type of such situations occurs when the hash-codes become clustered near some locations due to deficiencies of either the hash-function or the keyset.

The simplest modification of linear probing is called "quadratic probing". It operates by performing the search for the next vacant slot using the linear probing offsets (or some other sequence of offsets) that are just raised to the power 2. I.e. if, with linear probing, the offset sequence was 1,2,3,etc, with the quadratic one, it is 1,4,9,... "Double hashing" is another simple alternative, which, instead of a linear sequence of offsets, calculates the offsets using another hash-function. This approach makes the sequence specific to each key, so the keys that map to the same location will have different possible variants of collision resolution. "2-choice hashing" also uses 2 hash-functions but selects the particular one for each key based on the distance from the original index it has to be moved for collision resolution.

More elaborate changes to the original idea are proposed in Cuckoo, Hopscotch, and Robin Hood caching, to name some of the popular alternatives. We won't discuss them now, but if the need arises to implement a non-standard hash-table it's worth studying all of those before proceeding with an idea of your own. Although, who knows, someday you might come up with a viable alternative technique, as well...

## Hash-Functions

The class of possible hash-functions is very diverse: any function that sufficiently randomizes the key hashes will do. But what good enough means? One of the ways to find out is to look at the the pictures of the distribution of hashes. Yet, there are other factors that may condition the choice: speed, complexity of implementation, collision resistance (important for cryptographic hashes that we won't discuss in this book).

The good news is that, for most practical purposes, there's a single function that is both fast and easy to implement and understand. It is called FNV-1a.

``````(defparameter *fnv-primes*
'((32 . 16777619)
(64 . 1099511628211)
(128 . 309485009821345068724781371)
(256 . 374144419156711147060143317175368453031918731002211)))

(defparameter *fnv-offsets*
'((32 . 2166136261)
(64 . 14695981039346656037)
(128 . 144066263297769815596495629667062367629)
(256 . 100029257958052580907070968620625704837092796014241193945225284501741471925557)))

(defun fnv-1a (x &key (bits 32))
(assert (member bits '(32 64 128 256)))
(let ((rez (assoc1 bits *fnv-offsets*))
(prime (assoc1 bits *fnv-primes*)))
(dotimes (i (/ bits 8))
(:= rez (ldb (byte bits 0)
(* (logxor rez (ldb (byte 8 (* i 8)) x))
prime))))
rez))
``````

The constants `*fnv-primes*` and `*fnv-offsets*` are precalculated up to 1024 bits (here, I used just a portion of the tables).

Note that, in this implementation, we use normal Lisp multiplication (`*`) that is not limited to fixed-size numbers (32-bit, 64-bit,...) so we need to extract only the first `bits` with `ldb`.

Also note that if you were to calculate FNV-1a with some online hash calculator you'd, probably, get a different result. Experimenting with it, I noticed that it is the same if we use only the non-zero bytes from the input number. This observation aligns well with calculating the hash for simple strings when each character is a single byte. For them the hash-function would look like the following:

``````(defun fnv-1a-str (str)
(let ((rez (assoc1 32 *fnv-offsets*))
(prime (assoc1 32 *fnv-primes*)))
(dovec (char str)
(:= rez (ldb 32 (* (logxor rez (ldb (byte 8 (* i 8))
(char-code char)))
prime))))
rez))
``````

So, even such a simple hash-function has nuances in its implementation and it should be meticulously checked against some reference implementation or a set of expected results.

Alongside FNV-1a, there's also FNV-1, which is a slightly worse variation, but it may be used if we need to apply 2 different hash functions at once (like, in 2-way or double hashing).

What is the source of the hashing property of FNV-1a? Xors and modulos. Combining these simple and efficient operations is enough to create a desired level of randomization. Most of the other hash-functions use the same building blocks as FNV-1a. They all perform arithmetic (usually, addition and multiplication as division is slow) and xor'ing, adding into the mix some prime numbers. For instance, here's what the code for another popular hash-function "djb2" approximately looks like:

``````(defun djb2-str (str)
(let ((rez 5381)  ; a DJB2 prime number
(loop :for char :across str :do
(:= rez (ldb 32 (+ (char-code char)
(ldb 32 (+ (ash rez 5)
rez)))))))
rez))
``````

## Operations

### Initialization

Normally, the hash-table can be created with `make-hash-table`, which has a number of configuration options, including `:test` (default: `eql`). Most of the implementations allow the programmer to make synchronized (thread-safe) hash-tables via another configuration parameter, but the variants of concurrency control will differ.

Yet, it is important to have a way to define hash-tables already pre-initialized with a number of key-value pairs, and `make-hash-table` can't handle this. Pre-initialized hash tables represent a common necessity for tables serving as dictionaries, and such pre-initialization greatly simplifies many code patterns. Thus RUTILS provides such a syntax (in fact, in 2 flavors) with the help of reader macros:

``````#{equal "foo" :bar "baz" 42}
#h(equal "foo" :bar "baz" 42)
``````

Both of these expressions will expand into a call to `make-hash-table` with `equal` test and a two calls to set operation to populate the table with the kv-pairs `"foo" :bar` and `"baz" 42`. For this stuff to work, you need to switch to the appropriate readtable by executing: `(named-readtables:in-readtable rutils-readtable)`.

The reader-macro to parse `#h()`-style literal readtables isn't very complicated. As all reader-macros, it operates on the character stream of the program text, processing one character at a time. Here is it's implementation:

``````(defun |#h-reader| (stream char arg)
(read-char stream)  ; skip the open paren
;; we can also add a sanity check to ensure that this character
;; is indeed a #\(
(with (;; read-delimited-list is a standard library function
;; that reads items until a delimiter is encountered
;; and then returns them as a list of parsed Lisp objects
;; the idea is that the first element may be a hash-table
;; test function; in this case, the number of items in the
;; definition will be odd as each key-value pair should have
;; an even number of elements
(test (when (oddp (length sexp))
(first sexp)))
;; the rest of the values, after the possible test function,
;; are key-value pairs
(kvs (group 2 (if test (rest sexp) sexp)))
(ht (gensym)))
`(let ((,ht (make-hash-table :test ',(or test 'eql))))
;; iterate the tail of the KVS list (:on loop clause)
;; and, for each key-value pair, generate an expression
;; to add the value for the key in the resulting hash-table
,@(mapcar (lambda (kv)
`(:= (? ,ht ,(first kv)) ,(second kv)))
,kvs)
,ht)))
``````

After such a function is defined, it can be plugged into the standard readtable:

``(set-dispatch-macro-character #\# #\h '|#h-reader|)``

Or it may be used in a named-readtable (you can learn how to do that, from the docs).

`print-hash-table` is the utility to perform the reverse operation — display hash-tables in the similar manner:

``````RUTILS> (print-hash-table #h(equal "foo" :bar "baz" 42))
#{EQUAL
"foo" :BAR
"baz" 42
}
#<HASH-TABLE :TEST EQUAL :COUNT 2 {10127C0003}>
``````

The last line of the output is the default Lisp printed representation of the hash-table. As you see, it is opaque and doesn't display the elements of the table. RUTILS also allows switching to printing the literal representation instead of the standard one with the help of `toggle-print-hash-table`. However, this extension is intended only for debugging purposes as it is not fully standard-conforming.

### Access

Accessing the hash-table elements is performed with `gethash`, which returns two things: the value at key and `t` when the key was found in the table, or two nils otherwise. By using `(:= (gethash key ht) val)` (or `(:= (? ht key) val)`) we can modify the stored value. Notice the reverse order of arguments of `gethash` compared to the usual order in most accessor functions, when the structure is placed first and the key second. However, `gethash` differs from generic `?` in that it accepts an optional argument that is used as the default value if the requested key is not present in the table. In some languages, like Python, there's a notion of "default hash-tables" that may be initialized with a common default element. In Lisp, a different approach is taken. However, it's possible to easily implement default hash-tables and plug them into the `generic-elt` mechanism:

``````(defstruct default-hash-table
table
default-value)

(defun gethash-default (key ht)
(gethash key (? ht 'table) (? ht 'default-value)))

(defmethod generic-elt ((kv default-hash-table) key &rest keys)
(gethash-default key kv))
``````

RUTILS also defines a number of aliases/shorthands for hash-table operations. As the `#` symbol is etymologically associated with hashes, it is used in the names of all these functions:

• `get#` is a shorthand and a more distinctive alias for `gethash`
• `set#` is an alias for `(:= (gethash ...`
• `getset#` is an implementation of the common pattern: this operation either retrieves the value if the key is found in the table or calculates its third argument returns it and also sets it for the given key for future retrieval
• `rem#` is an alias for `remhash` (remove the element from the table)
• `take#` both returns the key and removes it (unlike `rem#` that only removes)
• `in#` tests for the presence of the key in the table
• also, `p#` is an abbreviated version of `print-hash-table`

### Iteration

Hash-tables are unordered collections, in principle. But, still, there is always a way to iterate over them in some (unspecified) order. The standard utility for that is either `maphash`, which unlike `map` doesn't populate the resulting collection and is called just for the side effects, or the special `loop` syntax. Both are suboptimal, from several points of view, so RUTILS defines a couple of alternative options:

• `dotable` functions in the same manner as `dolist` except that it uses two variables: for the key and the value
• `mapkv`, mentioned in the previous chapter, works just like `mapcar` by creating a new result table with the same configuration as the hash-table it iterates over and assigns the results of invoking the first argument — the function of two elements — with each of the kv-pairs

Despite the absence of a predefined ordering, there are ways in which some order may be introduced. For example, in SBCL, the order in which the elements are added, is preserved by using additional vectors called `index-vector` and `next-vector` that store this information. Another option which allows forcing arbitrary ordering is to use the so-called Linked Hash-Table. It is a combination of a hash-table and a linked list: each key-value pair also has the next pointer, which links it to some other item in the table. This way, it is possible to have ordered key-values without resorting to tree-based structures. A poor man's linked hash-table can be created on top of the normal one with the following trick: substitute values by pairs containing a value plus a pointer to the next pair and keep track of the pointer to the first pair in a special slot.

``````(deftruct linked-hash-table-item
key
val
next)

table
tail)

(? ht 'table key 'val))

;; The initial order of items is the order of addition.
;; If we'd like to impose a different order,
;; we'll have to perform reordering after each addition
;; or implement a custom sethash function.
(with (((table head tail) ? ht)
(cur (gethash key table)))
(if cur
(:= (? cur 'val) val)
:key key :val val)))
(:= (? ht 'tail)
(if tail
(:= (? ht 'tail 'next) new)
new))))))

:table (hash-table :key (hash-table-key (? rez 'table)))))
(prev nil))
(do ((item (? rez 'head) (? item 'next)))
((null item))
(sethash-linked key rez (call fn (? item 'val))))))
``````

The issue with this approach, as you can see from the code, is that we also need to store the key, and it duplicates the data also stored in the backing hash-table itself. So, an efficient linked hash-table has to be implemented from scratch using an array as a base instead of a hash-table.

## Perfect Hashing

In the previous exposition, we have concluded that using hash-tables implies a significant level of reserved unused space (up to 30%) and inevitable collisions. Yet, if the keyset is static and known beforehand, we can do better: find a hash-function, which will exclude collisions (simple perfect hashing) and even totally get rid of reserved space (minimal perfect hashing, MPH). Although the last variant will still need extra space to store the additional information about the hash-functions, it may be much smaller: in some methods, down to ~3-4 bits per key, so just 5-10% overhead. Statistically speaking, constructing such a hash-function is possible. But the search for its parameters may require some trial and error.

### Implementation

The general idea is simple, but how to find the appropriate hash-function? There are several approaches described in sometimes hard-to-follow scientific papers and a number of cryptic programs in low-level C libraries. At a certain point in time, I needed to implement some variant of an MPH so I read those papers and studied the libraries to some extent. Not the most pleasant process, I should confess. One of my twitter pals once wrote: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And, although he was putting a negative connotation to it, I recognized the statement as a very precise description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and — as a possible byproduct — into an explanation ("blog post") that other engineers will understand and be able to reproduce. And it's not a skill every software developer should be easily capable of. Not all papers can even be reproduced because the experiment was not set up correctly, some parts of the description are missing, the data is not available, etc. Of those, which, in principle, can be, only some are presented in the form that is clear enough to be reliably programmed.

Here is one of the variants of minimal perfect hashing that possesses such qualities. It works for datasets of any size as a 3-step process:

1. At the first stage, by the use of a common hash-function (in particular, the Jenkins hash), all keys are near-uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. It can be achieved with very high probability if the hash divisor is set to `(ceiling (length keyset) 200)`. This allows the algorithm to work for data sets of arbitrary size, thereby reducing the problem to a simpler one that already has a known solution.
2. Next, for each bucket, the perfect hash function is constructed. This function is a table (and it's an important mathematical fact that each discrete function is equivalent to a table, albeit, potentially, of unlimited length). The table contains byte-sized offsets for each hash code, calculated by another application of the Jenkins hash, which produces two values in one go (actually, three, but one of them is not used). The divisor of the hash-function, this time, equals to double the number of elements in the bucket. And the uniqueness requirement is that the sum of offsets corresponding, in the table, to the two values produced by the Jenkins hash is unique, for each key. To check if the constraint is satisfied, the hashes are treated as vertices of a graph, and if it happens to be acyclic (the probability of this event is quite high if the parameters are chosen properly), the requirement can be satisfied, and it is possible to construct the perfect hash function, by the process described as the next step. Otherwise, we change the seed of the Jenkins hash and try again until the resulting graph is acyclic. In practice, just a couple of tries are needed.
3. Finally, the hash-function for the current bucket may be constructed from the graph by the CHM92 algorithm (named after the authors and the year of the paper), which is another version of perfect hashing but suitable only for limited keysets. Here, you can see the CHM92 formula implemented in code:
``````(defstruct mpht
(data nil :type simple-vector)
(offsets nil :type (simple-array octet))
(div nil))

;; div is the divisor of the top-level hash, which is calculated as:
;; (/ (1- (length meta)) 2)

(defun mpht-index (item mpht)
(with (((offsets meta div) ? mpht)
(bucket-id (* (mod (jenkins-hash item) div) 2))
(bucket-offset (? meta bucket-id))
(bucket-seed (? meta (+ 1 bucket-id)))
;; the number of items in the bucket is calculated
;; by substracting the offset of the next bucket
;; from the offset of the current one
(bucket-count (- (? meta (+ 2 bucket-id))
bucket-offset))
(hash1 hash2 (jenkins-hash2 item bucket-seed bucket-div))
(base (* bucket-offset 2)))
(+ bucket-offset (mod (+ (? offsets (+ base hash1))
(? offsets (+ base hash2)))
bucket-count))))
``````

This algorithm guarantees exactly `O(1)` hash-table access and uses 2 bytes per key, i.e. it will result in a constant 25% overhead on the table's size (in a 64-bit system): 2 byte-sized offsets for the hashes plus negligible 8 bytes per bucket (each bucket contains ~200 elements) for meta information. Better space-utilization solutions (up to 4 times more efficient) exist, but they are harder to implement and explain.

The Jenkins hash-function was chosen for two reasons:

• Primarily, because, being a relatively good-quality hash, it has a configurable parameter `seed` that is used for probabilistic probing (searching for an acyclic graph). On the contrary, FNV-1a doesn't work well with an arbitrary prime hence the usage of a pre-calculated one that isn't subject to change.
• Also, it produces 3 pseudo-random numbers right away, and we need 2 for the second stage of the algorithm.

### The CHM92 Algorithm

The CHM92 algorithm operates by performing a depth-first search (DFS) on the graph, in the process, labeling the edges with unique numbers and calculating the corresponding offset for each of the Jenkins hash values. In the picture, you can see one of the possible labelings: each vertex is the value of one of the two hash-codes returned by `jenkins-hash2` for each key, and every edge, connecting them, corresponds to a key that produced the hashes. The unique indices of the edges were obtained during DFS. Now, each hash-code is mapped iteratively to the number that is `(- edge-index other-vertex-index)`. So, some codes will map to the same number, but it is guaranteed that, for each key, the sum of two corresponding numbers will be unique (as the edge indices are unique).

CHM92 is an example of the probabilistic algorithms we will discuss in more detail near the end of the book.

Let's say we have implemented the described scheme like I did in the const-table library. Now, we need to perform the measurements to validate that we have, in fact, achieved the desired improvement over the standard hash-table implementation. In this case, we are interested not only in speed measurements, which we already know how to perform but also in calculating the space occupied.

The latter goal is harder to achieve. Usually, most of the programming languages will provide the analog of a `sizeof` function that returns the space occupied by an array, a structure or an object. Here, we're interested not in "shallow" `sizeof` but in a "deep" one that will descend into the structure's slots and add their sizes recursively.

First, let's create functions to populate the tables with a significant number of random string key-value pairs.

``````(defun random-string (size)
(coerce (loop :repeat size :collect (code-char (+ 32 (random 100))))
'string))

(defun random-hash-table (&key (n 100000))
(let ((rez (make-hash-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))

(defun random-const-table (&key (n 100000))
(let ((rez (make-const-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))
``````

A very approximate space measurement may be performed using the standard operator `room`. But it doesn't provide detailed per-object statistics. Here's a result of the `room` measurement, in SBCL (the format of the report will be somewhat different, for each implementation):

``````CL-USER> (room)
Dynamic space usage is:   45,076,224 bytes.
Immobile space usage is:  18,998,832 bytes (64,672 bytes overhead).
Read-only space usage is:          0 bytes.
Static space usage is:         1,264 bytes.
Control stack usage is:        9,048 bytes.
Binding stack usage is:          640 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
11,369,232 bytes for  76,040 simple-vector objects
9,095,952 bytes for 160,669 instance objects
8,289,568 bytes for 518,098 cons objects
3,105,920 bytes for  54,655 simple-array-unsigned-byte-8 objects
2,789,168 bytes for  54,537 simple-base-string objects
2,344,672 bytes for   9,217 simple-character-string objects
6,973,472 bytes for 115,152 other objects

43,967,984 bytes for 988,368 dynamic objects (space total)

Breakdown for immobile space:
16,197,840 bytes for 24,269 code objects
1,286,496 bytes for 26,789 symbol objects
1,041,936 bytes for 27,922 other objects

18,526,272 bytes for 78,980 immobile objects (space total)

CL-USER> (defparameter *ht* (random-hash-table))
*HT*
CL-USER> (room)
...
Breakdown for dynamic space:
13,349,920 bytes for    77,984 simple-vector objects
11,127,008 bytes for   208,576 simple-character-string objects
9,147,824 bytes for   161,469 instance objects
8,419,360 bytes for   526,210 cons objects
3,517,792 bytes for     2,997 simple-array-unsigned-byte-32 objects
3,106,288 bytes for    54,661 simple-array-unsigned-byte-8 objects
7,671,168 bytes for   166,882 other objects

56,339,360 bytes for 1,198,779 dynamic objects (space total)
``````

So, it seems like we added roughly 10 megabytes by creating a hash-table with 100,000 random 5-9 character keys and values. Almost all of that space went into the keys and values themselves — 9 Mb ("11,127,008 bytes for 208,576 simple-character-string objects" versus "2,344,672 bytes for 9,217 simple-character-string objects" — a bit less than 200,000 new strings were added).

Also, if we examine the hash-table, we can see that its occupancy is rather high — around 90%! (The number of keys 99706 instead of 10000 tells us that there was a small portion of duplicate keys among the randomly generated ones).

``````CL-USER> (describe *ht*)
#<HASH-TABLE :TEST EQUAL :COUNT 99706 {1002162EF3}>
[hash-table]

Occupancy: 0.9
Rehash-threshold: 1.0
Rehash-size: 1.5
Size: 111411
``````

And now, a simple time measurement:

``````CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(gethash k *ht*)))))
Evaluation took:
0.029 seconds of real time
0.032000 seconds of total run time (0.032000 user, 0.000000 system)
110.34% CPU
72,079,880 processor cycles
0 bytes consed
``````

Now, let's try the `const-table`s that are the MPHT implementation:

``````СL-USER> (time (defparameter *ct* (cstab:build-const-table *ht*)))
...................................................................................................
Evaluation took:
0.864 seconds of real time
...
СL-USER> (room)
...
Breakdown for dynamic space:
14,179,584 bytes for    78,624 simple-vector objects
11,128,464 bytes for   208,582 simple-character-string objects
9,169,120 bytes for   161,815 instance objects
8,481,536 bytes for   530,096 cons objects
3,521,808 bytes for     2,998 simple-array-unsigned-byte-32 objects
3,305,984 bytes for    54,668 simple-array-unsigned-byte-8 objects
7,678,064 bytes for   166,992 other objects

57,464,560 bytes for 1,203,775 dynamic objects (space total)
``````

Another megabyte was added for the metadata of the new table, which doesn't seem significantly different from the hash-table version. Surely, often we'd like to be much more precise in space measurements. For this, SBCL recently added an allocation profiler `sb-aprof`, but we won't go into the details of its usage, in this chapter.

And now, time measurement:

``````CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(cstab:csget k *ct*)))))
Evaluation took:
3.561 seconds of real time
``````

Oops, a two-orders-of-magnitude slowdown! Probably, it has to do with many factors: the lack of optimization in my implementation compared to the one in SBCL, the need to calculate more hashes and with a slower hash-function, etc. I'm sure that the implementation may be sped up at least an order of magnitude, but, even then, what's the benefit of using it over the default hash-tables? Especially, considering that MPHTs have a lot of moving parts and rely on a number of "low-level" algorithms like graph traversal or efficient membership testing, most of which need a custom efficient implementation...

Still, there's one dimension in which MPHTs may provide an advantage: significantly reduce space usage by not storing the keys. Though, it becomes problematic if we need to distinguish the keys that are in the table from the unknown ones as those will also hash to some index, i.e. overlap with an existing key. So, either the keyspace should be known beforehand and exhaustively covered in the table or some precursory membership test is necessary when we anticipate the possibility of unseen keys. Yet, there are ways to perform the test efficiently (exactly or probabilistically), which require much less storage space than would be needed to store the keys themselves. Some of them we'll see in the following chapters.

If the keys are omitted, the whole table may be reduced to a Jump-table. Jump-tables are a low-level trick possible when all the keys are integers in the interval `[0, n)`. It removes the necessity to perform sequential equality comparisons for every possible branch until one of the conditions matches: instead, the numbers are used directly as an offset. I.e. the table is represented by a vector, each hash-code being the index in that vector.

A jump-table for the MPHT will be simply a data array, but sometimes evaluation of different code is required for different keys. Such more complex behavior may be implemented in Lisp using the lowest-level operators `tagbody` and `go` (and a bit of macrology if we need to generate a huge table). This implementation will be a complete analog of the C `switch` statement. The skeleton for such "executable" table will look like this, where 0, 1,... are goto labels:

``````(block nil
(tagbody (go key)
0 (return (do-something0))
1 (return (do-something1))
...))
``````

## Distributed Hash-Tables

Another active area of hash-table-related research is algorithms for distributing them over the network. This is a natural way to represent a lot of datasets, and thus there are numerous storage systems (both general- and special-purpose) which are built as distributed hash-tables. Among them are, for instance, Amazon DynamoDB or an influential open-source project Kademlia. We will discuss in more detail, in the chapter on Distributed Algorithms, some of the technologies developed for this use case, and here I wanted to mention just one concept.

Consistent Hashing addresses the problem of distributing the hash-codes among `k` storage nodes under the real-world limitations that some of them may become temporarily unavailable or new peers may be added into the system. The changes result in changes of the value of `k`. The straightforward approach would just divide the space of all codes into `k` equal portions and select the node into whose portion the particular key maps. Yet, if `k` is changed, all the keys need to be rehashed, which we'd like to avoid at all cost as rehashing the whole database and moving the majority of the keys between the nodes, at once, will saturate the network and bring the system to a complete halt.

The idea or rather the tweak behind Consistent Hashing is simple: we also hash the node ids and store the keys on the node that has the next hash-code larger than the hash of the key (modulo `n`, i.e. wrap around 0). Now, when a new node is added, it is placed on this so-called "hash ring" between two other peers, so only part of the keys from a single node (the next on the ring) require being redistributed to it. Likewise, when the node is removed, only its keys need to be reassigned to the next peer on the ring (it is supposed that the data is stored in multiple copies on different nodes, so when one of the nodes disappears the data doesn't become totally lost).

The only problem with applying this approach directly is the uneven distribution of keys originating from uneven placement of the hash-codes of the nodes on the hash ring. This problem can be solved with another simple tweak: have multiple ids for each node that will be hashed to different locations, effectively emulating a larger number of virtual nodes, each storing a smaller portion of the keys. Due to the randomization property of hashes, not so many virtual nodes will be needed, to obtain a nearly uniform distribution of keys over the nodes.

A more general version of this approach is called Rendezvous Hashing. In it, the key for the item is combined with the node id for each node and then hashed. The largest value of the hash determines the designated node to store the item.

## Hashing in Action: Content Addressing

Hash-tables are so ubiquitous that it's, actually, difficult to single out some peculiar use case. Instead, let's talk about hash-functions. They can find numerous uses beyond determining the positions of the items in the hash-table, and one of them is called "content addressing": globally identify a piece of data by its fingerprint instead of using external meta information like name or path. This is one of the suggested building blocks for large-scale distributed storage systems, but it works locally, as well: your git SCM system silently uses it behind the scenes to identify the changesets it operates upon.

• Potential for space economy: if the system has a chance of operating on repeated items (like git does, although it's not the only reason for choosing such naming scheme for blobs: the other being the lack of a better variant), content addressing will make it possible to avoid storing them multiple times.
• It guarantees that the links will always return the same content, regardless of where it is retrieved from, who added it to the network, how and when. This enables such distributed protocols as BitTorrent that split the original file into multiple pieces, each one identified by its hash. These pieces can be distributed in an untrusted network.
• As mentioned above, content addressing also results in a conflict-free naming scheme (provided that the hash has enough bits — usually, cryptographic hashes such as SHA-1 are used for this purpose, although, in many cases, such powerful hash-functions are an overkill).

## Take-aways

This chapter resented a number of complex approaches that require a lot of attention to detail to be implemented efficiently. On the surface, the hash-table concept may seem rather simple, but, as we have seen, the production-grade implementations are not that straightforward. What general conclusions can we make?

1. In such mathematically loaded areas as hash-function and hash-table implementation, rigorous testing is critically important. For there is a number of unexpected sources of errors: incorrect implementation, integer overflow, concurrency issues, etc. A good testing strategy is to use an already existing trusted implementation and perform a large-scale comparison testing with a lot of random inputs. We haven't discussed the testing code here but will return to the practical implementation of such testing frameworks in the following chapters.
2. Besides, a correct implementation doesn't necessarily mean a fast one. Low-level optimization techniques play a crucial role here.
3. In the implementation of MPHT, we have seen in action another important approach to solving algorithmic and, more generally, mathematic problems: reducing them to a problem that has a known solution.
4. Space measurement is another important area of algorithms evaluation that is somewhat harder to accomplish than runtime profiling. We'll also see more usage of both of these tools throughout the book.