## 2019-09-11

### Programming Algorithms: Hash-Tables

Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.

Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure `O(1)` membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with `t` for all values.

## Implementation

The basic properties of hash-tables are average `O(1)` access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:

1. As the backing array has a limited number of cells (`n`), the function should produce values in the interval `[0, n)`. This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by `n`.
2. Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".

### Dealing with Collisions

Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.

For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us `(/ (* 20 19) 2)`, i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is `(/ (- 365 1) 365)`: there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in `(* (/ 364 365) (/ (- 365 2) 365))`. The value `(- 365 2)` refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:

``````(defun birthday-collision-prob (n)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- 365 i) 365)))
;; don't forget that we want the complement
;; of the probability of no collisions,
;; hence (- 1.0 ...)
(- 1.0 rez)))

CL-USER> (birthday-collision-prob 20)
0.4114384
``````

So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!

But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).

``````(defun hash-collision-prob (n size)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- size i) size)))
(- 1.0 rez)))
``````

Let's say, we have 10 keys. What should be the array size to be safe against collisions?

``````CL-USER> (hash-collision-prob 10 10)
0.9996371
``````

99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?

``````CL-USER> (hash-collision-prob 10 20)
0.9345271
``````

93%. Still, pretty high.

``````CL-USER> (hash-collision-prob 10 100)
0.37184352
CL-USER> (hash-collision-prob 10 10000)
0.004491329
``````

So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...

Note that the number depends on both arguments, so `(hash-collision-prob 10 100)` (0.37) is not the same as `(hash-collision-prob 20 200)` (0.63).

We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.

Now, let's discuss what difference the presence of these collisions makes to our hash-table idea and how to deal with this issue. The obvious solution is to have a fallback option: when two keys hash to the same index, store both of the items in a list. The retrieval operation, in this case, will require a sequential scan to find the requested key and return the corresponding value. Such an approach is called "chaining" and it is used by some implementations. Yet, it has a number of drawbacks:

• It complicates the implementation: we now have to deal with both a static array and a dynamic list/array/tree. This change opens a possibility for some hard-to-catch bugs, especially, in the concurrent settings.
• It requires more memory than the hash-table backing array, so we will be in a situation when some of the slots of the array are empty while others chain several elements.
• It will have poor performance due to the necessity of dealing with a linked structure and, what's worse, not respecting cache locality: the chain will not fit in the original array so at least one additional RAM round-trip will be required.

One upside of this approach is that it can store more elements than the size of the backing array. And, in the extreme case, it degrades to bucketing: when a small number of buckets point to long chains of randomly shuffled elements.

The more widely-used alternative to chaining is called "open addressing" or "closed hashing". With it, the chains are, basically, stored in the same backing array. The algorithm is simple: when the calculated hash is pointing at an already occupied slot in the array, find the next vacant slot by cycling over the array. If the table isn't full we're guaranteed to find one. If it is full, we need to resize it, first. Now, when the element is retrieved by key, we need to perform the same procedure: calculate the hash, then compare the key of the item at the returned index. if the keys are the same, we've found the desired element, otherwise — we need to cycle over the array comparing keys until we encounter the item we need.

Here's an implementation of the simple open addressing hash-table using `eql` for keys comparison:

``````(defstruct ht
array
(count 0))

(defun ht (&rest kvs)
(let ((rez (make-ht :array (make-array 16))))
(loop :for (k v) :in kvs :do
rez))

(defun ht-get (key ht)
(with ((size (length @ht.array)))
(start (rem (hash key) size)))
(do ((count 0 (1+ count))
(i start (rem (1+ i) size))
(item (? ht 'array start)
(? ht 'array i))
((or (null item)
(= count size)))
(when (eql key (car item))
(return
(values (cdr item)
;; the second value is an index, at which
;; the item was found (also used to distinguish
;; represented by nil but without the second value)
i))))))

(with ((array (ht-array ht))
(size (length array)))
;; flet defines a local function that has access
;; to the local variables defined in HT-ADD
(do ((i (rem (hash k) size)
(rem (1+ i) size))
((null @ht.array#i)
(:= @ht.array#i (cons k v)))
;; this do-loop doesn't have a body
)))
;; TALLY is a generic function for size retrieval, from RUTILS
(when (= (tally ht) size)
;; when the backing array is full
;; expand it to have the length equal to the next power of 2
(:= size (expt 2 (ceiling (log (1+ count) 2)))
@ht.array (make-array size))
(dovec (item array)
;; finally, add the new item

(defun ht-rem (key ht)
;; here, we use the index of the item
;; returned as the 2nd value of HT-GET
;; (when-it is a so called anaphoric macro, from RUTILS,
;; that assigns the value of its first argument
;; to an implicitly created variable IT
;; and evaluates the body when IT isn't null)
(when-it (nth-value 2 (ht-get key ht))
(void (? ht 'array it))
;; return the index to indicate that the item was found
it))
``````

To avoid constant resizing of the hash-table, just as with dynamic arrays, the backing array is, usually, allocated to have the size equal to a power of 2: 16 elements, to begin with. When it is filled up to a certain capacity it is resized to the next power of 2: 32, in this case. Usually, around 70-80% is considered peak occupancy as too collisions may happen afterward and the table access performance severely degrades. In practice, this means that normal open-addressing hash-tables also waste from 20 to 50 percent of allocated space. This inefficiency becomes a serious problem with large tables, so other implementation strategies become preferable when the size of data reaches tens and hundreds of megabytes. Note that, in our trivial implementation above, we have, effectively, used the threshold of 100% to simplify the code. Adding a configurable threshold is just a matter of introducing a parameter and initiating resizing not when `(= (ht-count ht) size)` but upon `(= (ht-count ht) (floor size threshold))`. As we've seen, resizing the hash-table requires calculating the new indices for all stored elements and adding them anew into the resized array.

Analyzing the complexity of the access function of the hash-table and proving that it is amortized `O(1)` isn't trivial. It depends on the properties of the hash-function, which should ensure good uniformity. Besides, the resizing threshold also matters: the more elements are in the table, the higher the chance of collisions. Also, you should keep in mind that if the keys possess some strange qualities that prevent them from being hashed uniformly, the theoretical results will not hold.

In short, if we consider a hash-table with 60% occupancy (which should be the average number, for a common table) we end up with the following probabilities:

• probability that we'll need just 1 operation to access the item (i.e. the initially indexed slot is empty): 0.4
• probability that we'll need 2 operations (the current slot is occupied, the next one is empty): `(* 0.6 0.4)` — 0.24
• probability that we'll need 3 operations: `(* (expt 0.6 2) 0.4)` — 0.14
• probability that we'll need 4 operations: `(* (expt 0.6 3) 0.4)` — 0.09

Actually, these calculations are slightly off and the correct probability of finding an empty slot should be somewhat lower, although the larger the table is, the smaller the deviation in the numbers. Finding out why is left as an exercise for the reader :)

As you see, there's a progression here. With probability around 0.87, we'll need no more than 4 operations. Without continuing with the arithmetic, I think, it should be obvious that we'll need, on average, around 3 operations to access each item and the probability that we'll need twice as many (6) is quite low (below 5%). So, we can say that the number of access operations is constant (i.e. independent of the number of elements in the table) and is determined only by the occupancy percent. So, if we keep the occupancy in the reasonable bounds, named earlier, on average, 1 hash code calculation/lookup and a couple of retrievals and equality comparisons will be needed to access an item in our hash-table.

### Hash-Code

So, we can conclude that a hash-table is primarily parametrized by two things: the hash-function and the equality predicate. In Lisp, in particular, there's a choice of just the four standard equality predicates: `eq`, `eql`, `equal`, and `equalp`. It's somewhat of a legacy that you can't use other comparison functions so some implementations, as an extension, allow th programmer to specify other predicates. However, in practice, the following approach is sufficient for the majority of the hash-table use cases:

• use the `eql` predicate if the keys are numbers, characters, or symbols
• use `equal` if the keys are strings or lists of the mentioned items
• use `equalp` if the keys are vectors, structs, CLOS objects or anything else containing one of those

But I'd recommend trying your best to avoid using the complex keys requiring `equalp`. Besides the performance penalty of using the heaviest equality predicate that performs deep structural comparison, structs, and vectors, in particular, will most likely hash to the same index. Here is a quote from one of the implementors describing why this happens:

Structs have no extra space to store a unique hash code within them. The decision was made to implement this because automatic inclusion of a hashing slot in all structure objects would have made all structs an average of one word longer. For small structs this is unacceptable. Instead, the user may define a struct with an extra slot, and the constructor for that struct type could store a unique value into that slot (either a random value or a value gotten by incrementing a counter each time the constructor is run). Also, create a hash generating function which accesses this hash-slot to generate its value. If the structs to be hashed are buried inside a list, then this hash function would need to know how to traverse these keys to obtain a unique value. Finally, then, build your hash-table using the `:hash-function` argument to make-hash-table (still using the equal test argument), to create a hash-table which will be well-distributed. Alternatively, and if you can guarantee that none of the slots in your structures will be changed after they are used as keys in the hash-table, you can use the `equalp` test function in your make-hash-table call, rather than equal. If you do, however, make sure that these struct objects don't change, because then they may not be found in the hash-table.

But what if you still need to use a struct or a CLOS object as a hash key (for instance, if you want to put them in a set)? There are three possible workarounds:

• Choose one of their slots as a key (if you can guarantee its uniqueness).
• Add a special slot to hold a unique value that will serve as a key.
• Use the literal representation obtained by calling the print-function of the object. Still, you'll need to ensure that it will be unique and constant. Using an item that changes while being the hash key is a source of very nasty bugs, so avoid it at all cost.

These considerations are also applicable to the question of why Java requires defining both `equals` and `hashCode` methods for objects that are used as keys in the hash-table or hash-set.

Beyond the direct implementation of open addressing, called "linear probing" (for it tries to resolve collisions by performing a linear scan for an empty slot), a number of approaches were proposed to improve hash distribution and reduce the collision rate. However, for the general case, their superiority remains questionable, and so the utility of a particular approach has to be tested in the context of the situations when linear probing demonstrates suboptimal behavior. One type of such situations occurs when the hash-codes become clustered near some locations due to deficiencies of either the hash-function or the keyset.

The simplest modification of linear probing is called "quadratic probing". It operates by performing the search for the next vacant slot using the linear probing offsets (or some other sequence of offsets) that are just raised to the power 2. I.e. if, with linear probing, the offset sequence was 1,2,3,etc, with the quadratic one, it is 1,4,9,... "Double hashing" is another simple alternative, which, instead of a linear sequence of offsets, calculates the offsets using another hash-function. This approach makes the sequence specific to each key, so the keys that map to the same location will have different possible variants of collision resolution. "2-choice hashing" also uses 2 hash-functions but selects the particular one for each key based on the distance from the original index it has to be moved for collision resolution.

More elaborate changes to the original idea are proposed in Cuckoo, Hopscotch, and Robin Hood caching, to name some of the popular alternatives. We won't discuss them now, but if the need arises to implement a non-standard hash-table it's worth studying all of those before proceeding with an idea of your own. Although, who knows, someday you might come up with a viable alternative technique, as well...

## Hash-Functions

The class of possible hash-functions is very diverse: any function that sufficiently randomizes the key hashes will do. But what good enough means? One of the ways to find out is to look at the the pictures of the distribution of hashes. Yet, there are other factors that may condition the choice: speed, complexity of implementation, collision resistance (important for cryptographic hashes that we won't discuss in this book).

The good news is that, for most practical purposes, there's a single function that is both fast and easy to implement and understand. It is called FNV-1a.

``````(defparameter *fnv-primes*
'((32 . 16777619)
(64 . 1099511628211)
(128 . 309485009821345068724781371)
(256 . 374144419156711147060143317175368453031918731002211)))

(defparameter *fnv-offsets*
'((32 . 2166136261)
(64 . 14695981039346656037)
(128 . 144066263297769815596495629667062367629)
(256 . 100029257958052580907070968620625704837092796014241193945225284501741471925557)))

(defun fnv-1a (x &key (bits 32))
(assert (member bits '(32 64 128 256)))
(let ((rez (assoc1 bits *fnv-offsets*))
(prime (assoc1 bits *fnv-primes*)))
(dotimes (i (/ bits 8))
(:= rez (ldb (byte bits 0)
(* (logxor rez (ldb (byte 8 (* i 8)) x))
prime))))
rez))
``````

The constants `*fnv-primes*` and `*fnv-offsets*` are precalculated up to 1024 bits (here, I used just a portion of the tables).

Note that, in this implementation, we use normal Lisp multiplication (`*`) that is not limited to fixed-size numbers (32-bit, 64-bit,...) so we need to extract only the first `bits` with `ldb`.

Also note that if you were to calculate FNV-1a with some online hash calculator you'd, probably, get a different result. Experimenting with it, I noticed that it is the same if we use only the non-zero bytes from the input number. This observation aligns well with calculating the hash for simple strings when each character is a single byte. For them the hash-function would look like the following:

``````(defun fnv-1a-str (str)
(let ((rez (assoc1 32 *fnv-offsets*))
(prime (assoc1 32 *fnv-primes*)))
(dovec (char str)
(:= rez (ldb 32 (* (logxor rez (ldb (byte 8 (* i 8))
(char-code char)))
prime))))
rez))
``````

So, even such a simple hash-function has nuances in its implementation and it should be meticulously checked against some reference implementation or a set of expected results.

Alongside FNV-1a, there's also FNV-1, which is a slightly worse variation, but it may be used if we need to apply 2 different hash functions at once (like, in 2-way or double hashing).

What is the source of the hashing property of FNV-1a? Xors and modulos. Combining these simple and efficient operations is enough to create a desired level of randomization. Most of the other hash-functions use the same building blocks as FNV-1a. They all perform arithmetic (usually, addition and multiplication as division is slow) and xor'ing, adding into the mix some prime numbers. For instance, here's what the code for another popular hash-function "djb2" approximately looks like:

``````(defun djb2-str (str)
(let ((rez 5381)  ; a DJB2 prime number
(loop :for char :across str :do
(:= rez (ldb 32 (+ (char-code char)
(ldb 32 (+ (ash rez 5)
rez)))))))
rez))
``````

## Operations

### Initialization

Normally, the hash-table can be created with `make-hash-table`, which has a number of configuration options, including `:test` (default: `eql`). Most of the implementations allow the programmer to make synchronized (thread-safe) hash-tables via another configuration parameter, but the variants of concurrency control will differ.

Yet, it is important to have a way to define hash-tables already pre-initialized with a number of key-value pairs, and `make-hash-table` can't handle this. Pre-initialized hash tables represent a common necessity for tables serving as dictionaries, and such pre-initialization greatly simplifies many code patterns. Thus RUTILS provides such a syntax (in fact, in 2 flavors) with the help of reader macros:

``````#{equal "foo" :bar "baz" 42}
#h(equal "foo" :bar "baz" 42)
``````

Both of these expressions will expand into a call to `make-hash-table` with `equal` test and a two calls to set operation to populate the table with the kv-pairs `"foo" :bar` and `"baz" 42`. For this stuff to work, you need to switch to the appropriate readtable by executing: `(named-readtables:in-readtable rutils-readtable)`.

The reader-macro to parse `#h()`-style literal readtables isn't very complicated. As all reader-macros, it operates on the character stream of the program text, processing one character at a time. Here is it's implementation:

``````(defun |#h-reader| (stream char arg)
(read-char stream)  ; skip the open paren
;; we can also add a sanity check to ensure that this character
;; is indeed a #\(
(with (;; read-delimited-list is a standard library function
;; that reads items until a delimiter is encountered
;; and then returns them as a list of parsed Lisp objects
;; the idea is that the first element may be a hash-table
;; test function; in this case, the number of items in the
;; definition will be odd as each key-value pair should have
;; an even number of elements
(test (when (oddp (length sexp))
(first sexp)))
;; the rest of the values, after the possible test function,
;; are key-value pairs
(kvs (group 2 (if test (rest sexp) sexp)))
(ht (gensym)))
`(let ((,ht (make-hash-table :test ',(or test 'eql))))
;; iterate the tail of the KVS list (:on loop clause)
;; and, for each key-value pair, generate an expression
;; to add the value for the key in the resulting hash-table
,@(mapcar (lambda (kv)
`(:= (? ,ht ,(first kv)) ,(second kv)))
,kvs)
,ht)))
``````

After such a function is defined, it can be plugged into the standard readtable:

``(set-dispatch-macro-character #\# #\h '|#h-reader|)``

Or it may be used in a named-readtable (you can learn how to do that, from the docs).

`print-hash-table` is the utility to perform the reverse operation — display hash-tables in the similar manner:

``````RUTILS> (print-hash-table #h(equal "foo" :bar "baz" 42))
#{EQUAL
"foo" :BAR
"baz" 42
}
#<HASH-TABLE :TEST EQUAL :COUNT 2 {10127C0003}>
``````

The last line of the output is the default Lisp printed representation of the hash-table. As you see, it is opaque and doesn't display the elements of the table. RUTILS also allows switching to printing the literal representation instead of the standard one with the help of `toggle-print-hash-table`. However, this extension is intended only for debugging purposes as it is not fully standard-conforming.

### Access

Accessing the hash-table elements is performed with `gethash`, which returns two things: the value at key and `t` when the key was found in the table, or two nils otherwise. By using `(:= (gethash key ht) val)` (or `(:= (? ht key) val)`) we can modify the stored value. Notice the reverse order of arguments of `gethash` compared to the usual order in most accessor functions, when the structure is placed first and the key second. However, `gethash` differs from generic `?` in that it accepts an optional argument that is used as the default value if the requested key is not present in the table. In some languages, like Python, there's a notion of "default hash-tables" that may be initialized with a common default element. In Lisp, a different approach is taken. However, it's possible to easily implement default hash-tables and plug them into the `generic-elt` mechanism:

``````(defstruct default-hash-table
table
default-value)

(defun gethash-default (key ht)
(gethash key (? ht 'table) (? ht 'default-value)))

(defmethod generic-elt ((kv default-hash-table) key &rest keys)
(gethash-default key kv))
``````

RUTILS also defines a number of aliases/shorthands for hash-table operations. As the `#` symbol is etymologically associated with hashes, it is used in the names of all these functions:

• `get#` is a shorthand and a more distinctive alias for `gethash`
• `set#` is an alias for `(:= (gethash ...`
• `getset#` is an implementation of the common pattern: this operation either retrieves the value if the key is found in the table or calculates its third argument returns it and also sets it for the given key for future retrieval
• `rem#` is an alias for `remhash` (remove the element from the table)
• `take#` both returns the key and removes it (unlike `rem#` that only removes)
• `in#` tests for the presence of the key in the table
• also, `p#` is an abbreviated version of `print-hash-table`

### Iteration

Hash-tables are unordered collections, in principle. But, still, there is always a way to iterate over them in some (unspecified) order. The standard utility for that is either `maphash`, which unlike `map` doesn't populate the resulting collection and is called just for the side effects, or the special `loop` syntax. Both are suboptimal, from several points of view, so RUTILS defines a couple of alternative options:

• `dotable` functions in the same manner as `dolist` except that it uses two variables: for the key and the value
• `mapkv`, mentioned in the previous chapter, works just like `mapcar` by creating a new result table with the same configuration as the hash-table it iterates over and assigns the results of invoking the first argument — the function of two elements — with each of the kv-pairs

Despite the absence of a predefined ordering, there are ways in which some order may be introduced. For example, in SBCL, the order in which the elements are added, is preserved by using additional vectors called `index-vector` and `next-vector` that store this information. Another option which allows forcing arbitrary ordering is to use the so-called Linked Hash-Table. It is a combination of a hash-table and a linked list: each key-value pair also has the next pointer, which links it to some other item in the table. This way, it is possible to have ordered key-values without resorting to tree-based structures. A poor man's linked hash-table can be created on top of the normal one with the following trick: substitute values by pairs containing a value plus a pointer to the next pair and keep track of the pointer to the first pair in a special slot.

``````(deftruct linked-hash-table-item
key
val
next)

table
tail)

(? ht 'table key 'val))

;; The initial order of items is the order of addition.
;; If we'd like to impose a different order,
;; we'll have to perform reordering after each addition
;; or implement a custom sethash function.
(with (((table head tail) ? ht)
(cur (gethash key table)))
(if cur
(:= (? cur 'val) val)
:key key :val val)))
(:= (? ht 'tail)
(if tail
(:= (? ht 'tail 'next) new)
new))))))

:table (hash-table :key (hash-table-key (? rez 'table)))))
(prev nil))
(do ((item (? rez 'head) (? item 'next)))
((null item))
(sethash-linked key rez (call fn (? item 'val))))))
``````

The issue with this approach, as you can see from the code, is that we also need to store the key, and it duplicates the data also stored in the backing hash-table itself. So, an efficient linked hash-table has to be implemented from scratch using an array as a base instead of a hash-table.

## Perfect Hashing

In the previous exposition, we have concluded that using hash-tables implies a significant level of reserved unused space (up to 30%) and inevitable collisions. Yet, if the keyset is static and known beforehand, we can do better: find a hash-function, which will exclude collisions (simple perfect hashing) and even totally get rid of reserved space (minimal perfect hashing, MPH). Although the last variant will still need extra space to store the additional information about the hash-functions, it may be much smaller: in some methods, down to ~3-4 bits per key, so just 5-10% overhead. Statistically speaking, constructing such a hash-function is possible. But the search for its parameters may require some trial and error.

### Implementation

The general idea is simple, but how to find the appropriate hash-function? There are several approaches described in sometimes hard-to-follow scientific papers and a number of cryptic programs in low-level C libraries. At a certain point in time, I needed to implement some variant of an MPH so I read those papers and studied the libraries to some extent. Not the most pleasant process, I should confess. One of my twitter pals once wrote: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And, although he was putting a negative connotation to it, I recognized the statement as a very precise description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and — as a possible byproduct — into an explanation ("blog post") that other engineers will understand and be able to reproduce. And it's not a skill every software developer should be easily capable of. Not all papers can even be reproduced because the experiment was not set up correctly, some parts of the description are missing, the data is not available, etc. Of those, which, in principle, can be, only some are presented in the form that is clear enough to be reliably programmed.

Here is one of the variants of minimal perfect hashing that possesses such qualities. It works for datasets of any size as a 3-step process:

1. At the first stage, by the use of a common hash-function (in particular, the Jenkins hash), all keys are near-uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. It can be achieved with very high probability if the hash divisor is set to `(ceiling (length keyset) 200)`. This allows the algorithm to work for data sets of arbitrary size, thereby reducing the problem to a simpler one that already has a known solution.
2. Next, for each bucket, the perfect hash function is constructed. This function is a table (and it's an important mathematical fact that each discrete function is equivalent to a table, albeit, potentially, of unlimited length). The table contains byte-sized offsets for each hash code, calculated by another application of the Jenkins hash, which produces two values in one go (actually, three, but one of them is not used). The divisor of the hash-function, this time, equals to double the number of elements in the bucket. And the uniqueness requirement is that the sum of offsets corresponding, in the table, to the two values produced by the Jenkins hash is unique, for each key. To check if the constraint is satisfied, the hashes are treated as vertices of a graph, and if it happens to be acyclic (the probability of this event is quite high if the parameters are chosen properly), the requirement can be satisfied, and it is possible to construct the perfect hash function, by the process described as the next step. Otherwise, we change the seed of the Jenkins hash and try again until the resulting graph is acyclic. In practice, just a couple of tries are needed.
3. Finally, the hash-function for the current bucket may be constructed from the graph by the CHM92 algorithm (named after the authors and the year of the paper), which is another version of perfect hashing but suitable only for limited keysets. Here, you can see the CHM92 formula implemented in code:
``````(defstruct mpht
(data nil :type simple-vector)
(offsets nil :type (simple-array octet))
(div nil))

;; div is the divisor of the top-level hash, which is calculated as:
;; (/ (1- (length meta)) 2)

(defun mpht-index (item mpht)
(with (((offsets meta div) ? mpht)
(bucket-id (* (mod (jenkins-hash item) div) 2))
(bucket-offset (? meta bucket-id))
(bucket-seed (? meta (+ 1 bucket-id)))
;; the number of items in the bucket is calculated
;; by substracting the offset of the next bucket
;; from the offset of the current one
(bucket-count (- (? meta (+ 2 bucket-id))
bucket-offset))
(hash1 hash2 (jenkins-hash2 item bucket-seed bucket-div))
(base (* bucket-offset 2)))
(+ bucket-offset (mod (+ (? offsets (+ base hash1))
(? offsets (+ base hash2)))
bucket-count))))
``````

This algorithm guarantees exactly `O(1)` hash-table access and uses 2 bytes per key, i.e. it will result in a constant 25% overhead on the table's size (in a 64-bit system): 2 byte-sized offsets for the hashes plus negligible 8 bytes per bucket (each bucket contains ~200 elements) for meta information. Better space-utilization solutions (up to 4 times more efficient) exist, but they are harder to implement and explain.

The Jenkins hash-function was chosen for two reasons:

• Primarily, because, being a relatively good-quality hash, it has a configurable parameter `seed` that is used for probabilistic probing (searching for an acyclic graph). On the contrary, FNV-1a doesn't work well with an arbitrary prime hence the usage of a pre-calculated one that isn't subject to change.
• Also, it produces 3 pseudo-random numbers right away, and we need 2 for the second stage of the algorithm.

### The CHM92 Algorithm The CHM92 algorithm operates by performing a depth-first search (DFS) on the graph, in the process, labeling the edges with unique numbers and calculating the corresponding offset for each of the Jenkins hash values. In the picture, you can see one of the possible labelings: each vertex is the value of one of the two hash-codes returned by `jenkins-hash2` for each key, and every edge, connecting them, corresponds to a key that produced the hashes. The unique indices of the edges were obtained during DFS. Now, each hash-code is mapped iteratively to the number that is `(- edge-index other-vertex-index)`. So, some codes will map to the same number, but it is guaranteed that, for each key, the sum of two corresponding numbers will be unique (as the edge indices are unique).

CHM92 is an example of the probabilistic algorithms we will discuss in more detail near the end of the book.

Let's say we have implemented the described scheme like I did in the const-table library. Now, we need to perform the measurements to validate that we have, in fact, achieved the desired improvement over the standard hash-table implementation. In this case, we are interested not only in speed measurements, which we already know how to perform but also in calculating the space occupied.

The latter goal is harder to achieve. Usually, most of the programming languages will provide the analog of a `sizeof` function that returns the space occupied by an array, a structure or an object. Here, we're interested not in "shallow" `sizeof` but in a "deep" one that will descend into the structure's slots and add their sizes recursively.

First, let's create functions to populate the tables with a significant number of random string key-value pairs.

``````(defun random-string (size)
(coerce (loop :repeat size :collect (code-char (+ 32 (random 100))))
'string))

(defun random-hash-table (&key (n 100000))
(let ((rez (make-hash-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))

(defun random-const-table (&key (n 100000))
(let ((rez (make-const-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))
``````

A very approximate space measurement may be performed using the standard operator `room`. But it doesn't provide detailed per-object statistics. Here's a result of the `room` measurement, in SBCL (the format of the report will be somewhat different, for each implementation):

``````CL-USER> (room)
Dynamic space usage is:   45,076,224 bytes.
Immobile space usage is:  18,998,832 bytes (64,672 bytes overhead).
Read-only space usage is:          0 bytes.
Static space usage is:         1,264 bytes.
Control stack usage is:        9,048 bytes.
Binding stack usage is:          640 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
11,369,232 bytes for  76,040 simple-vector objects
9,095,952 bytes for 160,669 instance objects
8,289,568 bytes for 518,098 cons objects
3,105,920 bytes for  54,655 simple-array-unsigned-byte-8 objects
2,789,168 bytes for  54,537 simple-base-string objects
2,344,672 bytes for   9,217 simple-character-string objects
6,973,472 bytes for 115,152 other objects

43,967,984 bytes for 988,368 dynamic objects (space total)

Breakdown for immobile space:
16,197,840 bytes for 24,269 code objects
1,286,496 bytes for 26,789 symbol objects
1,041,936 bytes for 27,922 other objects

18,526,272 bytes for 78,980 immobile objects (space total)

CL-USER> (defparameter *ht* (random-hash-table))
*HT*
CL-USER> (room)
...
Breakdown for dynamic space:
13,349,920 bytes for    77,984 simple-vector objects
11,127,008 bytes for   208,576 simple-character-string objects
9,147,824 bytes for   161,469 instance objects
8,419,360 bytes for   526,210 cons objects
3,517,792 bytes for     2,997 simple-array-unsigned-byte-32 objects
3,106,288 bytes for    54,661 simple-array-unsigned-byte-8 objects
7,671,168 bytes for   166,882 other objects

56,339,360 bytes for 1,198,779 dynamic objects (space total)
``````

So, it seems like we added roughly 10 megabytes by creating a hash-table with 100,000 random 5-9 character keys and values. Almost all of that space went into the keys and values themselves — 9 Mb ("11,127,008 bytes for 208,576 simple-character-string objects" versus "2,344,672 bytes for 9,217 simple-character-string objects" — a bit less than 200,000 new strings were added).

Also, if we examine the hash-table, we can see that its occupancy is rather high — around 90%! (The number of keys 99706 instead of 10000 tells us that there was a small portion of duplicate keys among the randomly generated ones).

``````CL-USER> (describe *ht*)
#<HASH-TABLE :TEST EQUAL :COUNT 99706 {1002162EF3}>
[hash-table]

Occupancy: 0.9
Rehash-threshold: 1.0
Rehash-size: 1.5
Size: 111411
``````

And now, a simple time measurement:

``````CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(gethash k *ht*)))))
Evaluation took:
0.029 seconds of real time
0.032000 seconds of total run time (0.032000 user, 0.000000 system)
110.34% CPU
72,079,880 processor cycles
0 bytes consed
``````

Now, let's try the `const-table`s that are the MPHT implementation:

``````СL-USER> (time (defparameter *ct* (cstab:build-const-table *ht*)))
...................................................................................................
Evaluation took:
0.864 seconds of real time
...
СL-USER> (room)
...
Breakdown for dynamic space:
14,179,584 bytes for    78,624 simple-vector objects
11,128,464 bytes for   208,582 simple-character-string objects
9,169,120 bytes for   161,815 instance objects
8,481,536 bytes for   530,096 cons objects
3,521,808 bytes for     2,998 simple-array-unsigned-byte-32 objects
3,305,984 bytes for    54,668 simple-array-unsigned-byte-8 objects
7,678,064 bytes for   166,992 other objects

57,464,560 bytes for 1,203,775 dynamic objects (space total)
``````

Another megabyte was added for the metadata of the new table, which doesn't seem significantly different from the hash-table version. Surely, often we'd like to be much more precise in space measurements. For this, SBCL recently added an allocation profiler `sb-aprof`, but we won't go into the details of its usage, in this chapter.

And now, time measurement:

``````CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(cstab:csget k *ct*)))))
Evaluation took:
3.561 seconds of real time
``````

Oops, a two-orders-of-magnitude slowdown! Probably, it has to do with many factors: the lack of optimization in my implementation compared to the one in SBCL, the need to calculate more hashes and with a slower hash-function, etc. I'm sure that the implementation may be sped up at least an order of magnitude, but, even then, what's the benefit of using it over the default hash-tables? Especially, considering that MPHTs have a lot of moving parts and rely on a number of "low-level" algorithms like graph traversal or efficient membership testing, most of which need a custom efficient implementation...

Still, there's one dimension in which MPHTs may provide an advantage: significantly reduce space usage by not storing the keys. Though, it becomes problematic if we need to distinguish the keys that are in the table from the unknown ones as those will also hash to some index, i.e. overlap with an existing key. So, either the keyspace should be known beforehand and exhaustively covered in the table or some precursory membership test is necessary when we anticipate the possibility of unseen keys. Yet, there are ways to perform the test efficiently (exactly or probabilistically), which require much less storage space than would be needed to store the keys themselves. Some of them we'll see in the following chapters.

If the keys are omitted, the whole table may be reduced to a Jump-table. Jump-tables are a low-level trick possible when all the keys are integers in the interval `[0, n)`. It removes the necessity to perform sequential equality comparisons for every possible branch until one of the conditions matches: instead, the numbers are used directly as an offset. I.e. the table is represented by a vector, each hash-code being the index in that vector.

A jump-table for the MPHT will be simply a data array, but sometimes evaluation of different code is required for different keys. Such more complex behavior may be implemented in Lisp using the lowest-level operators `tagbody` and `go` (and a bit of macrology if we need to generate a huge table). This implementation will be a complete analog of the C `switch` statement. The skeleton for such "executable" table will look like this, where 0, 1,... are goto labels:

``````(block nil
(tagbody (go key)
0 (return (do-something0))
1 (return (do-something1))
...))
``````

## Distributed Hash-Tables

Another active area of hash-table-related research is algorithms for distributing them over the network. This is a natural way to represent a lot of datasets, and thus there are numerous storage systems (both general- and special-purpose) which are built as distributed hash-tables. Among them are, for instance, Amazon DynamoDB or an influential open-source project Kademlia. We will discuss in more detail, in the chapter on Distributed Algorithms, some of the technologies developed for this use case, and here I wanted to mention just one concept.

Consistent Hashing addresses the problem of distributing the hash-codes among `k` storage nodes under the real-world limitations that some of them may become temporarily unavailable or new peers may be added into the system. The changes result in changes of the value of `k`. The straightforward approach would just divide the space of all codes into `k` equal portions and select the node into whose portion the particular key maps. Yet, if `k` is changed, all the keys need to be rehashed, which we'd like to avoid at all cost as rehashing the whole database and moving the majority of the keys between the nodes, at once, will saturate the network and bring the system to a complete halt.

The idea or rather the tweak behind Consistent Hashing is simple: we also hash the node ids and store the keys on the node that has the next hash-code larger than the hash of the key (modulo `n`, i.e. wrap around 0). Now, when a new node is added, it is placed on this so-called "hash ring" between two other peers, so only part of the keys from a single node (the next on the ring) require being redistributed to it. Likewise, when the node is removed, only its keys need to be reassigned to the next peer on the ring (it is supposed that the data is stored in multiple copies on different nodes, so when one of the nodes disappears the data doesn't become totally lost).

The only problem with applying this approach directly is the uneven distribution of keys originating from uneven placement of the hash-codes of the nodes on the hash ring. This problem can be solved with another simple tweak: have multiple ids for each node that will be hashed to different locations, effectively emulating a larger number of virtual nodes, each storing a smaller portion of the keys. Due to the randomization property of hashes, not so many virtual nodes will be needed, to obtain a nearly uniform distribution of keys over the nodes.

A more general version of this approach is called Rendezvous Hashing. In it, the key for the item is combined with the node id for each node and then hashed. The largest value of the hash determines the designated node to store the item.

## Hashing in Action: Content Addressing

Hash-tables are so ubiquitous that it's, actually, difficult to single out some peculiar use case. Instead, let's talk about hash-functions. They can find numerous uses beyond determining the positions of the items in the hash-table, and one of them is called "content addressing": globally identify a piece of data by its fingerprint instead of using external meta information like name or path. This is one of the suggested building blocks for large-scale distributed storage systems, but it works locally, as well: your git SCM system silently uses it behind the scenes to identify the changesets it operates upon.

• Potential for space economy: if the system has a chance of operating on repeated items (like git does, although it's not the only reason for choosing such naming scheme for blobs: the other being the lack of a better variant), content addressing will make it possible to avoid storing them multiple times.
• It guarantees that the links will always return the same content, regardless of where it is retrieved from, who added it to the network, how and when. This enables such distributed protocols as BitTorrent that split the original file into multiple pieces, each one identified by its hash. These pieces can be distributed in an untrusted network.
• As mentioned above, content addressing also results in a conflict-free naming scheme (provided that the hash has enough bits — usually, cryptographic hashes such as SHA-1 are used for this purpose, although, in many cases, such powerful hash-functions are an overkill).

## Take-aways

This chapter resented a number of complex approaches that require a lot of attention to detail to be implemented efficiently. On the surface, the hash-table concept may seem rather simple, but, as we have seen, the production-grade implementations are not that straightforward. What general conclusions can we make?

1. In such mathematically loaded areas as hash-function and hash-table implementation, rigorous testing is critically important. For there is a number of unexpected sources of errors: incorrect implementation, integer overflow, concurrency issues, etc. A good testing strategy is to use an already existing trusted implementation and perform a large-scale comparison testing with a lot of random inputs. We haven't discussed the testing code here but will return to the practical implementation of such testing frameworks in the following chapters.
2. Besides, a correct implementation doesn't necessarily mean a fast one. Low-level optimization techniques play a crucial role here.
3. In the implementation of MPHT, we have seen in action another important approach to solving algorithmic and, more generally, mathematic problems: reducing them to a problem that has a known solution.
4. Space measurement is another important area of algorithms evaluation that is somewhat harder to accomplish than runtime profiling. We'll also see more usage of both of these tools throughout the book.