2019-09-11

Programming Algorithms: Hash-Tables

This is a snippet of the "Hash-Tables" chapter of the book.

Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.

Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure O(1) membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with t for all values.

Implementation

The basic properties of hash-tables are average O(1) access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:

  1. As the backing array has a limited number of cells (n), the function should produce values in the interval [0, n). This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by n.
  2. Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".

Dealing with Collisions

Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.

For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us (/ (* 20 19) 2), i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is (/ (- 365 1) 365): there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in (* (/ 364 365) (/ (- 365 2) 365)). The value (- 365 2) refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:

(defun birthday-collision-prob (n)
  (let ((rez 1))
    (dotimes (i n)
      (:* rez (/ (- 365 i) 365)))
    ;; don't forget that we want the complement
    ;; of the probability of no collisions,
    ;; hence (- 1.0 ...)
    (- 1.0 rez)))

CL-USER> (birthday-collision-prob 20)
0.4114384

So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!

But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).

(defun hash-collision-prob (n size)
  (let ((rez 1))
    (dotimes (i n)
      (:* rez (/ (- size i) size)))
    (- 1.0 rez)))

Let's say, we have 10 keys. What should be the array size to be safe against collisions?

CL-USER> (hash-collision-prob 10 10)
0.9996371

99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?

CL-USER> (hash-collision-prob 10 20)
0.9345271

93%. Still, pretty high.

CL-USER> (hash-collision-prob 10 100)
0.37184352
CL-USER> (hash-collision-prob 10 10000)
0.004491329

So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...

Note that the number depends on both arguments, so (hash-collision-prob 10 100) (0.37) is not the same as (hash-collision-prob 20 200) (0.63).

We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.

More details about of the book may be found on its website.

No comments: