2019-08-05

Programming Algorithms: Data Structures

The next several chapters will be describing the basic data structures that every programming language provides, their usage and the most important algorithms relevant to them. And we'll start with the notion of a data-structure and tuples or structs that are the most primitive and essential one.

Data Structures vs Algorithms

Let's start with a somewhat abstract question: what's more important, algorithms or data structures?

From one point of view, algorithms are the essence of many programs, while data structures may seem secondary. Besides, although a majority of algorithms rely on certain features of particular data structures, not all do. Good examples of the data-structure-relying algorithms are heapsort, search using BSTs, and union-find. And of the second type: the sieve of Erastophenes and consistent hashing.

At the same time, some seasoned developers state that when the right data structure is found, the algorithm will almost write itself. Linus Torvalds, the creator of Linux, is quoted saying:

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

A somewhat less poignant version of the same idea is formulated in the Art of Unix Programming by Eric S. Raymond as the "Rule of Representation":

Fold knowledge into data so program logic can be stupid and robust.

Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic.

Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.

Data structures are more static than algorithms. Surely, most of them allow change of their contents over time, but there are certain invariants that always hold. This allows reasoning by simple induction: consider only two (or at least a small number of) cases, the base one(s) and the general. In other words, data structures remove, in the main, the notion of time from consideration, and change over time is one of the major causes of program complexity. In other words, data structures are declarative, while most of the algorithms are imperative. The advantage of the declarative approach is that you don't have to imagine (trace) the flow of time through it.

So, this book, like most other books on the subject, is organized around data structures. The majority of the chapters present a particular structure, its properties and interface, and explain the algorithms, associated with it, showing its real-world use cases. Yet, some important algorithms don't require a particular data structure, so there are also several chapters dedicated exclusively to them.

The Data Structure Concept

Among data structures, there are, actually, two distinct kinds: abstract and concrete. The significant difference between them is that an abstract structure is just an interface (a set of operations) and a number of conditions or invariants that have to be met. Their particular implementations, which may differ significantly in efficiency characteristics and inner mechanisms, are provided by the concrete data structures. For instance, an abstract data structure queue has just two operations: enqueue that adds an item to the end of the queue and dequeue that gets an item at the beginning and removes it. There's also a constraint that the items should be dequeued in the same order they are enqueued. Now, a queue may be implemented using a number of different underlying data structures: a linked or a double-linked list, an array or a tree. Each one having different efficiency characteristics and additional properties beyond the queue interface. We'll discuss both kinds in the book, focusing on the concrete structures and explaining their usage to implement a particular abstract interface.

The term data structures has somewhat fallen from grace, in the recent years, being often replaced by conceptually more loaded notions of types, in the context of the functional programming paradigm, or classes, in object-orientated one. Yet, both of those notions imply something more than just algorithmic machinery we're exclusively interested in, for this book. First of all, they also distinguish among primitive values (numbers, characters, etc.) that are all non-distinct, in the context of algorithms. Besides, classes form a hierarchy of inheritance while types are associated with algebraic rules of category theory. So, we'll stick to a neutral data structures term, throughout the book, with occasional mentions of the other variants where appropriate.

Contiguous and Linked Data Structures

The current computer architectures consist of a central processor (CPU), memory and peripheral input-output devices. The data is someway exchanged with the outside world via the IO-devices, stored in memory, and processed by the CPU. And there's a crucial constraint, called the von Neumann's bottleneck: the CPU can only process data that is stored inside of it in a limited number of special basic memory blocks called registers. So it has to constantly move data elements back and forth between the registers and main memory (using intermediate cache to speed up the process). Now, there are things that can fit in a register and those that can't. The first ones are called primitive and mostly unite those items that can be directly represented with integer numbers: integers proper, floats, characters. Everything that requires a custom data structure to be represented can't be put in a register as a whole.

Another item that fits into the processor register is a memory address. In fact, there's an important constant — the number of bits in a general-purpose register, which defines the maximum memory address that a particular CPU may handle and, thus, the maximum amount of memory it can work with. For a 32-bit architecture it's 2^32 (4 GB) and for 64-bit — you've guessed it, 2^64. A memory address is usually called a pointer, and if you put a pointer in a register, there are commands that allow the CPU to retrieve the data in-memory from where it points.

So, there are two ways to place a data structure inside the memory:

  • a contiguous structure occupies a single chunk of memory and its contents are stored in adjacent memory blocks. To access a particular piece we should know the offset of its beginning from the start of the memory range allocated to the structure. (This is usually handled by the compiler). When the processor needs to read or write to this piece it will use the pointer calculated as the sum of the base address of the structure and the offset. The examples of contiguous structures are arrays and structs
  • a linked structure, on the contrary, doesn't occupy a contiguous block of memory, i.e. its contents reside in different places. This means that pointers to a particular piece can't be pre-calculated and should be stored in the structure itself. Such structures are much more flexible at the cost of this additional overhead both in terms of used space and time to access an element (which may require several hops when there's nesting, while in the contiguous structure it is always constant). There exists a multitude of linked data structures like lists, trees, and graphs

Tuples

In most languages, some common data structures, like arrays or lists, are "built-in", but, under the hood, they will mostly work in the same way as any user-defined ones. To implement an arbitrary data structure, these languages provide a special mechanism called records, structs, objects, etc. The proper name for it would be "tuple". It is the data structure that consists of a number of fields each one holding either a primitive value, another tuple or a pointer to another tuple of any type. This way a tuple can represent any structure, including nested and recursive ones. In the context of type theory, such structures are called product types.

A tuple is an abstract data structure and its sole interface is the field accessor function: by name (a named tuple) or index (an anonymous tuple). It can be implemented in various ways, although a contiguous variant with constant-time access is preferred. However, in many languages, especially dynamic, programmers often use lists or dynamic arrays to create throw-away ad-hoc tuples. Python has a dedicated tuple data type, that is often for this purpose, that is a linked data structure under the hood. The following Python function will return a tuple (written in parens) of a decimal and remainder parts of the number x[1]:

def truncate(x):
    dec = int(x)
    rem = x - dec
    return (dec, rem)

This is a simple and not very efficient way that may have its place when the number of fields is small and the lifetime of the structure is short. However, a better approach both from the point of view of efficiency and code clarity is to use a pre-defined structure. In Lisp, a tuple is called "struct" and is defined with defstruct, which uses a contiguous representation by default (although there's an option to use a linked list under-the-hood). Following is the definition of a simple pair data structure that has two fields (called "slots" in Lisp parlance): left and right.

(defstruct pair
  left right)

The defstruct macro, in fact, generates several definitions: of the struct type, its constructor that will be called make-pair and have 2 keyword arguments :left and :right, and field accessors pair-left and pair-right. Also, a common print-object method for structs will work for our new structure, as well as a reader-macro to restore it from the printed form. Here's how it all fits together:

CL-USER> (make-pair :left "foo" :right "bar")
#S(PAIR :LEFT "foo" :RIGHT "bar")
CL-USER> (pair-right (read-from-string (prin1-to-string *)))
"bar"

prin1-to-string and read-from-string are complimentary Lisp functions that allow to print the value in a computer-readable form (if an appropriate print-function is provided) and read it back. Good print-representations readable to both humans and, ideally, computers are very important to code transparency and should never be neglected.

There's a way to customize every part of the definition. For instance, if we plan to use pairs frequently we can leave out the pair- prefix by specifying (:conc-name nil) property. Here is an improved pair definition and shorthand constructor for it from RUTILS, which we'll use throughout the book. It uses :type list allocation to integrate with destructuring macros.

(defstruct (pair (:type list) (:conc-name nil))
  "A generic pair with left (LT) and right (RT) elements."
  lt rt)

(defun pair (x y)
  "A shortcut to make a pair of X and Y."
  (make-pair :lt x :rt y))

Passing Data Structures in Function Calls

One final remark. There are two ways to use data structures with functions: either pass them directly via copying appropriate memory areas (call-by-value) — an approach, usually, applied to primitive types — or pass a pointer (call-by-reference). In the first case, there's no way to modify the contents of the original structure in the called function, while in the second variant it is possible, so the risk of unwarranted change should be taken into account. The usual way to handle it is by making a copy before invoking any changes, although, sometimes, mutation of the original data structure may be intended so a copy is not needed. Obviously, the call-by-reference approach is more general, because it allows both modification and copying, and more efficient because copying is on-demand. That's why it is the default way to handle structures (and objects) in most programming languages. In a low-level language like C, however, both variants are supported. Moreover, in C++ the pass-by-reference has two kinds: pass the pointer and pass what's actually called a reference, which is syntax sugar over pointers that allows accessing the argument with non-pointer syntax (dot instead of arrow) and adds a couple of restrictions. But the general idea, regardless of the idiosyncrasies of particular languages, remains the same.

Structs in Action: Union-Find

Data structures come in various shapes and flavors. Here, I'd like to mention one peculiar and interesting example that is both a data structure and an algorithm, to some extent. Even the name speaks about certain operations rather than a static form. Well, most of the more advanced data structures all have this feature that they are defined not only by the shape and arrangement but also via the set of operations that are applicable. Union-Find is a family of data-structure-algorithms that can be used for efficient determination of set membership in sets that change over time. They may be used for finding the disjoint parts in networks, detection of cycles in graphs, finding the minimum spanning tree and so forth. One practical example of such problems is automatic image segmentation: separating different parts of an image, a car from the background or a cancer cell from a normal one.

Let's consider the following problem: how to determine if two points of the graph have a path between them? Given that a graph is a set of points (vertices) and edges between some of the pairs of these points. A path in the graph is a sequence of points leading from source to destination with each pair having an edge that connects them. If some path between two points exists they belong to the same component if it doesn't — to two disjoint ones.


A graph with 3 disjoint components

For two arbitrary points, how to determine if they have a connecting path? The naive implementation may take one of them and start building all the possible paths (this may be done in breadth-first or depth-first manner, or even randomly). Anyway, such procedure will, generally, require a number of steps proportional to the number of vertices of the graph. Can we do better? This is a usual question that leads to the creation of more efficient algorithms.

Union-Find approach is based on a simple idea: when adding the items record the id of the component they belong to. But how to determine this id? Use the id associated with some point already in this subset or the current point's id if the point is in a subset of its own. And what if we have the subsets already formed? No problem, we can simulate the addition process by iterating over each vertex and taking the id of an arbitrary point it's connected to as the subset's id. Below is the implementation of this approach (to simplify the code, we'll use the pointers to `point` structs instead of ids, but, conceptually, it's the same idea):

(defstruct point
  parent)  ; if the parent is null the point is the root

(defun uf-union (point1 point2)
  "Join the subsets of POINT1 and POINT2."
  (:= (point-parent point1) (or (point-parent point2)
                                point2)))

(defun uf-find (point)
  "Determine the id of the subset that a POINT belongs to."
  (let ((parent (point-parent point)))
    (if parent
        (uf-find parent)
        point)))

Just calling (make-point) will add a new subset with a single item in it to our set.

Note that uf-find uses recursion to find the root of the subset, i.e. the point that was added first. So, for each vertex, we store some intermediary data and, to get the subset id, each time, we'll have to perform additional calculations. This way, we managed to reduce the average-case find time, but, still, haven't completely excluded the possibility of it requiring traversal of every element of the set. Such so-called degraded case may manifest when each item is added referencing the previously added one. I.e. there will be a single subset with a chain of its members connected to the next one like this: a -> b -> c -> d. If we call uf-find on a it will have to enumerate all of the set's elements.

Yet, there is a way to improve uf-find behavior: by compressing the tree depth to make all points along the path to the root point to it, i.e squashing each chain into a wide shallow tree of depth 1.

  d
^ ^ ^
| | |
a b c

Unfortunately, we can't do that, at once, for the whole subset, but, during each run of uf-find, we can compress one path, which will also shorten all the paths in the subtree that is rooted in the points on it! Still, this cannot guarantee that there will not be a sequence of enough unions to grow the trees faster than finds can flatten them. But there's another tweak that, combined with path compression, allows to ensure sublinear (actually, almost constant) time of both operations: keep track of the size of all trees and link the smaller tree below the larger one. This will ensure that all trees' heights will stay below (log n). The rigorous proof of that is quite complex, although, intuitively, we can see the tendency by looking at the base case: if we add a 2-element tree and a 1-element one we'll still get the tree of the height 2.

Here is the implementation of the optimized version:

(defstruct point
  parent
  (size 1))

(defun uf-find (point)
  (let ((parent (point-parent point)))
    (if parent
        ;; here, we use the fact that the assignment will also return
        ;; the value to perform both path compression and find
        (:= (point-parent point) (uf-find parent))
        point)))

(defun uf-union (point1 point2)
  (with ((root1 (uf-find point1))
         (root2 (uf-find point2))
         (major minor (if (> (point-size root1)
                             (point-size root2))
                          (values root1 root2)
                          (values root2 root1))))
    (:+ (point-size major) (point-size minor))
    (:= (point-parent minor) major)))
 

Here, Lisp multiple values come handy, to simplify the code. See the footnote [1] for more details about them.

The suggested approach is quite simple in implementation but complex in complexity analysis. So, I'll have to give just the final result: m union/find operations, with tree weighting and path compression, on a set of n objects will work in O((m + n) log* n) (where log* is iterated logarithm — a very slowly increasing function, that can be considered a constant, for practical purposes).

Finally, this is how to check if none of the points belong to the same subset in almost O(n) where n is the number of points to check[2], so in O(1) for 2 points:

(defun uf-disjoint (points)
  "Return true if all of the POINTS belong to different subsets."
  (let (roots)
    (dolist (point points)
      (let ((root (uf-find point)))
        (when (member root roots)
          (return-from uf-disjoint nil))
        (push root roots))))
  t)

A couple more observations may be drawn from this simple example:

  1. Not always the clever idea that we, initially, have works flawlessly at once. It is important to check the edge cases for potential problems.
  2. We've seen an example of a data structre that, directly, doesn't exist: pieces of information are distributed over individual data points. Sometimes, there's a choice between storing the information, in a centralized way, in a dedicated structure like a hash-table and distributing it over individual nodes. The latter approach is often more elegant and efficient, although it's not so obvious.

Footnotes:

[1] Moreover, Python has special syntax for destructuring such tuples: dec, rem = truncate(3.14). However, this is not the optimal way to handle returning the primary and one or more secondary values from a function. Lisp provides a more elegant solution called multiple values: all the necessary values are returned via the values form: (values dec rem) and can be retrieved with (multiple-value-bind (dec rem) (truncate 3.14) ...) or (with ((dec rem (truncate 3.14))) ...). It is more elegant because secondary values may be discarded at will by calling the function in a usual way: (+ 1 (truncate 3.14)) => 4 — not possible in Python, because you can't sum a tuple with something.

[2] Actually, the complexity here is O(n^2) due to the use of the function member that performs set membership test in O(n), but it's not essential to the general idea. If uf-disjoint is expected to be called with tens or hundreds of points the roots structure has to be changed to a hash-set that has a O(1) membership operation.

No comments: