2019-08-30

Programming Algorithms: Key-Values

This is a snippet of the "Key-Values" chapter of the book.

To conclude the description of essential data structures, we need to discuss key-values (kvs), which are the broadest family of structures one can imagine. Unlike arrays and lists, kvs are not concrete structures. In fact, they span, at least in some capacity, all of the popular concrete ones, as well as some obscure.

The main feature of kvs is efficient access to the values by some kind of keys that they are associated with. In other words, each element of such data structure is a key-value pair that can be easily retrieved if we know the key, and, on the other hand, if we ask for the key that is not in the structure, the null result is also returned efficiently. By "efficiently", we usually mean O(1) or, at least, something sublinear (like O(log n)), although, for some cases, even O(n) retrieval time may be acceptable. See how broad this is! So, a lot of different structures may play the role of key-values.

By the way, there isn't even a single widely-adopted name for such structures. Besides key-values — which isn't such a popular term (I derived it from key-value stores) — in different languages, they are called maps, dictionaries, associative arrays, tables, objects and so on.

In a sense, these are the most basic and essential data structures. They are so essential that some dynamic languages — for example, Lua, explicitly, and JavaScript, without a lot of advertisement — rely on them as the core (sometimes sole) language's data structure. Moreover, key-values are used almost everywhere. Below is a list of some of the most popular scenarios:

  • implementation of the object system in programming languages
  • most of the key-value stores are, for the most part, glorified key-value structures
  • internal tables in the operating system (running process table or file descriptor tables in the Linux kernel), programming language environment or application software
  • all kinds of memoization and caching
  • efficient implementation of sets
  • ad hoc or predefined records for returning aggregated data from function calls
  • representing various dictionaries (in language processing and beyond)

Considering such a wide spread, it may be surprising that, historically, the programming language community only gradually realized the usefulness of key-values. For instance, such languages as C and C++ don't have the built-in support for general kvs (if we don't count structs and arrays, which may be considered significantly limited versions). Lisp, on the contrary, was to some extent pioneering their recognition with the concepts of alists and plists, as well as being one of the first languages to have hash-table support in the standard.

More details about of the book may be found on its website.

2019-08-29

RUTILS 5.0 and Tutorial

RUTILS is my take on the Lisp "modernization" effort that adds the missing syntactic and data structure pieces, which became proven and well-established, in Lisp itself or in other languages. The programming field is constantly developing while the Lisp standard remains fixed so some additions, over time, are not only desirable but inevitable, if we don't want to lag behind. Thankfully, Lisp provides all the necessary means for implementing them and so, with some creativity, there's a way to have access to almost anything you want and need while retaining full backward compatibility (a lack of which is the most critical problem of some alternative solutions).

I, surely, understand that using such an extension remains a matter of taste and not every Lisper will like it. I didn't try to seriously promote it and was quite satisfied with the benefit that it provided to me and my teams' development. However, as I decided to use it for the "Programming Algorithms" book, it received some attention and a number of questions. From the direction of the discussions, I realized that the docs are lacking a critical part — the tutorial explaining how to effectively use the library. This text is intended to bridge that gap. I had to finish it before publishing the next chapter of the book, which I'll do on Friday.

So, today, version 5 of RUTILS is released alongside with the tutorial that aims to better explain its usage.

2019-08-19

Programming Algorithms: Linked Lists

This is a snippet of the "Lists" chapter of the book.

Linked data structures are in many ways the opposite of the contiguous ones that we have explored to some extent in the previous chapter using the example of arrays. In terms of complexity, they fail where those ones shine (first of all, at random access) — but prevail at scenarios when a repeated modification is necessary. In general, they are much more flexible and so allow the programmer to represent almost any kind of a data structure, although the ones that require such level of flexibility may not be too frequent. Usually, they are specialized trees or graphs.

The basic linked data structure is a singly-linked list.

Just like arrays, lists in Lisp may be created both with a literal syntax for constants and by calling a function — make-list — that creates a list of a certain size filled with nil elements. Besides, there's a handy list utility that is used to create lists with the specified content (the analog of vec).

CL-USER> '("hello" world 111)
("hello" WORLD 111)
CL-USER> (make-list 3)
(NIL NIL NIL)
CL-USER> (list "hello" 'world 111)
("hello" WORLD 111)

An empty list is represented as () and, interestingly, in Lisp, it is also a synonym of logical falsehood (nil). This property is used very often, and we'll have a chance to see that.

If we were to introduce our own lists, which may be quite a common scenario in case the built-in ones' capabilities do not suit us, we'd need to define the structure "node", and our list would be built as a chain of such nodes. We might have wanted to store the list head and, possibly, tail, as well as other properties like size. All in all, it would look like the following:

(defstruct list-cell
  data
  next)

(defstruct our-own-list
  (head nil :type (or list-cell null))
  (tail nil :type (or list-cell null)))

CL-USER> (let ((tail (make-list-cell :data "world")))
           (make-our-own-list
            :head (make-list-cell
                   :data "hello"
                   :next tail)
            :tail tail))
#S(OUR-OWN-LIST
   :HEAD #S(LIST-CELL
            :DATA "hello"
            :NEXT #S(LIST-CELL :DATA "world" :NEXT NIL))
   :TAIL #S(LIST-CELL :DATA "world" :NEXT NIL))

More details about of the book may be found on its website.

2019-08-12

Programming Algorithms: Arrays

This is a snippet of the "Arrays" chapter of the book.

Arrays are, alongside structs, the most basic data structure and, at the same time, the default choice for implementing algorithms. A one-dimensional array that is also called a "vector" is a contiguous structure consisting of the elements of the same type. One of the ways to create such arrays, in Lisp, is this:

CL-USER> (make-array 3)
#(0 0 0)

The printed result is the literal array representation. It happens that the array is shown to hold 0's, but that's implementation-dependent. Additional specifics can be set during array initialization: for instance, the :element-type, :initial-element, and even full contents:

CL-USER> (make-array 3 :element-type 'list :initial-element nil)
#(NIL NIL NIL)
CL-USER> (make-array 3 :initial-contents '(1.0 2.0 3.0))
#(1.0 2.0 3.0)

If you read back such an array you'll get a new copy with the same contents:

CL-USER> #(1.0 2.0 3.0)
#(1.0 2.0 3.0)

It is worth noting that the element type restriction is, in fact, not a limitation the default type is T[1]. In this case, the array will just hold pointers to its elements that can be of arbitrary type. If we specify a more precise type, however, the compiler might be able to optimize storage and access by putting the elements in memory directly in the array space. This is, mainly, useful for numeric arrays, but it makes multiple orders of magnitude difference for them for several reasons, including the existence of vector CPU instructions that operate on such arrays.

The arrays we have created are mutable, i.e. we can change their contents, although we cannot resize them. The main operator to access array elements is aref. You will see it in those pieces of code, in this chapter, where we care about performance.

CL-USER> (let ((vec (make-array 3 :initial-contents '(1.0 2.0 3.0))))
           (print (aref vec 0))
           (print (? vec 1))
           (:= (aref vec 2) 4.0))
           (print (? vec 2))
           (aref vec 3))
1.0 
2.0 
4.0
; Evaluation aborted on #<SIMPLE-TYPE-ERROR expected-type: (MOD 3) datum: 3>

In Lisp, array access beyond its boundary, as expected, causes an error.

It is also possible to create constant arrays using the literal notation #(). These constants can, actually, be changed in some environments, but don't expect anything nice to come out of such abuse — and the compiler will warn you of that:

CL-USER> (let ((vec #(1.0 2.0 3.0)))
           (:= (aref vec 2) nil)
           (print vec))
; caught WARNING:
;   Destructive function (SETF AREF) called on constant data.
;   See also:
;     The ANSI Standard, Special Operator QUOTE
;     The ANSI Standard, Section 3.2.2.3
; 
; compilation unit finished
;   caught 1 WARNING condition

#(1.0 2.0 NIL)

RUTILS provides more options to easily create arrays with a shorthand notation:

CL-USER> #v(1 2 3)
#(1 2 3)
CL-USER> (vec 1 2 3)
#(1 2 3)

Although the results seem identical they aren't. The first version creates a mutable analog of #(1 2 3), and the second also makes it adjustable (we'll discuss adjustable or dynamic arrays next).

More details about of the book may be found on its website.

2019-08-05

Programming Algorithms: Data Structures

The next several chapters will be describing the basic data structures that every programming language provides, their usage, and the most important algorithms relevant to them. And we'll start with the notion of a data-structure and tuples or structs that are the most primitive and essential one. Here is a snippet from this chapter.

Data Structures vs Algorithms

Let's start with a somewhat abstract question: what's more important, algorithms or data structures?

From one point of view, algorithms are the essence of many programs, while data structures may seem secondary. Besides, although a majority of algorithms rely on certain features of particular data structures, not all do. Good examples of the data-structure-relying algorithms are heapsort, search using BSTs, and union-find. And of the second type: the sieve of Erastophenes and consistent hashing.

At the same time, some seasoned developers state that when the right data structure is found, the algorithm will almost write itself. Linus Torvalds, the creator of Linux, is quoted saying:

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

A somewhat less poignant version of the same idea is formulated in the Art of Unix Programming by Eric S. Raymond as the "Rule of Representation":

Fold knowledge into data so program logic can be stupid and robust.

Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic.

Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.

Data structures are more static than algorithms. Surely, most of them allow change of their contents over time, but there are certain invariants that always hold. This allows reasoning by simple induction: consider only two (or at least a small number of) cases, the base one(s) and the general. In other words, data structures remove, in the main, the notion of time from consideration, and change over time is one of the major causes of program complexity. In other words, data structures are declarative, while most of the algorithms are imperative. The advantage of the declarative approach is that you don't have to imagine (trace) the flow of time through it.

So, this book, like most other books on the subject, is organized around data structures. The majority of the chapters present a particular structure, its properties and interface, and explain the algorithms, associated with it, showing its real-world use cases. Yet, some important algorithms don't require a particular data structure, so there are also several chapters dedicated exclusively to them.

More information about the book may be found on its website.