Structs vs Parametric Polymorphism

Recently, Tamas Papp wrote about one problem he had with Lisp in the context of scientific computing: that it's impossible to specialize methods on parametric types.
While you can tell a function that operates on arrays that these arrays have element type double-float, you cannot dispatch on this, as Common Lisp does not have parametric types.
I encountered the same issue while developing the CL-NLP Lisp toolkit for natural language processing. For instance, I needed to specialize methods on sentences, which may come in different flavors: as lists of tokens, vectors of tokens, lists of strings or some more elaborate data-structure with attached metadata. Here's an example code. There's a generic function to perform various tagпing jobs (POS, NER, SRL etc.) It takes two arguments: the first — as with all CL-NLP generic functions — is the tagger object that is used for algorithm selection, configuration, as well as for storing intermediate state when necessary. The second one is a sentence being tagged. Here are two of its possible methods:
(defmethod tag ((tagger ap-dict-postagger) (sent string)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent list)) ...)
The first processes a raw string, which assumes that we should invoke some pre-processing machinery that tokenizes it and then, basically, call the second method, which will perform the actual tagging of the resulting tokens. So, list here means list of tokens. But what if we already have the tokenization, but haven't created the token objects? I.e. a list of strings is supplied as the input to the tag method. The CLOS machinery doesn't have a way to distinguish, so we'll have to resort to using typecase inside the method, which is exactly what defmethod replaces as a transparent and extensible alternative. Well, in most other languages we'll stop here and just have to assert that nothing can be done and it should be accepted as is. After all, it's a local nuisance and not a game changer for our code (although Tamas refers to it as a game-changer for his). In Lisp, we can do better. Thinking about this problem, I see at least 3 solutions with a varying level of elegance and portability. Surely, they may seem slightly inferior to such capability being built directly into the language, but demanding to have everything built-in is unrealistic, to say the least. Instead, having a way to build something in ourselves is the only future-proof and robust alternative. And this is what Lisp is known for. The first approach was mentioned by Tamas himself:
You can of course branch on the array element types and maybe even paper over the whole mess with sufficient macrology (which is what LLA ended up doing), but this approach is not very extensible, as, eventually, you end up hardcoding a few special types for which your functions will be "fast", otherwise they have to fall back to a generic, boxed type. With multiple arguments, the number of combinations explodes very quickly.
Essentially, rely on typecase-ing but use macros to blend it into the code in the most non-intrusive way, minimizing boilerplate. This is a straightforward path, in Lisp, and it has its drawbacks for long-running projects that need to evolve over time. But it remains a no-brainer for custom one-offs. That's why, usually, few venture further to explore other alternatives. The other solution was mentioned in the Reddit discussion of the post:
Generics dispatching on class rather than type is an interesting topic. I've definitely sometimes wanted the latter so far in doing CL for non-scientific things. It is certainly doable to make another group of generics that do this using the MOP.
I.e. use the MOP to introduce type-based generic dispatch. I won't discuss it here but will say that similar things were tried in the past quite successfully. ContextL and Layered functions are some of the examples. Yet, the MOP path is rather heavy and has portability issues (as the MOP is not in the standard, although there is the closer-to-mop project that unifies most of the implementations). In my point of view, its best use is for serious and fundamental extension of the CL object system, not to solve a local problem that may occur in some contexts but is not so pervasive. Also, I'd say that the Lisp approach that doesn't mix objects and types (almost) is, conceptually, the right one as these two facilities solve a different set of problems. There's a third — much simpler, clear and portable solution that requires minimal boilerplate and, in my view, is best suited for such level of problems. To use struct-s. Structs are somewhat underappreciated in the Lisp world, not a lot of books and study materials give them enough attention. And that is understandable as there's not a lot to explain. But structs are handy for many problems as they are a hassle-free and efficient facility that provides some fundamental capabilities. In its basic form, the solution is obvious, although a bit heavy. We'll have to define the wrapper structs for each parametric type we'd like to dispatch upon. For example, list-of-strings and list-of-tokens. This looks a little stupid and it is, because what's the semantic value of a list of strings? That's why I'd go for sentence/string and sentence/token which is a clearer naming scheme. (Or, if we want to mimic Julia, sentence<string>).
(defstruct sent/str
Now, from the method's signature, we will already see that we're dealing with sentences in the tagging process. And will be able to spot when some other tagging algorithm operates on the paragraphs instead of words: let's say, tagging parts of an email with such labels as greeting, signature, and content. Yes, this can also be conveyed via the name of the tagger, but, still, it's helpful. And it's also one of the hypothetical fail cases for a parametric type-based dispatch system: if we have two different kinds of lists of strings that need to be processed differently, we'd have to resort to similar workarounds in it as well. However, if we'd like to distinguish between lists of strings and vectors of strings, as well as more generic sequences of strings we'll have to resort to more elaborate names, like sent-vec/str, as a variant. It's worth noting though that, for the sake of producing efficient compiled code, only vectors of different types of numbers really make a difference. A list of strings or a list of tokens, in Lisp, uses the same accessors so optimization here is useless and type information may be used only for dispatch and, possibly, type checking. Actually, Lisp doesn't support type-checking of homogenous lists, so you can't say :type (list string), only :type list. (Wel, you can, actually uses (and satisfies (lambda (x) (every 'stringp x)), but what's the gain?) Yet, using structs adds more semantic dimensions to the code than just naming. They may store additional metadata and support simple inheritance, which will come handy when we'd like to track sentence positions in the text and so on.
(defstruct sent-vec/tok
  (toks nil :type (vector tok)))

(defstruct (corpus-sent-vec/tok (:include sent-vec/tok))
  file beg end)
And structs are efficient in terms of both space consumption and speed of slot access.
So, now we can do the following:
(defmethod tag ((tagger ap-dict-postagger) (sent sent/str)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent sent/tok)) ...)
(defmethod tag ((tagger ap-dict-postagger) (sent sent-vec/tok)) ...)
We'll also have to defstruct each parametric type we'd like to use. As a result, with this approach, we can have the following clean and efficient dispatch:
(defgeneric tag (tagger sent)
  (:method (tagger (sent string))
    (tag tagger (tokenize *word-splitter* sent))
  (:method (tagger (sent sent/str))
    (tag tagger (make-sent/tok :toks (map* ^(prog1 (make-tok 
                                                    :word %
                                                    :beg off 
                                                    :end (+ off (length %))) 
                                              (:+ off (1+ (length %)))
  (:method ((tagger pos-tagger) (sent sent/tok))
    (copy sent :toks (map* ^(copy % :pos (classify tagger
                                                   (extract-features tagger %))

CL-USER> (tag *pos-tagger* "This is a test.")
#S(SENT/TOK :TOKS (<This/DT 0..4> <is/VBZ 5..7> <a/DT 8..9>
                   <test/NN 10..14> <./. 14..15>))
Some of the functions used here, ?, map*, copy, as well as @ and ^ reader macros, come from my RUTILS, which fills the missing pieces of the CL standard library. Also an advantage of structs is that they define a lot of things in the background: invoking type-checking for slots, a readable print-function, a constructor, a builtin copy-structure and more. In my view, this solution isn't any less easy-to-use than the static-typed one (Julia's). There's a little additional boilerplate (defstructs), which may be even considered to have a positive impact on the code's overall clarity. And yes, you have to write boilerplate in Lisp sometimes, although not so much of it. Here's a fun quote on the topic I saw on twitter some days ago:
Lisp is an embarrassingly concise language. If you’re writing a bunch of boilerplate in it, you need to read SICP & “Lisp: A Language for Stratified Design”.
P.S. There's one more thing I wanted to address from Tamas's post
Now I think that one of the main reasons for this is that while you can write scientific code in CL that will be (1) fast, (2) portable, and (3) convenient, you cannot do all of these at the same time.
I'd say that this choice (or rather a need to prioritize one over the others) exists in every ecosystem. At least, looking at his Julia example, there's no word of portability (citing Tamas's own words about the language: "At this stage, code that was written half a year ago is very likely to be broken with the most recent release."), while convenience may be manifest well for his current use case, but what if we require to implement in the same system features that deal with other areas outside of numeric computing? I'm not so convinced. Or, speaking about Python, which is a goto language for scientific computing. In terms of performance, the only viable solution is to implement the critical parts in C (or Cython). Portable? No. Convenient — likewise. Well, as a user you get convenience, and speed, and portability (although, pretty limited). But at what cost? I'd argue that developing the Common Lisp scientific computing ecosystem to a similar quality would have required only 10% of the effort that went into building numpy and scipy...


ANN: flight-recorder - a robust REPL logging facility

Interactivity is a principal requirement for a usable programming environment. Interactivity means that there should be a shell/console/REPL or other similar text-based command environment. And a principal requirement for such an environment is keeping history. And not just keeping it, but doing it robustly:

  • recording history from concurrently running sessions
  • keeping unlimited history
  • identifying the time of the record and its context

This allows to freely experiment and reproduce the results of successful experiments, as well as go back to an arbitrary point in time and take another direction of your work, as well as keeping DRY while performing common repetitive tasks in the REPL (e.g. initialization of an environment or context).

flight-recorder or frlog (when you need a distinct name) is a small tool that intends to support all these requirements. It grew out of a frustration with how history is kept in SLIME, so it was primarily built to support this environment, but it can be easily utilized for other shells that don't have good enough history facility. It is possible due to its reliance on the most common and accessible data-interchange facility: text-based HTTP.

frlog is a backend service that supports any client that is able to send an HTTP request.

The backend is a Common Lisp script that can be run in the following manner (probably, the best way to do it is inside screen):

sbcl --noprint --load hunch.lisp -- -port 7654 -script flight-recorder.lisp

If will print a bunch of messages that should end with the following line (modulo timestamp):

[2018-09-29 16:00:53 [INFO]] Started hunch acceptor at port: 7654.

The service appends each incoming request to the text file in markdown format: ~/.frlog.md.

The API is just a single endpoint - /frlog that accepts GET and POST requests. The parameters are:

  • text is the content (url-encoded, for sure) of the record that can, alternatively, be sent in the POST request's body (more robust)

Optional query parameters are:

  • title - used to specify that this is a new record: for console-based interactions, usually, there's a command and zero or more results - a command starts the record (and thus should be accompanied with the title: for SLIME interactions it's the current Lisp package and a serial number). An entitled record is added in the following manner:
    ### cl-user (10) 2018-09-29_15:49:17
        (uiop:pathname-directory-pathname )
    If there's no title, the text is added like this:
    ;;; 2018-09-29_15:49:29
        #<program-error @ #x100074bfd72>
  • tag - if provided it signals that the record should be made not to a standard .frlog.md file, but to .frlog-<tag>.md. This allows to easily log a specific group of interactions separately If the response code is 200 everything's fine.

Currently, 2 clients are available:

  • a SLIME client flight-recorder.el that monkey-patches a couple of basic SLIME functions (just load it from Emacs if you have SLIME initialized)
  • and a tiny Lisp client frlog.lisp

P.S. To sum up, more and more I've grown to appreciate simple (sometimes even primitive - the primitive the better :) tools. flight-recorder seems to me to be just like that: it was very easy to hack together, but it solves an important problem for me and, I guess, for many. And it's the modern "Unix way": small independent daemons, text-based formats and HTTP instead of pipes...

P.P.S. frlog uses another tiny tool of mine - hunch that I've already utilized in a number of projects but haven't described yet - it's a topic for another post. In short, it is a script to streamline running hunchentoot that does all the basic setup and reduces the developer's work to just defining the HTTP endpoints.

P.P.P.S. I know, the name is, probably, taken and it's a rather obvious one. But I think it just doesn't matter in this case... :)


Minimal Perfect Hash-Tables in Common Lisp

Recently, my twitter pal @ifesdjeen wrote a line that resonated with me: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And although he used it in a negative context, I recognized it as a very precise (and, actually, positive) description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and - as a possible byproduct - into a blog post that other engineers will understand and be able to reproduce. I'm in the business of reproducing papers for about 7 years now, and, I believe, everyone with the same experience will confirm that it's not a skill every engineer should be easily capable of. Not all papers even can be reproduced (because the experiment was just not set up correctly), and of those, which, in principle, can be, only some are presented in the form that I can grasp well enough to be able to program them. And, after all, working code is an ultimate judge of your understanding.

But I digressed... :) This post is meant to explain (in simple "engineering" terms) the concept of minimal perfect hash-tables and how I've recently implemented one of their varieties to improve the memory efficiency of my language identification library WILD. Uncovering, in the process, a more abstract concept behind it.

The Justification of Perfect Hash-Tables

Minimal perfect hash-tables are persistent data structures that solve 2 basic deficiencies of normal hash-tables: they guarantee both constant (not only amortized) O(1) time for collision-free key-based access, while no reserved space is required (which for hash-tables may be in the range of 20%-50% of its size). It comes at a cost of two restrictions: the table should be filled with a keyset known ahead-of-time, and the process of building one takes longer than for a usual hash-table (although, for many methods, it can still be bounded by amortized O(1) time).

From my point of view, the main advantage of perfect HTs is the possibility to substantially reduce memory usage, which is important in such use cases as storing big dictionaries (relevant to many NLP tasks). Moreover, the space requirements can be lowered even more if the whole keyset is stored in the table, i.e. there can be no misses. Under such constraints, unlike with normal hash-tables, which still require storing the keys alongside the values due to the need to compare them in cases of hash collisions, in a perfect HT they can be just omitted. Unfortunately, such situation is rarely the case, but if some level of false positives is permitted, with the help of additional simple programmatic tricks, the memory usage for keys can be also reduced by orders of magnitude.

Hash-tables on their own are the trickiest among the basic data structures. But, in terms of complexity, they pale in comparison to minimal perfect hash-tables, which belong to the advanced algorithms family. One reason for that is that perfect hash-tables require more "moving parts", but the main one is, probably, that there's no common well-known algorithm to distribute keys in a perfect manner. And reading many perfect hash-table papers will scare away most programmers, at least it did me. However, after some research and trial-and-error, I've managed to find the one that presents a simple and scalable approach. So, I'm going to explain it in this post.

Hash-tables in SBCL

If you take a look at some particular hash-table implementation, your mental model of what a hash-table is may be quite seriously shaken. A straightforward open addressing table will assume a vector of key-value pairs as an underlying concrete structure, filled as the table is modified. On a 64-bit system, it will require: (8 × 1.5 + 8 × 16) × entries-count + total size of all keys + total size of all values + constant size of metadata. The 8 × 1.5 bytes in the formula is storage needed to hold a pointer to a hash-table entry including an average 33% extra overhead, the additional 8 × 16 bytes per entry will be spent on a cons cell (if we use this efficient although antiquated way to represent pairs in Lisp). It should be noted that depending on the size of keys and values the first term may be either a significant (up to half of the total size) or a negligible part of the table's memory profile.

However, this is not how SBCL implements hash-tables. See for yourself. Let's create a random hash-table with approximately 1000 keys:

> (defparameter *ht*
    (pairs->ht (loop :repeat 1000
                     :collect (pair (fmt "~{~C~}"
                                         (loop :repeat (random 10)
                                               :collect (code-char (+ 65 (random 50)))))
                                    (random 1000)))
               :test 'equal))

And inspect its contents:

> (inspect *ht*)

The object is a STRUCTURE-OBJECT of type HASH-TABLE.
1. TEST-FUN: #
2. HASH-FUN: #
              "plCjU" 985
              "VVYZqKm[" 861
              "N\\" 870
              "fqfBdZP\\" 364
              "cHNjIM]Y" 804
              "p_oHUWc^" 630
              "CqGRaMH" 408
              "NecO" 636
              "QDBq" 380
              "M" 838
            0 "plCjU" 985 "VVYZqKm[" 861 "N\\" 870 "fqfBdZP\\" 364 ...)
10. NEXT-FREE-KV: 845
11. CACHE: 1688
12. INDEX-VECTOR: #(0 617 332 393 35 0 194 512 0 420 ...)
13. NEXT-VECTOR: #(0 72 0 0 253 499 211 0 349 488 ...)
14. HASH-VECTOR: #(9223372036854775808 1830284672361826498 3086478891113066655
                   24243962159539 2602570438820086662 2431530612434713043
                   4568274338569089845 3085527599069570347 2374133389538826432
                   3322613783177911862 ...)
15. LOCK: #

As the fog of initial surprise clears, we can see that instead of a single vector it uses 4! The keys and values are placed in the one called TABLE (that starts with a reference to the hash-table object itself). Note that storing the entries as pairs is redundant both in terms of memory and access (additional dereferencing). That's why an obvious optimization is to put them directly in the backing array: so the keys and values here are interleaved. One more indirection that may be removed, at least in Lisp, is "unboxing" the numbers, i.e. storing them immediately in the array avoiding the extra pointer. This may be an additional specialized optimization for number-keyed hash-tables (to which we'll return later), but it is hardly possible with the interleaved scheme.

Perhaps, the biggest surprise is that the entries of the TABLE array are stored sequentially, i.e. there are no gaps. If they were to be added randomly, we'd expect the uniform distribution of 845×2 unique keys and corresponding values over the 2048-element array. Instead, this randomness is transferred to the 1024-element integer INDEX-VECTOR: another level of indirection. Why? For the same reason yet another 1024-element array is used — a HASH-VECTOR, which stores the values of hashes for all the table keys: efficient resizing. Although, it may seem that the biggest overhead of a hash-table is incurred due to reserved space for new keys, in fact, as we have now learned, resizing is a much heavier factor. Especially this relates to the HASH-VECTOR: if the hashes of all keys would not have been stored, each time the table got resized they would have had to be recalculated anew: an intolerable slowdown for larger tables. Having a separate INDEX-VECTOR is not so critical, but it provides two additional nice capabilities. Resizing the table is performed by adjusting the vectors' lengths (an optimized operation) without the need to redistribute the entries. Besides, unlike in theory, we can iterate this table in a deterministic order: the one of keys addition into the table. Which comes quite handy, although SBCL developers don't want to advertise this as a public API as it may restrict potential future modifications to the data structure. There's also a NEXT-VECTOR that is used to optimize insertion.

Overall, we can see that a production-optimized hash-table turns out to be much heavier than a textbook variant. And, indeed, these optimizations are relevant, as SBCL's hash-table are quite efficient. In my experiments several years ago, they turned out to be 2-3 times faster than, for instance, the CCL ones. Yet, it's clear that the speed optimizations, as usual, come at a cost of storage deoptimization. To restore storage sanity, we could fall back to the basic hash-table variant, and it's a possible solution for some cases, although a mediocre one, in general. Neither will it be the fastest, nor fully optimized.

Building an MPHT

Most of space in the SBCL's hash-table implementation is spent on metadata and keys, not on values. Yet, if we introduce a restriction that the table cannot be altered — no new values added after initial construction and no resizing possible — most of those optimizations and connected storage become redundant. An ideally space-efficient data structure is an array, but it doesn't allow key-based access. In theory, minimal perfect hash-tables are arrays with a minimal amount of metadata overhead. Yet, the precise amount is dependent on the algorithm, and there's still new research improving them going on. Overviewing all the existing approaches (most of which I don't fully understand) is beyond the scope of this post.

Besides, MPHTs require additional calculations at access time compared to normal hash-tables. So, if a hash-table is well-optimized it will usually be faster than and MPH.

And still, MPHs require a non-negligible amount of additional space for metadata: for the algorithm that will be discussed it's around 2 bytes per key. The best algorithms in the literature claim to reduce this amount more than 4 times to less than 4 bits per key. However, for now, I have picked the simpler approach, since this difference is not critical considering that every value in the table, on a 64-bit machine, occupies at least 16 bytes (a pointer plus the data), so an overhead of 2 bytes versus 0.5 bytes (that will probably be rounded to 1 byte) is already negligible.

Now, let's think of how we can distribute the keys in a hash-table so that there are no collisions and the backing array has the same number of elements as the number of hash-table keys. Theoretically, as the set of keys is known before the creation of the table, we can find a hash function that produces such distribution. Unfortunately, due to the Birthday paradox, it may be a long search. The algorithms for MPHTs suggest ways of structuring it. A good algorithm should have at most O(n) complexity as, otherwise, it will be infeasible for large keysets, which are the main use case for perfect hash-tables.

The algorithm I will briefly describe now was suggested by Botelho and Ziviani. It is a 2-step process:

  • at first stage, using a normal hash-function (in particular, Jenkins hash), all keys are nearly uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. This can be done by setting the hash divisor to (ceiling (length keyset) 200 #| or slightly less |#);
  • next, for each bucket, a perfect hash function is constructed via a simple algorithm: for each key, two hash codes are calculated by one call to the Jenkins hash (this function outputs 3 hashes at once), which are treated as vertices of a graph. If the graph happens to be non-circular (which can be ensured with high probability with a suitable hash divisor), it is possible to construct the desired function as a sum of 2 hashes. Otherwise, we change the Jenkins hash seed and try constructing a new graph until a non-circular one is obtained. In practice, this requires just a couple of tries;
  • the construction of the final hash function is described very clearly by Czech, Havas and Majewski who have proposed this method: it lies in performing a depth-first search on the graph, labelling the edges with unique numbers and deducing the corresponding number for each possible Jenkins hash value.

Here you can see one of the possible labelings (each edge corresponds to a key and its unique index, each vertex — to the value for each of the possible Jenkins hashes):

Now, the per-bucket hash-function can be reconstructed from an array of numbers (in the range 0-255) associated with each possible hash. The divisor of the hash is twice the number of keys in the bucket, though it can be any number above the number of keys: the greater the divisor, the more space is used for storing the numbers (the divisor is the length of an array) and the less time it takes to find an acyclic graph.

The algorithm works quite fast: on my laptop, it takes 8 seconds for the table of 725 359 character trigrams from a mid-size language identification model and 13 seconds for 1 236 452 words from the same model.

To sum up, this is how to find the index of an element (bytes argument) in our perfect hash-table:

(defun csindex (bytes cstab)
  (with ((mod (/ (1- (length @cstab.meta)) 2))  ; divisor of the top-level hash
         (hash (* (mod (jenkins-hash bytes) mod) 2))  ; determine the bucket
         (off (? @cstab.meta hash))
         (seed (? @cstab.meta (1+ hash)))  ; each bucket has its own Jenkins seed
         (mod2 (- (? @cstab.meta (+ 2 hash)) off))
         (b c (jenkins-hash2 bytes seed (* mod2 2)))
         (goff (* 2 off)))
    ;; bucket offset + in-bucket hash
    (+ off (mod (+ (? gs (+ goff b))
                   (? gs (+ goff c)))

Note, that, in the code for this article, I refer to perfect hash tables as cstabs for the reasons described in the end.

Efficiency In Practice

So, let's now examine the memory efficiency of this method. Thankfully, just recently the SBCL developers started working on a critical for every algorithmic developer missing piece in the SBCL API: a function to determine the size in memory occupied by an arbitrary object. As we know from the famous koan, "LISP programmers know the value of everything and the cost of nothing". Indeed, from a certain point of view, this applies to SBCL. Although we, now, have a rough tool at our disposal that patches this hole... ;)

Using this unofficial function, we can roughly calculate the space occupied by the character trigrams hash-table mentioned above:

> (let ((ht (? wild:*lang-detector* '3gs)))
    (+ (object-size ht)
       (object-size @ht.table)
       (object-size @ht.index-vector)
       (object-size @ht.hash-vector)
       (reduce '+ (map 'list
                       (lambda (obj)
                         ;; the keys of the table are strings
                         ;; and values -- alists of a symbol and a float
                         (reduce '+ (map 'list ^(if (listp %)
                                                    (sum 'object-size %)
                                                    (object-size %))

100 MB!

> (let ((ct (build-const-table (? wild:*lang-detector* '3gs))))
    (+ (object-size ct)
       (object-size @ct.gs)
       (object-size @ct.meta)
       (reduce '+ (map 'list ^(sum 'object-size %)

I.e., we have reduced the memory usage almost twice. Because all the metadata now occupies just 1479808 bytes or 1,5 MB.

One critical decision that allows for such drastic memory-use improvement is omitting the keys from the table. It should be noted that adding them back is trivial: (defstruct (keyed-pht (:include const-table)) (keys nil :type simple-vector) (test 'eql)), for which getting the value will work like this:

(defun csget (key cstab)
  (let ((idx (csindex (bytes key) cstab)))
    (when (call @cstab.test key (svref @cstab.keys idx))
      (values (svref @cstab.data idx)

However, this will, basically, return us to the same ballpark as a textbook hash-table implementation in terms of memory usage while loosing in terms of speed. Yet, if we allow for some controlled level of false positives, there are a few algorithmic tricks that can be used to once again make keys almost negligible.

The first one is really simple and straightforward: replace the vector of keys with the vector of their hashes. In particular, if we take a single byte of the hash, such array will be 10-100 times smaller than the generic keys array and produce an FP-rate of 1/255.

Another trick will be to use a Bloom filter. For instance, a filter with 0.1 FP-rate for all the trigrams from our language identification model will require just around 0.4 MB of storage compared to 0.7 MB needed to store the 1-byte hashes and 30 MB needed to store just the keys. The disadvantage of a Bloom filter, however, is slower processing time: for the mentioned 10% FP-rate we'll need to perform 4 hash calculations, and if we'd like to reach the same 0.3% rate as the 1-byte hash array we'd have to increase the memory usage to 1MB and perform 8 hash calculations.

Finally, it should be noted that the main motivator for this work was reducing the memory footprint of my language identification library, for, to be widely adopted, such kind of project should be very frugal in its memory usage: 10-30 MB is its maximum polite footprint. By switching to a perfect hash-table, we haven't reached that goal (at least, for this particular model), but there's also plenty of room for optimizing values memory usage that I will return to later.


The other motivator for this work, in general, was my interest in the topic of efficient "static" data structures. In this context, I feel that the notion of a perfect hash table doesn't fully capture the essential features of the data structures class we're dealing with. First of all, the main distinguishing factor is static vs dynamic usage. A hash-table is thought of as a dynamic entity while our data structure is primarily a static one. Next, hashing is, for sure, involved in constructing all these structures, but it's only part of a bigger picture. The way of structuring the metadata, in particular, the index arrays may differ greatly not only between perfect and usual hash-table, but also within the ordinary hash-table class — within the different implementations.

So, I decided to come up with a different name for this data structure — a const-table (or cstab). It defines a broad class of persistent data structures that allow constant-time key-based access and, ideally, efficiently store the keys. The implementation, described here, is released as the library with the same name. It is still in its infancy, so major changes are possible — and suggestions on its improvement are also welcome.