2019-11-20

Programming Algorithms: Strings

This is a snippet of the "Srings" chapter of the book.

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

  • indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
  • there's a lot of string-specific algorithms that deserve their own chapter
  • finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
                        (declare (type (simple-array bit) x))
                        (count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2:       F3480FB8FA       POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
  bytes
  bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
  (let ((off 0))
    (loop
      (with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
             (diff (- i cnt)))
        (cond
         ((= cnt i) (return (+ off i)))
         ((< diff *mb-threshold*) (return (mb-linear-char-index
                                           string diff off)))
         ((< cnt (floor i 2)) (:+ off i)
                              (:- i cnt))
         (t (:+ off (floor i 2))
            (:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
  (count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

More details about of the book may be found on its website.