This is a snippet of the "Compression" chapter of the book.
Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.
Encoding
Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.
Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.
Base64
Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +
, /
, and =
). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=
) is used to indicate 2 (=
) or 4 (==
) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.
Here is one of the ways to implement a Base64 serialization routine:
(defparameter *b64-dict*
(coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
:collect (code-char ch))
(loop :for ch :from (char-code #\a) :to (char-code #\z)
:collect (code-char ch))
(loop :for ch :from (char-code #\0) :to (char-code #\9)
:collect (code-char ch))
'(#\+ #\/ #\=))
'simple-vector))
(defun b64-encode (in out)
(let ((key 0)
(limit 6))
(flet ((fill-key (byte off beg limit)
(:= (ldb (byte limit off) key)
(ldb (byte limit beg) byte))
(:= off (- 6 beg)))
(emit1 (k)
(write-byte (char-code (svref *b64-dict* k)) out)))
(loop :for byte := (read-byte in nil) :while byte :do
(let ((beg (- 8 limit)))
(fill-key byte 0 beg limit)
(emit1 key)
(fill-key byte (:= limit (- 6 beg)) 0 beg)
(when (= 6 beg)
(emit1 key)
(:= limit 6))))
(when (< limit 6)
(:= (ldb (byte limit 0) key)
(ldb (byte limit 0) 0))
(emit1 key)
(loop :repeat (ceiling limit 2) :do
(emit1 64))))))
This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.
One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:
- It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
- Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner:
(with-open-file (in ...) (with-out-file (out) (base64-encode in out))
. With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.
So, what happens in the code above? First, the byte
s are read from the binary input stream in
, then each one is slashed into 2 parts. The higher bits are set into the current base64 key
, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out
. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte
, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...)
. The function relies on the standard Lisp ldb
operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset)
to control the bits it wants to obtain.
Implementing a decoder procedure is left as an exercise to the reader...
Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):
CL-USER> (with-input-from-string (str "Man i")
(let ((in (flex:make-in-memory-input-stream
(map 'vector 'char-code
(loop :for ch := (read-char str nil) :while ch :collect ch))))
(out (flex:make-in-memory-output-stream)))
(b64-encode in out)
(map 'string 'code-char (? out 'vector))))
"TWFuIGk="
This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key
) and the byte
being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:
M (77) | a (97) | n (110)
0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1 | | 19 = T
0 1| |
1: 0 1|0 1 1 0 | 22 = W
| 0 0 0 1|
2: | 0 0 0 1|0 1 5 = F
Iteration 0:
beg: 2
off: 0
limit: 6
beg: 0
off: (- 6 2) = 4
limit: 2
Iteration 1:
beg: 4
off: 0
limit: 4
beg: 0
off: (- 6 4) = 2
limit: 4
Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...)
. Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.
(deftest b64-encode ()
;; B64STR would be the function wrapped over the REPL code presented above
(should be blankp (b64str ""))
(should be string= "TWFu" (b64str "Man"))
(should be string= "TWFuIA==" (b64str "Man "))
(should be string= "TWFuIGk=" (b64str "Man i")))
Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.
More details about of the book may be found on its website.