Errors in your Platform

One of the always trending topics in our pop-culture-like computer industry is dealing with errors. There are whole movements addressing it. Yet, in my humble opinion, these movements often see errors in a rather narrow, I would even say, simplistic manner: like the proverbial example of passing a string to a function which expects integers. But, in fact, the scope of errors that can occur in a program is indefinite. In a sense, the language of errors is Turing complete, and creating a program can be viewed from a certain point as devising a way of working around all relevant errors.

Today, I'd like to talk about a particularly nasty type of errors - those that occur in the platform on which we're building the program. And the reality is such that we base our programs on at least several layers of platforms: a hardware platform, an OS and its standard library, a language's VM and the language's standard library, some user-level framework(s) - to name a few. There are several aspects of this picture - some positive for us and some negative.

  • the lower we go, the more widely-used the platform is and thus the better it is tested and the more robust (positive)
  • at the same time it becomes, usually, more general and so more complex and feature-full, which means that some of the more obscure features may, actually, be not so well tested (negative)
  • the platform is usually heterogeneous: each level has it's own language with its semantics and concepts, often conflicting with the ones at the other levels (negative)

There's a revealing rant by node.js creator Ryan Dahl on the topic of such platforms - now removed, but available in the all-remembering Internet - "I hate almost all software".

The platform errors are so nasty, because of two things:

  • you learn to trust the platform (or you can easily go insane otherwise), so you first search for errors in your code
  • the majority of the platforms are not built with a priority on traceability, so it's an order of magnitude harder to find the problem, when you finally start looking for it IN the platform

And the thing is, there are severe bugs in all of the platforms we use. The Pentium FDIV bug is one acute example, but let's talk Java here, because there's so many people bragging how rock solid the JVM is. One of the problems of the JVM is jar hell - the issue that haunts almost every platform under different names (like DLL hell). We had an encounter with it recently, when after an update our server started hanging for no apparent reason. After an investigation we've found that part of our code was compiled with a more recent version of Gson library for parsing JSON, while the main server used the older version. The interesting thing is that there was no error condition signalled. I wonder, how many projects do the same mistake and don't even know what's the source of their problems. We had a similar but more visible issue with different versions of the servlet-api, but that time they were caught at compile-time because of the minor API inconsistency.

The other problem we'd fought fruitlessly and had to eventually work around with external tools was the JVM not properly closing network sockets. Yes, it leaks sockets in some scenarios and we couldn't find a way to prevent it. This is an example of another nasty platform error - the one which happens at layer boundaries. Also, many java libraries, like the ubiquitous jetty webserver leak memory in their default setting. Actually, I wonder, what percentage of Java libraries doesn't leak memory by default?.. This is no way to say, that JVM is exceptional in this regard - other platforms suffer from the same problems. At the same time to JVM's credit it has one of the tools to fight platform errors that many other platforms lack.

So, what are the ways to deal with those errors? Unfortunately, I don't have a lot of answers here, but I hope to discover some with this article. Here are some of the suggestions:

First of all, don't trust the platform.
Try to check everything and cross-validate your assumptions with external tools. This is much harder to do than say, and there's a tradeoff to make somewhere between slackness and paranoia. But if you think of it beforehand, you can see ways to set-up the system so that it would be easier to deal with the problem when it's necessary.
Have the platform's source code at hand.
This is one dimension of traceability - the static one. It won't help you easily find all the problems, but without it you're just at the mercy of the platform's creator. And I'd not trust those guys (see p.1)
Do proper error-handling.
There's a plethora of information on that, and I'd not repeat it here. See the error handling manifesto for one of the good examples.

But the biggest upside is not in fighting with the existing system, but in removing or diminishing the source of the problems - i.e. creating more programmer-friendly platforms. So the most important recommendations I see are for platform authors (and, in fact, most of us become them to some extent when we release a utility library or a framework).

Build platforms with traceability in mind.
One example of such traceability is JVM's VisualVM which is a great tool to get visibility into what's happening in the platform. Another example is Lisp images the state of which you can inspect interactively with Lisp commands. For instance, Lisp has built-in operators TRACE that allow to see the invocations of any interesting function, and INSPECT which is an interactive object browser. These things are really hard to add after-the-fact: look, for example, at the trouble of bringing DTrace into the Linux world.
Create homogenous platforms.
Why? Because in homogenous platforms the boundaries are not so sharp, and so there tend to be fewer bugs on the boundaries. And it's much easier to build in traceability: for example you don't need to deal with multiple stacks, multiple object representation etc. Once again, Lisp comes up here, because with its flexibility it allows to build systems of different levels of abstraction in the same language. And that was exemplified by the Lisp Machines - see the Kalman Reti's talk describing their inner workings.
Support component versioning at core.
This is, actually, one of the topics not addressed well by any mainstream platform (except, maybe, .Net), as far as I know: how to allow several versions of the same component to live together in one program image.

How can static typing help with these types of errors? I don't know. What about the Erlang's failure isolation. Once again, it's pretty orthogonal to the mentioned issues: if your Erlang library leaks memory, the whole machine will suffer - i.e. you can't isolate errors which touch the underlying layers of your platform which you don't fully control...

Please, share your approaches and stories about dealing with platform errors.

No comments: