I’m currently working on a small web application that has to do a fair amount of munging binary data in the front end (meaning: JavaScript). One of the things it needs to do is inspect data packets, unpack them (from a simple 7/8 bit encoding scheme invented in the 80’s) and checksum them. The checksumming is done with a standard CRC32 algorithm. It took me a few hours to find a JavaScript library that uses the same polynomial as the counterpart of the app uses (which, luckily, is a widely used one, for example zlib uses it) and was usable within my Ember/Rails setup.
I tested around a bit and had a setup that worked, until I started testing with bigger packets and suddenly, the checksums wouldn’t match anymore. As it turns out the fact that the library worked in the first place was by chance: It returns a signed 32 bit integer and my test setup in the beginning simply produced a checksum that didn’t have the sign bit set. In parallel, I verified the results with two tools: The Ruby zlib bindings (part of the stdlib) and the crc32 command line tool that comes with OS X. Both return unsigned integers.
So I asked the maintainer of the library if this behaviour is intentional and it turns out, yes, it is.
First of all, the main reason is a JavaScript gotcha you may or may not be aware of: All bitwise operators (with one notable exception I’ll come back to later) return 32bit signed integers. This is documented. I am not sure about the reasons for this, but it could be very well simply an artefact of the rather chaotic past of JavaScript. At some point it becomes impossible to change something like this, because you will break people’s stuff.
For CRC32, it’s the resulting bitfield that matters, so the sign is not really important. But there’s a good reason of simply keeping the number within the bounds of 32 bit signed integers: Performance. V8 (and probably other engines as well) keeps numbers as signed 32bit integers if you let it and this let’s it optimize a lot of calculations. Given that so many operators specifically return signed ints, this makes sense, I guess.
The fix for my app was simple. Turns out that there is one bitwise operator that returns 32 bit unsigned integers. It’s called the unsigned shift right and looks like this: >>>
and works exactly like it’s signed counterpart, >>
, but fills up with 0 bits instead of sign bits. If you have never seen that before: Welcome to the club. Here’s the fix: crc32(stuff) >>> 0
. Looks and feels ridiculous, but works.
My short interaction with the library maintainer also revealed a rather interesting tidbit: As I mentioned earlier, the zlib bindings of ruby return the checksum as an unsigned value. The reason is probably that the bindings cast the result to the standard int type (Fixnum) in ruby, which, on my machine, is a 64 bit value. If I would be on a 32 bit platform (i haven’t tested it), I would assume that it would have been casted to a BigNum, because that’s how Ruby usually handles numbers. Python 2.x, on the other hand, as mentioned by the library maintainer, has a less strict behaviour. Before 2.6, zlib.crc32 would return signed or unsigned values, depending on your platform. After 2.6, it always returns a signed 32 bit integer and in the 3.x series, it always returns an unsigned integer. The docs state that you should use crc32 & 0xffffffff
to get consistent behaviour over all platforms. Sure.
So far, I’ve talked about three dynamic languages, and their behaviour around integer types is highly inconsistent. All of this made me think a bit. When we talk about dynamically typed vs. statically typed (or weak vs. strongly typed) programming languages, we often talk about how typed languages make it easier to find coding errors before you even run the code, vs. how much you often have to additionally declare to make this work (which is why well implemented type inference is such a big deal and why writing and reading Java is such a painful process). But as the JavaScript example shows, explicit typing would also help you when designing optimized code.
It is one thing to know about the 32 bit unsigned optimisations in V8 and a completely different thing to correctly design your code to stay within the bounds and prevent performance degradation over a large codebase.
But probably even more important: While the fact that all the bitwise operators in JavaScript operate on 32 bit integers is documented since ages (I’ve looked it up in ECMA-262, 3rd Edition, just to be sure), I would assume it’s not really widely known, and, quite frankly, it does not satisfy the principle of least surprise. Not because people don’t know that integers usually have boundaries, but because the number types in JavaScript usually act as if these boundaries do not exist.
Now, you could simply chalk this up as “one of the things Brendan Eich got wrong in that wretched week he developed JavaScript”, but as the Python example shows, it’s not only about the language itself, but also about how, in this case, libraries are built and how these type issues are handled in c bindings and what not. Being explicit about the types you’re using in these cases would help tremendously, because it simply takes away the element of surprise. This does not mean that I became a proponent of strong type systems overnight. A language like Ruby, crafted with deliberate care to handle these cases in a way that keeps surprises down to a minimum, makes some of this irrelevant, but sometimes I simply would wish to have the possibility to bolt on types when I need them. Which is why JavaScript has typed arrays, by the way, which are used in my application quite a lot.
As you can tell, this is not a well thought out analysis on additional benefits of type safety, it is more of a snapshot on my current thoughts on this but I wanted to share it anyway, not the least in the spirit of my “you need to write more in 2016” initiative. Please let me know what you think.
I couldn’t find a good post head for this article in my own collection of photos, so I’ve looked on flickr, for the first time in years. You can find the original by Marco Ooi here, which is licensed under a BY-SA creative commons license, which in turn means that the post head image is shared by me under the same license as well.