Jan Krutisch - The Binary Toolbox (for JavaScript)

The first time I tried my luck in parsing binary files within the browser must have been the Cloudtracker2 project, my (slightly out of date) try to make a good Protracker player/Editor for the web (It sort of lives on in the Halfplayer project if you’re interested). Parsing binary files in the browser is actually no longer a problem, but I thought it might be a fun exercise to write down some notes on what I’ve come across in one of my current projects, which involves intensive binary data munging on a much bigger scale than what I have tried so far.

So, here’s the toolboxes contents:

Additionally, we’re going to talk about file drag and drop, creating object URLs and other things.

Before we get started: Here’s a JSBin containing the final example code. Also a word of caution: I didn’t test my example code outside of Chrome and some things I’m using do not properly work in other browsers. I’ll try to amend the article later on with some compatibility notes. (My current project is, for a number of reasons, Chrome only)

Let’s get something binary

To get binary data into the browser, the easiest way is to use a standard file upload widget. Later on I’ll be talking about how to do this with drag and drop, but since that is a lot more involved, let’s start simple. Here’s your off-the-shelf file dialog:

<label>Select File: <input type="file" id="file"></label>

To load a file that has been selected, all you need is to listen to the onchange event and then you can load the file into an ArrayBuffer using a FileReader.

var fileSelector = document.getElementById('file');
fileSelector.addEventListener('change', function(e) {
  var file = e.target.files[0];
  var reader = new FileReader();
  reader.onload = function(e) {
    parse(e.target.result);
  };
  reader.readAsArrayBuffer(file);
});

I hear you say “Whoah, back off a bit” and right you are.

So the fileSelector, which is our type-file input field, has a property named files which looks like an Array and is actually a FileList. It simply contains references to the selected files, which are File objects. In our case, we’re just interested in one (and by default, the file selector only allows single files).

With that File at hand, we can then instantiate a FileReader, which encapsulates the actual I/O process.

Of course, this being JavaScript, reading a file is an asynchronous process which is why the actual result comes in form of an event callback. The e.target.result now points to an ArrayBuffer. You can think of an ArrayBuffer as a continuous piece of memory that contains your data. You cannot directly access that piece of memory, but we’re very close.

Of course, to properly implement this, you should also add a listener on onerror on the FileReader, but I’m the ignorant teacher here and can just assume you won’t be that stupid. Right?

A humble byte array

Now, what to do with it. Glad you asked. We need to create a “view” into the data. This can be done with two constructs: typed arrays and DataView. Let’s start with the very obvious thing: We want stupid bytes. All of them.

function parse(buffer) {
  var bytesView = new Uint8Array(buffer);
}

Typed Arrays are, as the name says, a special type of array. They only contain typed values, something that sounds a bit weird for a language like JavaScript but of course is very handy when dealing with binary data. There are a number of other types as well:

UintClampedArray
Int8Array
UInt16Array
Int16Array
UInt32Array
Int32Array
Float32Array
Float64Array

The larger types often come handy, but beware: You can’t define the byte order of these types. Instead, they use the byte order of the system the browser runs on. Most systems nowadays run in little endian, so you can kind of assume that all larger types use little endian byte order. We’ll see later on how to explicitly read or write bytes in a specific byte order.

Now, we can actually access the bytes. Let’s assume we want to parse the header of a JPEG file (why do we want to do that? Beats me. Because we can, I guess)

function parse(buffer) {
  var bytesView = new Uint8Array(buffer);
  var headerOffset = bytesView.findIndex(function(element, i, array) {
    return (element === 0xFF && array[i+1] === 0xD8);
  });      
}

This searches the array for the SOI (Start Of Image) marker. After that follows the JFIF tag, which we will parse with a different method: Let’s get a view that only covers a part of the ArrayBuffer.

Please note, that while current versions of Chrome and Firefox support findIndex on typed arrays, Safari sadly does not, so you’d have to fallback onto a for loop for Safari compatibility.

function parse(buffer) {
  // [...]
  var jfifTag = new Uint8Array(buffer, headerOffset + 6, 4);
  var jfifString = Array.prototype.map.call(jfifTag, function(byte) {
    return String.fromCharCode(byte);
  }).join("");
}

With most, well formed JPEG-Images, this should give you “JFIF”. So, by specifying an offset and a length, you can create typed arrays that only look into a specific part of the ArrayBuffer.

But what about that weird prototype.map action? Well, .map on typed arrays insists to return the same type of typed array. This does not work for me and so I’m using the Array prototype’s map instead. This works great and is, for example, a simple way of turning a typed array into a non typed array.

One thing I need to mention (another drawback of the “big” typed arrays, if you will) is that the offset in the typed array constructor must be dividable by the byte size of the type. So new Float64Array(buffer, 3, 2) simply won’t work. As with the byte order, I’ll show you later how to handle that.

Now, while typed arrays are fixed in length, they are not read only. You can modify them by using the array accessor:

jfifTag[0] = "H".charCodeAt(0);
jfifTag[1] = "E".charCodeAt(0);
jfifTag[2] = "M".charCodeAt(0);
jfifTag[3] = "P".charCodeAt(0);

This will certainly irritate future image parsers. Please note, as the typed array we’re using is only a view into the ArrayBuffer, we’re actually changing the binary data in the array buffer.

If you want to overwrite larger parts of the typed array, there’s also a .set method. Here’s a specially contrived example:

  jfifTag.set(Array.prototype.map.call("JANS", function(c) {
    return c.charCodeAt(0); });
  });

(Yes, it looks like you can call Array.prototype iterators on strings. I’m not even mad.)

You can also copy typed arrays into other typed arrays with this, but beware: type conversions will happen, it’s not just copying bytes.

DataView, the swiss army knife of binary parsing

So, like I said, this is all very well, but what happens if you need to actually read or write a 16 bit signed integer in big endian from or at offset 3? Worry not, DataView to the rescue.

So, let’s see if our JPEG file contains an EXIF block and see how long the block is.

var exifOffset = bytesView.findIndex(function(element, i, array) {
  return (element === 0xFF && array[i+1] === 0xE1);
});
if (exifOffset !== -1) {
  var view = new DataView(buffer, exifOffset, 4);
  var length = view.getUint16(2, false);
}

(Yes, the offset is not uneven, so a typed array would have worked as well, but JPEG stores the chunk sizes in big endian format, so we would have needed the DataView anyway…)

The DataView interface provides a set of functions to set and get single, typed values from the view. Contrary to the behavior of the typed arrays, these methods do not care for byte alignment and you can actually set and get big endian values on little endian systems.

The important thing to note: This of course has an impact on performance. The alignment rules of typed arrays are not out because the browser implementors were lazy, but because typed arrays were, for the most part, introduced for performance reasons. CPUs nowadays can read words in arbitrary sizes from memory very quickly, but only if they are properly aligned (and of course, endianness matches).

But especially when dealing with older file formats, big endian numbers are-a-plenty. The biggest reason for that, if you allow me the short detour, is Motorola’s 68k architecture, which was big endian and which powered popular machines such as the original Macintosh, the ATARI ST series and the AMIGA, which, of course, was my machine of choice back in the days. All of this during a time when many popular file formats (such as JPEG) were specified.

Let’s get it back!

So, we have done something with the binary file and we now want the user to be able to download the file. Well, not so much actually download (we’re still in the browser on our users computer), but more like, save. As in the old days, 8, 1, if you catch my drift.

Some of the things you’ve seen so far may have seemed absurd, but nothing prepares you for the following set of hoops you’re about to jump through.

First, let’s make a Blob. A Blob (also used as the underpinning of the File we saw earlier) is, like the ArrayBuffer, an arbitrary representation of binary data. But, in contrast to the ArrayBuffer, the Blob is completely opaque, unless read by a FileReader (as we did in the very beginning). So, let’s create a Blob:

var blob = new Blob([buffer], {type: 'image/jpeg'});

The blob constructor needs an array of things that can be turned into binary data. This could be an ArrayBuffer (our case), any of the typed arrays, another Blob, or a String. Additionally, we can specify a mime type for the “file” the Blob will represent.

To be able to download, err, save the file, we need to create a URL from this Blob using URL.createObjectURL:

var url = URL.createObjectURL(blob);

We can now assign this URL to a link and let the user click it. Or, we could, well, see for yourself.

var link = document.createElement('a');
link.href = url;
link.download = 'test.jpg';
link.style.display = 'none';
link.innerHTML = "click me";
document.body.appendChild(link);
link.click();
window.setTimeout(function() {
  link.remove();
  URL.revokeObjectURL(url);        
}, 0);

So, by setting the download property of the link, we effectively did the same thing as sending a Content-Disposition: attachment header with a HTTP response: If the user follows the link, the linked file will be downloaded, err, saved.

In Chrome, simply creating the link and clicking it is enough. In Firefox, the link needs to be in the DOM (but can be invisible). Also, the setTimeout bit (poor man’s nextTick) is needed for the download, err, save to work in Firefox.

By revoking the object URL, we tell the browser that the Blob is not attached to a URL anymore and can be normally garbage collected.

Drag and drop you said?

For many people, the standard file dialog can be a bit tedious. And while, at least on OS X, you can actually drag a file to the file upload field, the drop zone is so small that this is not really sufficient. Building drag and drop into a web app is easy, so let’s take a look.

Let’s say, you have a div that has a certain size, so it can act as a drop zone:

<div id="dropzone" style="height: 200px;  border: 2px solid black; line-height: 200px; text-align: center;">Drop files here!</div>

The minimal code to start the same routine as on the file input field is this:

var dropZone = document.getElementById('dropzone');

dropZone.addEventListener('dragover', function(e) {
  e.preventDefault();
});

dropZone.addEventListener('drop', function(e) {
  e.preventDefault();
  var file = e.dataTransfer.files[0];
  var reader = new FileReader();
  reader.onload = function(e) {
    parse(e.target.result);
  };
  reader.readAsArrayBuffer(file);
});

That wasn’t so bad, right? The ondragover handler looks a bit weird, but it is, unfortunately, needed for the drag and drop operation to work.

To make this half usable, we should add an indicator to the drop zone when something is hovering over it. This can be done by using two other events:

dropZone.addEventListener('dragenter', function(e) {
  e.target.style.background = '#ff8';
  e.preventDefault();
});

dropZone.addEventListener('dragleave', function(e) {
  e.target.style.background = null;
  e.preventDefault();
});

(You should, for good measure, also remove the style on a successful drop. And probably you should not use inline styles for all of this crap.)

What we have left out so far

Error handling, mostly. For demonstration purposes, of course. And because I’m lazy. But also validation in any form. The first, and easiest thing would be to check the mime type of the file. The File object we’re accessing in both examples, has a type property that will contain the mime type, if it is a common file type like in our jpeg example. But more often, when you’re dealing with binary files in the browser in the first place, the type will be simply 'application/octet-stream'. Then, you should try to identify the file with different methods, for example, look for the JFIF tag in JPEG images. But since this is application specific, it won’t make much sense to go deeper here.

Also, binary files can, of course, be loaded from a server. As this article is strictly client only, let’s just say that both old trusty XMLHttpRequest and it’s modern cousin, Fetch API allow you to read in a response as both an ArrayBuffer or a Blob.

Performance, Web Workers, UI-Updates

Some of these things can take time. My current project involves parsing roughly 7 MB of binary data and basically touching every byte at least twice. We’re talking a large number of seconds here.

The problem of doing this all in the main thread is that this prevents UI updates. Or put more bluntly: Do that a lot and your users will think the browser crashed.

Luckily, there’s an alternative: Web Workers. These are a bit clunky to work with, but they are generally available and make this a lot easier.

I will, maybe, write up my experiences with web workers in my current project in another article. Here are just two things I would like to mention:

Make sure you use the Transferable feature, that allows you to send objects from main thread to workers without copying, which is a lot faster.

Also, if you can, make sure the main thread gets frequent progress updates so that you can maintain a progress bar or so for the user to see that things are actually happening.

And this is the end of my little article on handling binary data in the browser. I hope you enjoyed it. If you think I got something wrong or I missed something, please let me know in the comments or on twitter.

Again, a post head found on flicker, the original is by Christiaan Colen and is licensed under a BY-SA creative commons license, which in turn means that the post head image is shared by me under the same license as well.