Huffman Coding


In much of computing, we run up against a very practical problem: the complicated devices we use to store digital information are expensive. We prefer to live in the world of the theoretical as much as possible, but eventually accounting has to have a say in how our programs work.

At some point every developer learns that the hardest thing in computer science has nothing to do with programming, and everything to do with avoiding getting killed by an angry finance department.

So, say we're trying to store a ton of text. We may at first reach for some of the encodings we've used before, such as ASCII, UTF-8*, etc. These are fine, but they may eventually make accounting rather mad at us—they take up way too much space!

Let's just focus on ASCII for now. Every single time we have to store a character, we're taking up a whole 8 bits*! That may not sound like a ton, but with enough text it becomes rather huge.

Wouldn't it be nice if we could make things smaller? We've already seen that if we make every character the same size, we have to use our same 8 bits just to fit all the symbols we need. So, things are going to have to be different sizes.

Thinking about it for awhile, we may realize that we use the letter e rather often, but Z is rather uncommon—why can't we "steal" some size off of one and give it to the other? Our code for Z can be a good bit longer, and as long as that lets us make e pretty short, we'll save space overall!

LetterCode
e0
z110100101110101
(Maybe not quite this intense)

So, we now have two new problems:


You may also realize pretty quickly that this is going to be different for different text. An article about pizza is going to have a ton of z's, so we can't universally say that z isn't used often*.


How do we do that?


This is a pretty complex problem! Luckily for us, we have an elegant solution already given to us in a 1952 MIT paper by David A. Huffman*.

We start by taking some block of text we want to encode, such as:

Our first step is to count characters, and see how often each one shows up. We'll then sort by frequency:

Here we see that "[nothing?]" occurs the most frequently, and "[nothing?]" the least.

In fact, let's look at the bottom two: [not long enough]

We'll squish these together and create a structure like this, showing the letter and how often it shows up:

What we're doing here is essentially "combining" these letters into a single place. If we look at the newly created grey node, we'll see that the number in it represents the total, added-up frequencies of the letters "in" it.

Now that we've created this little tree, we're gonna use it! We'll take out those bottom two letters, and then place our tree into the list, making sure to keep it sorted. We say that the top grey number is our frequency for this "character."

We then simply repeat this process over and over again. Take the bottom two things off the list, combine them together and add them up, and put them back in the list. After we're done, we'll end up with this tree.


Wait so why did we do all that?


It still isn't super clear why we built this tree, but bear with me a little longer. First, we're gonna label all of the branches. Any time we branch left, we'll label it 0. If we branch right, we'll label it 1. So, we get:

Note that if you're on mobile, this is rotated for better visual clarity.

Now, notice that if we "go down" the first 0 branch and ignore everything outside of it, we cut everything on the 1 side away. If we start by going down that left branch, we know we are guaranteed that we cannot reach anything on the right.

So, we keep cutting down branches, until finally we get to a node without any other nodes below it. At this point, we cannot get to absolutely anything else other than the text contained inside that node, and so we stop.

Therefore, with a specific pair of 0s and 1s, we can reach any letter in our tree. That sounds like an encoding! Lets write out all of these paths:

CharPath

If you just looked at these codes on their own, it might feel hard to prove that there isn't any overlap. How would we know for sure that one code starts somewhere and ends somewhere else?

But, if you think back to the tree, realize that as we "take a path" by having either a zero or a one, there's no way to get confused about where we're going. You'd always end up somewhere!


Earlier, I presented two problems. We've solved the first: we know where characters start and end.

Sneakily, though, we also solved the second! Because we kept our list sorted by frequency, things that didn't occur much got shoved to the very bottom of the tree. Ergo, they have the longest paths!


So, how'd we do?


We've done all this, but how much has it actually helped?

The ASCII encoded version of our string takes 0 bits:

While the Huffman coding takes 0 bits:

That's a pretty major reduction!


So, now that we've made accounting a bit happier, we'll live to see another day! While you're here, go ahead and change the text used in this article. Put anything in below, and scroll back up to see the changes! I've also thrown the tree down below again, just for easy access. Play around with it!