Huffman coding refers to a method of generating an optimal binary sequence code, which is often used for lossless data compression seen in computer science and information theory. This was developed by David A. Huffman (hence the name) and published in a 1952 paper titled “A Method for the Construction of Minimum-Redundancy Codes”.
There are multiple variations of the algorithm that take into consideration several other factors (such as unequal letter costs in case of Morse code), but this blog will only be covering the standard one.
The terms ‘weights’ and ‘frequencies’ will be used interchangeably in this blog.
Procedure
The input, or prerequisite to get started is a table containing all the characters and their frequencies [fig 1]. It is expected that each letter will have the same “cost”, i.e. every character will take the same amount of time to transmit.
The expected output, or the goal is to find a set of codewords with minimum expected codeword length, and/or a tree with minimum weighted path from the root.
The first step is to sort the given table in ascending order of frequency [fig 2], and lie down the nodes containing the letters in that order [fig 3].
From left to right, combine two nodes containing the lowest frequencies. The frequency of this new node will be the sum of the individual two nodes.
Repeat this process till there is only one node left. The weight of this node should be equal to the sum of frequencies of all the original nodes.
Once the tree has been constructed, assign bits to each edge of a node. There is no “correct” way to assign bits- 0 and 1 can be assigned to either left or right sides, but keep in mind that consistency has to be maintained. That is, if 0 is assigned to the left edge, the same has to be repeated for every branch.
Starting from the root, trace the edges of this generated tree to each node to find their respective codes.
The size of each character can be found by multiplying the number of digits in the final code by the given frequency of the character.
Pseudo-code for Huffman Coding
create a priority queue Q consisting of each unique character.
sort then in ascending order of their frequencies.
for all the unique characters:
create a newNode
extract minimum value from Q and assign it to leftChild of newNode
extract minimum value from Q and assign it to rightChild of newNode
calculate the sum of these two minimum values and assign it to the value of newNode
insert this newNode into the tree
return rootNode
Limitations
As mentioned earlier Huffman coding is a lossless encoding technique i.e. data can be perfectly retrieved from the compressed data without any loss of information.
However, these techniques often achieve a lower compression ratio when compared to lossy encoding techniques. For this reason, Huffman coding is suitable for encoding text and program files only, and cannot be used for digital images or other media.
Moreover, Huffman coding provides a variable length code- length of all binary codes is different. This makes it difficult for the decoding software to determine if the encoded data is corrupt, leading to incorrect decoding and therefore, a wrong output.