Encoding Using Greedy Algorithm: Huffman Coding

Harine
4 min readApr 10, 2022

--

Huffman coding refers to a method of generating an optimal binary sequence code, which is often used for lossless data compression seen in computer science and information theory. This was developed by David A. Huffman (hence the name) and published in a 1952 paper titled “A Method for the Construction of Minimum-Redundancy Codes”.

There are multiple variations of the algorithm that take into consideration several other factors (such as unequal letter costs in case of Morse code), but this blog will only be covering the standard one.

The terms ‘weights’ and ‘frequencies’ will be used interchangeably in this blog.

Procedure

The input, or prerequisite to get started is a table containing all the characters and their frequencies [fig 1]. It is expected that each letter will have the same “cost”, i.e. every character will take the same amount of time to transmit.

1: Given frequency table

The expected output, or the goal is to find a set of codewords with minimum expected codeword length, and/or a tree with minimum weighted path from the root.

The first step is to sort the given table in ascending order of frequency [fig 2], and lie down the nodes containing the letters in that order [fig 3].

2: Table after sorting
3: step 1

From left to right, combine two nodes containing the lowest frequencies. The frequency of this new node will be the sum of the individual two nodes.

step 2

Repeat this process till there is only one node left. The weight of this node should be equal to the sum of frequencies of all the original nodes.

step-by-step illustration of the construction of Huffman tree

Once the tree has been constructed, assign bits to each edge of a node. There is no “correct” way to assign bits- 0 and 1 can be assigned to either left or right sides, but keep in mind that consistency has to be maintained. That is, if 0 is assigned to the left edge, the same has to be repeated for every branch.

Huffman tree after assigning bits to edges

Starting from the root, trace the edges of this generated tree to each node to find their respective codes.

Code generated by following the edges

The size of each character can be found by multiplying the number of digits in the final code by the given frequency of the character.

Finding the sizes of each character after encoding

Pseudo-code for Huffman Coding

create a priority queue Q consisting of each unique character.
sort then in ascending order of their frequencies.
for all the unique characters:
create a newNode
extract minimum value from Q and assign it to leftChild of newNode
extract minimum value from Q and assign it to rightChild of newNode
calculate the sum of these two minimum values and assign it to the value of newNode
insert this newNode into the tree
return rootNode

Limitations

As mentioned earlier Huffman coding is a lossless encoding technique i.e. data can be perfectly retrieved from the compressed data without any loss of information.

However, these techniques often achieve a lower compression ratio when compared to lossy encoding techniques. For this reason, Huffman coding is suitable for encoding text and program files only, and cannot be used for digital images or other media.

Moreover, Huffman coding provides a variable length code- length of all binary codes is different. This makes it difficult for the decoding software to determine if the encoded data is corrupt, leading to incorrect decoding and therefore, a wrong output.

--

--

Harine
Harine

Written by Harine

Programmer, artist, dreamer

No responses yet