Huffman Coding
Table of contents:-
1. Introduction
2. Algorithm & Implementation
3. Time Complexity
4. Conclusion
1. Introduction
The Huffman Coding algorithm is a famous Greedy algorithm. It was developed by David A. Huffman in 1951. It is an optimal prefix code and is used for lossless data compression.
In this rapid-growing digital world, there is an enormous amount of data being sent and received. People are trying to develop different methods and algorithms to reduce space complexity and transmission time or time complexity. Huffman Coding is one such example that helps us in reducing the message size while encoding it and thereby decreasing space and transmission time.
Huffman Coding always follows the prefix rule, which states that no code should be the prefix of another code.
2. Algorithm & Implementation
Let’s look at an example to see how different techniques of data encoding work:
Assume we have a message with the following characters, as well as the number of times each character appears in the message (frequency).
Fixed Size Coding
1. ASCII coding method
We will be using ASCII values in this method. We know that in ASCII values all characters are represented using 8-bit code and here we have a total of 58 characters. So, when encoding with this method the total size of the message will be 58*8=464 bits.
2. n-bit binary representation
But, because we know we don’t use all of the symbols and characters on the keyboard, can we assign our codes to the characters?
Yes, we can do it!
Here we are having seven different characters. So, we can use 3 bits to represent these seven different letters because we know that 2+3=8, which implies we can represent 8 different combinations, but here we only need seven, so we can easily use the 3 bits.
Therefore we can assign the following codes to each of the seven characters:
Now, let’s calculate the size of the message:
We have a total of 58 characters and each of these characters is represented using a 3-bit code, so the total message size will be 58*3=174 bits.
However, we also need to send the corresponding table. This is because at the time of decoding we need to know which code is for which character because it is our code. So now let’s calculate the table size.
All the characters will be sent in their ASCII values and here we are having 7 different characters, so the calculations will be (7*8) + (7*3) because each of the seven characters has a 3-bit code. So, the size of the table becomes 56+21=77 bits.
So, the total size of the message becomes => message size + table size = 174+77 = 251 bits.
With this, we can conclude that n-bit binary representation has reduced the message size significantly i.e., from 464 bits to 251 bits.
But with Huffman coding, we can decrease this size even further
Huffman Coding (Variable Size Coding)
Till now we were looking only at fixed size coding. But Huffman says that we can also have variable size codes for different characters and this will help us to reduce the size of the message even more.
Let’s take the same example as above so then it will become easy to compare both the results.
So, again we are having 7 different characters along with their frequency. Now, the very first thing we must do is to draw the Huffman tree.
To draw the Huffman tree, we have to arrange the characters in ascending order of their count or frequency.
Now we have to choose two characters with the minimum counts i.e., the first 2 characters, and sum them (1+3=4). Then we will remove or pop those two frequencies (1 and 3) and push the new sum as sum=4. The tree will now be looking like this:
We will be repeating the above steps till we form a complete tree. So, now the minimum frequencies or count will be 4 and 4 and their sum would be 8.
Now, the minimum frequencies will be 8 and 10 so the new sum will be 18.
As you can see now, the frequencies are not sorted, so we have to sort them again. After sorting we will again pick the minimum frequency and sum it up, i.e., 12+13=25.
We will again sort the new frequencies and then pick the least two (15+18=33).
After sorting again, we will be completing the tree with the last two frequencies left.
This is what the final tree looks like. This is the Huffman Tree or the optimal merge pattern tree.
Now, we will be writing the codes and for that, we first need to assign weight to the edges. Let’s assign ‘0’ to the left edges and ‘1’ to the right edges.
(Note: we can also assign it the other way as well but always remember to follow the same convention while decoding that you chose at the starting or at the time of encoding.)
After assigning the weight to all of the edges, the Huffman Tree will now look like this:
For assigning the codes to all characters, we have to traverse the tree from the root node to the leaf node of each character.
So, the code for each character will now be:
One thing we should keep in mind from the above table is that the characters which are appearing more frequently have less number of bits in their code whereas the characters that are appearing less frequently have more bits in their code. This is primarily done to reduce the total size of the message.
Now, the total size of the message will be calculated as ∑ ( frequency.i x code length.i ) that will be: (10 x 3) + (15 x 2) + (12 x 2) + (3 x 5) + (4 x 4) + (13 x 2) + (1 x 5) = 30 + 30 + 24 + 15 + 16 + 26 + 5 = 146 bits
Now, just like the earlier example of n-bit binary representation, we have to send the tree or table so that the message can be decoded.
So, let’s calculate the table size.
Here also we need seven different characters which would be sent in ASCII values each of 8 bits (7*8=56). And now, the total number of bits in the codes of all characters will be (3+2+2+5+4+2+5=23).
Therefore, the size of the table is 56+23=79 bits.
Now, the total size of the message along with the table will be 146+79=225 bits which is much lesser than what came with the ASCII method.
1. Time Complexity
In the above implementation, the Huffman Coding is implemented using a priority queue. It can also be implemented using different data structures like an array, linked list, heap, etc. But generally, Heaps are preferred as it reduces the time complexity because the complexity for push() and pop() using heaps is O(log(n)).
And if we are having ’n’ characters, then the total time complexity becomes O(n*log(n)).
2. Conclusion
I hope you found the blog interesting and informative.
I’d also like to express my gratitude to Bennett University for inspiring me to write this blog and expand my knowledge on the subject.
Thank you!