Chapter 55: Huffman Coding

Huffman Coding – Let’s learn this topic properly, slowly and clearly — like we are sitting together and I’m teaching you from scratch with a notebook and pencil, drawing trees, writing bits, and counting everything so you really understand how it works and why it’s so clever.

Huffman Coding is one of the most beautiful and most practical compression algorithms ever invented. It is used everywhere — in ZIP files, JPEG images, MP3 audio, PNG images, PDFs, network protocols, and many more places.

1. What problem does Huffman Coding solve?

Most text files and data contain some characters that appear very frequently and others that appear very rarely.

Examples:

  • In English text: space, e, t, a, o, i, n appear very often z, q, x, j appear very rarely
  • In source code: ; { } ( ) appear very often rare symbols like @ # $ appear rarely

Normal encoding (like ASCII) uses fixed 8 bits for every character — even for very rare ones.

Huffman Coding solves this by giving:

  • Short codes (few bits) to frequent characters
  • Long codes (more bits) to rare characters

The result: total number of bits needed is much smaller than fixed-length encoding.

2. Core Idea – Variable-length prefix codes

Huffman Coding creates a variable-length code where:

  • More frequent characters get shorter bit sequences
  • Less frequent characters get longer bit sequences
  • No code is a prefix of any other code (very important!)

This last property is called prefix-free or prefix code — it means you can decode the bit stream unambiguously without any markers between codes.

Example (very simple):

Character Frequency Huffman code
space very high 0
e high 10
t high 110
z very low 1111

You can see: frequent letters get short codes, rare letters get long codes.

3. How Huffman Coding builds the codes (the algorithm)

The algorithm builds a binary tree (Huffman tree) using a greedy strategy.

Steps:

  1. Count the frequency of each character in the text
  2. Create a min-heap (priority queue) where each node contains:
    • a character (or combined characters)
    • its frequency
  3. While there is more than one node in the heap:
    • Remove the two nodes with the smallest frequencies
    • Create a new parent node with frequency = sum of the two children
    • Make the two nodes its left and right children (convention: left = 0, right = 1)
    • Put the new parent node back into the heap
  4. When only one node remains → that is the root of the Huffman tree
  5. Traverse the tree from root to each leaf:
    • Every left branch = 0
    • Every right branch = 1
    • The path from root to leaf = Huffman code for that character

4. Complete Worked Example (very detailed)

Text to compress: “abacabad”

Step 1: Count frequencies

Character Count
a 4
b 2
c 1
d 1

Step 2: Create initial min-heap (priority queue sorted by frequency)

Nodes:

  • c : 1
  • d : 1
  • b : 2
  • a : 4

Step 3: Build the tree

  1. Take two smallest: c (1) and d (1) Create parent node with freq = 1+1 = 2 Tree so far:
text

Put this node (2) back into heap.

Heap now: 2 (c-d), b (2), a (4)

  1. Take two smallest: c-d (2) and b (2) Create parent node with freq = 2+2 = 4 Tree now:
text

Put node 4 back.

Heap now: 4 (c-d-b), a (4)

  1. Take two remaining: 4 and a (4) Create root node with freq = 4+4 = 8 Final tree:
text

Step 4: Assign codes (left = 0, right = 1)

  • From root to a: right → 1
  • From root to b: left → right → 10
  • From root to c: left → left → left → 000
  • From root to d: left → left → right → 001

Final Huffman codes:

Character Code
a 1
b 10
c 000
d 001

Step 5: Encode the message “abacabad”

a b a c a b a d → 1 10 1 000 1 10 1 001

Bit string: 1 10 1 000 1 10 1 001

Total bits = 1+2+1+3+1+2+1+3 = 14 bits

Normal ASCII (8 bits per char × 8 chars) = 64 bits

Savings: from 64 bits → 14 bits ≈ 78% compression

6. Time Complexity of Huffman Coding

Step Time Complexity Explanation
Count frequencies O(n) Scan the string once
Build initial min-heap O(σ) σ = number of unique symbols
Build Huffman tree (n-1 merges) O(σ log σ) Each merge + heap operation ≈ log σ
Generate codes (DFS traversal) O(σ) Visit each node once
Encode the message O(n) Replace each character with its code

Total time complexityO(n + σ log σ) Where σ = number of distinct characters (usually very small, e.g. 256 for ASCII)

In practice → almost linear in length of text

7. Summary – What you should remember & say in interviews

Huffman Coding is a greedy, optimal prefix code compression algorithm that assigns variable-length codes to characters based on their frequencies — shorter codes for more frequent characters. It builds a binary tree by repeatedly combining the two nodes with smallest frequencies. The path from root to leaf gives the Huffman code (left=0, right=1). Time complexity is O(n + σ log σ) where n is text length and σ is number of unique symbols. It is optimal among all prefix codes for a given set of frequencies. Very widely used in: ZIP, JPEG, MP3, PNG, PDF, network protocols, etc.

Would you like to continue with any specific part?

Very common next topics:

  • How to decode a Huffman-encoded bit stream
  • Why Huffman is optimal (proof intuition)
  • Huffman vs Shannon-Fano coding
  • Adaptive Huffman coding (when frequencies change)
  • Building Huffman tree with priority queue (step-by-step code dry run)
  • Huffman coding for real compression (handling EOF, tree transmission)

Just tell me which direction you want — I’ll explain it in the same detailed, friendly, teacher style with lots of drawings and examples 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *