Huffman Coding: Greedy Prefix Codes with a Heap
Huffman coding is a greedy algorithm that builds an optimal prefix-free code — a way to assign short binary codes to frequent symbols and longer codes to rare ones so a message compresses to the fewest possible bits. It powers real compressors (ZIP, JPEG, MP3 all use it) and is the canonical greedy-with-a-heap algorithm. The greedy move: repeatedly merge the two least-frequent symbols.
The problem and the prefix rule
Fixed-length codes waste space: if every character is 8 bits, the letter e costs as much as z, even though e is far more common. Huffman gives common symbols shorter codes. The catch is decodability — we need a prefix-free code, meaning no code is a prefix of another. With that rule, a decoder can read bits left to right and always know exactly where each symbol ends, with no separators.
Prefix-free codes correspond to a binary tree: each leaf is a symbol, and the path from the root (left = 0, right = 1) spells its code. Because symbols only sit at leaves, no code can be a prefix of another.
The greedy idea
To minimize total bits — the sum of frequency × code length over all symbols — the rarest symbols should sit deepest (longest codes). Huffman achieves this bottom-up:
1. Make a leaf node for each symbol, keyed by its frequency.
2. Put all nodes in a min-heap (priority queue) ordered by frequency.
3. While more than one node remains:
a. Pop the two smallest-frequency nodes.
b. Make a new internal node whose frequency is their sum.
c. Push it back on the heap.
4. The last remaining node is the root of the optimal code tree.
A min-heap makes “fetch the two smallest” an O(log n) operation, so the whole build is O(n log n).
Computing the optimal cost
The cleanest thing to verify is the total number of encoded bits (the weighted path length). Here we build the tree with a heap and sum the cost as we merge — a neat fact is that the total bits equals the sum of every internal node’s frequency:
#include <queue>
#include <vector>
// Returns the total bits of the optimal Huffman encoding given symbol frequencies.
long long huffmanCost(const std::vector<int>& freqs) {
if (freqs.size() <= 1) return 0; // a single symbol needs no bits to distinguish
std::priority_queue<long long, std::vector<long long>, std::greater<>> pq(
freqs.begin(), freqs.end());
long long total = 0;
while (pq.size() > 1) {
long long a = pq.top(); pq.pop();
long long b = pq.top(); pq.pop();
total += a + b; // each merge contributes one bit to every symbol below
pq.push(a + b);
}
return total;
}
import java.util.PriorityQueue;
// Returns the total bits of the optimal Huffman encoding given symbol frequencies.
long huffmanCost(int[] freqs) {
if (freqs.length <= 1) return 0; // single symbol needs no bits
PriorityQueue<Long> pq = new PriorityQueue<>();
for (int f : freqs) pq.add((long) f);
long total = 0;
while (pq.size() > 1) {
long a = pq.poll();
long b = pq.poll();
total += a + b; // each merge adds one bit to every symbol below
pq.add(a + b);
}
return total;
}
// Returns the total bits of the optimal Huffman encoding given symbol frequencies.
// Simple array-backed min-extraction (fine for teaching; use a real heap at scale).
function huffmanCost(freqs) {
if (freqs.length <= 1) return 0; // single symbol needs no bits
const heap = [...freqs];
const popMin = () => {
let m = 0;
for (let i = 1; i < heap.length; i++) if (heap[i] < heap[m]) m = i;
return heap.splice(m, 1)[0];
};
let total = 0;
while (heap.length > 1) {
const a = popMin(), b = popMin();
total += a + b; // each merge adds one bit to every symbol below
heap.push(a + b);
}
return total;
}
import heapq
# Returns the total bits of the optimal Huffman encoding given symbol frequencies.
def huffman_cost(freqs):
if len(freqs) <= 1:
return 0 # single symbol needs no bits
heap = list(freqs)
heapq.heapify(heap)
total = 0
while len(heap) > 1:
a = heapq.heappop(heap)
b = heapq.heappop(heap)
total += a + b # each merge adds one bit to every symbol below
heapq.heappush(heap, a + b)
return total
For beginners: Why does summing every internal node’s frequency give the total bits? Each symbol’s code length equals how many merges happen above it in the tree. Every merge of total weight
wtherefore adds exactly one bit to each of thewsymbols underneath it — so adding up the merge weights counts every bit exactly once. For frequencies{5, 9, 12, 13, 16, 45}the answer is224bits.
Building the actual codes
To emit the codes themselves, keep real tree nodes in the heap and walk the tree afterward, appending 0 for left and 1 for right:
#include <queue>
#include <unordered_map>
#include <string>
#include <memory>
struct Node {
int freq; char ch;
std::shared_ptr<Node> left, right;
};
using PN = std::shared_ptr<Node>;
void walk(PN n, std::string code, std::unordered_map<char,std::string>& out) {
if (!n->left && !n->right) { out[n->ch] = code.empty() ? "0" : code; return; }
walk(n->left, code + "0", out);
walk(n->right, code + "1", out);
}
std::unordered_map<char,std::string> huffmanCodes(
const std::vector<std::pair<char,int>>& syms) {
auto cmp = [](PN a, PN b){ return a->freq > b->freq; };
std::priority_queue<PN, std::vector<PN>, decltype(cmp)> pq(cmp);
for (auto& [c, f] : syms) pq.push(PN(new Node{f, c, nullptr, nullptr}));
while (pq.size() > 1) {
PN a = pq.top(); pq.pop();
PN b = pq.top(); pq.pop();
pq.push(PN(new Node{a->freq + b->freq, '\0', a, b}));
}
std::unordered_map<char,std::string> out;
if (!pq.empty()) walk(pq.top(), "", out);
return out;
}
import java.util.*;
class Node {
int freq; char ch; Node left, right;
Node(int f, char c) { freq = f; ch = c; }
Node(int f, Node l, Node r) { freq = f; left = l; right = r; }
}
void walk(Node n, String code, Map<Character,String> out) {
if (n.left == null && n.right == null) {
out.put(n.ch, code.isEmpty() ? "0" : code);
return;
}
walk(n.left, code + "0", out);
walk(n.right, code + "1", out);
}
Map<Character,String> huffmanCodes(char[] chars, int[] freqs) {
PriorityQueue<Node> pq = new PriorityQueue<>((a, b) -> a.freq - b.freq);
for (int i = 0; i < chars.length; i++) pq.add(new Node(freqs[i], chars[i]));
while (pq.size() > 1) {
Node a = pq.poll(), b = pq.poll();
pq.add(new Node(a.freq + b.freq, a, b));
}
Map<Character,String> out = new HashMap<>();
if (!pq.isEmpty()) walk(pq.peek(), "", out);
return out;
}
// Teaching version with array-backed min-extraction.
function huffmanCodes(syms) { // syms: [{ch, freq}, ...]
const heap = syms.map(s => ({ freq: s.freq, ch: s.ch, left: null, right: null }));
const popMin = () => {
let m = 0;
for (let i = 1; i < heap.length; i++) if (heap[i].freq < heap[m].freq) m = i;
return heap.splice(m, 1)[0];
};
while (heap.length > 1) {
const a = popMin(), b = popMin();
heap.push({ freq: a.freq + b.freq, ch: null, left: a, right: b });
}
const out = {};
const walk = (n, code) => {
if (!n.left && !n.right) { out[n.ch] = code || "0"; return; }
walk(n.left, code + "0");
walk(n.right, code + "1");
};
if (heap.length) walk(heap[0], "");
return out;
}
import heapq
import itertools
def huffman_codes(syms): # syms: list of (char, freq)
counter = itertools.count() # tiebreaker so nodes never compare by payload
heap = [(f, next(counter), {"ch": c, "left": None, "right": None})
for c, f in syms]
heapq.heapify(heap)
while len(heap) > 1:
fa, _, a = heapq.heappop(heap)
fb, _, b = heapq.heappop(heap)
node = {"ch": None, "left": a, "right": b}
heapq.heappush(heap, (fa + fb, next(counter), node))
out = {}
def walk(n, code):
if n["left"] is None and n["right"] is None:
out[n["ch"]] = code or "0"
return
walk(n["left"], code + "0")
walk(n["right"], code + "1")
if heap:
walk(heap[0][2], "")
return out
Why greedy is correct here
Huffman’s greedy choice — merge the two least-frequent symbols — is safe by an exchange argument. In any optimal tree, the two deepest leaves can be assumed to be the two lowest-frequency symbols (if not, swapping a more-frequent shallow leaf with a less-frequent deep one would lower the total cost — a contradiction). Those two deepest leaves are siblings, so making them siblings up front loses nothing. Replacing them with a merged node leaves a smaller identical problem, so induction finishes the proof.
Complexity
- Time: O(n log n) for
ndistinct symbols —n − 1merges, each two pops and a push at O(log n). - Space: O(n) for the heap and the tree.
Pitfall: When two nodes tie on frequency, the exact tree can differ, but the total cost stays optimal. In languages where the heap compares whole objects (like Python tuples), include a unique tiebreaker so it never tries to compare the node payloads — the code above does this with a counter.
Practice greedy end to end on the greedy exercises, or compare against dynamic programming for the problems where greedy fails.