Huffman Coding

From WikiCoder

Huffman Coding is a method to generate optimal binary Prefix Codes. A prefix code being basically a binary pattern that uniquely identifies a specific symbol or byte. These binary patterns to encode all possible 256 values in a byte need not take 8-bits per byte -- in fact you have fewer bits for some values in exchange for more bits in others. A prefix code is essentially describing uniquely decodable bit-patterns. Huffman Coding uses a binary tree to represent this.

The problem essentially breaks down to creating a tree where each node/leaf in the tree represents a 1 or 0 in a bit stream. If its 0 you take the left path or if its 1 you take the right path in the tree. When you get to a leaf, that is a "symbol" -- which is commonly a byte from the input, but could be any particular number or sequence of numbers. For example, when encoding purely English, each leaf could be a different word.

The Huffman Algorithm itself for generating the tree is a bottom up tree construction method.

You start with an array of leaves (symbols/bytes) sorted in ascending frequency from Histogram generation. You then pick the lowest frequency of this set and combine those into a node. The node has the frequency of the combined children. You then re-sort the leaves and repeat until you have the root node.

Typically when sorting, you break frequency ties with leaves rather than nodes. This can create a more balanced tree with a shorter tail or maximum code length.

TODO: code examples... I have these just need to put them in here.