Huffman coding

In computer science, Huffman coding is an entropy encoding algorithm used for data compression.
It was developed by David A. Huffman and published 1952 A Method for the Construction of Minimum-Redundancy Codes.

The basic idea is borrowed from an older and slightly less efficient method called Shannon-Fano coding.

The text to be compressed is considered as a string of symbols. Symbols that are likely to be frequent are represented by a short sequence of bits, and symbols that are likely to be rare are represented by a longer sequences of bits.

Huffman coding uses a specific method for choosing the representations for each symbol, resulting in a prefix-free code (i.e. no bit string of any symbol is a prefix of the bit string of any other symbol). It has been proven that Huffman coding is the most effective compression method of this type. That is, no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code. For a set of symbols whose cardinality is a power of two and a uniform probability distribution, Huffman coding is equivalent to simple binary block encoding.

Huffman coding is optimal when the frequencies of input characters are powers of two. Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have not been large enough to offset arithmetic coding's higher computational complexity and patent royalties (as of November 2001, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions).

Huffman works by creating a binary tree of symbols:

Start with as many trees as there are symbols.
While there is more than one tree:
1. Find the two trees with the smallest total probability.
2. Combine the trees into one, setting one as the left child and the other as the right.
Now the tree contains all the symbols. A '0' represents following the left child; a '1' represents following the right child.

There are variations. The frequencies used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. (This variation requires that a frequency table or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store these tables efficiently.) A variation called "adaptive Huffman coding" calculates the frequencies dynamically based on recent actual frequencies in the source string. This is somewhat related to the LZ family of algorithms.

Extreme cases of Huffman codes are connected with Fibonacci numbers. For example, see http://mathforum.org/discuss/sci.math/t/207334.

Huffman coding today is often used as a "back-end" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding.

n-ary Huffman algorithm uses the {0, 1, ..., n-1} alphabet to encode message. Built tree is n-ary one.

Huffman Template algorithm enables to use non-numerical weights (costs, frequences). For example, see http://alexvn.freeservers.com/s1/huffman_template_algorithm.html