Variable-length code
Variable-length code or variable-length encoding is encoding that maps source symbols to variable number of bits. One does not need to be a genius, to figure out, variable-lenght encoding is a milestone of data compression.
The basic idea behind many lossless compression algorithms is simple, the more frequent the symbol is, the shorter code it gets.
From compression point of view, there are two interesting groups of encodings:
- uniquely decodable codes
- prefix codes
Uniquely decodable code
Intuitively uniquely decodable encoding is such, that if you stick any two code words, they can be unambigously decodes.
Following table presents uniquely decodable encoding (that is not a prefix code).
Symbol | Encoding |
---|---|
A | 0 |
B | 01 |
C | 011 |
The following encoding is not uniquely decodable
Symbol | Encoding |
---|---|
A | 0 |
B | 01 |
C | 10 |
Encoded bit string 010, can be encoded either as
- 0 10 = AC
or
- 01 0 = BA
Prefix code
Prefix code are especially valuable, as they make decoding a breeze.
Prefix encoding are context-free, that means that the prefix of a code word defines how many more bits need to be read.
Consider following example:
Symbol | Encoding |
---|---|
C | 0 |
A | 10 |
G | 110 |
T | 111 |
Now consider the following bit string:
- 101111100010
First bit is 1, so the only single-bit code-word C can be skipped. Second bit is 0, so because this is a prefix code, we've identified A and decoding of next code word can proceed.
By going left to right, this gives following code words:
- 10-111-110-0-0-10
Formally: prefix code is a code where there is no code word in the encoding, which is also a prefix of some other word. Compare with first example in previous subsection, where 0 is prefix of 01, and 01, is prefix of 011