Burrows–Wheeler Transform

From WikiCoder

The Burrows-Wheeler transform (BWT, also called block-sorting compression), is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler.

The fundamental gist to transform data into BWT form is that you sort the data, then for each element of the sorted data you substitute each character by the character immediately following that character from the original un-sorted ordering. This transform is actually reversible - and by taking the character next to the sorted characters you are basically finding correlations between nearby characters which compress more readily as it tends to produce runs of the same characters. Runs of repeated characters can be further compressed by techniques such as Move To Front Transform and Run Length Encoding.

For example, the string:

SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

could be transformed into this string, which is easier to compress because it has many repeated characters:

 TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT


The transform is done by sorting all rotations of the text, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF' pointer):

Transformation
Input All
Rotations
Sort the
Rows
Output
^BANANA@
^BANANA@
@^BANANA
A@^BANAN
NA@^BANA
ANA@^BAN
NANA@^BA
ANANA@^B
BANANA@^
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
@^BANANA
BNN^AA@A

The following pseudocode gives a simple, but inefficient, way to calculate the BWT and its inverse. It assumes that the input string s contains a special character 'EOF' which is the last character, occurs nowhere else in the text, and is ignored during sorting.

 function BWT (string s)
   create a list of all possible rotations of s
   let each rotation be one row in a large, square table
   sort the rows of the table alphabetically, treating each row as a string
   return the last (rightmost) column of the table
 
 function inverseBWT (string s)
   create an empty table with no rows or columns
   repeat length(s) times
       insert s as a new column down the left side of the table
       sort the rows of the table alphabetically
   return the row that ends with the 'EOF' character

To understand why this creates more-easily-compressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be re-generated from the last column data.

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:

Inverse Transformation
Input
BNN^AA@A
Add 1Sort 1Add 2Sort 2
B
N
N
^
A
A
@
A
A
A
A
B
N
N
^
@
BA
NA
NA
^B
AN
AN
@^
A@
AN
AN
A@
BA
NA
NA
^B
@^
Add 3Sort 3Add 4Sort 4
BAN
NAN
NA@
^BA
ANA
ANA
@^B
A@^
ANA
ANA
A@^
BAN
NAN
NA@
^BA
@^B
BANA
NANA
NA@^
^BAN
ANAN
ANA@
@^BA
A@^B
ANAN
ANA@
A@^B
BANA
NANA
NA@^
^BAN
@^BA
Add 5Sort 5Add 6Sort 6
BANAN
NANA@
NA@^B
^BANA
ANANA
ANA@^
@^BAN
A@^BA
ANANA
ANA@^
A@^BA
BANAN
NANA@
NA@^B
^BANA
@^BAN
BANANA
NANA@^
NA@^BA
^BANAN
ANANA@
ANA@^B
@^BANA
A@^BAN
ANANA@
ANA@^B
A@^BAN
BANANA
NANA@^
NA@^BA
^BANAN
@^BANA
Add 7Sort 7Add 8Sort 8
BANANA@
NANA@^B
NA@^BAN
^BANANA
ANANA@^
ANA@^BA
@^BANAN
A@^BANA
ANANA@^
ANA@^BA
A@^BANA
BANANA@
NANA@^B
NA@^BAN
^BANANA
@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
ANANA@^B
ANA@^BAN
@^BANANA
A@^BANAN
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
@^BANANA
Output
^BANANA@

A number of optimizations can make these algorithms run more efficiently without changing the output. In BWT, there is no need to actually store the table. Each row of the table can be represented by a single pointer into the strings. In inverse BWT there is no need to store the table or to do the multiple sorts. It is sufficient to sort it once with a stable sort, and remember where each character moved. This gives a single-cycle permutation, whose cycle is the output. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.

There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.

A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.

Example Code

 #include <unistd.h>
 #include <stdlib.h>
 #include <string.h>
 #include <assert.h>
 #include <stdio.h>
 
 typedef unsigned char byte;
 
 byte *rotlexcmp_buf = NULL;
 int rottexcmp_bufsize = 0;
 
 int rotlexcmp(const void *l, const void *r)
 {
     int li = *(const int*)l, ri = *(const int*)r, ac=rottexcmp_bufsize;
     if(li == ri) return 0;
     while (rotlexcmp_buf[li] == rotlexcmp_buf[ri])
     {
         if (++li == rottexcmp_bufsize)
             li = 0;
         if (++ri == rottexcmp_bufsize)
             ri = 0;
         if (!--ac)
             return 0;
     }
     if (rotlexcmp_buf[li] > rotlexcmp_buf[ri])
         return 1;
     else
         return -1;
 }
 
 void bwt_encode(byte *buf_in, byte *buf_out, int size, int *primary_index)
 {
     int indices[size];
     int i;
 
     for(i=0; i<size; i++)
         indices[i] = i;
     rotlexcmp_buf = buf_in;
     rottexcmp_bufsize = size;
     qsort (indices, size, sizeof(int), rotlexcmp); 

     for (i=0; i<size; i++)
         buf_out[i] = buf_in[(indices[i]+size-1)%size];
     for (i=0; i<size; i++)
     {
         if (indices[i] == 1) {
             *primary_index = i;
             return;
         }
     }
     assert (0);
 }
 
 void bwt_decode(byte *buf_in, byte *buf_out, int size, int primary_index)
 {
     byte F[size];
     int buckets[256];
     int i,j,k;
     int indices[size];
 
     for (i=0; i<256; i++)
         buckets[i] = 0;
     for (i=0; i<size; i++)
         buckets[buf_in[i]] ++;
     for (i=0,k=0; i<256; i++)
         for (j=0; j<buckets[i]; j++)
             F[k++] = i;
     assert (k==size);
     for (i=0,j=0; i<256; i++)
     {
         while (i>F[j] && j<size)
             j++;
         buckets[i] = j; // it will get fake values if there is no i in F, but
                         // that won't bring us any problems
     }
     for(i=0; i<size; i++)
         indices[buckets[buf_in[i]]++] = i;
     for(i=0,j=primary_index; i<size; i++)
     {
         buf_out[i] = buf_in[j];
         j=indices[j];
     }
 }
 
 int main()
 {
     byte buf1[] = "WikiCoder is a great code reference website.";
     int size = strlen((const char*)buf1);
     byte buf2[size];
     byte buf3[size];
     int primary_index;
 
     bwt_encode (buf1, buf2, size, &primary_index);
     bwt_decode (buf2, buf3, size, primary_index); 
 
     assert (!memcmp (buf1, buf3, size));
     printf ("Result is the same as input, that is: <%.*s>\n", size, buf3);
     // Print out encode/decode results:
     printf ("Input : <%.*s>\n", size, buf1);
     printf ("Output: <%.*s>\n", size, buf2);
     return 0;
 }

References

External links