To send information as fast as possible across the Internet, web servers first represent the data in a compact format (compression) before it is actually sent over the network. A web browser that receives this compressed data therefore needs to unpack it first (decompression) before the information can be used. To understand how data compression can be achieved, you need to know first that all data (text, images, sound, video, …) are represented inside a computer as bit strings — strings of consecutive zeros and ones. Textual data, for example, is traditionally encoded by representing each symbol (character) with a fix-length bit string. If we represent all 8 symbols of the word internet by a single byte (8 bits), the computer representation of this word takes 64 bits in total.

While he was a PhD student at MIT, David A. Huffman1 developed an algorithm that represents each symbol as a unique variable-length bit string. The details of this algorithm are not relevant for this assignment, but the gain for data compression lies in the fact that more common symbols are generally represented using fewer bits than less common symbols. The table below shows an example on how a list of symbols is converted to their corresponding bit string representation. From this you can derive that the letter i is represented by the bit string 1000, the letter e by the bit string 000, the letter x by the bit string 10010 and a space by the bit string 111. A full text is compressed by concatenating the bit string representations of the consecutive symbols in the text. According to the scheme of the table below, the word internet is compressed as the bit string 1000001001100001100000100000110 (31 bits).

symbol bit string
space 111
a 010
e 000
f 1101
h 1010
i 1000
m 0111
n 0010
symbol bit string
s 1011
t 0110
l 11001
o 00110
p 10011
r 11000
u 00111
x 10010

Decompressing a fragment of compressed data relies on an ingenious feature of the bit strings that are assigned by Huffman's algorithm. Of all possible symbols in the coding scheme, there is always exactly one symbol whose corresponding bit string is a prefix (a string at the start of another string) of the remaining portion of the compressed data.

Suppose that we wish to decompress the compressed bit string 1000001001100001100000100000110. We observe in the table above that the symbol i is the only one whose corresponding bit string (1000) is a prefix of the compressed bit string. This gives us the first symbol, and we can reduce the compressed bit string to 001001100001100000100000110 by removing the corresponding prefix. Again, we find that the symbol n is the only one whose corresponding bit string (0010) is a prefix of the remaining compressed bit string, so we found the second symbol. By repeatedly finding prefixes and reducing the compressed bit string, we finally end up decompressing the entire bit string.

Assignment

You are asked to compress strings into bit strings, and decompress compressed bit string into the original strings according to the principle of Huffman coding. Bit strings are represented as strings that only consist of the characters 0 and 1. The bit strings that are used to compress the individual symbols are given in a text file, whose lines consist of a symbol (a single character), followed by a tab and the bit string used to represent this symbol. Below, you find the text file containing the bit strings that correspond to the symbols in the table above. Note that the symbol on the first line of this text file is a space.

 	111
a	010
e	000
f	1101
h	1010
i	1000
m	0111
n	0010
s	1011
t	0110
l	11001
o	00110
p	10011
r	11000
u	00111
x	10010

Define a class ZIP that can be used to compress strings as bit strings, and decompress bit strings into the original strings according to the principle of Huffman coding. Upon initialization of an object of this class, the location (str) of a text file must be passed. This file must contain the encoding of the individual symbols in the format as outlined above. In addition, the objects of the class ZIP must support at least the following methods:

Example

In the following interactive session, we assume that the text file codes.txt2 to be located in the current directory.

>>> zip = ZIP('codes.txt3')

>>> zip.symbol2bitstring('i')
'1000'
>>> zip.symbol2bitstring('e')
'000'
>>> zip.symbol2bitstring('T')
Traceback (most recent call last):
AssertionError: unknown symbol "T"

>>> zip.bitstring2symbol('1000')
'i'
>>> zip.bitstring2symbol('000')
'e'
>>> zip.bitstring2symbol('01')
Traceback (most recent call last):
AssertionError: invalid bitstring
	
>>> zip.compress('internet')
'1000001001100001100000100000110'
>>> len(zip.compress('internet'))
31
>>> zip.compress('internet explorer')
'1000001001100001100000100000110111000100101001111001001101100000011000'
>>> zip.compress('mozilla firefox')
Traceback (most recent call last):
AssertionError: unknown symbol "z"

>>> zip.decompress('1000001001100001100000100000110')
'internet'
>>> zip.decompress('1000001001100001100000100000110111000100101001111001001101100000011000')
'internet explorer'
>>> zip.decompress('10000010011000011000001000001101')
Traceback (most recent call last):
AssertionError: invalid bitstring
>>> zip.decompress('10000010011000011000000000110')
Traceback (most recent call last):
AssertionError: invalid bitstring

Epilogue

When David Huffman died in 1999, the world lost a talented computer scientist — Huffman was best known for discovering the Huffman coding technique used in data compression. But it also lost a pioneer in mathematical origami, an extension of the traditional art of paper folding that applies computational geometry, number theory, coding theory, and linear algebra. The field today is finding wide application, helping researchers to fold everything from proteins to automobile airbags and space-based telescopes.

paperwork
David Huffman, with one of his paper foldings, in 1978.

Huffman was drawn to the work through his investigations into the mathematical properties of "zero curvature" surfaces, studying how paper behaves near creases and apices of cones. During the last two decades of his life he created hundreds of beautiful, perplexing paper models in which the creases were curved rather than straight.

But he kept his folding research largely to himself. He published only one paper on the subject4, and much of what he discovered was lost at his death. Laser physicist Robert Lang told the New York Times5 in 2004:

He anticipated a great deal of what other people have since rediscovered or are only now discovering. At least half of what he did is unlike anything I've seen.

MIT computer scientist Erik Demaine is working now with Huffman's family to recover and document his discoveries6.

Huffman told an audience in 1979:

I don't claim to be an artist. I'm not even sure how to define art. But I find it natural that the elegant mathematical theorems associated with paper surfaces should lead to visual elegance as well.