To send information as fast as possible across the Internet, web servers first represent the data in a compact format (compression) before it is actually sent over the network. A web browser that receives this compressed data therefore needs to unpack it first (decompression) before the information can be used. To understand how data compression can be achieved, you need to know first that all data (text, images, sound, video, …) are represented inside a computer as bit strings — strings of consecutive zeros and ones. Textual data, for example, is traditionally encoded by representing each symbol (character) with a fix-length bit string. If we represent all 8 symbols of the word internet by a single byte (8 bits), the computer representation of this word takes 64 bits in total.

While he was a PhD student at MIT, David A. Huffman¹ developed an algorithm that represents each symbol as a unique variable-length bit string. The details of this algorithm are not relevant for this assignment, but the gain for data compression lies in the fact that more common symbols are generally represented using fewer bits than less common symbols. The table below shows an example on how a list of symbols is converted to their corresponding bit string representation. From this you can derive that the letter i is represented by the bit string 1000, the letter e by the bit string 000, the letter x by the bit string 10010 and a space by the bit string 111. A full text is compressed by concatenating the bit string representations of the consecutive symbols in the text. According to the scheme of the table below, the word internet is compressed as the bit string 1000001001100001100000100000110 (31 bits).

symbol	bit string
space	`111`
`a`	`010`
`e`	`000`
`f`	`1101`
`h`	`1010`
`i`	`1000`
`m`	`0111`
`n`	`0010`

symbol	bit string
`s`	`1011`
`t`	`0110`
`l`	`11001`
`o`	`00110`
`p`	`10011`
`r`	`11000`
`u`	`00111`
`x`	`10010`

Decompressing a fragment of compressed data relies on an ingenious feature of the bit strings that are assigned by Huffman's algorithm. Of all possible symbols in the coding scheme, there is always exactly one symbol whose corresponding bit string is a prefix (a string at the start of another string) of the remaining portion of the compressed data.

Suppose that we wish to decompress the compressed bit string 1000001001100001100000100000110. We observe in the table above that the symbol i is the only one whose corresponding bit string (1000) is a prefix of the compressed bit string. This gives us the first symbol, and we can reduce the compressed bit string to 001001100001100000100000110 by removing the corresponding prefix. Again, we find that the symbol n is the only one whose corresponding bit string (0010) is a prefix of the remaining compressed bit string, so we found the second symbol. By repeatedly finding prefixes and reducing the compressed bit string, we finally end up decompressing the entire bit string.

Assignment

You are asked to compress strings into bit strings, and decompress compressed bit string into the original strings according to the principle of Huffman coding. Bit strings are represented as strings that only consist of the characters 0 and 1. The bit strings that are used to compress the individual symbols are given in a text file, whose lines consist of a symbol (a single character), followed by a tab and the bit string used to represent this symbol. Below, you find the text file containing the bit strings that correspond to the symbols in the table above. Note that the symbol on the first line of this text file is a space.

Define a class ZIP that can be used to compress strings as bit strings, and decompress bit strings into the original strings according to the principle of Huffman coding. Upon initialization of an object of this class, the location (str) of a text file must be passed. This file must contain the encoding of the individual symbols in the format as outlined above. In addition, the objects of the class ZIP must support at least the following methods:

A method symbol2bitstring that takes a symbol (str; a single character). The method must return the compressed bit string (str) that corresponds to the given symbol. In case the given symbol does not occur in the file that was passed upon initialization, the method must raise an AssertionError with the message unknown symbol "x", with x being the given symbol.
A method bitstring2symbol that takes a bit string (str). The method must return the symbol (str) that corresponds to the given bit string. In case the given bit string does not occur in the file that was passed upon initialization, the method must raise an AssertionError with the message invalid bitstring.
A method compress that takes a string (str). The method must return the compressed bit string (str) that corresponds to the given string. In case the given string contains characters whose bit string encoding is not contained in the file that was passed upon initialization, the method must raise an AssertionError with the message unknown symbol "x", with x being the leftmost symbol that does not occur in the file.
A method decompress that takes a compressed bit string (str). The method must return the original string (str) that corresponds to the given bit string according to the Huffman coding scheme. In case the given bit string is not valid according to the Huffman coding scheme, the method must raise an AssertionError with the message invalid bitstring.

Example

In the following interactive session, we assume that the text file codes.txt² to be located in the current directory.

>>> zip = ZIP('codes.txt³')

>>> zip.symbol2bitstring('i')
'1000'
>>> zip.symbol2bitstring('e')
'000'
>>> zip.symbol2bitstring('T')
Traceback (most recent call last):
AssertionError: unknown symbol "T"

>>> zip.bitstring2symbol('1000')
'i'
>>> zip.bitstring2symbol('000')
'e'
>>> zip.bitstring2symbol('01')
Traceback (most recent call last):
AssertionError: invalid bitstring
	
>>> zip.compress('internet')
'1000001001100001100000100000110'
>>> len(zip.compress('internet'))
31
>>> zip.compress('internet explorer')
'1000001001100001100000100000110111000100101001111001001101100000011000'
>>> zip.compress('mozilla firefox')
Traceback (most recent call last):
AssertionError: unknown symbol "z"

>>> zip.decompress('1000001001100001100000100000110')
'internet'
>>> zip.decompress('1000001001100001100000100000110111000100101001111001001101100000011000')
'internet explorer'
>>> zip.decompress('10000010011000011000001000001101')
Traceback (most recent call last):
AssertionError: invalid bitstring
>>> zip.decompress('10000010011000011000000000110')
Traceback (most recent call last):
AssertionError: invalid bitstring

Assignment

Example

Epilogue