Huffman coding is a computer science technique that is commonly used for lossless data compression. Inside computers, data always needs to be represented as bitstrings (strings of successive zeros and ones). As a result, each symbol — for example a character in a text — is traditionally represented using a fixed number of bits. This way, if we represent each character in the word huffman as a single byte (8 bits), the computer representation of the word uses 56 bits in total.

During his PhD at MIT, David A. Huffman1 developed an algorithm that represents each symbol as a variable-length bitstring: the number of bits for encoding the symbols may differ. The details of this algorithm are not important for this assignment, but the gain in data compression results from the fact that symbols that appear more often are represented by shorter bitstrings than symbols that appear less often. The output of Huffman's algorithm is a binary tree, whose left edges are labeled with bit 0 and whose right edges are labeled with bit 1. The internal nodes of the tree are not labeled with symbols, whereas all leaf nodes of the tree are labeled with a unique symbol.

Huffman tree
Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". Encoding the sentence with this code requires 135 bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used. This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information.

To encode a text as a bitstring, each individual symbol is represented by a unique bitstring. The bitstring that corresponds to a symbol is found by traversing the tree from the root to the leaf labeled with that symbol, and concatenating the labels of the edges visited during this traversal. For example, based on the above tree we see that the letter h is represented by the bitstring 1010, the letter u by the bitstring 00111, the letter f by the bitstring 1101 and a space by the bitstring 111. As such, the encoding of symbols as given in the above tree results in a representation of the text huffman as the bitstring 1010001111101110101110100010 (28 bits). Note that text encoding does not necessarily needs to make use of the tree, since each symbol is represented by a fixed bitstring. According to the above tree, the symbols are translated into bitstrings in the following way.

symbol bitstring
space 111
a 010
e 000
f 1101
h 1010
i 1000
m 0111
n 0010
symbol bitstring
s 1011
t 0110
l 11001
o 00110
p 10011
r 11000
u 00111
x 10010

However, to decode a bitstring we do need to make use of the tree. Say, for example, that we want to decode the bitstring 1010001111101110101110100010. We start at the root of the tree, and traverse the tree downwards to the right (bit 1), then to the left (bit 0), then to the right (bit 1), and to the left again (bit 0) to finally reach the leaf of the tree labeled with the letter h. As soon as we reach a leaf, we use the label of the leaf as the next symbol in the decoded text. Then, we turn back to the root of the tree as the starting point for processing the remaining bits. The bitstring 00111 gives us the letter u, twice the bitstring 1101 gives us twice the letter f, and so on, until we have decoded the bitstring as huffman.

Assignment

We ask you to encode strings (str) into bitstrings, and to decode bitstrings into strings according to the principles of Huffman coding. In this, bitstrings are represented as strings (str) containing zeros and ones only. You are given the bitstrings that are used to encode the individual symbols in a text file. Each line of such a file contains a symbol (a single character), followed by a tab and the bitstring used to represent the symbol. As an example, you can find below the context of a text file that corresponds to the tree in the introduction of this assignment. Note that the symbol at the first line of this example is a space.

 	111
a	010
e	000
f	1101
h	1010
i	1000
m	0111
n	0010
s	1011
t	0110
l	11001
o	00110
p	10011
r	11000
u	00111
x	10010

Define a class Node that can be used to represent the nodes of binary trees as used in Huffman coding. The means that each node can have a left and/or right edge that links it to another node (respectively called the left and right nodes), and that each node can be labeled with a symbol (leaf nodes are labeled with a symbol, internal nodes are not). The objects of this class must support methods left and right, respectively returning the left and right nodes, or the value None in case there is no left or right node. In addition, the objects of this class must also support a method symbol returning the symbol with which the node is labeled: a single character (str) if the node is labeled with a symbol or the value None is the node is not labeled with a symbol. None of these three methods takes arguments.

Define a class Huffman that can be used to encode strings into bitstrings, and to decode bitstrings into the original strings according the principle of Huffman coding. Upon initialization of an object of this class, one must pass the location (str) of a text file containing the encoding of the individual symbols in the format as described above. The initialization method must guarantee that objects have a property root that points to the root (Node) of the binary tree with the bitstring encodings of all symbols contained in the given file. In addition, the object of this class must support at least the following methods:

Example

In the following interactive session, we assume the text file codes.txt2 to be located in the current directory.

>>> huffman = Huffman('codes.txt3')
>>> isinstance(huffman.wortel, Knoop)
True
>>> huffman.wortel.rechts().rechts().rechts().symbool()
' '
>>> huffman.wortel.links().links().links().symbool()
'e'
>>> huffman.wortel.rechts().links().links().rechts().links().symbool()
'x'
>>> huffman.wortel.links().rechts().links().symbool()
'a'

>>> huffman.codeer('huffman')
'1010001111101110101110100010'
>>> len(huffman.codeer('huffman'))
28
>>> huffman.codeer('huffman codering')
Traceback (most recent call last):
ValueError: onbekend symbool "c"

>>> huffman.decodeer('1010001111101110101110100010')
'huffman'
>>> huffman.decodeer('10100011111011101011101000101')
Traceback (most recent call last):
ValueError: ongeldige bitstring
>>> huffman.decodeer('101000111110111010110100010')
Traceback (most recent call last):
ValueError: ongeldige bitstring