Huffman tree warmup

Warmup from Julie Zelenski

"Debugging is hard, take breaks!" by Julia Evans

In these exercises, you will practice with Huffman trees to establish a solid understanding of the Huffman algorithm before you start implementing the program.

Huffman coding was introduced in Lecture 24. We also posted a handout that repeats that info if reading works better for you than lecture. Be sure you have watched/read this content first! The assignment writeup assumes familiarity with this content and does not repeat it.

Encoding and decoding using an encoding tree

First, practice decoding a bit sequence using an encoding tree The diagram below is an encoding tree for the characters O N M and S. Each leaf node corresponds to a character. The path from root to a leaf node traces the sequence of bits that encode the node's character. In the diagram, we marked interior nodes with * for visualization purposes; an interior node does not store a character and the path from root to an interior node is only a partial encoding path.

We label the leftward branch zero and rightward one. The path from the root node to the leaf node S traces left-right-right which corresponds to the bit sequence 011.

Answer the following question in short_answer.txt:

Q1. Use the above encoding tree to decode the bit sequence 0101100011.

Q2. Prepare a table for the above encoding tree that lists each character with its assigned bit sequence. Use your table to encode the string "SONS".

As you work these operations manually, take a moment to consider what will be required to do these same operations in code.

Q3. Huffman codes obey the prefix property: no character's encoded bit sequence is a prefix of any other. What feature of an encoding tree demonstrates that it obeys the prefix property?

Flattening and unflattening an encoding tree

One of the practical concerns of Huffman coding is that the compressed data must include information about the encoding tree, since that encoding is unique to the input data. It is not possible to decode the bit sequences without the original encoding tree.

We want to write the Huffman tree into the compressed file, but it isn't possible to write a tree directly. Therefore we must come up with a way to "flatten" the tree's data and structure in a form that we can later use to reconstruct the same tree.

There are a variety of ways to flatten the tree, but one tidy and compact approach is to summarize the tree as two sequences: a sequence of bits that give the shape of the tree and a sequence of characters that correspond to the tree leaves.

The tree shape is flattened into a sequence of bits as follows:
- If the root of the tree is a leaf node, it’s represented by the bit 0.
- If the root of the tree is not a leaf node, it’s represented by a 1 bit, followed by the flattened version of its zero (left) subtree, followed by the flattened one (right) subtree.
The sequence of bits describes the tree structure in the order that the tree nodes are visited in a pre-order traversal.
The tree leaves are flattened into a sequence of characters by listing the characters of the leaf nodes as visited during an in-order traversal.

Together the two flattened sequences describe the tree shape and data in a way that allows us to later recreate the original tree.

For example, the two encoding trees below left and middle are labeled with the flattened sequence of bits (tree shape) and sequence of characters (tree leaves):

     *                          *                          *
   /   \                      /   \                       / \
  E     *                    A     *                     *   O
       / \                        /  \                  / \
      W   K                      *    N                N   * 
                                / \                       / \
                               D   B                     M   S
 
  10100                       1011000
  EWK                         ADBN

Answer the following question in short_answer.txt:

Q4. Flatten the encoding tree above on the right into its sequence of bits (tree shape) and sequence of characters (tree leaves).

Q5. Unflatten the sequences 110100100 (tree shape) and FLERA (tree leaves) to reconstruct the original encoding tree.

As you do these operations by hand, work out how you will translate these same steps to code. Each sequence is stored into a Queue. The flatten operation writes the tree shape into a Queue<bit> and the leaves into a Queue<char> . The unflatten operation takes in two queues and uses them to reconstruct the encoding tree.

Generating an optimal Huffman tree

The last task of the warmup is to construct a Huffman tree for a given input by manually following the Huffman algorithm.

Answer the following questions in short_answer.txt:

Q6. Construct a Huffman coding tree for the input "BOOKKEEPER".

While constructing a Huffman tree, there are some arbitrary choices to make that do not affect the correctness or optimality of the resulting tree. For example, if the priority queue contain several nodes/subtrees of the same weight, these are all tied. The algorithm can break ties arbitrarily. Each choice produces a slightly different tree, but still equally optimal. Similarly, exchanging the zero/one subtrees of any interior node results in a mirrored tree that is an equally valid alternative.

One means to confirm optimality of a Huffman tree despite these perturbations is by computing the entropy. The entropy is the weighted average of the encoding length for all characters in the tree. Consider an input of the 10 characters: AAAAAABBBC. One Huffman tree for this input could look like this:

To compute the entropy, first sum the encoding lengths per character weighted by the number of occurrences: 6*1 (6 A's, encoding length for A is 1 bit) + 3*2 (3 B's, encoding length 2) + 1*2 (1 C, encoding length 2). Take that sum 14 and divide by 10 (number of characters in input) to compute the average encoding length of 1.4 bits per character. This is the entropy.

Q7. Calculate the entropy of your Huffman tree for "BOOKKEEPER". Confirm that if you had chosen to break ties in a different manner when constructing the tree, this alternate tree has the same entropy as the optimal result.

Good versus bad Huffman trees. In lecture, Chris described how the performance of a BST depends on the tree height, which in turn is dictated by how balanced the tree is. A fully balanced tree (equal node counts in each child subtree and leaf nodes at equal depth) has a height log N and exhibits optimal performance. A little imbalance in a BST is no big deal, but as a BST becomes more and more lopsided, the height grows and performance degrades toward the fully degenerate case of a linear linked list. Does the height/balance of a Huffman tree matter in the same way as it does for BSTs?

Answer the following question in short_answer.txt:

Q8. Consider the space of inputs of length 1000 that consists of varied occurrences of 100 distinct characters. Of those various inputs, contrast which inputs result in a balanced Huffman tree versus those that produce a very lopsided Huffman tree. As an example, what if each of the 100 characters were represented 10 times (for a total of 1000 characters)? What would the Huffman tree look like? What if, instead, 99 of the characters was represented only once, but the remaining character was represented 901 times (again, for a total of 1000 characters)? What would the Huffman tree look like? Which of the two Huffman trees achieves more significant compression? What does this tell you about what makes for a "good" versus "bad" Huffman tree?

Test utility functions

As you learned on the previous assignment, the test cases to validate operations on linked structures don't come easily. By first investing in writing utility functions (like functions to create linked lists, compare linked lists, deallocate linked lists, etc), it is more manageable to write test cases later. Testing on trees will be similar. Our provided tests give you some idea of what writing tests for trees should look like, but they depend on you writing some tree utility functions in huffman.cpp as a foundation.

EncodingTreeNode

First review the definition of struct EncodingTreeNode in the file treenode.h. The struct declares two convenience constructors, one to initialize a new leaf node, the other for a new interior node. Every EncodingTreeNode has a character field ch and two child pointers, zero and one. For a leaf node, both child pointers are nullptr and the ch field is set to a character value. For an interior node, the child pointers are non-null and the character field is unused.

The struct fields (zero, one, and ch) are public, which allows your code to directly access them. However the struct also provides member functions isLeaf and getChar which access the fields on your behalf. We recommend that you use these functions. It will make your code a bit tidier and in the case of getChar, you gain the safety feature to alert if you mistakenly access the character field of an interior node. The code below demonstrates some sample use:

    EncodingTreeNode* t = new EncodingTreeNode('A');
    if (t->isLeaf()) {
        cout << t->getChar() << endl;
    }

createExampleTree

Some of the provided test cases use this sample encoding tree:

The tree has 4 leaf nodes, each corresponding to one of the characters {T, R, S, E} in the encoding. The 3 interior nodes are marked with * just for the purpose of visualization. An interior node does not store any character value.

EncodingTreeNode* createExampleTree()

The utility function createExampleTree is intended to construct an encoding tree that matches the one shown above. This is not a general-purpose tree creation function; it manually constructs exactly and only this tree. Implement the function to allocate the necessary nodes and wire them together in the exact form shown above.

To confirm that your function is doing what you intend, run it under the debugger and set a breakpoint after constructing the tree. Use the Variables pane of the debugger to explore the tree structure that you have created, starting from the root. Your process will be reminiscent of how you explored your personal labyrinth in the previous assignment. Confirm that the tree has the correct shape and correct character in each leaf node. Remember that the character field for an interior node is unused, so its value is not relevant.

deallocateTree

Any memory allocated using new during a test case should be deallocated with delete before finishing with the test case. Our provided test cases call deallocateTree to deallocate all of the memory for the nodes of a tree.

void deallocateTree(EncodingTreeNode* t)

After implementing deallocateTree, confirm it works correctly by writing a student test case that calls your createExampleTree to allocate a tree and uses deallocateTree to clean it up. Run the test case and confirm no memory leaks are reported.

areEqual

The areEqual utility function is given pointers to the root nodes of two encoding trees and returns true if the two trees are equal; that is, they have the same shape and same characters in each leaf node. If the two trees have mismatching shapes or differing leaf characters the function should return false.

bool areEqual(EncodingTreeNode* a, EncodingTreeNode* b)

To confirm your areEqual function is working as intended, you should add a variety of your own student tests. Here are some test cases to be sure to cover:

An empty tree is nullptr. Two empty trees should compare as true according to areEqual.
The simplest non-empty encoding tree consists of a root node and two leaf children. Create this simple tree and compare to an empty tree and confirm areEqual returns false.
Create another copy of the simple encoding tree from above and confirm it is equal to the first. Rearrange the second tree to swap the left and right children or assign different characters to the leaf nodes. Confirm the altered tree is not equal to the original.
Call createExampleTree to build the T-R-S-E example tree and confirm it is not equal to the simple tree.
Call createExampleTree a second time and confirm the two copies of the example tree are equal.
Compare the example tree with one of its subtrees and confirm they are not equal.

Be sure that areEqual is only comparing the valid fields of each tree node. In particular, interior nodes do not use the character field. The only characters you should compare are the ones stored in the leaf nodes.

After you have implemented and thoroughly tested your utility functions, you're ready to move on to the main event!