Huffman coding


Assignment written by Julie Zelenski. Thanks to Keith Schwarz for structure of tree flatten/unflatten.

Huffman coding is an algorithm devised by David Huffman in 1952 for compressing data. This simple and elegant approach is powerful enough that variants of it are still used today in computer networks, fax machines, modems, HDTV, and other areas. You will be impressed with its clever use of trees and your ability to implement such a nifty tool!

Huffman warmup

Be sure to have completed the preparatory warmup exercises.

Review starter code

Before you get started in earnest, take stock of the files in the starter project and what's provided to you. Here are some details to become acquainted with:

  • treenode.h defines the EncodedTreeNode struct you will use for as node type for an encoding tree. You were introduced to this type in the warmup.
  • bits.h defines the Bit type that is used to represent a single value of 0 or 1. A Bit can be used somewhat interchangeably with the integers 0 and 1:

      Queue<Bit> q;
      Bit b = 0;   
      q.enqueue(0);   
      if (q.dequeue() == 1) { }
    

    The special feature of Bit is that it is restricted to only allow values 0 or 1. Any other values will raise an error (this was a surprisingly common error in the past and the source of much grief; thus we have introduced this constrained type to help you avoid these issues).

  • bits.h also defines the EncodedData struct for the data stored in a Huffman compressed file.
  • In bits.cpp, we provide code to read and write bits to and from a stream as this is material beyond the scope of CS106B. The code uses the C++ features for bit-manipulation, a topic that you'll explore if you go on to CS107. However, if you’re curious to get a jump on it, you’re welcome to poke around in bits.cpp to check it out.
  • main.cpp contains an interactive console main program to do end-to-end file compression and decompression by calling your functions.
  • huffman.cpp is where you will implement the required functions for Huffman compression and decompression and add your student test cases.
  • When constructing a Huffman coding tree, you will need a priority queue. The PQHeap you previously implemented is close to what you need, but it only stored elements of type DataPoint. The priorityqueue.h interface in the Stanford library provides a general-purpose PriorityQueue implemented as an efficient binary min-heap. Its design is similar to your PQHeap but has a few differences of note: it is defined as a template and thus is capable of storing the client's choice of element type and an element's priority is specified separately from the element value. Consult the PriorityQueue documentation for more details. In particular, make sure you know how to enqueue and dequeue elements and how to determine the priorities of relevant items. The elements you store in the priority queue will be trees; that is to say, the element type is EncodingTreeNode*, a pointer to the root node.

Strategy

"I think that little by little I’ll be able to solve my problems" – wise words from Frida Kahlo

Ultimately, the goal is develop a full compress and decompress pipeline, but your job will be more manageable if you break it down into steps ad develop in stages. We have decomposed the project into a number of helper functions and propose a order of implementation that works from the bottom-up, writing and testing individual functions and then moving on to combine them into the final program.

We recommend you first tackle the decode/unflatten/decompress tasks. This may seem a little backwards, but it turns out to be a great strategy. The code to decode/decompress is a bit simpler than encode/compress, and it is more straightforward to verify the correctness of its output because it restores the original text.

Then when you move on to implement encode/flatten/compress, you will already have a working decode/unflatten/decompress, which means you can feed the output from your encode operation back into your decode for testing.

With both decompress and compress working, you can use our provided main program to do end-to-end file compression and decompression and then bask in awe of the impressive tool you have created!

A note on recursion versus iteration. As you write each function, you will need to first consider your approach. Many tasks could be implemented using either recursion or iteration and it will be up to choose an appropriate strategy.

Some tasks must be implemented iteratively to allow for successfully handling larger inputs. For example, if decodeText or encodeText function made a recursive call for every bit/character in the input, this could easily exhaust the limits of the stack and lead to stack overflow. Such tasks must be implemented iteratively.

Some tasks more naturally lend themselves to a recursive approach, such as the tree traversals for flatten/unflatten and deallocate. Since the depth of the recursive calls here is bounded by the height of the tree, there is no concern about stack overflow. Recursion is a great choice!

Absent a compelling reason to prefer one over the other, it is your call to choose for yourself.

Decode/decompress

We recommend that you implement the functions in the order listed below. You are also welcome to write additional helper functions if that aids your decomposition.

  • string decodeText(EncodingTreeNode* tree, Queue<Bit>& messageBits)
    

    The decodeText function is a good introduction to writing code that operates on trees. To review decoding using a tree, look back to Q1 of the warmup.

    Be sure to test! The starter code has one provided test for decodeText; augment with student tests of your own to achieve full coverage.

  • EncodingTreeNode* unflattenTree(Queue<Bit>& treeShape, Queue<char>& treeLeaves)
    

    If needed, revisit Q5 of the warmup to practice on paper with the unflatten operation before trying to code unflattenTree. This task is all about tree traversal and tree construction.

    Checking the unflattened tree by expanding/tracing its contents in the debugger tends to be an easier way to confirm the function is working than by hand-constructing test cases.

  • string decompress(EncodedData& data)
    

    The final decompress sounds like it would be the difficult task, but you're already done all the hard work, so all that's left to do is call the functions you already wrote and tested, woo hoo! Re-using the same inputs/trees that you manually traced in the warmup would make good candidates for simple test cases.

Decode a mystery

Now that you have a working decompress, use it to unpack the compressed mystery file we included in the starter project.

Run your program and when the main menu prompts you for which tests to run, respond 0 (None). Now the main enters a simple console program that lets you select a file to compress or decompress. The starter project contains a compressed file you can use to try out your decompress function.

When prompted to enter the input filename, specify the name relative to the project folder, for example res/mystery.jpg.huf (If on Windows, the path separator is a reverse slash res\mystery.jpg.huf). The uncompressed version will be written to a file named res/unhuf.mystery.jpg. Locate the uncompressed file in the Finder/File Explorer and open it to confirm your success at restoring the original. Neat!

At this point, you should have working versions of decodeText and unflattenTree and a complete decompress pipeline. Awesome! These will serve you well when writing and testing the encode/compress side coming up next.

Encode/compress

We recommend that you implement the functions in the order listed below. You are also welcome to write additional helper functions as a decomposition aid.

  • Queue<Bit> encodeText(EncodingTreeNode* tree, string text)
    

    Review Q2 of the warmup for practice with encoding. We recommend that you start by writing your own helper function that traverses the tree once and builds a map that associates each character with its encoded bit sequence. You will later use this map to look up the bit sequence for a character, rather than having to repeatedly traverse the tree to find it. The helper function operates recursively. The depth of the recursive calls is bounded by the height of the tree so there is no concern about stack overflow.

    When moving on to the encoding task, you must use iteration to process each character of the text, not recursion. A recursive approach that added a stack frame per character of input would overflow the stack on longer inputs.

    When writing test cases, be sure to consider how a call encodeText followed by decodeText is a round-trip that should restore the original. And consider how handy it is that you already have a working decodeText!

  • void flattenTree(EncodingTreeNode* tree, Queue<Bit>& treeShape, Queue<char>& treeLeaves)
    

    Review Q4 of the warmup for practice with the flatten operation. This function involves more tree traversals —once you are done writing flattenTree, you will be a tree traversal expert!

    A good approach for test cases would be to flatten a tree, then unflatten it, and confirm the original and reconstructed trees are equal. Your helper function to compare two trees is just what you'll need for these test cases.

  • EncodingTreeNode* buildHuffmanTree(string text)
    

    The buildHuffmanTree function is the workhorse of the entire program. It takes the input text and follows the Huffman algorithm to construct an optimal Huffman coding tree for the input. You practiced this task in Q6 of the warmup.

    The input text must contain at least two distinct characters in order to build a Huffman encoding. If the given input does not meet this requirement, buildHuffmanTree should raise an error.

    When ready to test this function, keep in mind that there can be many valid Huffman trees for a given input, differing only in how the algorithm breaks ties. Be sure to consider how tie-breaking affects the tree outcome when constructing your test cases. It would be a bummer to get mired in trying to resolve why your tree is not equal to an expected output, only to realize that your tree is perfectly valid and equally optimal but just different than you thought it would be.

  • EncodedData compress(string messageText)
    

    This final compress function pulls together the above operations into a complete whole. Test cases that run an input through compress -> decompress should reconstruct the original and allow you to confirm the functionality of the entire pipeline. You can also use the console main program to try compressing and decompressing files now.

You have already tested each function in isolation and you might reasonably expect that they will all happily work together, but sometimes there can be latent issues that only surface in the complete program. This can be an indication that the earlier test coverage was lacking, there are unexpected interactions between the components that were not previously triggered, or a nefarious memory bug has be lying in wait. If this happens to you, bust our your superior testing and debugging skills to hunt down those gremlins and eradicate them.

We included a variety of files in the res folder of the starter project to use as inputs to compress. You can also add files of your own for testing.

Encode a message for your SL

Compress a secret message to send your section leader as part of your submission. In the res subfolder of starter project, you'll find a file named secretmessage.txt. Edit that file to contain your message. Share about your journey in CS106B, thank your SL, tell us of your future plans, or send a funny meme. The message can be anything (non-offensive â˜ș) that you like.

Use your program to compress the file and save the result to secretmessage.txt.huf. Submit the compressed huf file to Paperless along with your code. Your section leader will uncompress your message with your program and read it while grading.

Specifications

  • Do not change prototypes for any of the required functions. This applies to both the utility functions described in the warmup and the functions described in this document.
  • The documentation for the individual functions (decode, unflatten, etc.) state what assumptions you can make about the inputs being valid and well-formed. Even so, writing those functions to detect/error on bad inputs can be a good defense. We won't test on invalid inputs, but while you are in development, you might accidentally slip up and feed your function a bad input. Your efforts to defend against that misstep could save you some time debugging an incorrect test case.
  • An input must contain at least two distinct characters to be Huffman-encodable, but beyond that, it can contain be of any length and may include any/all possible characters. Mon-text files, such as sounds and images, make use of the full range of character values from 0 to 255, so these inputs make good test cases to confirm.
  • Your code should have no memory leaks, including no memory leaks in test cases.
    • Working out where and where to deallocate your trees will require careful thinking! Each tree that is allocated needs to be deallocated once and only once, and only after you are done using that memory.
  • Consider this assignment an opportunity to show your section leader that you've been absorbing all that great feedback on style from your IGs. Let's see your best work on writing the beautiful code we know you are capable of!

Testing

  • The starter code includes provided tests that test basic operation of each function on a simple known input. These cases are good for early testing. You should supplement with other small test cases that are similarly very targeted.
  • Constructing large-scale tests by hand is quite tedious and not something we recommend. When ready for more comprehensive tests, instead move on to using an end-to-end test strategy across the full compress -> decompress cycle. Our provided main console program that runs compress -> decompress on files, which gives you easy access to larger and more varied test inputs than would be reasonable to construct by hand.
  • Your implementation should be robust enough to compress all types of files: text, binary, image, or even one that has previously compressed. Your program probably won’t be able to further squish an already compressed file (and in fact, it can get larger because of the additional overhead of the flattened tree), but it should be possible to compress multiple iterations, decompress the same number of iterations, and return to the original file.
  • What sorts of files do you expect Huffman to achieve particularly good good compression? On what sorts of files will it be less effective? Are there files that grow instead of shrink when they are Huffman encoded? Create sample files to test out your theories.

Extensions

There are many, many other techniques based on Huffman encoding that you can use to get even better data compression. Here are a few concepts to look up if you’d like to learn more about this.

  • Adaptive Huffman coding. Huffman encoding works by looking at the entire piece of text to compress, finding the global frequencies of each character, and then building a single encoding tree. However, in some files – especially image data – different regions of the file will likely have wildly different frequencies of each letter. For example, in a picture of a yellow sunflower against a bright blue sky, the parts of the file that store the sky data will have most of their bits dedicated to storing different shades of blue, and the parts of the file storing the sunflower will primarily store bits used to hold shades of yellow. For such patterns, it can be worth the effort to adjust the encoding while running through the file. An adaptive Huffman coding slices the file into smaller sections and uses different encoding trees per section.
  • LZW compression. The LZW algorithm was developed by Lempel, Ziv, and Welch. This extension on Huffman coding achieves higher compression by additionally assigning short codes to common letter sequences, instead of just common letters. For example, the letter 'r' is often followed by the letter 'e', so we could treat the sequence "re" as just another letter when assigning codes.
  • Shannon entropy. Huffman encoding is based on the assumption that, when reading characters from the file, each character is chosen randomly, with the likelihood of any character coming up being directly proportional to the number of times that letter appears in the overall file. Under this assumption, you can show that the theoretical limit of any compression algorithm for a file is related to a quantity called the Shannon entropy of the distribution. The Shannon entropy is actually relatively easy to compute, and if you’re looking for a nifty “how does theory meet practice” exercise, take a look at the file sizes generated by Huffman encoding and see how they compare against the theoretical lower bound.