Today we dive into the wonderful world of hashing!
- π Readings: Text 15.3
- π Lecture quiz on Canvas
- π¬ Lecture video on Canvas [toggle visibility]
Contents:
1. Introduction and Overview
2. Preliminaries: Approaches to Storing and Retrieving Student Records by Student ID
3. Hash Tables: Overview
4. Linear Probing (Collision Resolution Policy 1 of 2)
5. Supplementary Linear Probing in Code
6. Best- and Worst-Case Runtimes for Insertion with Linear Probing
7. Average-Case Runtime for Insertion with Linear Probing
8. Additional Data for Average-Case Analysis
9. The Impact of Repeated Collisions and the Role of Probability in Hashing
10. Search with Linear Probing
11. Clustering and Table Size with Linear Probing
12. Separate Chaining (Collision Resolution Policy 2 of 2)
13. Runtimes for Insertion, Search, and Deletion with Separate Chaining
14. Properties of Good Hash Functions
15. HashSet and HashMap
16. A Cautionary Note About the Expression of Hash Runtimes
17. Further Exploration
18. What's next?
19. Exam Prep, Including a Super Important Exercise
Introduction and Overview
We began lecture today with a discussion of tradeoffs between a variety of approaches to (efficiently) storing and retrieving student records associated with unique student IDs. I then introduced hash tables and hash functions. We examined two collision resolution policies (linear probing and separate chaining) and explored the runtimes of our insertion and search operations. We also discussed some properties of good hash functions.
(Important note!) Before we dive in, I just want to say that it's hard to overstate just how β¨ AMAZiNG β¨ hash tables are! Let me say that a bit larger for appropriate emphasis:
Hash tables are β¨ AMAZiNG β¨!
The way they work is pretty incredible, as is the fact that they give us O(1) runtimes for most operations while only using O(n) space to store n elements. These attributes make hash tables super powerful. They are widely used for handling vast amounts of data at scale and solving complex problems efficiently. They're also the answer to a lot of interview problems. If you're preparing for tech interviews, I encourage you to do a deeper dive on this incredible data structure soon. Some of the supplementary notes included in today's write-up can serve as a launching point for that exploration.
Preliminaries: Approaches to Storing and Retrieving Student Records by Student ID
I began today with the following problem:
Suppose we want to store 17,000 student records and be able to look them up efficiently. Knowing that every student at Stanford has a unique 8-digit ID, how could we do that?
The ideas we explored were:
- Direct indexing in an obscenely large array. Create an array of length 100 million (since there are 100 million possible IDs on the range 00000000 through 99999999). We could then use each ID as the index where we would store the corresponding student's data in the array (perhaps in a struct that is filled with information about that student). So, if a student's Stanford ID were 05883264, we would store the data for that student at index 05883264 in the array, like so:
| (data) | ||
| ^ 05883264 |
- This approach would give us O(1) insertion, deletion, and search operations, but we would have a lot of wasted space, especially since we only want to store data for 17,000 active students.
- Binary searching a more reasonably sized, sorted array. Alternatively, if we know we only have to store data for, say, 17,000 students at any given time, we could create an array of length 17,000, insert records into that array, sort the array by the IDs, and then use binary search to look up any student record as we need it. We'd save a lot of space compared to the first approach, but we have a setup cost of O(n log n) for the sorting, and search becomes an O(log n) operation (in comparison to the O(1) operations afforded by the previous approach). Inserting a new student record would also be problematic, even if we left some extra space in the array, because we would potentially have to re-sort the entire array or perform an O(n) operation to move elements over to accommodate new ones at the beginning of the array.
- Balanced BSTs. We could also throw the elements into a balanced BST, which would give us worst-case logarithmic runtimes for individual insertion and search operations. Constructing the initial tree would take O(n log n) time. Like the binary search approach above, this would use O(n) space, as the tree would contain exactly one node per element. This would take more space than a sorted array, since each node would contain two pointers, but would allow for more efficient insertion of new elements than the sorted array approach above. (This idea came up in class today when someone suggested using a map of student IDs to student records. Recall that the map in the Stanford C++ Library uses a balanced BST to store its keys.)
- Tries. I briefly mentioned we could create a tree where each node has 10 children -- one for each digit, 0 through 9 -- and use each digit of the student ID to tell us what child node to go to next at each level of the tree. This idea is a variation of a data structure called a "trie," which is a super awesome data structure that you should look into on your own time at some point!
(Key take-away!) The approaches above highlight a classical trade-off in computer science: sometimes using extra space can get us much better runtimes. On the other hand, in a space-constrained system, we might find ourselves forced to sacrifice runtime efficiency for the sake of getting a solution that will operate with a limited amount of memory.
Hash Tables: Overview
The new data structure we saw today -- the hash table -- combines the best of both worlds from the two solutions we explored to the problem above. It generally supports insert, delete, and search operations that have O(1) runtimes in the average case, but it only uses O(n) memory (where n is the number of elements we're storing).
Here's the fundamental nature of a hash table: It's just an array coupled with a hash function that takes an element as its input (which we call the "input" or the "key") and returns a number (called a "hash code" or "hash value") that we use to find an index for that key in our array:
| "input" or "key" → | hash function | → "hash value" or "hash code" (an integer) |
(Key take-away!) A hash function typically produces a large range of values, and we mod each hash code by the length of our hash table (i.e., our array) to get a valid index in that array. If you encounter a hash function in the wild that you want to use for some application, it's up to you to mod the values it returns to ensure you get a valid array index.
We saw a very simple example of a hash table in which we wanted to store eight student records in memory, and our hash function simply took each student record as input and returned the student ID as the hash code.
Here are eight names and Stanford IDs I gathered from my class in a previous quarter. We encountered very similar results when we did this live in class today, as well:
| Name | Student ID (hash code) |
initial index in hash table (hash code % array length) |
| Jason | 06846021 | 1 |
| Shreya | 06682497 | 7 |
| Ila | 06233639 | 9 |
| Xander | 06194315 | 5 |
| Anjali | 06335094 | 4 |
| Rushank | 06028532 | 2 |
| Michael | 06629695 | 5 |
| Jack | 06211096 | 6 |
| Each person only gave the last three digits of their ID in class, but I only preserved the last two digits, just to be on the safe side. The rest of each ID is fabricated. | ||
To be clear, our hash function in this case is taking an entire student record and returning just the student ID portion of that record as our hash code:
| student record → | hash function | → hash code (student ID) |
After inserting the first six of those records into the hash table, we'll have something that looks like this:
| Jason | Rushank | Anjali | Xander | Shreya | Ila | ||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Here's the big problem: When we try to insert Michael's record into the hash table, our hash function tells us to go to index 5, which is already occupied by Xander's record. When multiple keys map to the same position in a hash table, we call that a "collision." The rest of our lecture today revolved around mechanisms for resolving collisions in hash tables (and how that impacts insertion, search, and deletion in a hash table).
Linear Probing (Collision Resolution Policy 1 of 2)
With linear probing, if we encounter a collision, we simply search linearly for the next available space in the hash table. If necessary, we wrap back around to the beginning of the array. (To do that, just mod by the length of the array, often called table_size.)
For example, when we try to insert Michael's record into the hash table above, we see there's already an element at index 5. We then move on to index 6. Since index 6 is empty, we insert Michael's record into that position. The hash table now looks like this:
| Jason | Rushank | Anjali | Xander | Michael | Shreya | Ila | |||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Finally, when we try to insert Jack's record into the hash table, we see there's already an element at index 6. We then move on to index 7 and observe that it's occupied, as well. So, we move forward to the next spot (index 8). Since index 8 is empty, we insert Jack's record into that position. The hash table now looks like this:
| Jason | Rushank | Anjali | Xander | Michael | Shreya | Jack | Ila | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Supplementary Linear Probing in Code
In code, our insertion algorithm with linear probing would look something like this:
void HashTable::insert(ElementType key)
{
int hashCode = hash(key);
// We usually use 'size' to refer to the number of elements in a
// data structure, and 'capacity' to refer to its overall capacity.
// People often break from that tradition with hash tables, however,
// and refer to the capacity of the array as the "table size." Hence
// the variable name below.
for (int i = 0; i < _tableSize; i++)
{
// WARNING! This is problematic if we have a hash function that
// is capable of producing negative hash codes, which is actually
// quite common. A lot of hash functions let hash codes grow and
// grow and grow, to the point where they might encounter integer
// overflow and become negative. In Python, modding a negative
// integer by a positive one produces a non-negative result, but
// in C++ (as well as C and Java), that will produce a negative
// result. We need to be careful not to use a negative integer
// as an array index.
//
// Note that taking the absolute value of index is NOT a solid
// solution to this problem, as the absolute value of
// numeric_limits<int>::min() is not a valid int value.
int index = (hashCode + i) % _tableSize;
if (_array[index] == EMPTY)
{
_array[index] = key;
_numElements++;
break;
}
}
}
In the code above, I'm assuming the following:
- insert() is a member function in a HashTable.
- ElementType is the type for all the elements in our hash table (such as StudentRecord).
- hash() is a function that produces hash codes for the given element type.
- _array is a class member that holds the elements in our hash table.
- EMPTY is a constant used to indicate that a cell in our array does not contain an element.
- _tableSize is a class member that keeps track of the capacity of _array.
- _numElements is a class member that keeps track of the number of elements in our hash table.
A more robust insertion function would check whether the array were full before kicking off the linear probing loop and expand the array if not.
Best- and Worst-Case Runtimes for Insertion with Linear Probing
The best-case runtime for insertion into a hash table using linear probing comes when our hash function sends us to an empty cell in the array. In that case, we encounter O(1) insertion. We encounter that best-case runtime with the first six insertions in the linear probing example given above in today's notes.
The worst-case runtime occurs when all our elements have formed a cluster, and we need to loop through all of them before finding an open position in our table. Suppose, for example, that our hash table looks like this:
| Frank | Tamsen | Hanan | Arya | Asha | Dierdre | Ptolemy | Alexa | Fred | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Suppose we try to insert a new record for "Bobo" into the table above, and our hash function sends us to index 4. We would have to loop through every single element in that array before finding the only open position (at index 3). That would be an O(n) operation and corresponds to the worst case for insertion.
So far, we have the following runtimes:
| Operation | Best Case | Worst Case |
| Insertion | O(1) | O(n) |
Average-Case Runtime for Insertion with Linear Probing
Given our best-case runtime of O(1) and our worst-case runtime of O(n), the next reasonable question to ask is: what's our average-case runtime? Is is closer to O(n) or O(1)? Is it somewhere in the middle, such as O(log n)?
To figure that out, let's count the number of comparisons ("read operations") we encountered as we inserted the above elements into our hash table and see what the average number of comparisons was per operation:
- To add Jason to the table, we had to check whether index 1 was empty. It was, and so we added Jason to the hash table at that index after just a single comparison. That's true of the first six keys we inserted into the table, actually.
- To add Michael to the hash table, we checked two indices (5 and 6) in order to find an open spot.
- To add Jack to the hash table, we checked three indices (6, 7, and 8) in order to find an open spot.
So, the total number of comparisons we performed for all eight insertion operations was 1 + 1 + 1 + 1 + 1 + 1 + 2 + 3 = 11.
Having gone through 11 comparisons in order to perform 8 insertion operations, that's a rate of 11/8 = 1.375 comparisons per operation. That's a very small constant number of comparisons. We're getting average runtimes of O(1).
(Key take-away!) The key take-away here is that even though some insertions might be more expensive than others, we'd expect the vast majority of them to require only one or two comparison if we have a good hash function and a fairly uniformly distributed set of keys (inputs), which means that the runtime for insertion averages out to O(1) overall.
To summarize, our insertion runtimes are:
| Operation | Best Case | Average Case | Worst Case |
| Insertion | O(1) | O(1) | O(n) |
Additional Data for Average-Case Analysis
For an additional example, here are the names and hash codes from when I did this demo in another one of my previous classes. All nine of these came from actual students in one of my classes a few years ago:
| Name | Student ID (hash code) | Initial Index |
Comparisons |
| Chase | 3918939 | 9 | 1 |
| Easton | 1318625 | 5 | 1 |
| Patrick | 9846224 | 4 | 1 |
| Mike | 8746232 | 2 | 1 |
| Adrian | 9899820 | 0 | 1 |
| Austin | 8741265 | 5 ** Collision ** | 2 |
| Matthew | 9864733 | 3 | 1 |
| Alex | 9481227 | 7 | 1 |
| Sebastian | 4985574 | 4 ** Collision ** | 5 |
That leads to an average of 14/9 ≈ 1.56 comparisons per operation -- again, a small constant.
In another section of one of my classes, we had similar results. Again, all nine of these came from actual students in one of my classes a few years ago:
| Name | Student ID (hash code) |
Initial Index | Comparisons |
| Matthew | 6482243 | 3 | 1 |
| Ryan | 9422411 | 1 | 1 |
| Eryn | 8455428 | 8 | 1 |
| Jack | 9631657 | 7 | 1 |
| Jeffrey | 4912362 | 2 | 1 |
| Steven | 4681278 | 8 ** collision ** | 2 |
| Josh | 8492145 | 5 | 1 |
| Ashley | 9812360 | 0 | 1 |
| Charles | 9841138 | 8 ** collision ** | 7 |
That leads to an average of 16/9 ≈ 1.78 comparisons per operation -- again, a small constant.
The Impact of Repeated Collisions and the Role of Probability in Hashing
(Important note!) If every student hashed to the same index, inserting all n records would result in an O(n2) runtime. (It's that Gauss sum we keep seeing: 1 + 2 + 3 + ... + n.) Even if you have a good hash function, there's no way to get around this if you just happen to get unlucky and encounter a set of inputs that all map to the same index in the hash table.
I mentioned this in class today: it's totally conceivable that if we polled 10 random students on campus for their student IDs, all 10 of them could have IDs ending in the same number -- in which case they'd all collide in our hash table. In that unlucky situation, our nice O(1) runtimes go out the window. There's no way to guard against an unlucky set of inputs, but notice that the probability of that happening is very low. Probability is working in our favor here and plays a pivotal role in keeping our collision count low -- both in this example and as well as in hashing more generally.
Search with Linear Probing
Now, consider the implications that linear probing has for search (retrieval) in the hash table we constructed above. Here's the table again, for your convenience:
| Jason | Rushank | Anjali | Xander | Michael | Shreya | Jack | Ila | ||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
What if we want to look up Jack, whose hash code is 6? We have to perform linear probing to search for that element. We'll end up looking at indices 6, 7, and 8 before we stop searching.
Both insertion and retrieval with linear probing are O(n) operations in the worst case, which happens when we have lots of collisions or big clusters of data with no gaps, and we have to go through all n elements in the hash table before we find what we're looking for. With search, we still have best-case runtimes of O(1) and expected (or average-case) runtimes of O(1).
To summarize:
| Operation | Best Case | Average Case | Worst Case |
| Insertion | O(1) | O(1) | O(n) |
| Search | O(1) | O(1) | O(n) |
With linear probing, we might observe that there are three conditions that could cause us to stop searching for a key:
- We find the key we're looking for.
- We loop through every index in the hash table without finding the key we're looking for or any empty cells.
- We find an empty space in the table, since a key would never skip an empty space during insertion and end up at a later index in the array.*
(*) The claim above in (3) about being able to stop when we see an empty space is actually quite problematic if we allow deletion from our hash table!
For example, suppose we delete "Michael" from the table we came up with above and then search for "Jack" (starting at index 6). We don't want to stop our search upon seeing that index 6 is empty. We need to keep probing forward to find Jack at index 8.
This has rather concerning implications for search. If we have to keep probing forward any time we see an empty cell, then our search runtime for something that is not in the table will always be O(n); we'll have to look at every index in the array before determining that our target isn't there, because every cell is either empty (which causes us to keep searching) or some element other than what we're searching for (which also causes us to keep searching).
Can you see a way to fix this problem? How can we allow deletion but still allow our search algorithm to stop searching when it sees an empty cell? Note that rearranging the elements in our hash table any time we perform a deletion wouldn't be viable (at least not if we care about efficiency) because any collisions that happened prior to the deletion could have caused elements to end up very far away from the original positions they mapped to with the (hashCode % tableSize) formula. Recovering those elements and trying to undo all the chain reactions caused by various collisions that got them there would be a huge mess.
Highlight for a solution: One potential solution to this problem is to use some sort of flag to mark a cell as "dirty" whenever we delete an element from the table. A "dirty" cell is empty in the sense that we can insert a new element at that index. So, when performing an insertion, if we hit a "dirty" cell, we stop and insert at that index. However, if we encounter a "dirty" cell while searching the table for an element, we don't stop; we keep probing forward, because it's possible the thing we're looking for was inserted into the hash table at a time when that dirty cell was occupied, causing the element we're looking for to end up in a cell after that one. This presents a new problem, which is that if we perform a lot of deletions, we could end up with a lot of "dirty" cells in our hash table, and those can slow down our search runtimes. A potential workaround there is to simply keep track of what proportion of cells in a hash table are "dirty," and if that proportion gets too high, simply re-insert all the remaining elements into a fresh hash table.
Clustering and Table Size with Linear Probing
One of the problems we keep seeing with linear probing is that if we end up with all our elements in one big cluster, a single search or insertion operation can incur an O(n) runtime.
One trick for getting around that is to maintain an array of length 2n, which is still O(n) space complexity. By using a table of length 2n, we expect our elements to spread out a bit more and have breathing room interspersed between them. Having some gaps spread throughout our array means that insertion and search will never result in O(n) runtimes.
For example, suppose we create an array of length 16 to hold the 8 student records from above. For posterity, here are those hash codes again:
| Name | Student ID (hash code) |
initial index in hash table (hash code % array length) |
| Jason | 06846021 | 5 |
| Shreya | 06682497 | 1 |
| Ila | 06233639 | 7 |
| Xander | 06194315 | 11 |
| Anjali | 06335094 | 6 |
| Rushank | 06028532 | 4 |
| Michael | 06629695 | 15 |
| Jack | 06211096 | 8 |
Here's what we get if we insert those into a table of length 16:
| Shreya | Rushank | Jason | Anjali | Ila | Jack | Xander | Michael | ||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
Notice that there's some breathing room in our table now. That means that no insertion or search operation can cause us to probe through all eight elements. Consider this:
- If our next insertion maps to index 0, 2, 3, 9, 10, 12, 13, or 14, we insert immediately without any further probing. That's as fast as insertion can get, and there's a 50% chance of that happening.
- If our next insertion maps to index 1, 8, 11, or 15, we have to look at a total of two cells in order to find an empty one (the one we map to initially and the one after that, which is empty). That's also still very fast, and we have a 25% chance of that happening.
- To varying degrees, landing on any of the other cells requires more work than the two cases listed above. Landing on index 6 initially is more expensive than landing on index 8, and landing on index 4 initially is even more expensive still. We have only a 25% chance that our next insertion will land on one of those more expensive cells initially (indices 4, 5, 6, or 7).
In contrast, consider the following table, which is almost full:
| Frank | Tamsen | Hanan | Arya | Asha | Dierdre | Ptolemy | Alexa | Fred | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
- The only insertion that won't involve any probing is if we map to index 3 initially. We have only a 10% chance of that happening -- much lower than the 50% in the larger table above.
- Similarly, the only insertion that looks at exactly two cells before stopping is where we map to index 2 and stop probing at index 3. Again, we have only a 10% chance that our next insertion will do that, which is much lower than the 25% chance we had with the larger table.
- All other indices result in slower insertions than the two cases listed above. We have an 80% chance of encountering one of those indices (compared to only a 25% chance of that happening in the larger table), and some of those probing operations would be more expensive than the worst probing operation we could encounter with the larger table.
More formally, we could compute the expected value of the number of cells we will have to examine in the next insertion operation. For the first table, with all its breathing room, our next insertion is expected to examine just 2.125 cells in order to find an open position. In the latter table, we expect to have to examine 5.5 cells, which is more than double the work required for the previous one.
In general, keeping our hash table between 25% and 50% full can help ensure that we are less likely to encounter expensive, O(n) insertion or search operations.
Separate Chaining (Collision Resolution Policy 2 of 2)
After discussing linear probing at length, we turned to another collision resolution policy: separate chaining (sometimes called "chaining").
The idea behind separate chaining is to maintain an array of linked lists. If two elements collide, rather than having to do any probing, we can simply add those elements to the same linked list. For example, using the same student records from the top of today's notes with an array of 10 linked lists, we would get the following:
Records and Hash Codes:
| Name | Student ID (hash code) |
initial index in hash table (hash code % array length) |
| Jason | 06846021 | 1 |
| Shreya | 06682497 | 7 |
| Ila | 06233639 | 9 |
| Xander | 06194315 | 5 |
| Anjali | 06335094 | 4 |
| Rushank | 06028532 | 2 |
| Michael | 06629695 | 5 |
| Jack | 06211096 | 6 |
Separate Chaining Diagram:
| 0 | nullptr | ||||
| 1 | → | Jason | |||
| 2 | → | Rushank | |||
| 3 | nullptr | ||||
| 4 | → | Anjali | |||
| 5 | → | Michael | → | Xander | |
| 6 | → | Jack | |||
| 7 | → | Shreya | |||
| 8 | nullptr | ||||
| 9 | → | Ila |
The array itself does not hold any student records. Each cell in the array holds a pointer to the head of a linked list. There are 10 separate linked lists (three of which are empty, as indicated by the nullptr pointers). The gray boxes in the diagram above are array cells (head pointers). The white boxes are linked list nodes.
Here are some key notes about separate chaining:
- When we encounter a collision, we simply throw our elements into the same linked list. Since a linked list can hold multiple elements, there is no need to probe forward to a different cell in the array.
- New elements typically get inserted at the head of a linked list. There are two reasons for this: (1) this gives us a fast, O(1) insertion without any need to maintain a tail pointers, and (2) elements that have been accessed recently tend to be more likely to be accessed again sometime soon, so having those at the beginning of our lists could help with our runtimes.
- As an aside, I mentioned that we could also order our linked lists by access frequency. To do that, we would add a counter to each node that we increment every time that record is looked up in our hash table, and we would stop every now and then to sort our nodes by those access counts.
- Keeping these lists short is essential for maintaining good runtimes with separate chaining. If any one linked list gets too long, traversing it becomes a slow operation.
- Related to the previous point: we typically keep track of a load factor when using separate chaining. The load factor is n/b, where n is the number of elements in our hash table and b is the number of "buckets" (or linked lists). This tells us the expected number of elements per bucket. In the table above, our load factor is 8/10 = 0.8. If a load factor is too high, our lists are getting too long. If a load factor is too low, we might have an excessive number of empty linked lists, meaning we would have a lot of wasted space in our array of pointers.
- With separate chaining, it's still possible to get unlucky and have all our elements collide into the same bucket. In that case, we end up with one long linked list, which leads to O(n) search, insertion, and deletion runtimes.
Runtimes for Insertion, Search, and Deletion with Separate Chaining
Following are the runtimes for hash table operations with separate chaining. The average-case runtimes of O(1) are based on the expectation that each linked list will contain a small constant number of elements.
(Important note!) We sometimes articulate these runtimes as O(n/b) (referring to the load factor). We expect this to be a small constant, however, which is where the O(1) runtimes come from in the following table. The O(n) runtimes correspond to situations where all our elements end up colliding into a single bucket.
| Operation | Best Case | Average Case | Worst Case |
| Insertion | O(1) | O(1) | O(n)* |
| Search | O(1) | O(1) | O(n) |
| Deletion | O(1) | O(1) | O(n) |
(*) It might seem counterintuitive to say the worst-case runtime for inserting at the head of a linked list can incur an O(n) runtime, but that's because we typically disallow the insertion of duplicate elements in a hash table. So, when performing an insertion, we first need to search the linked list we're inserting into in order to make sure it doesn't already contain the element we're inserting. If our list contains all of the table's n elements, the search operation that precedes our insertion can have an O(n) runtime. If we are willing to allow the insertion of duplicates into a hash table for some reason, we can achieve worst-case insertion runtimes of O(1).
Properties of Good Hash Functions
At the very end of today's lecture, I mentioned a few important properties of good hash functions. While the design of hash functions falls outside the scope of this class, having a good, working understanding these properties is essential to understanding how and why hashing is so awesome:
- A hash function should be deterministic.
- A hash function must be deterministic (i.e., given the same input, a hash function must always produce the same output). A non-deterministic hash function (one that involves random number generation, for example) would be disastrous. If it mapped a record to index 3 upon insertion, but then mapped a record to index 8 when we went to look it up later, we'd lose our ability to efficiently look up the keys we had inserted into our hash tables.
- Given a uniform set of inputs, a hash function should produce a uniform set of hash codes.
- One of the most important properties I mentioned is that a good hash function will produce a nice, uniform distribution of hash codes when given a nice, uniform distribution of inputs to process. Notice that if we gathered up 10 Stanford students at random, we'd expect approximately one of them to have a student ID ending in 0, one to have an ID ending in 1, one to have an ID ending in 2, and so on. Assuming Stanford IDs are uniformly distributed in terms of their ones digits (which I strongly expect to be the case), our hash function would put about 10% of students at index 0 in the array, 10% at index 1, 10% at index 2, and so on.
- A bad hash function is one that would map, say, 70% of students to index 1 and 30% of students to index 8. That would result in a lot of collisions, which leads to sucky runtimes. If we insert n elements into a hash table and they all collide to the same position, we end up with an overall runtime of O(n2) to insert all those elements.
- The function should produce a large range of values.
- A good hash function should also produce a large range of values so that it can be used with hash tables that are both large and small. If we create a hash table of length 10,000 and insert thousands of keys into it, but our hash function only produces values 0 through 9, then we'll have a lot of collisions, which will drastically slow down the runtimes of our insertion operations. (Recall that when a hash function produces a hash value that exceeds the length of your array, we always mod that hash value by the array length to figure out what index to go to.)
- (Important note!) It's not uncommon for hash functions to allow the hash codes they produce to grow so large that they actually cause integer overflow. For that reason, it can be dangerous to simply mod a hash code by our hash table length and assume we have a valid array index. Depending on the language we're in, modding a negative integer by a positive integer could yield a negative result. (I believe that's the case for C++, C, and Java, but not Python.) If we were coding up an industry-grade implementation of hash tables, we would want to guard against that and avoid trying to access a negative array index.
- The function should produce very different hash codes for similar inputs.
- The utility of this property is perhaps a bit more obscure than the others. In the real world, data tends to cluster. So, if we're entering a bunch of data into a hash table, we might expect to encounter several values that are close to one another in that dataset. Having those values spread out across the hash table could be useful in avoiding clusters in our table -- and remember, clusters lead to slow runtimes.
HashSet and HashMap
I mentioned in class that the Stanford C++ Library has HashSet and HashMap classes:
These classes are powered using hash tables (as opposed to the balanced BSTs that drive the Set and Map classes). They offer us O(1) runtimes (as opposed to the logarithmic runtimes we see with various Set and Map operations), but the tradeoff here is that when we iterate over the elements in a HashSet or the keys in a HashMap, they are not guaranteed to be in sorted order.
A Cautionary Note About the Expression of Hash Runtimes
(Not mentioned in class; please review.)
Note that people often present the runtimes for insertion, search, and deletion operations on hash tables as O(1) without further qualification, despite the fact that our worst-case runtimes for those operations could be O(n). This is a widely used convention, which breaks from the tradition of defaulting to the worst case when presenting an unqualified runtime. Such is the strength of our expectation that these operations will tend to be O(1) on average.
These O(1) runtimes also have a hidden assumption: that the hash function we're calling is also O(1). Calling our hash function is requisite for each insertion, search, and deletion operation, and so the runtimes for those operations must be at least as bad as the runtime for our hash function. Many hash functions do, in fact, have O(1) runtimes, but a common exception is with strings. Many string hashing functions loop through all k characters in a string, resulting in (at least) O(k) runtimes just to produce a hash code. The process that follows after we have our hash code might be O(1), but if calling our hash function was O(k), the overall cost of our insertion, search, or deletion operation in that case is also O(k).
Further Exploration
Hashing is everywhere in CS, and with this one lecture, we have only scratched the surface of this complex, beautiful, and powerful topic. If you want to learn more about hashing, here are some directions you could go in:
- Cryptographic hash functions. Read up on how hashing is used in cryptographic applications, including password storage. Check out the notes on that Wikipedia page about checksums, digital fingerprints, MD5, and SHA-256, as well.
- Hash table expansion. Read up on hash table expansion. It's not particularly complex, but when we expand a hash table, we end up having to re-insert all our elements into the table from scratch. It's an expensive operation, but we tend to at least double the length of a hash table with each expansion, so it doesn't take many expansions for them to grow quite large. Look into how hash table expansion impacts the worst- and average-case runtimes for insertion operations.
- Deleted element markers. Look into how deletion works with linear probing. There are some comments about that in today's notes.
- Hash functions. Read more about hash functions. Consider looking up common approaches to hashing strings.
- Quadratic probing. Read up on quadratic probing, a collision resolution policy that is very similar to linear probing. What advantage does quadratic probing have over linear probing? If you look into this topic, you'll also bump into an interesting proof related to the ability of quadratic probing to find an empty cell (if one exists) when attempting to perform an insertion.
- Code up hash tables! It shouldn't be hard to find a good hash table assignment online, or if you're feeling ambitious, you could just code them up yourself from scratch. Consider plugging different hash functions into your hash tables and performing some experiments where you keep track of the average number of cells probed per insertion/search/deletion operation to help evaluate how good those various hash functions are.
What's next?
On Wednesday and Thursday, we'll talk about another super important data structure: graphs!
Next week, we'll tie up a few loose ends as we head into the final exam. On Monday, we'll implement a few neat graph algorithms. Tuesday will be a wrap-up day where we tie up loose ends and talk about end-of-quarter logistics.
We will likely take next Wednesday and Thursday off as study days. On Friday, we have the final exam.
Exam Prep, Including a Super Important Exercise
1. Insert the following elements into the hash table below using linear probing. Separately, insert the keys into a hash table using separate chaining. In both cases, use a table of length 10 (which means you will have to mod the hash code by 10 to kick off the insertion process).
int myHash(int key)
{
return (key - 31) * 2;
}
Keys to insert: 37, 61, 39, 46, 87, 92
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
2. Insert the following elements into the hash table below using linear probing. Separately, insert the keys into a hash table using separate chaining. In both cases, use a table of length 11 this time (which means you will have to mod the hash code by 11 to kick off the insertion process).
int myHash(string s)
{
if (s == "")
return 0;
else
return (int(tolower(s[0]) - 'a') + 1);
}
Keys to insert: "abacus", "dwindle", "fox", "goose", "bag", "beans", "ferocious", "delicate"
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
3. Suppose we want to use the hash function from the previous question to insert a much larger set of strings into our hash table (some arbitrarily large number of strings, n), and so we create a hash table whose initial length is O(n). Is the hash function in this problem a good hash function for that purpose? Why or why not?
4. If we're using linear probing, what is the worst-case big-oh runtime for inserting a single element into a hash table that already contains n elements? Briefly describe the specific situation that leads to this worst-case performance. What would be the best-case runtime for inserting n distinct elements into an initially empty hash table? What would be the worst-case runtime? What are the answers to this question if we instead us separate chaining?
5. Using the names and student IDs from today's notes, trace through the insertion of those student records into hash tables of length 10 using linear probing and separate chaining. Verify the number of comparisons for each insertion, and verify the calculations for the average number of comparisons per operation.
6. Knowing that hash tables give us O(1) runtimes and O(n) space complexity, why would we ever want to use any of the other data structures we've covered this quarter? In particular, can you think of anything that a linked list can do that a hash table won't do for us? What about a balanced BST? What about a priority queue?
7. Be sure to carefully read through today's notes to absorb any additional tidbits of information sprinkled throughout. Hash tables are an incredibly powerful and important data structure. It's worth investing some extra time to develop a more complete understanding of them.
Super Important Exercise:
8. Write an efficient* function named twoSum() that takes a vector of integers, v, and an integer value, target, and returns true if any two integers in v add up to the target value. Otherwise, return false.
For example:
- twoSum({5, 1, 3, 1, 9}, 8) would return true because we can select 5 and 3 to get our target: 5 + 3 = 8
- twoSum({5, 1, 3, 1, 9}, 2) would return true because we can select 1 + 1 = 2
- twoSum({5, 1, 3, 1, 9}, 9) would return false (we cannot take 9 alone; we must take exactly two values)
- twoSum({5, 1, 3, 1, 9}, 18) would return false (there is only one 9; we cannot take it twice to get 18)
(*) Note: One solution to Problem #8 would be to use nested for-loops to check whether each pair of integers adds up to the target sum, like so:
bool twoSum(Vector<int>& v, int target)
{
// Try every pair of integers to see if they sum to target.
for (int i = 0; i < v.size(); i++)
{
// Start at (i + 1) here so we don't try adding some
// element in the array to itself.
for (int j = i + 1; j < v.size(); j++)
{
if (v[i] + v[j] == target)
{
return true;
}
}
}
// If we get here, then none of the pairs add up to target.
return false;
}
The above solution is a brute force approach and has a slow, sad, O(n2) runtime. Can you come up with something faster?
Highlight for hint to Problem #8: Use a hash table! Specifically, a HashSet will be handy here.
Highlight for another hint to Problem #8: Consider the first example above. When we hit the 3 in the vector, it would be nice if there were a quick way to check whether we had already seen a 5 elsewhere in the vector. If so, we know we have two elements that add up to 8. How can we facilitate fast lookup of elements we have already seen?