The A-Star Algorithm and Esoteric Data Structures (Skip Lists and Bloom Filters)

CS 106B: Programming Abstractions

Fall 2024, Stanford University Computer Science Department

Lecturers: Cynthia Bailey and Chris Gregg, Head CA: Jonathan Coronado

A skip list and a bloom filter (to be described in this lecture)


Slide 2

Announcements

Assignment 7 is due Wednesday at midnight


Slide 3

Today's Topics

  • A brief review of Dijkstra
  • The A-Star algorithm
  • Skip Lists
  • Bloom Filters (we probably won't get to this)

Slide 4

Dijkstra's Algorithm Refresh

  • Cynthia demonstrated Dijkstra's algorithm using a priority queue instead of a queue (as in BFS), with the priorities as the total path length so far.
  • A different form of Dijkstra's algorithm uses a table of nodes, and keeps track of how far away from the start each node is, updating by looking at neighbors of unseen nodes. It also keeps track of enough information to create the shortest path after the table is filled in.
  • We will do an example with the table method.
  • Example website: https://web.stanford.edu/~cgregg/graphs.html
    • Example to load (click "Solution" below and then copy all text. Paste into above website after clicking Load)
{"nodes":[{"color":"green","x":195,"y":439,"name":"A"},{"color":"green","x":411,"y":439,"name":"B"},{"color":"green","x":296,"y":333,"name":"C"},{"color":"green","x":293,"y":201,"name":"D"},{"color":"green","x":178,"y":272,"name":"E"},{"color":"green","x":125,"y":124,"name":"F"},{"color":"green","x":416,"y":117,"name":"G"},{"color":"green","x":289,"y":510,"name":"SJ"},{"color":"green","x":269,"y":46,"name":"SF"}],"edges":[{"node1":{"color":"green","x":195,"y":439,"name":"A"},"node2":{"color":"green","x":289,"y":510,"name":"SJ"},"weight":5,"textX":242,"textY":461.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":289,"y":510,"name":"SJ"},"node2":{"color":"green","x":411,"y":439,"name":"B"},"weight":10,"textX":350,"textY":461.5,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":411,"y":439,"name":"B"},"node2":{"color":"green","x":296,"y":333,"name":"C"},"weight":8,"textX":353.5,"textY":373,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":195,"y":439,"name":"A"},"node2":{"color":"green","x":296,"y":333,"name":"C"},"weight":200,"textX":245.5,"textY":373,"textWidth":30.0322265625,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":296,"y":333,"name":"C"},"node2":{"color":"green","x":293,"y":201,"name":"D"},"weight":13,"textX":294.5,"textY":254,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":178,"y":272,"name":"E"},"node2":{"color":"green","x":293,"y":201,"name":"D"},"weight":1,"textX":235.5,"textY":223.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":293,"y":201,"name":"D"},"node2":{"color":"green","x":125,"y":124,"name":"F"},"weight":5,"textX":209,"textY":149.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":125,"y":124,"name":"F"},"node2":{"color":"green","x":178,"y":272,"name":"E"},"weight":9,"textX":151.5,"textY":185,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":125,"y":124,"name":"F"},"node2":{"color":"green","x":269,"y":46,"name":"SF"},"weight":1,"textX":197,"textY":72,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":269,"y":46,"name":"SF"},"node2":{"color":"green","x":416,"y":117,"name":"G"},"weight":18,"textX":342.5,"textY":68.5,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":293,"y":201,"name":"D"},"node2":{"color":"green","x":416,"y":117,"name":"G"},"weight":6,"textX":354.5,"textY":146,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":416,"y":117,"name":"G"},"node2":{"color":"green","x":125,"y":124,"name":"F"},"weight":2,"textX":270.5,"textY":107.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":416,"y":117,"name":"G"},"node2":{"color":"green","x":411,"y":439,"name":"B"},"weight":20,"textX":413.5,"textY":265,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":178,"y":272,"name":"E"},"node2":{"color":"green","x":195,"y":439,"name":"A"},"weight":40,"textX":186.5,"textY":342.5,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":296,"y":333,"name":"C"},"node2":{"color":"green","x":416,"y":117,"name":"G"},"weight":15,"textX":356,"textY":212,"textWidth":20.021484375,"textHeight":13}]}

  • Setup:
    • Create a table of all nodes, with a distance from the start as one row. A second row will be called previous, which will annotate which node we were looking at when we updated the node's distance. A final row, seen starts out with all nodes as Not Seen, and gets updated after we analyze each node in turn. Here is an example graph, representing a map where we are trying to get from San Jose to San Francisco and we have nodes A-G as potential different paths to take. An example graph with SJ (San Jose) at the bottom, and SF (San Francisco) at the top. There are also nodes A-G
  • You might think, "this is going to take all day to run the algorithm!" but it is very doable.
  • Here is the basic algorithm:
    1. Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes.
    2. Look at this node's neighbors, and update the total distance to the neighbors based on their distance and the distance already to this node.
    3. If the node visited is the destination, stop
    4. Repeat from step 1.
  • Here is the table we will update:
                     SJ   A    B    C    D    E    F    G    SF
    Distance from SJ 0    ∞    ∞    ∞    ∞    ∞    ∞    ∞    ∞
    Previous         -                                
    Seen?            N    N    N    N    N    N    N    N    N
    
  • Notice:
    • All the distances to SJ are listed as infinity, except for the distance from SJ to itself, which is 0. We don't yet know the distance from SJ from the other nodes, because we haven't analyzed them yet (formally – we can see them in the graph, but for our table we haven't yet analyzed anything).
    • The previous row is empty
    • All nodes (including SJ) are listed as Not Seen (N).
  • Step 1:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. That would be SJ (it has a distance of 0, and all others have a distance of ∞).
    • Look at SJ's neighbors, A and B. Both can be reached from SJ by their respective distances, and both are less than ∞. So we update the table, and then we mark SJ as visited. We also update Previous to show that we got to A and B directly from SJ:
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   ∞    ∞    ∞    ∞    ∞    ∞
Previous         -    SJ   SJ                            
Seen?            Y    N    N    N    N    N    N    N    N
  • Step 2:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is A.
    • Look at A's neighbors, SJ, C, and E. SJ has been visited, so we ignore it. We update E to have a distance of 5 + 40 from SJ (because we have already gone 5 to get to A). We update C to be 205.
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   205  ∞    45   ∞    ∞    ∞
Previous         -    SJ   SJ   A         A                           
Seen?            Y    Y    N    N    N    N    N    N    N
  • Step 3:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is B.
    • Look at B's neighbors, SJ, C, and G. We recognize that SJ-B-C is going to be shorter than SJ-A-C becuase 10 + 8 = 18 is shorter than 205. So, we update C to have a distance of 18 and we modify its Previous to be from B. We update G to be 30.
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   ∞    45   ∞    30   ∞
Previous         -    SJ   SJ   B         A         B                 
Seen?            Y    Y    Y    N    N    N    N    N    N
  • Step 4:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is C.
    • Look at C's neighbors, A, B, D, and G. We ignore A and B (already seen). We make D 18 + 13 = 31 and we see that we don't need to udpate G because 18 + 15 = 33 is greater than 30.
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   45   ∞    30   ∞
Previous         -    SJ   SJ   B    C    A         B                 
Seen?            Y    Y    Y    Y    N    N    N    N    N
  • Step 5:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is G.
    • Look at G's neighbors, B, C, D, F, and SF. We ignore B and C. We don't update D (already smaller than 36), we update F to be 32, and we update SF to be 48. We can't stop yet, even though we have reached SF! This is because we may have a shorter path.
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   45   32   30   48 
Previous         -    SJ   SJ   B    C    A    G    B    G            
Seen?            Y    Y    Y    Y    N    N    N    Y    N
  • Step 6:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is D.
    • Look at D's neighbors, C, E, F, and G. We ignore C and G. We update E to be 32, because 31 + 1 = 32 and this is less than 45. We don't update F (already shorter than 36).
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   32   32   30   48 
Previous         -    SJ   SJ   B    C    D    G    B    G            
Seen?            Y    Y    Y    Y    Y    N    N    Y    N
  • Step 7:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is either E or F, and it doesn't matter which one we choose. Let's choose E.
    • Look at E's neighbors, A, D, and F. We ignore A and D. We don't update F. Notice that we have to do less updating as we get further through the graph!
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   32   32   30   48 
Previous         -    SJ   SJ   B    C    D    G    B    G            
Seen?            Y    Y    Y    Y    Y    Y    N    Y    N
  • Step 8:
    • Of the unseen/unvisited nodes, find the node that currently has the shortest distance from the start from the un-seen nodes. Now this is either F.
    • Look at F's neighbors, D, E, G, and SF. We ignore D, E, and G. We update SF because 32 + 1 = 33 is less than 48. So, we did need to update SF after all!
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   32   32   30   33 
Previous         -    SJ   SJ   B    C    D    G    B    F            
Seen?            Y    Y    Y    Y    Y    Y    Y    Y    N
  • Step 9:
    • Find the shortest distance from the start from the un-seen nodes: now this is SF (it might not always be the last node, but happens to be so in this case).
    • Because we are visiting the destination, we know that we have found the shortest path. We are done, and the shortest path is 33.
                 SJ   A    B    C    D    E    F    G    SF
Distance from SJ 0    5    10   18   31   32   32   30   33 
Previous         -    SJ   SJ   B    C    D    G    B    F            
Seen?            Y    Y    Y    Y    Y    Y    Y    Y    Y
  • Now, to get the actual path, we simply follow the Previous nodes backwards from SF:
    • SF's previous was F
    • F's previous was G
    • G's previous was B
    • B's previous was SJ
  • Therefore, the shortest path is SJ -> B -> G -> F -> SF with a length of 33. The shortest path from SJ to SF is SJ->B->G->F->SF

Slide 5

A* Search

  • One of the downsides to Dijkstra's algorithm is that it can, in many circumstances, ignore other sources of information that might prove useful to finding the shortest path in the fewest number of steps.
  • For example, take a look at the following graph, where we want to find the shortest path from A to J. If all we have is the graph in this form, Dijkstra's is the best we can do.
  • But, what if we know that this is a map, and that any paths to the south of A are probably going to take us in the wrong direction? Yes, it could be the case that those would lead to a shorter path, but on a real map, it is unlikely that going to far South when your destination is to the North is going to lead to the shortest path.

A graph where there are many nodes that probably aren't worth looking at, because they are not in the 'direction' of the other nodes.

  • We call the idea of using external information about a graph a heuristic. The heuristic estimates the cost of the cheapest path to the goal. It is different for every problem.
  • A heuristic should always underestimate the distance to the goal. If it overestimates the distance, it could end up finding a solution that is not actually optimal (though it will do so relatively fast).
  • We use the heuristic as an addition to the value for the priority.
    • For the case of maps, if the distance to the destination is closer, this will weight the nodes in that direction to be preferable (i.e., they will actually have a lower priority calculation).
    • In other words, priority(u) = weight(s, u) + heuristic(u, d), where s is the start, u is the node we are considering, and d is the destination.

Slide 6

A* Search example

A graph where there are many nodes that probably aren't worth looking at, because they are not in the 'direction' of the other nodes.

{"nodes":[{"color":"green","x":292,"y":355,"name":"A"},{"color":"green","x":223,"y":249,"name":"B"},{"color":"green","x":377,"y":241,"name":"C"},{"color":"green","x":288,"y":153,"name":"D"},{"color":"green","x":179,"y":419,"name":"E"},{"color":"green","x":230,"y":483,"name":"F"},{"color":"green","x":344,"y":487,"name":"G"},{"color":"green","x":438,"y":440,"name":"H"},{"color":"green","x":206,"y":64,"name":"I"},{"color":"green","x":360,"y":62,"name":"J"}],"edges":[{"node1":{"color":"green","x":292,"y":355,"name":"A"},"node2":{"color":"green","x":344,"y":487,"name":"G"},"weight":6,"textX":318,"textY":408,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":344,"y":487,"name":"G"},"node2":{"color":"green","x":438,"y":440,"name":"H"},"weight":3,"textX":391,"textY":450.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":292,"y":355,"name":"A"},"node2":{"color":"green","x":230,"y":483,"name":"F"},"weight":7,"textX":261,"textY":406,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":230,"y":483,"name":"F"},"node2":{"color":"green","x":179,"y":419,"name":"E"},"weight":6,"textX":204.5,"textY":438,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":292,"y":355,"name":"A"},"node2":{"color":"green","x":179,"y":419,"name":"E"},"weight":2,"textX":235.5,"textY":374,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":292,"y":355,"name":"A"},"node2":{"color":"green","x":377,"y":241,"name":"C"},"weight":8,"textX":334.5,"textY":285,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":292,"y":355,"name":"A"},"node2":{"color":"green","x":223,"y":249,"name":"B"},"weight":12,"textX":257.5,"textY":289,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":223,"y":249,"name":"B"},"node2":{"color":"green","x":288,"y":153,"name":"D"},"weight":7,"textX":255.5,"textY":188,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":377,"y":241,"name":"C"},"node2":{"color":"green","x":288,"y":153,"name":"D"},"weight":1,"textX":332.5,"textY":184,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":288,"y":153,"name":"D"},"node2":{"color":"green","x":360,"y":62,"name":"J"},"weight":1,"textX":324,"textY":94.5,"textWidth":20.021484375,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":288,"y":153,"name":"D"},"node2":{"color":"green","x":206,"y":64,"name":"I"},"weight":3,"textX":247,"textY":95.5,"textWidth":10.0107421875,"textHeight":13,"color":"black"},{"node1":{"color":"green","x":206,"y":64,"name":"I"},"node2":{"color":"green","x":360,"y":62,"name":"J"},"weight":4,"textX":283,"textY":50,"textWidth":10.0107421875,"textHeight":13,"color":"black"}]}

  • Let's look at this graph with Dijkstra, first. We want to find the shortest path from A to J.
                  A    B    C    D    E    F    G    H   I   J
Distance from A   0    ∞    ∞    ∞    ∞    ∞    ∞    ∞   ∞   ∞
Previous          -                                
Seen?             N    N    N    N    N    N    N    N   N   N

Dijkstra:
1.(A)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    7    6    ∞   ∞   ∞
Previous          -    A    A         A    A    A  
Seen?             Y    N    N    N    N    N    N    N   N   N

2.(E)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    7    6    ∞   ∞   ∞
Previous          -    A    A         A    A    A  
Seen?             Y    N    N    N    Y    N    N    N   N   N

3.(G)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    7    6    9   ∞   ∞
Previous          -    A    A         A    A    A    G
Seen?             Y    N    N    N    Y    N    Y    N   N   N

4.(F)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    7    6    7   ∞   ∞
Previous          -    A    A         A    A    A    G
Seen?             Y    N    N    N    Y    Y    Y    N   N   N

5.(H)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    5    6    7   ∞   ∞
Previous          -    A    A         A    A    A    G
Seen?             Y    N    N    N    Y    Y    Y    Y   N   N

6.(C)             A    B    C    D    E    F    G    H   I   J
Distance from A   0    12   8    ∞    2    5    4    7   ∞   ∞
Previous          -    A    A         A    A    A    G
Seen?             Y    N    N    N    Y    Y    Y    Y   N   N
...
  • We'll stop calculating after 6 steps, as you can see the pattern (as in the other Dijkstra example). There would be many more steps to get all the way to J.
  • Notice that the shortest paths at the beginning are all below A. If we know this is a map, why are we going South?
  • We can use a heuristic based on the straight-line-distance to J from all other nodes. In this case, we're dividing the pixel distance by the shortest pixel distance for all non-J nodes.
A*
Distances to J and divided by the smallest distance (116)
A 301, 2.6
B 232, 2
C 180, 1.6
D 116, 1
E 400, 3.4
F 441, 3.8
G 425, 3.7
H 386, 3.3
I 154, 1.3
J 0, 0

Now, when we run the algorithm, we use the Distance + future distance to pull out of the priority queue. We still update the real lengths when we update the actual Distance from A, so that we have the real distances. The heuristic is just helping us with the calculation.

A*:
                    A      B      C      D      E      F     G     H     I     J
Distance from A     0      ∞      ∞      ∞      ∞      ∞     ∞     ∞     ∞     ∞
Distance + future   2.6
Previous            -                                                
Seen?               N      N      N      N      N      N     N     N     N     N

1(A)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12      8     ∞      2      7     6     ∞     ∞     ∞
Distance + future   2.6    14     9.6           5.4    11.8  9.7   
Previous            -                                                
Seen?               Y      N      N      N      N      N     N     N     N     N

2(E)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12     8      ∞      2      7     6     ∞     ∞     ∞
Distance + future   2.6    14     9.6           5.4    11.8  9.7   
Previous            -      A      A             A      A     A          
Seen?               Y      N      N      N      Y      N     N     N     N     N

3(C)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12     8      9      2      7     6     ∞     ∞     ∞
Distance + future   2.6    14     9.6    10     5.4    11.8  9.7   
Previous            -      A      A             A      A     A          
Seen?               Y      N      Y      N      Y      N     N     N     N     N

4(G)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12     8      9      2      7     6     9     ∞     ∞
Distance + future   2.6    14     9.6    10     5.4    11.8  9.7   12.3
Previous            -      A      A             A      A     A          
Seen?               Y      N      Y      N      Y      N     Y     N     N     N 

5(D)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12     8      9      2      7     6     9     12    11
Distance + future   2.6    14     9.6    10     5.4    11.8  9.7   12.3  13.3  11
Previous            -      A      A             A      A     A          
Seen?               Y      N      Y      N      Y      N     Y     N     N     N

6(J)                A      B      C      D      E      F     G     H     I     J
Distance from A     0      12     8      9      2      7     6     9     12    11
Distance + future   2.6    14     9.6    10     5.4    11.8  9.7   12.3  13.3  11
Previous            -      A      A             A      A     A          
Seen?               Y      N      Y      N      Y      N     Y     N     N     Y
  • Done!
  • This only took six total steps! It is much faster than Dijkstra, because we prioritize going in the direction where we know the solution is.
  • Would it always work out this well? Possibly not – let's say there was a big traffic accident in the direction we want to go. This would mean that we might spend more time calculating in the direction we think we want to go, but really we should have gone on a more roundabout route. But, it will never be worse than Dijkstra's algorithm.

Slide 7

Esoteric Data Structures

A list of esoteric data structures: difference-list, table, 2-3-heap, b-tree, rolling-hash, heightmap, skip-list, kd-tree, lookup, splay-tree

  • In CS 106B, we have talked about many standard, famous, and commonly used data structures: Vectors, Linked Lists, Trees, Hash Tables, Graphs.
  • However, we only scratched the surface of available data structures, and data structure research is alive and well to this day.
  • If you want take further classes on data structures, check out Kieth Schwartz's CS 166 Data Structures class.
  • Let's take a look at two interesting data structures that have interesting properties and you might not see covered in detail in a standard course: the skip list and the bloom filter.
  • Both data structures involve probability, which is something we haven't yet seen in a data structure.

Slide 8

Skip Lists

  • Note: much of the material for the Skip List part of this lecture was borrowed from a brilliant lecture by Erik Demaine, at MIT.
  • A skip list is a balanced search structure that maintains an ordered, dynamic set for insertion, deletion and search
  • What other efficient (log n or better) sorted search structures have we talked about?
    • Hash Tables? (nope, not sorted)
    • Heaps? (nope, not searchable)
    • Sorted Array? (kind of, but insert/delete is O(n))
    • Binary Trees? (only if balanced, e.g., AVL or Red/Black)
  • A skip list is a simple, randomized search structure that will give us O(log N) in expectation for search, insert, and delete, but also with high probability.
  • Invented by William Pugh in 1989 – fairly recent! (Okay, not too recent, but most of the data structures we've looked at have been around since the 1950s/60s)

Slide 9

Improving the Linked List

  • Let's see what we can do with a linked list to make it better.
  • How long does it take to search a sorted, doubly-linked list for an element?
    headβ†’14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • log(n)?
    • Nope!
  • It is O(n) – we must traverse the list!


Slide 10

Improving the Linked List

  • How might we help the situation of O(n) searching in a linked list?
    headβ†’14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • What if we put another link in the middle?
                  middle
                       β€Έ
    headβ†’14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • This would help a little…we could start searching from the middle, but we would still have to traverse the list.
  • O(n) becomes … O(n/2), which is just O(n)
    • We didn't really help anything!
    • We also had to keep track of the middle.
  • Maybe we could just keep adding pointers:
           23        50    66 
             β€Έ        β€Έ     β€Έ
    headβ†’14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • This would help some more, but still doesn't solve the underlying problem.
    • We would have to search the pointers, too, so we're basically back to where we started.

Slide 11

Improving the Linked List

  • Let's play a game: the list we've been using was chosen for a very specific reason. Any idea what that reason is? Do you happen to recognize the sequence? (If you're from New York City, you might!)
    14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • These are subway stops on the NYC 7th Avenue line!

  • A somewhat unique feature in the New York City subway system is that it has express lines:
    14←--β†’34⇔42←---------β†’72
    ⇕     ⇕  ⇕            ⇕
    14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
    
  • To search the list (or ride the subway): Walk right in the top list and when youi've gone too far, go back and then down to the bottom list.
    • To search for 59 above, start at the 14 in the top list, and then traverse to the 34, then the 42, then the 72. Recognize that you've gone too far, and go back to the 42 and down to the corresponding 42 in the bottom list. Then go to the 50 and then finally to the 59 in the bottom list.

Slide 12

Skip Lists

  L1: 14←--β†’34⇔42←---------β†’72
      ⇕     ⇕  ⇕            ⇕
  L2: 14⇔23⇔34⇔42⇔50⇔59⇔66⇔72⇔79
  • For the two lists, L1 and L2 above, what is the best placement for the nodes in L1? For the subway, it is based on ridership, but for we want to minimize the worst-case performance.
    • We want equally spaced nodes.
  • The search cost can be represented as follows:
    |L1| + (|L2| / |L1|)
    
  • where |L1| is the number of nodes in L1, and |L2| is the number of nodes in L2.
  • We always want all the nodes, n, in L2, so we can represent the search cost as follows:
    |L1| + (n / |L1|)
    
  • We can do a bit of calculus to minimize |L1|, by taking the derivative with respect to |L1| and setting this equal to 0 (welcome to Calc I!)
    d(|L1| + (n / |L1|))/d(|L1|) = 1 - n * |L1|^(-2)
    
  • We can set this equal to 0 and solve for |L1|:
    1 - n * |L1|^(-2) = 0
    1 = n / |L1|^2
    |L1|^2 = n
    |L1| = sqrt(n) 
    
  • So, for two lists, minimizing the number of nodes in the second list would be placing them every sqrt(n) away from each other.
  • How does this change our search time in terms of Big O?
    • We can plug in sqrt(n) in for |L1| in our original function, and solve:
      sqrt(n) + n / sqrt(n)  = 2 * sqrt(n)
      
    • So, we have O(2 * sqrt(n)) = O(sqrt(n)). Not bad!
    • But, not as good as O(log n).

Slide 13

Skip Lists

  • What if we had more lists?
  • The search calculations get a bit tricky, but this is the end result: A graphic showing that 2 sorted lists is 2 root n, 3 sorted lists is 3 n^(1/3), k sorted lists is k n^(1/k), and log n sorted lists is log n * n^(1/log n)
  • It turns out that if we have log n number of lists, we have an interesting situation, becuase n^(1 / log2 n) is just equal to 2, which means that we end up with O(2 log2 n), or just O(log n).
  • This is what we want!

Slide 14

Skip Lists

  • What we want to build is the following:
    • log n number of lists, with the lowest list with n nodes, the next higher list with n/2 nodes, the next higher list with n/4 nodes, etc., until the top list has a single node (well, really a node on the left and on the right):

    A set of linked lists with the top list having just 14 on the left and 79 on the right. The next list has 14 on the left, 50 in the middle, and 79 on the right. The next list down has 14, 34, 50, 66, and 72, equally spaced. The final list has 14, 23, 34, 42, 50, 59, 66, and 72

  • This should look a bit like a binary search tree, and it acts like one!

Slide 15

Building a Skip List

  • The question now is, how do we build a skip list?
  • We could try to keep all the elements perfectly aligned – in the lowest list, we have n elements, and in the next list up we have n/2 elements, etc.
    • This would not be efficient, becuase we would be moving links all over the place!
  • What we do instead is to build the list using a probabilistic strategy
    • We flip a coin!
  • Here's the algorithm:
    1. All lists will have a -∞ node at the beginning of the list, in order to make a standard "minimum" node.
    2. All elements must go into the bottom list (search to find the spot)
    3. After inserting into the bottom list, flip a fair, two sided coin. If the coin comes up heads, add the element to the next list up, and flip again, repeating step 2.
    4. If the coin comes up tails, stop.
  • Let's build one!

Slide 16

Bloom Filters

  • Our second esoteric data structure is called a bloom filter, named for its creator, Burton Howard Bloom, who invented the data structure in 1970.
  • A bloom filter is a space efficient, probabilistic data structure that is used to tell whether a member is in a set.
  • Bloom filters are a bit odd because they can definitely tell you whether an element is not in the set, but can only say whether the element is possibly in the set.
  • In other words: false positives are possible, but false negatives are not.
    • (A false positive would say that the element is in the set when it isn’t, and a false negative would say that the element is not in the set when it is).

Slide 17

Bloom Filters

  • The idea behind a Bloom Filter is that we have a bit array. We will model a bit array with a regular array, but you can compress a bit array by up to 32x because there are 8 bits in a byte, and there are 4 bytes to a 32-bit number (thus, 32x!) (although Bloom Filters themselves need more space per element than 1 bit).
  • This is a bit array:
1 0 1 1 0 1 1 1
  • For a bloom filter, we start with an empty bit array (all zeros, indexes added), and we have k hash functions:
    k1 = (13 - (x % 13)) % 8
    k2 = (3 + 5x) % 8
    etc.
    
0 1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
  • The hash functions should be independent, and the optimal amount is calculable based on the number of items you are hashing, and the length of your table (see the Wikipedia article on Bloom Filters for details).
  • To insert a value, hash the value with all k hashes, and then set the bit in the hashed position to 1 for each result.

Slide 18

Bloom Filter Example Insertions

  • Insert 119:
    • x = 119
    • k1 = (13 - (x % 13)) % 8 = 3
    • k2 = (3 + 5 * x) % 8 = 6
  • Because k1 is 3, we change bit 3 to 1. Because k2 is 6, we change bit 6 to 1:
0 1 2 3 4 5 6 7
0 0 0 1 0 0 1 0
  • Insert 132:
    • x = 132
    • k1 = (13 - (x % 13)) % 8 = 3
    • k2 = (3 + 5 * x) % 8 = 7
  • Because k1 is 3, we don't change the array, because bit 3 is already a 1. Because k2 is 7, we change bit 7 to 1:
0 1 2 3 4 5 6 7
0 0 0 1 0 0 1 1

Slide 19

Bloom Filters: Determining if a value is in the set

  • To check if values are in the set, re-hash with all the hash functions. If the bits returned are 1, then we can say that the value is probably in the table.
    • Why probably? There is overlap, and we could potentially have two values overlap bits such that another value will appear as if it was inserted.
  • Example from above:
    • To see if 119 is in the set, we re-hash: k1 = 3, k2 = 6:
    • 119 is probably in the set (and we did add it to the set)
  • Another example: search for 126 in the set: k1 = 4, k2 = 1
    • 126 is definitely not in the set. Why?
    • There are 0s in the resulting bits (both index 4 and index 1 have 0s)
  • Another example: search for 143: k1 = 5, k2 = 6
    • 143 is definitely not in the set. Why?
    • There are 0s in the resulting bits (index 5 is 0)
  • Another example: search for 175: k1 = 7, k2 = 6:
    • 175 is probably in the set. Why?
    • Both bits 6 and 7 are 1.
    • But, we never entered 175 into the set! This is a false positive, and we can't rely on positive values.
  • Here is a great online example to see how Bloom Filters work: https://llimllib.github.io/bloomfilter-tutorial/

Slide 20

Bloom Filters: Probability of a False Positive

  • What is the probability that we have a false positive? Let's do a bit of math.
  • If \(m\) is the number of bits in our bit array, then the probability that a bit is not set to 1 is:
\[1 - \left(\frac{1}{m}\right)\]
  • If \(k\) is the number of hash functions, the probability that the bit is not set to 1 by any hash function is:
\[\left(1 - \frac{1}{m}\right)^{k}\]
  • If we have inserted \(n\) elements, the probability that a certain bit is still 0 is:
\[\left(1 - \frac{1}{m}\right)^{kn}\]
  • To get the probability that a bit is 1 is just 1 - the answer previous calculation:
\[1 - \left(1 - \frac{1}{m}\right)^{kn}\]
  • Now we test membership of an element that is not in the set. Each of the \(k\) array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, (false positive) is:
\[\left(1 - \left[1 - \frac{1}{m}\right]^{kn}\right)^{k} \approx \left(1 - e^{\frac{-kn}{m}}\right)^{k}\]
  • For our prior example, \(m = 8\), \(n = 2\), and \(k = 2\), so:
\[\left(1 - \left[1 - \frac{1}{m}\right]^{kn}\right)^{k} = 0.17\]
  • or 17% of the time we will get a false positive.

Slide 21

Bloom Filters: Why?

  • Why would we want a structure that can produce false positives?
  • Example: Google Chrome uses a local Bloom Filter to check for malicious URLs β€” if there is a hit, a stronger check is performed.
  • There is one more negative issue with a Bloom Filter: you can’t delete! If you delete, you might delete another inserted value, as well! You could keep a second bloom filter of removals, but then you could get false positives in that filter…
  • One other thing: Big O?
    • You have to perform k hashing functions for an element, and then either flip bits, or read bits. Therefore, they perform in O(k) time, which is independent of the number of elements in the structure. Additionally, because the hashes are independent, they can be parallelized, which gives drastically better performance with multiple processors.