Dynamic Ordered Set Documentation
Joy Ku, Michael McDaid, Vicki Shimizu

Data Structure

Our dynamic ordered set implementation is based on an augmented skip list. The skip list data structure is described in W. Pugh's paper Skip Lists: A Probabalistic Alternative to Balanced Trees, which appeared in the Communications of the ACM, Vol. 33, June 1990, pp. 668-670.

Skip lists use a parameter p, which is the fraction p of the nodes with level i pointers that also have level i +1 pointers. We chose p = 0.25 because it slightly improves the constant factors in the running time. We could have chosen the value 1/e to get slightly better constants, but it is more efficient to make 1/p a power of 2, so that we can multiply by p by simply right-shifting bits. Choosing p = 0.5 would have given less variability in running time, but this was not a concern as stated in the assignment. We are only interested in the total running time, not the per operation time. Skip lists use a constant MAX_LEVEL to limit the maximum number of forward pointers per node. We chose MAX_LEVEL = log1/p(n) = log4(n) = log4(10,000) = 7 since the maximum number of nodes n is 10,000.

Skip lists perform well on Insert, Delete, Search, Minimum, and Successor without augmentation. We augmented the basic data structure to improve the performance of Maximum and Predecessor. We added a tail pointer to the list for Maximum and we added a back pointer to each node which points to the previous node for Predecessor.

The level of each node is determined randomly. In the basic algorithm, nodes are of varying sizes based on the number of levels. In order to avoid the high cost of malloc, we decided to use the simple, efficient storage manager provided in the class List implementation. However, we needed the nodes to be the same size. Since MAX_LEVEL is only 7, we chose to always allocate the maximum size node. This improves performance at the expense of using extra space.

Set Operations

We based our implementation of Search, Insert, and Delete on the pseudo-code in the paper. We made an optimization to Search, so it performs slightly differently than Insert and Delete when trying to find the proper place in the list. In the example below, to search for node 25, the unoptimized code would visit nodes 6, 9, 17, 19, 21, 25. The loop moves forward and down levels while the key is less than 25. We optimized the loop to break out if the key is equal to 25. We could not optimize Insert or Delete in this way because we need to find all the nodes whose forward pointers point to the node to be inserted or the node to be deleted.

We modified Insert and Delete to maintain the back pointers and the tail of the list. Since a skip list is just an augmented singly linked list, the modifications required were the same as turning a singly linked list into a doubly linked list.

Minimum is implemented by returning the level 0 forward pointer of the list header. By augmenting the list with a tail, Maximum was implemented by returning the tail, instead of searching forward in the list until the last element was found. Successor is implemented by returning the level 0 forward pointer of the node or Minimum if the node is null. By augmenting the list with back pointers, Predecessor was implemented by returning the back pointer of the node or Maximum if the node is null.

MinInRange and MaxInRange are implemented by finding the start of the range using the optimized search, then moving forward to the end of the range using the level 0 forward pointers to find the min or the max value. We spent a lot of time trying to figure out how to augment the skip list to optimize these functions, but we did not have time to get a working implementation. Unfortunately, we did not figure out that it would be easier to optimize a tree structure than a skip list until it was too late to change the basic data structure. The idea we had for a tree structure was to augment the nodes to store the min and max values of the node and all its children. The min and max values for a node can be determined from the values of the node and its 2 children. The information can be maintained in Insert and Delete in O(log n) time, because only nodes along the search path from the root to the inserted or deleted node and nodes local in rotations are affected. It was more difficult to find an augmentation for skip lists.

The idea we had was to augment each node with min and max values at each level. These values are the min and max values of the node and all the nodes to the right up to the next node at the same level. For example, the min at level 3 in node 6 would contain the min values of nodes 6 to 21 because 25 is the next node at level 3. This value can be determined from the min values of the nodes at the next lower level up to the next node at the same level. The min at level 3 in node 6 can be determined from the min values at level 2 in nodes 6, 9, and 17. Looking at the nodes at the next lower level in a skip list is equivalent to looking at the children in a tree. The information can be maintained in Insert and Delete in O(log n) time because the nodes and levels that are affected are the same ones which are being examined to find the point of insertion or deletion. This is similar to the path from the root to a node in a tree. The min and max values would be updated along with the forward pointers. Each min and max value can be recalculated by looking at the nodes at the next lower level up to the next node at the same level.

Asymptotic Upper Bounds on Running Times

Search is implemented as a randomized algorithm that operates on a data structure with the following expected distribution of elements: 75% of the elements at the lowest level, 18.75% at the next level up, and so on, continuing to have one-fourth the number of elements at each higher level. (This is for a probability value p = 0.25). Although the search algorithm we implemented stops searching the first time it encounters the desired key, which could be at a level greater than the lowest level, the worst-case scenario would require the algorithm to continue the search until it reached the lowest level.

As the search algorithm moves from element to element at a given level, it eliminates X elements from the search pool for each element skipped (X is the n * sum of (1/p)^i, for i = 0 to the current level). Therefore, it is possible to examine a few elements at high levels and eliminate many intermediate elements without examination. It will then move to the next level and look at a subgroup of the elements at the lower level, and continue this until it reaches the lowest level. In all, it will examine logP(n) levels, where P = (1/p), and at each level, it only expects to look at a small percentage of the total elements n. Therefore, this operation is expected to run in O(lg n). Pugh's paper presents a more rigorous derivation for the expected run time.

Insertion requires searching for the element that has the largest key that is smaller than the key of the element being inserted. This requires a search, which takes an expected running time of O(lg n). Once this element is found, pointers need to updated. The maximum number of pointer changes is proportional to the maximum number of levels in the list, which is constant O(1). Therefore, the expected running time of insert is O(lg n) + O(1) = O(lg n). The delete algorithm is similar to insert, in that it requires a search and then updates pointers. The maximum number of pointer changes is also proportional to the maximum number of levels in the list. Therefore, delete also runs with an expected time of O(lg n) + O(1) = O(lg n).

Minimum runs in O(1) worst-case time because it only requires accessing the first element in the list, which can be accomplished by looking at the Next pointer of the Header of the list. Similarly, Maximum runs in O(1) worst-case time because it just returns the Tail from the list. Successor runs in O(1) worst-case time. In general, it simply needs to return the Next pointer of the element. In the special case where it needs to find the successor of element 0, it returns the minimum, which is still O(1). Predecessor also runs in O(1) worst-case time because it can just return the Previous pointer of the element. It also handles the special case of returning the predecessor of element 0 in O(1) because that is just the case of finding the maximum.

Both MinInRange and MaxInRange follow a similar algorithm and have an expected running time of O(lg n + k), where k is the number of elements in the range. The algorithm searches for the smallest element in the range, which takes an expected O(lg n). It then linearly proceeds through the list until it encounters a key that is outside the desired range. This requires looking at k elements, so the running time for this part of the algorithm is O(k). Therefore, the running time is O(lg n + k).

Best and Worst Case Inputs

For randomized algorithms, Insert, Delete, and Search have no particular input sequence that will create a worst or best case. However, there are particular situations that could cause the operations to perform poorly. In general, inputs with a small number of elements will perform poorly because they do not allow full utilization of the levels and depending on the levels at which elements are inserted, they might end up just being a regular linked list. A case where a user deleted all elements that were not at the lowest level would also result in the list being just a regular linked list. In this case, insert, delete, and search will take O(n) time. Conversely, inputs with a large number of inputs will cause the best performance of these operations.

Minimum and Maximum are constant time operations and have no best or worst case. Successor and Predecessor are also constant time operations that have similar running times no matter what the input is. Successor might be slightly slower if the user requested the successor of element 0, since that would require one extra pointer assignment.

There are 2 cases where MinInRange and MaxInRange would perform poorly. Their dependence on the search algorithm means that the situations described for worst-case searches would also cause MinInRange and MaxInRange to perform poorly. In addition, the longer the range being examined, the worse these operations perform. In fact, if the number of elements in the range being examined is greater than logP(n), where P = 1/p, then that will determine the search time. The best performance would occur for very short ranges on lists with a large number of elements and a maximum number of levels.

Comparison to Other Data Strucures

Structure	Insert	Delete	Search	Min	Max	Successor	Predecessor	MaxInRange	MinInRange
Doubly Linked Lists	O(n)	O(n)	O(n)	O(1)	O(1)	O(1)	O(1)	O(n)	O(n)
Compact Linked Lists	O(sqrt(n))	O(sqrt(n))	O(sqrt(n))	O(1)	O(1)	O(1)	O(1)	O(n)	O(n)
Binary Search Trees	O(n)	O(n)	O(n)	O(n)	O(n)	O(n)	O(n)	O(n)	O(n)
Red-Black Trees ( k = # of elements in range)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
B-Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
AVL Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
Amortized Weight Balanced Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
Radix Trees ( r = max # of bits in a key)	O(r)	O(r)	O(r)	O(r)	O(r)	O(r)	O(r)	O(r)	O(r)
Splay Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
Unaugmented Skip Lists	O(lgn)	O(lgn)	O(lgn)	O(1)	O(1)	O(1)	O(1)	O(lgn + k)	O(lgn + k)
Scapegoat Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)
Randomized Search Trees	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn)	O(lgn + k)	O(lgn + k)

We found that tree structures yielded asymptotic running times that were very similar to each other. For example, trees structures tend to produce O(lgn) running times for Insert and Delete regardless of the augmentations that have been made to them. This is because the depth of the tree must be at the least O(lgn), where n is the number of nodes in the tree. Our structure improves on a general tree structure because of our use of back pointers (which enable us to do Predecessor more efficiently). Predecessor, Successor, Min, and Max are O(1) time with only minimal changes to Insert and Delete. It was not clear that MinInRange and MaxInRange could be done sublinearly. Our decision to use skip lists was based on their efficiency for the Min/Max and Predecessor/Successor operators, which together account for 40% of the client's use patterns.

We tested our data structure for various values of p on a Sun Ultra 1 Model 170. With set sizes of 10000, we determined that p = 0.25 proved most efficient as it was 150-200% faster than with a value of p = 0.5. In general, although our implementation may be slower than others, it is correct. In this respect, we feel our choice of a data structure was a good one. What it lacked in technical difficulty, it more than made up for in ease of implementation and debugging time.

References

There is an ftp site which has C and Pascal code in its /pub/skipLists directory. Although we cannot honestly say that we have not seen this code, its impact on our implementation proved marginal. It did not have optimizations for searching and we did not find it easy to incorperate our tail node with the representation it provided.

Last modified: Mon May 26 19:13:59 PDT 1997