hashing

hashing

A hash table contains a set of key-value pairs. The access functions for hash tables accept a key and retrieve the corresponding value. Here are the details, together with a discussion of how we designed the hash table system seen in Macaulay 2.

The keys and values are stored in the hash table. The hash table consists of a certain number of buckets, each of which can hold an arbitrary number of key-value pairs. The number of buckets is chosen to be large enough that typically one may expect each bucket to hold fewer than three key-value pairs. The key is used as follows to determine in which bucket the key-value pair is to be stored. The function hash is applied to the key to produce, in a deterministic way, an integer called the hash code, and the remainder of the hash code upon division by the number of buckets tells which bucket will be used.

It is essential that the hash code of a key never change, for otherwise the next attempt to find the key in the hash table will have an unpredictable result - the new hash code of the key may or may not lead to the same bucket, and the value may or may not be located.

Some hash tables and lists are mutable, i.e., their contents can be altered by assignment statements. As explained above, the hash code of a mutable thing is not permitted to change when the contents of the thing change. Hence, the algorithm for computing the hash code may not refer to the contents of a mutable thing.

The strong comparison operator === is provided to parrot the equality testing that occurs implicitly when accessing a key in a hash table. The fundamental requirement for this strong comparison operator is that things with different hash codes must also turn out to be different when tested with this comparison operator.

Here we come to a question of design. As discussed above, we must assign hash codes to mutable things in such a way that the hash codes don't depend on their contents. We can do this in various ways.

One way to assign hash codes to mutable things is to give the same hash code, say 1000000, to every mutable thing. We could then implement a strong comparison operator for mutable things that would proceed by examining the contents of the things, so that two mutable things would be equal if and only if their contents were equal. A disadvantage of this approach would be that a hash table in which many mutable things appear as keys would have all of those key-value pairs appearing in the same bucket, so that access to this hashtable would be slow. (Each bucket is implemented as a linear list, and searching a long linear list is slow.)

Another way to assign hash codes to mutable things is to give different hash codes to each mutable thing; for example, the first mutable thing could receive hash code 1000000, the second could receive the hash code 1000001, and so on. (Another choice for such a hash code is the address in memory of the thing. But this address can change depending on environmental factors not under the control of the interpreter, and thus its use as a hash code would lead to unpredictable behavior.) A disadvantage of this approach is that the strong comparison operator could not examine the contents of mutable objects! (Remember that if the hash codes are different, the strong comparison must declare the things to be different, too.) The offsetting advantage is that a hash table in which many mutable things appear as keys would typically have the key-value pairs distributed among the buckets, so that access to this hashtable would be fast.

In Macaulay 2, we chose the second approach listed above; we expect to have many mutable things appearing as keys in hash tables, and we need the speed. A counter with initial value 1000000 is incremented each time a mutable thing is created, and its value is taken as the hash code of the thing and stored within it. The strong comparison test cannot depend on the contents of mutable things, and thus such things appear to be containers with opaque walls. For mutable things, the test for equality must be the same as equality of the hash codes.

It is essential to have some hash tables for which equality amounts to equality of the contents. This cannot be achieved for mutable hash tables, but we do achieve it for non-mutable hash tables -- the hash code is computed directly from the contents of the thing in a deterministic way. This allows us to implement the notion of polynomial, say, as a hash table -- the keys can be the monomials (necessarily non-mutable) and the values can be the coefficients. The notion of monomial can be implemented as a hash table where the keys are the variables and the values are the corresponding exponents.

One further comforting remark: the routines that compute hash codes or strong equality do not get into infinite loops, despite the existence of circular structures: any circular structure must come into being by means of changing something, and so the circular loop in the structure necessarily involves a mutable thing, whose contents are not examined by the routines. This provides another argument in favor of taking the second approach listed above.

See also:

HashTable -- the class of all hash tables