Lecture 11 - Hashing

1.3. Hashing

Both observations lead to the same idea, the idea of hashing the key values:

We do not use the key directly as an array subscript. Instead, we have a function that converts a key value into a valid array subscript.

Such a function is called a hashing function. We will use H(k) for the hashing function. The value it returns - H(K) - is called the hash value for key K. The array used in conjunction with a hashing function to implement the abstract type `Table' is called a Hash Table.

For example, your social insurance number could be mapped to a subscript in the range 0..99 by taking it MOD 100 - i.e. by looking at the last two digits. Or we could extract the third and sixth digits. Indeed, we could do just about anything we liked.

Similarly, if your name was being used as the key, it could be mapped to a subscript in the range 0..99 by converting each character to an integer, adding up all the integers, and taking the sum MOD 100. Of course, there are endless other ways of mapping a string to an array subscript.

In order for the operations to be constant time, it is necessary for the hashing function to be a constant time function. This precludes, for example, storing the keys and their associated hash-values in a binary search tree and looking up the hash-value each time it's needed.

A fundamental premise of our original strategy was that different keys would correspond to different positions in the array. If we use a hashing function, this means that if K1 =/= K2, then H(K1) =/= H(K2). A hashing function with this property is called a perfect hashing function.

The construction of a perfect hashing function is usually difficult, even when the actual key values are known. In practice, it is almost always necessary to choose the hashing function before the set of actual key values is entirely known.

In general, then, our hashing functions will not be perfect; it will happen that two different keys get mapped to the same hash value. We call this a collision. If there were no collisions, we would have a constant-time implementation of all the Table operations. In order for our implementation to be as close to constant-time as possible, we need to do two things:

  1. minimize the number of collisions that do occur
  2. `resolve' collisions in a way that does not degrade performance