Lecture 11 - Hashing

## 1.3. Hashing

Both observations lead to the same idea, the idea of *hashing*
the key values:

We do not use the *key* directly as an array subscript.
Instead, we have a function that converts a key value into a valid
array subscript.

Such a function is called a *hashing function*. We will use
H(k) for the hashing function. The value it returns - H(K) - is called
the hash value for key K. The array used in conjunction with a hashing
function to implement the abstract type `Table' is called a Hash
Table.

For example, your social insurance number could be mapped to a
subscript in the range 0..99 by taking it MOD 100 - i.e. by looking at
the last two digits. Or we could extract the third and sixth digits.
Indeed, we could do just about anything we liked.

Similarly, if your name was being used as the key, it could be
mapped to a subscript in the range 0..99 by converting each character
to an integer, adding up all the integers, and taking the sum MOD 100.
Of course, there are endless other ways of mapping a string to an
array subscript.

In order for the operations to be constant time, it is necessary
for the hashing function to be a constant time function. This
precludes, for example, storing the keys and their associated
hash-values in a binary search tree and *looking up* the
hash-value each time it's needed.

A fundamental premise of our original strategy was that different
keys would correspond to different positions in the array. If we use a
hashing function, this means that if K1 =/= K2, then H(K1) =/= H(K2).
A hashing function with this property is called a *perfect hashing
function*.

The construction of a perfect hashing function is usually
difficult, even when the *actual* key values are known. In
practice, it is almost always necessary to choose the hashing function
before the set of actual key values is entirely known.

In general, then, our hashing functions will not be perfect; it
will happen that two different keys get mapped to the same hash value.
We call this a collision. If there were no collisions, we would have a
constant-time implementation of all the Table operations. In order for
our implementation to be as close to constant-time as possible, we
need to do two things:

- minimize the number of collisions
that do occur
- `resolve' collisions in a way
that does not degrade performance