Hash Considerations Discussion

 

This page is a discussion of hashing as a storage mechanism.  The basic idea of hashing is storage of data (called the value) based on some key associated with the value.  The key may be the data itself.  

The key is said to be hashed (corned beef hash is chopped and stirred up) to produce an index into a hash table.  The function that produces the hashing of the key is called the hash function.

Hashing is a fundamental part of every programmers education.  It is commonly used and provides for constant (or nearly constant) time storage and retrieval.  Unlike storage in an array, which requires integer indices,  keys may be of any type and may even be objects.

One draw back to hashing is that the hash function is not guaranteed to generate unique indices. When duplicate slots are chosen for different keys, a collision is said to have occured. 

Hash functions do not exist in a vacuum.  We must consider them with respect to the hash table.

One of the ways that the hash function relates to the table is that after a hash calculation, the typical thing to do is to take the modulus with respect to the size of the table.  This insures that you will never generate an index that is outside the size of the table.

For example: Suppose that from some key you generate a hash of 12345 and suppose your table has 139 slots for storage. When you divide 12345 by 139 you get a remainder of 113.  113 would be the slot where you would place the key value pair.  You can never generate a slot greater than 139 if you are taking hash(key)%139.  In fact, your range will be 0-138.

Keep in mind that a good hashing function creates a maximum dispersal of indices.  This means that keys of nearly identical data can be stored with a minimum of collisions.  

When we design a hash table, we do not know a priori what use the hash table will be put to and, so, we must design for degenerate cases.  For example, someone may use our hash table to store very similar strings or they may provide a hash function that is not very good: i.e. it does not disperse indices well.

The following table shows the result of using a table size that is a multiple of 2 - 128 verses a table of a prime size - 97. 

The hash function used is:

size_t  hash(char const* str,int mult)  {

  size_t  h = 0; 
  for (; *str; ++str)  {
    h = mult * h + *str;
  }
  return  h ;
}

*NOTE: I am using unsigned arithmetic (size_t) the overflow just wraps. If we used int, we could generate negative numbers with overflow.

string multiplier hash hash % 128 hash % 97
a1 256 24881 49 49
b1  256  25137 49 14
c1 256  25393 49 76
z1 256  31281 49

47

Notice that a hash table size of 128 generates identical slots whereas when the table size is a prime number - 97, it generates a unique slot for each string.

Here is a similar table, but the hash function now uses a prime multiplier.

string multiplier hash hash % 128 hash % 97
a1 13 1310 30 49
b1  13  1323 43 62
c1 13  1336 56 75
z1 13 1635 99

83

file to generate hashing results

So we can conclude that a good choice of table size (size being prime is good) can compensate for a poor choice of hashing functions.  It is generally agreed that using prime numbers to size hash tables is a good idea.

This code could be used to give table sizes up to 805,306,457.


static const int num_primes = 25;
static const unsigned long prime_list[] =
{
  53,         97,         193,       389,       769,
  1543,       3079,       6151,      12289,     24593,
  49157,      98317,      196613,    393241,    786433,
  1572869,    3145739,    6291469,   12582917,  25165843,
  50331653,   100663319,  201326611, 402653189, 805306457
};