Understanding Hash Tables

This article is regarding the basic understanding of a relatively complicated topic in Computer Science - Hash Tables. I have prepared this article keeping in mind to provide a basic and conceptual overview of the hashing mechanism.


* What is Hash Table or Hashing ?
    Hash Table is a data-structure, which provides a highly efficient way of retreival of data, in other words hashing is a concept that provide fast and efficient search algorithms to the applied data set. Let us first understand the need of hash tables using an example:
   
    Suppose, we want to implement a dictionary for a specified list of words. For a good implementation of a dictionary, it is imperative to have a very good search mechanism. Let us consider few different approaches to get a better search algorithm:

 Randomly ordered list  linear sequential search  O(n)
 Alphabetically ordered list  binary search  O(log n)
 Binary Search Tree  binary search  O(log n)
 Hash Tables  Hash Maps  O(1)

       
* So, one thing is for sure, hash table is the way to go to achieve a good search algorithm. But, what is it that makes searching so fast in a hash table ?
    The problem with other search method is the way data (or more appropriately - keys) is arranged. In an alphabetically ordered list of words, we start comparsion with the middle element of a list and then at each step we keep on dividing the data-set into 2 halves and then proceed with one of them. It is better then sequential searching, but still we have to browse through many elements to reach our desired result.
   
    But, what if we could directly access the required word from dictionary by some means and avoid the need to go through different words for elimination ! If we really can do that, we can reduce our search time complexity to O(1) i.e. Constant Time. This is the basic idea behind a hash table. We provide specific hash map functions which can be used to provide a mapping so that every word in the dictionary can be accessed directly.
    
EduSagar - hash table    

int H1(char *key) {
      return (key[0] - 'a');
}

    
    In the above example, we have 10 words and through the use of mapping function H(key) we can now directly access any word out of these 10 words in constant time. By, constant time complexity it is meant that even if the number of words in our data set increases, the time taken to search a word remains the same. In simple words, even if the words in dictionary doubles at a later stage, the search routine should take the same time.
  
    What if we have two different words but mapping to the same bucket? e.g. "apple" and "ant". They both map to the same bucket - bucket0. Such a situation is known as "collision". There are several techniques to handle collision in the context of hash tables.
   
* Hashing by Chaining
    One simple way to resolve collision is to provide hash functionality through chaining mechanism. We put all the elements that map to the same value at the appropriate bucket but that bucket now contains multiple elements linked with each other through a link-list.
    

EduSagar - hashing by chaining
    
   

int H2(char *key) {
      return (key[0] - 'a');
}


    
    This is very popular form of hashing, as it requires knowledge of basic data structures only. To insert a new element, we add it to the end of list. To search for an element, we need to search it in the link list associated with the bucket to which the hash map function maps to. In the worst case, we might have to traverse the full link list to search the required word. But, in average cases, if we consider simple uniform distribution of words we can say that the search depends on the average number of elements per bucket (also known as "load factor").
   
* Open Addressing
    Suppose, we have the same original map function H1(k) and we try to insert "ant" into the table when "apple" is already present at that slot in the hash table. Now, if we simply go on and place "ant" in the next available location e.g. bucket3. Now, if another word "america" comes, we would find still another empty in the table and insert it there. This technique is known as "linear probing" and is an example of a general method of resolving collisions called "rehashing" / "open addressing".
   
    To search an element, we systematically examine table slots until the desired element or a NULL is found. Every time the hash function will provide us the next slot which needs to be searched.
   
    Linear Probing: H3(k, i) = (H1(k) + i) mod m
    Here m is the total number of elements in the table and i = 0...m. For key k, we first search slot [H1(k)] i.e. slot returned by the original hash map function. Next, we search [H1(k) + 1] and so on till we wrap around our search after slot [m] to [0]..[1] till slot [H1(k) - 1]. Linear probing is easy to implement, but it suffers from "primary clustering" issue. After some time, the probability of a key being mapped to a particular empty slot keeps on increasing e.g. in the longer runs all the new words would be in contention to be filled up at the next available empty bucket.
   
    Quadratic Probing: H3(k, i) = (H1(k) + c1 + (c2)) mod m
    Here, c1 and c2 are some constants. The slots being probed depends on a quadratic equation. Although, quadratic probing is much better then linear probing but on longer runs it leads to "secondary clustering".
   
    Note: In both of the above methods, the initial probe determines the entire sequence, so only m distict probe sequences are available.
   
    Double Hashing : H3(k, i) = (h1(k) + ih2(k)) mod m
    Here, we have two different hash functions, and hence the probing sequence may vary from the initial probing sequence. However, there are limitations on the selection of these two hash functions as stated below:
   

    h1(k) = k mod m
    h2(k) = 1 + (k mod m')

    
    Here, m is to be chosen as a power of 2 and design h2 so that it produces and odd number. Another way is to choose m to be a prime number and choose h2 so that it gives a +ve integer less than m.
   
    Open Addressing saves space usage only if the words/elements are smaller in size and if the load factor is not to small. If the load factor is close to zero, open addressing is considered wasteful.
   
    Example : A hash table of size 13 and h1(k) = k mod 13 and h2(k) = 1 + (k mod 11). For inserting 14, for i=0 => H3(14) => 1 which is full, Next for i=1, H3(14) => 4, which is full again. Next, for i=2 H3(14) => 9 which is empty and hence, we fill the key 14 there.
    

EduSagar - double hashing
   
Coalesced Hashing
    A combination of chaining and open-addressing gives way to coalesced hashing. Here, we maintain chains of nodes / elements within the table itself. Hence, we have the advantage of space utilization of the open-addressing scheme while like chaining scheme it doesnt have clustering issues.
   
Summary of different hashing methods:
    In the end, let us summarize the different attributes of chaining and re-hashing mechanisms.

 Method  Advantages  Disadvantage
 Chaining  unlimited number of elements  overhead of multiple nodes in linked list form
   no collisions  
 Re-hashing   Fast access  maximum number of elements must be known in advance
     multiple collision issues

 

- Pankaj
Happy Programming !!

comments powered by Disqus