Monday, February 7, 2011

Hashmap 101: Build your own!

Recently I saw an article on dzone talking about harassing future minions about HashMap details. I have never personally asked an interview victim this particular question but I might just start!

As I prefer to see code (hmm ... perhaps I should ask candidates to code their own HashMap on a whiteboard?) I have taken it upon myself to build a very simple one as an example. The sample code is all for Java 6, tests are for JUnit 4.

First, an overview of how a basic hashtable works. Every entry has a key that has a hash code and a value. Keys must be comparable in some fashion. Many different keys can have the same hash code, though in practice we want this to tend to be true only of keys that are also considered equivalent.

In Java Object helpfully provides hashCode() and equals() so we know that any object will be usable as a key in a hashtable.

Internally a hashtable stores a set of buckets, typically in a simple array. Each bucket contains 0..N entries, each of which has a key and a value. To find what bucket an entry goes into we generate a numeric hashcode for the key, and determine what bucket index from hashcode mod the number of buckets. In Java, something like:
int bucketIndex = key.hashCode() % buckets.length;
In a crude drawing we can see that we have an array of buckets, each of which contains 0..N entries. As shown below, probably in a linked list, though this is very much an implementation detail:
Anyway, as I'm pro-code anti-word (50% of that is true!), lets see some damn code. First we'll need a simple entry class. We need a key, which can't change, a value, which can change, and a pointer to next so we can make our array that has 0..N entries in each spot. Something like this should do it:
package whileonefork.map;

public class AnEntry<K, V> {
 private AnEntry<K, V> next;
 private final K key;
 private V value;
 
 public AnEntry(K key, V value) {
  this.key = key;
  this.setValue(value);
 }

 public K getKey() {
  return key;
 }

 public void setValue(V value) {
  this.value = value;
 }

 public V getValue() {
  return value;
 }

 public void setNext(AnEntry<K, V> next) {
  this.next = next;
 }

 public AnEntry<K, V> getNext() {
  return next;
 } 
}

As a complete aside ... what a lot of bloody boilerplate. If only Java had a keyword or other shortcut for properties that don't have any logic. private final property K key anyone?

Anyway, returning to the point, next we'll need the skeleton of our map class. We'll provide a dead simple API: no null keys, only get/put as operations.
package whileonefork.map;

public class SimpleHashMap<K, V> {
 
 private int DEFAULT_BUCKET_COUNT = 128;
 
 private AnEntry<K, V>[] buckets;
 
 public SimpleHashMap() {
  buckets = new AnEntry[DEFAULT_BUCKET_COUNT];
 }
 
 public V get(K key) {
  //TODO
 }
 
 public void put(K key, V value) {
  //TODO
 }
}
The first operation we'll implement is get. We'll need to identify what bucket to use, then cycle through the 0..N entries for that bucket to see if any of them have the exact key we are interested in:
public V get(K key) {
  throwIfKeyNull(key);
  
  AnEntry<K, V> entry = buckets[bucketIndexForKey(key)];
  while (entry != null && !key.equals(entry.getKey())) 
   entry = entry.getNext();
  return entry != null ? entry.getValue() : null;
 }
Next, put. If there is nothing at all in the bucket we can just create a new entry and shove it in. If there is already 1..N things there we need to search through them all. If we find one with the exact same key we'll update it, otherwise we'll just shove a new entry on the end:
public void put(K key, V value) {
  throwIfKeyNull(key);
  
  int bucketIndex = bucketIndexForKey(key);
  AnEntry<K, V> entry = buckets[bucketIndex];
  
  if (null != entry) {
   boolean done = false;
   while(!done) {
    if (key.equals(entry.getKey())) {
     entry.setValue(value);
     done = true;
    } else if (entry.getNext() == null) {
     entry.setNext(new AnEntry<K, V>(key, value));
     done = true;
    }
    entry = entry.getNext();
   }
  } else {
   //nothing there at all; just shove the new entry in
   buckets[bucketIndex] = new AnEntry<K, V>(key, value);
  }
 }

Note that the extent to which hash codes are widely distributed will significantly influence how many entries tend to be in the same bucket. If people implement hashCode() as 'return 1;' or some other minimally unique value then our hashtable will tend to be quite slow as all the entries will go into the same bucket. Ideally the vast majority of buckets have 0 or 1 entries.

A real implementation may do fancy things like monitor it's own stats and resize the buckets array for optimal operation if it thinks doing so might help. The most common heuristic for this is how many buckets have something in them, eg if more than 60% of our buckets are non-null (have 1..N entries) then we should rebuild a new, larger, buckets array. hashCode mod buckets.length changes if you alter the bucket count so you have to migrate all the entries from the old buckets into the new ones at potentially different indices. This functionality is omitted from the example.

Putting it all together our final hashmap looks like this:
package whileonefork.map;

public class SimpleHashMap<K, V> {
 
 private int DEFAULT_BUCKET_COUNT = 128;
 
 private AnEntry<K, V>[] buckets;
 
 public SimpleHashMap() {
  buckets = new AnEntry[DEFAULT_BUCKET_COUNT];
 }
 
 public V get(K key) {
  throwIfKeyNull(key);
  
  AnEntry<K, V> entry = buckets[bucketIndexForKey(key)];
  while (entry != null && !key.equals(entry.getKey())) 
   entry = entry.getNext();
  return entry != null ? entry.getValue() : null;
 }
 
 public void put(K key, V value) {
  throwIfKeyNull(key);
  
  int bucketIndex = bucketIndexForKey(key);
  AnEntry<K, V> entry = buckets[bucketIndex];
  
  if (null != entry) {
   boolean done = false;
   while(!done) {
    if (key.equals(entry.getKey())) {
     entry.setValue(value);
     done = true;
    } else if (entry.getNext() == null) {
     entry.setNext(new AnEntry<K, V>(key, value));
     done = true;
    }
    entry = entry.getNext();
   }
  } else {
   //nothing there at all; just shove the new entry in
   buckets[bucketIndex] = new AnEntry<K, V>(key, value);
  }
 }

 /**
  * THIS SHOULDN'T ACTUALLY BE PUBLIC; IT IS SO JUST FOR CLARIFICATION UNIT TEST PURPOSES
  * 
  */
 public int bucketIndexForKey(K key) {
  int bucketIndex = key.hashCode() % buckets.length;
  return bucketIndex;
 }
 
 private void throwIfKeyNull(K key) {
  if (key == null) {
   throw new IllegalArgumentException("key may not be null");
  }
 } 
}

We can even write some basic tests for it:
package whileonefork.map;

import static org.junit.Assert.assertEquals;

import org.junit.Test;

public class SimpleHashMapTestCase { 
 @Test
 public void readWriteSimpleValue() {
  SimpleHashMap<String, Integer> map = new SimpleHashMap<String, Integer>();
  
  map.put("duck", 7);
  map.put("goose", 42);
  
  assertEquals(Integer.valueOf(7), map.get("duck"));
  assertEquals(Integer.valueOf(42), map.get("goose"));
 }
 
 @Test
 public void getSomethingThatIsntThere() {
  SimpleHashMap<Integer, String> map = new SimpleHashMap<Integer, String>();  
  assertEquals(null, map.get(42));
 }
 
 @Test
 public void overWriteValue() {
  SimpleHashMap<Integer, String> map = new SimpleHashMap<Integer, String>();
  
  map.put(42, "meh");
  assertEquals("meh", map.get(42));
  map.put(42, "we have the technology to rebuild him");
  assertEquals("we have the technology to rebuild him", map.get(42));
 }
 
 @Test
 public void doNotOverwriteValueJustBecauseSameHashCode() {
  String diffButSameHash1 = "Ea";
        String diffButSameHash2 = "FB";
        
        //prove they really have the same hash
        assertEquals(diffButSameHash1.hashCode(), diffButSameHash2.hashCode());
                
        SimpleHashMap<String, String> map = new SimpleHashMap<String, String>();
        
        //prove they really go into the same bucket (yes, might be overdoing it here!)
        assertEquals(map.bucketIndexForKey(diffButSameHash1), map.bucketIndexForKey(diffButSameHash2));
                
        map.put(diffButSameHash1, "We Are All Unique");
        map.put(diffButSameHash2, "I'm Not!");
        
        //same hash but amazingly still different values!
        assertEquals("We Are All Unique", map.get(diffButSameHash1));
        assertEquals("I'm Not!", map.get(diffButSameHash2));
 }
}

Disclaimer: This code is garbage, purely for example purposes. It almost certainly has horrible bugs. You'd be crazy to use it for anything.

12 comments:

Anonymous said...

Hi,
you made this question How HashMap works in Java even more difficult by asking interviewee to write your own HashMap :) , Nice article though. by the way some important thing to note here is how interviewee reacts to this question not only on coding perspective but also on design and considering performance and usability aspect e.g. how often rehashing should be done, way to provide load factor and collision resolution handling etc.

Thanks
Javin
Why String is immutable in Java

Foudres said...

Changing the number of buckets is costly and require that the keys are evently split accross the hashcode values interval.

Imagine that you are in java. hash code range is something like from -2^31 to 2^31.

Let say you want to store numbers from 1 to 1000 in the map as a key. And that you use a naive hashCode algorithm : just the number itself.

This is exactly what the JDK 1.5.0_18 does at least.

Then you need something like 4 billions buckets to ensure that each number can be accessed instantly.

This is really, really bad.

Instead what you need is a tree. Each bucket can contain a list of leafs or a sub array of buckets.

Your heuristic is now to decide when you switch from a list of leafs to a full array of subbucket.

Go back to the numbers from 1 to 1000 exemple. Let say we start with 16 buckets per tree lvl.

You will need 8 lvl for your tree to ensure that there is one bucket for each number from 1 to 1000. And you need something like 1200 bucket total. Not so bad...

Anonymous said...

Nice Article

thoughts_anshul said...

Why didnt you use Linked List for implementing a bucket. You are traversing it sequentially so not taking any advantage of array. By taking Linked List your problem of increasing bucket size would be solved

penn73er said...

I like the article, but it would be nice if the code would compile. My guess is you're British, a race that is fond of omitting details that in their informed opinion are unimportant.

Shashank Jain said...

@penn73er, he might be british, but you are a racist for sure!

Gopal said...

nice code!
And what are dumb racist kids doing in a such a technical blog! I thought they were dead already!

Javed said...

very clean any easy to understand the basics of how hash map works. I had spent a lot of time on the source code of HashMap.

Jaskaran Singh Khalsa said...

excellent code...had i read this article before my interview i would have got the internship

HARRY said...

How about using key.hashCode() & (buckets.length -1) ??

A M said...

DETAILED EXPLANATION of how HASHMAP works internally in java , with diagrams.
http://www.javamadesoeasy.com/2015/02/hashmap-custom-implementation.html

A M said...

DETAILED EXPLANATION of how HASHMAP works internally in java , with diagrams.
http://www.javamadesoeasy.com/2015/02/hashmap-custom-implementation.html

Post a Comment