The Department of Computer Science & Engineering STUART C. SHAPIRO: CSE 116 B # Hash Tables

Riley Chapter 8

Introduction
Recall, an ArrayList is an unbounded mutable random-access collection indexed by ints. A hash table is an unbounded mutable random-access (well, almost) collection indexed by Objects.

Keys & Values
Instead of talking about a value stored in an array at some index, we say that a value is stored in a hash table paired with some key. The key is used to find the value previously stored in the hash table. The storage and retrieval methods are ```put(Object key, Object value)``` and `get(Object key)`.

Hash Functions
How can any Object be used as a key? Turn the object into an int with a hash function. See java.lang.Object.hashCode(). In order to find the value previously stored it is usually important that if `obj1.equals(obj2)` then ```obj1.hashCode() == obj2.hashCode()```

Example:

```bsh % int[] a = {2, 4, 6, 2, 7, 9};

bsh % a2 = new ArrayList();
bsh % b2 = new ArrayList();

bsh % for (int i=0; i<a.length; i++) a2.add(a[i]);
bsh % for (int i=0; i<a.length; i++) b2.add(a[i]);

bsh % print(a2);
[2, 4, 6, 2, 7, 9]
bsh % print(b2);
[2, 4, 6, 2, 7, 9]

bsh % print(a2 == b2);
false
bsh % print(a2.equals(b2));
true

bsh % print(a2.hashCode());
948636961
bsh % print(b2.hashCode());
948636961
```

Compression & Capacity
It is not feasible for the array used within a hash table to have a place to store every possible Object. (Consider above `a2.hashCode()` = 948,636,961.) So every hash table has some capacity, and the index used is actually `obj.hashCode() % capacity`.

Collision
Since the index used to store a key-value pair is `key.hashCode() % capacity`, keys that are not content equal may get the same index. This is called collision, and some collision resolution stragegy is needed.

Buckets
An open hashing method stores the value in the next available place in the array, if its righful place is already taken. But this is usually not a great idea (see Riley).

Closed hashing stores, at each position of the array, a bucket: a collection of all the values whose keys hash to this same array position. Any collection may be used for the buckets. A linked list is common. Another hash table is sometimes used, but this requires a second hash function.

Searching Buckets
So looking up a value in a hash table, given a key, actually requires two searches:
1. Apply the hash function to the key, getting the index of the bucket, an O(1), random access search.
2. Search the bucket for the appropriate value, a search that depends on the collection used to store the buckets---often linear.

Storing Keys
When searching the bucket, how do you know when you have the appropriate value? You must store the key along with the value, as a key-value pair. So searching the bucket involves searching the key-value pairs, comparing the key of the pair with the supplied key using `equals` for comparison purposes.
Notice that an array or ArrayList doesn't have to store the keys, but a hash table does.

Capacity, as introduced above, is the number of available buckets in the hash table. Size, as usual for a collection, is the number of key-value pairs stored in it. Notice that as the size increases, the probability of collisions increases, making the hash table less efficient. The load factor is the ratio of size/capacity which, when reached, causes the capacity of the hash table to be increased. "When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled" [java.util.HashMap API]. (Compare the behavior of an ArrayList when the size is about to exceed the capacity.) Java hash tables use a default load factor of 0.75.

java.util.HashMap vs. java.util.HashSet
Java has two classes based on hash tables, java.util.HashMap, and java.util.HashSet.

HashMaps are Java's implementation of hash tables, as discussed above. Since it is legal to store `null` as a value, HashMaps have a useful ```boolean containsKey(Object key)``` method.

HashSet implements a set by using a HashMap as a bit vector (where the value is either `true` or `false`), although it's probably implemented by just using `containsKey`.

Bottom Line
A hash table is a generalization of an array, but performance tends to be slightly worse than O(1), and the space it requires is generally more than the minimum needed to store the number of entries in it. Nevertheless, it is a very useful data structure when its functionality is needed, and it is strongly recommended in those cases.