UNIVERSITY AT BUFFALO, THE STATE UNIVERSITY OF NEW YORK
The Department of Computer Science & Engineering

STUART C. SHAPIRO: CSE 116 B

CSE 116
Introduction To Computer Science for Majors 2
Lecture B
Lecture Notes
Stuart C. Shapiro
Spring, 2003

Hash Tables

Readings

Riley Chapter 8

Introduction

Recall, an ArrayList is an unbounded mutable random-access collection indexed by ints. A hash table is an unbounded mutable random-access (well, almost) collection indexed by Objects.

Keys & Values

Instead of talking about a value stored in an array at some index, we say that a value is stored in a hash table paired with some key. The key is used to find the value previously stored in the hash table. The storage and retrieval methods are

put(Object key, Object
value)

and get(Object key).

Hash Functions

How can any Object be used as a key? Turn the object into an int with a hash function. See java.lang.Object.hashCode(). In order to find the value previously stored it is usually important that if obj1.equals(obj2) then

obj1.hashCode() ==
obj2.hashCode()

Example:

bsh % int[] a = {2, 4, 6, 2, 7, 9};

bsh % a2 = new ArrayList();
bsh % b2 = new ArrayList();

bsh % for (int i=0; i<a.length; i++) a2.add(a[i]);
bsh % for (int i=0; i<a.length; i++) b2.add(a[i]);

bsh % print(a2);
[2, 4, 6, 2, 7, 9]
bsh % print(b2);
[2, 4, 6, 2, 7, 9]

bsh % print(a2 == b2);
false
bsh % print(a2.equals(b2));
true

bsh % print(a2.hashCode());
948636961
bsh % print(b2.hashCode());
948636961

Compression & Capacity

It is not feasible for the array used within a hash table to have a place to store every possible Object. (Consider above a2.hashCode() = 948,636,961.) So every hash table has some capacity, and the index used is actually obj.hashCode() % capacity.

Collision

Since the index used to store a key-value pair is key.hashCode() % capacity, keys that are not content equal may get the same index. This is called collision, and some collision resolution stragegy is needed.

Buckets

An open hashing method stores the value in the next available place in the array, if its righful place is already taken. But this is usually not a great idea (see Riley).

Closed hashing stores, at each position of the array, a bucket: a collection of all the values whose keys hash to this same array position. Any collection may be used for the buckets. A linked list is common. Another hash table is sometimes used, but this requires a second hash function.

Searching Buckets

So looking up a value in a hash table, given a key, actually requires two searches:

Apply the hash function to the key, getting the index of the bucket, an O(1), random access search.
Search the bucket for the appropriate value, a search that depends on the collection used to store the buckets---often linear.

Storing Keys

When searching the bucket, how do you know when you have the appropriate value? You must store the key along with the value, as a key-value pair. So searching the bucket involves searching the key-value pairs, comparing the key of the pair with the supplied key using equals for comparison purposes.
Notice that an array or ArrayList doesn't have to store the keys, but a hash table does.

Size, Capacity & Load Factor

Capacity, as introduced above, is the number of available buckets in the hash table. Size, as usual for a collection, is the number of key-value pairs stored in it. Notice that as the size increases, the probability of collisions increases, making the hash table less efficient. The load factor is the ratio of size/capacity which, when reached, causes the capacity of the hash table to be increased. "When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled" [java.util.HashMap API]. (Compare the behavior of an ArrayList when the size is about to exceed the capacity.) Java hash tables use a default load factor of 0.75.

java.util.HashMap vs. java.util.HashSet

Java has two classes based on hash tables, java.util.HashMap, and java.util.HashSet.

HashMaps are Java's implementation of hash tables, as discussed above. Since it is legal to store null as a value, HashMaps have a useful boolean containsKey(Object key) method.

HashSet implements a set by using a HashMap as a bit vector (where the value is either true or false), although it's probably implemented by just using containsKey.

Bottom Line

A hash table is a generalization of an array, but performance tends to be slightly worse than O(1), and the space it requires is generally more than the minimum needed to store the number of entries in it. Nevertheless, it is a very useful data structure when its functionality is needed, and it is strongly recommended in those cases.

CSE 116 Introduction To Computer Science for Majors 2 Lecture B Lecture Notes Stuart C. Shapiro Spring, 2003

Hash Tables

Stuart C. Shapiro <shapiro@cse.buffalo.edu>

CSE 116
Introduction To Computer Science for Majors 2
Lecture B
Lecture Notes
Stuart C. Shapiro
Spring, 2003