Hash table

Hash table #

A hash table is a data structure often used to implement an associative array (like a Java Map).

Associative array #

Reminder. An associative array (or map) is a finite set of key-value pairs, where keys are distinct.

Terminology. A key-value pair in an associative array is also called an entry.

Running example. The following associative array has 5 entries.

$\qquad$ 1 $\mapsto$ Alice $\qquad$ 3 $\mapsto$ Bob $\qquad$ 4 $\mapsto$ Carol $\qquad$ 9 $\mapsto$ Dan $\qquad$ 12 $\mapsto$ Eve

In this associative array, keys are integers, and values are strings.

Notation. We will use $K$ for the type of the keys (integers in this example), and $V$ for the type of the values (strings in this example).

Structure #

A hash table is an array $A$, whose entries are called slots (or buckets).

Hash tables come in two varieties:

  • chaining hash tables and
  • hash tables with open indexing.

In a chaining hash table, each slot is a pointer to a data structure (typically a linked list). These data structures store the entries (a.k.a. key-value pairs) of the associative array.

Example (continued). The following is a chaining hash table, which stores our associative array.

The array has 4 slots, each of which points to a (possibly empty) linked list.

In a hash table with open indexing, each slot stores one entry of the associative array.

We will discuss open indexing later in this section (and focus for now on chaining hash tables).

Hash function #

In a hash table, a so-called hash function determines which slot each entry is assigned to.

Definition. A hash function maps each key in $K$ to an index in $A$.

Example (continued). In the above example, we used

$\qquad h(k) = k \mod 4$

as our hash function.

For instance,

$\qquad h(9) = 9 \mod 4 = 1$

which explains why the entry

$\qquad$ $9 \mapsto$ Dan

is assigned to slot $1$.

Collision #

The size of the set $K$ of possible keys is often (much) larger than the size of the hash table $A$.

Example (continued). In our example, $K$ is the set of integers. In Java, this set has size $2^{32}$ (i.e. more than 4 billion), which is unrealistic for the size of $A$.

As a consequence, hashing functions are usually not injective.

Terminology. A collision occurs when two entries of an associative array are assigned to the same slot.

In other words, a collision occurs if the associative array contains two entries ${k_1 \mapsto v_1}$ and ${k_2 \mapsto v_1}$ such that $h(k_1) = h(k_2)$.

Example (continued). In the example above, 1 $\mapsto$ Alice and 9 $\mapsto$ Dan are assigned to the same slot, because

$$ h(1) = h(9) = 1 $$

Collisions are often unavoidable (this is a consequence of what is sometimes called the birthday paradox).

Example. Let us assume that our hashing function $h$ partitions $K$ evenly across slots (like the function $h(x) = x \mod 4$ used above).

  • If $A$ has $16$ slots (which is the default size of a hash table in Java), then the probability of a collision is $\ge 0.5$ for $5$ entries already

  • If $A$ has $1000$ slots, then the probability of a collision is $\ge 0.5$ for $38$ entries already, and $\ge 0.9$ for $69$ entries already.

Performance #

Reminder. An associative array exposes (at least) the following methods:

  • lookup (or get) takes a key as input, and returns the value for this key (if any).
  • insert (or put ) inserts a pair (key, value). If an entry for this key was already present, then overwrites its value.
  • remove (or delete) deletes the entry for a given key (if any).

Worst case #

In a chained hash table, in order to check whether the structure contains an entry for an input key $k$, we need to:

  • compute $h(k)$
  • search for an entry with key $k$ in the data structure that $A[h(k)]$ refers to (if any).

Consider a associative array with $n$ entries, stored as a chained hash table with linked lists (as above).

What is the worst case asymptotic running time of the lookup operation?

Instead, what would be the best possible scenario?

A worst possible input (of size $n$) for the lookup operation consists of an input key $k$ and an input hash table $A$ with hash function $h$ such that, for each pair $k’ \mapsto v$ stored in $A$:

  • $k’ \neq k$ and
  • $h(k’) = h(k)$.

In this case, any algorithm for lookup would need to inspect the whole list $A[h(k)]$. So the worst-case running time of such an algorithm is in $\Theta(n)$.

A best possible input consists of an input key $k$ and an input hash table $A$ with hash function $h$ such that $h(k’) \neq h(k)$ for each pair $k’ \mapsto v$ stored in $A$. A standard algorithm for lookup in this case would have constant running time (expressed as function of $n$).

Load factor and rehashing #

Definition. Let $n$ be the number of entries in the associative array, and let $m$ be the size of $A$. The ratio $\alpha = \frac{n}{m}$ is called the load factor of the hash table.

A high load factor increases the chances of collision, thus affecting performance.

When the load factor becomes too high, the hash table is typically rehashed, meaning that a larger hash table is created (thus with a different hashing function), in which entries are re-inserted.

From Wikipedia, for a chaining hash table, the maximal acceptable load factor is between 1 and 3.

Average running time #

A good hash function is usually expected to distribute keys evenly across slots.

In a chaining hash table, this means that all lists should have length close to the load factor $\alpha$.

Universal hashing refers to using not one, but a collection of hash functions that satisfies certain properties, and randomly selecting one of these functions at runtime.

This guarantees that for a fixed set $S \subset K$ of keys with size $n$, and for a fixed $k \in S$, the average length of the slot for $k$ is the load factor $\alpha$.

In other words, the average running time of lookup, insertion and deletion is $\alpha = \frac{n}{m}$.

If we assume that the size $m$ of $A$ is always reasonable for the number $n$ of entries of the associate array, then $\alpha$ is independent of $n$. So under this assumption, the average running time of lookup, insertion and deletion is constant (i.e. is in $\Theta(1)$).

Open indexing #

In a hash table with open indexing, each slot stores at most one entry of the associative array. In case of collision between two entries, one of them is assigned to a free slot.

This is achieved via a so-called probing function (from indices to indices), which defines an order over indices. More precisely:

  • Insert. When inserting an entry with key $k$, if the slot $A[h(k)]$ is already occupied, then the “next” free slot is used instead (where “next” depends on the probing function).
  • Lookup. When searching for the entry with key $k$, the first investigated slot is $A[h(k)]$, then its successor $s$ (according to the probing function), then the successor of $s$, etc., until we reach the entry with key $k$ (if present) or an empty slot.
  • Delete. When deleting a entry, the slot that it was occupying cannot be deleted, because the resulting empty slot may break future searches. Instead, the slot is marked as a so called tombstone, which does not interrupt a search, but can still be used for future insertions.

Performance #

The (average and worst case) asymptotic costs of these operations are identical to the ones for chained hash tables.

Open addressing can be more efficient in practice than chaining, because it is more likely to benefit from caching (since the whole hash table it an array, it is likely to be contiguous in memory).

A downside is the maximal acceptable load factor, which is (necessarily) $\le 1$ (from Wikipedia still, between 0.6 and 0.75).

in Java #

Implementation #

Java uses chained hash tables internally, with 16 slots by default.

Each slot points to a linked list by default.

If the maximal acceptable load factor is exceeded, then the hash table is rehashed (as expected).

In addition, Java uses a maximal load threshold for an individual slot. If this value is exceeded (but the maximal load factor is not), then the linked list for this slot is converted to a red-black tree.

Associative array #

As we saw in the dedicated section, the interface java.util.Map represents an associative array.

  • The class java.util.HashMap implements the interface as a hash table.

  • The class java.util.LinkedHashMap also implements this interface as a hash table. However, in addition, each entry contains two pointers to its predecessor and successor (if any) in the order of insertion, thus forming a double linked list, independent of the hash table itself. This allows iterating over the entries of a map in a predictable order (precisely, their insertion order), which is not possible with a HashMap.

Warning. In a LinkedHashMap, overwriting the value of an entry (with the method Map.put) does not affect the order of insertion.

Set #

As we saw in the dedicated section, the interface java.util.Set represents a set.

  • The class java.util.Hashset implements this interface as a hash table, where keys are the elements of the set, and an “entry” consists of a key only (i.e. has no value).

  • The class java.util.LinkedHashSet also implements this interface as a hash table. It is analogous to LinkedHashMap, i.e. it guarantees that iteration preserves the order of insertion.

hashCode #

In Java, the keys of a Map and the entries of a Set are objects.

So a Java hash function must map an object to a hash table slot. Technically, this function is the composition of:

  • the function hashCode, which maps an object to an int, and
  • a proper hash function $h$ that maps this int to a hash table slot.

The function hashCode is an instance method of the class Object, thus it is inherited by default by all objects. This function behaves like a pure function, in the sense that it must return twice the same int if it is called twice for the same object. This property must be preserved if the method is overridden.

The hash function $h$ is also a function in the mathematical sense. Therefore the composition of these two functions has this property. As a consequence, a same key is always mapped to the same slot.

hashCode and equals #

Warning. In a Java hash table, two keys $o_1$ and $o_2$ are considered equal iff

$\qquad o_1$.equals($o_2$)

returns true.

Consider the following Java class City, which overrides the method equals, so that two cities are considered equal iff they have the same name and zip code.

public class City(){
    String name;
    int zipCode;

    public City(String name, int zipCode){
        this.name = name;
        this.zipCode = zipCode;
    }

    @Override
    public boolean equals(Object o) {
       if (this == o) { // same reference
           return true;
       }
       if (o == null || getClass() != o.getClass())
           return false; // o is null or has a different type
       }
       City downcastObject = (City) o;
       return zipCode == downcastObject.zipCode &&
           name.equals(downcastObject.name);
    }

Can we predict the output of the following program (and if so, what is it)?

City bologna = new City("Bologna", 40100);
City bolognaAgain = new City("Bologna", 40100);

Set<City> cities = new HashSet();
cities.add(bologna);
cities.add(bolognaAgain);

System.out.println(cities.size());

The output is unpredictable.

Observe that the variables bologna and bolognaAgain point to two different objects $o_1$ and $o_2$.

In the hash table that stores the set cities:

  • the slot associated to $o_1$ is $i_1 = h$($o_1$.hashCode()),
  • the slot associated to $o_2$ is $i_2 = h$($o_2$.hashCode()).

Since $o_1$ and $o_2$ are different objects, $i_1$ and $i_2$ may of may not be identical.

The instruction cities.add(bologna) adds a node with key $o_1$ to $A[i_1]$. The instruction cities.add(bolognaAgain) adds a node with key $o_2$ to $A[i_2]$.

Then the instruction cities.size() returns the number of entries in the hash table.

  • If $i_1 = i_2$ (i.e. if we have a collision), then the method put will not insert $o2$ in the hash table (since $o_1$.equals($o_2$) evaluates to true). Therefore the method cities.size() will return 1,

  • If $i_1 \neq i_2$, then the method put will insert $o2$ in the hash table (at slot $i_2$). Therefore the method cities.size() will return 2.

Warning. As show with this exercise, if a class overrides equals without overriding hashCode, then the behavior of the basic operations on hash tables (insertion, lookup and deletion) may depend on collisions.

This makes the behavior of several methods implemented by HashMap, LinkedHashMap, HashSet and LinkedHashSet (among others) unpredictable.

To avoid this, whenever equals is overridden, hashCode should be overridden accordingly.

Requirement. For two Java objects $o_1$ and $o_2$,

$\qquad$ if $o_1$.equals($o_2$), then hashCode($o_1$) = hashCode($o_2$).

Example(continued). hashCode could be implemented as follows:

public class City(){
    ...
    @Override
    public int hashCode(){
        return name.length + zipCode;
    }
}

In the example above, any (pure) function that produces an int out of the values of name and zipCode could in theory be used to implement hashCode in the class City.

However, the choice of this function may affect performance.

Example(continued). The following implementation of hashCode also complies with the implementation of equals above. However, it has the effect of mapping all keys with type City to the same hash table slot.

public class City(){
    ...
    @Override
    public int hashCode(){
        return 1;
    }
}

This is why a common practice consists in implementing hashCode in terms of the method hashCode of the selected attributes (and implement the method hashCode of these attributes similarly, in a recursive fashion).

Note. Most native Java classes (like String, Integer, Boolean, ArrayList, HasSet, etc.) already implement hashCode (in accordance with equals).

The utility method Objects.hash takes any number of arguments, and wraps the ones that have primitive types (e.g. int into Integer). It returns an int whose value is a function of the outputs of the respective hashCode methods of the arguments.

Example(continued). In our example, we can implement hashCode as follows:

public class City(){
    ...
    @Override
    public int hashCode() {
        return Objects.hash(name, zipCode);
    }
}

The returned value depends on the values of name.hashCode() and Integer.valueOf(zipCode).hashCode().