Hash Functions: What Makes a Good Hash

A hash function maps a key of any size to a fixed-range integer — the bucket index where its value lives in a hash table. A good hash function is deterministic (same key always gives the same result), fast to compute, and spreads keys uniformly across the buckets so collisions stay rare. Get this right and lookups stay O(1) on average; get it wrong and the table degrades to an O(n) list.

From key to index, in two steps

Hash: turn the key into a large integer — its hash code.
Compress: map that integer into [0, numBuckets), usually with % numBuckets.

#include <string>

// Polynomial rolling hash for a string, then compress to a bucket.
int hashIndex(const std::string& key, int numBuckets) {
    unsigned long h = 0;
    for (char c : key) h = h * 31 + (unsigned char)c; // mix each char
    return (int)(h % numBuckets);                     // compress
}

// Polynomial rolling hash for a string, then compress to a bucket.
int hashIndex(String key, int numBuckets) {
    long h = 0;
    for (int i = 0; i < key.length(); i++)
        h = h * 31 + key.charAt(i); // mix each char
    return (int)(Math.floorMod(h, numBuckets)); // compress (non-negative)
}

// Polynomial rolling hash for a string, then compress to a bucket.
function hashIndex(key, numBuckets) {
  let h = 0;
  for (let i = 0; i < key.length; i++) {
    h = (h * 31 + key.charCodeAt(i)) >>> 0; // keep it a 32-bit unsigned int
  }
  return h % numBuckets; // compress
}

def hash_index(key, num_buckets):
    h = 0
    for ch in key:
        h = h * 31 + ord(ch)   # mix each char
    return h % num_buckets     # compress

Why 31? Multiplying by a small odd prime before adding each character spreads similar strings (“ab” vs “ba”) to different codes. Java’s built-in String.hashCode() uses exactly this h * 31 + c recurrence.

What makes a hash function “good”

Deterministic. The same key must always hash to the same index — otherwise you could never find what you stored.
Uniform. Keys should scatter evenly across buckets. Clustering means long buckets and slow lookups.
Fast. The hash runs on every operation, so it must be cheap (typically O(length of key)).
Avalanche. A tiny change in the key (one character) should change the index a lot, breaking up patterns in real data.

Collisions are inevitable

A hash function maps a huge (often infinite) key space into a small set of buckets, so by the pigeonhole principle two different keys must sometimes share a bucket. This is a collision.

// With only 8 buckets, distinct keys can land on the same index.
int b1 = hashIndex("cat", 8);
int b2 = hashIndex("act", 8);
// b1 and b2 may be equal — that's a collision.

// With only 8 buckets, distinct keys can land on the same index.
int b1 = hashIndex("cat", 8);
int b2 = hashIndex("act", 8);
// b1 and b2 may be equal — that's a collision.

// With only 8 buckets, distinct keys can land on the same index.
const b1 = hashIndex("cat", 8);
const b2 = hashIndex("act", 8);
// b1 and b2 may be equal — that's a collision.

# With only 8 buckets, distinct keys can land on the same index.
b1 = hash_index("cat", 8)
b2 = hash_index("act", 8)
# b1 and b2 may be equal — that's a collision.

A good hash function does not eliminate collisions; it makes them rare and evenly spread. Handling the ones that do happen is the job of collision resolution.

A bad hash, for contrast

hash(key) = key.length % numBuckets is deterministic and fast — but every 3-letter word lands in the same bucket. Real data (“cat”, “dog”, “car”…) would clump into a few buckets, turning the table into a handful of O(n) lists. Uniformity is what separates a usable hash from a useless one.

Load factor: the other half

Even a perfect hash collides if there are far more keys than buckets. The load factor α = entries / buckets controls this. Libraries keep α below a threshold (often ~0.75) and resize — doubling the bucket count and rehashing every key — when it is exceeded. Resizing is O(n) but rare, so inserts stay O(1) amortized.

For beginners: You almost never write your own hash function for built-in maps — the language provides good ones for strings, ints, and tuples. You do need to provide one when you use a custom object as a key.

Going deeper: Hash functions for security (SHA-256) and for hash tables are different goals. Table hashes optimize speed and uniformity; cryptographic hashes optimize irreversibility and collision-resistance against attackers. Don’t use one where the other is needed.

Next: how tables cope when keys collide — collision resolution.