What is hashing?

Hashing is transforming a piece of data of an arbitrary size to another value. If the data is transformed into an integer for example, the hash space is a range of integers from 0 to xn. The smaller the range, the greater the probability of collisions (multiple pieces of data mapping back to the same value). A good hash function reduces the probability of collisions.

A data structure called a hash map uses an array of size xn+1 where each element points to a data bucket. If there was one bucket, searching for a value would take O(N) time. If we add k buckets, the time complexity reduces to O(N/k).

If we need to scale out our hash map, we will design a distributed hashing system. Let’s say our system uses the following look up formula: server_id = hash(key) % N where hash(...) is the hashing function and N is the number of servers. The first time a client asks for a value, there will be a cache miss and the server will retrieve the value from the database and add it to the cache for next time.

The rehashing problem

This distributed hashing scheme works well as long as all of your servers remain online. If one or more of the servers go offline and we use the same hash function with a diminished server pool, the hashing function hash(key) % N will give us the same server index values. However, these index values will no longer correspond to the correct servers. A significant number of requests will be routed to the wrong server. If the system is a cache, a storm of cache misses will ensue, load on the database (or data service) will increase, and performance may be seriously degraded. Consistent hashing is an effective method to mitigate the rehashing problem.

flowchart LR H["hash(key=1002) % N"] S0[server 0] S1[server 1] S2[server 2] S3[server 3] C((Client Request))-->H H-->S0 H-->S1 H--"1002 % 4 = 2"-->S2 H-->S3

Hash function

A hashing function maps strings “randomly” onto a range of numbers (e.g. 0 to 2^64 - 1). This ensures that the keys are mapped uniformly across the range.

Hash partitioning

By taking the modulo of the hashed key and the number of partitions N, the data corresponding to the keys can be evenly distributed across the available partitions. As long as the data is even distributed, simple hash partitioning works well. If there are hotspots (keys that are accessed more often than others), this method falls short.

  • hash(key) % N = index

Consistent hashing

With the Consistent hashing method, a hash function randomly maps both the partition identifier and the key onto a circle. Each key is assigned to the next node moving clockwise. Sort order is only preserved within a partition. (The diagram below should be rendered as a circle. The flowchart library I used doesn’t offer that feature.)

flowchart LR S0[S0] S1[S1] S2[S2] S3[S3] S4[S4] S5[S5] K1((key 1)) S0---S1 S0---S5 S2---S3 S4---S3 S5---S4 S1---K1 K1-->S2
In this example, the hashing function mapped key 1 to the partition between S1 and S2 so key 1 is stored in server S2. If a new server is added between key 1 and S2, key 1 is moved to it.

If SHA-1 is used as the hashing function, the hash space ranges from 0 to 2^160 - 1. Note: we do not use the modulus operator here.

flowchart LR 0-->1 1-->2 2-->A[. . .] A-->B["2^160-2"] B-->C["2^160-1"] C-->0

One way to implement Consistent hashing in code would be to use a sorted list of servers that stores the angles or the numerical values of the server locations on a hash ring. When a request is made to retrieve data for a key, a linear walk or binary search could be used to find the first server whose value is larger than the hashed lookup key value (the next server moving clockwise). If no larger server value is found, you would wrap around and retrieve the data from the first server in the list.

Practical applications

In order to ensure that keys are evenly distributed among the available servers, virtual nodes can be added for each server. If we used one node for each server, we could very easily find ourselves with several servers bunched up side-by-side. Recall that server locations on the hash ring are assigned “randomly”. This would cause a small subset of the servers to handle the majority of the load.

An elegantly simple trick can be used to solve this problem. If we define each server as a collection of virtual servers, we can take advantage of the uniform distribution property of the hashing function. If we defined 360 virtual nodes, we would closely approximate an even distribution of servers around the circle. (e.g. s0_0, s0_1, s0_2, … s0_59, s1_0, …). According to reference 1, the standard deviation is 10% for 100 nodes and 5% for 200 nodes. If we have servers of differing capacities, we would add more virtual servers to those ones. Virtual servers function as weights.

If we lose a server, the data for all associated virtual servers will be redistributed to the adjacent virtual server (moving counterclockwise). In general, remapping keys takes O(N/k) time. Given N servers and k keys.

Distributed hashing mitigates the rehashing and the hotspot key problems.

Sources

  1. System Design Interview by Alex Xu
  2. Understanding Distributed Systems by Robert Vitillo
  3. A Guide to Consistent Hashing - Topal