What is hashing?
Hashing is transforming a piece of data of an arbitrary size to another value. If the data is transformed into an integer for example, the hash space is a range of integers from
xn. The smaller the range, the greater the probability of collisions (multiple pieces of data mapping back to the same value). A good hash function reduces the probability of collisions.
A data structure called a
hash map uses an array of size
xn+1 where each element points to a data bucket. If there was one bucket, searching for a value would take
O(N) time. If we add
k buckets, the time complexity reduces to
If we need to scale out our hash map, we will design a distributed hashing system. Let’s say our system uses the following look up formula:
server_id = hash(key) % N where
hash(...) is the hashing function and
N is the number of servers. The first time a client asks for a value, there will be a cache miss and the server will retrieve the value from the database and add it to the cache for next time.
The rehashing problem
This distributed hashing scheme works well as long as all of your servers remain online. If one or more of the servers go offline and we use the same hash function with a diminished server pool, the hashing function
hash(key) % N will give us the same server index values. However, these index values will no longer correspond to the correct servers. A significant number of requests will be routed to the wrong server. If the system is a cache, a storm of cache misses will ensue, load on the database (or data service) will increase, and performance may be seriously degraded. Consistent hashing is an effective method to mitigate the rehashing problem.
A hashing function maps strings “randomly” onto a range of numbers (e.g.
0 to 2^64 - 1). This ensures that the keys are mapped uniformly across the range.
By taking the modulo of the hashed key and the number of partitions
N, the data corresponding to the keys can be evenly distributed across the available partitions. As long as the data is even distributed, simple hash partitioning works well. If there are hotspots (keys that are accessed more often than others), this method falls short.
hash(key) % N = index
With the Consistent hashing method, a hash function randomly maps both the
partition identifier and the
key onto a circle. Each key is assigned to the next node moving clockwise. Sort order is only preserved within a partition. (The diagram below should be rendered as a circle. The flowchart library I used doesn’t offer that feature.)
key 1to the partition between
key 1is stored in server
S2. If a new server is added between
key 1is moved to it.
If SHA-1 is used as the hashing function, the hash space ranges from
2^160 - 1. Note: we do not use the modulus operator here.
One way to implement Consistent hashing in code would be to use a sorted list of servers that stores the
angles or the
numerical values of the server locations on a
hash ring. When a request is made to retrieve data for a key, a linear walk or binary search could be used to find the first server whose value is larger than the hashed lookup key value (the next server moving clockwise). If no larger server value is found, you would wrap around and retrieve the data from the first server in the list.
In order to ensure that keys are evenly distributed among the available servers,
virtual nodes can be added for each server. If we used one node for each server, we could very easily find ourselves with several servers bunched up side-by-side. Recall that server locations on the hash ring are assigned “randomly”. This would cause a small subset of the servers to handle the majority of the load.
An elegantly simple trick can be used to solve this problem. If we define each server as a collection of virtual servers, we can take advantage of the uniform distribution property of the hashing function. If we defined 360 virtual nodes, we would closely approximate an even distribution of servers around the circle. (e.g.
s1_0, …). According to reference 1, the standard deviation is
100 nodes and
200 nodes. If we have servers of differing capacities, we would add more virtual servers to those ones. Virtual servers function as weights.
If we lose a server, the data for all associated virtual servers will be redistributed to the adjacent virtual server (moving counterclockwise). In general, remapping keys takes
O(N/k) time. Given
N servers and
Distributed hashing mitigates the rehashing and the hotspot key problems.