Last modified: December 05, 2024
This article is written in: 🇺🇸
Consistent Hashing
Imagine you're organizing books in a vast library with shelves arranged in a circle. Each book is placed on a shelf based on its title's position in the alphabet, looping back to the beginning after 'Z'. If you add a new shelf or remove one, you wouldn't want to reshuffle all the books—just a few should need to move. Consistent hashing works similarly in computer systems, allowing data to be distributed across multiple servers efficiently, even as servers are added or removed.
The Hash Ring Concept
Consistent hashing uses a logical ring to represent the entire range of possible hash values. Both data items and nodes (servers) are mapped onto this ring using a hash function.
Visualizing the Hash Ring:
#
+---------+
| |
+-----+ 0° +-----+
| | | |
| +---------+ |
| |
+------+ +------+
| |
270° + + 90°
| |
+------+ +------+
| |
| +---------+ |
| | | |
+-----+ 180° +-----+
| |
+---------+
- The circle represents the entire hash space (e.g., 0 to 2³² - 1).
- Positions on the ring correspond to hash values from the hash function.
- Nodes and data are placed on the ring based on their hash values.
Mapping Nodes and Data onto the Ring
Suppose we have three nodes—Node A, Node B, and Node C—and several data keys that need to be stored.
Assigning Nodes to the Ring:
- Node A hashes to position at 0°.
- Node B hashes to position at 120°.
- Node C hashes to position at 240°.
Visual Representation with Nodes:
#
+---------+
| Node A |
+-----+ 0° +-----+
| | | |
| +---------+ |
| |
+------+ +------+
| |
| |
| |
+------+ +------+
| |
| +---------+ |
| | Node C | |
+-----+ 240° +-----+
| |
+---------+
Placing Data Items on the Ring:
Let's say we have data items with keys K1, K2, and K3.
- K1 hashes to 100°.
- K2 hashes to 200°.
- K3 hashes to 330°.
Visual Representation with Data:
#
+---------+
| Node A |
+-----+ 0° +-----+
| | | |
| +---------+ |
| ↑ |
+------+ K3 +------+
| |
| |
| |
+------+ +------+
| |
| +---------+ |
| | Node C | |
+-----+ 240° +-----+
| ↑ |
| K2 |
How Data Assignment Works
In consistent hashing, each data item is assigned to the next node encountered when moving clockwise around the ring.
- K1 (100°) is stored on Node B (120°).
- K2 (200°) is stored on Node C (240°).
- K3 (330°) wraps around the ring and is stored on Node A (0°).
Complete Ring with Nodes and Data:
#
+---------+
| Node A |
+-----+ 0° +-----+
| | ↑ | |
| +--- K3 ---+ |
| |
+------+ +------+
| |
| |
| |
+------+ +------+
| |
| +---------+ |
| | Node C | |
+-----+ 240° +-----+
| ↑ |
| K2 |
Adding and Removing Nodes
One of the strengths of consistent hashing is that it minimizes the amount of data that needs to move when nodes join or leave the system.
Adding a New Node
Suppose we add Node D that hashes to 80°.
- K1 (100°) now maps to Node D instead of Node B.
- Only K1 needs to be moved to the new node.
Ring After Adding Node D:
#
+---------+
| Node A |
+-----+ 0° +-----+
| | | |
| +---------+ |
| |
+------+ +------+
| Node D (80°) |
| ↑ |
| K1 |
+------+ +------+
| |
| +---------+ |
| | Node C | |
+-----+ 240° +-----+
| ↑ |
| K2 |
Removing a Node
If Node B leaves the system:
- Data items previously mapped to Node B now map to the next node clockwise.
- K1 (if still at Node B) would move to Node D or Node C depending on its position.
Practical Example: Distributed Caching with Consistent Hashing
Consider a web application using a distributed cache to store session data. Let's see how consistent hashing helps in this scenario.
Without Consistent Hashing
- Servers are selected using a simple modulo operation:
server = hash(key) % number_of_servers
. - Adding or removing a server changes
number_of_servers
, causing most keys to remap to different servers. - This results in cache misses and increased load on the database.
With Consistent Hashing
- Servers and keys are placed on the hash ring.
- Only keys that would map to the affected servers need to be remapped.
- The majority of keys continue to map to the same servers, preserving cache hits.
Virtual Nodes (VNodes)
To enhance data distribution and fault tolerance, consistent hashing often uses virtual nodes.
What Are Virtual Nodes?
- Each physical node is represented multiple times on the hash ring at different positions.
- For example, Node A might be placed at positions 5°, 120°, and 250°.
Benefits of Virtual Nodes
- Improved load balancing is achieved as increasing the number of points on the hash ring leads to a more even distribution of data among nodes.
- Easier scaling is possible because adding or removing a physical node only requires redistributing its associated virtual nodes, minimizing disruption.
- Supporting heterogeneous nodes allows nodes with greater capacity to have more virtual nodes, enabling them to handle a proportionally larger share of data and requests.
Implementing Consistent Hashing
Let's walk through how you might implement consistent hashing in practice.
Step 1: Hash Function Selection
Choose a hash function that distributes values uniformly, such as MD5 or SHA-1.
Step 2: Mapping Nodes to the Ring
Assign each node (or virtual node) a hash value to determine its position on the ring.
Step 3: Mapping Keys to Nodes
For each data key:
- Compute its hash value.
- Locate the first node on the ring whose hash value is greater than or equal to the key's hash.
- If no such node exists (the key's hash is higher than any node's hash), wrap around to the first node on the ring.
Sample Code Snippet (Pseudocode)
# Import a reliable hash function
import hashlib
# Function to compute hash value
def compute_hash(key):
return int(hashlib.sha1(key.encode()).hexdigest(), 16)
# Nodes in the system
nodes = ['NodeA', 'NodeB', 'NodeC']
# Create the ring with virtual nodes
ring = {}
virtual_nodes = 100 # Number of virtual nodes per physical node
for node in nodes:
for i in range(virtual_nodes):
vnode_key = f"{node}:{i}"
hash_val = compute_hash(vnode_key)
ring[hash_val] = node
# Sort the hash ring
sorted_hashes = sorted(ring.keys())
# Function to find the node for a given key
def get_node(key):
hash_val = compute_hash(key)
for node_hash in sorted_hashes:
if hash_val <= node_hash:
return ring[node_hash]
return ring[sorted_hashes[0]] # Wrap around
# Example usage
key = 'my_data_key'
assigned_node = get_node(key)
print(f"Key '{key}' is assigned to {assigned_node}")
Interpreting the Output
- The key
'my_data_key'
is assigned to a node based on its hash value. - If you add or remove nodes, only the keys that map to the affected virtual nodes need to change assignments.
Real-World Applications
Distributed Databases: Apache Cassandra
- Uses consistent hashing to distribute data across nodes in a cluster.
- Ensures high availability and fault tolerance.
- Supports virtual nodes to improve load balancing.
Distributed Cache: Amazon DynamoDB
- Employs consistent hashing to distribute data and handle partitions.
- Provides seamless scaling by adding or removing nodes with minimal impact.
Load Balancing: Web Servers
- Consistent hashing can distribute client requests based on client IP addresses.
- Helps maintain session affinity without storing session data on every server.
Advantages of Consistent Hashing
- The scalability of consistent hashing allows nodes to be added or removed with minimal impact on existing data placement.
- Efficiency is enhanced by reducing cache misses and lowering database load in distributed caching systems.
- Built-in fault tolerance ensures the system remains operational even if some nodes fail.
- A balanced load is achieved by distributing data and requests evenly across available nodes.
Challenges and Considerations
- Hotspots can arise if the hash function or data keys do not distribute data uniformly, leading to overloaded nodes.
- Choosing the right hash function is critical to ensure a good spread of hash values across nodes.
- The complexity of implementing virtual nodes and managing the hash ring can increase system overhead.
Best Practices
- Incorporating virtual nodes improves load distribution and minimizes hotspots.
- Regularly monitor system performance to identify and address uneven load or emerging bottlenecks.
- Selecting a consistent hash function that avoids clustering of values ensures better distribution of data.
- Designing for scaling from the outset helps the system accommodate growth and handle additional nodes smoothly.