Step-by-Step Redis Cluster Slot Migration Guide
Redis Cluster partitions the keyspace across 16,384 deterministic hash slots. Horizontal scaling requires redistributing these slots across newly provisioned or underutilized nodes without disrupting live traffic. Because the architecture relies on a decentralized gossip protocol and strict key hashing, any misalignment during the handoff triggers MOVED or ASK redirect storms that can exhaust connection pools and degrade application latency. Executing a Zero-Downtime Slot Migration requires treating the operation as a stateful, multi-phase transition rather than a bulk data transfer. The following guide details production-tested diagnostics, CLI orchestration, client-side resilience patterns, and CI/CD gating for Redis 6.2+ and 7.x environments.
Pre-Migration Diagnostics & Baseline Validation
Before initiating any topology change, establish a deterministic health baseline. Uneven memory distribution, CPU saturation, or orphaned slots will amplify migration latency and increase the probability of partial failures.
- Verify Slot Ownership & Topology Consistency
redis-cli -h <any-master> -p <port> --cluster check <any-master>:<port>
redis-cli -h <any-master> -p <port> CLUSTER SLOTS
redis-cli -h <any-master> -p <port> CLUSTER NODES
Cross-reference output to confirm no slot is marked ERR or assigned to multiple masters. Mismatched configuration epochs (configEpoch) indicate stale gossip states. Resolve with:
redis-cli -h <affected-node> -p <port> CLUSTER SET-CONFIG-EPOCH <new-epoch>
- Validate Critical Cluster Parameters
Ensure the following are explicitly configured in
redis.confor viaCONFIG SET:
cluster-node-timeout 15000(minimum for production; prevents premature failover during high-latencyMIGRATEoperations)cluster-require-full-coverage no(ensures the cluster continues serving unaffected slots if a migration temporarily blocks a subset)cluster-migration-barrier 1(maintains replica availability during master slot transfers)
- Identify Hot-Key Skew
Run
redis-cli --bigkeysor sampleMEMORY USAGEon high-throughput keys. If a single slot contains >15% of a node's memory footprint, split the migration into smaller batches to avoid blocking the source event loop.
Orchestrating the Slot Handoff
Slot migration is governed by three cluster states: MIGRATING (source), IMPORTING (target), and STABLE (post-handoff). The redis-cli --cluster reshard utility automates the underlying CLUSTER SETSLOT and MIGRATE commands.
Incremental Migration Pattern Never migrate >500 slots in a single execution for datasets exceeding 10GB. Use the following sequence:
redis-cli --cluster reshard <target-host>:<target-port> \
--cluster-from <source-node-id> \
--cluster-to <target-node-id> \
--cluster-slots 200 \
--cluster-yes
During execution, the source node transitions to MIGRATING. Queries for unmigrated keys in the target range return ASK <slot> <target-host>:<target-port>. The client must follow the ASK redirect, issue an ASKING command, and retry the original operation. Once all keys are transferred, CLUSTER SETSLOT <slot> NODE <dest_node_id> is issued to the destination and source nodes to assign the slot to its new owner, and gossip propagates the updated ownership map. (redis-cli --cluster reshard performs this finalization automatically; SETSLOT ... STABLE only clears the transient migrating/importing state and does not transfer ownership.)
While a slot is migrating, a client may hit the source for a key that has already moved; the ASK redirect routes that single request to the target:
sequenceDiagram
participant C as Client
participant Src as Source node
participant Dst as Target node
C->>Src: GET key (slot migrating)
Src-->>C: ASK slot dst
C->>Dst: ASKING
C->>Dst: GET key
Dst-->>C: value
For Redis 7.x environments, leverage MIGRATE with AUTH2 if ACLs are enforced, and monitor cluster_stats_migrate_failed in INFO CLUSTER to detect network timeouts or key serialization errors.
Client-Side Resilience & Python Retry Patterns
Native cluster clients handle MOVED redirects automatically, but ASK redirects require explicit handling to prevent infinite loops or connection thrashing. Implement exponential backoff with jitter for transient migration windows.
import redis
import time
import random
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class RedisClusterMigrationClient:
def __init__(self, nodes):
self.client = redis.RedisCluster(startup_nodes=nodes, decode_responses=True)
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=0.1, min=0.1, max=2),
retry=retry_if_exception_type((redis.exceptions.AskError, redis.exceptions.ConnectionError)),
reraise=True
)
def safe_set(self, key, value):
try:
return self.client.set(key, value)
except redis.exceptions.AskError as e:
# Extract target node from ASK response
_, target = e.args[0].split()
host, port = target.split(":")
ask_conn = redis.Redis(host=host, port=int(port))
ask_conn.asking() # Required before retrying
return ask_conn.set(key, value)
For high-throughput services, configure redis-py with retry_on_timeout=True and socket_timeout=2.0 to prevent thread pool starvation during gossip propagation delays. Refer to the official redis-py documentation for cluster client initialization best practices.
CI/CD Gating & Pipeline Automation
Automate migration safety checks to prevent topology corruption during automated deployments. The following GitHub Actions pattern demonstrates pre-flight validation, dry-run execution, and rollback gating.
jobs:
redis-slot-migration:
runs-on: ubuntu-latest
steps:
- name: Pre-flight Cluster Health Check
run: |
redis-cli --cluster check $REDIS_ENDPOINT | grep -q "All 16384 slots covered" || exit 1
redis-cli -h $REDIS_ENDPOINT CLUSTER INFO | grep -q "cluster_state:ok" || exit 1
- name: Canary Slot Redistribution (50 slots)
run: |
# --cluster reshard has no dry-run mode; with --cluster-yes this executes
# a real migration, and --cluster-to is required for non-interactive runs.
redis-cli --cluster reshard $TARGET_NODE \
--cluster-from $SOURCE_ID \
--cluster-to $TARGET_ID \
--cluster-slots 50 \
--cluster-yes --cluster-timeout 10000
- name: Post-Migration Validation Gate
run: |
sleep 30 # Allow gossip convergence
redis-cli --cluster check $REDIS_ENDPOINT > cluster_report.txt
grep -q "\[OK\] All 16384 slots covered" cluster_report.txt || { echo "FAIL: slot coverage gap"; exit 1; }
grep -q "\[ERR\]" cluster_report.txt && { echo "FAIL: cluster check reported errors"; exit 1; }
echo "PASS: Migration complete"
Enforce mandatory manual approval gates for migrations exceeding 1,000 slots. Integrate Prometheus redis_cluster_slots_assigned and redis_cluster_slots_ok metrics into your deployment dashboard to trigger automated rollbacks if slot coverage drops below 100% for >60 seconds.
Post-Migration Verification & Telemetry
After the final slot batch transitions to STABLE, validate topology consistency and monitor convergence metrics:
redis-cli --cluster check <any-node>:<port>
redis-cli -h <any-node> -p <port> CLUSTER NODES | grep "master" | awk '{print $2}' | sort -n
Confirm that cluster_stats_messages_received and cluster_stats_messages_sent stabilize within 5–10 seconds, indicating gossip convergence. Track migrate_failed and migrate_timeout counters in INFO CLUSTER to identify persistent network bottlenecks. For teams scaling beyond single-region deployments, review the broader architectural patterns documented in Redis Cluster Scaling, Sharding & Automation to align slot distribution with cross-AZ latency requirements.
Maintain a 24-hour observation window before decommissioning legacy nodes. Monitor connection pool utilization, latency_percentiles_usec, and replica sync lag (master_repl_offset delta) to ensure the cluster has fully absorbed the new topology.