Step-by-Step Redis Cluster Slot Migration Guide

You need to move part of the keyspace onto a freshly provisioned node — or off a hot one — while application traffic keeps flowing, and a single misstep will scatter MOVED and ASK redirects across every client. Redis Cluster partitions the keyspace across 16,384 deterministic hash slots, and because ownership is propagated through a decentralized gossip protocol with strict key hashing, any misalignment during the handoff triggers redirect storms that exhaust connection pools and inflate tail latency. This guide treats a zero-downtime slot migration as a stateful, multi-phase transition rather than a bulk copy: you baseline the topology, move slots in bounded batches, keep clients resilient to redirects, gate the operation in CI, and verify convergence before decommissioning anything. Every step is independently runnable against a live cluster so you can validate each transition with real output instead of assumptions. If you are still deciding how slots should be distributed in the first place, start from Redis Cluster slot allocation basics.

Prerequisites

Redis 6.2+ or 7.x running in cluster mode, with redis-cli reachable from your operations host.
Cluster admin access — permission to run CLUSTER SLOTS, CLUSTER NODES, CLUSTER SETSLOT, and --cluster reshard.
redis-py 5.x on Python 3.10+ (pip install "redis>=5,<6" tenacity) for the client-resilience example.
Node IDs and endpoints for both the source and target masters (redis-cli CLUSTER NODES prints them).
A maintenance runbook and a rollback owner: migrations that move more than 1,000 slots should carry a manual approval gate.

Step-by-Step Slot Migration

1. Verify slot ownership and topology consistency

Establish a deterministic baseline first, because uneven memory distribution or orphaned slots will amplify migration latency and turn a partial failure into a coverage gap.

redis-cli -h <any-master> -p <port> --cluster check <any-master>:<port>
redis-cli -h <any-master> -p <port> CLUSTER SLOTS
redis-cli -h <any-master> -p <port> CLUSTER NODES

Cross-reference the output to confirm no slot is marked ERR or claimed by two masters, and that check reports all 16,384 slots covered.

2. Reconcile stale configuration epochs

Mismatched configuration epochs (configEpoch) indicate divergent gossip state that will cause the new owner's claim to lose an ownership race, so resolve them before moving data.

redis-cli -h <affected-node> -p <port> CLUSTER SET-CONFIG-EPOCH <new-epoch>

Assign each master a unique, monotonically higher epoch; the highest epoch wins when two nodes disagree about who owns a slot.

3. Confirm the critical cluster parameters

Set the parameters that keep the Redis cluster stable while a MIGRATE call briefly blocks the source event loop, either in redis.conf or live via CONFIG SET.

redis-cli -h <node> -p <port> CONFIG SET cluster-node-timeout 15000
redis-cli -h <node> -p <port> CONFIG SET cluster-require-full-coverage no
redis-cli -h <node> -p <port> CONFIG SET cluster-migration-barrier 1

cluster-node-timeout 15000 prevents premature failover during high-latency transfers, cluster-require-full-coverage no keeps unaffected slots serving if a batch temporarily blocks a subset, and cluster-migration-barrier 1 preserves replica coverage during the master handoff.

4. Identify hot-key skew before moving data

Sample the largest keys so a single oversized slot does not stall the source while its value is serialized and shipped in one blocking MIGRATE.

redis-cli -h <source> -p <port> --bigkeys
redis-cli -h <source> -p <port> --cluster call <source>:<port> MEMORY USAGE <suspect-key>

If one slot holds more than roughly 15% of a node's memory footprint, split the migration into smaller batches so no single transfer monopolizes the event loop.

5. Reshard an incremental batch of slots

Move slots in bounded batches rather than all at once — --cluster reshard automates the underlying CLUSTER SETSLOT and MIGRATE commands, transitioning the source to MIGRATING and the target to IMPORTING.

redis-cli --cluster reshard <target-host>:<target-port> \
  --cluster-from <source-node-id> \
  --cluster-to <target-node-id> \
  --cluster-slots 200 \
  --cluster-yes

Never move more than about 500 slots in a single execution for datasets exceeding 10 GB; once all keys in a slot are transferred, the tool issues CLUSTER SETSLOT <slot> NODE <dest-node-id> to both nodes and gossip propagates the new owner.

6. Follow ASK redirects from the client

While a slot is MIGRATING, a request for a key that has already moved returns ASK <slot> <target-host>:<target-port>, and the client must issue ASKING on the target before retrying — native clients like redis-py do this automatically, but custom routing layers must replicate the mechanics.

import redis
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class SlotMigrationClient:
    def __init__(self, nodes: list[dict]) -> None:
        # RedisCluster resolves MOVED and ASK redirects internally.
        self.client = redis.RedisCluster(startup_nodes=nodes, decode_responses=True)

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential(multiplier=0.1, min=0.1, max=2),
        retry=retry_if_exception_type((redis.exceptions.AskError, redis.exceptions.ConnectionError)),
        reraise=True,
    )
    def safe_set(self, key: str, value: str) -> bool:
        try:
            return self.client.set(key, value)
        except redis.exceptions.AskError as exc:
            # Manual ASK handling for a bespoke router: prefix the retry with ASKING.
            _, target = str(exc).split()
            host, port = target.rsplit(":", 1)
            ask_conn = redis.Redis(host=host, port=int(port), decode_responses=True)
            ask_conn.execute_command("ASKING")
            return ask_conn.set(key, value)

For high-throughput services, configure the client with retry_on_timeout=True and a bounded socket_timeout=2.0 so gossip-propagation delays cannot starve the connection pool.

7. Gate the migration in CI/CD

Wrap the reshard in pre-flight and post-flight checks so an automated pipeline aborts before it can corrupt topology, and treat a canary batch as the first move.

jobs:
  redis-slot-migration:
    runs-on: ubuntu-latest
    steps:
      - name: Pre-flight cluster health check
        run: |
          redis-cli --cluster check "$REDIS_ENDPOINT" | grep -q "All 16384 slots covered" || exit 1
          redis-cli -h "$REDIS_ENDPOINT" CLUSTER INFO | grep -q "cluster_state:ok" || exit 1

      - name: Canary slot redistribution (50 slots)
        run: |
          redis-cli --cluster reshard "$TARGET_NODE" \
            --cluster-from "$SOURCE_ID" \
            --cluster-to "$TARGET_ID" \
            --cluster-slots 50 \
            --cluster-yes \
            --cluster-timeout 10000

      - name: Post-migration validation gate
        run: |
          sleep 30  # allow gossip convergence
          redis-cli --cluster check "$REDIS_ENDPOINT" > cluster_report.txt
          grep -q "\[OK\] All 16384 slots covered" cluster_report.txt || {
            echo "FAIL: slot coverage gap"; exit 1;
          }
          grep "\[ERR\]" cluster_report.txt && {
            echo "FAIL: cluster check reported errors"; exit 1;
          } || true
          echo "PASS: migration complete"

Enforce a mandatory manual approval gate for migrations exceeding 1,000 slots, and feed Prometheus redis_cluster_slots_assigned and redis_cluster_slots_ok into the deployment dashboard to trigger rollback if coverage drops below 100% for more than 60 seconds.

Failure Modes

Stalled slot stuck in MIGRATING/IMPORTING. A killed reshard or a timed-out MIGRATE can leave a slot half-transferred, so --cluster check reports it open on both nodes.

redis-cli -h <source> -p <port> CLUSTER SETSLOT <slot> STABLE
redis-cli -h <target> -p <port> CLUSTER SETSLOT <slot> STABLE
redis-cli --cluster fix <any-node>:<port>   # re-drives the transfer to completion

CLUSTER SETSLOT <slot> STABLE only clears the transient state — it does not transfer ownership, so use it to abort a stalled batch, then re-run the reshard or let --cluster fix finish it.

ASK redirect storm exhausting the connection pool. A client that does not send ASKING before retrying will loop on ASK errors, opening new connections until the pool is drained and latency spikes.

redis-cli -h <app-host> INFO clients | grep -E "connected_clients|blocked_clients"
redis-cli -h <target> -p <port> INFO commandstats | grep asking

If cmdstat_asking is flat while AskError counts climb in the application, the client is not following redirects correctly — upgrade to a modern cluster-aware client or add the ASKING step from Step 6.

Split ownership after a configEpoch collision. Two masters claiming the same slot with equal epochs cause gossip to flip-flop the owner, producing intermittent MOVED loops.

redis-cli -h <node> -p <port> CLUSTER NODES | awk '{print $1, $7, $9}'
redis-cli -h <winning-node> -p <port> CLUSTER BUMPEPOCH

Bump the epoch on the node that should own the slot so its claim wins the next gossip round, then re-run --cluster check.

Verification

Confirm topology consistency and gossip convergence before you consider the migration finished.

redis-cli --cluster check <any-node>:<port>
redis-cli -h <any-node> -p <port> CLUSTER NODES | grep "master" | awk '{print $2, $9}' | sort
redis-cli -h <any-node> -p <port> CLUSTER INFO | grep -E "cluster_state|cluster_slots_ok|cluster_slots_pfail|cluster_slots_fail"

cluster_state:ok with cluster_slots_ok:16384 and zero pfail/fail slots confirms a clean handoff; watch cluster_stats_messages_sent and cluster_stats_messages_received stabilize within 5–10 seconds to confirm gossip has converged. Maintain a 24-hour observation window before decommissioning legacy nodes, tracking connection-pool utilization, latency_percentiles_usec from LATENCY HISTORY, and replica sync lag (master_repl_offset delta) so you know the Redis cluster has fully absorbed the new topology. When you are ready to remove the drained node, follow automated node provisioning and removal so the teardown is scripted rather than manual.

FAQ

How many slots should I move in a single reshard batch?

Keep batches to 200–500 slots for datasets under 10 GB and smaller for larger ones. MIGRATE serializes and ships each key on the source's main thread, so a big batch that lands on a slot full of large values blocks the event loop and can trip cluster-node-timeout. Start with a 50-slot canary, watch latency, then scale the batch size up.

What is the difference between MOVED and ASK during migration?

MOVED is a permanent redirect: the slot's owner has changed and the client should update its slot map. ASK is a one-time, transient redirect that only applies to the single key being requested while its slot is mid-migration — the client must send ASKING to the target before retrying and must not cache the redirect. Treating ASK like MOVED corrupts the client's routing table.

Do I need to stop writes during the migration?

No — that is the point of an online reshard. Writes to keys still on the source succeed normally, and writes to already-moved keys are steered to the target by the ASK redirect. You only need application changes if you run a custom, non-cluster-aware client that does not handle redirects.

Why does CLUSTER SETSLOT STABLE not finish my migration?

STABLE only clears the MIGRATING/IMPORTING flags; it does not copy keys or transfer ownership. It is an abort switch for a stalled batch. To actually reassign a slot you need CLUSTER SETSLOT <slot> NODE <dest-id> on both nodes (which --cluster reshard and --cluster fix issue for you) so gossip propagates the new owner.

Can I roll back a migration that is going wrong?

Yes, as long as the batch is still in progress. Run CLUSTER SETSLOT <slot> STABLE on both nodes to clear the transient state, then reshard the affected slots back to the original owner. Because ownership only flips after all keys transfer, aborting mid-batch leaves the source authoritative — which is why bounded batches from Step 5 keep rollback cheap.

Up: Zero-Downtime Slot Migration

Step-by-Step Redis Cluster Slot Migration Guide

# Prerequisites

# Step-by-Step Slot Migration

# 1. Verify slot ownership and topology consistency

# 2. Reconcile stale configuration epochs

# 3. Confirm the critical cluster parameters

# 4. Identify hot-key skew before moving data

# 5. Reshard an incremental batch of slots

# 6. Follow ASK redirects from the client

# 7. Gate the migration in CI/CD

# Failure Modes

# Verification

# FAQ

# How many slots should I move in a single reshard batch?

# What is the difference between MOVED and ASK during migration?

# Do I need to stop writes during the migration?

# Why does CLUSTER SETSLOT STABLE not finish my migration?

# Can I roll back a migration that is going wrong?

# Related