Automated Node Provisioning and Removal in Redis Cluster

This guide covers how to add and drain Redis Cluster nodes automatically — choosing between imperative runbook orchestration and a declarative reconciling controller — so capacity tracks load without manual redis-cli intervention or dropped requests.

Modern distributed caching treats Redis nodes as ephemeral compute units rather than permanent infrastructure fixtures. Manual redis-cli --cluster invocations during traffic spikes or capacity contraction introduce unacceptable operational risk: a mistyped node ID, a half-completed reshard, or a premature termination can strand hash slots and take part of the keyspace offline. The scaling discipline framed in Redis Cluster Scaling, Sharding & Automation decouples compute provisioning from storage topology, letting an automation layer expand and contract the Redis cluster while preserving read/write continuity. This page compares the two automation models teams actually run in production, then covers the failure modes and verification steps that separate a safe pipeline from a data-loss incident.

The same primitives, both directions: a node is never handed slots until cluster_state reads ok, and it is never destroyed until every slot has drained back off it.

Architectural Trade-offs: Imperative vs Declarative Lifecycle Automation

There are two dominant ways to automate node lifecycle. Imperative orchestration scripts an ordered sequence of redis-cli and MIGRATE calls, triggered by a runbook, a CI job, or an on-call engineer pressing a button. Declarative reconciliation runs a long-lived controller that continuously compares desired topology (a target node count derived from telemetry) against actual topology and issues the minimum set of operations to converge. The trade-off is between predictability-per-run and hands-off responsiveness.

Automation model	Consistency	Latency	Write Amplification	Operational Complexity
Imperative orchestration (scripted `redis-cli` / Ansible runbook)	Deterministic per run; stops on first error, easy to reason about	High reaction latency — waits for a human or a scheduled trigger	Bounded — you reshard exactly the slot count the runbook names	Low to build, high to operate under frequent scaling
Declarative reconciliation (telemetry-driven controller)	Eventually consistent to a desired state; must guard against oscillation	Low reaction latency — responds to metrics within a scrape interval	Higher if thresholds flap — repeated inbound/outbound migrations move the same keys twice	High to build correctly, low to operate once stable

Both models share the same primitive operations — bootstrap, gossip join, slot migration, drain, del-node. The difference is who decides when to run them and how failure is contained. Small clusters that scale a few times a quarter are well served by the imperative model. Clusters that ride diurnal or bursty traffic, where a human cannot react fast enough, justify the declarative controller.

Approach A — Imperative Orchestration

The imperative model expresses the whole lifecycle as an ordered, idempotent script. It is the right starting point because every step is inspectable and the blast radius of a single run is bounded.

Infrastructure-as-Code bootstrapping

Provisioning begins with declarative compute allocation and strict configuration enforcement. Terraform defines the instance footprint; Ansible handles the Redis bootstrap. The automation must inject non-negotiable cluster parameters before the service starts:

# ansible/roles/redis-cluster/templates/redis.conf.j2
cluster-enabled yes
cluster-config-file nodes-{{ ansible_hostname }}.conf
cluster-node-timeout 5000
cluster-migration-barrier 1
cluster-announce-ip {{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}
cluster-announce-port 6379
cluster-announce-bus-port 16379
save ""
appendonly yes

The cluster-announce-ip and cluster-announce-bus-port directives are critical in cloud environments with NAT or VPC routing. Misconfiguration here causes gossip fragmentation and split-brain conditions. Once Ansible converges, the runbook validates the bootstrap sequence before touching topology. For the full IaC implementation — Terraform modules, Ansible roles, and the wiring between them — see Automating Node Scaling with Terraform and Ansible.

Gossip integration and topology validation

After startup, the node enters the handshake state and begins participating in the gossip protocol. The runbook must verify cluster health before assigning any slots:

# Validate cluster state and gossip convergence on the new node
redis-cli -h <new-node-ip> -p 6379 CLUSTER INFO | grep -E "cluster_state|cluster_known_nodes"
redis-cli -h <new-node-ip> -p 6379 CLUSTER NODES | grep -E "myself|master|slave"

Poll for cluster_state:ok and confirm cluster_known_nodes matches the expected topology. Only after gossip converges can the script safely query slot distribution and compute exact slot deltas — the automation must never exceed the 16,384-slot ceiling or leave a slot unowned. The mechanics of that ownership map are covered in Redis Cluster Slot Allocation Basics.

Deterministic slot reallocation

Slot migration is the operational bottleneck and the step most sensitive to ordering. The script executes a strict state machine to move slots from overloaded primaries onto the new node:

Mark the target as importing first: CLUSTER SETSLOT <slot> IMPORTING <source_node_id> — the destination must be ready before the source begins sending ASK redirects.
Mark the source as migrating: CLUSTER SETSLOT <slot> MIGRATING <target_node_id>.
Stream keys: MIGRATE <target_ip> <target_port> "" 0 5000 KEYS <key1> <key2> ... — pass REPLACE to overwrite any stale copy; omit COPY so keys are deleted from the source after a successful transfer.
Atomic handoff: CLUSTER SETSLOT <slot> NODE <target_node_id> issued on both the destination and the source.

The runbook tracks MIGRATING and IMPORTING state through CLUSTER SLOTS. If a migration stalls on a large-key serialization or network jitter, it applies exponential backoff and raises the MIGRATE timeout before aborting. This atomic, redirect-safe flow is documented end to end in Zero-Downtime Slot Migration, and the CLI-level walkthrough lives in Step-by-Step Redis Cluster Slot Migration Guide.

Controlled decommissioning

Removal is riskier than provisioning because the node still owns live data. The script drains slots, respects cluster-migration-barrier, and confirms replica failover before terminating the instance:

# 1. Drain all slots from the node to be removed, while it is still a primary.
#    Slots MUST be fully migrated before del-node is called.
redis-cli --cluster reshard <any-cluster-node>:6379 \
  --cluster-from <node-id-to-remove> \
  --cluster-to <target-node-id> \
  --cluster-slots <count> \
  --cluster-yes

# 2. Verify the node now owns zero slots
redis-cli -h <node-to-remove> -p 6379 CLUSTER NODES | grep myself

# 3. Remove the now-empty node from the cluster topology
redis-cli --cluster del-node <any-cluster-node>:6379 <node-id-to-remove>

In containerized environments, the script must coordinate with the Kubernetes control plane — a preStop hook that blocks until slot count reaches zero — to prevent premature pod termination while migration is still in flight.

Approach B — Declarative Reconciliation

The declarative model wraps the same primitives in a controller that reacts to real-time signals instead of a schedule. It replaces "run the runbook when someone notices load" with "converge continuously toward a desired topology."

Telemetry-driven scaling triggers

The controller subscribes to Prometheus-exported Redis metrics and OpenTelemetry traces, and derives a desired node count from thresholds rather than fixed timetables:

Metric	Threshold	Action
`redis_memory_used_bytes / redis_memory_max_bytes`	> 0.75 per primary	Provision a new primary and rebalance
`redis_cluster_known_nodes`	< expected count	Trigger gossip repair before scaling
`redis_commands_processed_total` rate	p95 latency spike	Scale out replicas
`redis_cluster_slots_fail`	> 0	Immediate alert; investigate before any scaling

During extreme load events the controller prioritizes slot migration over replica synchronization to prevent cascading latency. Critically, it applies hysteresis: separate scale-out and scale-in thresholds plus a cooldown window, so a metric hovering at the boundary does not trigger repeated inbound/outbound migrations of the same keys.

A reconciling scaling controller

The following module implements a deterministic controller using redis-py 5.x, structured logging, and exponential backoff. It handles the slot-migration state transitions from Approach A, adds observability hooks, and enforces failure boundaries so a partial migration never silently drops keys.

import time
import logging
from typing import List, Dict
from redis.cluster import RedisCluster, ClusterNode
from redis.exceptions import RedisError, ConnectionError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("redis_cluster_orchestrator")


class RedisClusterScaler:
    def __init__(
        self,
        startup_nodes: List[Dict[str, str]],
        max_retries: int = 5,
        base_delay: float = 1.0,
    ):
        # startup_nodes is a list of {"host": ..., "port": ...} dicts
        nodes = [ClusterNode(n["host"], int(n["port"])) for n in startup_nodes]
        self.client = RedisCluster(startup_nodes=nodes, decode_responses=True, socket_timeout=5)
        self.max_retries = max_retries
        self.base_delay = base_delay

    def _execute_with_backoff(self, func, *args, **kwargs):
        for attempt in range(self.max_retries):
            try:
                return func(*args, **kwargs)
            except (ConnectionError, RedisError) as e:
                delay = self.base_delay * (2 ** attempt)
                logger.warning("Attempt %d failed: %s. Retrying in %.1fs...", attempt + 1, e, delay)
                time.sleep(delay)
        raise RuntimeError("Max retries exceeded for cluster operation")

    def migrate_slots(
        self,
        source_id: str,
        target_id: str,
        target_host: str,
        target_port: int,
        slots: List[int],
        timeout_ms: int = 5000,
    ):
        for slot in slots:
            logger.info("Migrating slot %d from %s to %s", slot, source_id, target_id)

            # Step 1: IMPORTING on the target MUST precede MIGRATING on the source.
            # Otherwise the source sends ASK redirects to a target not yet importing.
            self._execute_with_backoff(
                self.client.execute_command, "CLUSTER", "SETSLOT", slot, "IMPORTING", source_id
            )
            self._execute_with_backoff(
                self.client.execute_command, "CLUSTER", "SETSLOT", slot, "MIGRATING", target_id
            )

            # Step 2: Transfer keys. REPLACE overwrites stale copies; omit COPY so
            # keys are deleted from the source after a successful transfer.
            keys = self._get_keys_in_slot(slot)
            if keys:
                self._execute_with_backoff(
                    self.client.execute_command,
                    "MIGRATE", target_host, target_port, "", 0, timeout_ms, "REPLACE", "KEYS", *keys,
                )

            # Step 3: Commit ownership on both nodes
            self._execute_with_backoff(
                self.client.execute_command, "CLUSTER", "SETSLOT", slot, "NODE", target_id
            )
            logger.info("Slot %d successfully migrated.", slot)

    def _get_keys_in_slot(self, slot: int, count: int = 100) -> List[str]:
        try:
            return self.client.execute_command("CLUSTER", "GETKEYSINSLOT", slot, count)
        except Exception as e:
            logger.error("Failed to fetch keys for slot %d: %s", slot, e)
            return []

    def validate_cluster_health(self) -> bool:
        info = self.client.cluster_info()
        state = info.get("cluster_state", "fail")
        known_nodes = int(info.get("cluster_known_nodes", 0))
        logger.info("Cluster state: %s | Known nodes: %d", state, known_nodes)
        return state == "ok"


# Usage:
# scaler = RedisClusterScaler(startup_nodes=[{"host": "10.0.1.10", "port": "6379"}])
# if scaler.validate_cluster_health():
#     scaler.migrate_slots("source_node_id", "target_node_id", "10.0.1.20", 6379, list(range(1000, 1050)))

The validate_cluster_health gate is what makes the controller safe to run unattended: every reconcile loop confirms cluster_state:ok before it commits a single SETSLOT, so a mid-failover cluster is never handed a fresh migration.

When to Choose Which

Pick the model against concrete production signals, not preference:

Scaling frequency. Fewer than roughly one topology change per week → imperative runbook. Multiple changes per day, or unpredictable bursts → declarative controller.
Reaction-time SLA. If a memory-pressure event must be absorbed inside a metric scrape interval (tens of seconds), only the controller reacts fast enough. If you can tolerate minutes of human lead time, the runbook is simpler and safer.
Team operations burden. A controller is a stateful service you now own, monitor, and page on. If your team cannot commit to operating that reliably, an imperative script triggered from CI carries far less standing risk.
Traffic shape. Flat, predictable load rarely justifies reconciliation. Diurnal or spiky workloads — where the same node is added and removed within a day — are exactly where continuous convergence pays for itself, provided hysteresis is tuned to avoid churn.

A common maturity path is to start imperative, harden every step to be idempotent, then lift the same primitives into a controller once scaling frequency crosses the point where a human in the loop is the bottleneck.

Failure Modes and Diagnostics

Every failure below is an edge that escapes the happy path: gossip fragmentation throws a node back to HANDSHAKE, a stalled migration traps it in RESHARDING, and an early terminate turns a clean DRAINING exit into data loss.

Gossip fragmentation / split-brain. A newly provisioned node behind NAT with a wrong cluster-announce-ip never fully joins; two partitions each believe they own the same slots. Diagnose by comparing epochs and known-node counts across partitions:

redis-cli -h <node-a> -p 6379 CLUSTER NODES | awk '{print $1, $2, $7, $8}'
redis-cli -h <node-b> -p 6379 CLUSTER INFO | grep -E "cluster_state|cluster_known_nodes|cluster_size"

Divergent cluster_known_nodes or a slot claimed by two masters confirms fragmentation. Fix the announce directives, then let gossip re-converge before resuming any migration.

Stalled slot migration. A migration blocks on a large key or a flapping link, leaving a slot pinned in MIGRATING/IMPORTING. Clients see ASK redirect storms that exhaust their connection pools. Detect the stuck state and inspect the source event loop:

redis-cli -h <source> -p 6379 CLUSTER NODES | grep -E "migrating|importing"
redis-cli -h <source> -p 6379 --bigkeys        # find the blocking key
redis-cli -h <source> -p 6379 SLOWLOG GET 10    # confirm MIGRATE is the stall

Resolve by raising the MIGRATE timeout, splitting the slot into smaller KEYS batches, or clearing the transient state with CLUSTER SETSLOT <slot> STABLE before retrying.

Premature decommission (data loss). Terminating an instance before its slots fully drain destroys any keys not yet migrated and can leave slots unassigned. Guard with a hard assertion that the node owns zero slots before del-node and before the compute is destroyed:

OWNED=$(redis-cli -h <node-to-remove> -p 6379 CLUSTER NODES | awk '/myself/{print $9}')
[ -z "$OWNED" ] && echo "safe to remove" || echo "ABORT: still owns slots $OWNED"

Verification

Confirm each lifecycle operation landed correctly in a live cluster:

# Full coverage: all 16,384 slots assigned, state ok, no failing slots
redis-cli --cluster check <any-node>:6379
redis-cli -h <any-node> -p 6379 CLUSTER INFO | grep -E "cluster_state:ok|cluster_slots_assigned:16384|cluster_slots_fail:0"

# Post-provision: the new node owns a balanced share and carries a replica
redis-cli -h <new-node> -p 6379 CLUSTER NODES | grep myself

# Post-decommission: the removed node is gone from every survivor's view
redis-cli -h <any-node> -p 6379 CLUSTER NODES | grep -c <removed-node-id>   # expect 0

Pair these with the metric checks from Approach B — redis_cluster_slots_fail at 0 and per-primary memory back under the scale-out threshold — to confirm the Redis cluster reached a stable steady state rather than merely a syntactically valid one.

Operational Guardrails

Never exceed 16,384 total slots. Validate slot sums before committing migrations.
Respect cluster-migration-barrier. Keep at least one replica online for every primary during rebalancing.
Enforce idempotency. Every provisioning and removal step must be safe to re-run; use CLUSTER NODES snapshots as state anchors.
Integrate with the service mesh. Shift traffic away from draining nodes via Istio/Envoy weight adjustments before issuing CLUSTER SETSLOT.

Automated Redis Cluster scaling is a reliability discipline, not a feature toggle. Whether you run scripted runbooks or a reconciling controller, the same guarantees hold: bounded slot movement, atomic handoff, and a hard drain gate before any instance is destroyed.

Up one level: Redis Cluster Scaling, Sharding & Automation

Automated Node Provisioning and Removal in Redis Cluster

# Architectural Trade-offs: Imperative vs Declarative Lifecycle Automation

# Approach A — Imperative Orchestration

# Infrastructure-as-Code bootstrapping

# Gossip integration and topology validation

# Deterministic slot reallocation

# Controlled decommissioning

# Approach B — Declarative Reconciliation

# Telemetry-driven scaling triggers

# A reconciling scaling controller

# When to Choose Which

# Failure Modes and Diagnostics

# Verification

# Operational Guardrails

# Related