Automating Node Scaling with Terraform and Ansible

You run a Redis Cluster that has to grow and shrink with traffic, and every manual redis-cli --cluster add-node you type risks a half-joined member, an unassigned slot range, or a client-side MOVED/ASK redirect storm the moment gossip disagrees with itself. This page shows how to make horizontal expansion and contraction deterministic by splitting the work cleanly: Terraform owns the declarative compute, network, and storage state, and Ansible owns idempotent Redis configuration and the Redis cluster handshake. It is the concrete implementation behind Automated Node Provisioning and Removal in Redis Cluster; the broader scaling context lives in Redis Cluster Scaling, Sharding & Automation. The focus here is the exact code — and the failure surfaces you hit once real traffic is riding the live cluster during a resize.

Prerequisites

Redis 7.2+ on every node, with the Redis cluster bus reachable on the client port + 10000 (e.g. 6379 and 16379).
Terraform 1.6+ with a remote backend and state locking (S3 + DynamoDB, or GCS) so two apply runs can never race on the same topology.
Ansible 2.15+ with the community.general collection and a cloud inventory plugin (amazon.aws.aws_ec2 or google.cloud.gcp_compute).
Python 3.10+ with redis-py 5.x (redis.asyncio) available on the pipeline runner for the slot-coverage gate.
A convention for how hash slots are assigned, and agreement that a resize is a multi-phase state transition, not a bulk copy — the same discipline applied in the Step-by-Step Redis Cluster Slot Migration Guide.

Step-by-Step Implementation

Step 1 — Declare cluster nodes and network boundaries in Terraform

Define each member as an indexed instance with a stable role tag and a security group that opens only the client port and the Redis cluster bus, so Ansible can discover nodes later by tag instead of by hand-maintained IP lists.

resource "aws_instance" "redis" {
  count                  = var.node_count            # scale by changing one number
  ami                    = var.redis_ami
  instance_type          = "r6i.xlarge"
  subnet_id              = var.private_subnet_ids[count.index % length(var.private_subnet_ids)]
  vpc_security_group_ids = [aws_security_group.redis_bus.id]

  root_block_device {
    volume_type = "gp3"
    iops        = 6000          # AOF rewrite + RDB snapshots are IOPS-bound
    throughput  = 250
  }

  tags = {
    redis_cluster = "prod-primary"
    role          = count.index < var.master_count ? "master" : "replica"
  }
}

resource "aws_security_group_rule" "cluster_bus" {
  security_group_id = aws_security_group.redis_bus.id
  type              = "ingress"
  protocol          = "tcp"
  from_port         = 16379       # gossip bus = client port + 10000
  to_port           = 16379
  self              = true
}

Step 2 — Generate a dynamic Ansible inventory from node tags

Point Ansible at the cloud provider and group hosts by the role tag Terraform stamped, so a newly provisioned instance appears in the right play automatically with zero inventory edits.

# inventory.aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions: [us-east-1]
filters:
  tag:redis_cluster: prod-primary
keyed_groups:
  - key: tags.role          # -> groups: master, replica
    prefix: ""
hostnames:
  - private-ip-address

Step 3 — Template an identical redis.conf across every node

Render one templated redis.conf from a single source of truth and restart Redis only when the rendered file actually changes, so configuration drift between members is impossible and restarts are idempotent.

- name: Render redis.conf
  ansible.builtin.template:
    src: redis.conf.j2          # cluster-enabled yes, cluster-node-timeout 5000, appendonly yes, ...
    dest: /etc/redis/redis.conf
    mode: "0640"
  notify: restart redis         # handler fires only on a checksum change

- name: Ensure Redis is enabled and running
  ansible.builtin.systemd:
    name: redis-server
    state: started
    enabled: true

The template pins the directives that must match on every node — cluster-enabled yes, cluster-config-file nodes.conf, cluster-node-timeout 5000, appendonly yes, appendfsync everysec, and maxmemory-policy volatile-lru — so a divergent cluster-node-timeout can never trigger asymmetric failover.

Step 4 — Join new nodes with bounded retry and backoff

Add each new member behind a retry loop with a fixed delay, because a naive add-node issued mid-gossip fails until the Redis cluster's view of membership settles.

- name: Join node to the cluster with retry
  ansible.builtin.command: >
    redis-cli --cluster add-node
    {{ new_node_ip }}:6379 {{ existing_node_ip }}:6379
    {{ '--cluster-slave' if node_role == 'replica' else '' }}
  register: join_result
  retries: 8
  delay: 10
  until: join_result.rc == 0 and 'OK' in join_result.stdout

The --cluster-slave flag was renamed --cluster-replica in Redis 5.0; on 7.2+ use --cluster-replica --cluster-master-id <id> to bind a replica to a specific primary. Omitting the flag adds the node as an empty primary that owns no slots until Step 5 rebalances.

Step 5 — Wait for gossip convergence before assigning slots

Block the play until cluster_state:ok reports back from an existing member, so slot rebalancing never runs against a Redis cluster that is still reconciling membership.

- name: Wait for gossip convergence
  ansible.builtin.command: redis-cli -h {{ existing_node_ip }} cluster info
  register: cluster_state
  retries: 12
  delay: 5
  until: "'cluster_state:ok' in cluster_state.stdout"
  changed_when: false

- name: Rebalance slots onto the new primary
  ansible.builtin.command: >
    redis-cli --cluster rebalance {{ existing_node_ip }}:6379
    --cluster-use-empty-masters --cluster-yes
  when: node_role == 'master'

Step 6 — Drain and remove a node on scale-in

Migrate every slot off the departing primary first, then delete it, so contraction never strands a slot range and forces cluster-require-full-coverage to reject writes.

# 1. Move all owned slots to another primary before removing the node
redis-cli --cluster reshard <existing>:6379 \
  --cluster-from   <departing-node-id> \
  --cluster-to     <target-node-id> \
  --cluster-slots  <count-owned-by-departing> \
  --cluster-yes

# 2. Only once it owns zero slots, purge it from every node's gossip view
redis-cli --cluster del-node <existing>:6379 <departing-node-id>

Only after del-node confirms removal from gossip should Terraform destroy the instance (terraform apply with a lowered node_count), so no client is still routing to a host that is about to disappear.

Step 7 — Gate the pipeline on full slot coverage

Fail the deployment unless the Redis cluster reports ok and all 16,384 slots are assigned, so a partial resize can never be promoted to production.

import asyncio
import sys
from redis.asyncio.cluster import RedisCluster


async def verify_slot_coverage(host: str, port: int = 6379) -> None:
    client = RedisCluster(host=host, port=port, ssl=True, decode_responses=True)
    try:
        info = await client.cluster_info()                 # CLUSTER INFO
        if info.get("cluster_state") != "ok":
            sys.exit(f"FAIL cluster_state={info.get('cluster_state')} via {host}")
        assigned = int(info.get("cluster_slots_assigned", 0))
        if assigned != 16384:
            sys.exit(f"FAIL slot coverage {assigned}/16384 via {host}")
        print(f"PASS all 16384 slots assigned via {host}")
    finally:
        await client.aclose()


if __name__ == "__main__":
    asyncio.run(verify_slot_coverage(sys.argv[1]))

Critical Path

The flow below shows the single deterministic path from a Terraform plan to a gated, production-ready cluster — and the one branch that rolls the change back.

A single member walks the same state machine on the way in and the way back out. The scale-out path (top) is gated before it is promoted; the scale-in path (bottom) only begins once the node owns zero slots.

Failure Modes

Partial join leaves slots unassigned

An add-node that succeeds but is never followed by a rebalance (Step 5) leaves an empty primary and, worse, a scale-in that reshards to a not-yet-converged node can drop a slot range entirely. Detect it before promoting the change:

redis-cli --cluster check <any-node>:6379   # look for "Not all 16384 slots are covered"
redis-cli -h <any-node> CLUSTER INFO | grep -E "cluster_state|cluster_slots_assigned"

Fix it by re-running redis-cli --cluster fix <any-node>:6379 to reassign orphaned slots, then --cluster rebalance to even out ownership. The Step 7 gate exists precisely to stop this state from shipping.

Timeout breaches cascade into failover during scale-out

If AOF rewrite or RDB snapshotting saturates disk IOPS while a new node syncs, cluster-node-timeout (5000 ms) expires and healthy primaries are wrongly flagged PFAIL/FAIL, triggering needless failovers. Confirm the root cause is infrastructure, not Redis:

redis-cli -h <node> INFO persistence | grep -E "aof_rewrite_in_progress|rdb_bgsave_in_progress|latest_fork_usec"
journalctl -u redis-server --since "10 min ago" | grep -iE "timeout|cluster|fork"

Fix it by provisioning higher-IOPS volumes in Step 1 (raise the gp3 iops/throughput) rather than masking the symptom; only bump cluster-node-timeout to 8000 as a temporary bridge while the storage change rolls out.

Config-epoch collision from a racing apply

Two concurrent terraform apply runs, or a node re-imported with a stale nodes.conf, can produce duplicate configEpoch values, so gossip cannot agree on who owns a contested slot and MOVED redirects flap. Inspect epochs across the Redis cluster:

redis-cli -h <node> CLUSTER NODES | awk '{print $1, $3, $7}'   # id, flags, config-epoch

Fix it by bumping the affected node with redis-cli -h <node> CLUSTER SET-CONFIG-EPOCH <higher-epoch>, and prevent recurrence by enforcing remote-state locking (Step 0 prerequisite) so only one apply mutates topology at a time.

Verification

Confirm the resized cluster is healthy, fully covered, and internally consistent before closing the change:

# 1. Every slot is owned and the cluster agrees it is healthy
redis-cli --cluster check <any-node>:6379          # expect "[OK] All 16384 slots covered"
redis-cli -h <any-node> CLUSTER INFO | grep cluster_state   # expect cluster_state:ok

# 2. Master/replica topology matches what Terraform declared
redis-cli -h <any-node> CLUSTER NODES | grep master | awk '{print $2, $9}' | sort

# 3. Replicas are actually streaming from their primary
redis-cli -h <replica> INFO replication | grep -E "role|master_link_status|master_last_io_seconds_ago"

A master_link_status:down or a growing master_last_io_seconds_ago means the replica never finished syncing after the join — do not remove or fail over the old node until it recovers. A slot count other than 16384 in --cluster check means Step 5 or Step 6 left the topology mid-transition.

FAQ

Should Terraform run the `redis-cli --cluster` commands directly?

No — keep provisioning and cluster orchestration separate. Terraform is declarative and has no notion of gossip convergence or retry, so a local-exec calling add-node will report success while the Redis cluster is still reconciling. Let Terraform own only compute, network, and storage, and let Ansible's retry-until loops (Steps 4-5) own the handshake.

Why add a new primary empty and rebalance, instead of resharding during the join?

Adding the node empty gets it into the gossip view and healthy first, so a transient join failure never leaves keys mid-migration. Rebalancing as a distinct, gated step (Step 5) means you can verify cluster_state:ok before any slot moves, and you can cap batch size to avoid blocking the source event loop — the same incremental discipline used for a full slot migration.

Is it safe to scale in by just running `terraform destroy` on a node?

Never. Destroying an instance that still owns slots strands that range and, with cluster-require-full-coverage yes, halts writes for those keys. Always drain with --cluster reshard and confirm the node owns zero slots via --cluster check, then del-node, and only then lower node_count in Terraform.

How do I keep `redis.conf` from drifting between nodes?

Render it from one Jinja template with ansible.builtin.template and gate the restart on a checksum-changing handler (Step 3). Because every node materializes from the same source, a directive like cluster-node-timeout is identical everywhere, which is what prevents the asymmetric-failover failure mode above. Never edit a live redis.conf by hand.

What belongs in the CI gate versus a manual approval?

Automate the objective checks — cluster_state:ok, 16,384 slots assigned, replica link status — as the Step 7 gate that blocks promotion. Reserve manual approval for large-blast-radius operations (resizing more than a few hundred slots at once, or removing a primary in a small cluster), where a human should confirm blast radius before gossip propagates the change.

Up: Automated Node Provisioning and Removal in Redis Cluster

Automating Node Scaling with Terraform and Ansible

# Prerequisites

# Step-by-Step Implementation

# Step 1 — Declare cluster nodes and network boundaries in Terraform

# Step 2 — Generate a dynamic Ansible inventory from node tags

# Step 3 — Template an identical redis.conf across every node

# Step 4 — Join new nodes with bounded retry and backoff

# Step 5 — Wait for gossip convergence before assigning slots

# Step 6 — Drain and remove a node on scale-in

# Step 7 — Gate the pipeline on full slot coverage

# Critical Path

# Failure Modes

# Partial join leaves slots unassigned

# Timeout breaches cascade into failover during scale-out

# Config-epoch collision from a racing apply

# Verification

# FAQ

# Should Terraform run the redis-cli --cluster commands directly?

# Why add a new primary empty and rebalance, instead of resharding during the join?

# Is it safe to scale in by just running terraform destroy on a node?

# How do I keep redis.conf from drifting between nodes?

# What belongs in the CI gate versus a manual approval?

# Related