Production-Grade Pub/Sub Routing for Cross-Service Cache Invalidation

This page covers how to route invalidation signals between services that share a Redis keyspace — choosing between native Pub/Sub and Streams consumer groups, and wiring deterministic channel topology, async connection pools, and idempotent consumers so a single write reaches every cache-owning service without broadcast storms or stale reads.

When multiple microservices maintain overlapping Redis keyspaces, naive TTL expiration or synchronous DEL cascades introduce thundering-herd effects, inconsistent read states, and unpredictable tail latencies. Cross-service invalidation instead needs an explicit event layer: a producer emits one invalidation directive and every cache-aware consumer reconciles its local view. This is one of the propagation mechanisms mapped out in the parent guide to Advanced Cache Invalidation Patterns & Synchronization; the routing substrate you pick there decides message granularity, ordering, and what happens when a subscriber is offline.

Two Routing Substrates: Native Pub/Sub vs Streams Consumer Groups

Redis gives you two production-grade transports for fan-out invalidation, and they sit at opposite ends of the durability/latency curve. Native Pub/Sub is a fire-and-forget bus: PUBLISH delivers a message to whichever subscribers are connected right now and forgets it. Redis Streams with consumer groups persist every entry, track per-consumer acknowledgements, and let a service that was offline replay everything it missed on reconnect. Keyspace notifications are a third, narrower option — Redis itself emits an event whenever a key changes — but they are effectively Pub/Sub under the hood and share its at-most-once semantics.

Routing substrate	Consistency / delivery	Latency	Fan-out amplification	Operational complexity
Native Pub/Sub (`PUBLISH`/`SUBSCRIBE`)	At-most-once — lost if subscriber offline	Lowest (single hop, no persistence)	High — every subscriber gets every message on the channel	Low — no group state, but no replay
Pub/Sub + hierarchical channels	At-most-once, but scoped fan-out	Lowest	Low — only interested services subscribe	Low–medium — channel naming discipline
Streams + consumer groups	At-least-once, replayable after reconnect	Low (append + `XREADGROUP`)	Low — one entry, many groups read independently	Medium — `XACK`, PEL, `MAXLEN` trimming
Keyspace notifications	At-most-once, per-key granularity	Lowest	High — fires on every matching key event	Low — but noisy, hard to scope

The rest of this page treats native Pub/Sub with disciplined channel topology as Approach A (the low-latency default) and Streams consumer groups as Approach B (the durable upgrade), then gives concrete criteria for choosing between them.

Approach A — Native Pub/Sub With Deterministic Channel Topology

Broadcasting every invalidation directive to a single monolithic cache:invalidate channel creates cross-service message storms and prevents targeted scaling: every consumer wakes for every event and filters in application code. The fix is strict namespace isolation using hierarchical topic routing, where channel names encode service boundary, domain context, and operation scope.

svc:orders:invalidate          # Single-service, row-level
svc:inventory:invalidate       # Single-service, SKU-level
domain:catalog:bulk            # Cross-service, aggregate-level
region:us-east1:session:flush  # Geographic/tenant-scoped

This topology lets you apply per-channel rate limits, isolate noisy neighbours, and scale consumer groups independently. Pattern subscriptions (PSUBSCRIBE) should be reserved for operational dashboards or debugging endpoints, never for primary invalidation consumers — a wildcard subscriber re-couples the fan-out you just decoupled.

Connection Lifecycle and Async Pool Tuning

Pub/Sub connections block on SUBSCRIBE/PSUBSCRIBE, so synchronous clients are incompatible with high-throughput application servers — a blocked subscriber thread cannot also serve requests. Production deployments use redis.asyncio with an explicitly tuned connection pool so the publisher path and the subscriber loop never contend for the same socket.

import asyncio
import msgspec
from redis.asyncio import Redis, ConnectionPool

INVALIDATION_POOL = ConnectionPool.from_url(
    "redis://cache-broker.internal:6379/2",
    max_connections=50,
    socket_timeout=1.5,
    socket_connect_timeout=1.0,
    retry_on_timeout=True,
    health_check_interval=15,
    decode_responses=False,  # Binary payloads for msgpack
)

class InvalidationPublisher:
    def __init__(self, pool: ConnectionPool):
        self.pool = pool
        self.encoder = msgspec.msgpack.Encoder()

    async def publish(self, channel: str, payload: dict) -> None:
        async with Redis(connection_pool=self.pool) as client:
            envelope = self.encoder.encode({
                "seq_id": payload["seq_id"],   # monotonic per source
                "source": payload["source"],
                "ts": payload["ts"],
                "ttl": payload.get("ttl", 30),
                "keys": payload.get("keys", []),
                "tag": payload.get("tag"),
            })
            await client.publish(channel, envelope)

Size the pool to exceed the expected subscriber count by roughly 20% to absorb connection churn during rolling deployments, when old and new pods briefly hold overlapping subscriptions. Broker-side buffer tuning and the full client lifecycle are worked through in Implementing Redis Pub/Sub for Real-Time Cache Invalidation.

Idempotent Async Subscribers

Pub/Sub guarantees at-most-once delivery, but retries, reconnect races, and overlapping deploys can still deliver the same directive twice. Consumers must therefore be strictly idempotent — deduplicate on a (source, seq_id) pair — and must run in an isolated event loop so a slow UNLINK never blocks the request path.

import asyncio
import logging
import msgspec
from collections import OrderedDict
from redis.asyncio import Redis

logger = logging.getLogger("cache.invalidator")

class AsyncInvalidationConsumer:
    def __init__(self, redis_client: Redis, channels: list[str]):
        self.redis = redis_client
        self.channels = channels
        # Bounded dedup cache: last 10k seq_ids across all sources
        self.seen_seqs: OrderedDict[tuple, float] = OrderedDict()
        self.max_dedup = 10_000

    async def run(self) -> None:
        async with self.redis.pubsub() as pubsub:
            await pubsub.subscribe(*self.channels)
            logger.info("Subscribed to invalidation channels: %s", self.channels)
            async for message in pubsub.listen():
                if message["type"] != "message":
                    continue
                await self._process(message)

    async def _process(self, raw_msg: dict) -> None:
        try:
            payload = msgspec.msgpack.decode(raw_msg["data"])
            dedup_key = (payload["source"], payload["seq_id"])
            if dedup_key in self.seen_seqs:
                return  # already applied — idempotent no-op
            self._track_dedup(dedup_key)

            if payload.get("tag"):
                await self._resolve_tag(payload)
            else:
                await self._delete_keys(payload["keys"])
        except Exception as exc:
            logger.error("Invalidation processing failed: %s", exc, exc_info=True)

    def _track_dedup(self, key: tuple) -> None:
        self.seen_seqs[key] = asyncio.get_event_loop().time()
        if len(self.seen_seqs) > self.max_dedup:
            self.seen_seqs.popitem(last=False)  # evict oldest

    async def _delete_keys(self, keys: list[str]) -> None:
        if keys:
            await self.redis.unlink(*keys)  # non-blocking reclaim

Approach B — Streams With Consumer Groups for Durable Routing

When a lost invalidation is a correctness bug rather than a tolerable blip — inventory counts, entitlement changes, anything a user can observe going stale — replace the fire-and-forget bus with a durable stream. Producers XADD one entry per domain event; each consuming service reads through its own consumer group, so a slow or restarted service never blocks the others and can replay the pending entries list (PEL) on reconnect. This is the same at-least-once contract used across the asynchronous invalidation workflows that back-pressure and retry the actual key deletions.

import redis.asyncio as redis
from redis.exceptions import ResponseError

class StreamInvalidationRouter:
    def __init__(self, r: redis.Redis, stream: str, group: str, consumer: str):
        self.r = r
        self.stream = stream          # e.g. "cache:invalidate:catalog"
        self.group = group            # one group per subscribing service
        self.consumer = consumer      # one name per pod/worker

    async def ensure_group(self) -> None:
        try:
            # id="$" starts at the tail; use "0" to replay history on first boot
            await self.r.xgroup_create(self.stream, self.group, id="$", mkstream=True)
        except ResponseError as exc:
            if "BUSYGROUP" not in str(exc):
                raise  # group already exists is fine; anything else is not

    async def run(self) -> None:
        await self.ensure_group()
        while True:
            # First drain this consumer's own pending (unacked) entries after a
            # crash, then switch to new messages with the ">" cursor.
            resp = await self.r.xreadgroup(
                self.group, self.consumer,
                {self.stream: ">"}, count=100, block=5000,
            )
            for _stream, entries in resp or []:
                for entry_id, fields in entries:
                    try:
                        await self._apply(fields)
                        await self.r.xack(self.stream, self.group, entry_id)
                    except Exception:
                        # No XACK: entry stays in the PEL for XCLAIM / retry
                        continue

    async def _apply(self, fields: dict) -> None:
        keys = fields.get(b"keys", b"").split(b",")
        if keys and keys != [b""]:
            await self.r.unlink(*keys)

Cap unbounded growth with XADD ... MAXLEN '~' 100000 and reclaim entries stuck in a dead consumer's PEL with XAUTOCLAIM so a crashed pod never strands a directive forever. The trade-off is memory and the housekeeping that native Pub/Sub does not need — which is exactly why Approach A stays the default until durability is a hard requirement.

When to Choose Which

Pick the substrate from concrete production signals, not preference:

Freshness SLA. If a missed invalidation is user-visible incorrectness (pricing, entitlements, inventory), use Streams so an offline consumer replays on reconnect. If stale-for-seconds is acceptable and a short backstop TTL bounds the blast radius, native Pub/Sub is simpler and lower latency.
Subscriber availability. Long-lived, always-connected consumers suit Pub/Sub. Services that scale to zero, deploy frequently, or run on spot capacity will miss fire-and-forget messages and need the Streams PEL.
Throughput and fan-out width. At very high event rates across many services, Pub/Sub's per-subscriber fan-out multiplies socket traffic; a stream stores one entry that N groups read independently, flattening amplification.
Ordering and audit. If you need replay, ordered redelivery, or an audit trail of what was invalidated when, Streams give you XRANGE history. Pub/Sub keeps nothing.
Operational budget. Streams add XACK/PEL/trimming to your runbook and monitoring. If the team cannot own that, disciplined hierarchical Pub/Sub with a reconciliation TTL is the lower-burden choice.

A common production shape is hybrid: publish on native channels for the low-latency common path, and mirror the same directive into a stream so restarted services reconcile from the tail — pairing Approach A's latency with Approach B's durability.

Bulk Invalidation via Tag Resolution

Emitting thousands of individual DEL commands through either substrate saturates network buffers and routinely trips Redis client-output-buffer-limit. Bulk scenarios need tag-based routing: producers publish a single invalidate:tag:<name> directive, and each consumer resolves the tag locally with non-blocking cursor iteration rather than shipping the whole key list over the wire.

    async def _resolve_tag(self, payload: dict) -> None:
        tag_key = f"cache:tag:{payload['tag']}"
        cursor = 0
        batch_size = 500
        while True:
            cursor, keys = await self.redis.sscan(tag_key, cursor=cursor, count=batch_size)
            if keys:
                await self.redis.unlink(*keys)
            if cursor == 0:
                break
        await self.redis.delete(tag_key)

This shifts computational load from the broker to the consumer and preserves network throughput by iterating the tag set incrementally without blocking the server. Note that SSCAN-based iteration is not atomic — if a concurrent write can add keys to the tag mid-sweep, resolve and delete inside a single Lua script instead. Tag schema design, hash-tag slot co-location, and set maintenance are covered in Key Tagging Strategies for Bulk Updates and applied end-to-end in Using Key Tags to Invalidate Related Data Sets.

Failure Modes and Diagnostics

Each routing choice fails in a characteristic way. Name the mode so on-call can move from symptom to fix quickly.

Lost invalidation (silent stale reads)

A subscriber was offline — deploy, crash, network blip — during a native PUBLISH, so it never deleted the key and now serves stale data indefinitely. There is no error; the only signal is divergence. Compare a known entity across services and confirm subscriber presence:

# Is anyone actually listening on the channel?
redis-cli PUBSUB NUMSUB "svc:orders:invalidate"
# If a subscriber shows 0 during a window a write happened, the message was dropped.

The durable fix is to move the correctness-critical directive onto a stream (Approach B) so the consumer replays on reconnect; the interim mitigation is a bounded reconciliation TTL on affected keys.

Buffer overflow on bulk fan-out

A producer sends thousands of per-key deletes and Redis disconnects slow subscribers when their output buffer breaches client-output-buffer-limit, which itself causes dropped invalidations. Diagnose from the pub/sub buffer class and switch to tag-based routing:

redis-cli CONFIG GET client-output-buffer-limit        # inspect pubsub class limits
redis-cli INFO clients | grep -E "blocked_clients|client_recent_max_output_buffer"
# Temporary relief while you cut over to tag routing:
redis-cli CONFIG SET client-output-buffer-limit "pubsub 32mb 16mb 60"

Consumer-group backlog (Streams)

With Approach B, a stalled or under-scaled consumer group lets the PEL and stream length climb, widening the eventual-consistency window. Watch group lag and pending counts:

redis-cli XINFO GROUPS cache:invalidate:catalog        # lag + pending per group
redis-cli XPENDING cache:invalidate:catalog svc-orders # entries stuck unacked

If pending climbs monotonically, scale consumers in the lagging group and XAUTOCLAIM entries orphaned by a dead pod. A sustained gap between the stream tail and the group's delivered id means producers are outrunning consumers — the same backlog runaway detailed in the asynchronous invalidation workflows guide.

Verification

Confirm routing is actually working in a live cluster before trusting it in an incident:

Check	Command	Healthy signal
Channels are live	`redis-cli PUBSUB CHANNELS "svc:*"`	Expected channels present, no stray wildcards
Every channel has a listener	`redis-cli PUBSUB NUMSUB "svc:orders:invalidate"`	Subscriber count matches deployed pods
Stream groups keep up	`redis-cli XINFO GROUPS cache:invalidate:catalog`	`lag` near zero, `pending` bounded
No orphaned pending entries	`redis-cli XPENDING cache:invalidate:catalog svc-orders`	Idle time under your `XAUTOCLAIM` threshold
Live message stream (sparingly)	`redis-cli MONITOR \| grep invalidate`	Directives appear within your SLA of the write

Wire the same signals into metrics rather than eyeballing them — count applied invalidations and record end-to-end latency so you alert on trend breaks, not absolute values.

from opentelemetry import metrics

meter = metrics.get_meter("cache.invalidation")
invalidation_counter = meter.create_counter("cache.invalidation.count")
latency_histogram = meter.create_histogram("cache.invalidation.latency.ms")

async def _process_with_observability(self, raw_msg: dict) -> None:
    start = asyncio.get_event_loop().time()
    try:
        await self._process(raw_msg)
        invalidation_counter.add(1, {"status": "success"})
    except Exception:
        invalidation_counter.add(1, {"status": "error"})
        raise
    finally:
        latency_histogram.record((asyncio.get_event_loop().time() - start) * 1000)

Security and Network Isolation

Pub/Sub channels and streams are unauthenticated by default: in a shared cluster, any client can publish to any channel, enabling cache poisoning or a denial-of-service flood of spurious invalidations. Enforce Redis ACLs so an invalidation identity can only touch invalidation resources, and terminate TLS at the broker.

# Dedicated invalidation user scoped to cache channels and streams only
redis-cli ACL SETUSER svc-invalidator on ">strong_password" \
  "~cache:*" "&svc:*" "&domain:*" \
  "+subscribe" "+psubscribe" "+publish" \
  "+xadd" "+xreadgroup" "+xack" "-@dangerous"

redis-cli ACL GETUSER svc-invalidator   # verify the grant

Deploy Redis into private subnets, restrict port 6379 to application CIDRs with security-group rules, and keep the invalidation broker off any internet-facing path. Broader network segmentation for shared clusters is covered under multi-tenant security boundaries.

By pairing hierarchical channel routing (or durable streams) with async pool management, idempotent consumers, tag-based bulk resolution, and metric-driven verification, engineering teams decouple invalidation from the request path while keeping cross-service reads consistent.

Up: Advanced Cache Invalidation Patterns & Synchronization

Related

Production-Grade Pub/Sub Routing for Cross-Service Cache Invalidation

# Two Routing Substrates: Native Pub/Sub vs Streams Consumer Groups

# Approach A — Native Pub/Sub With Deterministic Channel Topology

# Connection Lifecycle and Async Pool Tuning

# Idempotent Async Subscribers

# Approach B — Streams With Consumer Groups for Durable Routing

# When to Choose Which

# Bulk Invalidation via Tag Resolution

# Failure Modes and Diagnostics

# Lost invalidation (silent stale reads)

# Buffer overflow on bulk fan-out

# Consumer-group backlog (Streams)

# Verification

# Security and Network Isolation