Building Async Invalidation Queues with Celery

Synchronous cache eviction — issuing DEL or UNLINK inline on the request thread after a database commit — couples write latency to Redis availability and shard fan-out. On write-heavy endpoints this blocks the connection pool until every key is swept, and a slow shard or a large-key reclaim turns a 5 ms write into a tail-latency spike. This page shows how to hand eviction to a Celery worker pool so the request thread acknowledges the commit immediately, while a background task applies the sweep with retries and dead-lettering. It is the durable, orchestrated option compared against Pub/Sub and Redis Streams in Asynchronous Invalidation Workflows; if a missed eviction is a correctness bug rather than a briefly stale read, the at-least-once delivery built here is what you want.

Prerequisites

Celery 5.3+ with celery[redis], and redis-py 5.x on Python 3.10+.
Redis 7.2+ reachable as the Celery broker, ideally on a database separate from your cache keyspace so task messages and cached values never contend for memory.
A working eviction primitive: this guide uses UNLINK, which reclaims large-value memory on a background thread and never blocks the Redis main thread the way DEL can.
An understanding of where eviction sits in your consistency budget — Celery adds a broker hop, so pair it with a short TTL as a backstop for any key whose staleness window must be bounded.
For bulk sweeps, an explicit tag-to-key mapping rather than a keyspace SCAN; the set-based approach is covered in Key Tagging Strategies for Bulk Updates.

Step-by-Step Implementation

Each step is independently runnable — apply them in order to stand up a queue that survives worker crashes and broker restarts.

1. Pin the broker and queue semantics in a dedicated config module. Celery's defaults optimize for throughput, not cache hygiene, so override acknowledgement, prefetch, and routing before a single task runs.

# celery_config.py
broker_url = "redis://redis-broker:6379/1"      # separate DB from the cache
result_backend = "redis://redis-broker:6379/2"

# Retain the message until the worker explicitly ACKs — survives a crash mid-sweep.
task_acks_late = True
task_reject_on_worker_lost = True

# One task in flight per worker process: no local hoarding, even shard fan-out.
worker_prefetch_multiplier = 1

broker_transport_options = {
    "visibility_timeout": 3600,          # long sweeps are not re-queued prematurely
    "queue_order_strategy": "priority",
}

# Priority routing: point invalidations jump ahead of background sweeps.
task_routes = {
    "cache.invalidate.*": {"queue": "cache_invalidation", "priority": 9},
    "cache.sweep.*": {"queue": "cache_invalidation", "priority": 5},
}

# Emit task events so queue depth and latency are observable.
worker_send_task_events = True
task_send_sent_event = True

2. Define the eviction task and guard it with an atomic Lua script so a redelivered message is a safe no-op. The script checks existence and UNLINKs in a single round trip, avoiding a redundant delete that would generate cluster replication traffic.

# tasks.py
import random
import redis
from celery import Celery

app = Celery("cache_worker")
app.config_from_object("celery_config")

# KEYS[1]: exists? then UNLINK. Returns 1 if a key was reclaimed, else 0.
INVALIDATION_SCRIPT = """
local key = KEYS[1]
if redis.call('EXISTS', key) == 1 then
    redis.call('UNLINK', key)
    return 1
end
return 0
"""

_r = redis.Redis.from_url(app.conf.broker_url, decode_responses=True)

@app.task(bind=True, name="cache.invalidate.key", max_retries=4)
def invalidate_key(self, key: str):
    try:
        return _r.eval(INVALIDATION_SCRIPT, 1, key)
    except (redis.exceptions.ConnectionError, redis.exceptions.TimeoutError) as exc:
        # Exponential backoff with jitter — Celery adds none of its own.
        delay = 2 ** self.request.retries + random.uniform(0, 1)
        raise self.retry(exc=exc, countdown=delay)

3. Fence bulk sweeps on a tag set instead of scanning the keyspace. Iterate the members of a Redis Set that names the business entity and pipeline the UNLINKs, so a mass update stays predictable rather than blocking the server on a SCAN cursor.

@app.task(bind=True, name="cache.sweep.tag", max_retries=4)
def sweep_tag(self, tag: str):
    set_key = f"tag:{tag}:keys"
    try:
        members = _r.smembers(set_key)          # small, bounded per-tag set
        if members:
            with _r.pipeline(transaction=False) as pipe:
                for k in members:
                    pipe.unlink(k)
                pipe.unlink(set_key)             # drop the mapping after the sweep
                pipe.execute()
    except (redis.exceptions.ConnectionError, redis.exceptions.TimeoutError) as exc:
        delay = 2 ** self.request.retries + random.uniform(0, 1)
        raise self.retry(exc=exc, countdown=delay)

4. Enqueue from the write path only after the source-of-truth commit succeeds. Publishing before the transaction commits races the reader and can evict a value the database has not yet updated; enqueue in an after-commit hook so the ordering is deterministic.

# In the request handler, after the DB transaction commits:
from tasks import invalidate_key

def on_product_update(product_id: int):
    # ... db.session.commit() has already returned ...
    invalidate_key.apply_async(
        args=[f"product:{product_id}"],
        queue="cache_invalidation",
        priority=9,
    )
    # request thread returns now; eviction happens on a worker

5. Run the worker with late acks and single-task prefetch, and route exhausted tasks to a dead-letter queue. A task that burns through its retries must land somewhere inspectable rather than vanishing.

# tasks.py — dead-letter exhausted invalidations for later reconciliation
from celery.signals import task_failure

@task_failure.connect
def to_dead_letter(sender=None, task_id=None, args=None, einfo=None, **kw):
    if sender and sender.request.retries >= sender.max_retries:
        _r.xadd("dlq:invalidation", {"task": sender.name, "args": str(args)})

# Launch a worker bound to the priority invalidation queue.
celery -A tasks worker --queues=cache_invalidation \
  --concurrency=8 --prefetch-multiplier=1 --loglevel=info

The critical path — enqueue, attempt UNLINK, retry with backoff, or dead-letter after exhaustion — is shown below.

Failure Modes

1. Lost invalidation on worker crash → silent stale reads. With early acknowledgement (Celery's default), the broker deletes the message the moment a worker picks it up; a crash mid-UNLINK drops the eviction and the cache serves the old value until the next write or TTL. Confirm the ack policy and that unacked messages are being redelivered:

celery -A tasks inspect conf | grep -E "task_acks_late|task_reject_on_worker_lost"
# both must report True; if not, set them in celery_config.py and redeploy

The fix is task_acks_late = True plus task_reject_on_worker_lost = True, so an interrupted task returns to the queue instead of being acknowledged as done.

2. Broker eviction drops queued tasks under memory pressure. If the broker database shares a maxmemory-policy that evicts keys, Redis can discard un-processed task messages during a spike, and the queue silently shrinks. Detect it by watching eviction counters on the broker DB:

redis-cli -n 1 INFO stats | grep evicted_keys
redis-cli -n 1 CONFIG GET maxmemory-policy

A non-zero, climbing evicted_keys on the broker means messages are being thrown away; set maxmemory-policy noeviction on the broker database so producers get an error (which you can retry) instead of losing work.

3. Retry storm exhausts the connection pool during recovery. When Redis recovers from a blip, every failed task retries at once; without jitter the synchronized wave re-opens connections faster than the pool allows and trips max number of clients reached. Correlate blocked clients against connections:

redis-cli INFO clients | grep -E "connected_clients|blocked_clients"

If blocked_clients runs above ~15% of connected_clients during recovery, the retries are correlated — the random.uniform(0, 1) jitter in each task de-correlates the wave, and an explicit redis.ConnectionPool(max_connections=...) bounds concurrent connections per worker.

Verification

Confirm the queue drains and a real key is actually evicted on a live broker rather than inferring it from logs:

# 1. Queue depth is trending to zero (Celery keeps its list at the queue name).
redis-cli -n 1 LLEN cache_invalidation

# 2. Active tasks and pool health per worker.
celery -A tasks inspect active
celery -A tasks inspect stats | grep -E "pool|total"

# 3. End-to-end: a key set, then invalidated, is gone.
redis-cli SET product:8492 '{"v":1}' EX 300
python -c "from tasks import invalidate_key; invalidate_key.delay('product:8492').get(timeout=5)"
redis-cli EXISTS product:8492        # expect (integer) 0

# 4. No pathological eviction on the broker under load.
redis-cli -n 1 INFO stats | grep -E "evicted_keys|expired_keys"

For continuous assurance, alert on queue depth (LLEN cache_invalidation > 1000 for two minutes → scale worker concurrency or widen the sweep batch) and on broker evicted_keys moving off zero. Propagate a trace ID from the originating request through the task args so a stale-data report resolves to a specific enqueue.

FAQ

Why Celery instead of Redis Pub/Sub for invalidation? Pub/Sub is fire-and-forget: a subscriber mid-reconnect misses the event with no replay, which is fine for briefly stale reads but not for correctness-critical keys. Celery gives at-least-once delivery, bounded retries, and dead-lettering. Many teams run both — a fast Pub/Sub tier for the common case and this durable queue as the recovery backbone.

Should I use DEL or UNLINK in the task? Prefer UNLINK. It removes the key from the keyspace immediately but reclaims the memory on a background thread, so evicting a large value never blocks the Redis main thread and stalls unrelated commands. DEL frees memory inline and can cause a latency spike on multi-megabyte values.

Do I need task_acks_late if my tasks are idempotent? Yes — they solve different problems. Idempotency makes a redelivered message harmless; task_acks_late is what causes the redelivery to happen at all when a worker dies mid-task. Without late acks the broker discards the message on pickup, so an interrupted eviction is simply lost.

How do I stop a mass update from flooding the queue? Don't enqueue one task per key. Route bulk changes through a single tag sweep (Step 3) that evicts a whole related set in one pipelined task, and give sweeps a lower priority than point invalidations so interactive traffic drains first. This mirrors the write-behind idea of batching deferred work.

Why is worker_prefetch_multiplier = 1 recommended here? The default prefetch lets a worker reserve many tasks in local memory. For invalidation that starves sibling workers and clusters deletions unevenly across shards. Setting it to 1 keeps exactly one task in flight per process, so work spreads evenly and a slow sweep doesn't hold a batch of point invalidations hostage.

Up one level: Asynchronous Invalidation Workflows

Building Async Invalidation Queues with Celery

# Prerequisites

# Step-by-Step Implementation

# Failure Modes

# Verification

# FAQ

# Related

Prerequisites

Step-by-Step Implementation

Failure Modes

Verification

FAQ

Related