6 minute read

Part 3 of the HealthKit + Celery series. Part 1 covered why we needed Celery. Part 2 covered the sync pipeline. This post covers deploying it all to ECS Fargate.

One Image, Three Services

The entire Celery deployment reuses the existing Django Docker image. No separate Dockerfiles, no separate build pipelines. The ECS task definitions differ only in their command override:

Service Command CPU Memory Replicas Ports
api gunicorn backend.wsgi... 512 1024 2+ (auto-scale) 8000 (ALB)
worker celery -A backend worker -c 4 -Q default 1024 2048 2+ (auto-scale) none
beat celery -A backend beat --scheduler django_celery_beat 256 512 1 none

The worker gets more CPU and memory than the API because it’s doing the actual computation — running update_or_create loops, processing health data batches, executing AI tasks. The API just validates input and enqueues.

Beat gets minimal resources because it’s just a scheduler — it reads cron entries from the database and pushes task messages to Redis. It never executes task code.

The First Deploy Crash

The first time we deployed the Celery worker to ECS, the CloudFormation stack rolled back immediately. The CloudWatch logs showed:

decouple.UndefinedValueError: COMPASS_DB_USER not found.
Declare it as envvar or define a default value.

The traceback pointed at worker_pool initialization, but that was misleading. The actual failure was in Django settings import, which happens before Celery’s pool even initializes.

Here’s what happens when a Celery worker starts:

  1. Celery loads the Django settings module (our backend/settings/__init__.py)
  2. That file imports from .development import * (or production, based on env)
  3. development.py calls config('COMPASS_DB_USER') — a legacy database config
  4. python-decouple raises UndefinedValueError because that env var wasn’t in the worker’s ECS task definition

The API container had it. The worker container didn’t. Same image, different task definitions, different environment blocks.

The lesson: When your Celery worker uses the same Django settings as your API (which it should), the worker’s ECS environment must have every env var the API has — even for features the worker doesn’t use. Django settings import is all-or-nothing at module scope.

Our fix was straightforward: mirror the API’s environment block into the worker and beat task definitions. We also added a default value to the legacy config call so it wouldn’t crash on import: config('COMPASS_DB_USER', default='').

Redis Configuration Across Environments

We run three environments — DEV, UAT, PROD — each with their own ElastiCache Redis instance. The configuration lives in Django settings, keyed by environment:

REDIS_URL = config('REDIS_URL', default='redis://localhost:6379/0')

CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': REDIS_URL,
        'OPTIONS': {
            'CLIENT_CLASS': 'django_redis.client.DefaultClient',
            'SOCKET_CONNECT_TIMEOUT': 5,
            'SOCKET_TIMEOUT': 5,
        },
        'KEY_PREFIX': 'ourself',
    }
}

CELERY_BROKER_URL = REDIS_URL
CELERY_RESULT_BACKEND = REDIS_URL

One REDIS_URL env var drives everything — cache, Celery broker, and result backend. In production, this points to an ElastiCache cluster in the same VPC as the ECS tasks. In dev, it points to a single-node instance.

During an audit of the deploy script, we found a gap: the Redis URL was being set for DEV and UAT but not PROD. The PROD worker was silently falling back to localhost:6379, which doesn’t exist in a Fargate container. Tasks were being enqueued by the API (which had the correct Redis URL) but never consumed by the worker (which was looking at a nonexistent local Redis). The queue grew, HealthKit syncs never completed, and there were no errors — just silence.

The fix was a one-line addition to the deploy script. But the debugging took hours because the failure mode was “nothing happens” rather than “something crashes.”

Health Checks for Headless Workers

ECS health checks are designed for HTTP services. You configure a path, ECS hits it, and if it gets a 200, the container is healthy. This works great for the API container (/health/).

Celery workers don’t serve HTTP. There’s no port to hit. ECS has no built-in way to know if your worker is actually processing tasks or if it’s hung.

Our options:

Option 1: celery inspect ping — runs a broadcast command to all workers and waits for a response. Works, but it requires the worker to be connected to Redis and responsive. If Redis is down, the ping fails even if the worker process is technically alive.

Option 2: A sidecar HTTP health check — run a tiny HTTP server alongside the worker that calls celery inspect ping internally and returns 200/503. This is the “proper” solution but adds complexity.

Option 3: ECS command health check — instead of an HTTP check, use a command-based health check in the task definition:

"healthCheck": {
    "command": ["CMD-SHELL", "celery -A backend inspect ping --timeout 10 || exit 1"],
    "interval": 30,
    "timeout": 15,
    "retries": 3
}

We went with Option 3. It’s not perfect — the inspect ping command adds a small amount of overhead every 30 seconds — but it catches the two failure modes we care about: worker process died, and worker lost connection to Redis.

The Deploy Script

Our local ECS deploy script (deploy-ecs.sh) handles all three services in sequence:

  1. Build and push Docker image to ECR
  2. Register new task definitions for API, worker, and beat (same image tag, different commands)
  3. Update each ECS service to use the new task definition
  4. Wait for API service stability (health check passes)
  5. Worker and beat don’t have ALB-based health checks, so we verify via CloudWatch logs

The script ensures that all three services always run the same image version. A mismatch — where the API is on v42 but the worker is still on v41 — would cause task serialization failures if the schema changed between versions.

Celery Configuration for Production

The production Celery settings evolved through debugging:

CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_TASK_TRACK_STARTED = True
CELERY_TASK_TIME_LIMIT = 30 * 60        # 30 min hard limit
CELERY_TASK_SOFT_TIME_LIMIT = 28 * 60   # 28 min soft limit (raises SoftTimeLimitExceeded)
CELERY_WORKER_PREFETCH_MULTIPLIER = 1   # Don't hoard tasks
CELERY_TASK_ACKS_LATE = True            # ACK after completion, not on delivery
CELERY_BEAT_SCHEDULER = 'django_celery_beat.schedulers:DatabaseScheduler'

PREFETCH_MULTIPLIER = 1 — prevents one worker from grabbing 10 tasks at once. Critical when some tasks (HealthKit backfill) take 30 seconds and others (notification check) take 100ms. Without this, one worker grabs a batch and the heavy tasks block the light ones.

ACKS_LATE = True — the task stays in Redis until the worker confirms completion. If the worker crashes mid-task, the task reappears in the queue. Combined with update_or_create on hk_uuid, this makes the pipeline truly idempotent.

DatabaseScheduler — Beat reads its schedule from PostgreSQL instead of a local file. On Fargate, the filesystem is ephemeral — if the container restarts, a file-based schedule is gone. The database scheduler also lets us edit schedules through Django admin without redeploying.

The Monitoring Gap

This is the honest part: monitoring Celery on ECS is harder than it should be.

We can’t use Flower effectively because it needs persistent WebSocket connections to workers, which is awkward through ECS service discovery. We rely instead on CloudWatch metrics — custom metrics emitted by our Django app on task enqueue/completion, plus the standard ElastiCache metrics for Redis queue depth.

The getHealthItemTrackingData query pattern also surfaced a frontend issue during debugging: the Flutter app was firing 4 overlapping health data queries on every screen render, every few seconds. Not a backend bug, but the noise made it harder to distinguish “HealthKit sync is slow” from “the client is hammering the API with redundant reads.”

What I’d Do Differently

Start with Redis, not SQS. The plan doc originally specified SQS. We switched to Redis before implementation, which was the right call. But if I were starting fresh, I wouldn’t even consider SQS for a Django/Celery stack unless the team has dedicated DevOps and doesn’t need Flower or inspect.

Gate legacy settings behind feature flags. The COMPASS_DB_USER crash was avoidable. Any Django setting that’s not needed by all containers should have a default='' or be behind a config('FEATURE_ENABLED', default=False, cast=bool) guard.

Instrument task duration from day one. We added CloudWatch custom metrics for task execution time after the first production incident. Should have been there from the start. Knowing that HealthKit backfill tasks average 8 seconds but occasionally spike to 45 seconds would have informed the VISIBILITY_TIMEOUT and TIME_LIMIT settings earlier.

The Full Series