HealthKit + Celery on ECS, Part 3: Deploying Workers to Fargate

5 minute read

Part 3 of the HealthKit + Celery series. Part 1 covered why SQS over Redis for the broker. Part 2 covered the chunk-based import pipeline. This post covers deploying it all to ECS Fargate.

One Image, Two Services

The Celery worker reuses the Django Docker image. No separate Dockerfile, no separate build pipeline. The ECS task definitions differ only in their command override:

Service	Command	CPU	Memory	Replicas	Ports
`api`	`gunicorn backend.wsgi...`	512	1024	2+ (auto-scale)	8000 (ALB)
`worker`	`celery -A backend worker -c 2 -Q apple-health-imports`	1024	2048	1	none

The worker gets more CPU and memory than the API because it’s doing the actual computation — running update_or_create loops, processing health data batches, computing fingerprint hashes. The API just validates input, creates the batch/chunk records, and fires .delay().

SQS Queue Setup

The apple-health-imports queue is pre-provisioned in AWS, not auto-created by Celery. The task definition passes the queue URL via environment variable:

_celery_sqs_queue_url = config('CELERY_APPLE_HEALTH_QUEUE_URL', default='')

if _celery_sqs_queue_url:
    CELERY_BROKER_TRANSPORT_OPTIONS['predefined_queues'] = {
        'apple-health-imports': {'url': _celery_sqs_queue_url},
    }

Using predefined queues with explicit URLs is important because SQS queue auto-creation requires sqs:CreateQueue permissions. In a least-privilege IAM setup, your ECS task role should only have sqs:SendMessage, sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:GetQueueAttributes on the specific queue ARN.

The SQS transport options are tuned for the HealthKit workload:

CELERY_BROKER_TRANSPORT_OPTIONS = {
    'region': 'us-west-2',
    'visibility_timeout': 600,     # 10 min — enough for large chunks
    'polling_interval': 5,          # Long-poll interval
    'wait_time_seconds': 20,       # SQS long-polling max (most cost-efficient)
}

visibility_timeout: 600 — when a worker picks up a chunk task, SQS hides that message for 10 minutes. If the worker doesn’t confirm completion within that window, the message reappears for another worker (or the same one) to retry. The SQS queue attribute must also be set to >= 600 seconds, or the transport option is silently ignored.

wait_time_seconds: 20 — long-polling. Instead of checking SQS every second (expensive, generates empty receives), the worker blocks for up to 20 seconds waiting for a message. This is the SQS maximum and the most cost-efficient setting. When no health data is syncing, the worker sits idle and generates essentially zero SQS API calls.

IAM Authentication

The broker URL is just sqs:// with no credentials embedded:

CELERY_BROKER_URL = config('CELERY_BROKER_URL', default='sqs://')

Celery’s SQS transport uses boto3 under the hood, which automatically picks up the ECS task role credentials. No access keys, no secrets in environment variables. The task role just needs the SQS permissions on the queue ARN.

The First Deploy Crash

The first time we deployed the Celery worker to ECS, the CloudFormation stack rolled back. CloudWatch logs showed:

decouple.UndefinedValueError: COMPASS_DB_USER not found.

The traceback pointed at worker pool initialization, but that was misleading. The actual failure was in Django settings import, which happens before Celery’s pool even starts.

When a Celery worker boots:

Celery loads backend.celery, which sets DJANGO_SETTINGS_MODULE
app.config_from_object triggers Django settings import
The settings switcher loads the environment-specific settings module
Any config() call without a default raises if the env var is missing

The API container had COMPASS_DB_USER. The worker container didn’t. Same image, different task definitions, different environment blocks.

The lesson: the worker’s ECS environment must mirror the API’s — every env var, even for features the worker doesn’t use. Django settings import is all-or-nothing at module scope.

Worker Tuning for SQS

CELERY_WORKER_PREFETCH_MULTIPLIER = 1
CELERY_WORKER_CONCURRENCY = 2
CELERY_TASK_ACKS_LATE = True
CELERY_TASK_TIME_LIMIT = 3600       # 1 hour hard limit
CELERY_TASK_SOFT_TIME_LIMIT = 3300  # 55 min soft limit

PREFETCH_MULTIPLIER = 1 — this is critical with SQS. Unlike Redis or RabbitMQ, SQS starts the visibility timeout at receive time, not processing time. If the worker prefetches 4 messages but processes them sequentially, the last 3 messages burn through their visibility timeout while sitting in the worker’s memory. With prefetch=1, the worker only grabs one message at a time.

CONCURRENCY = 2 — two concurrent worker threads. This matches the Fargate task’s vCPU allocation. Higher concurrency would mean more database connections competing for the same CPU.

ACKS_LATE = True — the SQS message is only deleted after the task completes. Combined with reject_on_worker_lost=True on the tasks themselves, this means an OOM-killed or replaced ECS task doesn’t lose messages.

TIME_LIMIT = 3600 — one hour hard kill. The soft limit at 55 minutes raises SoftTimeLimitExceeded, giving the task a chance to clean up before the hard kill. HealthKit backfill chunks typically finish in seconds, but the generous limit accounts for database contention under load.

JSON Only, No Pickle

CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'

Pickle serialization is a remote code execution vector. If an attacker can write to your SQS queue (compromised IAM credentials, misconfigured queue policy), pickle deserialization executes arbitrary Python. JSON-only prevents this class of attack entirely.

This means all task arguments must be JSON-serializable — no Django model instances, no datetime objects, no UUIDs as arguments. We pass import_batch_id as a string and re-query the database inside the task.

Health Checks for Headless Workers

ECS health checks are designed for HTTP services. Celery workers don’t serve HTTP.

We use a command-based health check in the task definition:

"healthCheck": {
    "command": ["CMD-SHELL", "celery -A backend inspect ping --timeout 10 || exit 1"],
    "interval": 30,
    "timeout": 15,
    "retries": 3
}

This catches two failure modes: worker process died, and worker lost connection to SQS. It’s not perfect — inspect ping adds minor overhead every 30 seconds — but it’s the simplest approach that actually works on Fargate.

Monitoring Without Flower

SQS doesn’t support Celery’s event protocol, so Flower doesn’t work. Our monitoring is CloudWatch-based:

SQS metrics: ApproximateNumberOfMessagesVisible (queue depth — if it’s growing, workers aren’t keeping up) and ApproximateNumberOfMessagesNotVisible (in-flight — if this is high relative to worker count, tasks are taking too long).

Application metrics: Custom CloudWatch EMF metrics emitted from the import service — batch duration, chunk processing time, records imported per batch, deduplication hit rate.

Database tracking: AppleHealthImportBatch records give us a complete audit trail — when the import started, how many chunks, how many succeeded, how many failed, what went wrong. This is richer than anything Celery’s result backend would provide.

What I’d Do Differently

Instrument task duration from day one. We added CloudWatch metrics for chunk processing time after the first production incident. Should have been there from the start.

Gate legacy settings behind defaults. The COMPASS_DB_USER crash was avoidable. Any Django setting that’s not needed by all containers should have default=''.

Consider Fargate Spot for workers. HealthKit import tasks are idempotent and retry-safe. Spot pricing for the worker task would cut compute costs significantly, with SQS handling the retry on interruption automatically.

The Full Series

Part 1: Why Background Processing Changes Everything
Part 2: The Sync Pipeline
Part 3: Deploying Workers to Fargate (this post)

Alex Beattie

HealthKit + Celery on ECS, Part 3: Deploying Workers to Fargate

One Image, Two Services

SQS Queue Setup

IAM Authentication

The First Deploy Crash

Worker Tuning for SQS

JSON Only, No Pickle

Health Checks for Headless Workers

Monitoring Without Flower

What I’d Do Differently

The Full Series

You May Also Enjoy

Streaming LLM Architecture Patterns: Sources, Done Events, And Observability

Designing Production AI Routing And Evals For A Healthcare Assistant

Building A Geospatial Care-Navigation Platform For Developmental Disability Services

From Commissioned Website To Nonprofit Startup: The KiNDD Technical Journey