HealthKit + Celery on ECS, Part 3: Deploying Workers to Fargate
Part 3 of the HealthKit + Celery series. Part 1 covered why SQS over Redis for the broker. Part 2 covered the chunk-based import pipeline. This post covers deploying it all to ECS Fargate.
One Image, Two Services
The Celery worker reuses the Django Docker image. No separate Dockerfile, no separate build pipeline. The ECS task definitions differ only in their command override:
| Service | Command | CPU | Memory | Replicas | Ports |
|---|---|---|---|---|---|
api |
gunicorn backend.wsgi... |
512 | 1024 | 2+ (auto-scale) | 8000 (ALB) |
worker |
celery -A backend worker -c 2 -Q apple-health-imports |
1024 | 2048 | 1 | none |
The worker gets more CPU and memory than the API because it’s doing the actual computation — running update_or_create loops, processing health data batches, computing fingerprint hashes. The API just validates input, creates the batch/chunk records, and fires .delay().
SQS Queue Setup
The apple-health-imports queue is pre-provisioned in AWS, not auto-created by Celery. The task definition passes the queue URL via environment variable:
_celery_sqs_queue_url = config('CELERY_APPLE_HEALTH_QUEUE_URL', default='')
if _celery_sqs_queue_url:
CELERY_BROKER_TRANSPORT_OPTIONS['predefined_queues'] = {
'apple-health-imports': {'url': _celery_sqs_queue_url},
}
Using predefined queues with explicit URLs is important because SQS queue auto-creation requires sqs:CreateQueue permissions. In a least-privilege IAM setup, your ECS task role should only have sqs:SendMessage, sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:GetQueueAttributes on the specific queue ARN.
The SQS transport options are tuned for the HealthKit workload:
CELERY_BROKER_TRANSPORT_OPTIONS = {
'region': 'us-west-2',
'visibility_timeout': 600, # 10 min — enough for large chunks
'polling_interval': 5, # Long-poll interval
'wait_time_seconds': 20, # SQS long-polling max (most cost-efficient)
}
visibility_timeout: 600 — when a worker picks up a chunk task, SQS hides that message for 10 minutes. If the worker doesn’t confirm completion within that window, the message reappears for another worker (or the same one) to retry. The SQS queue attribute must also be set to >= 600 seconds, or the transport option is silently ignored.
wait_time_seconds: 20 — long-polling. Instead of checking SQS every second (expensive, generates empty receives), the worker blocks for up to 20 seconds waiting for a message. This is the SQS maximum and the most cost-efficient setting. When no health data is syncing, the worker sits idle and generates essentially zero SQS API calls.
IAM Authentication
The broker URL is just sqs:// with no credentials embedded:
CELERY_BROKER_URL = config('CELERY_BROKER_URL', default='sqs://')
Celery’s SQS transport uses boto3 under the hood, which automatically picks up the ECS task role credentials. No access keys, no secrets in environment variables. The task role just needs the SQS permissions on the queue ARN.
The First Deploy Crash
The first time we deployed the Celery worker to ECS, the CloudFormation stack rolled back. CloudWatch logs showed:
decouple.UndefinedValueError: COMPASS_DB_USER not found.
The traceback pointed at worker pool initialization, but that was misleading. The actual failure was in Django settings import, which happens before Celery’s pool even starts.
When a Celery worker boots:
- Celery loads
backend.celery, which setsDJANGO_SETTINGS_MODULE app.config_from_objecttriggers Django settings import- The settings switcher loads the environment-specific settings module
- Any
config()call without adefaultraises if the env var is missing
The API container had COMPASS_DB_USER. The worker container didn’t. Same image, different task definitions, different environment blocks.
The lesson: the worker’s ECS environment must mirror the API’s — every env var, even for features the worker doesn’t use. Django settings import is all-or-nothing at module scope.
Worker Tuning for SQS
CELERY_WORKER_PREFETCH_MULTIPLIER = 1
CELERY_WORKER_CONCURRENCY = 2
CELERY_TASK_ACKS_LATE = True
CELERY_TASK_TIME_LIMIT = 3600 # 1 hour hard limit
CELERY_TASK_SOFT_TIME_LIMIT = 3300 # 55 min soft limit
PREFETCH_MULTIPLIER = 1 — this is critical with SQS. Unlike Redis or RabbitMQ, SQS starts the visibility timeout at receive time, not processing time. If the worker prefetches 4 messages but processes them sequentially, the last 3 messages burn through their visibility timeout while sitting in the worker’s memory. With prefetch=1, the worker only grabs one message at a time.
CONCURRENCY = 2 — two concurrent worker threads. This matches the Fargate task’s vCPU allocation. Higher concurrency would mean more database connections competing for the same CPU.
ACKS_LATE = True — the SQS message is only deleted after the task completes. Combined with reject_on_worker_lost=True on the tasks themselves, this means an OOM-killed or replaced ECS task doesn’t lose messages.
TIME_LIMIT = 3600 — one hour hard kill. The soft limit at 55 minutes raises SoftTimeLimitExceeded, giving the task a chance to clean up before the hard kill. HealthKit backfill chunks typically finish in seconds, but the generous limit accounts for database contention under load.
JSON Only, No Pickle
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
Pickle serialization is a remote code execution vector. If an attacker can write to your SQS queue (compromised IAM credentials, misconfigured queue policy), pickle deserialization executes arbitrary Python. JSON-only prevents this class of attack entirely.
This means all task arguments must be JSON-serializable — no Django model instances, no datetime objects, no UUIDs as arguments. We pass import_batch_id as a string and re-query the database inside the task.
Health Checks for Headless Workers
ECS health checks are designed for HTTP services. Celery workers don’t serve HTTP.
We use a command-based health check in the task definition:
"healthCheck": {
"command": ["CMD-SHELL", "celery -A backend inspect ping --timeout 10 || exit 1"],
"interval": 30,
"timeout": 15,
"retries": 3
}
This catches two failure modes: worker process died, and worker lost connection to SQS. It’s not perfect — inspect ping adds minor overhead every 30 seconds — but it’s the simplest approach that actually works on Fargate.
Monitoring Without Flower
SQS doesn’t support Celery’s event protocol, so Flower doesn’t work. Our monitoring is CloudWatch-based:
SQS metrics: ApproximateNumberOfMessagesVisible (queue depth — if it’s growing, workers aren’t keeping up) and ApproximateNumberOfMessagesNotVisible (in-flight — if this is high relative to worker count, tasks are taking too long).
Application metrics: Custom CloudWatch EMF metrics emitted from the import service — batch duration, chunk processing time, records imported per batch, deduplication hit rate.
Database tracking: AppleHealthImportBatch records give us a complete audit trail — when the import started, how many chunks, how many succeeded, how many failed, what went wrong. This is richer than anything Celery’s result backend would provide.
What I’d Do Differently
Instrument task duration from day one. We added CloudWatch metrics for chunk processing time after the first production incident. Should have been there from the start.
Gate legacy settings behind defaults. The COMPASS_DB_USER crash was avoidable. Any Django setting that’s not needed by all containers should have default=''.
Consider Fargate Spot for workers. HealthKit import tasks are idempotent and retry-safe. Spot pricing for the worker task would cut compute costs significantly, with SQS handling the retry on interruption automatically.
The Full Series
- Part 1: Why Background Processing Changes Everything
- Part 2: The Sync Pipeline
- Part 3: Deploying Workers to Fargate (this post)