4 minute read

When we started integrating Apple HealthKit into Ourself Health, the initial approach was simple: Flutter reads HealthKit, POSTs JSON to a Django GraphQL mutation, Django writes to PostgreSQL. Done.

It lasted about a week.

The Problem with Inline Processing

HealthKit data is on-device only. There’s no server-side API — every piece of health data has to travel from the user’s iPhone to your backend via HTTP. That creates three problems that compound:

Volume. A user with an Apple Watch generates heart rate samples every few minutes, step counts continuously, sleep analysis nightly. Connect HealthKit for the first time and request a 30-day backfill? That’s potentially thousands of samples per data type. Multiply by user count.

Latency. Our GraphQL mutations were processing HealthKit batches synchronously. A 30-day backfill of heart rate data would hold the HTTP connection open for 15-30 seconds while Django ran update_or_create in a loop. The Flutter client would time out or the user would navigate away.

Reliability. If Django crashed mid-batch — OOM, database connection timeout, ECS task getting replaced — the entire sync was lost. The user’s last_anchor_token hadn’t been updated, so the next sync would re-send everything, but partial writes meant some records existed and some didn’t.

The answer was obvious: stop processing inline, start processing in the background.

Why Celery, and Why Not SQS

The conventional wisdom for Django + AWS is Celery with SQS as the broker. SQS is managed, scales infinitely, costs nothing at rest, and integrates natively with AWS. We actually planned this — there’s a 06n-Plan-Redis-Celery.md in our Obsidian vault with the full SQS architecture.

We chose Redis (ElastiCache) instead. Here’s why:

No Flower, no celery inspect. SQS doesn’t support the event protocol that Celery’s monitoring tools depend on. You can’t use Flower to watch task progress, and you can’t use celery inspect to debug a stuck worker. For a team of one (me), losing observability was a dealbreaker.

15-minute delay ceiling. SQS caps message delay at 15 minutes. We had use cases — medication reminders, scheduled check-ins — where we needed to schedule tasks hours in advance. Celery Beat with Redis handles this natively.

We already needed Redis for caching. Our Django backend was using LocMemCache, which is per-process. With ECS running multiple tasks, verification codes stored in cache were invisible across containers. We needed ElastiCache Redis anyway. Using it as the Celery broker meant one fewer infrastructure dependency.

The tradeoff we accepted: Redis isn’t as durable as SQS. If the Redis node dies, queued tasks are lost. We mitigated this with acks_late=True (tasks stay in queue until completion) and idempotent task design (re-processing is safe).

The Architecture That Emerged

Three ECS services, one Docker image:

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│  ECS: api       │     │  ElastiCache │     │  ECS: worker    │
│  gunicorn       │────▶│  Redis       │────▶│  celery worker  │
│  desiredCount:2 │     │              │     │  desiredCount:2 │
└─────────────────┘     └──────────────┘     └─────────────────┘
                              │
                              ▼
                        ┌─────────────────┐
                        │  ECS: beat      │
                        │  celery beat    │
                        │  desiredCount:1 │
                        └─────────────────┘

The API container calls .delay() to enqueue tasks. The worker container pulls from Redis and executes them. The beat container schedules periodic tasks. All three use the identical Docker image — only the CMD in the ECS task definition differs.

The critical constraint: Beat must be exactly 1 replica. If you scale Beat to 2, every scheduled task runs twice. We learned this the hard way during UAT when medication reminders went out in duplicate.

What Celery Actually Replaced

Before Celery, we had three separate problems all solved badly:

APScheduler for notifications. Django’s apps.py spun up an in-process APScheduler on startup. When ECS scaled to 2 tasks, both instances ran the scheduler. Every notification job executed twice. The plan doc (06n-Plan-Redis-Celery.md) lists this as the primary driver.

Synchronous HealthKit processing. GraphQL mutations processed health data inline, blocking the request thread for seconds to minutes.

LocMemCache for verification codes. Each ECS task had its own isolated memory cache. A verification code generated on Task 1 was invisible to Task 2. Users would get “invalid code” errors seemingly at random.

Celery + Redis solved all three: background task processing, Beat-scheduled notifications, and shared caching via django-redis.

The Celery Worker Crash That Taught Us Everything

The first deploy of the Celery worker to ECS crashed immediately. The CloudFormation stack rolled back. The error:

UndefinedValueError: COMPASS_DB_USER not found.

The worker container was loading settings/development.py, which calls config('COMPASS_DB_USER') with no default. That environment variable wasn’t in the ECS task definition for the worker container — only the API container had it.

The insight: the Celery worker imports your entire Django settings module at startup, not just the parts it needs. python-decouple’s config() with no default raises at import time, before Celery even gets to pool initialization. The traceback looked like a worker_pool error but the root cause was a missing env var.

The fix was adding every env var from the API container to the worker container. Same image, same settings, same environment — only the CMD differs.

Next: The HealthKit Sync Pipeline

In Part 2, I’ll cover the actual HealthKit sync implementation: the 30-day backfill system, anchor-based incremental sync, chunked uploads, and the deduplication strategy that makes the whole pipeline idempotent.