Streaming LLM Architecture Patterns: Sources, Done Events, And Observability
Streaming an LLM answer is easy in a prototype. Yield tokens as they arrive.
Streaming an LLM answer in production is harder because the first token is not the whole product. The client also needs the final answer, source metadata, usage details, route information, timing data, retry state, and a reliable terminal event.
The most useful pattern I found was to treat streaming as two contracts:
- A progressive text contract for what the user sees while the answer is being generated.
- A terminal metadata contract for what the application should persist, reconcile, evaluate, and observe.
The terminal event matters more than it sounds.
Problem Context
The assistant streamed answers to a mobile app. The backend was a Django service using synchronous ORM calls and model/provider SDKs, while the HTTP endpoint was an async Server-Sent Events surface.
That created a boundary problem:
- The client wants low-latency chunks.
- The backend has synchronous work that should not block the event loop.
- The model may call tools and retrieval before or during generation.
- Post-processing may change the final display text.
- Sources may be discovered outside the visible answer text.
- Partial failures must still close the stream cleanly.
If the backend simply forwarded provider tokens, the app would see text but lose the final truth about the turn.
Architecture
Here is the public-safe shape:
The queue bridge was the key implementation pattern. The async HTTP view can yield SSE events without waiting for the whole answer, while a background worker can run synchronous application code safely.
Event Types
The public event contract can stay small:
ack
The request was accepted and the backend has enough context to begin.
chunk
A piece of display text.
done
The terminal event. Contains final answer metadata and closes the turn.
error
Optional transport-level or pre-token failure event.
Most application logic should hang off done, not the last chunk.
That gives the backend room to do final cleanup. For example, the streamed text
may include internal citation markers, repeated whitespace, or provider-specific
artifacts that should not be persisted. The done event can carry the canonical
answer after cleanup.
Why done Is A Product Contract
The client needs one event that says:
{
"type": "done",
"metadata": {
"answer": "final cleaned answer",
"sources": ["https://example.org/reference"],
"sourceDetails": [],
"route": "clinical-rag",
"modelTier": "smart",
"toolsWithResults": ["knowledge_base"],
"completionStatus": "success",
"partial": false,
"stageTimings": {
"precheckMs": 12,
"routeMs": 4,
"retrievalMs": 180,
"firstTokenMs": 920,
"totalMs": 3100
}
}
}
This is not just metadata for nerds. It protects the product.
The mobile app can persist the cleaned answer instead of whatever happened to stream token-by-token. The UI can show source cards without scraping URLs out of text. Analytics can compare first-token latency to total latency. Evals can differentiate “the model used a tool” from “the tool returned useful evidence.”
The terminal event is where streaming becomes auditable.
Sources Should Not Depend On Visible Text
Source handling was one of the biggest contract lessons.
In a simple RAG demo, citations live directly in the answer:
This symptom can happen during perimenopause [1].
In a production streaming UI, visible citations can be fragile:
- The model may place them near the wrong claim.
- The output pipeline may strip or reformat them.
- The mobile UI may want source cards, not inline footnotes.
- Production and eval surfaces may intentionally see different text.
The better contract is to collect sources from multiple places, de-duplicate them, and emit them separately:
source candidates:
retrieval metadata
tool outputs
structured source payloads
final answer text fallback
final source payload:
ordered, de-duplicated URLs + optional source details
This lets the visible answer stay readable while preserving evidence for the client and for evals.
Partial Output Is A First-Class State
Streaming failures are different before and after visible output.
Before the first token, the backend can often retry, swap to a cheaper model, or return a safe fallback. After the user has seen tokens, retrying can produce duplicated or contradictory text.
The useful rule was:
No visible output yet:
retry or degrade if safe
Visible output already emitted:
preserve the partial answer
stop retrying the provider loop
emit a terminal done event with partial=true
This makes failure honest. The user keeps what they saw. The app receives a closed stream. Observability records that the answer was partial instead of pretending it was a normal success.
Stage Timings Beat One Latency Number
“The request took 5 seconds” is not enough to debug a streaming system.
Useful timings were more granular:
authMs
precheckMs
conversationContextMs
routeMs
retrievalMs
agentSetupMs
firstTokenMs
streamMs
postProcessMs
finalizeMs
totalMs
First-token latency and total latency tell different stories. If first token is slow, the problem may be auth, routing, retrieval, or provider setup. If first token is fast but total latency is slow, the model or tool loop may be dragging. If finalization is slow, source extraction or persistence may be the issue.
Stage timings make streaming debuggable without reading raw logs for every turn.
Sanitized Queue Bridge Pattern
The implementation details vary by framework, but the pattern is portable:
def stream_endpoint(request):
events = Queue()
def worker() -> None:
try:
for event in run_conversation(request):
events.put(event)
except Exception as exc:
events.put({"type": "error", "message": "stream failed"})
finally:
events.put({"type": "closed"})
Thread(target=worker, daemon=True).start()
for event in consume_until_closed(events):
yield format_sse(event)
The production version included auth, usage accounting, trace context, and more careful exception handling. The architectural point is the same: separate the HTTP streaming surface from the synchronous application/runtime work.
What Failed Or Changed
Several early assumptions changed:
- Raw provider tokens were not enough. The app needed a terminal event with the canonical answer and metadata.
- The last streamed chunk was not a reliable persistence source. Final cleanup could change the answer.
- “Tools used” was too weak for observability. The better signal was “tools with usable results.”
- Retrying after partial output created worse UX than emitting an honest partial terminal state.
- Sources embedded in text were too brittle. Structured source metadata became the contract.
- A single duration metric hid the real bottleneck. Stage timings made routing, retrieval, first token, and finalization separately visible.
Operational Lessons
The biggest lesson is that streaming is not just a transport optimization. It is a state machine.
Every streamed turn needs a beginning, middle, and end:
beginning:
accepted, authenticated, route selected
middle:
chunks, tools, retrieval, model generation
end:
final answer, sources, state, timings, audit metadata
If you only design the middle, the app will feel fast but become hard to debug. If you design the terminal event well, streaming becomes compatible with observability, evals, source rendering, retries, and persistence.
For production LLM systems, the final done event is not bookkeeping. It is the
contract that makes the stream trustworthy.