5 minute read

Streaming an LLM answer is easy in a prototype. Yield tokens as they arrive.

Streaming an LLM answer in production is harder because the first token is not the whole product. The client also needs the final answer, source metadata, usage details, route information, timing data, retry state, and a reliable terminal event.

The most useful pattern I found was to treat streaming as two contracts:

  1. A progressive text contract for what the user sees while the answer is being generated.
  2. A terminal metadata contract for what the application should persist, reconcile, evaluate, and observe.

The terminal event matters more than it sounds.

Problem Context

The assistant streamed answers to a mobile app. The backend was a Django service using synchronous ORM calls and model/provider SDKs, while the HTTP endpoint was an async Server-Sent Events surface.

That created a boundary problem:

  • The client wants low-latency chunks.
  • The backend has synchronous work that should not block the event loop.
  • The model may call tools and retrieval before or during generation.
  • Post-processing may change the final display text.
  • Sources may be discovered outside the visible answer text.
  • Partial failures must still close the stream cleanly.

If the backend simply forwarded provider tokens, the app would see text but lose the final truth about the turn.

Architecture

Here is the public-safe shape:

Sanitized streaming LLM done-event contract

The queue bridge was the key implementation pattern. The async HTTP view can yield SSE events without waiting for the whole answer, while a background worker can run synchronous application code safely.

Event Types

The public event contract can stay small:

ack
    The request was accepted and the backend has enough context to begin.

chunk
    A piece of display text.

done
    The terminal event. Contains final answer metadata and closes the turn.

error
    Optional transport-level or pre-token failure event.

Most application logic should hang off done, not the last chunk.

That gives the backend room to do final cleanup. For example, the streamed text may include internal citation markers, repeated whitespace, or provider-specific artifacts that should not be persisted. The done event can carry the canonical answer after cleanup.

Why done Is A Product Contract

The client needs one event that says:

{
  "type": "done",
  "metadata": {
    "answer": "final cleaned answer",
    "sources": ["https://example.org/reference"],
    "sourceDetails": [],
    "route": "clinical-rag",
    "modelTier": "smart",
    "toolsWithResults": ["knowledge_base"],
    "completionStatus": "success",
    "partial": false,
    "stageTimings": {
      "precheckMs": 12,
      "routeMs": 4,
      "retrievalMs": 180,
      "firstTokenMs": 920,
      "totalMs": 3100
    }
  }
}

This is not just metadata for nerds. It protects the product.

The mobile app can persist the cleaned answer instead of whatever happened to stream token-by-token. The UI can show source cards without scraping URLs out of text. Analytics can compare first-token latency to total latency. Evals can differentiate “the model used a tool” from “the tool returned useful evidence.”

The terminal event is where streaming becomes auditable.

Sources Should Not Depend On Visible Text

Source handling was one of the biggest contract lessons.

In a simple RAG demo, citations live directly in the answer:

This symptom can happen during perimenopause [1].

In a production streaming UI, visible citations can be fragile:

  • The model may place them near the wrong claim.
  • The output pipeline may strip or reformat them.
  • The mobile UI may want source cards, not inline footnotes.
  • Production and eval surfaces may intentionally see different text.

The better contract is to collect sources from multiple places, de-duplicate them, and emit them separately:

source candidates:
    retrieval metadata
    tool outputs
    structured source payloads
    final answer text fallback

final source payload:
    ordered, de-duplicated URLs + optional source details

This lets the visible answer stay readable while preserving evidence for the client and for evals.

Partial Output Is A First-Class State

Streaming failures are different before and after visible output.

Before the first token, the backend can often retry, swap to a cheaper model, or return a safe fallback. After the user has seen tokens, retrying can produce duplicated or contradictory text.

The useful rule was:

No visible output yet:
    retry or degrade if safe

Visible output already emitted:
    preserve the partial answer
    stop retrying the provider loop
    emit a terminal done event with partial=true

This makes failure honest. The user keeps what they saw. The app receives a closed stream. Observability records that the answer was partial instead of pretending it was a normal success.

Stage Timings Beat One Latency Number

“The request took 5 seconds” is not enough to debug a streaming system.

Useful timings were more granular:

authMs
precheckMs
conversationContextMs
routeMs
retrievalMs
agentSetupMs
firstTokenMs
streamMs
postProcessMs
finalizeMs
totalMs

First-token latency and total latency tell different stories. If first token is slow, the problem may be auth, routing, retrieval, or provider setup. If first token is fast but total latency is slow, the model or tool loop may be dragging. If finalization is slow, source extraction or persistence may be the issue.

Stage timings make streaming debuggable without reading raw logs for every turn.

Sanitized Queue Bridge Pattern

The implementation details vary by framework, but the pattern is portable:

def stream_endpoint(request):
    events = Queue()

    def worker() -> None:
        try:
            for event in run_conversation(request):
                events.put(event)
        except Exception as exc:
            events.put({"type": "error", "message": "stream failed"})
        finally:
            events.put({"type": "closed"})

    Thread(target=worker, daemon=True).start()

    for event in consume_until_closed(events):
        yield format_sse(event)

The production version included auth, usage accounting, trace context, and more careful exception handling. The architectural point is the same: separate the HTTP streaming surface from the synchronous application/runtime work.

What Failed Or Changed

Several early assumptions changed:

  • Raw provider tokens were not enough. The app needed a terminal event with the canonical answer and metadata.
  • The last streamed chunk was not a reliable persistence source. Final cleanup could change the answer.
  • “Tools used” was too weak for observability. The better signal was “tools with usable results.”
  • Retrying after partial output created worse UX than emitting an honest partial terminal state.
  • Sources embedded in text were too brittle. Structured source metadata became the contract.
  • A single duration metric hid the real bottleneck. Stage timings made routing, retrieval, first token, and finalization separately visible.

Operational Lessons

The biggest lesson is that streaming is not just a transport optimization. It is a state machine.

Every streamed turn needs a beginning, middle, and end:

beginning:
    accepted, authenticated, route selected

middle:
    chunks, tools, retrieval, model generation

end:
    final answer, sources, state, timings, audit metadata

If you only design the middle, the app will feel fast but become hard to debug. If you design the terminal event well, streaming becomes compatible with observability, evals, source rendering, retries, and persistence.

For production LLM systems, the final done event is not bookkeeping. It is the contract that makes the stream trustworthy.