Building a Markdown-First Citation Pipeline for an AI Health Agent
When your AI agent cites sources, the obvious approach is inline markdown links: [1](https://example.com/study). That’s what the AI assistant did initially. It worked — until it didn’t.
The problem wasn’t the citations themselves. It was everything downstream that had to deal with them: streaming text that included raw URLs mid-sentence, layout comparison logic that broke because two identical answers had different link text, bullet formatting that collapsed because the post-processor couldn’t distinguish citation markup from content structure.
This post documents the refactor that replaced all of that with a token-based citation pipeline — [[REF:ID]] tokens in the model output, a structured SOURCES JSON payload appended after the response, and a clean stripping layer that removes machine tokens before the user ever sees them.
The Problem with Inline Citations
The agent’s original citation approach worked like this: the system prompt told Claude to emit [n](url) markdown links inline, and the response went straight to the client. Simple.
Three things broke:
1. Streaming got ugly. When streaming via SSE, the client renders text as it arrives. An inline citation like [1](https://example.com/very/long/path/to/study.pdf) would arrive across multiple chunks, showing the user raw URL fragments mid-sentence before the link closed. The Flutter client couldn’t do anything about it — the text was already committed to the stream.
2. Layout comparison diverged. The app has a dashboard view and a chat view that should show identical content. The normalize_layout_compare_text function compared the two, but inline citations with dynamically generated footnote numbers meant the “same” answer could have different link indices depending on rendering order. The comparison would flag false differences.
3. Bullet formatting collapsed. A post-processing step called _collapse_bullet_blocks_to_paragraphs was rewriting bullet lists into flowing paragraphs for cleaner mobile display. But it couldn’t tell the difference between a bullet point that was content structure and one that was a citation footnote list. The collapse was eating citation blocks, or worse, merging citation URLs into paragraph text.
The New Architecture: Tokens + Structured Payload
The refactor separates citation concerns into three layers:
Layer 1: Model Output (Reference Tokens)
The system prompt now instructs Claude to emit [[REF:ID]] tokens wherever it would cite a source. The ID maps to a knowledge base chunk. No URLs, no markdown links, no footnote numbers in the model’s output stream.
Based on current clinical guidelines [[REF:kb_0042]], the recommended
approach involves monitoring levels every 3-6 months [[REF:kb_0891]].
The model also emits a structured block at the end of its response:
SOURCES_START
[
{"id": "kb_0042", "title": "ACOG Practice Bulletin #232", "url": "https://..."},
{"id": "kb_0891", "title": "Endocrine Society Guidelines 2024", "url": "https://..."}
]
SOURCES_END
This is a machine-readable payload, never shown to users.
Layer 2: Citation Generation (Grounding Utils)
The grounding/utils.py module was rewritten. The key function format_inline_citation_reference now emits [[REF:N]] tokens instead of markdown links. A new append_sources_block function constructs the SOURCES_START...SOURCES_END JSON payload from the RAG retrieval results.
The separation means the grounding layer doesn’t need to know anything about rendering — it just tags where citations go and provides the structured data.
Layer 3: Citation Stripping (Response Cleanup)
A new strip_citation_tokens() function in response_cleanup.py removes all [[REF:...]] tokens and the entire SOURCES_START...SOURCES_END block from user-visible text. The prepare_answer_for_delivery pipeline now runs this stripping as a standard step, along with legacy footer cleanup for backward compatibility.
The streaming layer (finalize.py, output_postprocessing.py) extracts URLs from the structured payload before running text cleanup, so the citation data is preserved for the client’s reference panel while the visible stream stays clean.
What Got Deleted
The most satisfying part of any refactor is what you get to remove.
_collapse_bullet_blocks_to_paragraphs is gone. It was removed from post_process_answer, normalize_streaming_visible_text, and format_instant_response_text. The function existed because the old inline-citation format created bullet lists that looked bad on mobile. With citations separated from content, the model’s natural markdown formatting can pass through untouched.
The bullet collapse was a workaround for a rendering problem caused by a citation problem. Removing the citation problem removed the rendering problem removed the workaround. Three layers of complexity deleted.
New Addition: Definition Formatting
With bullets no longer being collapsed, a new format_definition_runs_as_bullets function was added to response_formatting.py. It detects runs of Label: explanation lines (common in health/medical responses) and converts them to - **Label** — explanation format. This is additive formatting that works with the model’s output rather than fighting it.
Instant Delivery Fix
A subtle bug surfaced during the refactor: instant responses (non-streamed, cached or guardrail-triggered answers) were being sent as chunk SSE events, which the Flutter client renders with a typewriter animation. This looked wrong for instant content.
The fix: instant responses now send as guardrail_override events, which the client renders as a full replace. They also now run through the same formatting pipeline as streamed responses, so the output is consistent regardless of delivery path.
The Test Suite
153 tests across 5 test files, all updated for the new behavior, all passing. The tests cover:
- Token emission: model output contains
[[REF:ID]]not[n](url) - Token stripping: user-visible text has no REF tokens or SOURCES blocks
- URL extraction: citation URLs are captured before text cleanup
- Streaming: visible stream is clean, citation data preserved separately
- Layout comparison:
normalize_layout_compare_textignores REF tokens - Definition formatting:
Label: textrows convert correctly - Instant delivery: correct SSE event type and formatting pipeline
The Complete Change Set
| Area | Files | What Changed |
|---|---|---|
| Model instructions | prompts/service.py |
[[REF:ID]] tokens + SOURCES_START...SOURCES_END JSON payload |
| Citation generation | grounding/utils.py, grounding/__init__.py |
format_inline_citation_reference emits tokens, new structured payload |
| Citation stripping | response_cleanup.py, response/__init__.py |
strip_citation_tokens(), updated prepare_answer_for_delivery |
| Streaming | finalize.py, output_postprocessing.py |
URL extraction before cleanup, clean visible stream |
| Layout comparison | text_policy.py, views.py |
Ignore REF tokens in normalization |
| Bullet collapse | output_postprocessing.py |
Removed entirely |
| Definition formatting | response_formatting.py |
New format_definition_runs_as_bullets |
| Instant delivery | strands_conversation_service.py, helpers.py |
guardrail_override event type, formatting pipeline |
| Tests | 5 test files | 153/153 pass |
Takeaways
The core insight: citation is a machine concern, not a rendering concern. The moment you let citation markup leak into user-visible text, every downstream system — streaming, comparison, formatting, caching — has to be citation-aware. Separating the token layer from the display layer let us simplify all of them.
The refactor also killed the assumption that post-processing should “fix” model output. The old pipeline was: model emits messy markdown → post-processor rewrites it → client renders the rewrite. The new pipeline is: model emits clean markdown with semantic tokens → stripping layer removes tokens → client renders the model’s actual formatting. Less code, fewer bugs, better output.