Designing Production AI Routing And Evals For A Healthcare Assistant
Most AI demos make the product look like one call to a model:
user question -> prompt -> model -> answer
That is not the shape that survived production for a healthcare assistant.
The production shape became a routing system. A user message had to pass through authentication, entitlement checks, safety handling, deterministic shortcuts, conversation context, model selection, tool selection, retrieval decisions, post-processing, and eval telemetry before the app could safely show the final answer.
The hard part was not making the assistant sound helpful. The hard part was making each turn explainable when something went wrong.
Problem Context
Healthcare assistants sit in an awkward middle ground. Users expect the speed and warmth of chat, but the system cannot behave like a generic chatbot. It must know when to answer from product data, when to retrieve clinical reference material, when to search for current information, when to refuse or escalate, and when to avoid the model entirely.
That produced three design constraints:
- Safety checks must run before optimization checks.
- Routing decisions must be observable after the fact.
- Evaluation must cover both fixed test cases and real production traces.
The last point matters because a passing eval set does not prove production is healthy. It proves the system handled the cases you remembered to write down.
Architecture
Here is the sanitized version of the request funnel:
The important design choice is that routing is not a single classifier. It is an ordered set of gates. Some gates are deterministic. Some gates decide which agent configuration to build. Some gates only attach tools or retrieval context.
That separation prevented a common failure: treating “use the full prompt” as the same thing as “use the clinical knowledge base.” Those are different decisions. A full prompt can still answer without retrieval; a retrieval path can be attached only when the question needs evidence.
The Fastest Model Call Is No Model Call
The first production lesson was simple: do not ask a model to do work that the application can do deterministically.
Some user turns are product operations, not reasoning problems:
- “recap this conversation”
- “forget what I told you about caffeine”
- “what can you help me with?”
- “am I out of messages?”
- emergency or crisis language
For those, the best route is a deterministic response with the same client contract as a streamed model answer. The app should not need to know whether the answer came from a shortcut or a model. It should receive chunks, final metadata, usage details, and a terminal state either way.
That lets the backend optimize latency without creating client-specific special cases.
Route Labels Beat Vibes
Every model path needs a route label. Not for the user, for the operator.
In practice, useful route labels looked like:
fast-general
fast-health-guidance
data-session
data-full
clinical-rag
current-facts
emergency-detection
instant-response
The exact names matter less than the invariant: every answer should be explainable as “we selected this route because these signals were present.”
That made debugging much less mysterious. A poor answer could be sorted into one of a few buckets:
- The route was wrong.
- The route was right, but the wrong tools were attached.
- Retrieval ran but returned weak context.
- Retrieval returned good context, but the model ignored it.
- The model produced good text, but post-processing damaged it.
- The answer was good, but the final metadata misrepresented it.
Each bucket points to a different fix.
Evals Need More Than One Lane
I ended up thinking about evals as three feedback loops with different costs.
Retrieval-only loop
cheap, fast
answers: "did the right evidence come back?"
Gold-case agent loop
slower, controlled
answers: "does the assistant behave correctly on known cases?"
Live-trace scoring loop
sampled, operational
answers: "what are real users experiencing?"
The mistake is to collapse those into one dashboard number. They are different questions.
Retrieval-only evals can catch vector index regressions, threshold mistakes, and coverage gaps before you spend money generating full answers. They cannot prove the final answer is clinically good.
Gold-case evals are better for prompt, model, and tool-behavior changes. They let you compare runs across stable cases with consistent IDs and metadata.
Live-trace scoring is the operational ledger. It should be sampled, tagged by surface, and interpreted carefully because production output may differ from raw eval output. For example, a production delivery path might strip inline citation markers while still emitting structured source metadata. If your citation judge expects raw markers, it should produce “no signal,” not a false zero.
A Practical Scorecard
For RAG quality, a single “was it good?” judge was too vague. A better model was a small scorecard:
Answer relevance
Did the answer address the user's question?
Context relevance
Were the retrieved chunks actually about the question?
Faithfulness / grounding
Were answer claims supported by retrieved material?
Citation or source accuracy
Did source references support the nearby claims?
Deterministic retrieval metrics
Did known gold chunks appear in the top K?
The useful dashboard was a composite score plus the underlying dimensions. The composite helped with trend detection. The dimensions told you what to fix.
One threshold change illustrates the point. A low similarity floor improved apparent hit rate, but it allowed borderline chunks to anchor weak answers. Raising the floor reduced some hit metrics while improving answer quality on the cases that remained grounded. Without multiple metrics, that would have looked like a regression.
Sanitized Config Pattern
The production config had more knobs, but the useful public pattern is small:
@dataclass(frozen=True)
class EvalConfig:
retrieval_k: int = 8
min_similarity: float = 0.35
run_live_judges: bool = True
run_citation_judge: bool = True
publish_scores: bool = False
@dataclass(frozen=True)
class RouteDecision:
label: str
model_tier: str
prompt_tier: str
allow_user_tools: bool
allow_clinical_kb: bool
allow_web_search: bool
The key is keeping the route decision explicit. You want to log the model tier, prompt tier, and tool availability separately because they fail separately.
What Failed Or Changed
Several assumptions did not survive contact with production:
- “The model can decide when to retrieve” was not enough. Some health guidance needed pre-generation retrieval priming so the first streamed token was already grounded.
- “Citations in text are enough” was brittle. Structured source metadata was a better client contract than relying on visible footers or inline markers.
- “One eval score is enough” hid failures. Retrieval quality, answer quality, and citation/source quality needed separate signals.
- “Retry on timeout” sounded safe but could make streaming worse after partial output. Once the user has seen tokens, preserve the partial answer and emit an honest terminal state.
- “More tools is better” created tool overuse. The better pattern was route first, then attach only the tools that route could justify.
Operational Lessons
The biggest production lesson was that routing is a product surface, not just an implementation detail. It decides latency, cost, safety posture, and evidence quality.
The second lesson was that evals should be designed like observability. A score without route labels, trace tags, tool results, source metadata, and run metadata is hard to act on.
The third lesson was that deterministic systems and LLM systems should not be treated as enemies. The most reliable healthcare assistant path used both: deterministic gates for safety and product logic, LLMs for language and reasoning, retrieval for evidence, and evals to keep the whole system honest.