Lessons From Bedrock, pgvector, And RAG In Production
RAG sounds simple until it has to answer real health questions.
The toy version is:
embed the question -> search vectors -> paste chunks into the prompt -> answer
The production version is a set of uncomfortable choices about latency, coverage, evidence quality, source metadata, evals, and operational rollback.
This is what I learned building a healthcare assistant on Bedrock, managed Postgres, pgvector, and a curated clinical knowledge base.
Problem Context
The assistant needed to answer user questions across two very different kinds of context:
- private user data, such as tracking history and trends
- public medical reference material, such as symptoms, treatments, evaluation guidance, and safety caveats
Those should not be mixed casually. If a user asks “when was my last period?”, the answer should come from user-scoped data tools. If a user asks “what is primary ovarian insufficiency?”, the answer should be grounded in reference material. If a user asks “where can I buy this?”, static medical references are the wrong corpus entirely.
That made the real RAG problem less about vector search and more about routing: when is retrieval appropriate, which corpus should it use, and what source contract should reach the client?
Architecture
The sanitized retrieval architecture looked like this:
The most important piece is the metadata path. The answer text can be cleaned, rewritten, or streamed. Source metadata needs to survive those transformations.
In practice, every retrieved chunk needed at least:
{
"content": "short passage used for grounding",
"score": 0.72,
"topic": "menopause symptoms",
"source": "medical reference source",
"sourceUrl": "https://example.org/reference"
}
The app should not have to scrape URLs back out of prose. It should receive an explicit source list in the final response metadata.
Bedrock Was The Model Plane, Not The Whole Product
Bedrock handled model access well: chat models, embeddings, and managed AWS auth fit cleanly into the backend. But Bedrock did not remove the need for application-owned routing and evaluation.
The useful boundary was:
- Bedrock hosts the models.
- The application decides what evidence is required.
- The retrieval service controls what context is eligible.
- The eval harness determines whether the answer was good enough.
That boundary kept the system portable. If a model changed, the routing and source contract did not have to change. If the retrieval index changed, the app could test retrieval quality before changing prompts. If a judge changed, the trace and dataset structure still had stable case IDs.
pgvector Lesson: Index Settings Are Product Behavior
The retrieval index was not an implementation detail. It directly changed what the assistant said.
An early failure mode came from vector search returning confident-looking but wrong chunks. The query was valid. The content existed. The result set was bad. That sort of bug feels like “the model is hallucinating,” but the root cause can be retrieval.
The first fix was to make the vector index behavior match the runtime path and to test it directly. Later, moving to an HNSW index made the speed/recall trade much better for the corpus size. In our case, the retrieval path moved from multi-second p50 behavior to a sub-second range on normal queries.
That improvement was not just a performance win. It changed routing options. A RAG path that takes several seconds gets avoided or deferred. A RAG path that is fast enough can run before generation, which improves the first streamed token.
Query Rewrite Beat Raw Embeddings
Clinical questions often contain shorthand, life-stage context, and multiple intents:
Is vaginal estrogen safe if I have migraine with aura?
A naive embedding of the raw question may miss useful chunks because the source content uses different language. The retrieval path improved when it generated structured probes:
local estrogen therapy safety migraine with aura
vaginal estrogen contraindications migraine aura
genitourinary syndrome menopause treatment safety
The model-facing answer still used the user’s original question. The rewritten probes were only for retrieval.
That distinction matters. Rewriting the user’s question for the model can change intent. Rewriting the query for retrieval can improve recall while preserving the user’s actual request.
Retrieval Should Be Selective
One temptation is to retrieve for every message. That is usually wrong.
Retrieval adds latency and cost. More importantly, it can add misleading authority to answers that do not need it. Some questions need private user data. Some need current web information. Some are product commands. Some are casual conversation.
The practical rule became:
Use the clinical KB for medical explanation, symptoms, safety,
evaluation, treatment, or guidance questions.
Avoid the clinical KB for personal tracking data, live facts,
shopping/current prices, and lightweight conversation.
This made retrieval a scoped capability rather than a default reflex.
Source Delivery: Separate Text From Evidence
Inline citations look attractive because they make grounding visible:
Hot flashes can be related to estrogen changes [[REF:1]].
But inline citation tokens create product problems. They can feel awkward in a mobile chat UI, they are easy for the model to place incorrectly, and they can break when answer text is cleaned or reformatted.
The better production contract was:
{
"answer": "clean text ready for the user",
"sources": [
"https://example.org/source-a",
"https://example.org/source-b"
],
"sourceDetails": [
{
"title": "Source title",
"url": "https://example.org/source-a",
"kind": "clinical_reference"
}
]
}
The assistant can still use citation markers internally for evals, but the client should get structured metadata. That gives the UI control over how to show sources and gives evals a stable surface to inspect.
Evals Changed How We Tuned RAG
RAG tuning without evals becomes anecdote-driven:
"This answer feels worse."
"This query looks better."
"The source seems unrelated."
The better loop separated retrieval metrics from answer metrics.
Retrieval metrics:
- Did the top K contain human-labeled relevant chunks?
- What was the best similarity score?
- How many chunks cleared the quality floor?
- Did known source families appear?
Answer metrics:
- Did the answer address the question?
- Was it faithful to retrieved context?
- Were medical caveats appropriate?
- Did citations or sources support the claims?
That separation caught a counterintuitive result: a retrieval change can reduce hit rate while improving final answer quality if it filters weak chunks. In a health context, fewer but stronger chunks can be the right tradeoff.
Sanitized Retrieval Pattern
The actual code had more fallback behavior, but the shape can be expressed simply:
def retrieve_context(question: str, *, k: int = 8) -> list[dict]:
probes = build_retrieval_probes(question)
candidates = []
for probe in probes:
candidates.extend(vector_search(probe, limit=k))
deduped = dedupe_by_source_and_chunk(candidates)
ranked = sort_by_score(deduped)
return [
chunk
for chunk in ranked[:k]
if chunk["score"] >= MIN_RELEVANCE_SCORE
]
The important pieces are not the function names. The important pieces are multiple probes, de-duplication, a relevance floor, and metadata preservation.
What Failed Or Changed
Several things changed as the system matured:
- The first vector path was too dependent on default index behavior. Index configuration had to be treated as part of the application contract.
- Retrieval was originally too model-directed. Pre-generation priming improved grounded answers for clinical questions.
- Query rewriting started as simple synonym expansion and evolved toward structured clinical probes.
- Citation display moved away from raw inline markers toward structured source metadata.
- The eval harness added deterministic IR metrics because LLM judges alone were too fuzzy for retrieval regressions.
- Old retrieval tuning knobs became obsolete after the index changed. Stale knobs are dangerous because they make operators think they are controlling behavior that no longer exists.
Operational Lessons
The best RAG architecture is not the most complex one. It is the one whose failures are easy to locate.
If the wrong chunks come back, that is a retrieval problem. If the right chunks come back but the model ignores them, that is an answer-generation problem. If the answer is grounded but the UI loses sources, that is a contract problem. If all the examples look good but live traces drift, that is an eval coverage problem.
The production lesson is to make those boundaries explicit. Bedrock, pgvector, and RAG are useful pieces. The system becomes reliable when routing, retrieval, source delivery, and evaluation are designed as one product loop.