I have 268 markdown files in my Obsidian vault – internal docs, architecture decisions, API specs, meeting notes. Searching them with Obsidian’s built-in search works for exact terms, but I wanted something smarter: ask a natural language question and get a synthesized answer with citations, completely offline.
So I built a local RAG pipeline. No API keys, no cloud services, no data leaving my machine. Here’s how it works and the decisions I made along the way.
Code: github.com/alexbeattie/obsidian-rag
The Stack
| Component | Tool | Why |
|---|---|---|
| Embeddings | nomic-embed-text via Ollama | Best open-source retrieval embeddings at this size (768-dim). Outperforms OpenAI’s text-embedding-3-small on MTEB benchmarks. |
| Vector DB | ChromaDB (persistent, on-disk) | Zero-config, no Postgres process needed. Persists to SQLite + hnswlib. |
| LLM | mistral-nemo:12b via Ollama | Good balance of quality and speed for RAG synthesis. |
| MCP Server | FastMCP (stdio) | Exposes tools to Cursor/Claude Code for agent-driven queries. |
Chunking: Why Heading-Based, Not Fixed-Size
The most common RAG tutorial says “split your text into 500-character chunks with overlap.” I didn’t do that. My Obsidian docs use ## and ### headings consistently, so I split on heading boundaries instead.
Why this matters:
- A fixed 500-char window would split mid-paragraph, losing context
- Heading-based chunks are semantically coherent units – one section = one concept
- The tradeoff is uneven chunk sizes, but guardrails handle the extremes
sections = re.split(r'(?=^#{1,3}\s)', text, flags=re.MULTILINE)
Two guardrails prevent degenerate cases:
- MIN_CHUNK_CHARS = 150 – drops fragments too small to be useful (bare headings, section dividers)
- MAX_CHUNK_CHARS = 2000 – hard-splits oversized sections at paragraph boundaries first, then at character boundaries as a last resort
The paragraph-level splitting is important. If a section is 6000 chars, I don’t just chop it at character 2000. I split on \n\n (paragraph boundaries) and accumulate paragraphs into a buffer until it would exceed the limit. This preserves paragraph coherence even within the fallback splitting.
The nomic-embed-text Prefix Trap
This cost me an hour of debugging. nomic-embed-text was trained with asymmetric task prefixes. You must prepend:
"search_document: "when embedding documents for storage"search_query: "when embedding a user query
# Indexing
prefixed = f"search_document: {text}"
# Querying
prefixed = f"search_query: {query}"
Without these prefixes, document and query embeddings land in misaligned regions of the vector space. Recall drops 15-20%. This is specific to nomic – OpenAI’s text-embedding-3-small doesn’t use prefixes. It’s easy to miss because the embeddings still “work” without prefixes, just with silently degraded quality.
Deterministic Chunk IDs
Every chunk gets an ID based on its content hash:
content_hash = hashlib.md5(f"{source}:{idx}:{text[:200]}".encode()).hexdigest()[:12]
chunk_id = f"{filename_stem}_{idx}_{content_hash}"
This means re-running ingestion upserts rather than duplicating. I can add new docs to the vault and re-run python ingest.py safely – existing chunks keep their IDs, new chunks get added.
MCP Server: Querying the Vault from Cursor
The most useful extension was wrapping the search and Q&A functions as MCP tools. Now my coding agent can query my Obsidian notes during development sessions:
@mcp.tool()
def search_vault(query: str, top_k: int = 5) -> str:
"""Vector-only retrieval -- returns matching chunks with scores."""
chunks = _search(query, top_k=top_k)
# Format and return results...
@mcp.tool()
def ask_vault(question: str, top_k: int = 8) -> str:
"""Full RAG -- retrieval + LLM synthesis with source citations."""
# Retrieve, build context, generate answer...
Configure it in Cursor’s .cursor/mcp.json and the vault becomes a tool the agent can call. When I’m writing code and need to check how an internal API works, the agent searches the vault instead of me context-switching to Obsidian.
Numbers
Files indexed: 268
Chunks stored: ~4,700
Ingest time: ~3 minutes
Query latency: ~50ms embed + <5ms retrieval + LLM generation
What’s Next
- Embedding model eval harness – benchmark retrieval recall across nomic, OpenAI, and Cohere on my actual data
- Model routing – route simple queries to a fast model and complex queries to a capable one, tracking cost savings
The full code is at github.com/alexbeattie/obsidian-rag. PRs welcome.