Case Study
An AI knowledge graph that turns years of documents into a searchable, structured memory. Not a notes app. A retrieval-augmented generation system built on Obsidian and LightRAG.
An ingestion pipeline that takes years of accumulated documents (PDFs, DOCX, XLSX, emails, meeting notes, research, call transcripts) and converts them into a structured, searchable knowledge graph. The pipeline is fully automated: 10 parallel Claude Code agents process 30 files at a time, deduplicating, enriching, and organizing everything into an Obsidian vault with a PARA folder structure.
Once ingested, LightRAG indexes the entire vault and serves it as a retrieval-augmented generation endpoint. I can ask natural language questions about anything I've ever written, read, or received, and get grounded answers with source references. Not keyword search. Semantic retrieval over a knowledge graph.
The structured vault. PARA folder architecture (Projects, Areas, Resources, Archive) with domain-scoped CLAUDE.md router files in every directory to prevent hallucination and keep agents on-task.
The knowledge graph engine. Indexes the full vault, builds entity-relationship graphs, and serves semantic queries on port 9621. Answers come back grounded in actual documents, not AI imagination.
Custom pipeline that reads raw files, extracts text, deduplicates (5,782 caught), filters low-value content (3,197 removed), and writes clean Obsidian-formatted notes with frontmatter and wikilinks.
10 parallel agents processing batches of 30 files. Each agent enriches notes with auto-generated wikilinks, frontmatter metadata, relationship tags, and domain classification. SQLite manifest tracks every file through the pipeline.
Raw documents (PDFs, DOCX, XLSX, emails, meeting notes) are fed into the Python ingestion pipeline. Each file is parsed, text-extracted, and hashed for deduplication.
5,782 duplicate files identified and removed. Fuzzy matching catches near-duplicates: different versions of the same document, forwarded emails, copy-paste artifacts.
3,197 low-value files removed: auto-generated reports, blank templates, system logs, receipts. Quality gate ensures only meaningful content enters the vault.
10 Claude Code agents process 30 files at a time. Each note gets frontmatter (date, source, domain, tags), auto-generated wikilinks to related notes, and relationship-tier tags for contact files.
Notes are sorted into the PARA structure. Domain-scoped CLAUDE.md router files in each directory define what belongs there and how agents should handle queries about that domain.
LightRAG indexes the full vault into a knowledge graph. Entity extraction, relationship mapping, and semantic embeddings make the entire archive queryable in natural language.
Most "second brain" setups are fancy note-taking. This one has teeth. CLAUDE.md router files in every directory prevent AI agents from hallucinating by scoping their context to the relevant domain. Contact files aren't address books. They have relationship tiers, interaction history, and context about what was discussed and agreed on.
The value compounds. Every document I process makes the knowledge graph smarter. Every new connection between notes surfaces relationships I didn't know existed. After 1,537 notes, asking "What did I decide about X six months ago?" actually returns a grounded, sourced answer.