what it does
rag-psych is a retrieval-augmented question-answering system over a local corpus of psychiatry / mental-health reference material. You type a clinical question; the system finds the most relevant passages in the corpus, has an LLM compose a grounded answer with citations back to those passages, and shows you the supporting passages alongside the answer so you can verify every claim.
what it offers
Grounded answers
Every factual claim in the response is followed by a
[chunk_id] citation
linking to the exact passage it came from. Click a citation
to scroll to and highlight its chunk.
Source transparency
Retrieved passages are shown on the right with their source (clinical notes, research abstracts, or diagnostic references) colour-coded and labelled. No hidden reasoning.
Hallucination detection
Cited IDs that do not appear in the retrieved set are flagged in the answer and in a warning banner. The model does not get to quote things that weren't retrieved.
Insufficient-evidence refusal
When the corpus doesn't contain an answer, the system returns a canonical refusal string rather than inventing one. Off-topic queries trigger this at the retrieval layer with no LLM call.
Negation-aware retrieval
Passages that deny the queried concept ("patient denies suicidal ideation") are filtered out before reaching the answer step, so they're never cited as positive evidence.
Hybrid retrieval
Three retrievers run in parallel — dense semantic search, BM25-style keyword, and literal rare-token matching — then Reciprocal Rank Fusion and a cross-encoder re-score the combined candidate pool.
what to ask
criteria for generalized anxiety disorder
essential features of post-traumatic stress disorder
diagnostic criteria for obsessive compulsive disorder
45-year-old female presenting with depressive symptoms and suicidal ideation
patient medication list including SSRI for depression
cognitive behavioral therapy outcomes for anxiety disorders in adolescents
psychosocial interventions for bipolar disorder
what does the literature say about the diagnostic criteria for depression
how is suicidal ideation assessed clinically and what is its prevalence
what it can't do
- Medical advice. Answers are grounded in reference material, not a clinician's judgement. Never use this to make a real diagnostic or treatment decision.
- Real-time information. The corpus is a snapshot. It doesn't know about new papers, guidelines, or drug approvals published after ingest time.
- PHI-sensitive work. All source data is public or de-identified. Do not paste identifiable patient information into queries.
- Multi-hop reasoning. Each query is answered in a single pass. Questions that need separate lookups and then a comparison ("is the criterion for X different in version Y vs Z?") are handled more loosely than direct lookups.
- Exact-string drug dosing. If the corpus doesn't contain a literal "drug name + dose" mention, the system may return same-class alternatives rather than the precise dose you asked for.
how a query flows
- Your query is embedded with a clinical-domain sentence encoder, and in parallel tokenised for keyword and rare-token lookups.
- Three retrievers run against the local vector database and return their top candidates independently.
- The candidate lists are fused by Reciprocal Rank Fusion, deduplicated, and re-scored by a cross-encoder reranker.
- A rule-based negation filter drops any surviving passage where the queried concept is denied / ruled out / negative for.
- If the best remaining passage clears a confidence threshold, the top-k are sent to a language model with a strict system prompt: answer only from these, cite every claim, refuse cleanly if the passages don't support an answer.
- The response is parsed for citation integrity — any cited ID not in the retrieved set is flagged before the answer is rendered.
what it could offer next
- Per-source retrieval balance. Currently one source can crowd out others when a query spans topics. Fetching top-K from each source independently before fusion would keep all three voices in the final answer.
- Stronger reranker. The current cross-encoder is a general-purpose MS-MARCO model. Swapping to a clinical-tuned reranker would reduce the "case-study" bias observed in the eval set.
- Agentic follow-up queries. For multi-hop questions, giving the model a retrieval tool and letting it iterate would outperform the current single-pass design.
- Continuous evaluation. The eval harness already runs against 16 labelled queries. Running it on every ingest change and diffing the JSON outputs would catch regressions early.
behind the curtain
A live evaluation dashboard with per-query metrics, source-mix
breakdowns, latency profile, and run history is available at
/eval
— password-protected so it doesn't leak eval numbers to casual
visitors. Credentials come from the operator's .env.