what it does

rag-psych is a retrieval-augmented question-answering system over a local corpus of psychiatry / mental-health reference material. You type a clinical question; the system finds the most relevant passages in the corpus, has an LLM compose a grounded answer with citations back to those passages, and shows you the supporting passages alongside the answer so you can verify every claim.

what it offers

Grounded answers

Every factual claim in the response is followed by a [chunk_id] citation linking to the exact passage it came from. Click a citation to scroll to and highlight its chunk.

Source transparency

Retrieved passages are shown on the right with their source (clinical notes, research abstracts, or diagnostic references) colour-coded and labelled. No hidden reasoning.

Hallucination detection

Cited IDs that do not appear in the retrieved set are flagged in the answer and in a warning banner. The model does not get to quote things that weren't retrieved.

Insufficient-evidence refusal

When the corpus doesn't contain an answer, the system returns a canonical refusal string rather than inventing one. Off-topic queries trigger this at the retrieval layer with no LLM call.

Negation-aware retrieval

Passages that deny the queried concept ("patient denies suicidal ideation") are filtered out before reaching the answer step, so they're never cited as positive evidence.

Hybrid retrieval

Three retrievers run in parallel — dense semantic search, BM25-style keyword, and literal rare-token matching — then Reciprocal Rank Fusion and a cross-encoder re-score the combined candidate pool.

what to ask

diagnostic

criteria for generalized anxiety disorder

essential features of post-traumatic stress disorder

diagnostic criteria for obsessive compulsive disorder

clinical scenarios

45-year-old female presenting with depressive symptoms and suicidal ideation

patient medication list including SSRI for depression

research

cognitive behavioral therapy outcomes for anxiety disorders in adolescents

psychosocial interventions for bipolar disorder

cross-source

what does the literature say about the diagnostic criteria for depression

how is suicidal ideation assessed clinically and what is its prevalence

what it can't do

Medical advice. Answers are grounded in reference material, not a clinician's judgement. Never use this to make a real diagnostic or treatment decision.
Real-time information. The corpus is a snapshot. It doesn't know about new papers, guidelines, or drug approvals published after ingest time.
PHI-sensitive work. All source data is public or de-identified. Do not paste identifiable patient information into queries.
Multi-hop reasoning. Each query is answered in a single pass. Questions that need separate lookups and then a comparison ("is the criterion for X different in version Y vs Z?") are handled more loosely than direct lookups.
Exact-string drug dosing. If the corpus doesn't contain a literal "drug name + dose" mention, the system may return same-class alternatives rather than the precise dose you asked for.

how a query flows

Your query is embedded with a clinical-domain sentence encoder, and in parallel tokenised for keyword and rare-token lookups.
Three retrievers run against the local vector database and return their top candidates independently.
The candidate lists are fused by Reciprocal Rank Fusion, deduplicated, and re-scored by a cross-encoder reranker.
A rule-based negation filter drops any surviving passage where the queried concept is denied / ruled out / negative for.
If the best remaining passage clears a confidence threshold, the top-k are sent to a language model with a strict system prompt: answer only from these, cite every claim, refuse cleanly if the passages don't support an answer.
The response is parsed for citation integrity — any cited ID not in the retrieved set is flagged before the answer is rendered.

what it could offer next

Per-source retrieval balance. Currently one source can crowd out others when a query spans topics. Fetching top-K from each source independently before fusion would keep all three voices in the final answer.
Stronger reranker. The current cross-encoder is a general-purpose MS-MARCO model. Swapping to a clinical-tuned reranker would reduce the "case-study" bias observed in the eval set.
Agentic follow-up queries. For multi-hop questions, giving the model a retrieval tool and letting it iterate would outperform the current single-pass design.
Continuous evaluation. The eval harness already runs against 16 labelled queries. Running it on every ingest change and diffing the JSON outputs would catch regressions early.

behind the curtain

A live evaluation dashboard with per-query metrics, source-mix breakdowns, latency profile, and run history is available at /eval — password-protected so it doesn't leak eval numbers to casual visitors. Credentials come from the operator's .env.