Project · Technical Writeup · 2026
A bounded symptom reasoning service that turns a knee complaint into controlled evidence, asks one focused question at a time, and stops with a shortlist, fallback, or safety escalation.
Diagnostic Engine is a deliberately bounded reasoning system for knee complaints. The goal is not to behave like a general medical chatbot and it is not to present itself as a diagnosis tool. The useful boundary is smaller: accept an opening complaint in free text, convert only justified details into structured evidence, ask a short targeted follow-up round, and stop cleanly once the system has either earned a shortlist, failed to earn one, or hit a safety rule.
That scope is what makes the architecture interesting. Instead of relying on one large prompt to hold the entire reasoning process together, the project moves medical logic into registries, pushes free text through a constrained intake layer, and runs the rest of the loop on symptom keys, question definitions, and deterministic scoring rules.
The runtime follows a governed evidence loop rather than an open-ended chat loop. After the first complaint is parsed, the engine decides between safety escalation, candidate scoring, and the next single discriminating question. The important architectural move is that everything after intake runs on the same bounded symptom state instead of on fresh free-text interpretation each round.
The core design is not "ask an LLM what the complaint sounds like" over and over. The core design is to convert the opening story into controlled evidence once, then keep all later reasoning tied to the same bounded state. That makes the system auditable and also changes what a question is for. A question is not there to continue the conversation. It is there to confirm, negate, or separate concrete symptom signals that matter to the scorer.
| System surface | What it represents | Why it exists |
|---|---|---|
symptom registry |
Canonical symptom IDs, value types, categories, and scale labels. | The system needs one stable vocabulary so every later step talks about the same evidence model. |
question bank |
Authored follow-up prompts, gating rules, and answer-to-symptom mappings. | Questions become deterministic evidence updates rather than loose conversational turns. |
disease definitions |
Supports, anti-symptoms, contradiction logic, and stage-specific weighting. | Fit scoring stays in explicit rules that can be inspected, edited, and benchmarked. |
session symptom state |
The live evidence map for the current interview. | Free-text intake and form answers converge into one shared state instead of two competing interpretations. |
This is what gives the system its shape. The intake parser can infer or tentatively map evidence, but it still has to land inside the same bounded symptom vocabulary as the rest of the engine. The scorer can only rank against what has been earned into that state. The selector can only ask about unknown or weakly-supported signals that exist in the registry. The whole loop stays narrow because every component is constrained by the same evidence model.
Evidence status is also part of the design. The session distinguishes between explicit evidence, inferred evidence, and low-confidence evidence. That matters because it lets the engine treat the opening complaint as useful but provisional, then let later answers overwrite weaker assumptions with cleaner signals. In practice, that is what makes the follow-up rounds feel purposeful instead of repetitive.
Session start and answer submission both run through the same evaluation loop. The system parses or merges evidence, appends ledger events, checks safety, scores candidates, and either returns a single compiled question or one of the final outcomes. That matters because the engine never switches reasoning modes halfway through. The opening story and the later answers are both just ways of updating the same session state.
The intake layer maps only justified details into structured symptom evidence and produces a compact summary of what the system thinks it heard. Unclear facts are left unresolved instead of being silently guessed.
Fever with a hot or red knee, visible deformity, or major trauma can short-circuit the normal candidate loop and return an escalation message immediately.
Candidate fit depends on supports, anti-symptoms, contradictions, evidence status, and stage profiles defined in the registries. Given the same symptom state, the scorer returns the same result.
The selector chooses the smallest next question set that best separates the remaining candidates. In the current engine, that means exactly one active question at a time, which keeps each round tied to one specific clarification goal.
The loop ends with a shortlist, fallback, or escalation. The important product choice is that fallback is treated as a real answer state, not as something the system tries to smooth over with fake confidence.
| Outcome | When it appears | Why it matters |
|---|---|---|
candidates |
A confident shortlist exists and the lead is decisive enough, or no more useful questions remain. | The engine only shows a shortlist once it has actually earned a bounded fit result. |
fallback |
Evidence stays too weak or ambiguous after the questioning budget is used up. | The system would rather stop cautiously than pretend certainty it does not have. |
escalation |
Safety logic fires before or during the normal ranking loop. | Urgent patterns are treated as urgent, not diluted into a candidate score. |
The strongest architectural choice here is not the scorer itself. It is the decision to keep safety logic outside disease ranking. In a lot of lightweight diagnostic demos, urgency is just another score feature, which means the same surface is expected to do both differential reasoning and red-flag triage. This engine avoids that. A hot, red knee with fever does not need to win a candidate race before the system can say the user needs urgent in-person review.
The second useful decision is that the engine is allowed to fail cleanly. A shortlist is only one of the permitted endings. Fallback is equally part of the architecture. That keeps the system honest under weak evidence and prevents the scorer from being forced into fake confidence just because the product needs a satisfying ending.
The engine is useful precisely because it does not try to behave like a universal medical intelligence. Its scope is narrow, the outcomes are bounded, and fallback is explicit. That creates a more honest product surface and a cleaner engineering surface at the same time.
The session ledger supports that honesty. The system can show what it parsed, what it asked, and where it stopped, instead of only exposing a polished final sentence with no audit path behind it.
The project keeps an evolving session object with symptom state, parser output, candidate state, question log, and the latest compiled form. Ledger entries are appended for material transitions such as session creation, parse merge, answer recording, candidate flagging, fallback, and safety escalation.
That persistence layer is what makes the app feel more like a small governed service than a front-end toy. The hosted deployment uses Supabase-backed sessions and ledger rows so the Vercel runtime can stay stateless while the interview state remains reloadable.
This is still a small, intentionally narrow system. The current model covers ACL tear, meniscal tear, patellofemoral pain syndrome, and knee osteoarthritis. It is a governed knee-only reasoning loop, not a general diagnostic platform.
A few current constraints are especially worth being explicit about: