Key Principle
Three clinical products — Woebot Health, Ada Health, and Limbic Access — independently proved the "thick deterministic core, thin LLM shell" pattern in the most regulated environment possible. Each demonstrates a different facet: Woebot proves deliberate non-generation works clinically, Ada proves Bayesian reasoning outperforms LLM reasoning for diagnostics, and Limbic proves structured data collection — not AI insight — drives outcomes.
Why This Matters
Clinical AI faces the highest possible stakes for getting the architecture wrong. These products passed regulatory scrutiny (FDA Breakthrough Device, CE Class IIa medical devices, UKCA certification) precisely because deterministic logic governs all critical paths. Their lessons apply to any domain where assessment accuracy and auditability matter.
Good Examples
Woebot Health (Composite 9/10, $123M invested):
- Every response is written by conversational designers and clinical experts. The NLP layer interprets; it never generates novel therapeutic text — a clinical design choice validated by a 36,070-user landmark study. (pp. 2-3, chunk 003)
- Crisis detection runs as pre-processing middleware before any LLM processing. Two tiers: fast-path regex (typing "SOS") and NLP classification for subtler language. Kill signals short-circuit the entire state machine. (pp. 2-3, chunk 003)
- The LLM Eagerness Problem: in thought-challenging exercises, the LLM skipped ahead and reframed cognitive distortions instead of guiding the user to identify them. (p. 3, chunk 003)
- Confirmation loops: "It sounds like there's a couple of things here, feeling low and problems with relationships, is that true?" — both empathic and a bias guard. (p. 3, chunk 003)
Ada Health (Composite 8/10, 13M+ users, 32M+ assessments):
- Bayesian probabilistic reasoning engine makes ALL decisions. NLP is strictly a translator mapping natural language to structured "findings" in Ada's Medical Description Language (3,600+ conditions, 50+ in-house doctors). (p. 5, chunk 003)
- Information-theoretic question selection: each question is chosen to maximally reduce diagnostic uncertainty. Assessments take 6-8 minutes with dynamic question count. (p. 5, chunk 003)
- Three-valued logic: Present/Absent/Unknown. Skipped questions don't bias hypotheses. Full re-computation from complete evidence set on every input prevents state corruption. (p. 5, chunk 003)
- Five years of silent R&D (2011-2016) before launch. (p. 5, chunk 003)
Limbic Access (Composite 8/10, 64,862-patient outcomes study):
- "The main driver of improved outcomes was collecting clinically relevant information ahead of the human assessment — not the AI's therapeutic capability." (p. 1, chunk 004)
- The Limbic Layer middleware (~14 patents) intercepts every LLM response. Clinically unsafe responses are regenerated in milliseconds, invisibly. (p. 1, chunk 004)
- Testing pyramid: automated AI-vs-AI testing, clinical red teaming, non-clinical population testing, continuous post-launch monitoring. (p. 1, chunk 004)
Counterpoints
- Woebot's Consumer Shutdown: 1.5M+ D2C users, FDA Breakthrough Device, yet consumer app shut down June 2025. Root cause: FDA has no clear pathway for LLM-based therapeutic tools. "Architecture is not enough" — the business model must work too. (pp. 2-3, chunk 003)
- Ada's Knowledge Base Regression: A JMIR study found diagnostic accuracy worsened in ophthalmology between 2018-2020 despite overall improvements. Knowledge base updates can introduce regressions. Fix: separate creation from validation teams, update biweekly. (p. 5, chunk 003)
- Limbic's Evidence Gap: Both major published papers are observational, not RCTs. (p. 1, chunk 004)
Key Quotes
"generative AI is best used to augment well-structured conversational agents" (p. 2, chunk 003) — Woebot BUILD study
"the main driver of improved outcomes was collecting clinically relevant information ahead of the human assessment — not the AI's therapeutic capability." (p. 1, chunk 004) — Limbic
Rules of Thumb
- Kill signals must short-circuit the entire state machine — they are pre-processors, not branches.
- Confirmation loops catch misclassification at the point of error.
- Use three-valued logic: never force binary on incomplete data.
- Separate knowledge creation from knowledge validation with distinct teams.
- Build for structured data collection as primary value, not AI-generated insight.
Related References
- The Dual-System Architecture Thesis - The architecture these products validate
- Conversational Assessment Design - Assessment patterns derived from clinical lessons
- Financial and Legal Domain Case Studies - Complementary regulated domain studies