Key Principle
Build the state machine first. Test it with zero LLM calls. Only then wrap the agentic shell around it. Phased construction isolates failure domains — you can determine whether a bug originates in logic (state machine) or generation (LLM). This is what makes a 15-state, 22-transition system debuggable by a solo developer.
"The products that struggled most — Woebot's consumer shutdown, Salesforce's slow adoption, ValidatorAI's shallow ceiling — failed not because the architecture was wrong but because of business model mismatches, instruction bloat, or insufficient deterministic control." (p. 1, chunk 006)
Why This Matters
Without phase separation, the same system becomes intractable. When state machine logic and LLM generation are interleaved from the start, every bug could originate in either layer. Instruction amnesia, silent extraction failures, and LLM state-skipping must be addressed in the initial architecture — retrofitting these controls requires structural state-machine changes, not parameter tuning on the LLM. (p. 1, chunk 006)
The failure taxonomy matters equally: if you misdiagnose a business model problem as an architecture problem, you rebuild what doesn't need rebuilding. Teams default to "the AI isn't good enough" when the fix is upstream — pricing, instruction pruning, or tightening the deterministic boundary.
Good Examples
Ada Health: Five years of silent R&D (2011-2016) building the Bayesian reasoning engine before adding any LLM layer. The diagnostic engine works without any LLM at all. (p. 5, chunk 003)
Token Budget Allocation Formula (Origin Financial): 15% system prompt, 25% state context, 20% relevant history, 15% tool definitions, 10% user query, 15% output buffer. Maintain state externally in a JSONB store. On each LLM call, compile minimal working context. Never append entire conversation history. (p. 4, chunk 003)
Dynamic Prompt Debugging: Log the fully assembled prompt for every LLM call alongside the response. Audit pattern:
{timestamp, inputState, assembledPrompt, llmResponse, guardsEvaluated, outputState, tokensUsed}. (p. 4, chunk 005)
Counterpoints
Failure Misdiagnosis: Woebot's shutdown was business model, not architecture. Salesforce's slow adoption was instruction bloat, not capability. ValidatorAI's ceiling was insufficient deterministic control, not LLM quality. The default error attribution — "the AI isn't good enough" — triggers wasted effort on model upgrades when the fix is upstream. (p. 1, chunk 006)
Complexity Sweet Spot: 15 states, 22 transitions — enough for real value delivery, few enough for solo-developer debuggability. Beyond ~100 rules, procedural decision trees become "impossibly fragile to maintain" (Neota Logic). (p. 2, chunk 004)
JSONB Concurrent Update Conflicts: Deep merge conflicts when concurrent updates hit the same state object. Fix: append-only semantics for arrays, explicit merge strategies per field type. (p. 4, chunk 005)
Key Quotes
"Build the state machine first. Test it with zero LLM calls. Only then wrap the agentic shell around it." (p. 1, chunk 006)
"the knowledge elicitation problem is consistently harder than the technology" (p. 2, chunk 004) — Neota Logic
Rules of Thumb
- Phase 1: Build state machine, test with zero LLM calls.
- Phase 2: Add LLM shell, test each state's prompt independently.
- Phase 3: Integration test with full conversation transcripts, evaluating F1 per command type.
- Budget tokens explicitly: 15/25/20/15/10/15 allocation formula.
- Log every assembled prompt — you cannot debug what you cannot reconstruct.
- Use LLMs to draft initial rule sets from reference material, then validate with domain experts.
- Every backward transition gets a maximum-attempts guard to prevent infinite loops.
Related References
- The Dual-System Architecture Thesis - The architecture being implemented
- Master Pain Points Checklist - What to test against
- Context and Token Management - Token budget and context persistence details