Collected Heuristics and Design Rules - Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots

Key Principle

This reference synthesizes actionable heuristics from across the book into categorical rules. Each rule is grounded in evidence or failure cases from the source material. The organizing insight is that most dialogue system failures stem from a small number of recurring mistakes: choosing the wrong paradigm for the task, ignoring error propagation, optimizing the wrong metric, and assuming scale substitutes for structure.

Why This Matters

Conversational AI spans too many sub-disciplines for any practitioner to hold all considerations simultaneously. These heuristics serve as a decision-support checklist -- a compressed set of hard-won lessons from decades of research and deployment, drawn from rule-based, statistical, neural, and hybrid systems alike. The book's core thesis -- that each paradigm solves prior limitations while creating new ones (p. 12) -- makes cross-paradigm pattern recognition essential.

Architecture

Start every project by asking: task-oriented or open-domain? The answer determines architecture, evaluation criteria, and design approach. Conflating them leads to mismatched requirements (p. 11).
Use the three-type taxonomy to select dialogue management: user-initiated (simple Q&A), system-directed (slot-filling, proactive, instructional), or multi-turn open-domain. Each type requires a fundamentally different DM architecture (p. 30).
The modular pipeline (ASR -> NLU -> DM -> NLG -> TTS) enables independent component development but creates cumulative error propagation. "What the system outputs as its representation of the user's input does not correspond to what the user actually intended" (p. 44). N-best lists are the primary self-correction mechanism: they create downstream re-scoring opportunities (p. 45).
End-to-end systems eliminate inter-module error propagation but lose modular interpretability and control. Neither extreme is optimal in isolation (p. 126-127).
Hybrid architectures outperform pure approaches in deployment. Use rules for control and safety, statistics for robustness under uncertainty, and neural models for naturalness and coverage (p. 175).
Toolkit choice is an architectural commitment. Commercial platforms constrain internal access; research toolkits lack production scalability. Choose based on required complexity, not feature lists (p. 68, p. 186).
Visual design tools hit a complexity ceiling at approximately hundreds of dialogue paths. Beyond that, migrate to code-based toolkits (p. 66).

NLU

Frame NLU as classification (intents) + sequence labeling (entities), not full syntactic parsing. Shallow parsing extracts "just enough meaning" and is more amenable to statistical robustness (p. 75).
Joint intent-entity models outperform separate pipeline stages because they capture correlations between what the user wants and the details they provide (p. 76).
Pure ML intent classifiers group by surface similarity, but surface similarity does not entail identical goals. "I bought trousers, how do I return them?" clusters with purchase utterances, yet the operative intent is return (p. 61). Consider hybrid NLU for precision-critical domains.
Implement explicit context mechanisms for follow-up utterances. Without them, each turn is interpreted in isolation (p. 61).
Anticipate cross-domain slot schema mismatches. The same entity gets different slot names across domains (p. 63).
Rule-based grammar rules cannot scale: synonymy, word order variation, and disfluency create combinatorial growth that makes rule enumeration structurally impractical (p. 73).
Statistical dialogue state tracking provides disproportionate value precisely where rule-based approaches fail most -- systems with poor ASR (p. 154).
Domain classification matters for multi-domain assistants: users switch domains mid-conversation, and without detection, contextual ambiguity compounds (p. 48).

Dialogue Management

The Context Model / Decision Model split reappears across all paradigms (rule-based, statistical, neural) because any multi-turn system must separately track what it knows and decide what to do next (p. 49-50).
Confidence-driven confirmation prevents both over-confirming (annoying users) and under-confirming (propagating errors). Partition states into safe (implicit confirmation) and uncertain (explicit confirmation) based on ASR confidence thresholds (p. 77).
Combinatorial explosion kills hand-crafted dialogue management: 7 topping choices produce 5,000+ paths (p. 64). Switch to interactive learning or simulation-based generation beyond ~100 paths.
Corpus-based DM cannot generalize beyond training distribution. When an unseen (state, situation) pair is encountered, the system selects the nearest known pair, producing coherent-looking but semantically wrong actions (p. 77-78).
For RL-based DM, reward design determines policy behavior. Small negative reward per turn plus large positive for task completion is the standard pattern, but poorly designed rewards produce technically optimal yet practically useless policies (p. 82).
Simulated users create a bootstrapping problem: policy quality is bounded by simulator fidelity, which is hard to validate without the real users you are trying to avoid (p. 82).
Include a simple rule-based fallback bot. In the Alexa Prize, the ELIZA bot's removal had the largest negative impact on quality ratings because it provided a reliable conversational floor (p. 69).
Hierarchical DM is necessary for open-domain conversation: flat management cannot scale across unbounded topic spaces. Use a global manager dispatching to specialized sub-components (p. 69).

NLG

NLG is an underinvested output bottleneck: systems routinely understand more than they can express, creating a perceived-intelligence ceiling at the output stage (p. 51).
Reframe NLG as planning under uncertainty, not text templating. RL-optimized content presentation (summary vs. compare vs. recommend) improved task success by 8.2% over static approaches (p. 80).
Response length is a first-order quality signal: "If the utterances are too short, they are seen as dull and uninteresting, and if they are too long the chatbot can be seen as rambling" (p. 144).
The generic response problem is caused by the objective function, not the architecture. Maximum likelihood training rewards high-frequency bland phrases. Maximum Mutual Information (MMI) penalizes responses that could follow any input (p. 147).
Decompose response quality into four measurable dimensions: repetition, specificity, response-relatedness, and question-asking frequency. Vague "quality" cannot be diagnosed or controlled (p. 147).
Retrieve-and-Refine is a general antidote to generative dullness: inject retrieved text as generator context to import specificity while preserving generative flexibility (p. 143).

Evaluation

BLEU fails for dialogue: appropriate responses may share zero words with the input, and the one-to-many mapping breaks MT metrics' one-to-one assumption (p. 104).
Alignment-based metrics are "less applicable in dialogue systems" because "responses do not usually use the same words as the previous utterance" (p. 1).
TTS quality creates a halo effect: poor speech output causes users to rate the entire system as unintelligent even when task performance succeeds (p. 101). TTS investment has disproportionate ROI.
Use relative (side-by-side) comparison rather than absolute Likert ratings. Humans are better at relative judgements, yielding "more fine-grained and more sensitive evaluation" at lower cost (p. 108).
Evaluate at both exchange level and dialogue level. Exchange-level metrics capture per-response quality; dialogue-level metrics capture flow and coherence. Neither alone suffices (p. 106-110).
Single-session evaluation cannot detect cross-conversation repetitiveness, a systematic bias toward bots that seem varied in one interaction but repeat across many (p. 109).
No single metric suffices. Combine PARADISE (objective-to-subjective bridge), QoE (causal user perception taxonomy), and IQ (exchange-level temporal granularity) (p. 120).
Wild-chat evaluation fails because adversarial or low-effort users dominate the signal, invalidating results (p. 113).

Deployment

Determine whether conversation is the right modality at all before building. The common mistake is forcing conversational interfaces onto tasks better served by graphical UIs (p. 53).
Design for the happy path first, edge cases second. Reversing this causes over-engineering of rare scenarios (p. 53).
Use Wizard of Oz testing to discover actual user vocabulary and breakdown points that developer assumptions cannot produce (p. 53).
ASR errors cascade: each misrecognition corrupts the dialogue state that subsequent turns depend on. Intercept errors early where they are cheapest to fix (p. 17-18, p. 68).
Assume 5-30% of user input will be abusive. Build abuse handling architecturally, not as an afterthought (p. 179).
Clean training data is necessary but not sufficient for safety. Neural generation can produce harmful outputs from non-harmful inputs (p. 180).
Transfer learning for dialogue "does not yet reach a level of performance that would be required for adoption in industry" (p. 166), despite impressive few-shot benchmarks.
Corpus type is a consequential design decision: human-human dialogues have different error distributions than human-machine dialogues and may not train effective systems (p. 151).

Key Quotes

"Designing all the rules that are required to cover all potential interactions of a dialogue system soon becomes a difficult, if not impossible task." (p. 76)

"We have certainly not yet arrived at a solution to open-domain dialogue." (p. 144)

"There is no simple answer to this question as a number of different factors are involved." (p. 120)

"the development team can feel assured that they have full control over the operation of their system" (p. 70)

Related References

core-framework.md -- the three-paradigm thesis underlying these heuristics
conversation-design.md -- theoretical foundations for design decisions
pipeline-architecture.md -- the modular architecture these rules apply to
evaluation-frameworks.md -- detailed evaluation methodology
ethics-and-safety.md -- safety-specific heuristics expanded
neural-failure-modes.md -- failure modes that many of these rules address