Well-designed tools make the difference between an agent that can accomplish tasks and one that constantly struggles with its own interface.
Tool Examples as Design Pattern
[2025-12-09]: JSON schemas define structural validity but can't teach usage—that's what examples are for.
The Gap: Schemas tell the model what parameters exist and their types, but not:
- When to include optional parameters
- Which parameter combinations make sense together
- API conventions not expressible in JSON Schema
The Pattern: Provide 1-5 concrete tool call examples demonstrating correct usage at varying complexity levels.
Example Structure (for a support ticket API):
- Minimal: Title-only task—shows the floor
- Partial: Feature request with some reporter info—shows selective use
- Full: Critical bug with escalation, full metadata—shows the ceiling
This progression teaches when to use optional fields, not just that they exist.
Results: Internal testing showed accuracy improvement from 72% → 90% on complex parameter handling.
Best Practices:
- Use realistic data, not placeholder values ("John Doe", not "{name}")
- Focus examples on ambiguity areas not obvious from schema alone
- Show the minimal case first—don't always demonstrate full complexity
- Keep examples concise (1-5 per tool, not 20)
See Also:
- Prompt: Structuring — How tool examples relate to broader prompt design principles
Source: Advanced Tool Use - Anthropic
Rich User Questioning Patterns
[2026-01-30]: Orchestration research from cc-mirror reveals sophisticated user clarification patterns that replace text-based menus with rich, decision-guiding question structures.
Philosophy: Users Need to See Options
Core insight: "Users don't know what they want until they see the options"
Asking "What should I prioritize?" yields vague answers. Showing 4 options with descriptions, trade-offs, and recommendations enables informed decisions.
The 4×4 Maximal Pattern
Structure:
- 4 questions addressing different decision dimensions
- 4 options per question with rich descriptions
- Trade-offs and implications explicit
- Recommended option guides optimal choice
- Multi-select support when appropriate
Example: Feature Implementation Scope Clarification
AskUserQuestion(
questions=[
{
"text": "What's the primary goal?",
"options": [
{
"label": "Performance optimization",
"description": "Focus on speed and efficiency. May increase code complexity and require more thorough testing.",
"recommended": True
},
{
"label": "Code simplicity",
"description": "Prioritize readability and maintainability over raw speed. Easier onboarding, longer execution time."
},
{
"label": "Feature completeness",
"description": "Cover all edge cases and scenarios. Comprehensive but longer timeline. Best for user-facing features."
},
{
"label": "Quick MVP",
"description": "Fast delivery with core features only. Technical debt acceptable. Good for validation before full build."
}
]
},
{
"text": "How should we handle errors?",
"options": [
{
"label": "Fail fast with clear messages",
"description": "Immediate error visibility. Easier debugging but more interruptions. Good for development."
},
{
"label": "Graceful degradation",
"description": "Continue operation with reduced functionality. Better UX but errors may go unnoticed."
},
{
"label": "Retry with exponential backoff",
"description": "Automatic recovery from transient failures. Adds complexity and potential latency."
},
{
"label": "Log and continue",
"description": "Silent failure recovery. Best for non-critical paths. Risk of hidden issues accumulating."
}
]
},
{
"text": "What testing level?",
"options": [
{
"label": "Unit tests only",
"description": "Fast feedback, isolated verification. Misses integration issues. Good for pure logic."
},
{
"label": "Integration tests",
"description": "Verify component interactions. Slower but catches real-world failures. Recommended for APIs."
},
{
"label": "Full E2E suite",
"description": "Complete user flow validation. Highest confidence but longest execution. Expensive to maintain."
},
{
"label": "Manual testing",
"description": "Skip automated tests initially. Fast development but manual verification burden. Technical debt."
}
]
},
{
"text": "Deployment strategy?",
"options": [
{
"label": "Feature flag rollout",
"description": "Deploy to all, enable gradually. Instant rollback. Requires flag infrastructure."
},
{
"label": "Canary deployment",
"description": "Small user percentage first. Catch issues early. Needs monitoring and rollback automation."
},
{
"label": "Blue-green deployment",
"description": "Full environment switch. Zero downtime. Doubles infrastructure cost temporarily."
},
{
"label": "Direct deployment",
"description": "Immediate production rollout. Fastest but highest risk. Acceptable for low-traffic features."
}
]
}
]
)Anti-Pattern: Text-Based Menus
Avoid:
Please choose:
1. Fast
2. Cheap
3. Good quality
Which do you prefer?
Problems:
- No context about trade-offs
- Binary thinking (can't combine attributes)
- Vague options without implications
- No guidance toward optimal choice
When to Use Maximal Questions
Good fit:
- Request admits multiple valid interpretations
- Choices meaningfully affect implementation approach
- Actions carry risk or are difficult to reverse
- User preferences influence trade-offs
- Scope clarification prevents rework
Poor fit:
- Decisions are obvious from context
- Only one reasonable approach exists
- User already provided detailed requirements
- Questions would annoy rather than clarify
Implementation Guidelines
Option descriptions should include:
- What this choice means concretely
- Primary trade-off or cost
- When this choice makes sense
- What happens if this choice is wrong
The recommended flag signals:
- Optimal choice given typical constraints
- Not forcing—user can override
- Guides users unfamiliar with domain
Multi-select enables:
- "Performance AND simplicity" combinations
- "All of these except X" selections
- Prioritization without forced ranking
Sources: cc-mirror orchestration tools, AskUserQuestion pattern documentation
Coding Agent Edit Formats
[2026-04-11]: Edit format selection is a first-order tool design decision for coding agents — one with direct consequences for model error rates, output size, and edit application reliability. The choice is model-capability-driven, not aesthetic.
Three Primary Format Archetypes
| Format | Mechanism | Cognitive Load on Model | Production Evidence |
|---|---|---|---|
| Whole-file rewrite | Model returns complete updated file | Low — no line number tracking required | Reliable; expensive for large files |
| Unified diff (udiff) | Standard diff format with +/- line prefixes |
High — requires tracking original line numbers before writing new code | Used for legacy models with "lazy coding" tendency |
| Search-replace blocks | Model specifies exact text to find and exact replacement text | Medium — no line numbers; exact string matching | Most models in production; Aider's primary format |
The cognitive load distinction is significant. Unified diff requires the model to know the line count of original content before writing new code — a constraint that causes errors on complex edits. Search-replace blocks avoid this by using content identity rather than position. Whole-file rewrites sidestep both constraints at the cost of output token volume.
Model-Capability-Driven Format Selection
Aider's production documentation provides the most concrete evidence available for model-specific format selection:
| Model Family | Recommended Format | Reason |
|---|---|---|
| Most modern models (Claude, GPT-4o, Gemini 1.5+) | Search-replace (diff) |
Reliable exact-match application; balanced output size |
| Gemini models (older) | diff-fenced (path inside fence) |
Standard fencing syntax failed consistently |
| GPT-4 Turbo | udiff |
Mitigated "lazy coding" — replacing implementation with placeholder comments |
| Architect mode tasks | editor-diff / editor-whole |
Streamlined for multi-file operations in a separate editor model |
The practical implication: format choice is not a configuration detail — it is a tool design decision that affects model reliability at the task level. When a model produces incorrect edits, audit the edit format before adjusting the prompt.
Poka-Yoke Constraints for Coding Tools
Anthropic's SWE-bench implementation provides a concrete example of constraint-based tool design: converting relative to absolute filepaths as a required tool input. The result was measurable reduction in model path errors — the tool became harder to use incorrectly.
The principle: coding agent tools benefit from constraints that prevent common model errors:
- Require absolute paths (eliminates relative-path resolution errors)
- Require exact string matching in search-replace (prevents approximate matches that corrupt context)
- Validate that referenced line numbers exist before executing edits (prevents off-by-one failures)
These are not validation afterthoughts — they are design decisions that change model behavior by making the wrong action structurally difficult.
Sources: Aider Edit Formats, Building Effective Agents — Anthropic, Components of a Coding Agent — Raschka (2026-04-04)
Coding Agent Tool Inventory
[2026-04-11]: Coding agents use a recurring, minimal tool set. Raschka identifies the baseline as five named tools; Anthropic's SWE-bench work extends this with patch-application tools. Understanding the canonical inventory prevents over-engineering the tool layer.
Standard Inventory
| Tool | Role | Notes |
|---|---|---|
read_file |
Retrieve file content for context | Prefer whole-file for small files; line-range for large |
write_file / apply_edit |
Persist changes to disk | Format depends on edit format choice (see above) |
search / grep |
Find symbols, patterns, references across the repo | Critical for navigation without full-file reads |
bash / shell |
Run commands, tests, linters, build tools | Primary execution feedback mechanism |
list_files / ls |
Directory traversal and discovery | Supports repo orientation without reading every file |
Claude Code adds a sixth functional layer via hooks — pre/post action enforcement that operates outside the agent's direct tool calls. This is best understood as a meta-tool: it applies constraints and side effects that the agent cannot disable.
Minimum viable set for most coding tasks: read_file, write_file, bash, search. The other tools improve efficiency but are not required for correctness on small codebases.
Sources: Components of a Coding Agent — Raschka (2026-04-04), Building Effective Agents — Anthropic
Leading Questions
- How do you write tool descriptions that work for both humans and LLMs?
- When should you split one complex tool into multiple simple ones?
- What parameters should be required vs. optional? How do you decide?
- How do naming conventions affect tool selection accuracy?
- What makes a tool description actionable vs. confusing?
Connections
- To Tool Selection: Design choices directly impact selection accuracy
- To Prompt: Tool descriptions are prompts themselves—see Model-Invoked vs. User-Invoked Prompts
- To Scaling Tools: Good design becomes critical when managing dozens of tools
- To Orchestrator Pattern: AskUserQuestion maximal pattern demonstrates rich clarification as coordination tool. Orchestrators use structured questioning before delegation to absorb complexity and radiate simplicity.
- To ReAct Pattern: Edit format choice directly affects observation quality in coding agent loops. Search-replace formats produce cleaner, verifiable observations than whole-file rewrites.