Tool Design

    Well-designed tools make the difference between an agent that can accomplish tasks and one that constantly struggles with its own interface.


    Tool Examples as Design Pattern

    [2025-12-09]: JSON schemas define structural validity but can't teach usage—that's what examples are for.

    The Gap: Schemas tell the model what parameters exist and their types, but not:

    • When to include optional parameters
    • Which parameter combinations make sense together
    • API conventions not expressible in JSON Schema

    The Pattern: Provide 1-5 concrete tool call examples demonstrating correct usage at varying complexity levels.

    Example Structure (for a support ticket API):

    1. Minimal: Title-only task—shows the floor
    2. Partial: Feature request with some reporter info—shows selective use
    3. Full: Critical bug with escalation, full metadata—shows the ceiling

    This progression teaches when to use optional fields, not just that they exist.

    Results: Internal testing showed accuracy improvement from 72% → 90% on complex parameter handling.

    Best Practices:

    • Use realistic data, not placeholder values ("John Doe", not "{name}")
    • Focus examples on ambiguity areas not obvious from schema alone
    • Show the minimal case first—don't always demonstrate full complexity
    • Keep examples concise (1-5 per tool, not 20)

    See Also:

    Source: Advanced Tool Use - Anthropic


    Rich User Questioning Patterns

    [2026-01-30]: Orchestration research from cc-mirror reveals sophisticated user clarification patterns that replace text-based menus with rich, decision-guiding question structures.

    Philosophy: Users Need to See Options

    Core insight: "Users don't know what they want until they see the options"

    Asking "What should I prioritize?" yields vague answers. Showing 4 options with descriptions, trade-offs, and recommendations enables informed decisions.

    The 4×4 Maximal Pattern

    Structure:

    • 4 questions addressing different decision dimensions
    • 4 options per question with rich descriptions
    • Trade-offs and implications explicit
    • Recommended option guides optimal choice
    • Multi-select support when appropriate

    Example: Feature Implementation Scope Clarification

    AskUserQuestion(
        questions=[
            {
                "text": "What's the primary goal?",
                "options": [
                    {
                        "label": "Performance optimization",
                        "description": "Focus on speed and efficiency. May increase code complexity and require more thorough testing.",
                        "recommended": True
                    },
                    {
                        "label": "Code simplicity",
                        "description": "Prioritize readability and maintainability over raw speed. Easier onboarding, longer execution time."
                    },
                    {
                        "label": "Feature completeness",
                        "description": "Cover all edge cases and scenarios. Comprehensive but longer timeline. Best for user-facing features."
                    },
                    {
                        "label": "Quick MVP",
                        "description": "Fast delivery with core features only. Technical debt acceptable. Good for validation before full build."
                    }
                ]
            },
            {
                "text": "How should we handle errors?",
                "options": [
                    {
                        "label": "Fail fast with clear messages",
                        "description": "Immediate error visibility. Easier debugging but more interruptions. Good for development."
                    },
                    {
                        "label": "Graceful degradation",
                        "description": "Continue operation with reduced functionality. Better UX but errors may go unnoticed."
                    },
                    {
                        "label": "Retry with exponential backoff",
                        "description": "Automatic recovery from transient failures. Adds complexity and potential latency."
                    },
                    {
                        "label": "Log and continue",
                        "description": "Silent failure recovery. Best for non-critical paths. Risk of hidden issues accumulating."
                    }
                ]
            },
            {
                "text": "What testing level?",
                "options": [
                    {
                        "label": "Unit tests only",
                        "description": "Fast feedback, isolated verification. Misses integration issues. Good for pure logic."
                    },
                    {
                        "label": "Integration tests",
                        "description": "Verify component interactions. Slower but catches real-world failures. Recommended for APIs."
                    },
                    {
                        "label": "Full E2E suite",
                        "description": "Complete user flow validation. Highest confidence but longest execution. Expensive to maintain."
                    },
                    {
                        "label": "Manual testing",
                        "description": "Skip automated tests initially. Fast development but manual verification burden. Technical debt."
                    }
                ]
            },
            {
                "text": "Deployment strategy?",
                "options": [
                    {
                        "label": "Feature flag rollout",
                        "description": "Deploy to all, enable gradually. Instant rollback. Requires flag infrastructure."
                    },
                    {
                        "label": "Canary deployment",
                        "description": "Small user percentage first. Catch issues early. Needs monitoring and rollback automation."
                    },
                    {
                        "label": "Blue-green deployment",
                        "description": "Full environment switch. Zero downtime. Doubles infrastructure cost temporarily."
                    },
                    {
                        "label": "Direct deployment",
                        "description": "Immediate production rollout. Fastest but highest risk. Acceptable for low-traffic features."
                    }
                ]
            }
        ]
    )

    Anti-Pattern: Text-Based Menus

    Avoid:

    Please choose:
    1. Fast
    2. Cheap
    3. Good quality
    
    Which do you prefer?
    

    Problems:

    • No context about trade-offs
    • Binary thinking (can't combine attributes)
    • Vague options without implications
    • No guidance toward optimal choice

    When to Use Maximal Questions

    Good fit:

    • Request admits multiple valid interpretations
    • Choices meaningfully affect implementation approach
    • Actions carry risk or are difficult to reverse
    • User preferences influence trade-offs
    • Scope clarification prevents rework

    Poor fit:

    • Decisions are obvious from context
    • Only one reasonable approach exists
    • User already provided detailed requirements
    • Questions would annoy rather than clarify

    Implementation Guidelines

    Option descriptions should include:

    1. What this choice means concretely
    2. Primary trade-off or cost
    3. When this choice makes sense
    4. What happens if this choice is wrong

    The recommended flag signals:

    • Optimal choice given typical constraints
    • Not forcing—user can override
    • Guides users unfamiliar with domain

    Multi-select enables:

    • "Performance AND simplicity" combinations
    • "All of these except X" selections
    • Prioritization without forced ranking

    Sources: cc-mirror orchestration tools, AskUserQuestion pattern documentation


    Coding Agent Edit Formats

    [2026-04-11]: Edit format selection is a first-order tool design decision for coding agents — one with direct consequences for model error rates, output size, and edit application reliability. The choice is model-capability-driven, not aesthetic.

    Three Primary Format Archetypes

    Format Mechanism Cognitive Load on Model Production Evidence
    Whole-file rewrite Model returns complete updated file Low — no line number tracking required Reliable; expensive for large files
    Unified diff (udiff) Standard diff format with +/- line prefixes High — requires tracking original line numbers before writing new code Used for legacy models with "lazy coding" tendency
    Search-replace blocks Model specifies exact text to find and exact replacement text Medium — no line numbers; exact string matching Most models in production; Aider's primary format

    The cognitive load distinction is significant. Unified diff requires the model to know the line count of original content before writing new code — a constraint that causes errors on complex edits. Search-replace blocks avoid this by using content identity rather than position. Whole-file rewrites sidestep both constraints at the cost of output token volume.

    Model-Capability-Driven Format Selection

    Aider's production documentation provides the most concrete evidence available for model-specific format selection:

    Model Family Recommended Format Reason
    Most modern models (Claude, GPT-4o, Gemini 1.5+) Search-replace (diff) Reliable exact-match application; balanced output size
    Gemini models (older) diff-fenced (path inside fence) Standard fencing syntax failed consistently
    GPT-4 Turbo udiff Mitigated "lazy coding" — replacing implementation with placeholder comments
    Architect mode tasks editor-diff / editor-whole Streamlined for multi-file operations in a separate editor model

    The practical implication: format choice is not a configuration detail — it is a tool design decision that affects model reliability at the task level. When a model produces incorrect edits, audit the edit format before adjusting the prompt.

    Poka-Yoke Constraints for Coding Tools

    Anthropic's SWE-bench implementation provides a concrete example of constraint-based tool design: converting relative to absolute filepaths as a required tool input. The result was measurable reduction in model path errors — the tool became harder to use incorrectly.

    The principle: coding agent tools benefit from constraints that prevent common model errors:

    • Require absolute paths (eliminates relative-path resolution errors)
    • Require exact string matching in search-replace (prevents approximate matches that corrupt context)
    • Validate that referenced line numbers exist before executing edits (prevents off-by-one failures)

    These are not validation afterthoughts — they are design decisions that change model behavior by making the wrong action structurally difficult.

    Sources: Aider Edit Formats, Building Effective Agents — Anthropic, Components of a Coding Agent — Raschka (2026-04-04)


    Coding Agent Tool Inventory

    [2026-04-11]: Coding agents use a recurring, minimal tool set. Raschka identifies the baseline as five named tools; Anthropic's SWE-bench work extends this with patch-application tools. Understanding the canonical inventory prevents over-engineering the tool layer.

    Standard Inventory

    Tool Role Notes
    read_file Retrieve file content for context Prefer whole-file for small files; line-range for large
    write_file / apply_edit Persist changes to disk Format depends on edit format choice (see above)
    search / grep Find symbols, patterns, references across the repo Critical for navigation without full-file reads
    bash / shell Run commands, tests, linters, build tools Primary execution feedback mechanism
    list_files / ls Directory traversal and discovery Supports repo orientation without reading every file

    Claude Code adds a sixth functional layer via hooks — pre/post action enforcement that operates outside the agent's direct tool calls. This is best understood as a meta-tool: it applies constraints and side effects that the agent cannot disable.

    Minimum viable set for most coding tasks: read_file, write_file, bash, search. The other tools improve efficiency but are not required for correctness on small codebases.

    Sources: Components of a Coding Agent — Raschka (2026-04-04), Building Effective Agents — Anthropic


    Leading Questions

    • How do you write tool descriptions that work for both humans and LLMs?
    • When should you split one complex tool into multiple simple ones?
    • What parameters should be required vs. optional? How do you decide?
    • How do naming conventions affect tool selection accuracy?
    • What makes a tool description actionable vs. confusing?

    Connections

    • To Tool Selection: Design choices directly impact selection accuracy
    • To Prompt: Tool descriptions are prompts themselves—see Model-Invoked vs. User-Invoked Prompts
    • To Scaling Tools: Good design becomes critical when managing dozens of tools
    • To Orchestrator Pattern: AskUserQuestion maximal pattern demonstrates rich clarification as coordination tool. Orchestrators use structured questioning before delegation to absorb complexity and radiate simplicity.
    • To ReAct Pattern: Edit format choice directly affects observation quality in coding agent loops. Search-replace formats produce cleaner, verifiable observations than whole-file rewrites.