Prompt Maturity Model | Agentic Engineering

A framework for understanding and designing prompts at different levels of sophistication. Each level builds on the previous, adding new capabilities and complexity.

The Seven Levels

Level 1: Static

What defines it: Hardcoded instructions with no variation. The prompt is the same every time it runs.

When to use it: For simple, repeatable tasks that never need customization. Quick utilities where the overhead of parameters isn't worth it.

Example: A command that always formats code the same way, or always runs the same test suite.

# format-code.md
Run prettier on all TypeScript files in src/

Trade-offs:

Pros: Simplest to write and understand. No state to manage.
Cons: Inflexible. Requires creating new commands for variations.

Level 2: Parameterized

What defines it: Uses $ARGUMENTS or other variables to accept input at runtime.

When to use it: When you need the same logic with different inputs. The behavior is consistent but the data varies.

Example: From the knowledge base examples, commands that take a file path or topic as input.

# review-file.md
Review the file at $ARGUMENTS for clarity and completeness.

Trade-offs:

Pros: Reusable across different inputs. Still straightforward logic.
Cons: Limited to simple substitution. No branching behavior.

Level 3: Conditional

What defines it: Contains if/else logic or branching based on input characteristics.

When to use it: When the same command needs to behave differently depending on what it receives.

Example: A command that processes markdown differently than code files, or handles different file types.

# analyze.md
If $ARGUMENTS contains .ts or .js:
  - Check for type safety issues
Else if $ARGUMENTS contains .md:
  - Check for broken links
Else:
  - Provide general analysis

Trade-offs:

Pros: One command handles multiple scenarios. More intelligent.
Cons: Logic can become complex. Harder to predict behavior.

Level 4: Contextual

What defines it: Reads external files or project state before acting. Uses context to inform decisions.

When to use it: When the prompt needs to understand the broader environment. Common with project-aware commands.

Example: The /tools:prime command that gathers project context, or commands that read CLAUDE.md before working.

# contextualized-review.md
1. Read CLAUDE.md to understand project conventions
2. Read the file at $ARGUMENTS
3. Review against project standards
4. Suggest improvements that fit the codebase

Trade-offs:

Pros: Decisions informed by actual project state. More intelligent.
Cons: Slower due to file reads. Can fail if context is missing.

Level 5: Higher Order

What defines it: Invokes other commands as subroutines. Orchestrates multiple operations.

When to use it: For workflows that combine several distinct steps, or meta-commands that coordinate other commands.

Example: This book's /do command orchestration (.claude/commands/do.md) demonstrates higher-order coordination by classifying user requirements and routing to appropriate expert domains. Workflow commands that call research, implement, and review commands in sequence also fit this level.

# feature-workflow.md
1. Call /research:gather-requirements $ARGUMENTS
2. Call /plan:design-architecture
3. Call /implement:build
4. Call /review:validate

Trade-offs:

Pros: Compose complex behaviors from simpler parts. DRY.
Cons: Dependencies between commands. Harder to debug failures.

Level 6: Self-Modifying

What defines it: Updates its own template based on outcomes or feedback.

When to use it: When a command should learn from its usage patterns and improve itself over time.

Example: The *_improve.md pattern seen in some agentic systems where commands track their failures and update their instructions.

# adaptive-analyzer.md
[CURRENT TEMPLATE]
Analyze code for: ${FOCUS_AREAS}
 
[IMPROVEMENT MECHANISM]
After each run:
- If analysis missed issues: add to FOCUS_AREAS
- If too verbose: add to SUPPRESS_PATTERNS
- Update this template

Trade-offs:

Pros: Commands get better with use. Adapt to project needs.
Cons: Non-deterministic. Can drift from original intent. Needs safeguards.

Level 7: Meta-Cognitive

What defines it: Improves other commands, not just itself. Operates on the command system as a whole.

When to use it: For maintenance and evolution of the command ecosystem. Quality assurance for prompts.

Example: A bulk-update orchestrator that analyzes all commands and suggests improvements, or a command that identifies redundant commands and proposes consolidation.

# command-optimizer.md
1. Scan all .md commands in .claude/
2. Identify patterns:
   - Duplicated logic
   - Commands that could be parameterized
   - Missing error handling
3. Generate improvement proposals
4. Execute approved updates

Trade-offs:

Pros: System-wide optimization. Maintains command quality.
Cons: Highest complexity. Requires understanding of entire system. Risk of breaking changes.

Choosing the Right Level

Start at the lowest level that solves the problem.

Level 1-2: Most commands should live here. Simple is better.
Level 3: Use sparingly. Often a sign you need multiple commands instead.
Level 4: Standard for project-aware tools. Worth the context-gathering cost.
Level 5: Good for defined workflows. Keep orchestration logic simple.
Level 6: Experimental. Needs monitoring and rollback capability.
Level 7: Rare. Usually for tooling teams or advanced automation.

Signals you need a higher level:

Creating many similar commands → Move from 1 to 2
Copy-pasting logic between commands → Move to 5
Manually tweaking commands after each use → Consider 6
Spending more time maintaining commands than using them → Consider 7

Signals you're at too high a level:

Can't predict what the command will do
Debugging takes longer than the command saves
Other developers avoid using it
You're the only one who understands it

The maturity sweet spot: Most systems should have a pyramid distribution:

Many Level 1-2 commands (foundation)
Some Level 3-4 commands (core workflows)
Few Level 5 commands (orchestration)
Rare Level 6-7 commands (if any)

Engineer Leverage Progression

[2026-04-11]: The seven levels above describe prompt artifact sophistication. A complementary dimension describes practitioner workflow sophistication — how an individual engineer's framing of AI use changes as leverage increases. This axis is orthogonal: an engineer can write a structurally sophisticated L4-L5 prompt while still operating in a low-leverage framing. Liu's central observation names the gap precisely: "One engineer treats AI like a better search engine. The other treats AI like an entire engineering team working in concert."

Three stages characterize the progression, observed independently by multiple practitioners:

Stage 1 — Search-Engine Framing

Framing: AI as a better search interface or text transformer.

Behavior: Isolated, one-off queries. Context is provided per-query without codebase integration. Output is treated as reference material, not an executable artifact. Each query begins fresh with no compounding context.

Prompt-level correlation: Typically L1-L3. Parameterization adds flexibility, but the workflow remains point-to-point. A well-structured L4 prompt can appear in Stage 1 framing when the engineer pastes context manually without workflow integration.

Named by: Liu [1] — the "David" example: a meeting transcript pasted into a fresh session produces a plan disconnected from actual system architecture regardless of prompt quality.

Key limitation: No feedback loop. Each query begins fresh; there are no compounding returns.

Stage 2 — Integrated Workflow Framing

Framing: AI as a pipeline stage whose output feeds subsequent stages.

Behavior: Complete context is provided — full meeting transcript plus codebase, not fragments. Output is designed for downstream consumption: PRD feeds tickets, tickets feed Plan.md, Plan.md feeds agentic execution. Humans design the workflow; AI executes stages within it.

Prompt-level correlation: Typically L4-L5. Contextual and higher-order prompts are structurally required by this framing — the workflow's pipeline structure forces L4+ because each stage must be informed by the previous stage's output.

Named by: Liu [1] — the "Elena" example: complete workflow from meeting to executable plan; Willison [7] — "lead on design, delegate on implementation."

Key advancement: Compounding context. Each stage's output enriches the next stage's input; returns accumulate across the pipeline.

Stage 3 — System Designer Framing

Framing: AI as a component in a designed system with feedback loops, measurement, and self-improvement.

Behavior: Engineers define evaluation criteria before implementation. Error analysis precedes optimization. Measurement infrastructure — evals, tracing, human review — is designed first; automation of benchmarks follows manual understanding, not the reverse.

Prompt-level correlation: Typically L5-L7. Meta-cognitive prompts (Level 7) emerge from the demands of this framing: once an engineer treats the system as measurable, prompts that improve themselves become structurally necessary rather than optional.

Named by: Hamel [4][6] — manual error analysis is the highest-ROI starting point; "good writing is good thinking" — prompt engineering requires active intellectual engagement; Willison [7] — design decisions remain human before delegation to the system.

Key advancement: Feedback discipline. Improvement is systematic rather than intuitive; the measurement infrastructure determines what is worth optimizing.

Stage	Framing	Typical Prompt Level	Compounding Returns
Search-Engine	Query-response	L1-L3	None — each query begins fresh
Integrated Workflow	Pipeline stage	L4-L5	Moderate — context accumulates across stages
System Designer	Designed system with feedback	L5-L7	High — measurement enables systematic improvement

The stages correlate with but do not determine prompt level. An engineer can reach Stage 2 while using only L3-L4 prompts — the workflow framing will pull prompt sophistication upward as the engineer encounters the structural requirements of pipeline stages. Hamel's principle applies to the transition from Stage 2 to Stage 3: manual engagement precedes effective automation. Stage 3 cannot be reached by automating Stage 1 behaviors; the judgment about what to measure must be developed manually before measurement can compound.

Stage 2 integration maps directly to leverage points #1 (ADWs) and #3 (Plans) in the Twelve Leverage Points hierarchy — the pipeline structure of Stage 2 is what makes Plans (#3) genuinely executable and ADWs (#1) designable rather than improvised. Stage 3 maps to #5 (Tests) and the measurement infrastructure that distinguishes production-grade from experimental systems. Stage 3 also corresponds to the individual prerequisite for operating at Shapiro Level 3-4 in the Software Factories framework.

Connections

To Prompt Structuring: Structural choices (output templates, failure sentinels, state machines) enable prompts to move up maturity levels—the techniques that make higher levels possible
To Self-Improving Experts: Self-modifying (Level 6) and meta-cognitive (Level 7) prompts parallel expert system evolution. The three-command expert pattern (Plan-Build-Improve) implements Level 6 maturity.
To Knowledge Evolution: Tracking prompt maturity progression mirrors knowledge base maturity—both evolve from simple to sophisticated through observation and refinement
To Twelve Leverage Points: The Engineer Leverage Progression stages map onto the leverage hierarchy: Stage 2 framing enables Plans (#3) and ADWs (#1); Stage 3 framing enables Tests (#5) and system-level measurement. The Anti-Patterns by Leverage Level section in that entry identifies the specific failure modes that Stage 1 framing produces when applied to high-leverage points.
To Evaluation: Prompt maturity and evaluation maturity co-evolve. Reaching Level 5+ prompt sophistication (higher-order, self-modifying) without corresponding evaluation infrastructure produces unmeasured complexity—changes to self-modifying prompts can degrade behavior in ways that only systematic evaluation detects. The Evaluation Maturity Curve in that section maps directly: Level 1–2 prompts (static, parameterized) pair with manual evaluation; Level 5+ prompts (higher-order, self-modifying) require at minimum scripted automated evaluation to remain maintainable.