What Is a Harness?

    A harness is everything around the model that makes it an agent.

    Agent = Model + Harness
    

    This formula — attributed to Martin Fowler and echoed independently by multiple practitioners in early 2026 — provides the clearest available answer to a question practitioners have struggled to articulate: what is the thing that wraps the model and makes it capable of completing multi-step, multi-tool, multi-session work? The answer is the harness. Not the prompt. Not the context. Not the tools individually. The surrounding system that assembles and manages all of those components in service of task completion.


    The Horse Harness Metaphor

    Ethan Mollick introduced the horse harness metaphor to explain why the term is well-chosen (Mollick, ~2026-Q1). A horse has raw physical capability — strength, speed, endurance. But raw capability cannot pull a plow, draw a carriage, or haul freight without a harness. The harness converts raw power into directed, controlled, useful work.

    The analogy maps precisely:

    Dimension Horse Analogy AI Agent
    Raw capability Physical strength Model reasoning and language capability
    Harnessing mechanism Physical harness straps and fittings Software harness (prompt, tools, memory, loops)
    Output Directed work (plowing, transport) Completed tasks (code, research, decisions)
    Without harness Undirected energy Answered prompts, not completed work

    A model without a harness is like a horse standing in a field — full of capability, producing nothing useful. The harness is what makes the difference between potential and performance.

    This metaphor also explains why the term "harness" has displaced earlier vocabulary. "Scaffold" suggests temporary support that gets removed. "Wrapper" suggests a thin passthrough layer. "Harness" suggests permanent infrastructure that actively directs capability — which is precisely what the surrounding system does.


    Definitions from Multiple Authorities

    [2026-04-12]: Five authoritative practitioners converged on complementary definitions within the same 90-day window in early 2026. This convergence is itself evidence that the concept has crystallized into a stable practitioner term.

    Source Definition
    Raschka (2026-04-04) "The software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory"
    LangChain (2026) "Every piece of code, configuration, and execution logic that isn't the model itself"
    Fowler (~2026-Q1) The execution environment that decomposes into Guides (feedforward) and Sensors (feedback), each computational or inferential
    Mollick (~2026-Q1) "A system that lets the AI use tools, take actions, and complete multi-step tasks on its own"
    Schmid (~2026-Q1) The OS in a Model-as-CPU metaphor — coordinates resources, enforces permissions, manages application lifecycle

    These definitions are complementary rather than competing. LangChain's definition is the widest aperture — "everything except the model" — which is useful for establishing scope but not for analysis. Raschka's definition is the most specific, naming six concrete components. Fowler's is the most architecturally precise, decomposing the harness into control mechanisms. Mollick's is the most accessible. Schmid's is the most conceptually generative — the OS analogy unlocks a rich set of implications about resource management, permission enforcement, and application lifecycle.

    A practitioner reading all five definitions in sequence understands the harness at multiple levels: what it includes (LangChain), what it does (Mollick), what it consists of (Raschka), how it works mechanically (Fowler), and what it is analogous to at the systems level (Schmid).

    The LangChain Definition in Practice

    "Everything except the model itself" is a useful working definition because it is inclusive without being exhaustive. The system prompt is part of the harness. The tool schema registry is part of the harness. The context management logic is part of the harness. The permission enforcement layer is part of the harness. The memory persistence system is part of the harness. The loop controller that decides when to stop is part of the harness.

    What is not part of the harness: the weights, the attention mechanism, the training data, the inference compute. Those belong to the model.

    The OS Analogy

    Schmid's OS analogy deserves extended treatment because it generates the most practical implications. If the model is the CPU — capable of computation but requiring inputs and orchestration — then the harness is the operating system that:

    • Manages memory allocation (context management component)
    • Enforces access permissions (tool access and permission component)
    • Coordinates concurrent processes (subagent delegation component)
    • Maintains system state (session memory component)
    • Provides a standard interface for application code (prompt shape component)
    • Maintains awareness of the working environment (workspace context component)

    The analogy also clarifies why harness quality is often more important than model capability. A high-powered CPU running a poorly designed OS will underperform a moderate CPU running a well-engineered one. Similarly, a frontier model running inside a poorly designed harness will underperform a previous-generation model inside a well-engineered harness — for most production tasks.


    Scaffold vs. Harness — A Critical Distinction

    The academic and SWE-bench research literature uses "scaffold" where practitioners now use "harness." These terms are not synonymous, and conflating them creates confusion about what to optimize and when.

    arXiv:2603.05344 (2026-03) provides the clearest available formalization: "Scaffolding assembles the agent before the first prompt; the harness orchestrates everything after."

    Term When It Operates Primary Function What It Manages
    Scaffold Pre-runtime (setup) Assembly System prompt construction, tool schema registration, subagent registry initialization, initial context loading
    Harness Runtime (execution) Orchestration Tool dispatch, context management, safety enforcement, loop control, memory updates

    The distinction matters for practitioners because scaffolding failures and harness failures have different symptoms and different fixes:

    Scaffold failure symptoms:

    • Agent begins with incorrect capabilities or missing tools
    • System prompt missing critical instructions
    • Tool schemas malformed or absent
    • Initial context contains stale or irrelevant information

    Harness failure symptoms:

    • Agent accumulates context rot across turns
    • Tool calls reach outside allowed scope
    • Memory state inconsistent across sessions
    • Agent loops without stopping conditions
    • Subagent outputs not properly integrated

    A practitioner who conflates scaffold and harness will apply harness fixes to scaffold problems (adding runtime context management when the real issue is missing upfront initialization) and scaffold fixes to harness problems (revising system prompt instructions when the real issue is runtime permission enforcement).

    [2026-04-12]: "Scaffold" remains dominant in academic literature and SWE-bench benchmarking contexts. "Harness" is the 2026 practitioner term in production engineering discussions. Practitioners reading research papers should translate "scaffold" as approximately "harness" in most contexts — with awareness that the research term is ambiguous between pre-runtime and runtime concerns.


    Historical Evolution

    The harness has always existed — every system that calls a model has some surrounding infrastructure. What changed in 2026 is that the surrounding system became the primary object of design attention, named, and theorized.

    Era 1: Prompt Engineering (2022–2024)

    The model was the locus of improvement. The prompt was everything. Getting the model to do the right thing meant writing better instructions, adding examples, structuring output formats, and applying techniques like chain-of-thought. The surrounding system was minimal — an API call with a string in, a string out.

    The harness in this era was invisible because it was trivial: a thin HTTP wrapper that assembled the prompt and returned the completion. Nobody needed a name for it because nobody was designing it.

    Era 2: Context Engineering (2025)

    Context became the primary lever as agentic systems grew more capable and context windows grew larger. What the model was shown — not just told — determined outcomes. The surrounding system began to matter: what went into the context window, in what order, at what granularity, for how long.

    The harness in this era was the context manager — still not fully named, but clearly more than a prompt wrapper. Practitioners began writing sophisticated context injection logic, compression systems, and retrieval pipelines. The surrounding system was becoming a first-class engineering concern.

    Era 3: Harness Engineering (2026 — )

    [2026-04-12]: Mitchell Hashimoto coined "harness engineering" on February 5, 2026, to describe the discipline of systematically improving the surrounding system whenever an agent makes a mistake. The coinage named a practice that had been happening in fragmentary form but lacked a vocabulary.

    The shift: the harness is now the primary design object. When an agent underperforms, practitioners ask first whether the harness component is the cause — not whether the model is insufficient. The model is a fixed capability (for a given provider and tier); the harness is the variable that practitioners control and can improve.

    Each era added a layer without eliminating the previous ones. Prompt engineering remains essential. Context engineering remains essential. Harness engineering adds a third layer of systematic optimization that operates above and around both.

    Era 3: Harness Engineering
      ↑ wraps
    Era 2: Context Engineering
      ↑ wraps
    Era 1: Prompt Engineering
      ↑ invokes
    Model (fixed capability)
    

    Connection to the Twelve Leverage Points

    The Twelve Leverage Points framework (see Ch 1 Foundations) identifies places in a system where small changes produce disproportionate results. The harness is where the highest-leverage points live in an agentic system:

    • System prompt (leverage point 1): the harness manages the stable prefix
    • Tool selection and constraints (leverage point 3): the harness defines the tool inventory
    • Context composition (leverage point 5): the harness manages what enters the context window
    • Memory and state (leverage point 7): the harness maintains session memory
    • Loop control (leverage point 9): the harness decides when the agent continues and when it stops
    • Permission enforcement (leverage point 11): the harness is the security boundary

    Investing in harness engineering is investing in all of these leverage points simultaneously.


    Why Harness Is a Foundational Pillar

    The five foundational pillars answer five distinct questions:

    Pillar Question
    Prompt What instruction does the agent receive?
    Model What reasoning can the agent perform?
    Context What information can the agent access?
    Tool Use What actions can the agent execute?
    Harness What system orchestrates and constrains the agent's execution?

    The first four pillars are necessary but not sufficient. A perfectly crafted prompt, a frontier model, a rich context, and well-designed tools will not produce reliable agent behavior without a harness that:

    1. Assembles them coherently before the first turn
    2. Manages their interaction across multiple turns
    3. Enforces constraints on what the agent can do
    4. Provides feedback loops for self-correction
    5. Persists state across sessions
    6. Improves through observed failures

    Agent Psychometrics (arXiv:2604.00594, 2026-04) provides quantitative evidence for the harness pillar. The paper proposes P(success) = σ(θ_LLM + θ_scaffold − β_difficulty), where θ_LLM represents LLM capability and θ_scaffold represents scaffold (harness) quality. The key finding: these two terms are additively independent. Improving harness quality yields gains regardless of current model capability. Harness investment does not substitute for model capability — it multiplies independently.

    This independence is the strongest argument for treating harness as a distinct foundational pillar rather than a peripheral concern. Unlike the other pillars — where improvements to one interact with constraints from others — harness investment operates on its own axis.


    Summary

    The harness is the execution environment that wraps the model and makes it an agent. It includes everything the agent runs inside: the prompt assembly, the tool access layer, the context management system, the memory persistence, the permission enforcement, and the loop control. It excludes only the model weights and inference compute.

    The vocabulary is new (2026) but the concept is not. Every agentic system has always had a harness — what has changed is that practitioners now name it, design it deliberately, and improve it systematically through the practice of harness engineering.

    The foundational formula — Agent = Model + Harness — has a practical implication: when an agent underperforms, the diagnosis must consider both terms. Most practitioners default to model blame. The evidence suggests the harness is more often the distinguishing factor.


    Connections