What Is a Harness? | Agentic Engineering

A harness is everything around the model that makes it an agent.

Agent = Model + Harness

This formula — attributed to Martin Fowler and echoed independently by multiple practitioners in early 2026 — provides the clearest available answer to a question practitioners have struggled to articulate: what is the thing that wraps the model and makes it capable of completing multi-step, multi-tool, multi-session work? The answer is the harness. Not the prompt. Not the context. Not the tools individually. The surrounding system that assembles and manages all of those components in service of task completion.

The Horse Harness Metaphor

Ethan Mollick introduced the horse harness metaphor to explain why the term is well-chosen (Mollick, ~2026-Q1). A horse has raw physical capability — strength, speed, endurance. But raw capability cannot pull a plow, draw a carriage, or haul freight without a harness. The harness converts raw power into directed, controlled, useful work.

The analogy maps precisely:

Dimension	Horse Analogy	AI Agent
Raw capability	Physical strength	Model reasoning and language capability
Harnessing mechanism	Physical harness straps and fittings	Software harness (prompt, tools, memory, loops)
Output	Directed work (plowing, transport)	Completed tasks (code, research, decisions)
Without harness	Undirected energy	Answered prompts, not completed work

A model without a harness is like a horse standing in a field — full of capability, producing nothing useful. The harness is what makes the difference between potential and performance.

This metaphor also explains why the term "harness" has displaced earlier vocabulary. "Scaffold" suggests temporary support that gets removed. "Wrapper" suggests a thin passthrough layer. "Harness" suggests permanent infrastructure that actively directs capability — which is precisely what the surrounding system does.

Definitions from Multiple Authorities

[2026-04-12]: Five authoritative practitioners converged on complementary definitions within the same 90-day window in early 2026. This convergence is itself evidence that the concept has crystallized into a stable practitioner term.

Source	Definition
Raschka (2026-04-04)	"The software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory"
LangChain (2026)	"Every piece of code, configuration, and execution logic that isn't the model itself"
Fowler (~2026-Q1)	The execution environment that decomposes into Guides (feedforward) and Sensors (feedback), each computational or inferential
Mollick (~2026-Q1)	"A system that lets the AI use tools, take actions, and complete multi-step tasks on its own"
Schmid (~2026-Q1)	The OS in a Model-as-CPU metaphor — coordinates resources, enforces permissions, manages application lifecycle

These definitions are complementary rather than competing. LangChain's definition is the widest aperture — "everything except the model" — which is useful for establishing scope but not for analysis. Raschka's definition is the most specific, naming six concrete components. Fowler's is the most architecturally precise, decomposing the harness into control mechanisms. Mollick's is the most accessible. Schmid's is the most conceptually generative — the OS analogy unlocks a rich set of implications about resource management, permission enforcement, and application lifecycle.

A practitioner reading all five definitions in sequence understands the harness at multiple levels: what it includes (LangChain), what it does (Mollick), what it consists of (Raschka), how it works mechanically (Fowler), and what it is analogous to at the systems level (Schmid).

The LangChain Definition in Practice

"Everything except the model itself" is a useful working definition because it is inclusive without being exhaustive. The system prompt is part of the harness. The tool schema registry is part of the harness. The context management logic is part of the harness. The permission enforcement layer is part of the harness. The memory persistence system is part of the harness. The loop controller that decides when to stop is part of the harness.

What is not part of the harness: the weights, the attention mechanism, the training data, the inference compute. Those belong to the model.

The OS Analogy

Schmid's OS analogy deserves extended treatment because it generates the most practical implications. If the model is the CPU — capable of computation but requiring inputs and orchestration — then the harness is the operating system that:

Manages memory allocation (context management component)
Enforces access permissions (tool access and permission component)
Coordinates concurrent processes (subagent delegation component)
Maintains system state (session memory component)
Provides a standard interface for application code (prompt shape component)
Maintains awareness of the working environment (workspace context component)

The analogy also clarifies why harness quality is often more important than model capability. A high-powered CPU running a poorly designed OS will underperform a moderate CPU running a well-engineered one. Similarly, a frontier model running inside a poorly designed harness will underperform a previous-generation model inside a well-engineered harness — for most production tasks.

Scaffold vs. Harness — A Critical Distinction

The academic and SWE-bench research literature uses "scaffold" where practitioners now use "harness." These terms are not synonymous, and conflating them creates confusion about what to optimize and when.

arXiv:2603.05344 (2026-03) provides the clearest available formalization: "Scaffolding assembles the agent before the first prompt; the harness orchestrates everything after."

Term	When It Operates	Primary Function	What It Manages
Scaffold	Pre-runtime (setup)	Assembly	System prompt construction, tool schema registration, subagent registry initialization, initial context loading
Harness	Runtime (execution)	Orchestration	Tool dispatch, context management, safety enforcement, loop control, memory updates

The distinction matters for practitioners because scaffolding failures and harness failures have different symptoms and different fixes:

Scaffold failure symptoms:

Agent begins with incorrect capabilities or missing tools
System prompt missing critical instructions
Tool schemas malformed or absent
Initial context contains stale or irrelevant information

Harness failure symptoms:

Agent accumulates context rot across turns
Tool calls reach outside allowed scope
Memory state inconsistent across sessions
Agent loops without stopping conditions
Subagent outputs not properly integrated

A practitioner who conflates scaffold and harness will apply harness fixes to scaffold problems (adding runtime context management when the real issue is missing upfront initialization) and scaffold fixes to harness problems (revising system prompt instructions when the real issue is runtime permission enforcement).

[2026-04-12]: "Scaffold" remains dominant in academic literature and SWE-bench benchmarking contexts. "Harness" is the 2026 practitioner term in production engineering discussions. Practitioners reading research papers should translate "scaffold" as approximately "harness" in most contexts — with awareness that the research term is ambiguous between pre-runtime and runtime concerns.

Historical Evolution

The harness has always existed — every system that calls a model has some surrounding infrastructure. What changed in 2026 is that the surrounding system became the primary object of design attention, named, and theorized.

Era 1: Prompt Engineering (2022–2024)

The model was the locus of improvement. The prompt was everything. Getting the model to do the right thing meant writing better instructions, adding examples, structuring output formats, and applying techniques like chain-of-thought. The surrounding system was minimal — an API call with a string in, a string out.

The harness in this era was invisible because it was trivial: a thin HTTP wrapper that assembled the prompt and returned the completion. Nobody needed a name for it because nobody was designing it.

Era 2: Context Engineering (2025)

Context became the primary lever as agentic systems grew more capable and context windows grew larger. What the model was shown — not just told — determined outcomes. The surrounding system began to matter: what went into the context window, in what order, at what granularity, for how long.

The harness in this era was the context manager — still not fully named, but clearly more than a prompt wrapper. Practitioners began writing sophisticated context injection logic, compression systems, and retrieval pipelines. The surrounding system was becoming a first-class engineering concern.

Era 3: Harness Engineering (2026 — )

[2026-04-12]: Mitchell Hashimoto coined "harness engineering" on February 5, 2026, to describe the discipline of systematically improving the surrounding system whenever an agent makes a mistake. The coinage named a practice that had been happening in fragmentary form but lacked a vocabulary.

The shift: the harness is now the primary design object. When an agent underperforms, practitioners ask first whether the harness component is the cause — not whether the model is insufficient. The model is a fixed capability (for a given provider and tier); the harness is the variable that practitioners control and can improve.

Each era added a layer without eliminating the previous ones. Prompt engineering remains essential. Context engineering remains essential. Harness engineering adds a third layer of systematic optimization that operates above and around both.

Era 3: Harness Engineering
  ↑ wraps
Era 2: Context Engineering
  ↑ wraps
Era 1: Prompt Engineering
  ↑ invokes
Model (fixed capability)

Connection to the Twelve Leverage Points

The Twelve Leverage Points framework (see Ch 1 Foundations) identifies places in a system where small changes produce disproportionate results. The harness is where the highest-leverage points live in an agentic system:

System prompt (leverage point 1): the harness manages the stable prefix
Tool selection and constraints (leverage point 3): the harness defines the tool inventory
Context composition (leverage point 5): the harness manages what enters the context window
Memory and state (leverage point 7): the harness maintains session memory
Loop control (leverage point 9): the harness decides when the agent continues and when it stops
Permission enforcement (leverage point 11): the harness is the security boundary

Investing in harness engineering is investing in all of these leverage points simultaneously.

Why Harness Is a Foundational Pillar

The five foundational pillars answer five distinct questions:

Pillar	Question
Prompt	What instruction does the agent receive?
Model	What reasoning can the agent perform?
Context	What information can the agent access?
Tool Use	What actions can the agent execute?
Harness	What system orchestrates and constrains the agent's execution?

The first four pillars are necessary but not sufficient. A perfectly crafted prompt, a frontier model, a rich context, and well-designed tools will not produce reliable agent behavior without a harness that:

Assembles them coherently before the first turn
Manages their interaction across multiple turns
Enforces constraints on what the agent can do
Provides feedback loops for self-correction
Persists state across sessions
Improves through observed failures

Agent Psychometrics (arXiv:2604.00594, 2026-04) provides quantitative evidence for the harness pillar. The paper proposes P(success) = σ(θ_LLM + θ_scaffold − β_difficulty), where θ_LLM represents LLM capability and θ_scaffold represents scaffold (harness) quality. The key finding: these two terms are additively independent. Improving harness quality yields gains regardless of current model capability. Harness investment does not substitute for model capability — it multiplies independently.

This independence is the strongest argument for treating harness as a distinct foundational pillar rather than a peripheral concern. Unlike the other pillars — where improvements to one interact with constraints from others — harness investment operates on its own axis.

Summary

The harness is the execution environment that wraps the model and makes it an agent. It includes everything the agent runs inside: the prompt assembly, the tool access layer, the context management system, the memory persistence, the permission enforcement, and the loop control. It excludes only the model weights and inference compute.

The vocabulary is new (2026) but the concept is not. Every agentic system has always had a harness — what has changed is that practitioners now name it, design it deliberately, and improve it systematically through the practice of harness engineering.

The foundational formula — Agent = Model + Harness — has a practical implication: when an agent underperforms, the diagnosis must consider both terms. Most practitioners default to model blame. The evidence suggests the harness is more often the distinguishing factor.

Connections

The Harness Stack — The six-component decomposition of the harness
Harness Categories — Types of harnesses and when each applies
Harness as Control System — Fowler's guides-and-sensors framework
Harness Engineering — Hashimoto's methodology for systematic improvement
Twelve Leverage Points — Foundational leverage framework that harness components map to
Model — The other term in the Agent = Model + Harness formula