Harness Categories | Agentic Engineering

Not all harnesses are the same.

The term "harness" spans a spectrum from lightweight workflow structuring to full agentic execution environments with filesystem access, command execution, and persistent multi-session memory. Understanding which category fits a given problem is a prerequisite to framework and tool selection — and selecting the wrong category creates a capability ceiling that no amount of prompt engineering can overcome.

The Three-Axis Stack

Ethan Mollick identifies three independent axes in the AI selection decision (~2026-Q1):

Axis	What It Selects	Examples
Model	The underlying reasoning capability	Claude Opus, GPT-5, Gemini Pro
App	The human-facing interface	claude.ai chat, IDE integration, CLI
Harness	The autonomous execution environment	Claude Code, LangGraph, raw API

These axes are genuinely independent. The same model deployed in different harness categories produces categorically different outcomes — not marginally different outputs, but fundamentally different capabilities.

The capability ceiling implication: Choosing a chat interface as the harness for an agentic system is not a model choice — it is a capability ceiling. A frontier model in a web chat interface cannot maintain state across sessions, cannot execute commands, cannot manage a multi-agent workflow, and cannot apply harness engineering improvements. The harness selection determines the upper bound of what the agent can do; the model selection operates within that bound.

[2026-04-12]: This observation from Mollick's analysis has significant practical implications. Many practitioners who report that AI tools "cannot handle complex tasks" are operating frontier models inside chat interfaces — harnesses that were not designed for autonomous, multi-step execution. The limitation is architectural, not model-level.

Harness Capability Tiers

[2026-04-12]: The three-tier taxonomy below reflects the production landscape as of April 2026. Boundaries between tiers are shifting as vendors add agentic capabilities to chat interfaces.

Tier	Examples	Characteristics
Full agentic harnesses	Claude Code, OpenAI Codex	Autonomous file access, command execution, multi-step loops, tool-native design, hooks for enforcement, persistent memory across sessions
Web-based chat interfaces	claude.ai, chatgpt.com	Optimized for conversational interaction; limited autonomous execution; session-scoped context; no persistent filesystem access
Constrained agentic environments	Gemini website, Gemini in Google Workspace	Capable model with limited harness; agency bounded by platform design; some tool access but not full autonomy

Practical implication for practitioners: Framework selection is moot if the deployment target is a chat interface. A practitioner who selects LangGraph as their orchestration framework but deploys agents through a chat interface has applied the wrong tool to the wrong problem. The harness category must be selected before framework selection begins.

The capability gap within tiers: Not all full agentic harnesses are equivalent. Claude Code ships with Anthropic's defaults already configured — system prompt conventions, permission models, memory mechanisms, and hooks are pre-built. A raw API call with a custom Python script is also a "full agentic harness" in principle but requires building all six stack components from scratch. The tier defines what is possible; the specific implementation determines what is practical.

Framework vs. Runtime vs. Harness

Three terms that practitioners frequently conflate but that refer to distinct layers of the stack:

Term	Definition	Examples	When It Operates
Framework	Build-time libraries for constructing agent workflows	LangChain, CrewAI, AutoGen	Build-time (development)
Runtime	Production execution with state persistence and observability	LangGraph, LangSmith, Weave	Build-time + runtime
Harness	Opinionated runtime with defaults, batteries-included	Claude Code, OpenAI Codex	Runtime (production)

The key distinction: A framework is what a developer uses to build an agent. A harness is what the agent runs inside during execution. These are different concerns at different times.

LangGraph is a runtime framework — it provides the execution infrastructure but requires developers to make all architectural decisions: which tools, which memory model, which prompt shape, which permission enforcement. It is a blank canvas with powerful primitives.

Claude Code is a harness — it ships with Anthropic's architectural decisions already made. The system prompt conventions, the memory mechanisms, the permission model, the hooks system — these are included with defaults that work for most use cases. Developers configure and extend; they do not build from scratch.

Choosing between framework and harness:

Situation	Recommendation
Novel multi-agent coordination pattern	Framework (LangGraph, AutoGen) — need full control
Standard coding/development work	Harness (Claude Code) — defaults work; start immediately
Web-based AI product	Framework + custom runtime — harness defaults won't fit product requirements
Research and experimentation	Framework — flexibility needed for exploration
Production agentic workflow at standard tasks	Harness — faster deployment, maintained defaults
Enterprise with specific compliance requirements	Framework + custom runtime — compliance-specific controls

Harness Category Taxonomy

Category 1: Full Agentic Harnesses

Full agentic harnesses are purpose-built execution environments that provide all six stack components out of the box. They are the reference implementation of the harness concept.

Defining characteristics:

Autonomous filesystem access (read, write, navigate directories)
Command execution (shell, compilers, test runners)
Multi-step execution loops with stopping conditions
Persistent memory across sessions
Permission enforcement via hooks or equivalent mechanisms
Tool-native design (tools are first-class, not bolted on)

Canonical examples: Claude Code (Anthropic), OpenAI Codex

When this category applies: Software development and engineering tasks, code review, codebase navigation and modification, automated testing, and any task requiring repeated read-write-execute cycles on a filesystem. The full agentic harness is the natural environment for the Orchestrator Pattern, Autonomous Loops, and ReAct execution patterns.

The Claude Code reference implementation: Claude Code ships with CLAUDE.md convention support (workspace context), auto-memory (session memory), hooks (computational guides), skills (progressive context disclosure), and sub-agents (bounded delegation). All six stack components are implemented with Anthropic's defaults — practitioners configure rather than build.

Category 2: Web-Based Chat Interfaces

Web-based chat interfaces are conversational environments that prioritize human-AI dialogue over autonomous execution. They are not harnesses in the production engineering sense — they are apps that use harnesses internally.

Defining characteristics:

Optimized for back-and-forth conversation
Session-scoped context (typically no cross-session persistence unless explicitly provided)
Limited tool access (file upload, web search, code execution via sandbox — not filesystem access)
No hooks or enforcement mechanisms accessible to the developer
No programmatic control over context management

When this category applies: Exploratory work, single-session research, drafting and editing, tasks where the human remains the execution environment (the model advises; the human acts).

The capability ceiling is real: A practitioner attempting to run a multi-phase software project through a chat interface will hit the ceiling quickly: no persistent memory, no autonomous execution, no enforcement mechanisms, no trajectory capture. The limitation is not model capability — it is harness design.

Category 3: Workflow Harnesses

A workflow harness is a structured layer above raw agent interactions that imposes phase discipline, context engineering, and state persistence on multi-step human-AI work. It is the middle tier between chat interfaces and full agentic runtimes.

[2026-04-12]: Workflow harnesses are a recently named category (2026-Q1) that fills the gap between single-session chat and full autonomous execution. The GSD (get-shit-done) tool is the canonical open-source example.

Defining characteristics:

Meta-prompting system that structures human-AI workflows
Phase decomposition (break large work into defined phases with verification gates)
Context engineering (fresh context per executor, no context rot across phases)
State persistence (state files as coordination layer, not model memory)
Wave-based parallelism (simultaneous execution of independent tasks within a phase)

The GSD pattern: The GSD workflow harness implements a five-phase lifecycle for complex software projects: specification → design → implementation → verification → integration. Each phase uses a fresh executor with scoped context, preventing context rot that accumulates when a single agent carries full conversation history across all phases. State is persisted in explicit files (not model memory) so each phase has access to previous phase outputs without inheriting previous phase context.

Tooling hierarchy:

Scale and abstraction:
  Raw coding agent       → Single-session, ad-hoc interaction
  Workflow harness (GSD) → Multi-phase structured lifecycle, context-engineered
  Workspace manager      → 20+ agent infrastructure (worktrees, merge queues, supervision)

The workflow harness occupies a practical niche: it handles projects too complex for a single agent session but not large enough to require full workspace manager infrastructure.

Category 4: Workspace Managers

Workspace managers are the high end of the harness spectrum — infrastructure for coordinating 20 or more agents simultaneously with shared filesystem coordination, merge queues, and active supervision.

Defining characteristics:

Multi-worktree management (each agent operates in an isolated git worktree)
Merge queue coordination (sequential integration of parallel agent outputs)
Supervision layer (orchestrating agent monitors and corrects subagent behavior)
Shared state coordination (distributed state accessible to all agents without conflicts)
Scale designed for sustained parallel throughput, not occasional multi-agent use

[2026-04-12]: Workspace manager tooling is an emerging infrastructure category (2026-Q1). Specific tools in this category include multi-agent workspace coordinators that manage git worktrees, parallel execution queues, and supervisory oversight. The category is evolving rapidly.

When this category applies: Large-scale parallel software development, enterprise codebase refactoring, sustained autonomous engineering workflows where human supervision of individual agents is not feasible.

Category 5: Persistent Personal Runtimes

An emerging category (2026) optimized for single-agent long-running personal use with persistent memory and self-improving skills — not multi-agent coordination.

Defining characteristics:

Cross-session memory (SQLite-backed or equivalent persistent storage)
Self-improvement loop (tasks generate procedural skills stored for future reuse)
Multi-platform integration (connects to calendar, email, documents, external services)
Single-agent design (continuity and recall, not orchestration topology)

Canonical example: Hermes Agent (Nous Research, ~47K GitHub stars, Feb 2026) — multi-platform gateway, SQLite-backed memory, self-improvement loop where tasks automatically generate procedural skills. The harness accumulates knowledge about the user's preferences, workflows, and domain-specific conventions across all sessions.

The distinction from full agentic harnesses: Full agentic harnesses are optimized for task execution within development environments. Persistent personal runtimes are optimized for continuity of relationship between a single agent and a single user across all domains of that user's work.

Category 6: Custom Framework + Runtime

The full-flexibility option: practitioners build agent infrastructure using framework primitives (LangChain, CrewAI, AutoGen) and deploy on runtime infrastructure (LangGraph, LangSmith) without using an opinionated harness.

When this applies:

Novel coordination patterns not supported by existing harnesses
Enterprise compliance requirements that prohibit opinionated defaults
Research contexts requiring full control over every component
Products that embed AI agents within larger systems with specific integration requirements

The cost: All six stack components must be built explicitly. There is no equivalent of Claude Code's auto-memory, hooks system, or skill dispatch — the team must implement each component or accept the risk of operating without it.

Decision Table

Problem Type	Harness Category
Single-session, exploratory work	Chat interface or raw coding agent
Multi-session projects with defined phases	Workflow harness (GSD pattern)
Parallel execution across 20+ agents	Workspace manager
Long-running personal assistant with persistent recall	Persistent personal runtime
Custom multi-agent system from scratch	Framework + runtime
Standard software development with Anthropic models	Full agentic harness (Claude Code)
Research with novel coordination patterns	Framework + runtime
Compliance-sensitive enterprise deployment	Framework + runtime with custom controls

When Category Selection Matters Most

Category selection matters most at the beginning of a project, when the deployment environment is still open. Once a chat interface is chosen and workflows are established around it, the switching cost to a full agentic harness is significant — not in technical terms but in organizational terms. Teams develop habits and expectations around the interface they use.

The correct order of selection decisions:

Define the problem type (single-session vs. multi-session, exploratory vs. systematic)
Select the harness category based on the decision table
Select specific tools within the chosen category (evaluated in Practitioner Toolkit)
Select the model based on capability requirements and harness compatibility

Reversing this order — selecting the model first, then discovering that the preferred interface doesn't support the required harness capabilities — is a common source of architectural rework.

Connections

What Is a Harness? — Foundational definition and the Agent = Model + Harness formula
The Harness Stack — Six-component decomposition used to evaluate harnesses within categories
Designing for Your Context — Decision frameworks for choosing between categories
Practitioner Toolkit — Specific tool evaluations within each category
Workflow Coordination — GSD case study and workflow harness patterns in depth
Agent Frameworks — Framework-specific capability tiers and selection guidance