Long-Horizon Agent State

If you're curious about the implementation, check out Warren. Everything I describe here is built there. These ideas came from the building that system.

The historical industry conversation about AI agents is almost entirely about model capability. Reasoning, tool use, planning within a context window. In the past few months the narrative has switched to harnesses. But I still think there's a bottle neck beyond the harness. It comes in the form of agent state, and persisting state across agents.

I'm calling this Long-Horizon Agent State. It's the persistent, structured state that allows agents to pursue goals across multiple sessions, runs, and human interactions. The term rhymes with "long-horizon planning" from the research literature, but shifts focus from the model's planning ability to the infrastructure surrounding the model.

Most harnesses today are stateless function calls. Prompt in, output out, context accumulates, session restarts. That works fine for one-shot tasks. But I want to push my systems to do work that lasts for hours, days, weeks. Not one-shot prompts that are done in 20 mins. I want agents that are working 24/7 and doing so in alignment with my goals. So how to solve this?

Here's how I think about it. Long-Horizon Agent State decomposes into five layers, each solving a distinct problem. An agent system needs all five to sustain work across a long horizon. I'll walk through them bottom-up.

The first layer is execution. No matter what harness you're running, agent runs are typically black boxes. You send messages, agent does things, session ends. Logging and tracing is present, but how often are devs really digging into their ~/.claude dir to look at the results of every agent they've run? Tool calls, reasoning, outputs, errors, state transitions, etc. should all be persisted and accessible. Runs should be isolated in sandboxes but their history is permanent. Failed runs leave artifacts. This is the raw substrate that everything else builds on. Without this layer, you can't audit, replay, or attribute cost to anything.

The second layer is memory. I built Mulch to solve this for my own projects. It's a typed memory system where agents record conventions, patterns, failures, decisions, and references. Records are keyed by domain, linked to evidence like commits and issues, and scoped to relevance on retrieval. Agents prime themselves at session start and record before they finish. Agents actually get better at a codebase over time. Knowledge transfers between agents working on the same project.

The third layer is intent. Agents work on isolated tasks with no awareness of the broader goal or what work remains. They need a structured issue system where work decomposes into plans with ordered steps, blocking dependencies, and lifecycle states. I built Seeds for this. Agents read the work queue to know what's ready. Plans give agents the full picture: what's done, what's blocked, what's next. Multi-step goals survive across sessions. Agents pick up where the last one left off.

The fourth layer is coordination. Humans and agents work in silos. There's no shared record of decisions, questions, or steering. I built Plot as a coordination object that binds intent, attachments, and an append-only event log around a unit of work. Humans and agents are peer actors on the log. Humans set intent. Agents record decisions, produce artifacts, and pose questions. Agents cannot edit intent. They must ask. This is by design. The boundary prevents agents from silently redefining their own objectives. Events accumulate over days or weeks. Each actor sees the full history. Questions pause work until answered. The coordination object becomes the single source of truth for why something is being done and what happened along the way.

The fifth layer is orchestration. Real long-horizon work requires multiple runs in sequence. A for loop walks a plan's steps one at a time, dispatching an agent run for each step, waiting for the previous run's PR to merge before dispatching the next. Trivial steps auto-advance. Failed steps halt the plan. Resumed plans skip already-completed steps. This gives you multi-PR features where each PR is a clean, reviewable unit. Human review becomes a natural checkpoint. Plans survive interruption and resume from where they stopped.

The layers compose into a feedback loop. All of them live in git with the code itself. A human writes intent through the coordination layer. Intent decomposes into a plan. The system dispatches an agent for step one. The agent runs in a sandbox, reads memory from prior sessions. It hits a blocker, poses a question to the user. Work pauses. The human answers. The agent resumes. It records what it learned before finishing. A PR opens. The PR gets merged. The system dispatches step two with accumulated context. Repeat until the plan completes. The coordination object transitions to done. The next agent on this project inherits the memory.

Long-Horizon Agent State is the infrastructure that turns a stateless model into a stateful collaborator. It's what allows agents to work for hours, days, weeks.

The layers I described are implemented in Warren and they work. I don't think the exact implementation matters as much as the decomposition. If you're building agent systems and hitting the wall where single-run agents can't sustain real work, I think these five layers are the ones you need to build.