The Multi-Agent Landscape | Agentic Engineering

Multi-agent systems in early 2026 converge on shared architectural patterns despite divergent implementations. Across 15+ active frameworks, five dimensions define the design space: how agents communicate, how they organize, what constrains them, how they remember, and how much latitude they receive.

This chapter maps those dimensions. Projects appear as evidence for conceptual claims, not as the focus. The goal: identify the structural forces shaping multi-agent design so practitioners can make informed architectural decisions regardless of framework choice.

Communication — The Emerging Protocol Stack

[2026-02-11]: The multi-agent communication landscape has stratified into distinct layers, each addressing a different integration problem. No single protocol covers the full stack.

Protocol Stratification

Layer	Protocol	Purpose	Transport	Adoption Signal
Vertical (agent-to-tool)	MCP (Anthropic)	Standardized tool integration	JSON-RPC 2.0	97M monthly SDK downloads
Horizontal (agent-to-agent)	A2A (Google / Linux Foundation)	Cross-vendor agent interop	REST / JSON-RPC	50+ launch partners
Lightweight (legacy / REST)	ACP (IBM BeeAI)	Low-barrier agent coordination	REST	Open-source community
Decentralized (identity)	ANP (W3C)	Peer-to-peer agent discovery	DID-based	Early specification stage

The stratification reflects a fundamental insight: agent-to-tool communication (vertical) and agent-to-agent communication (horizontal) solve different problems. MCP standardizes how agents discover and invoke tools. A2A standardizes how agents discover and invoke each other. Attempting to solve both with one protocol produces an unwieldy specification.

Organizations adopting standardized protocols report 60-70% reduction in integration time compared to custom agent messaging implementations.

Three Communication Paradigms

Beyond protocols, three paradigms govern how agents actually exchange information during execution:

Handoff — Agent transfers control mid-conversation with state attached. Clean and deterministic, but inherently sequential. One agent acts at a time.

Observed in: OpenAI Agents SDK, Pydantic AI
Trade-off: Simplicity vs. parallelism

Shared State — Agents read and write to a common store (SQLite, graph state, session objects). Enables fast reuse of intermediate results, but risks overwrite conflicts and stale reads.

Observed in: AutoForge (SQLite features DB), LangGraph (graph state), Google ADK (output_key)
Trade-off: Speed of reuse vs. consistency guarantees

Async Mail / Hooks — Agents work independently, checking persistent message queues for coordination signals. Maximally decoupled, but coordination overhead increases with team size.

Observed in: Gas Town (hooks with GUPP principle), OpenClaw (Beads messaging)
Trade-off: Decoupling vs. coordination latency

The Structured Communication Trend

[2026-02-11]: A clear trend across frameworks: structured protocols over natural language for inter-agent communication. JSON-RPC 2.0 and REST dominate transport. The reasoning is practical — structured messages are parseable, validatable, and debuggable. Natural language between agents introduces the same ambiguity problems that prompt engineering solves for human-to-agent communication.

Opt-in communication as safety pattern: Some frameworks default agent-to-agent communication to OFF, requiring explicit allowlisting per agent pair. This treats cross-talk as a capability to be granted, not a default — a meaningful architectural choice that limits blast radius when individual agents misbehave.

Protocol Selection Guidance

Integration Need	Recommended Protocol	Rationale
Tool discovery and invocation	MCP	Mature ecosystem, wide adoption, standardized tool descriptions
Cross-vendor agent interop	A2A	Industry backing, enterprise focus, task lifecycle management
Legacy system integration	ACP	Low barrier, REST-native, minimal infrastructure
Decentralized agent networks	ANP	DID-based identity, no central authority required
Internal team coordination	Framework-native	Lower overhead than protocol adoption for single-vendor deployments

The decision is not "which protocol?" but "which protocols?" — most production systems need at least two (vertical + horizontal) and often three (adding framework-native for internal coordination).

Topology — Dynamic Over Fixed

The question is not "which topology is best?" but "which topology fits this task's coordination requirements?"

Seven Topologies in Active Use

Topology	Structure	Best For	Example Implementations
Hub-spoke / Orchestrator-worker	Central coordinator delegates to specialists	Production deployments — clear control and debugging	Anthropic Claude Code, Google ADK Coordinator, Gas Town Mayor
Sequential pipeline	Agent output feeds next agent's input	Linear workflows with clear stage boundaries	Google ADK SequentialAgent, AutoForge
Parallel fan-out/gather	Multiple agents execute simultaneously, results aggregated	Independent subtasks with synthesis	Google ADK ParallelAgent, MassGen
Hierarchical tree	Multi-level delegation (coordinator → sub-coordinator → workers)	Complex decomposition with scope isolation	Gas Town (Town → Rig → Crew), smolagents
Flat / peer-to-peer	Agents communicate directly without central coordinator	Routing-based specialization	OpenClaw, A2A-enabled systems
Dynamic / swarm	Team composition changes based on task requirements	Elastic workloads, evolving requirements	CrewAI crews, Swarms Enterprise
Generator-critic	One agent produces, another evaluates, iterating until quality threshold	Iterative refinement with explicit quality gates	Google ADK pattern, Constitutional AI variants

Dynamic Hierarchy as the Trend

[2026-02-11]: The trend is not away from hierarchy — it is toward dynamic hierarchy. Orchestrators that spawn, reassign, and dissolve teams as task requirements shift outperform static configurations. Fixed topologies force tasks into predetermined coordination shapes; dynamic topologies adapt coordination to the task.

LangChain's production guidance codifies the scaling path: "Start with a single agent. Add tools before adding agents. Adopt multi-agent only when facing context management constraints." This sequential escalation prevents premature complexity — most tasks that seem to need multi-agent coordination actually need better tool design.

Swarm vs. Supervisor Trade-offs

LangChain benchmarks show swarm architecture slightly outperforms supervisor across the board, but supervisor systems are easier to debug and reason about. The performance gap is narrow; the observability gap is wide.

Dimension	Swarm	Supervisor
Raw performance	Slightly higher	Slightly lower
Debugging	Distributed traces, harder to follow	Central log, linear trace
Scalability	Horizontal (add more agents)	Vertical (smarter coordinator)
Failure isolation	Natural (agent-level)	Requires explicit handling
Coordination cost	Implicit (emergence)	Explicit (orchestration logic)

Orchestration vs. Choreography

Code-based orchestration (deterministic routing) and LLM-based routing (flexible delegation) represent opposite ends of a control spectrum. Most production systems use hybrid approaches — deterministic routing for well-understood paths, LLM-based routing for novel situations. Pure orchestration is brittle; pure choreography is unpredictable.

Approach	Control Model	Strengths	Weaknesses
Code-based orchestration	Deterministic routing via explicit rules	Predictable, debuggable, auditable	Brittle under novel conditions
LLM-based routing	Model decides delegation dynamically	Flexible, adapts to unforeseen tasks	Unpredictable, harder to audit
Hybrid	Deterministic default, LLM for novel paths	Balance of control and flexibility	Complexity of two routing systems

Topology Selection Heuristic

Is the workflow linear with clear stages?
    Yes → Sequential Pipeline
    No  → Continue

Do subtasks share state during execution?
    Yes → Hub-spoke with shared state
    No  → Continue

Are subtasks independent and parallelizable?
    Yes → Parallel fan-out/gather
    No  → Continue

Does the task require iterative refinement?
    Yes → Generator-critic
    No  → Continue

Must team composition change during execution?
    Yes → Dynamic swarm
    No  → Hub-spoke (safest default)

Safeguards — Bounded Autonomy as Standard

Defense-in-depth is the baseline pattern, not a nice-to-have. Every production multi-agent system studied implements multiple independent safety layers.

The Defense-in-Depth Stack

┌─────────────────────────────────────────────┐
│  Manual Review (human approval gates)        │
├─────────────────────────────────────────────┤
│  Agent Config Protection                     │
│  (block writes to CLAUDE.md, .cursorrules,   │
│   MCP configs)                               │
├─────────────────────────────────────────────┤
│  Command Allowlists                          │
│  (only permitted commands execute)           │
├─────────────────────────────────────────────┤
│  Filesystem Restrictions                     │
│  (tiered: none → read-only → read-write)    │
├─────────────────────────────────────────────┤
│  Credential Injection                        │
│  (minimal at start, task-scoped, short-lived)│
├─────────────────────────────────────────────┤
│  OS-Level Sandbox                            │
│  (gVisor, Kata containers, macOS Seatbelt)   │
└─────────────────────────────────────────────┘

Each layer catches failures that slip through layers above it. Application-level controls alone are insufficient — attackers use indirection, calling restricted tools through approved ones. OS-level enforcement catches what application guards miss.

The Four Pillars of Platform Control

[2026-02-11]: CNCF's 2026 framework for agent governance identifies four complementary control mechanisms:

Golden paths — Paved roads that make the secure approach the easy approach
Guardrails — Automated checks that prevent known-bad configurations
Safety nets — Catch-all systems that detect anomalies the guardrails miss
Manual review — Human checkpoints at high-consequence decision points

These map cleanly to multi-agent systems: golden paths are well-tested agent templates, guardrails are filesystem and command restrictions, safety nets are watchdog processes, and manual review is human-in-the-loop approval.

Prompt Injection Remains Unsolved

Every framework studied acknowledges prompt injection as an open risk. No framework claims prevention — only mitigation through defense-in-depth.

Concrete mitigations observed across frameworks:

Locked DMs — Agents cannot message arbitrary endpoints; communication restricted to declared peers
Mention gating — Agents only respond when explicitly invoked, preventing injection through ambient context
Instruction-hardened models — Training-time resistance to injection attempts (model-level, not application-level)
Input sanitization — Stripping or escaping control characters from external data before injecting into agent context
Output validation — Verifying agent actions against declared intent before execution

None of these is sufficient alone. The defense-in-depth principle applies: assume each layer will fail, and design the stack so that no single failure compromises the system.

Trust Creep as Primary Risk

[2026-02-11]: The most dangerous failure mode is not a dramatic exploit but gradual erosion. Trust creep — gradually relaxing approval gates because "agents usually get it right" — produces systemic quality degradation. The fix is structural: tie approval requirements to task classification, not to recent success rates.

Only 21% of organizations report mature agent governance frameworks despite 73% citing security as their top AI risk. The gap between perceived risk and implemented controls defines the current landscape.

Memory — The Urgent Frontier

[2026-02-11]: Memory is the most urgent unsolved frontier in agent development. ICLR 2026 dedicated a workshop to "MemAgents: Memory for LLM-Based Agentic Systems," reflecting the gap between memory's importance and its current maturity.

Three Memory Models in Production

Model	Mechanism	Strengths	Risks
Shared pool	All agents access common store (vector DB, document store)	Fast reuse, no duplication	Overwrite conflicts, stale reads, contamination
Local with sync	Each agent owns private memory, shares via periodic synchronization	Isolation by default, fewer contention issues	Sync latency, divergence between syncs
Event bus	Agents as independent actors with private state, communicating asynchronously	Maximal decoupling, clean interfaces	Requires event schema discipline, eventual consistency

The Two-Tier Pattern

A recurring architecture across projects: ephemeral working memory paired with persistent knowledge storage. The specific implementations differ, but the structural pattern is consistent.

Ephemeral layer (session-scoped, disposable):

Daily logs, work queues, conversation context
Cheap to create, acceptable to lose

Persistent layer (cross-session, curated):

Curated knowledge bases, semantic search indexes, structured databases
Expensive to build, critical to preserve

Project	Ephemeral Layer	Persistent Layer	Retrieval
OpenClaw	Daily logs (`memory/YYYY-MM-DD.md`)	Curated `MEMORY.md`	Hybrid vector/BM25 search
Gas Town	Hooks (work queues)	Beads (git-backed JSONL knowledge graph with semantic decay)	Graph traversal + semantic search
AutoForge	Fresh context per session (deliberate amnesia)	SQLite features DB	Structured queries

Virtual Memory for Cognition

[2026-02-11]: OpenClaw's metaphor reframes context management: the LLM context window functions as cache, persistent storage as source of truth. The model only "remembers" what gets written to disk. This shifts the mental model from "fitting everything in the window" to "managing a memory hierarchy" — analogous to how operating systems manage RAM and disk.

The implications are practical:

Write-back policy matters. Deciding what persists and what evaporates requires explicit design, not implicit hope.
Cache invalidation applies. Stale information in the context window is worse than missing information — it produces confident wrong answers.
Prefetch strategies help. Loading likely-needed context before the agent requests it reduces latency (progressive disclosure applied to memory).

Memory Contamination

Incorrect information spreading across agents via shared memory is a named risk in every shared-pool implementation. The trend is toward fine-grained access control — restricting which agents can write to which memory segments — rather than all-or-nothing sharing. Read-many, write-few architectures reduce contamination surface.

Contamination mitigation strategies observed:

Strategy	Mechanism	Trade-off
Write gating	Only designated agents can write to shared memory	Reduces contamination but bottlenecks information flow
Versioned entries	Every write creates a new version; reads get latest or pinned	Enables rollback but increases storage cost
Semantic decay	Entries lose relevance weight over time unless refreshed	Self-cleaning but risks losing valid long-term knowledge
Provenance tracking	Every memory entry tagged with source agent and confidence	Enables blame tracing but adds metadata overhead

Seancing: Cross-Session Context Recovery

[2026-02-11]: Gas Town's "seancing" pattern addresses context loss at session boundaries. New agent sessions query predecessors about unfinished work, reconstructing intent from artifacts rather than relying on memory transfer. The pattern treats session boundaries as inevitable rather than preventable — building recovery into the architecture instead of trying to maintain continuity.

Fresh Context as Deliberate Strategy

AutoForge takes the opposite approach: starting each session with an empty context window prevents context pollution while SQLite provides continuity for structured state. This trades recall for cleanliness — the agent cannot reference previous conversations but also cannot be poisoned by stale context from them.

The choice between seancing (recover prior context) and fresh starts (discard prior context) depends on whether the task benefits more from continuity or from clean reasoning.

Autonomy — The Bounded Middle

The dominant strategy in 2026 is not maximum automation but intentional constraint. Bounded autonomy — clear operational limits combined with escalation paths and comprehensive audit trails — outperforms both excessive restriction and excessive freedom.

Five-Level Autonomy Spectrum

Level	Description	Agent Behavior	Human Role	Current Examples
Augmentation	Agents enhance human work	Suggest, draft, complete	Active creator	Most chat assistants, Copilot
Supervised automation	Agents execute with checkpoints	Act within approved scope	Reviewer at gates	OpenClaw (configurable thinking depth)
Bounded autonomy	Agents operate within defined limits	Full execution within boundaries	Supervisor, exception handler	AutoForge (spec + test gates), Google ADK
High autonomy	Agents execute relentlessly, humans review results	Execute until done or stuck	Post-hoc reviewer	Gas Town (GUPP: "must execute")
Full autonomy	Minimal human oversight	Theoretically unbounded	Absent or passive	Not achieved by any current system

Bounded Autonomy as Consensus

[2026-02-11]: Bounded autonomy is the 2026 consensus position. Full automation is not the dominant strategy — practitioners report better outcomes with intentional human checkpoints positioned at high-consequence decision points. The human role shifts from reviewing all outputs to functioning as an "agent supervisor" positioned at workflow decision points. The job changes from proofreader to factory floor manager.

The Economics of Multi-Agent Systems

Multi-agent systems consume approximately 15x more tokens than standard single-agent chat (Anthropic production data). This cost multiplier means multi-agent architectures are economically justified only for tasks where the value of parallelism, specialization, or consistency exceeds the token premium.

Architecture	Relative Token Cost	Best Justified When
Single agent	1x (baseline)	Most tasks — default choice
Two-agent (generator-critic)	2-3x	Quality-critical outputs needing verification
Orchestrator + 3-5 specialists	5-8x	Multi-domain analysis requiring parallel expertise
Full swarm (10+ agents)	10-15x	Large-scale parallel execution with tight deadlines

Role-based design reduces failure by 35%. Separating Planner, Executor, Verifier, and Optimizer roles mirrors human team structure and produces measurable quality improvement over monolithic agent architectures. The specialization benefit compounds: planners develop better plans when they never execute, and executors produce better output when they follow explicit plans rather than self-planning.

The 60% Failure Rate

[2026-02-11]: Multi-agent deployments fail in the majority of enterprises — not due to agent capability limitations but due to architectural misunderstanding of legacy systems and undocumented dependencies. The systems agents interact with were not designed for automated consumption. Undocumented APIs, implicit workflows, and tribal knowledge create failure surfaces that no amount of agent sophistication can navigate without explicit mapping.

The GUPP Lesson: High Autonomy Without Proportional Safeguards

Gas Town's GUPP principle ("If there is work on your hook, you MUST run it") represents the high-autonomy extreme. The documented consequences: agents merged PRs with failing tests, deleted code unpredictably, and required forced repository resets. High autonomy without proportional safeguards produces chaos — the safeguards must scale with the latitude.

This is not an argument against high autonomy. It is evidence that autonomy level and safeguard depth must be calibrated together. The failure was not too much autonomy per se, but a mismatch between autonomy granted and constraints enforced.

Autonomy Calibration Framework

Match autonomy level to task risk and reversibility:

Is the action easily reversible?
    Yes → Higher autonomy acceptable (bounded autonomy or above)
    No  → Continue

Is the action visible to external parties?
    Yes → Supervised automation (human gate before external actions)
    No  → Continue

Does the action modify shared state?
    Yes → Bounded autonomy with explicit checkpoints
    No  → Higher autonomy acceptable for isolated execution

The calibration principle: Autonomy level × safeguard depth should remain roughly constant. Increasing autonomy without proportionally increasing safeguards creates the mismatch that produced the GUPP failures. Decreasing safeguards without reducing autonomy produces the same mismatch from the other direction.

Connections

To Expert Swarm Pattern: File ownership and flat coordination represent a specific instantiation of the topologies and communication paradigms described here. Expert Swarm's expertise inheritance protocol is one implementation of the shared-state communication paradigm.
To Production Multi-Agent Systems: Operational patterns for the topologies and safeguards discussed here. Production systems must implement the defense-in-depth stack and memory architectures at scale, with autonomous recovery for the failure modes this chapter identifies.
To Multi-Agent Context: Memory architectures in this chapter extend the context isolation patterns documented there. The two-tier memory pattern (ephemeral + persistent) maps directly to context management strategies for multi-agent systems.
To Operating Agent Swarms: Operational practices for the safeguards and autonomy levels described here. The bounded autonomy consensus translates into specific operational procedures for monitoring, escalation, and recovery.
To Execution Topologies: Autonomy as a measurement dimension for topology selection. The seven topologies documented here map to the execution topology mental models, with autonomy level as an additional selection criterion.

Open Questions

How do the communication protocols (MCP, A2A, ACP, ANP) compose when an agent needs both vertical and horizontal integration simultaneously?
What governance structures prevent trust creep in organizations with hundreds of deployed agents?
As memory architectures mature, does the two-tier pattern (ephemeral + persistent) stabilize or fragment into more specialized tiers?
What observability standards emerge for tracing decisions across multi-agent topologies? Current debugging remains ad hoc.
How does the bounded autonomy consensus shift as model capabilities improve — does the "bounded middle" move toward higher autonomy, or do safeguard requirements scale proportionally?
What architectural patterns address the 60% enterprise failure rate — is the solution better agent design or better legacy system instrumentation?
How do memory contamination defenses compose with the performance benefits of shared memory pools?
When (if ever) does full autonomy become viable, and what safeguard architecture makes it safe?