Software Factories

    The previous section (§6) introduced the factory floor metaphor to describe what happens when agent count exceeds eight to twelve: infrastructure requirements emerge, the workshop model breaks down, and a phase transition occurs. That section asked when the transition happens. This section asks what factory means—historically, critically, and practically—as the term has re-emerged in 2026 discourse with remarkable force.


    Core Questions

    Historical Context

    • How has the "software factory" concept failed twice before, and what is different now?
    • What problems did Japanese software factories (1980s) and Microsoft software factories (2004) solve, and where did each stall?
    • Why did LLMs dissolve the flexibility problem that killed the 2004 approach?

    The 2026 Landscape

    • What does an actual production software factory look like in 2026?
    • What does Dan Shapiro's five-level framework reveal about where most practitioners actually operate?
    • What is the "dark factory" variant, and where does it come from?

    Critical Assessment

    • What does independent evidence say about AI-authored code quality?
    • How does the circularity problem (AI builds, AI validates) undermine traditional quality assurance?
    • Which factory claims are greenfield-only, and how does brownfield work change the calculus?

    Practitioner Guidance

    • What signals indicate a team should adopt factory thinking?
    • What scale thresholds map to which infrastructure requirements?
    • What does an honest assessment of Level 5 ("dark factory") actually support?

    Your Mental Model

    The factory metaphor reveals what changes when implementation becomes free—and conceals what still costs.

    When implementation time collapses from weeks to minutes, every assumption about software development economics must be revisited. The factory metaphor names this shift precisely: it highlights throughput, standardization, and process-as-product. But factories also carry assumptions inherited from physical manufacturing—assumptions about inputs, outputs, quality inspection, and labor—that do not transfer cleanly to software. The mental model is most useful when held critically: apply it where it fits, identify where it breaks.

    The history of software factory thinking is a history of the right idea applied with the wrong enabling technology. The 2026 resurgence is compelling not because the idea is new, but because the enabling technology finally exists.


    The Factory Resurgence

    In 2026, the term "software factory" re-emerged as a central metaphor for AI-autonomous development, carried by a wave of manifestos, case studies, and practitioner reports that share a common claim: small teams using AI agents can now produce software at industrial scale.

    The most-cited concrete example is StrongDM, documented by Simon Willison in February 2026. Three engineers produced 32,000 lines of production code in seven months. The team's operating principles stripped the process to its minimum: code must not be written by humans, code must not be reviewed by humans. Token spend reached over $1,000 per engineer per day. The key innovations were not speed alone, but two validation mechanisms that addressed quality: scenarios as holdout sets—behavioral acceptance criteria hidden from code-generating agents and used exclusively for validation—and a Digital Twin Universe, behavioral clones of third-party services that enable safe testing against realistic external system behavior.

    The "dark factory" variant imports its core metaphor from FANUC Corporation's lights-out manufacturing facility in Oshino, Japan, where robots have built other robots since 2001 in 30-day unmanned production runs. The facility operates without lighting because the robots do not need to see. Johnny Butler's formulation—"No Humans Should Write Code"—applies the same logic to software: if the factory can run without human operators, why should it require them? BCG Platinion's 2026 report on dark software factories uses the FANUC analogy directly, describing autonomous software production as a natural extension of industrial automation.

    This section examines what the factory metaphor reveals—and what it conceals—about agentic systems at scale.


    Historical Lineage

    The 2026 resurgence is the third attempt to realize the software factory concept. Each prior attempt reached the same conclusion about the problem and the same obstacle to the solution.

    Era 1: Japanese Software Factories (1980s–1991)

    Michael Cusumano's 1991 study Japan's Software Factories documented how Hitachi, Toshiba, NEC, and Fujitsu applied manufacturing logic to software development in the 1980s. The central problem was engineer scarcity: skilled developers were expensive and in short supply. The factories sought to address this by standardizing components for reuse—if a software component could be written once and reused many times, the ratio of engineers to output would improve.

    The approach achieved genuine, documented gains in defect rates and productivity within narrow domains. But it stalled when confronted with the fundamental tension in software: standardization reduces flexibility. Components built for reuse across many projects must be general enough to apply broadly, which makes them less precisely fitted to any specific need. The more a team optimized for reuse, the less they could accommodate novel requirements. Projects that didn't fit the standard components required either expensive custom work or compromised designs that fit the templates.

    The Japanese factories solved the engineer scarcity problem for standard work. They did not solve it for novel work, and software is disproportionately novel.

    Era 2: Microsoft Software Factories (2004)

    Greenfield, Short, Cook, Kent, and Crupi's 2004 book Software Factories proposed a different mechanism: domain-specific languages (DSLs) and model-driven code generation. The factory would consist of "a configuration of languages, patterns, frameworks, and tools that can be used to rapidly and cost-effectively produce an open-ended set of unique variants of a standard product."

    This formulation is remarkably close to the 2026 conception. The insight was correct: if software could be specified formally at a higher level of abstraction, and if a code generator could translate those specifications into working implementations, the bottleneck would shift from implementation to specification. The human would design; the machine would build.

    The obstacle was the same as in the 1980s: the DSLs required to capture domain models precisely enough for code generation were expensive to create, expensive to maintain, and limited in scope. Building a DSL capable of generating non-trivial software in a new domain required specialized expertise equivalent to (or exceeding) simply writing the software directly. The investment did not pay off at scale.

    The 2004 framework was the right architecture for a missing component. That component was a general-purpose language model capable of understanding informal specifications and generating corresponding implementations.

    Era 3: LLM-Enabled Resurgence (2024–2026)

    Large language models dissolve the flexibility problem that killed both prior attempts. Natural language specifications replace formal domain models. The investment required to specify behavior in a DSL collapses to the investment required to write clearly in English. The "configuration of languages, patterns, frameworks, and tools" that Greenfield et al. described in 2004 now includes a general-purpose model that can interpret informal requirements and generate implementations across arbitrary domains.

    The historical irony is direct: spec-driven development was always the goal. The 2004 book described the architecture correctly. The 2026 factory is what the 2004 book wanted but could not achieve without language models.

    Martin Fowler's 2026 analysis of spec-driven development positions this explicitly: the shift is not in the idea of specifications as the primary development artifact, but in the capability of the system that executes against those specifications.


    The 2026 Landscape

    Dan Shapiro's Five Levels

    Dan Shapiro's January 2026 framework, modeled on NHTSA vehicle automation levels, provides a concrete taxonomy for AI-assisted development:

    Level Name Description Who Controls Implementation
    0 Manual No AI assistance Humans entirely
    1 Task offloading "Write this unit test" Humans, with AI executing bounded tasks
    2 Active partnership Extended back-and-forth with AI Collaborative; humans review each output
    3 Human oversight Humans managing extensive AI diffs AI implements; humans review and steer
    4 Autonomous coding Humans write specs, agents implement AI, with human specification
    5 The Dark Factory Specs to software, no human intervention AI entirely

    Shapiro's observation is that most practitioners plateau at Level 2. The jump to Level 3 requires comfort with reviewing large AI-generated diffs rather than writing code directly. Levels 4 and 5 require infrastructure and process changes that most teams have not made.

    The Anthropic 2026 Agentic Coding Trends Report corroborates this distribution: the vast majority of developer adoption concentrates at Levels 1-2, with Level 3 representing a distinct mode change that requires deliberate organizational investment.

    StrongDM's Production Implementation

    StrongDM's approach, documented by Willison and analyzed by Allan MacGregor (The Pragmatic CTO), represents one of the most detailed public accounts of Level 4-5 operations. Several features distinguish it from simpler AI-assisted development:

    Validation separation. Scenarios that define acceptance criteria are never exposed to the agents generating code. This separation is analogous to holdout sets in machine learning: the validation criteria are not part of the training signal. Agents cannot "teach to the test" because they do not see the test.

    Third-party service emulation. The Digital Twin Universe consists of behavioral clones of external services—APIs, databases, infrastructure—that accept the same interfaces as production systems but operate in a sandboxed environment. This enables integration testing without production dependencies, which is critical when agents are generating code that interacts with external systems.

    Cost structure. Token spend over $1,000 per engineer per day represents a significant operating cost. At this expenditure, the economic case for the factory model depends on output quality and velocity that justifies the cost relative to traditional development. StrongDM's 32,000-line output over seven months—an average exceeding 150 production lines per engineer-day—suggests the model is viable for their domain, though the comparison to traditional productivity benchmarks is non-trivial.

    The Organizational Bifurcation

    Andrew Baker's March 2026 analysis identifies a developing polarization in the software industry. A small fraction of teams—Baker estimates 3-5%—operate at Levels 4-5 with productivity multiples of three to five times traditional development. The majority operate at Levels 1-2, capturing real but bounded gains.

    This bifurcation is complicated by evidence on productivity perception. The METR study (2025) examined 246 real open-source development tasks with experienced developers working with and without AI assistance. Developers using AI tools took 19% longer to complete tasks while predicting they would be 24% faster—a 43-percentage-point gap between predicted and actual performance. This suggests that productivity gains at higher levels require deliberate process changes, not simply the addition of AI tools to existing workflows. The subjective experience of acceleration does not reliably indicate actual acceleration.

    Adjacent Tool Development

    Several adjacent tools have emerged under the spec-driven development banner, each representing a distinct take on the factory model:

    • Factory.ai markets "Droids"—autonomous agents that operate from high-level specifications
    • GitHub Spec-Kit integrates spec-driven development into the pull request workflow
    • Amazon Kiro positions specification-first development as a native IDE workflow
    • Tessl describes itself as "AI that programs software" from natural language requirements

    Steve Yegge's 2026 analysis of coding agents frames this tool proliferation as a race to capture the specification layer: whoever controls how requirements are expressed controls the factory's input.


    Factory Principles for Agentic Systems

    The factory metaphor maps directly to the mental models developed in this chapter. The components already exist; the factory is the assembled system.

    Specs as the Control Plane

    The factory model requires that specifications be the primary human artifact. This is the claim of Specs as Source Code stated at industrial scale: when agents execute specifications, the quality of those specifications determines the quality of the output. The Design as Bottleneck model (§6, Model 1) is the Theory of Constraints applied to this same observation—when implementation is fast, specification becomes the constraint.

    At factory scale, this principle has an additional implication: specifications must be machine-verifiable, not just human-readable. StrongDM's scenarios-as-holdout-sets approach is one implementation of this: acceptance criteria expressed precisely enough that an automated system can confirm or disconfirm them without human judgment.

    Persistent Identity, Ephemeral Execution at Factory Scale

    The Persistent Identity, Ephemeral Execution model (§6, Model 3) describes how individual agents maintain track records while keeping session context fresh. At factory scale, this model becomes infrastructure: the persistent layer is not a single CV file but a distributed system of agent identities, routing tables, and capability registries.

    StrongDM's Digital Twin Universe is context management at factory scale—the persistent layer includes behavioral models of production systems, not just individual agent track records. The principle is the same; the implementation scale is different.

    Mechanical Execution Contracts at Industrial Volume

    The Agents as Pistons model (§6, Model 2) establishes that agents execute without conversational overhead: hook fires, agent runs, completion is the only signal. At factory scale, this contract becomes the throughput mechanism. A factory where agents negotiate scope, request clarification, or poll for instructions cannot sustain industrial-volume output. The piston model is not optional at factory scale; it is the execution architecture.

    Design as the Factory Constraint

    The Design as Bottleneck model (§6, Model 1) identifies the constraint directly: when implementation is free, specification quality is the bottleneck. In factory terms, the design function is the constraint in the Theory of Constraints sense—the one resource whose throughput limits system output.

    This reframes the factory model's central investment decision. A traditional software factory invests in implementation infrastructure (servers, build systems, deployment pipelines). A 2026 software factory invests in specification infrastructure: tools for writing precise requirements, processes for validating acceptance criteria, and skills for decomposing complex behavior into independently-implementable units.

    Execution Topologies at Scale

    The Execution Topologies framework describes five shapes of agent work: parallel, sequential, synthesis, nested, and persistent. Factory operations require nested and persistent topologies at 100+ agent scale. A factory running StrongDM-class operations cannot rely on flat parallel execution—the coordination requirements alone demand hierarchical orchestration, with sub-orchestrators managing specialized agent pools and synthesis layers aggregating outputs.

    The factory floor phase transition described in §6 (Model 5) is where the execution topology must change: flat topologies stop working, and hierarchical topologies become necessary.

    Context as Infrastructure

    The Context as Code model treats agent knowledge like software. At factory scale, context management is infrastructure. StrongDM's Digital Twin Universe is an example: the behavioral models of external services are not ad-hoc context; they are engineered, maintained, and version-controlled artifacts that constitute the factory's knowledge layer.

    The principle from Context as Code—that context should be composed rather than accumulated—applies with particular force at factory scale, where context mismanagement multiplies across hundreds of concurrent agents.


    The Critique

    The factory claims deserve rigorous examination. The evidence does not uniformly support the enthusiasm.

    The Circularity Problem

    Stanford Law CodeX's February 2026 analysis poses the core challenge precisely: "Built by Agents, Tested by Agents, Trusted by Whom?" When the same class of system builds software and validates that software, no independent inspector exists. Traditional quality assurance depends on the mismatch between builder and checker—a different person, with different blind spots, examining the work. Models from the same architecture family may share correlated blind spots. A factory where LLMs generate code and LLMs validate code may produce output that satisfies validation criteria while failing in ways neither system was trained to detect.

    StrongDM's holdout set approach partially addresses this: the acceptance criteria are withheld from code-generating agents. But the acceptance criteria themselves are written by humans (and potentially validated by humans who also use LLMs), and the coverage of those criteria is limited to what humans anticipated when writing them.

    Code Quality Evidence

    Independent measurement of AI-authored code does not support factory-level confidence:

    • CodeRabbit analysis of over 10 million GitHub pull requests (December 2025) found that AI-authored code contains 1.4 times more critical issues and 1.7 times more major issues than human-authored code across the same repositories.
    • Veracode found that 45% of AI-generated code contains vulnerabilities from the OWASP Top 10, with only 10.5% of solutions being both functionally correct and secure.
    • METR (2025) demonstrated that experienced developers took 19% longer when using AI tools while estimating 24% improvement—a systematic miscalibration that suggests practitioners cannot reliably assess AI-assisted productivity through subjective experience.

    These are population-level findings. Individual teams—including StrongDM—have achieved better outcomes through specific validation mechanisms. The question is whether those mechanisms generalize, or whether they are expensive to implement and domain-specific in their applicability.

    The Validation Gap

    Simon Willison frames the core epistemic challenge: "How can agents prove software works?" Scenarios test anticipated behavior. Unknown unknowns—failure modes not anticipated when acceptance criteria were written—remain outside the test coverage boundary. In traditional development, experienced engineers bring tacit knowledge about failure modes to code review. That tacit knowledge catches issues that are not in the specification.

    At factory scale, that tacit knowledge must be encoded into the specification or it does not exist in the system. The factory's quality ceiling is bounded by specification coverage.

    Technical Debt at Industrial Speed

    Code duplication increases approximately four times when AI generates code without explicit deduplication constraints. Architectural unsoundness accumulates invisibly—each agent executes its specification, and no agent has responsibility for overall system coherence. The integration layer, where individual correct implementations combine into a system that may not be correct as a whole, is the factory's weakest point.

    The Talent Pipeline Problem

    A structural consequence of factory-model adoption may not appear for years. U.S. junior developer hiring declined 67% in 2024; UK technology graduate roles fell 46%. If entry-level positions traditionally produced the senior architects who later design the specifications that factories execute, the depletion of that pipeline represents a long-term constraint. The factory model consumes specification quality as its primary input. Specification quality depends on architects who developed their skills through implementation experience. Whether factory-model adoption self-limits through architect supply is an open question with a multi-year feedback loop.

    The Liability Vacuum

    No regulatory framework or professional licensing model has adapted to software production where no human reviewed the final artifact. In regulated domains—healthcare, finance, transportation—software systems carry liability that traces to identifiable professionals. The dark factory, operating at Level 5, produces artifacts without that attribution chain. The legal and regulatory frameworks that apply to such systems remain undefined as of 2026.

    Greenfield vs. Brownfield Reality

    The most dramatic factory claims—StrongDM's 32,000 lines in seven months, autonomous agent swarms implementing complete features—emerge almost exclusively from greenfield contexts: new codebases, clear domain boundaries, fresh architectural decisions. Most software development is brownfield: existing codebases, accumulated technical debt, unclear ownership, implicit contracts with legacy systems.

    Legacy systems impose hard limits on agent autonomy. An agent cannot safely modify a module it cannot fully understand, and full understanding of a legacy system requires context that may exceed practical context window sizes and may contain knowledge that was never written down. The realistic ceiling for brownfield factory operations sits closer to Level 3—human oversight of AI-generated diffs—than to Level 5.


    Practitioner Framework

    Transition Signals

    The workshop-to-factory transition is not a goal to pursue—it is a consequence of scale that demands infrastructure. The signals indicating the transition has become necessary:

    Scale Transition Signals:
    
      Are agent count > 8-12?
      ├── No → Workshop model; factory thinking premature
      └── Yes
           └── Is implementation time < specification/design time?
                ├── No → Design bottleneck not yet dominant;
                │        optimize specification quality first
                └── Yes
                     └── Are repeated specification patterns visible?
                          ├── No → Factory infrastructure overhead
                          │        exceeds benefit; defer investment
                          └── Yes → Factory model appropriate;
                                     invest in specification infrastructure
    

    The three conditions together identify genuine factory readiness: sufficient scale, implementation speed exceeding design speed, and sufficient pattern repetition to amortize infrastructure investment.

    Scale Tiers

    Extending the infrastructure table from §6 (Model 5) to factory-scale operations:

    Tier Agent Count Infrastructure Required Factory Applicability
    Solo 1-5 None Not applicable
    Workshop 6-12 Merge strategy, basic supervision Factory thinking counterproductive
    Small Factory 13-30 Full orchestration, health monitoring, attribution Core factory infrastructure; factory model appropriate
    Industrial 30-100 Hierarchical coordination, specification infrastructure, Digital Twin-class context management Full factory operations; specification quality is dominant constraint
    Dark Factory 100+ Automated specification validation, circularity mitigation, regulatory compliance framework Experimental; limited production evidence; circularity risks unresolved

    Anti-Patterns

    Premature factory infrastructure. Building supervisor agents, merge queues, specification validation pipelines, and Digital Twin environments for a five-agent team. The overhead exceeds the value. Direct management remains superior at workshop scale. The factory model adds complexity that benefits appear only at scale.

    Vibe coding at industrial volume. Deploying factory-scale agent counts without factory-scale validation infrastructure. The term "vibe coding" was coined by Andrej Karpathy in February 2025 to describe "fully giving in to the vibes...and forgetting that the code even exists" — a legitimate stance for exploratory, throwaway-project work where stakes are low and rapid iteration is the goal. At industrial volume, without holdout sets, behavioral validation, and quality measurement, the stance produces output quantity without assured quality. Code duplication, architectural drift, and silent failures scale with agent count when validation does not. The failure mode is not the practice itself but its misapplication to a scale context it was not designed for. (See: Agentic Engineering vs. Vibe Coding for the spectrum framing and scale boundary.)

    Level 5 as the target. Treating the dark factory as an aspirational endpoint rather than an extreme case. For most teams and most brownfield contexts, Levels 3-4 represent the productive sweet spot: sufficient agent autonomy to capture substantial velocity gains, with human oversight sufficient to catch the quality issues that current models consistently produce.

    Delayed infrastructure. Running factory-scale agent counts with workshop-scale infrastructure. The resulting chaos—merge conflicts, silent failures, attribution gaps, quality degradation—is predictable and well-documented. The infrastructure requirements at scale are not optional; they are the cost of operating at scale reliably.

    The Honest Assessment

    The evidence supports a tempered conclusion: the software factory concept is now technically feasible for the first time in its history, achieves genuine productivity multiples in the right conditions, and carries well-documented risks that current validation approaches only partially address.

    For most teams and most contexts—particularly brownfield, regulated, or legacy-adjacent work—Level 3-4 operation is both the ceiling of what is currently practical and the sweet spot of what delivers reliable value. Factory thinking is a mental model for understanding scale dynamics, not a mandate for every team. The teams operating at Level 4-5 with documented success have made specific, expensive investments in validation infrastructure. The headline productivity numbers do not include those investment costs.

    The factory metaphor is most useful as a way of understanding what changes as scale increases—where the bottleneck moves, what infrastructure becomes necessary, and why specification quality becomes the dominant variable. It is least useful as an aspiration divorced from the evidence about what it currently requires to work.


    Open Questions

    • Can the circularity problem be solved within the factory model? If the same model family generates and validates code, what techniques reliably create the independent inspection relationship that traditional quality assurance provides?

    • What is the brownfield ceiling? Current evidence suggests brownfield contexts cap effective autonomy around Level 3. What techniques—better context management, improved RAG over legacy codebases, specialized legacy-systems models—might raise this ceiling?

    • How does the talent pipeline play out? The decline in junior developer hiring creates a delayed constraint on specification quality. What does software development look like in five years if the architect pipeline is significantly depleted?

    • What liability framework will apply? As dark factory operations produce software without human review of the final artifact, how will professional liability, regulatory compliance, and organizational accountability adapt?

    • Can specification quality be measured independently? Current factory operations assess specification quality by measuring implementation success rates. Is there a way to assess specification quality before implementation—catching ambiguities and coverage gaps before they propagate through factory-scale agent execution?

    • What happens above 100 agents? The production evidence for factory operations concentrates in the 10-30 agent range. Kimi K2.5 and similar systems demonstrate 100+ concurrent subagents, but the coordination and quality mechanics at that scale are not well-documented. Does factory-floor infrastructure scale, or does a third organizational model emerge?

    • How does the factory model interact with security? In a production system where AI generates and reviews code, and attackers know this, what attack surfaces emerge that did not exist in human-authored systems? Slopsquatting, prompt injection into specification pipelines, and adversarial acceptance criteria are documented threats; the defensive landscape is still forming.


    Connections

    • Design as Bottleneck: Model 5 (Factory Floor vs. Workshop) introduced the workshop-to-factory transition. This section examines what the factory concept has historically meant, where it now stands, and what evidence exists for its 2026 instantiation. The factory is the assembled system; §6's models are the components.

    • Specs as Source Code: The factory model's fundamental claim—that specifications are the primary human artifact—is the industrial-scale expression of the Specs as Source Code mental model. Factory economics are spec economics at scale.

    • Context as Code: StrongDM's Digital Twin Universe is Context as Code at factory scale: external system behavior is engineered, maintained, and version-controlled as the factory's knowledge infrastructure.

    • Execution Topologies: Factory operations at 30+ agents require hierarchical topologies—nested orchestration with sub-orchestrators managing specialized pools. The topology section's measurement framework applies directly to factory-scale coordination assessment.

    • Production Multi-Agent Systems: The patterns documented there—persistent identity, supervision tiers, batch tracking, merge integration—are the operational implementations of what the factory model requires at production scale.

    • Operating Agent Swarms: Factory operations face swarm-scale operational challenges: cost management, incident response, quality monitoring. The practices chapter addresses the day-to-day mechanics that the factory mental model motivates.

    • Gas Town case study (appendices/examples/gastown/): A Go-based production implementation operating at factory scale (20-30 agents), demonstrating the piston model, persistent identity, and ledger-based attribution in a working system.

    • Overstory case study (appendices/examples/overstory/): A TypeScript/Bun implementation validating the same factory-scale patterns through a different tech stack, confirming that the architectural principles are language-agnostic.

    • Journal: Agentic Engineering vs. Vibe Coding: The factory model at Level 5 represents the endpoint of agentic engineering discipline. The contrast with vibe coding at scale—illustrated by the OpenClaw case study—grounds the abstract factory critique in concrete failure modes.


    Sources

    • Michael Cusumano, Japan's Software Factories (1991) — documentation of 1980s factory attempts at Hitachi, Toshiba, NEC, and Fujitsu; standardization-flexibility tension
    • Jack Greenfield, Keith Short, Steve Cook, Stuart Kent, and John Crupi, Software Factories (2004) — DSL-driven code generation framework; the definition quoted in Era 2
    • Simon Willison, "How StrongDM's AI team build serious software" (February 7, 2026) — scenarios-as-holdout-sets, Digital Twin Universe, cost structure
    • Dan Shapiro, "The Five Levels: From Spicy Autocomplete to the Dark Factory" (January 2026) — five-level framework modeled on NHTSA vehicle automation
    • Stanford Law CodeX, "Built by Agents, Tested by Agents, Trusted by Whom?" (February 8, 2026) — circularity problem analysis
    • Andrew Baker, "Dark Factories: AI Is Splitting Software Teams Apart" (March 22, 2026) — 3-5% frontier team analysis, organizational bifurcation
    • Allan MacGregor (The Pragmatic CTO), "The Software Factory: When No Human Writes or Reviews the Code" (February 18, 2026) — StrongDM analysis and factory model assessment
    • Johnny Butler, "Dark Factory: No Humans Should Write Code" (2026) — dark factory formulation
    • Steve Yegge, "The Future of Coding Agents" (2026) — specification layer competition; Level 7-8 framework extended
    • BCG Platinion, "The Dark Software Factory" (2026) — FANUC lights-out manufacturing analogy
    • Anthropic, "2026 Agentic Coding Trends Report" (2026) — practitioner adoption distribution across levels
    • Martin Fowler, "Understanding Spec-Driven Development" (2026) — historical positioning of specification-first approaches
    • CodeRabbit study (December 2025) — 10 million PR analysis; 1.4x critical issues, 1.7x major issues in AI-authored code
    • METR study (2025) — 246-task randomized controlled trial; 19% slower with AI, 24% faster predicted; 43-point perception gap
    • Veracode (2025) — 45% of AI code contains OWASP Top 10 vulnerabilities; 10.5% both correct and secure
    • FANUC Corporation — lights-out manufacturing facility, Oshino, Japan (2001–present); 30-day unmanned production runs
    • Andrej Karpathy, "vibe coding" coinage (February 2, 2025) — original definition: "fully giving in to the vibes...and forgetting that the code even exists"; original framing: exploratory/throwaway contexts. https://x.com/karpathy/status/1886192184808149383