Dan Grafham · @dangrafham · May 2026

Field Notes · AI Development

The
Dark Factory

A deep dive into Level 5: what it actually requires, what it looks like in practice, and what kind of person could architect one.

LEVEL 5 · Theoretical / Emerging · Almost Nobody Here Yet

01 · The Core Idea

Specification In. Software Out.

What it actually means when no human reads the code.

The Dark Factory is not simply "more AI." It is a categorical shift in what programming is. At every level below Level 5, a human is still present in the loop, reading output, approving PRs, running tests. At Level 5, that loop closes. The system receives a specification and produces working, tested, deployed software. No human touches it at any point in the middle.

This doesn't mean humans disappear. It means they move entirely upstream. The work that once lived in code reviews, debugging sessions, and implementation decisions now lives in the specification itself. The spec is no longer a document that describes software. It is the software. The rest is compilation.

The philosophical shift

In traditional development, the code is the source of truth. In a Level 5 factory, the specification is the source of truth. This inverts 70 years of software culture. Developers have always used specs as inputs to code. In a dark factory, code is an implementation detail of a spec, and a disposable one at that. If the code fails, you don't fix the code. You clarify the spec and regenerate.

input

Specification

›

generate

Code

›

evaluate

Evals / Tests

›

iterate

Self-Correct

›

output

Deployed Software

02 · What Makes a Complete Spec

The Spec the Factory Can Actually Use

Most specs fail not because they're wrong, but because they're incomplete in invisible ways.

The single biggest bottleneck to Level 5 is not AI capability. It's spec quality. The factory can only build what you describe. Every ambiguity in the spec is a decision the factory will make on its own, and probably not the way you intended. A usable Level 5 spec has to answer questions you haven't thought to ask yet.

Behaviour Contract

Precise descriptions of what the software does, not how it does it. Inputs, outputs, side effects, edge cases. Every branching path. "The login flow should work" is not a behaviour contract. "A user with an expired token receives a 401 and is redirected to /login within 200ms" is.

Eval Suite

The automated scorecard that judges whether the software satisfies the contract. Not unit tests written during development. Outcome tests written before development begins. End-to-end flows, load scenarios, adversarial inputs, real-world data samples. If your eval can't distinguish good software from bad, the factory has no compass.

Constraint Inventory

Every technical constraint that isn't implied by the behaviour. Must run on .NET 8, not 9. Must not introduce any new third-party dependencies. Response time under 150ms at P99. No shared mutable state between modules. The factory will make sensible defaults without this, but its "sensible" is not yours.

Decision Log

Documented prior decisions the factory must not reverse. Why third-party library X instead of rolling our own. Why this state management approach over that one. The factory has no memory of past sessions. Without a decision log, every run risks undoing deliberate architecture choices you made months ago.

Anti-Spec

Explicit list of things the factory must not do, even if they seem like improvements. This is underrated. AI agents are optimisers. They will make tradeoffs you didn't ask for. "Do not add authentication to this endpoint" is as important as "add this feature."

Verification Gate

The final pass/fail condition that determines whether the factory's output is accepted. Not "does it compile." Not "do the unit tests pass." Does it behave correctly under real-world conditions as defined in the eval suite? The gate must be automatable. If a human has to look at the output to judge it, you're at Level 4, not 5.

03 · Infrastructure

What the Factory Is Actually Built From

Level 5 is not a single tool. It's a pipeline of autonomous systems.

LAYER 01 The Spec Parser Entry point

Interprets the spec and breaks it into discrete tasks
Identifies dependencies between tasks
Determines which tasks can be parallelised
Flags ambiguities before any code is written

Key insight The spec parser's output is not code. It's a task graph. Before any agent writes a single line, the factory needs to know the shape of the work. This is where most attempts at Level 5 break down: they jump straight to code generation before mapping the problem.

LAYER 02 The Agent Pool Parallel execution

Multiple autonomous agents working on isolated branches
Each agent carries its task context, constraints, and eval criteria
Agents self-correct in tight loops: write → run → fail → fix
Agents never communicate directly, only via the repo

Branch isolation Each agent works in its own branch and never sees main directly. This is what makes the factory safe to run without oversight: a broken agent can't corrupt anything. The worst case is a failed PR. This is identical to the Level 3 multi-agent workflow, but without the human in the approval loop.

LAYER 03 The Eval Engine The quality gate

Runs the full eval suite against every PR before it touches main
Behavioural tests, load tests, adversarial inputs, regression checks
Returns structured pass/fail with failure reasons to the agent
Agents use failure output to self-correct and resubmit

The factory's immune system The eval engine is what makes Level 5 safe. Without a rigorous eval, the factory will ship confidently broken software. The eval does not check code quality. It checks observable behaviour. A messy implementation that passes all evals is preferable to clean code that fails one.

LAYER 04 The Orchestrator Merge authority

Reviews passing PRs for integration conflicts
Manages merge order to prevent race conditions
Triggers re-evaluation after each merge to catch regressions
Routes failures back to the responsible agent for correction

Not the same as Level 3 In Level 3, the orchestrator flags things for human review. In Level 5, the orchestrator makes every merge decision autonomously. There is no human approval loop. This is the precise definition of the threshold between Level 4 and Level 5.

LAYER 05 The Memory Layer Often missing

Persists decisions, patterns, and lessons across runs
Prevents the factory from re-making decisions already settled
Feeds architectural context into each new agent session
Equivalent to a living DECISIONS.md + CLAUDE.md, auto-updated

The missing layer Most experimental Level 5 setups skip this entirely and wonder why the factory keeps re-introducing solved problems. Memory is what distinguishes a factory that gets smarter from one that starts fresh every time. Without it, you're not building a factory, you're renting one that forgets everything overnight.

04 · The Human Role

The Factory Needs an Architect

Level 5 doesn't eliminate human expertise. It concentrates it into a single, rare skill.

The instinct is to assume that Level 5 makes the developer obsolete. The opposite is closer to true. The factory is only as good as the spec it receives, and writing a spec that is genuinely complete, one that covers every edge case, captures every constraint, and includes an eval suite rigorous enough enough to distinguish good output from bad, is harder than writing the code itself.

This is not a developer skill. It is not a project manager skill. It is something rarer: the ability to fully model a complex system in your head before it exists, express that model with enough precision that an autonomous agent can execute it faithfully, and then design the scorecard that proves the result is correct. That person is an architect in the original sense of the word: someone who designs a building they will never physically construct.

Skills that matter at Level 5

Exhaustive edge-case thinking before anything exists · Precise ambiguity-free language · Eval design (outcome testing, not implementation testing) · System architecture at the constraint level · Knowing what to prohibit, not just what to permit · Comfort with not reading the implementation · Trust calibrated to the eval, not to the code

Skills that become irrelevant at Level 5

Syntax · Debugging implementation details · Code review · Refactoring · Library familiarity · IDE proficiency · Reading other people's code · Writing tests during development

05 · What Goes Wrong

How Dark Factories Break

The failure modes are not the ones you'd expect.

Risk · Spec

Underspecified Behaviour

The factory builds exactly what you described. If your spec is 90% complete, the factory confidently fills in the last 10% the wrong way, and the eval doesn't catch it because you didn't write a test for what you forgot to specify.

Risk · Eval

Weak Eval Suite

A factory with a weak eval is worse than no factory. It ships confident, wrong software at speed. The eval is the factory's only quality signal. If it's incomplete, everything downstream is suspect, and you won't know it.

Warning · Architecture

Missing Decision Log

The factory has no memory between runs. Without a persistent record of decisions already made, it will revisit and reverse them. Each run may subtly undo the architecture of the last. The output compiles, the evals pass, but the codebase drifts.

Warning · Trust

Premature Autonomy

Removing human review before your eval suite is truly complete. The transition from Level 4 to Level 5 should be earned, not assumed. Most teams who think they're running Level 5 are running Level 4 with hidden human checkpoints they haven't noticed.

Mitigation · Process

Start with a Narrow Domain

Level 5 works reliably first on well-bounded problem spaces: a specific microservice, a data pipeline, a well-defined API surface. The narrower and cleaner the domain, the more complete the spec can be, and the more rigorous the eval.

Opportunity · Compounding

The Factory Gets Faster

Every well-specified component you produce becomes a building block for future specs. The memory layer grows. The eval suite expands. Over time a good factory compounds: the cost of producing new software drops with each iteration.

06 · Current State

Honest Assessment: March 2026

What actually exists, and what remains genuinely unsolved.

EXISTS NOW Level 3–4 Infrastructure

Multi-agent branch isolation with orchestrated merges (Level 3)
Spec-driven generation with eval-based acceptance (Level 4)
Claude Code CLI with CLAUDE.md instruction layers
Automated CI/CD pipelines integrated with AI agents

The gap The piece that doesn't exist reliably yet is the full removal of human review from the merge loop. Every current production implementation keeps a human somewhere: approving PRs, reviewing eval results, making the final call. That's Level 4. Level 5 is closing that last loop.

UNSOLVED The Spec Problem

No standardised format for machine-readable specifications yet
Most developers have never written a spec complete enough for Level 4, let alone 5
Spec quality is hard to evaluate before the run; failure is the primary feedback signal
The industry hasn't built a culture of pre-run specification discipline

The unsolved problem Building the factory is tractable. Teaching humans to write specs good enough to feed it is the real unsolved problem. This is a cultural and educational challenge as much as a technical one. The tools exist. The skill doesn't, yet.

Where to focus if Level 5 is your goal

Don't start by building the factory. Start by building the spec discipline. Write your next feature as a full Level 5 spec (behaviour contract, eval suite, constraints, anti-spec) before touching any code. Run it through a Level 4 flow. See what the spec missed. The gap between your spec and the actual working software is exactly the gap between you and Level 5. Close it incrementally. The factory will be waiting when you get there.