Field Notes · AI Development
The
Dark Factory
A deep dive into Level 5: what it actually requires, what it looks like in practice, and what kind of person could architect one.
LEVEL 5 · Theoretical / Emerging · Almost Nobody Here Yet
01 · The Core Idea
Specification In. Software Out.
What it actually means when no human reads the code.
The Dark Factory is not simply "more AI." It is a categorical shift in what programming is.
At every level below Level 5, a human is still present in the loop, reading output, approving PRs, running tests.
At Level 5, that loop closes. The system receives a specification and produces working, tested, deployed software.
No human touches it at any point in the middle.
This doesn't mean humans disappear. It means they move entirely upstream. The work that once lived
in code reviews, debugging sessions, and implementation decisions now lives in the specification itself.
The spec is no longer a document that describes software. It is the software. The rest is compilation.
The philosophical shift
In traditional development, the code is the source of truth. In a Level 5 factory, the specification is the source of truth.
This inverts 70 years of software culture. Developers have always used specs as inputs to code.
In a dark factory, code is an implementation detail of a spec, and a disposable one at that.
If the code fails, you don't fix the code. You clarify the spec and regenerate.
02 · What Makes a Complete Spec
The Spec the Factory Can Actually Use
Most specs fail not because they're wrong, but because they're incomplete in invisible ways.
The single biggest bottleneck to Level 5 is not AI capability. It's spec quality.
The factory can only build what you describe. Every ambiguity in the spec is a decision
the factory will make on its own, and probably not the way you intended.
A usable Level 5 spec has to answer questions you haven't thought to ask yet.
Behaviour Contract
Precise descriptions of what the software does, not how it does it.
Inputs, outputs, side effects, edge cases. Every branching path.
"The login flow should work" is not a behaviour contract.
"A user with an expired token receives a 401 and is redirected to /login within 200ms" is.
Eval Suite
The automated scorecard that judges whether the software satisfies the contract.
Not unit tests written during development. Outcome tests written before development begins.
End-to-end flows, load scenarios, adversarial inputs, real-world data samples.
If your eval can't distinguish good software from bad, the factory has no compass.
Constraint Inventory
Every technical constraint that isn't implied by the behaviour. Must run on .NET 8, not 9.
Must not introduce any new third-party dependencies. Response time under 150ms at P99.
No shared mutable state between modules. The factory will make sensible defaults
without this, but its "sensible" is not yours.
Decision Log
Documented prior decisions the factory must not reverse.
Why third-party library X instead of rolling our own. Why this state management approach over that one.
The factory has no memory of past sessions. Without a decision log,
every run risks undoing deliberate architecture choices you made months ago.
Anti-Spec
Explicit list of things the factory must not do, even if they seem like improvements.
This is underrated. AI agents are optimisers. They will make tradeoffs you didn't ask for.
"Do not add authentication to this endpoint" is as important as "add this feature."
Verification Gate
The final pass/fail condition that determines whether the factory's output is accepted.
Not "does it compile." Not "do the unit tests pass." Does it behave correctly
under real-world conditions as defined in the eval suite? The gate must be
automatable. If a human has to look at the output to judge it, you're at Level 4, not 5.
03 · Infrastructure
What the Factory Is Actually Built From
Level 5 is not a single tool. It's a pipeline of autonomous systems.
- Interprets the spec and breaks it into discrete tasks
- Identifies dependencies between tasks
- Determines which tasks can be parallelised
- Flags ambiguities before any code is written
Key insight
The spec parser's output is not code. It's a task graph.
Before any agent writes a single line, the factory needs to know
the shape of the work. This is where most attempts at Level 5 break down:
they jump straight to code generation before mapping the problem.
- Multiple autonomous agents working on isolated branches
- Each agent carries its task context, constraints, and eval criteria
- Agents self-correct in tight loops: write → run → fail → fix
- Agents never communicate directly, only via the repo
Branch isolation
Each agent works in its own branch and never sees main directly.
This is what makes the factory safe to run without oversight:
a broken agent can't corrupt anything. The worst case is a failed PR.
This is identical to the Level 3 multi-agent workflow, but without
the human in the approval loop.
- Runs the full eval suite against every PR before it touches main
- Behavioural tests, load tests, adversarial inputs, regression checks
- Returns structured pass/fail with failure reasons to the agent
- Agents use failure output to self-correct and resubmit
The factory's immune system
The eval engine is what makes Level 5 safe. Without a rigorous eval,
the factory will ship confidently broken software.
The eval does not check code quality. It checks observable behaviour.
A messy implementation that passes all evals is preferable to clean code
that fails one.
- Reviews passing PRs for integration conflicts
- Manages merge order to prevent race conditions
- Triggers re-evaluation after each merge to catch regressions
- Routes failures back to the responsible agent for correction
Not the same as Level 3
In Level 3, the orchestrator flags things for human review.
In Level 5, the orchestrator makes every merge decision autonomously.
There is no human approval loop. This is the precise definition
of the threshold between Level 4 and Level 5.
- Persists decisions, patterns, and lessons across runs
- Prevents the factory from re-making decisions already settled
- Feeds architectural context into each new agent session
- Equivalent to a living DECISIONS.md + CLAUDE.md, auto-updated
The missing layer
Most experimental Level 5 setups skip this entirely and wonder why
the factory keeps re-introducing solved problems. Memory is what
distinguishes a factory that gets smarter from one that starts fresh
every time. Without it, you're not building a factory, you're
renting one that forgets everything overnight.
04 · The Human Role
The Factory Needs an Architect
Level 5 doesn't eliminate human expertise. It concentrates it into a single, rare skill.
The instinct is to assume that Level 5 makes the developer obsolete. The opposite is closer to true.
The factory is only as good as the spec it receives, and writing a spec that is genuinely complete,
one that covers every edge case, captures every constraint, and includes an eval suite rigorous enough
enough to distinguish good output from bad, is harder than writing the code itself.
This is not a developer skill. It is not a project manager skill. It is something rarer:
the ability to fully model a complex system in your head before it exists,
express that model with enough precision that an autonomous agent can execute it faithfully,
and then design the scorecard that proves the result is correct. That person is an architect
in the original sense of the word: someone who designs a building they will never physically construct.
Skills that matter at Level 5
Exhaustive edge-case thinking before anything exists · Precise ambiguity-free language ·
Eval design (outcome testing, not implementation testing) ·
System architecture at the constraint level · Knowing what to prohibit, not just what to permit ·
Comfort with not reading the implementation · Trust calibrated to the eval, not to the code
Skills that become irrelevant at Level 5
Syntax · Debugging implementation details · Code review · Refactoring · Library familiarity ·
IDE proficiency · Reading other people's code · Writing tests during development
05 · What Goes Wrong
How Dark Factories Break
The failure modes are not the ones you'd expect.
Risk · Spec
Underspecified Behaviour
The factory builds exactly what you described. If your spec is 90% complete, the factory confidently fills in the last 10% the wrong way, and the eval doesn't catch it because you didn't write a test for what you forgot to specify.
Risk · Eval
Weak Eval Suite
A factory with a weak eval is worse than no factory. It ships confident, wrong software at speed. The eval is the factory's only quality signal. If it's incomplete, everything downstream is suspect, and you won't know it.
Warning · Architecture
Missing Decision Log
The factory has no memory between runs. Without a persistent record of decisions already made, it will revisit and reverse them. Each run may subtly undo the architecture of the last. The output compiles, the evals pass, but the codebase drifts.
Warning · Trust
Premature Autonomy
Removing human review before your eval suite is truly complete. The transition from Level 4 to Level 5 should be earned, not assumed. Most teams who think they're running Level 5 are running Level 4 with hidden human checkpoints they haven't noticed.
Mitigation · Process
Start with a Narrow Domain
Level 5 works reliably first on well-bounded problem spaces: a specific microservice, a data pipeline, a well-defined API surface. The narrower and cleaner the domain, the more complete the spec can be, and the more rigorous the eval.
Opportunity · Compounding
The Factory Gets Faster
Every well-specified component you produce becomes a building block for future specs. The memory layer grows. The eval suite expands. Over time a good factory compounds: the cost of producing new software drops with each iteration.
06 · Current State
Honest Assessment: March 2026
What actually exists, and what remains genuinely unsolved.
- Multi-agent branch isolation with orchestrated merges (Level 3)
- Spec-driven generation with eval-based acceptance (Level 4)
- Claude Code CLI with CLAUDE.md instruction layers
- Automated CI/CD pipelines integrated with AI agents
The gap
The piece that doesn't exist reliably yet is the full removal of human review
from the merge loop. Every current production implementation keeps a human
somewhere: approving PRs, reviewing eval results, making the final call.
That's Level 4. Level 5 is closing that last loop.
- No standardised format for machine-readable specifications yet
- Most developers have never written a spec complete enough for Level 4, let alone 5
- Spec quality is hard to evaluate before the run; failure is the primary feedback signal
- The industry hasn't built a culture of pre-run specification discipline
The unsolved problem
Building the factory is tractable. Teaching humans to write specs good enough
to feed it is the real unsolved problem. This is a cultural and educational
challenge as much as a technical one. The tools exist. The skill doesn't, yet.
Where to focus if Level 5 is your goal
Don't start by building the factory. Start by building the spec discipline.
Write your next feature as a full Level 5 spec (behaviour contract, eval suite, constraints, anti-spec)
before touching any code. Run it through a Level 4 flow. See what the spec missed.
The gap between your spec and the actual working software is exactly the gap between you and Level 5.
Close it incrementally. The factory will be waiting when you get there.