Why We're Building an AI Coding Harness

Over the last year, a lot of teams have made a sharp turn in how they think about AI writing code.

The first question was: can AI write code? That has since been replaced by a different one: can the code AI writes actually enter a real engineering system?

These are very different questions. The first is about generation — completing a function, building a page from a prompt, explaining an error. The second is about engineering: did the requirement survive, was the design validated, do the tests move with the code, did E2E actually run, does the change ripple into install, configuration, rollback, cost, and long-term maintenance.

At AiKey Labs we call the set of engineering constraints we build around AI coding the AI Coding Harness. It's not a nicer set of prompts. It's not piling more models into the same workflow. It's closer to a constraint system: AI gets to be fast, but it can't drift off the goal; agents get to run in parallel, but every step produces something acceptable; code keeps getting generated, but each generation leaves behind evidence you can actually check.

What AI coding really lacks isn't generation — it's acceptance

AI is excellent at producing something that looks complete. The problem is real software development almost never just needs something that looks complete.

Before a feature ships, a team is asking a different set of questions:

Did this implementation solve the user's original problem, or did it quietly turn into a different, easier-to-answer problem along the way?
Did the tests cover the changed path, or only the happiest path that was easiest to write?
Did E2E run against a scenario with real data flow, or against an empty state that mostly reassures itself?
Did the issues raised in review get fixed in code, or just explained away in a doc?
Will this change affect installers, config migrations, the enterprise variant, historical data, or rollback?

These questions don't go away just because the model got better. The stronger the model, the more a team needs a process that can absorb its output. Without one, velocity becomes noise, code becomes debt, and tests become decoration.

The first principle of an AI Coding Harness is simple: every AI output must be acceptable. Acceptance enters the workflow at the requirement, not at the end.

Every requirement needs both a producer side and an acceptor side: who delivers the plan, code, tests, and reports, and who decides whether that evidence is enough. AI coding without an acceptor side tends to become a long stream of suspiciously diligent-looking commits.

A prompt pattern we use internally is to pull a task into the acceptance frame first. Say a developer tool needs to change its one-click installer. Behind that is more than downloading a binary — there's checksum validation, writing configs, starting a local service, rolling back on failure, and version upgrades. If you tell an agent "implement the install plan," it usually writes code first and then bolts on a few tests at the end. We start it on something else:

"Read this packaging and install plan. Confirm the approach is feasible. Check for anything that conflicts with the current code, and flag what you find."

This isn't asking the agent to start coding. It's asking the agent to first answer: does the current code already support this plan, which assumptions need validation, how do we roll back on failure, and what evidence will prove the install actually works. The value of a prompt isn't in elegant phrasing — it's in turning "generate code" into "enter a process that can be checked."

The harness protects the user's goal, not whichever plan we landed on

The most common drift in AI coding isn't the model going completely off-script. It's the model quietly rewriting the requirement into a task that's easier to finish.

The user says, "I want to confirm all my keys are usable before a long task." The implementation comes back as "add a check button." The user's actual worry was long-running tasks being interrupted, quota burned, and failures with no clear owner. A button is one possible surface; it isn't the goal.

The user says, "Team members don't care about the admin console — they just want to know whether their Team Key works." The implementation comes back as "add a table in the admin panel." That's not the member's point of view. The member wants to know: can I keep working right now, who do I ask when it breaks, and will this affect my current task.

The harness has to keep the user's original goal alive throughout the process. Plans can change, technical paths can change, page layouts can change — but the user's pain point cannot be silently swapped for whichever component the implementer happens to know best. Technical implementation is not the goal. The goal is what the user is trying to accomplish; code is just one path to it.

Context purity matters more than prompt wording

A lot of teams attribute AI quality to prompt. Prompts matter, but in complex engineering work, the thing that actually determines quality is context purity.

In a long session contaminated with requirement debate, architecture arguments, ad-hoc bug fixes, failure logs, abandoned proposals, irrelevant files, and repeated rework, the model has a harder and harder time telling what's a fact, what's history, and what's a current constraint that must be obeyed.

The harness has to actively manage context:

Design, implementation, review, and E2E don't all have to live in the same window.
For complex problems, fork different perspectives and let different agents own different responsibilities.
Key context should land in SPECs, acceptance criteria, test reports, and decision records — not just chat scrollback.
Every rework round should state which acceptance gate it's fixing, not retell the whole story.

A common pattern is cross-session decision recovery. Complex work rarely fits in one window: one window discusses architecture, another writes code, a third runs E2E. Come back days later, and the active agent only sees "modify this module" — it doesn't know why an earlier alternative was rejected, or which constraints the user has already locked in.

When context isn't clean, the agent happily re-invents the rejected plan, sometimes deleting the compatibility logic that was intentionally preserved. It looks like it's "simplifying the code." It's actually erasing history.

Most AI coding incidents aren't the model being unable to code. They're the model holding 20% of the current context while changing 100% of the system. The cleaner the context, the higher the leverage AI gets.

Multiple models isn't about stacking compute — it's about division of roles

The value of multi-model collaboration isn't "ask three models and pick the best-looking answer." That gets expensive fast, and it manufactures new noise.

Multi-model is more useful for roles. In our setup, one model leans into design and constraint synthesis, another into implementation and integration tests, and a third into review or workflow oversight — checking whether the task is moving inside its declared constraints. The number of models matters less than whether each role has a clear responsibility boundary.

The design agent decides whether there's enough information to make a decision.
The implementation agent writes code and tests against the SPEC.
The review agent checks code, architecture, tests, edge branches, and alignment with the user's goal.
The acceptance agent focuses on E2E, data flow, regression risk, and whether evidence is complete.

The point of multiple perspectives is to surface blind spots. An implementation-only agent tends to underweight acceptance. A review-only agent without runtime evidence tends to stop at text. The harness makes these roles constrain each other — not endorse each other.

SPEC + E2E: turning requirements into something checkable

The thing AI coding fears most isn't a lack of tests — it's tests that have drifted away from the requirement. A requirement that can't be turned into a SPEC won't transfer cleanly to an agent. A SPEC that can't generate test cases is hard to call executable. An E2E without real data flow can't prove the system works along the user's path.

So we bind SPEC and E2E together:

The SPEC describes the requirement, boundaries, error branches, and acceptance criteria.
Test cases grow out of the SPEC, not after the implementation lands.
Coding and testing move in the same step — change code, change tests.
The E2E report is gate evidence, not a release-day ritual.
If acceptance fails, the next round of changes has to map back to the failure, either as code changes or as an explicit waiver.

A useful rule of thumb: an E2E run on an empty state isn't a complete acceptance.

Checklist · Installation FlowEvidence required

[PASS] Release bundle exists; manifest and checksums verify

[PASS] Downloaded artifact matches by hash; binary runs after install

[PASS] Upgrade preserves the user's existing config

[GATE] E2E on an empty state is not a complete acceptance

Real users don't show up in pristine environments. The enterprise config has missing fields. The personal user has an old directory layout. The key expired. The OAuth token ended up somewhere it shouldn't. The model call succeeded but the cost was attributed to the wrong account. The harness needs to pull these branches in early — not let users discover them post-launch.

One-click install is a good example. To a user it's simple: paste one line, wait, the tool works. To an implementer the chain is long: build the release bundle, generate manifest + checksums, the install script downloads the right version, verifies what it got, starts the binary, and preserves user config across upgrades. A shallow E2E stops at "the script finished without errors." A real E2E mocks the release source, uploads the build artifacts, points the install script's download URL at the test server, runs the installer inside an isolated temp directory, then verifies the binary runs, the config is preserved, the health check passes, and the SHA-256 of the key files matches across build, download, and install.

Front-loaded validation reduces downstream rework

A lot of AI coding workflows fail because they enter implementation too early. The model is great at writing code, so teams naturally hand it the things they haven't fully thought through. The result is more code, not less uncertainty. By the time review or E2E catches the wrong direction, rework cost is already high.

For unfamiliar third-party integrations, spike first — don't jump into the main implementation.
For uncertain architecture choices, run a controlled comparison and a scoring pass.
For critical flows, define gates and failure criteria before writing code.
For anything touching cost, security, permissions, or config migration, lock the data model and boundaries first.

File permission errors are a classic example. When a dev tool tries to update a shell config or hook file, the most intuitive response to EACCES: permission denied is "try again with sudo." AI tends to fix it the same way: change the error message to suggest privilege escalation.

But these errors often aren't actually a privilege problem. On Windows or certain shells, the file may be momentarily held by another shell, an editor, an indexer, or some security tool. Sudo doesn't release that file handle. Worse, some sudo modes spawn a new window — error output flashes by, the user assumes it's fixed, and nothing actually changed.

Front-loaded validation asks the right question first: is this an ACL issue or a sharing-violation issue? Is it consistently reproducible or transient? Does the error survive privilege escalation? The sturdier fix usually isn't "ask the user for sudo" — it's retry on transient locks, detect sharing violations, and surface actionable next steps. The lesson isn't only about technical feasibility; front-loaded validation is also about validating what the problem actually is.

Merged ownership: whoever designs has to face acceptance

In traditional workflows, design, implementation, testing, and acceptance are often split across handoffs. With AI in the loop, treating it as "implementation labor" while keeping the rest of the workflow unchanged breaks the balance fast.

A healthy harness re-binds responsibility:

Whoever designs the plan must define how it will be accepted, up front.
Whoever writes code must submit the corresponding tests.
Reviewers don't just leave comments — they push for changes in code, tests, or the SPEC.
E2E rejection isn't "edit the wording of the report" — it's add tests, fix the implementation, or document why it's intentionally not fixed.

AI is excellent at manufacturing the feeling of completion: there's a doc, there's code, there's a report, there's a summary. What an engineering team actually has to check is whether the system state actually improved.

Why AiKey Labs needs its own harness

AiKey is a particularly good product to apply an AI Coding Harness to. We deal with AI keys, model routing, cost attribution, team usage, OAuth token security, personal vs. enterprise differences, local proxies, and long-lived configuration state. None of these are single-feature problems.

A Team Key feature isn't really about whether the admin can create a config. The member's question is: can I use it right now? When it fails, why? Which team / project / task does this key belong to? Will the cost get attributed to the wrong person?

Cost management isn't really about a single number on a dashboard. The team needs to slice it by model, by person, by project, by task. Otherwise the number is a report, not a management tool.

The user's goal isn't diluted by implementation details.
The data model is stable enough to absorb future versions.
Personal vs. enterprise differences are designed in, not bolted on.
Install, config, rollback, offline availability, and error branches enter acceptance together.
Every change is regression-tested against the modules it touches.

For us, the AI Coding Harness isn't an internal process optimization — it's part of product quality. We build AiKey using it, and we use AiKey to host a more reliable AI development workflow.

What a mature AI Coding Harness looks like

Our view of the harness is still evolving, but a few things are clear.

A stable domain model. Agents come and go, models upgrade, tasks split — but the core objects, states, and events shouldn't change every day.
Gate-state management. Design review, implementation, testing, review, E2E, regression, release — each stage knows what evidence it's waiting for and who's responsible for producing it.
An evidence ledger. Test results, failure reasons, fix commits, acceptance conclusions, skip rationales — all traceable. "Done" without evidence doesn't mean much on a complex project.
Layered context. Long-term knowledge lives in docs and SPECs. Short-term tasks live in sessions. Branch exploration lives in forks. Failure experience lives in regression tests.
Humans, still. The harness doesn't remove engineers from the loop. It moves them from staring at every generation to designing goals, setting boundaries, judging evidence, and making the hard trade-offs.

Closing

AI will keep getting better at writing code. That trend isn't really in doubt. What will actually separate teams is whether they can route AI's generation capability into a reliable engineering system. Without a harness, AI coding stops at demos and local productivity wins. With one, it has a real shot at the more complex, longer-horizon, closer-to-production parts of software development.

We're building the AI Coding Harness because we don't just want AI to produce code faster. We want AI to participate in an engineering process that can be accepted, regressed, located, reviewed — and continually learned from.

This is also the throughline of how we plan to write and ship from here on: less abstract noise, more of what's actually happening inside real systems. The companion piece, In the AI Era, Code Is No Longer Your Most Valuable Engineering Asset, goes deeper into what teams should actually be accumulating.

Try AiKey — Quick Start Star on GitHub More posts