Product•6 min read

The Last Mile Problem: Why AI Code Compiles But Doesn't Ship

March 6, 2026 by Asif Waliuddin

AISecurityEngineering

The Last Mile Problem: Why AI Code Compiles But Doesn't Ship

45% of AI-generated code contains security vulnerabilities, and 86% of it fails to defend against cross-site scripting. AI gets you to "it compiles" in minutes.

The last 20% — tests, security, deployment, governance — is where it falls apart.

This is not a fringe observation. It's a documented pattern with numbers attached. 45% of AI-generated code contains security vulnerabilities. 86% of AI-written code fails to defend against cross-site scripting. 88% is vulnerable to log injection. Developers using AI-assisted coding tools were 19% slower than those who weren't, according to research from Cerbos, and the most striking part: they believed they were faster.

The illusion of correctness is the real risk. Code that compiles. Tests that pass. A production system that breaks in ways the AI never anticipated and you never caught.

What the Data Says

The GroWExx research is direct: nearly half of all AI-generated code contains security vulnerabilities. This isn't a small sample. This isn't a methodology edge case. It's what you get when you ask an AI to write production code without enforcing security review as part of the generation process.

DevOps.com found that 86% of AI-generated code failed to defend against XSS attacks. 88% was vulnerable to log injection. These aren't exotic vulnerabilities. XSS and log injection are on every security training list. Every developer knows what they are. And AI tools still miss them at a rate that should make anyone shipping production code uncomfortable.

Help Net Security puts the XSS risk more precisely: AI-generated code is 2.74 times more likely to introduce cross-site scripting vulnerabilities than human-written code. Not because the AI doesn't know what XSS is. It can explain the concept with perfect accuracy. It just doesn't enforce the defense intrinsically when it generates code.

Veracode tested Claude Opus 4.5 specifically. Without security-focused prompting, it produces secure code only 56% of the time.

The AI knows what secure code looks like. It doesn't always write it.

The Kafka Handler Story

Here is a concrete version of how this plays out in production, documented by Testkube in their analysis of AI-generated infrastructure code.

An engineering team used an AI coding tool to build a Kafka event handler. The handler was clean, readable, and well-structured. It passed unit tests. It passed integration tests. The CI pipeline was green. The team shipped it.

Under production traffic — specifically under concurrent message bursts — the handler failed. The tests had never simulated the load patterns that production actually generated. The AI wrote technically correct code for the scenarios it was given. It had no way to anticipate the scenarios it wasn't given.

The code was not wrong. The testing assumptions were wrong. And the AI could not distinguish between "these tests cover what matters in production" and "these tests cover what I was told to test."

IT Pro calls this the illusion of correctness: "code looks polished but conceals serious flaws." The polish is what makes it dangerous. Ugly code signals caution. Polished AI-generated code signals confidence that may not be warranted.

Why This Is a Last Mile Problem

The AI is genuinely good at the parts it's good at.

Writing idiomatic code in a given language: excellent. Refactoring for clarity: excellent. Generating tests for specified scenarios: excellent. Explaining architectural tradeoffs: excellent.

The Last Mile is everything that happens between "it works on my machine" and "it runs in production without incident":

Security vulnerability scanning against known patterns
Edge case testing beyond the specified scenarios
Dependency conflict detection
Deployment configuration validation
Governance compliance: does this code match the architectural decisions the team made?

These are not things the AI skips because it's lazy. They're things it can't do intrinsically because they require context outside the generation session: the organization's security policies, the production load characteristics, the architectural decisions made three sprints ago.

The Coordination Failure Underneath

There's a second failure mode that compounds the security problem, and it's specifically about running multiple AI tools.

I ran Claude Code and Codex CLI on the same codebase. Claude refactored the authentication module. Codex CLI, which had no visibility into that refactoring, updated tests for the same module against the pre-refactor interface. Both tools saved their changes. The tests failed. Neither tool knew the other existed.

This isn't a rare edge case. It's the default state of any workflow that runs two AI tools without coordination infrastructure. Each tool operates with its own memory, its own assumptions, and no shared state. Every session starts from scratch. Every decision evaporates at session end.

The Last Mile problem is both a security problem and a coordination problem. The security flaws get embedded during generation. The coordination failures amplify them by making the knowledge needed to catch them invisible to the tools doing the work.

What Forge Intrinsically Catches

The Forge Plugin (L1) runs governance as part of every Claude Code session. Not as an afterthought. Intrinsically.

Every file change triggers security pattern scanning. Every commit runs quality gate checks. Drift detection catches when the code diverges from the spec. File placement enforcement catches when new files land in unexpected locations.

None of this requires remembering to run a separate tool. None of it requires maintaining a separate checklist. The governance is built into the session.

The numbers from the platform: 4,434 tests, 31/31 launch gates passing, security scanning built into 6 governance hooks that run before and after every substantive action.

The 45% of AI-generated code that contains security vulnerabilities — that's code being generated without enforcement at the point of generation. Forge enforces at the point of generation.

The Practical Implication

If you're shipping AI-generated code to production without a governance layer, you are accepting the base rate. 45% chance of a security flaw in any given piece of AI-generated code. 86% chance of XSS vulnerability if your application handles user input. These are not theoretical risks.

The solution isn't to stop using AI. The productivity gains are real. The capability improvement is genuine. The solution is to close the Last Mile gap between "it compiles" and "it ships."

Forge is that layer. Built into Claude Code. Running intrinsically. Catching what the AI missed before it reaches production.

Sources: GroWExx (45% security flaw statistic); DevOps.com (86% XSS, 88% log injection); Help Net Security (2.74x XSS likelihood); Testkube (Kafka handler production failure case); IT Pro ("illusion of correctness" framing); Cerbos (19% productivity paradox); Veracode (Claude Opus 4.5 secure code baseline).

Forge Plugin on GitHub · forge.nxtg.ai

The Last Mile Problem: Why AI Code Compiles But Doesn't Ship

The Last Mile Problem: Why AI Code Compiles But Doesn't Ship

What the Data Says

The Kafka Handler Story

Why This Is a Last Mile Problem

The Coordination Failure Underneath

What Forge Intrinsically Catches

The Practical Implication

Ship AI you can trust

Enjoyed this article?