With AI, The Proof Is in Production

All this author’s posts

Human review, and AI review, can only get you so far

Let's be frank: the last few years in software engineering have been earth-shattering. The foundations of the discipline have changed. Code can be written, rewritten, tested, and shipped faster than ever before. Agents are burning through trillions of tokens, and every month they get better at turning vague intent into working software.

That is exciting. It is also destabilizing.

Many teams are still built around the assumption that every meaningful change can be understood by a human before it merges. A developer opens a pull request, a reviewer reads it, a test suite runs, and the team decides whether the change is safe enough to deploy.

That model was already under pressure before AI, but now it is breaking.

LLMs can produce code far faster than any team can review it. The volume problem is obvious: if one engineer with an agent can generate several times more change than before, the review queue grows faster than the organization can absorb. The harder problem is trust. Even when a change looks reasonable, and even when another model reviews it, the system still cannot guarantee the behavior of that change in production.

AI review does not eliminate this problem. You can ask a different model, use a different prompt, or build an entire agentic code-review workflow. That can catch real issues. It can improve consistency. It can reduce the burden on humans. But it is still a non-deterministic system evaluating the output of another non-deterministic system. It can tell you what looks wrong. It cannot prove that a change will not degrade production.

Even staging and QA only get you so far. A non-production environment is not, and cannot be, exactly the same as production. It will not have the same traffic shape, data distribution, customer behavior, integrations, timing, scale, noisy neighbors, or failure modes. The closer you make it, the more useful it becomes, but it is still a model of production. It is not production.

So the question is not, "How do we review everything perfectly?"

The better question is, "How do we release in a way that assumes review is imperfect?"

The Old Idea That Suddenly Matters Again

Would you believe that one of the best answers to this problem has existed for a long time?

In December 2009, Flickr published an unassuming engineering post called Flipping Out. The idea was simple: release new features without deploying new code for every feature launch. Flickr described a model where code was merged continuously, deployed from the main branch, and gated behind small runtime switches. A feature could exist in production but remain unavailable until a configuration value flipped it on.

At first, that may not seem directly related to AI-generated code. But follow the thread.

What Flickr was describing is what we now call feature flagging. Combined with trunk-based development, feature flags let teams deploy code continuously without releasing every behavior immediately. The key distinction is simple but profound: deployment and release are not the same thing.

Deployment is getting code into an environment.

Release is exposing behavior to users.

Those two actions are often treated as one event, but they do not have to be. Feature flags are a way to choose between code paths at runtime and explicitly decouple deployment from release. With AI-accelerated engineering, that separation becomes a basic safety requirement.

If AI can generate more changes than humans can manually reason through, then the release system has to become more empirical. It has to answer: what is this feature actually doing to real users, real systems, and real business metrics?

The Game Is Production Feedback

Hiding unfinished work behind if statements is only the beginning. The real value is controlled exposure. A feature can be deployed to production, then released first to internal testers. Then to one percent of users. Then five. Then ten. At every step, you observe the impact before deciding whether to continue.

Production is where the unknowns live. Your tests can tell you whether the code behaves as expected in known scenarios. Your reviewers can tell you whether the change looks reasonable. Your static analysis tools can tell you whether it violates known rules. But only production can show you whether the change behaves well under the messy reality of actual usage.

Why APM Is Not Enough

Most teams already have observability. They have dashboards, logs, traces, alerts, and APM tools. You still need all of that, but aggregate system health is a blunt instrument when the risk is tied to one feature in a partial rollout.

APM tools are usually excellent at telling you something changed in the system. They are much less reliable at telling you which feature caused the change, especially during progressive delivery.

Imagine an AI-generated change increases crash rate by 10 percent for users who receive it. If that feature is only enabled for five percent of traffic, the total crash rate across the whole application may move by only half a percent. That can look like noise. It may not page anyone. It may not even be visible until the rollout expands to 20, 30, or 50 percent of traffic.

Harness FME Release Monitoring is designed around that gap. Rather than looking only at aggregate platform health, Release Monitoring measures the impact of feature flags and experiments on performance and behavioral metrics. If multiple features are rolling out at once, you do not want to know only that the application got worse. You want to know which feature is responsible, which users saw it, and which metric moved.

Metrics Become the Review Layer

Code review does not go away. Human review still matters. AI review still helps. Tests still matter. Security scanning still matters. Production metrics add the control those systems cannot provide on their own: measured impact.

In Harness FME, metrics evaluate the impact of feature flags and experiments on user behavior and system performance. They can measure errors, conversions, page load performance, interactions, satisfaction, sessions, shopping cart behavior, and any other event stream that matters to the product.

"Safe" is not a purely technical word. Depending on the feature, safety might mean error rates stay flat, page loads do not slow down, conversion does not drop, support tickets do not spike, or customers do not start rage clicking their way through a broken flow.

The right guardrails depend on the feature. Engineering leadership may care about latency and error rate. Product leadership may care about adoption and retention. Support may care about ticket volume. The power of a metric-driven release process is that all of those concerns can be defined before the rollout, measured during the rollout, and used to decide whether the feature keeps moving forward.

That changes the AI conversation. Reviewers are no longer being asked to predict every possible effect of a change from the diff alone. The release system is responsible for measuring the effects that actually matter.

Alert, Kill, Learn, Continue

Once metrics are attached to a rollout, the next step is automation.

Harness FME alerts and monitoring can notify teams when metrics cross critical thresholds or when statistically significant impact is detected on key or guardrail metrics. If the impact is negative, the team can stop the rollout, kill the flag, and investigate with a much narrower blast radius than a traditional deploy-and-pray release.

The operational model starts to look different:

AI helps generate the change.
Humans and AI review the change where review adds value.
The change merges behind a feature flag.
The code deploys to production without immediately releasing to everyone.
The feature ramps through controlled production exposure.
Metrics determine whether the rollout continues, pauses, rolls back, or gets killed.

That loop is much more realistic for the AI era than pretending review can scale linearly with code generation.

With FME pipelines, this can also become part of the delivery workflow itself. Harness pipelines can include FME steps for operations like creating or updating feature flags, changing rollout behavior, modifying targets, setting default allocations, and killing a flag. Feature release can move from an ad hoc manual process to an auditable automation path.

AI velocity does not need chaos with better dashboards. It needs disciplined automation with measurable gates.

Production Is the Proof

Software engineering has changed permanently. The amount of code that can be produced by a small team is going up. The number of ideas that can be prototyped is going up. The number of changes waiting to be reviewed, validated, merged, and released is also going up.

But some things have not changed.

Production is still the only environment that is truly production. Users still behave in ways you did not predict. Distributed systems still fail in ways your test plan did not imagine. Business metrics still matter more than whether the diff looked elegant.

So yes, keep reviewing code. Use AI reviewers where they help. Keep improving tests. Keep scanning for vulnerabilities. Keep investing in non-production environments.

None of that is proof by itself.

When features are being written faster than humans can comprehensively review them, the release process has to become empirical. Put the code behind a flag. Release it progressively. Measure the impact per feature. Alert on guardrails. Kill the feature when the data says it is hurting users.

In the age of the LLM, the proof is in production.

Joshua Klein

All this author’s posts

Joshua Klein has helped product companies deliver high-quality software to their customers by architecting and implementing scalable, reliable solutions, over the course of his career.

With AI, The Proof Is in Production | Harness Blog

The Old Idea That Suddenly Matters Again

The Game Is Production Feedback

Why APM Is Not Enough

Metrics Become the Review Layer

Alert, Kill, Learn, Continue

Production Is the Proof

Similar Blogs

AI writes the code. Who delivers it safely?

AI in Software Delivery: Engineering Excellence or Just Market Hype?

The AI Visibility Problem: When Speed Outruns Security

Engineering

Excellence 2026

With AI, The Proof Is in Production | Harness Blog

The Old Idea That Suddenly Matters Again

The Game Is Production Feedback

Why APM Is Not Enough

Metrics Become the Review Layer

Alert, Kill, Learn, Continue

Production Is the Proof

Similar Blogs

AI writes the code. Who delivers it safely?

AI in Software Delivery: Engineering Excellence or Just Market Hype?

The AI Visibility Problem: When Speed Outruns Security

the State of

Engineering

Excellence 2026