Engineering Blogs

Featured Blogs

Recent Blogs

Engineering Blog

Top Continuous Integration Metrics Every Platform Engineering Leader Should Track

Track essential Continuous Integration metrics to boost developer productivity, reduce costs, and optimize pipelines. Learn how platform leaders drive results with CI metrics.

Chinmay Gaikwad

February 11, 2026

Time to read

Dark, futuristic operations room with a glowing central server stack and floating dashboard panels connected by neon-green and cyan pipelines, conveying coordinated incident response and control.

Your developers complain about 20-minute builds while your cloud bill spirals out of control. Pipeline sprawl across teams creates security gaps you can't even see. These aren't separate problems. They're symptoms of a lack of actionable data on what actually drives velocity and cost.

The right CI metrics transform reactive firefighting into proactive optimization. With analytics data from Harness CI, platform engineering leaders can cut build times, control spend, and maintain governance without slowing teams down.

Why Do CI Metrics Matter for Platform Engineering Leaders?

Platform teams who track the right CI metrics can quantify exactly how much developer time they're saving, control cloud spending, and maintain security standards while preserving development velocity. The importance of tracking CI/CD metrics lies in connecting pipeline performance directly to measurable business outcomes.

Reclaim Hours Through Speed Metrics

Build time, queue time, and failure rates directly translate to developer hours saved or lost. Research shows that 78% of developers feel more productive with CI, and most want builds under 10 minutes. Tracking median build duration and 95th percentile outliers can reveal your productivity bottlenecks.

Harness CI delivers builds up to 8X faster than traditional tools, turning this insight into action.

Turn Compute Minutes Into Budget Predictability

Cost per build and compute minutes by pipeline eliminate the guesswork from cloud spending. AWS CodePipeline charges $0.002 per action-execution-minute, making monthly costs straightforward to calculate from your pipeline metrics.

Measuring across teams helps you spot expensive pipelines, optimize resource usage, and justify infrastructure investments with concrete ROI.

Measure Artifact Integrity at Scale

SBOM completeness, artifact integrity, and policy pass rates ensure your software supply chain meets security standards without creating development bottlenecks. NIST and related EO 14028 guidance emphasize on machine-readable SBOMs and automated hash verification for all artifacts.

However, measurement consistency remains challenging. A recent systematic review found that SBOM tooling variance creates significant detection gaps, with tools reporting between 43,553 and 309,022 vulnerabilities across the same 1,151 SBOMs.

Standardized metrics help you monitor SBOM generation rates and policy enforcement without manual oversight.

10 CI/CD Metrics That Move the Needle

Not all metrics deserve your attention. Platform engineering leaders managing 200+ developers need measurements that reveal where time, money, and reliability break down, and where to fix them first.

Performance metrics show where developers wait instead of code. High-performing organizations achieve up to 440 times faster lead times and deploy 46 times more frequently by tracking the right speed indicators.
Cost and resource indicators expose hidden optimization opportunities. Organizations using intelligent caching can reduce infrastructure costs by up to 76% while maintaining speed, turning pipeline data into budget predictability.
Quality and governance metrics scale security without slowing delivery. With developers increasingly handling DevOps responsibilities, compliance and reliability measurements keep distributed teams moving fast without sacrificing standards.

So what does this look like in practice? Let's examine the specific metrics.

Build Duration (p50/p95): Pinpointing Bottlenecks and Outliers

Build duration becomes most valuable when you track both median (p50) and 95th percentile (p95) times rather than simple averages. Research shows that timeout builds have a median duration of 19.7 minutes compared to 3.4 minutes for normal builds. That’s over five times longer.

While p50 reveals your typical developer experience, p95 exposes the worst-case delays that reduce productivity and impact developer flow. These outliers often signal deeper issues like resource constraints, flaky tests, or inefficient build steps that averages would mask. Tracking trends in both percentiles over time helps you catch regressions before they become widespread problems. Build analytics platforms can surface when your p50 increases gradually or when p95 spikes indicate new bottlenecks.

Keep builds under seven minutes to maintain developer engagement. Anything over 15 minutes triggers costly context switching. By monitoring both typical and tail performance, you optimize for consistent, fast feedback loops that keep developers in flow. Intelligent test selection reduces overall build durations by up to 80% by selecting and running only tests affected by the code changes, rather than running all tests.

An example of build durations dashboard (on Harness)

Queue Time: Measuring Infrastructure Constraints

Queue time measures how long builds wait before execution begins. This is a direct indicator of insufficient build capacity. When developers push code, builds shouldn't sit idle while runners or compute resources are tied up. Research shows that heterogeneous infrastructure with mixed processing speeds creates excessive queue times, especially when job routing doesn't account for worker capabilities. Queue time reveals when your infrastructure can't handle developer demand.

Rising queue times signal it's time to scale infrastructure or optimize resource allocation. Per-job waiting time thresholds directly impact throughput and quality outcomes. Platform teams can reduce queue time by moving to Harness Cloud's isolated build machines, implementing intelligent caching, or adding parallel execution capacity. Analytics dashboards track queue time trends across repositories and teams, enabling data-driven infrastructure decisions that keep developers productive.

Build Success Rate: Ensuring Pipeline Reliability

Build success rate measures the percentage of builds that complete successfully over time, revealing pipeline health and developer confidence levels. When teams consistently see success rates above 90% on their default branches, they trust their CI system to provide reliable feedback. Frequent failures signal deeper issues — flaky tests that pass and fail randomly, unstable build environments, or misconfigured pipeline steps that break under specific conditions.

Tracking success rate trends by branch, team, or service reveals where to focus improvement efforts. Slicing metrics by repository and pipeline helps you identify whether failures cluster around specific teams using legacy test frameworks or services with complex dependencies. This granular view separates legitimate experimental failures on feature branches from stability problems that undermine developer productivity and delivery confidence.

An example of Build Success/Failure Rate Dashboard (on Harness)

Mean Time to Recovery (MTTR): Speeding Up Incident Response

Mean time to recovery measures how fast your team recovers from failed builds and broken pipelines, directly impacting developer productivity. Research shows organizations with mature CI/CD implementations see MTTR improvements of over 50% through automated detection and rollback mechanisms. When builds fail, developers experience context switching costs, feature delivery slows, and team velocity drops. The best-performing teams recover from incidents in under one hour, while others struggle with multi-hour outages that cascade across multiple teams.

Automated alerts and root cause analysis tools slash recovery time by eliminating manual troubleshooting, reducing MTTR from 20 minutes to under 3 minutes for common failures. Harness CI's AI-powered troubleshooting surfaces failure patterns and provides instant remediation suggestions when builds break.

Flaky Test Rate: Eliminating Developer Frustration

Flaky tests pass or fail non-deterministically on the same code, creating false signals that undermine developer trust in CI results. Research shows 59% of developers experience flaky tests monthly, weekly, or daily, while 47% of restarted failing builds eventually passed. This creates a cycle where developers waste time investigating false failures, rerunning builds, and questioning legitimate test results.

Tracking flaky test rate helps teams identify which tests exhibit unstable pass/fail behavior, enabling targeted stabilization efforts. Harness CI automatically detects problematic tests through failure rate analysis, quarantines flaky tests to prevent false alarms, and provides visibility into which tests exhibit the highest failure rates. This reduces developer context switching and restores confidence in CI feedback loops.

Cost Per Build: Controlling CI Infrastructure Spend

Cost per build divides your monthly CI infrastructure spend by the number of successful builds, revealing the true economic impact of your development velocity. CI/CD pipelines consume 15-40% of overall cloud infrastructure budgets, with per-run compute costs ranging from $0.40 to $4.20 depending on application complexity, instance type, region, and duration. This normalized metric helps platform teams compare costs across different services, identify expensive outliers, and justify infrastructure investments with concrete dollar amounts rather than abstract performance gains.

Automated caching and ephemeral infrastructure deliver the biggest cost reductions per build. Intelligent caching automatically stores dependencies and Docker layers. This cuts repeated download and compilation time that drives up compute costs.

Ephemeral build machines eliminate idle resource waste. They spin up fresh instances only when the queue builds, then terminate immediately after completion. Combine these approaches with right-sized compute types to reduce infrastructure costs by 32-43% compared to oversized instances.

Cache Hit Rate: Accelerating Builds With Smart Caching

Cache hit rate measures what percentage of build tasks can reuse previously cached results instead of rebuilding from scratch. When teams achieve high cache hit rates, they see dramatic build time reductions. Docker builds can drop from five to seven minutes to under 90 seconds with effective layer caching. Smart caching of dependencies like node_modules, Docker layers, and build artifacts creates these improvements by avoiding expensive regeneration of unchanged components.

Harness Build and Cache Intelligence eliminates the manual configuration overhead that traditionally plagues cache management. It handles dependency caching and Docker layer reuse automatically. No complex cache keys or storage management required.

Measure cache effectiveness by comparing clean builds against fully cached runs. Track hit rates over time to justify infrastructure investments and detect performance regressions.

Test Cycle Time: Optimizing Feedback Loops

Test cycle time measures how long it takes to run your complete test suite from start to finish. This directly impacts developer productivity because longer test cycles mean developers wait longer for feedback on their code changes. When test cycles stretch beyond 10-15 minutes, developers often switch context to other tasks, losing focus and momentum. Recent research shows that optimized test selection can accelerate pipelines by 5.6x while maintaining high failure detection rates.

Smart test selection optimizes these feedback loops by running only tests relevant to code changes. Harness CI Test Intelligence can slash test cycle time by up to 80% using AI to identify which tests actually need to run. This eliminates the waste of running thousands of irrelevant tests while preserving confidence in your CI deployments.

Pipeline Failure Cause Distribution: Prioritizing Remediation

Categorizing pipeline issues into domains like code problems, infrastructure incidents, and dependency conflicts transforms chaotic build logs into actionable insights. Harness CI's AI-powered troubleshooting provides root cause analysis and remediation suggestions for build failures. This helps platform engineers focus remediation efforts on root causes that impact the most builds rather than chasing one-off incidents.

Visualizing issue distribution reveals whether problems are systemic or isolated events. Organizations using aggregated monitoring can distinguish between infrastructure spikes and persistent issues like flaky tests. Harness CI's analytics surface which pipelines and repositories have the highest failure rates. Platform teams can reduce overall pipeline issues by 20-30%.

Artifact Integrity Coverage: Securing the Software Supply Chain

Artifact integrity coverage measures the percentage of builds that produce signed, traceable artifacts with complete provenance documentation. This tracks whether each build generates Software Bills of Materials (SBOMs), digital signatures, and documentation proving where artifacts came from. While most organizations sign final software products, fewer than 20% deliver provenance data and only 3% consume SBOMs for dependency management. This makes the metric a leading indicator of supply chain security maturity.

Harness CI automatically generates SBOMs and attestations for every build, ensuring 100% coverage without developer intervention. The platform's SLSA L3 compliance capabilities generate verifiable provenance and sign artifacts using industry-standard frameworks. This eliminates the manual processes and key management challenges that prevent consistent artifact signing across CI pipelines.

Steps to Track CI/CD Metrics and Turn Insights Into Action

Tracking CI metrics effectively requires moving from raw data to measurable improvements. The most successful platform engineering teams build a systematic approach that transforms metrics into velocity gains, cost reductions, and reliable pipelines.

Step 1: Standardize Pipeline Metadata Across Teams

Tag every pipeline with service name, team identifier, repository, and cost center. This standardization creates the foundation for reliable aggregation across your entire CI infrastructure. Without consistent tags, you can't identify which teams drive the highest costs or longest build times.

Implement naming conventions that support automated analysis. Use structured formats like team-service-environment for pipeline names and standardize branch naming patterns. Centralize this metadata using automated tag enforcement to ensure organization-wide visibility.

Step 2: Automate Metric Collection and Visualization

Modern CI platforms eliminate manual metric tracking overhead. Harness CI provides dashboards that automatically surface build success rates, duration trends, and failure patterns in real-time. Teams can also integrate with monitoring stacks like Prometheus and Grafana for live visualization across multiple tools.

Configure threshold-based alerts for build duration spikes or failure rate increases. This shifts you from fixing issues after they happen to preventing them entirely.

Step 3: Analyze Metrics and Identify Optimization Opportunities

Focus on p95 and p99 percentiles rather than averages to identify critical performance outliers. Drill into failure causes and flaky tests to prioritize fixes with maximum developer impact. Categorize pipeline failures by root cause — environment issues, dependency problems, or test instability — then target the most frequent culprits first.

Benchmark cost per build and cache hit rates to uncover infrastructure savings. Optimized caching and build intelligence can reduce build times by 30-40% while cutting cloud expenses.

Step 4: Operationalize Improvements With Governance and Automation

Standardize CI pipelines using centralized templates and policy enforcement to eliminate pipeline sprawl. Store reusable templates in a central repository and require teams to extend from approved templates. This reduces maintenance overhead while ensuring consistent security scanning and artifact signing.

Establish Service Level Objectives (SLOs) for your most impactful metrics: build duration, queue time, and success rate. Set measurable targets like "95% of builds complete within 10 minutes" to drive accountability. Automate remediation wherever possible — auto-retry for transient failures, automated cache invalidation, and intelligent test selection to skip irrelevant tests.

Make Your CI Metrics Work

The difference between successful platform teams and those drowning in dashboards comes down to focus. Elite performers track build duration, queue time, flaky test rates, and cost per build because these metrics directly impact developer productivity and infrastructure spend.

Start with the measurements covered in this guide, establish baselines, and implement governance that prevents pipeline sprawl. Focus on the metrics that reveal bottlenecks, control costs, and maintain reliability — then use that data to optimize continuously.

Ready to transform your CI metrics from vanity to velocity? Experience how Harness CI accelerates builds while cutting infrastructure costs.

Continuous Integration Metrics FAQ

Platform engineering leaders often struggle with knowing which metrics actually move the needle versus creating metric overload. These answers focus on metrics that drive measurable improvements in developer velocity, cost control, and pipeline reliability.

What separates actionable CI metrics from vanity metrics?

Actionable metrics directly connect to developer experience and business outcomes. Build duration affects daily workflow, while deployment frequency impacts feature delivery speed. Vanity metrics look impressive, but don't guide decisions. Focus on measurements that help teams optimize specific bottlenecks rather than general health scores.

Which CI metrics have the biggest impact on developer productivity?

Build duration, queue time, and flaky test rate directly affect how fast developers get feedback. While coverage monitoring dominates current practices, build health and time-to-fix-broken-builds offer the highest productivity gains. Focus on metrics that reduce context switching and waiting.

How do CI metrics help reduce infrastructure costs without sacrificing quality?

Cost per build and cache hit rate reveal optimization opportunities that maintain quality while cutting spend. Intelligent caching and optimized test selection can significantly reduce both build times and infrastructure costs. Running only relevant tests instead of entire suites cuts waste without compromising coverage.

What's the most effective way to start tracking CI metrics across different tools?

Begin with pipeline metadata standardization using consistent tags for service, team, and cost center. Most CI platforms provide basic metrics through built-in dashboards. Start with DORA metrics, then add build-specific measurements as your monitoring matures.

How often should teams review CI metrics and take action?

Daily monitoring of build success rates and queue times enables immediate issue response. Weekly reviews of build duration trends and monthly cost analysis drive strategic improvements. Automated alerts for threshold breaches prevent small problems from becoming productivity killers.

Engineering Blog

Unit Testing in CI/CD: How to Accelerate Builds Without Sacrificing Quality

Speed up your CI/CD builds with smarter unit testing strategies. Learn how AI-powered test optimization can give you faster feedback, lower costs, and better code quality.

Chinmay Gaikwad

February 11, 2026

Time to read

Modern unit testing in CI/CD can help teams avoid slow builds by using smart strategies. Choosing the right tests, running them in parallel, and using intelligent caching all help teams get faster feedback while keeping code quality high.

Platforms like Harness CI use AI-powered test intelligence to reduce test cycles by up to 80%, showing what’s possible with the right tools. This guide shares practical ways to speed up builds and improve code quality, from basic ideas to advanced techniques that also lower costs.

What Is a Unit Test?

Knowing what counts as a unit test is key to building software delivery pipelines that work.

The Smallest Testable Component

A unit test looks at a single part of your code, such as a function, class method, or a small group of related components. The main point is to test one behavior at a time. Unit tests are different from integration tests because they look at the logic of your code. This makes it easier to figure out what went wrong if something goes wrong.

Isolation Drives Speed and Reliability

Unit tests should only check code that you wrote and not things like databases, file systems, or network calls. This separation makes tests quick and dependable. Tests that don't rely on outside services run in milliseconds and give the same results no matter where they are run, like on your laptop or in a CI pipeline.

Foundation for CI/CD Quality Gates

Unit tests are one of the most important part of continuous integration in CI/CD pipelines because they show problems right away after code changes. Because they are so fast, developers can run them many times a minute while they are coding. This makes feedback loops very tight, which makes it easier to find bugs and stops them from getting to later stages of the pipeline.

Unit Testing Strategies: Designing for Speed and Reliability

Teams that run full test suites on every commit catch problems early by focusing on three things: making tests fast, choosing the right tests, and keeping tests organized. Good unit testing helps developers stay productive and keeps builds running quickly.

Deterministic Tests for Every Commit

Unit tests should finish in seconds, not minutes, so that they can be quickly checked. Google's engineering practices say that tests need to be "fast and reliable to give engineers immediate feedback on whether a change has broken expected behavior." To keep tests from being affected by outside factors, use mocks, stubs, and in-memory databases. Keep commit builds to less than ten minutes, and unit tests should be the basis of this quick feedback loop.

Intelligent Test Selection

As projects get bigger, running all tests on every commit can slow teams down. Test Impact Analysis looks at coverage data to figure out which tests really check the code that has been changed. AI-powered test selection chooses the right tests for you, so you don't have to guess or sort them by hand.

Parallelization and Caching

To get the most out of your infrastructure, use selective execution and run tests at the same time. Divide test suites into equal-sized groups and run them on different machines simultaneously. Smart caching of dependencies, build files, and test results helps you avoid doing the same work over and over. When used together, these methods cut down on build time a lot while keeping coverage high.

Standardized Organization for Scale

Using consistent names, tags, and organization for tests helps teams track performance and keep quality high as they grow. Set clear rules for test types (like unit, integration, or smoke) and use names that show what each test checks. Analytics dashboards can spot flaky tests, slow tests, and common failures. This helps teams improve test suites and keep things running smoothly without slowing down developers.

Unit Test Example: From Code to Assertion

A good unit test uses the Arrange-Act-Assert pattern. For example, you might test a function that calculates order totals with discounts:

def test_apply_discount_to_order_total():
   # Arrange: Set up test data
   order = Order(items=[Item(price=100), Item(price=50)])
   discount = PercentageDiscount(10)
   
   # Act: Execute the function under test
   final_total = order.apply_discount(discount)
   
   # Assert: Verify expected outcome
   assert final_total == 135  # 150 - 10% discount

In the Arrange phase, you set up the objects and data you need. In the Act phase, you call the method you want to test. In the Assert phase, you check if the result is what you expected.

Testing Edge Cases

Real-world code needs to handle more than just the usual cases. Your tests should also check edge cases and errors:

def test_apply_discount_with_empty_cart_returns_zero():
   order = Order(items=[])
   discount = PercentageDiscount(10)
   
   assert order.apply_discount(discount) == 0

def test_apply_discount_rejects_negative_percentage():
   order = Order(items=[Item(price=100)])
   
   with pytest.raises(ValueError):
       PercentageDiscount(-5)

Notice the naming style: test_apply_discount_rejects_negative_percentage clearly shows what’s being tested and what should happen. If this test fails in your CI pipeline, you’ll know right away what went wrong, without searching through logs.

Benefits of Unit Testing: Building Confidence and Saving Time

When teams want faster builds and fewer late-stage bugs, the benefits of unit testing are clear. Good unit tests help speed up development and keep quality high.

Catch regressions right away: Unit tests run in seconds and find breaking changes before they get to integration or production environments.
Allow fearless refactoring: A strong set of tests gives you the confidence to change code without adding bugs you didn't expect.
Cut down on costly debugging: Research shows that unit tests cover a lot of ground and find bugs early when fixing them is cheapest.
Encourage modular design: Writing code that can be tested naturally leads to better separation of concerns and a cleaner architecture.

When you use smart test execution in modern CI/CD pipelines, these benefits get even bigger.

Disadvantages of Unit Testing: Recognizing the Trade-Offs

Unit testing is valuable, but knowing its limits helps teams choose the right testing strategies. These downsides matter most when you’re trying to make CI/CD pipelines faster and more cost-effective.

Maintenance overhead grows as automated tests expand, requiring ongoing effort to update brittle or overly granular tests.
False confidence occurs when high unit test coverage hides integration problems and system-level failures.
Slow execution times can bottleneck CI pipelines when test collections take hours instead of minutes to complete.
Resource allocation shifts developer time from feature work to test maintenance and debugging flaky tests.
Coverage gaps appear in areas like GUI components, external dependencies, and complex state interactions.

Research shows that automatically generated tests can be harder to understand and maintain. Studies also show that statement coverage doesn’t always mean better bug detection.

Industry surveys show that many organizations have trouble with slow test execution and unclear ROI for unit testing. Smart teams solve these problems by choosing the right tests, using smart caching, and working with modern CI platforms that make testing faster and more reliable.

How Do Developers Use Unit Tests in Real Workflows?

Developers use unit tests in three main ways that affect build speed and code quality. These practices turn testing into a tool that catches problems early and saves time on debugging.

Test-Driven Development and Rapid Feedback Loops

Before they start coding, developers write unit tests. They use test-driven development (TDD) to make the design better and cut down on debugging. According to research, TDD finds 84% of new bugs, while traditional testing only finds 62%. This method gives you feedback right away, so failing tests help you decide what to do next.

Regression Prevention and Bug Validation

Unit tests are like automated guards that catch bugs when code changes. Developers write tests to recreate bugs that have been reported, and then they check that the fixes work by running the tests again after the fixes have been made. Automated tools now generate test cases from issue reports. They are 30.4% successful at making tests that fail for the exact problem that was reported. To stop bugs that have already been fixed from coming back, teams run these regression tests in CI pipelines.

Strategic Focus on Business Logic and Public APIs

Good developer testing doesn't look at infrastructure or glue code; it looks at business logic, edge cases, and public interfaces. Testing public methods and properties is best; private details that change often should be left out. Test doubles help developers keep business logic separate from systems outside of their control, which makes tests more reliable. Integration and system tests are better for checking how parts work together, especially when it comes to things like database connections and full workflows.

Unit Testing Best Practices: Maximizing Value, Minimizing Pain

Slow, unreliable tests can slow down CI and hurt productivity, while also raising costs. The following proven strategies help teams check code quickly and cut both build times and cloud expenses.

Write fast, isolated tests that run in milliseconds and avoid external dependencies like databases or APIs.
Use descriptive test names that clearly explain the behavior being tested, not implementation details.
Run only relevant tests using selective execution to cut cycle times by up to 80%.
Monitor test health with failure analytics to identify flaky or slow tests before they impact productivity.
Refactor tests regularly alongside production code to prevent technical debt and maintain suite reliability.

Types of Unit Testing: Manual vs. Automated

Choosing between manual and automated unit testing directly affects how fast and reliable your pipeline is.

Manual Unit Testing: Flexibility with Limitations

Manual unit testing means developers write and run tests by hand, usually early in development or when checking tricky edge cases that need human judgment. This works for old systems where automation is hard or when you need to understand complex behavior. But manual testing can’t be repeated easily and doesn’t scale well as projects grow.

Automated Unit Testing: Speed and Consistency at Scale

Automated testing transforms test execution into fast, repeatable processes that integrate seamlessly with modern development workflows. Modern platforms leverage AI-powered optimization to run only relevant tests, cutting cycle times significantly while maintaining comprehensive coverage.

	Manual Unit Testing	Automated Unit Testing
Execution	Developer runs tests by hand	Tests run programmatically on every commit
Speed	Minutes to hours per test cycle	Thousands of tests in minutes
Repeatability	Varies with each run	Identical every time
CI/CD integration	Impractical	Seamless
Best for	Exploratory testing, complex edge cases, legacy systems	Regression testing, frequent validation, pipeline gates
Scales with codebase	Poorly. Time cost grows linearly	Well. Automation handles growth

Why High-Velocity Teams Prioritize Automation

Fast-moving teams use automated unit testing to keep up speed and quality. Manual testing is still useful for exploring and handling complex cases, but automation handles the repetitive checks that make deployments reliable and regular.

Difference Between Unit Testing and Other Types of Testing

Knowing the difference between unit, integration, and other test types helps teams build faster and more reliable CI/CD pipelines. Each type has its own purpose and trade-offs in speed, cost, and confidence.

Unit Tests: Fast and Isolated Validation

Unit tests are the most important part of your testing plan. They test single functions, methods, or classes without using any outside systems. You can run thousands of unit tests in just a few minutes on a good machine. This keeps you from having problems with databases or networks and gives you the quickest feedback in your pipeline.

Integration Tests: Validating Component Interactions

Integration testing makes sure that the different parts of your system work together. There are two main types of tests: narrow tests that use test doubles to check specific interactions (like testing an API client with a mock service) and broad tests that use real services (like checking your payment flow with real payment processors). Integration tests use real infrastructure to find problems that unit tests might miss.

End-to-End Tests: Complete User Journey Validation

The top of the testing pyramid is end-to-end tests. They mimic the full range of user tasks in your app. These tests are the most reliable, but they take a long time to run and are hard to fix. Unit tests can find bugs quickly, but end-to-end tests may take days to find the same bug. This method works, but it can be brittle.

The Test Pyramid: Balancing Speed and Coverage

The best testing strategy uses a pyramid: many small, fast unit tests at the bottom, some integration tests in the middle, and just a few end-to-end tests at the top.

	Unit Tests	Integration Tests	End-to-End Tests
What it tests	Individual functions, methods, or classes in isolation	How components work together at interaction points	Complete user workflows through the entire stack
Speed	Milliseconds; thousands run in minutes	Seconds to minutes per test	Minutes per test; full suites can take hours
Infrastructure	None. Uses mocks and stubs	May use test doubles or live services	Full production-like environment
Failure debugging	Pinpoints the exact function or method	Narrows down to component interaction	Could be anything in the stack
Best for catching	Logic errors, edge cases, regressions	Interface mismatches, contract violations	User journey breaks, environment issues
Recommended proportion	~70% of test suite	~20% of test suite	~10% of test suite

Workflow of Unit Testing in CI/CD Pipelines

Modern development teams use a unit testing workflow that balances speed and quality. Knowing this process helps teams spot slow spots and find ways to speed up builds while keeping code reliable.

The Standard Development Cycle

Before making changes, developers write code on their own computers and run unit tests. They run tests on their own computers to find bugs early, and then they push the code to version control so that CI pipelines can take over. This step-by-step process helps developers stay productive by finding problems early, when they are easiest to fix.

Automated CI Pipeline Execution

Once code is in the pipeline, automation tools run unit tests on every commit and give feedback right away. If a test fails, the pipeline stops deployment and lets developers know right away. This automation stops bad code from getting into production. Research shows this method can cut critical defects by 40% and speed up deployments.

Accelerating the Workflow

Modern CI platforms use Test Intelligence to only run the tests that are affected by code changes in order to speed up this process. Parallel testing runs test groups in different environments at the same time. Smart caching saves dependencies and build files so you don't have to do the same work over and over. These steps can help keep coverage high while lowering the cost of infrastructure.

Results Analysis and Continuous Improvement

Teams analyze test results through dashboards that track failure rates, execution times, and coverage trends. Analytics platforms surface patterns like flaky tests or slow-running suites that need attention. This data drives decisions about test prioritization, infrastructure scaling, and process improvements. Regular analysis ensures the unit testing approach continues to deliver value as codebases grow and evolve.

Unit Testing Techniques: Tools for Reliable, Maintainable Tests

Using the right unit testing techniques can turn unreliable tests into a reliable way to speed up development. These proven methods help teams trust their code and keep CI pipelines running smoothly:

Replace slow external dependencies with controllable test doubles that run consistently.
Generate hundreds of test cases automatically to find edge cases you'd never write manually.
Run identical test logic against multiple inputs to expand coverage without extra maintenance.
Capture complex output snapshots to catch unintended changes in data structures.
Verify behavior through isolated components that focus tests on your actual business logic.

These methods work together to build test suites that catch real bugs and stay easy to maintain as your codebase grows.

Isolation Through Test Doubles

As we've talked about with CI/CD workflows, the first step to good unit testing is to separate things. This means you should test your code without using outside systems that might be slow or not work at all. Dependency injection is helpful because it lets you use test doubles instead of real dependencies when you run tests.

It is easier for developers to choose the right test double if they know the differences between them. Fakes are simple working versions, such as in-memory databases. Stubs return set data that can be used to test queries. Mocks keep track of what happens so you can see if commands work as they should.

This method makes sure that tests are always quick and accurate, no matter when you run them. Tests run 60% faster and there are a lot fewer flaky failures that slow down development when teams use good isolation.

Teams need more ways to get more test coverage without having to do more work, in addition to isolation. You can set rules that should always be true with property-based testing, and it will automatically make hundreds of test cases. This method is great for finding edge cases and limits that manual tests might not catch.

Expanding Coverage with Smart Generation

Parameterized testing gives you similar benefits, but you have more control over the inputs. You don't have to write extra code to run the same test with different data. Tools like xUnit's Theory and InlineData make this possible. This helps find more bugs and makes it easier to keep track of your test suite.

Both methods work best when you choose the right tests to run. You only run the tests you need, so platforms that know which tests matter for each code change give you full coverage without slowing things down.

Verifying Complex Outputs

The last step is to test complicated data, such as JSON responses or code that was made. Golden tests and snapshot testing make things easier by saving the expected output as reference files, so you don't have to do complicated checks.

If your code’s output changes, the test fails and shows what’s different. This makes it easy to spot mistakes, and you can approve real changes by updating the snapshot. This method works well for testing APIs, config generators, or any code that creates structured output.

Teams that use full automated testing frameworks see code coverage go up by 32.8% and catch 74.2% more bugs per build. Golden tests help by making it easier to check complex cases that would otherwise need manual testing.

The main thing is to balance thoroughness with easy maintenance. Golden tests should check real behavior, not details that change often. When you get this balance right, you’ll spend less time fixing bugs and more time building features.

Unit Testing Tools: Frameworks That Power Modern Teams

Picking the right unit testing tools helps your team write tests efficiently, instead of wasting time on flaky tests or slow builds. The best frameworks work well with your language and fit smoothly into your CI/CD process.

JUnit and TestNG dominate Java environments, with TestNG offering advanced features like parallel execution and seamless pipeline integration.
pytest leads Python testing environments with powerful fixtures and minimal boilerplate, making it ideal for teams prioritizing developer experience.
Jest provides zero-configuration testing for JavaScript/TypeScript projects, with built-in mocking and snapshot capabilities.
RSpec delivers behavior-driven development for Ruby teams, emphasizing readable test specifications.

Modern teams use these frameworks along with CI platforms that offer analytics and automation. This mix of good tools and smart processes turns testing from a bottleneck into a productivity boost.

Transform Your Development Velocity Today

Smart unit testing can turn CI/CD from a bottleneck into an advantage. When tests are fast and reliable, developers spend less time waiting and more time releasing code. Harness Continuous Integration uses Test Intelligence, automated caching, and isolated build environments to speed up feedback without losing quality.

Want to speed up your team? Explore Harness CI and see what's possible.

Engineering Blog

Powering Harness Executions Page: Inside Our Flexible Filters Component

How we rebuilt a messy filters system in React using Context and inversion of control to create a scalable, reusable, and URL-synced architecture.

Sayantan Mondal

February 10, 2026

Time to read

Filtering data is at the heart of developer productivity. Whether you’re looking for failed builds, debugging a service or analysing deployment patterns, the ability to quickly slice and dice execution data is critical.

At Harness, users across CI, CD and other modules rely on filtering to navigate complex execution data by status, time range, triggers, services and much more. While our legacy filtering worked, it had major pain points — hidden drawers, inconsistent behaviour and lost state on refresh — that slowed both developers and users.

This blog dives into how we built a new Filters component system in React: a reusable, type-safe and feature-rich framework that powers the filtering experience on the Execution Listing page (and beyond).

Prefer Watching? Here’s the Talk

The Starting Point: Challenges with Our Legacy Filters

Our old implementation revealed several weaknesses as Harness scaled:

Poor Discoverability and UX: Filters were hidden in a side panel, disrupting workflow and making applied filters non-glanceable. Users didn’t get feedback until the filter was applied/saved.
Inconsistency Across Modules: Custom logic in modules like CI and CD led to confusing behavioural differences.
High Developer Overhead: Adding new filters was cumbersome, requiring edits to multiple files with brittle boilerplate.

These problems shaped our success criteria: discoverability, smooth UX, consistent behaviour, reusable design and decoupled components.

The Evolution of Filters: A Design Journey

Building a truly reusable and powerful filtering system required exploration and iteration. Our journey involved several key stages and learning from the pitfalls of each:

Iteration 1: React Components (Conditional Rendering)

Shifted to React functional components but kept logic centralised in the FilterFramework. Each filter was conditionally rendered based on visibleFilters array. Framework fetched filter options and passed them down as props.

COMPONENT FilterFramework:
    STATE activeFilters = {}
    STATE visibleFilters = []
    STATE filterOptions = {}
    
    ON visibleFilters CHANGE:
        FOR EACH filter IN visibleFilters:
            IF filterOptions[filter] NOT EXISTS:
                options = FETCH filterData(filter)
                filterOptions[filter] = options
    
    ON activeFilters CHANGE:
        makeAPICall(activeFilters)
    
    RENDER:
        <AllFilters setVisibleFilters={setVisibleFilters} />
        
        IF 'services' IN visibleFilters:
            <DropdownFilter 
                name="Services"
                options={filterOptions.services}
                onAdd={updateActiveFilters}
                onRemove={removeFromVisible}
            />
        
        IF 'environments' IN visibleFilters:
            <DropdownFilter ... />

Pitfalls: Adding new filters required changes in multiple places, creating a maintenance nightmare and poor developer experience. The framework had minimal control over filter implementation, lacked proper abstraction and scattered filter logic across the codebase, making it neither “stupid-proof” nor scalable.

Iteration 2: React.cloneElement Pattern

Improved the previous approach by accepting filters as children and using React.cloneElement to inject callbacks (onAdd, onRemove) from the parent framework. This gave developers a cleaner API to add filters.

children.forEach(child => {
  if (visibleFilters.includes(child.props.filterKey)) {
    return React.cloneElement(child, {
      onAdd: (label, value) => {
        activeFilters[child.props.filterKey].push({ label, value });
      },
      onRemove: () => {
        delete activeFilters[child.props.filterKey];
      }
    });
  }
});

Pitfalls: React.cloneElement is an expensive operation that causes performance issues with frequent re-renders and it’s considered an anti-pattern by the React team. The approach tightly coupled filters to the framework’s callback signature, made prop flow implicit and difficult to debug and created type safety issues since TypeScript struggles with dynamically injected props.

Final Solution: Context API

The winning design uses React Context API to provide filter state and actions to child components. Individual filters access setValue and removeFilter via useFiltersContext() hook. This decouples filters from the framework while maintaining control.

COMPONENT Filters({ children, onChange }):
    STATE filtersMap = {}           // { search: { value, query, state } }
    STATE filtersOrder = []         // ['search', 'status']

    FUNCTION updateFilter(key, newValue):
        serialized = parser.serialize(newValue)   // Type → String
        filtersMap[key] = { value: newValue, query: serialized }
        updateURL(serialized)
        onChange(allValues)

    ON URL_CHANGE:
        parsed = parser.parse(urlString)          // String → Type
        filtersMap[key] = { value: parsed, query: urlString }

    RENDER:
        <Context.Provider value={{ updateFilter, filtersMap }}>
            {children}
        </Context.Provider>
END COMPONENT

Benefits: This solution eliminated the performance overhead of cloneElement, decoupled filters from framework internals and made it easy to add new filters without touching framework code. The Context API provides clear data flow that’s easy to debug and test, with type safety through TypeScript.

Inversion of Control (IoC)

The Context API in React unlocks something truly powerful — Inversion of Control (IoC). This design principle is about delegating control to a framework instead of managing every detail yourself. It’s often summed up by the Hollywood Principle: “Don’t call us, we’ll call you.”

In React, this translates to building flexible components that let the consumer decide what to render, while the component itself handles how and when it happens.

Our Filters framework applies this principle: you don’t have to manage when to update state or synchronise the URL. You simply define your filter components and the framework orchestrates the rest — ensuring seamless, predictable updates without manual intervention.

How Filters Inverts Control

Our Filters framework demonstrates Inversion of Control in three key ways.

Logic via Props: The framework doesn’t know how to save filters or fetch data — the parent injects those functions. The framework decides when to call them, but the parent defines what they do.
Content via Children (Composition): The parent decides which filters to render.
Actions via Callbacks: The framework triggers callbacks when users type, select or apply filters, but it’s your code that decides what happens next — fetch data, update cache or send analytics.

The result? A single, reusable Filters component that works across pipelines, services, deployments or repositories. By separating UI logic from business logic, we gain flexibility, testability and cleaner architecture — the true power of Inversion of Control.

COMPONENT DemoPage:
    STATE filterValues
    FilterHandler = createFilters()

    FUNCTION applyFilters(data, filters):
        result = data
        IF filters.onlyActive == true:
            result = result WHERE item.status == "Active"
        RETURN result

    filteredData = applyFilters(SAMPLE_DATA, filterValues)

    RENDER:
        <RouterContextProvider>
            <FilterHandler onChange = (updatedFilters) => SET filterValues = updatedFilters>
                
                // Dropdown to add filters dynamically
                <FilterHandler.Dropdown>
                    RENDER FilterDropdownMenu with available filters
                </FilterHandler.Dropdown>

                // Active filters section
                <FilterHandler.Content>
                    <FilterHandler.Component parser = booleanParser filterKey = "onlyActive">
                        RENDER CustomActiveOnlyFilter
                    </FilterHandler.Component>
                </FilterHandler.Content>

            </FilterHandler>

            RENDER DemoTable(filteredData)
        </RouterContextProvider>
END COMPONENT

The URL Problem

One of the key technical challenges in building a filtering system is URL synchronization. Browsers only understand strings, yet our applications deal with rich data types — dates, booleans, arrays and more. Without a structured solution, each component would need to manually convert these values, leading to repetitive, error-prone code.

The solution is our parser interface, a lightweight abstraction with just two methods: parse and serialize.

parse converts a URL string into the type your app needs.
serialize does the opposite, turning that typed value back into a string for the URL.

This bidirectional system runs automatically — parsing when filters load from the URL and serialising when users update filters.

const booleanParser: Parser<boolean> = {
  parse: (value: string) => value === 'true',   // "true" → true
  serialize: (value: boolean) => String(value)  // true → "true"
}

FiltersMap — The State Hub

At the heart of our framework lies the FiltersMap — a single, centralized object that holds the complete state of all active filters. It acts as the bridge between your React components and the browser, keeping UI state and URL state perfectly in sync.

Each entry in the FiltersMap contains three key fields:

Value — the parsed, typed data your components actually use (e.g. Date objects, arrays, booleans).
Query — the serialized string representation that’s written to the URL.
State — the filter’s lifecycle status: hidden, visible or actively filtering.

You might ask — why store both the typed value and its string form? The answer is performance and reliability. If we only stored the URL string, every re-render would require re-parsing, which quickly becomes inefficient for complex filters like multi-selects. By storing both, we parse only once — when the value changes — and reuse the typed version afterward. This ensures type safety, faster URL synchronization and a clean separation between UI behavior and URL representation. The result is a system that’s predictable, scalable, and easy to maintain.

interface FilterType<T = any> {
  value?: T              // The actual filter value
  query?: string         // Serialized string for URL
  state: FilterStatus    // VISIBLE | FILTER_APPLIED | HIDDEN
}

The Journey of a Filter Value

Let’s trace how a filter value moves through the system — from user interaction to URL synchronization.

It all starts when a user interacts with a filter component — for example, selecting a date. This triggers an onChange event with a typed value, such as a Date object. Before updating the state, the parser’s serialize method converts that typed value into a URL-safe string.

The framework then updates the FiltersMap with both versions:

the typed value under value and
the serialized string under query.

From here, two things happen simultaneously:

The onChange callback fires, passing typed values back to the parent component — allowing the app to immediately fetch data or update visualizations.
The URL updates using the serialized query string, keeping the browser’s address bar in sync and making the current filter state instantly shareable or bookmarkable.

The reverse flow works just as seamlessly. When the URL changes — say, the user clicks the back button — the parser’s parse method converts the string back into a typed value, updates the FiltersMap and triggers a re-render of the UI.

All of this happens within milliseconds, enabling a smooth, bidirectional synchronization between the application state and the URL — a crucial piece of what makes the Filters framework feel so effortless.

Conclusion

For teams tackling similar challenges — complex UI state management, URL synchronization and reusable component design — this architecture offers a practical blueprint to build upon. The patterns used are not specific to Harness; they are broadly applicable to any modern frontend system that requires scalable, stateful and user-driven filtering.

The team’s core objectives — discoverability, smooth UX, consistent behavior, reusable design and decoupled elements — directly shaped every architectural decision. Through Inversion of Control, the framework manages the when and how of state updates, lifecycle events and URL synchronization, while developers define the what — business logic, API calls and filter behavior.

By treating the URL as part of the filter state, the architecture enables shareability, bookmarkability and native browser history support. The Context API serves as the control distribution layer, removing the need for prop drilling and allowing deeply nested components to seamlessly access shared logic and state.

Ultimately, Inversion of Control also paved the way for advanced capabilities such as saved filters, conditional rendering, and sticky filters — all while keeping the framework lightweight and maintainable. This approach demonstrates how clear objectives and sound architectural principles can lead to scalable, elegant solutions in complex UI systems.

Engineering Blog

Backstage Alternatives: IDP Options for Engineering Leaders

Compare Backstage alternatives, from open source builds to commercial IDPs like Harness, and learn how to choose the right developer portal for your team.

Bri Strozewski

February 5, 2026

Time to read

In most teams, the question is no longer "Do we need an internal developer portal?" but "Do we really want to run backstage ourselves?"

Backstage proved the internal developer portal (IDP) pattern, and it works. It gives you a flexible framework, plugins, and a central place for services and docs. It also gives you a long-term commitment: owning a React/TypeScript application, managing plugins, chasing upgrades, and justifying a dedicated platform squad to keep it all usable.

That's why there are Backstage alternatives like Harness IDP and managed Backstage services. It's also why so many platform teams are taking a long time to look at them before making a decision.

Why Teams Start Searching For Backstage Alternatives

Backstage was created by Spotify to fix real problems with platform engineering, such as problems with onboarding, scattered documentation, unclear ownership, and not having clear paths for new services. There was a clear goal when Spotify made Backstage open source in 2020. The main value props are good: a software catalog, templates for new services, and a place to put all the tools you need to work together.

The problem is not the concept. It is the operating model. Backstage is a framework, not a product. If you adopt it, you are committing to:

Running and scaling the portal as a first-class internal product.
Owning plugin selection, security reviews, and lifecycle management.
Maintaining a consistent UX as more teams and use cases pile in.

Once Backstage moves beyond a proof of concept, it takes a lot of engineering work to keep it reliable, secure, and up to date. Many companies don't realize how much work it takes. At the same time, platforms like Harness are showing that you don't have to build everything yourself to get good results from a portal.

When you look at how Harness connects IDP to CI, CD, IaC Management, and AI-powered workflows, you start to see an alternate model: treat the portal as a product you adopt, then spend platform engineering energy on standards, golden paths, and self-service workflows instead of plumbing.

The Three Real Paths: Build, Buy, Or Go Hybrid

When you strip away branding, almost every Backstage alternative fits one of three patterns. The differences are in how much you own and how much you offload:

	Build (Self-Hosted Backstage)	Hybrid (Managed Backstage)	Buy (Commercial IDP)
You own	Everything: UI, plugins, infra, roadmap	Customization, plugin choices, catalog design	Standards, golden paths, workflows
Vendor owns	Nothing	Hosting, upgrades, security patches	Platform, upgrades, governance tooling, support
Engineering investment	High (2–5+ dedicated engineers)	Medium (1–2 engineers for customization)	Low (configuration, not code)
Time to value	Months	Weeks to months	Weeks
Flexibility	Unlimited	High, within Backstage conventions	Moderate, within vendor abstractions
Governance & RBAC	Build it yourself	Build or plugin-based	Built-in
Best for	Large orgs wanting full control	Teams standardized on Backstage who want less ops	Teams prioritizing speed, governance, and actionability

1. Build: Self-Hosted Backstage Or Fully DIY Portal

What This Actually Means

You fork or deploy OSS Backstage, install the plugins you need, and host it yourself. Or you build your own internal portal from scratch. Either way, you now own:

The UI and UX.
The plugin ecosystem and compatibility matrix.
Security, upgrades, and infra.
Roadmapping and feature decisions.

Backstage gives you the most flexibility because you can add your own custom plugins, model your internal world however you want, and connect it to any tool. If you're willing to put a lot of money into it, that freedom is very powerful.

Where It Breaks Down

In practice, that freedom has a price:

You need a dedicated team (often several engineers) to keep the portal healthy as adoption grows.
You own every design decision and every piece of technical debt, forever.
Plugin sprawl becomes real, especially when different teams install different components for similar problems.
Scaling governance, RBAC, and standards enforcement almost always requires custom code.

This path could still work. If you run a very large organization and want to make the portal a core product, you need to have strong React/TypeScript and platform skills, and you really want to be able to customize it however you want, building on Backstage is a good idea. Just remember that you are not choosing a tool; you are hiring people to work on a long-term project.

2. Hybrid: Managed Backstage

What This Actually Means

Managed Backstage providers run and host Backstage for you. You still get the framework and everything that goes with it, but you don't have to fix Kubernetes manifests at 2 a.m. or investigate upstream patch releases.

Vendor responsibilities typically include:

Running the control plane and handling infra.
Coordinating upgrades and security fixes.
Creating a curated library of high-value plugins.

You get "Backstage without the server babysitting."

Where The Trade-Offs Show Up

You also inherit Backstage's structural limits:

The data model and catalog schema still look like Backstage.
UI and interaction patterns follow Backstage's rules, which may not fit every team's mental model.
Deeply customized plugins or data models still require serious engineering work.

Hybrid works well if you have already standardized on Backstage concepts, want to keep the ecosystem, and simply refuse to run your own instance. If you're just starting out with IDPs and are still looking into things like golden paths, self-service workflows, and platform-managed scorecards, it might be helpful to compare hybrid Backstage to commercial IDPs that were made to be products from the start.

3. Buy: Commercial IDPs

What This Actually Means

Commercial IDPs approach the space from the opposite angle. You do not start with a framework, you start with a product. You get a portal that ships with:

A software catalog.
Ownership and scorecards.
Self-service workflows.
RBAC and governance tools.

The main point that sets them apart is how well that portal is connected to the systems that your developers use every day. Some products act as a metadata hub, bringing together information from your current tools. Harness does things differently. The IDP is built right on top of a software delivery platform that already has CI, CD, IaC Management, Feature Flags, and more.

Why Teams Go This Route

Teams that choose commercial Backstage alternatives tend to prioritize:

Time to value in weeks, not quarters.
Predictable total cost of ownership instead of wandering portal roadmaps.
Built-in governance and security rather than "we'll build RBAC later."
A real customer success partnership and roadmap, as opposed to depending on open-source momentum.

You trade some of Backstage's absolute freedom for a more focused, maintainable platform. For most organizations, that is a win.

Open Source Backstage Vs. Commercial Backstage Alternatives: Real Trade-Offs

People often think that the difference is "Backstage is free; commercial IDPs are expensive." In reality, the choice is "Where do you want to spend?"

When you use open source, you save money but lose engineering capacity. With commercial IDPs like Harness, you do the opposite: you pay to keep developers focused on the platform and save time. A platform's main purpose is to serve the teams that build on it. Who does the hard work depends on whether you build or buy.

This is how it works in practice:

Dimension	Open-Source Backstage	Commercial IDP (e.g., Harness)
Upfront cost	Free (no license fees)	Subscription or usage-based pricing
Engineering staffing	2–5+ engineers dedicated at scale	Minimal—vendor handles core platform
Customization freedom	Unlimited—you own the code	Flexible within vendor abstractions
UX consistency	Drifts as teams extend the portal	Controlled by product design
AI/automation depth	Add-on or custom build	Native, grounded in delivery data
Vendor lock-in risk	Low (open source)	Medium (tied to platform ecosystem)
Long-term TCO (3–5 years)	High (hidden in headcount)	Predictable (visible in contract)

Backstage is a solid choice if you explicitly want to own design, UX, and technical debt. Just be honest about how much that will cost over the next three to five years.

Commercial IDPs like Harness come with pre-made catalogs, scorecards, workflows, and governance that show you the best ways to do things. In short, it's ready to use right away. You get faster rollout of golden paths, self-service workflows, and environment management, as well as predictable roadmaps and vendor support.

The real question is what you want your platform team to do: shipping features in your portal framework, or defining and evolving the standards that drive better software delivery.

Where Commercial IDPs Fit Among Backstage Alternatives

When compared to other Backstage options, Harness IDP is best understood as a platform-based choice rather than a separate portal. It runs on Backstage where it makes sense (for example, to use the plugin ecosystem), but it is packaged as a curated product that sits on top of the Harness Software Delivery Platform as a whole.

There are a few design principles stand out:

Start from a product, not a bare framework. Backstage is intentionally a framework. Harness IDP is shipped as a product. Teams can start using the software right away because it already has a software catalog, scorecards, self-service workflows, RBAC, and policy-as-code. You add to it and shape it, but you don't put the basics together so that anyone can use it.
Make governance a first-class concern. Harness bakes environment-aware RBAC, policy-as-code (OPA), approvals, freeze windows, audit trails, and standards enforcement into the platform. Instead of adding custom plugins later, governance and security are built in from the start.
Prioritize actionability over passive visibility. Harness IDP does not stop at showing data. Because it runs directly over Harness CI, CD, IaC Management, Feature Flags, and related capabilities, it can drive workflows: spinning up new services from golden paths, managing environments, shutting down ephemeral resources, and wiring in repeatable self-service runbooks. The result is a portal that behaves more like an operational control plane.
Use AI where it can safely take action. The Harness Knowledge Agent is based on real delivery data, such as services, pipelines, environments, and scorecards. It can answer questions about who owns what and what happened in the past. It can also suggest or start safe actions under governance controls. That is not the same as AI features that only give a brief overview of catalog entries.

When you think about Backstage alternatives in terms of "How much of this work do we want to own?" and "Should our portal be a UI or a control plane?" Harness naturally fits into the group that sees the IDP as part of a connected delivery platform rather than as a separate piece of infrastructure.

Migration Realities: Moving Off Backstage Is Not A Free Undo Button

A lot of teams say, "We'll start with Backstage, and if it gets too hard, we'll move to something else." That sounds safe on paper. In production, moving from Backstage gets harder over time.

Common points where things go wrong include:

Custom plugins and extensions: One of Backstage's best features is its plugin ecosystem. It also keeps teams together. Over time, you build up a lot of custom plugins, scaffolder actions, and UI panels that are closely linked to your internal systems. Moving those to a different portal often means rewriting them completely, checking for compatibility, and sometimes even refactoring them.‍
Catalog complexity: Backstage catalogs tend to grow into hundreds or thousands of catalog-info.yaml files, custom entity kinds, and annotations. Moving this to a commercial IDP means putting that structure into the new system's data model while keeping ownership, relationships, and rules for governance. Trust in the new portal is directly affected by an incomplete migration here.‍
Golden path and scaffolder differences: Your existing scaffolder templates are wired into specific CI/CD tools and habits. Moving them to Harness IDP usually means changing the templates so that they run Harness pipelines, Harness environments, and IaC workflows instead of jobs from outside. That refactor is usually worth it, but it is still a lot of work.‍
Developer UX and "who moved my cheese?": Developers get used to Backstage's interaction patterns and custom dashboards. Changing to a new IDP always causes problems with adoption. The only way to avoid a revolt is to run portals at the same time and slowly roll out new golden paths.‍
Parallel system complexity: Running Backstage next to a new portal uses up a lot of platform bandwidth and makes things confusing for users if timelines aren't clear. Commercial vendors like Harness can help with this by providing migration tools and hands-on help, but you still need to plan for a migration window, not just flipping a switch.

The point isn't "never choose Backstage." The point is that if you do, you should think of it as a strategic choice, not an experiment you can easily undo in a year.

How To Evaluate Backstage Alternatives With A Clear Head

Whether you are comparing Backstage alone, Backstage in a managed form, or commercial platforms like Harness, use a lens that goes beyond feature checklists. These seven questions will help you cut through the noise.

Time to first value‍
- Can you deliver a useful portal (catalog plus a couple of golden paths) in weeks?
- Who owns upgrades, patches, and production reliability?‍
Total cost of ownership‍
- How many engineers will this realistically consume over 3 years?
- Is that time spent on differentiated work or reinvention?‍
Governance and security maturity‍
- Do you get RBAC, policy-as-code, approvals, and audit trails out of the box?
- Can you express environment-aware rules without writing custom code for every edge case?‍
Data model and extensibility‍
- How hard is it to model services, infra, teams, and dependencies in a way that reflects reality?
- Can you evolve the model as your architecture and org change?‍
Automation and actionability‍
- Does the portal only aggregate data, or can it drive workflows like service creation, environment provisioning, and deployment rollbacks?
- How directly does it connect to your CI/CD, IaC, and incident tooling?‍
AI and "agentic" workflows‍
- Is AI just summarizing what you already see on dashboards, or can it actually update environments, run pipelines, and enforce policies safely?
- How well grounded is that AI in your real delivery platform versus a generic data lake?‍
Exit strategy and lock-in‍
- If you have to move in three to five years, how portable are your catalogs, templates, and automation?
- Are you comfortable tying your IDP to a broader platform (like Harness) to gain deeper integration and efficiency?

If a solution cannot give you concrete answers here, it is not the right Backstage alternative for you.

Why Harness IDP Belongs On Your Shortlist

Choosing among Backstage alternatives comes down to one question: what kind of work do you want your platform team to own?

Open source Backstage gives you maximum flexibility and maximum responsibility. Managed Backstage reduces ops burden but keeps you within Backstage's conventions. Commercial IDPs like Harness narrow the surface area you maintain and connect your portal directly to CI/CD, environments, and governance.

If you want fast time to value, built-in governance, and a portal that acts rather than just displays, connect with Harness.

Engineering Blog

Architecting Trust: The Blueprint for a "Golden Standard" Software Supply Chain

Move beyond bespoke CI/CD scripts. Learn how to architect a "Golden Standard" pipeline that enforces governance, accelerates security testing, and guarantees artifact integrity using a Zero Trust approach.

Aditya Kashyap

February 5, 2026

Time to read

We’ve all seen it happen. A DevOps initiative starts with high energy, but two years later, you’re left with a sprawl of "fragile agile" pipelines. Every team has built their own bespoke scripts, security checks are inconsistent (or non-existent), and maintaining the system feels like playing whack-a-mole.

This is where the industry is shifting from simple DevOps execution to Platform Engineering.

The goal of a modern platform team isn't to be a help desk that writes YAML files for developers. The goal is to architect a "Golden Path"—a standardized, pre-vetted route to production that is actually easier to use than the alternative. It reduces the cognitive load for developers while ensuring that organizational governance isn't just a policy document, but a reality baked into every commit.

In this post, I want to walk through the architecture of a Golden Standard Pipeline. We’re going to look beyond simple task automation and explore how to weave Governance, Security, and Supply Chain integrity into a unified workflow that stands the test of time.

The Architectural Blueprint

A Golden Standard Pipeline isn't defined by the tools you use—Harness, Gitlab, GitHub Actions—but by its layers of validation. It’s not enough to simply "build and deploy" anymore. We need to architect a system that establishes trust at every single stage.

I like to break this architecture down into four distinct domains:

Governance Domain: checking if we should run this pipeline before we even start.
Integration Domain (The Inner Loop): getting fast feedback to developers.
Trust Domain (Supply Chain): creating proof that the software is safe.
Delivery Domain (The Outer Loop): getting it to production reliably.

Visualizing the Flow

Layer 1: Governance as the First Gate

The Principle: Don't process what you can't approve.

In traditional pipelines, we often see compliance checks shoehorned in right before production deployment. This is painful. There is nothing worse than waiting 20 minutes for a build and test cycle, only to be told you can't deploy because you used a non-compliant base image.

In a Golden Standard architecture, we shift governance to Step Zero.

By implementing Policy as Code (using frameworks like OPA) at the very start of execution, we solve a few problems:

Drift Prevention: Pipelines simply won't run if they’ve been hacked to bypass standard steps.
Resource Efficiency: We don't waste expensive compute building artifacts that are doomed to fail compliance.
Security Baseline: Unauthorized workflows are stopped dead before they can access secrets or internal networks.

Layer 2: Parallelized Security Orchestration

The Principle: Security must speed the developer up, not slow them down.

The "Inner Loop" is sacred ground. This is where developers live. If your security scanning adds friction or takes too long, developers will find a way to bypass it. To solve this, we rely on Parallel Orchestration.

Instead of running checks linearly (Lint → then SAST → then Secrets), we group "Code Smells," "Linting," and "Security Scanners" to run simultaneously.

This gives us a huge architectural advantage:

Reduced Latency: We squash the wall-clock time by running I/O heavy checks in parallel.
Cost Optimization: We only trigger expensive Unit Test runners after the cheap, fast security checks pass. There is zero value in running a heavy test suite on a codebase that contains a hardcoded API key.

Layer 3: The Trust Layer (Supply Chain Security)

The Principle: Prove the origin and ingredients of your software.

This is the biggest evolution we've seen in CI/CD recently. We need to stop treating the build artifact (Docker image/Binary) as a black box. Instead, we generate three critical pieces of metadata that travel with the artifact:

SBOM (Software Bill of Materials): Think of this as the ingredients list on a food packet. It’s a machine-readable inventory of every library inside your container. When the next Log4j happens, you don't need to scan the world; you just query your inventory.
SLSA Provenance: This is an unforgeable ID card for your build. It proves where the build happened, when it happened, and what inputs were used. This is your defense against tampering attacks (like SolarWinds).
Cryptographic Signing: Finally, we sign the artifact using a private key (via Cosign). This acts like a digital wax seal; if the image is modified by even one bit after the build, the seal breaks, and your cluster will refuse to run it.

Layer 4: Immutable Delivery

The Principle: Build once, deploy everywhere.

A common anti-pattern I see is rebuilding artifacts for different environments—building a "QA image" and then rebuilding a "Prod image" later. This introduces risk.

In the Golden Standard, the artifact generated and signed in Layer 3 is the exact same immutable object deployed to QA and Production. We use a Rolling Deployment strategy with an Approval Gate between environments. The production stage explicitly references the digest of the artifact verified in QA, ensuring zero drift.

The Capability Map

To successfully build this, your platform needs to provide specific capabilities mapped to these layers.

Future-Proofing Your Platform

Tools change. Jenkins, Harness, GitHub Actions—they all evolve. But the Architecture remains constant. If you adhere to these principles, you future-proof your organization:

Decouple Policy from Pipeline: Store your policies separately from your pipeline YAML. This lets you update security rules globally without needing a massive migration project to edit hundreds of pipelines.
Standardize Interfaces: Use standard formats for your metadata (SPDX for SBOMs, In-toto for attestations). This prevents vendor lock-in and ensures your data is portable.
Invest in "Shift Left" Culture: The best architecture in the world fails if developers see it as a hurdle. Position the Golden Pipeline as a product that solves developer pain points (like setting up environments or managing credentials) while silently enforcing security in the background.

Conclusion

Adopting a Golden Standard architecture transforms the CI/CD pipeline from a simple task runner into a governance engine. By abstracting security and compliance into these reusable layers, Platform Engineering teams can guarantee that every microservice—regardless of the language or framework—adheres to the organization's highest standards of trust.

Engineering Blog

Kubernetes Cost Traps: Fixing What Your Scheduler Won’t

Kubernetes costs often spiral due to hidden scheduling inefficiencies—not bad architecture. This blog breaks down how over-provisioning, poor bin packing, wrong node choices, and idle clusters waste money, and how cost-aware scheduling helps you take control.

Riyas P

January 26, 2026

Time to read

Kubernetes is a powerhouse of modern infrastructure — elastic, resilient, and beautifully abstracted. It lets you scale with ease, roll out deployments seamlessly, and sleep at night knowing your apps are self-healing.

But if you’re not careful, it can also silently drain your cloud budget.

In most teams, cost comes as an afterthought — only noticed when the monthly cloud bill starts to resemble a phone number. The truth is simple:

Kubernetes isn’t expensive by default.

Inefficient scheduling decisions are.

These inefficiencies don’t come from massive architectural mistakes. It’s the small, hidden inefficiencies — configuration-level choices — that pile up into significant cloud waste.

In this post, let’s unpack the hidden costs lurking in your Kubernetes clusters and how you can take control using smarter scheduling, bin packing, right-sizing, and better node selection.

The Hidden Costs Nobody Talks About

Over-Provisioned Requests and Limits

Most teams play it safe by over-provisioning resource requests — sometimes doubling or tripling what the workload needs. This leads to wasted CPU and memory that sit idle, but still costs money because the scheduler reserves them.

Your cluster is “full” — but your nodes are barely sweating.

Low Bin-Packing Efficiency

Kubernetes’s default scheduler optimizes for availability and spreading, not cost. As a result, workloads are often spread across more nodes than necessary. This leads to fragmented resource usage, like:

A node with 2 free cores that no pod can “fit” into
Nodes stuck at 5–10% utilization because of a single oversized pod
Non-evictable pods holding on to almost empty nodes

Wrong Node Choices (Intel vs AMD, Spot vs On-Demand)

Choosing the wrong instance type can be surprisingly expensive:

AMD-based nodes are 20–30% cheaper in many clouds
Spot instances can cut costs dramatically for stateless workloads up to 70%
ARM (e.g., Graviton in AWS) can offer up to 40% savings

But without node affinity, taints, or custom scheduling, workloads might not land where they should.

Zombie Workloads and Forgotten Jobs

Old cron jobs, demo deployments, and failed jobs that never got cleaned up — they all add up. Worse, they might be on expensive nodes or keeping the autoscaler from scaling down.

Node Pool Fragmentation

Mixing too many node types across zones, architectures, or families without careful coordination leads to bin-packing failure. A pod that fits only one node type can prevent the scale-down of others, leading to stranded resources.

Always-On Clusters and Idle Infrastructure

Many Kubernetes environments run 24/7 by default, even when there is little or no real activity. Development clusters, staging environments, and non-critical workloads often sit idle for large portions of the day, quietly accumulating cost.

This is one of the most overlooked cost traps.
Even a well-sized cluster becomes expensive if it runs continuously while doing nothing.

Because this waste doesn’t show up as obvious inefficiency — no failed pods, no over-provisioned nodes — it often goes unnoticed until teams review monthly cloud bills. By then, the cost is already sunk.

Idle infrastructure is still infrastructure you pay for.

Smarter Scheduling: Cost-Aware Techniques

Kubernetes doesn’t natively optimize for cost, but you can make it.

Bin Packing with Intent: Taints, Affinity, and Custom Schedulers

Encourage consolidation by:

Using taints and tolerations to isolate high-memory or GPU workloads
Applying pod affinity/anti-affinity to co-locate or separate workloads
Leveraging Cluster Orchestrator with Karpenter to intelligently place pods based on actual resource availability and cost
Use Smart Placement Strategies to place non-evictable pods efficiently

In addition to affinity and anti-affinity, teams can use topology spread constraints to control the explicit distribution of pods across zones or nodes. While they’re often used for high availability, overly strict spread requirements can work against bin-packing and prevent efficient scale-down, making them another lever that needs cost-aware tuning.

__wf_reserved_inherit — Bin-packing to optimize resources

Scheduled Scaledown of Idle Resources

All of us go through a state where all of our resources are running 24/7 but are barely getting used and racking up costs even when everything is idle.A tried and proved way to avoid this is to scale down these resources either based on schedules or based on idleness.

Harness CCM Kubernetes AutoStopping let’s you scale down your Kubernetes workloads, AutoScaling Groups, VMs and many more based on either their activity or based on Fixed schedules to save you from these idle costs.

Cluster Orchestrator can help you to scale down the entire cluster or specific Nodepools when they are not needed, based on schedules

Right-Sizing Workloads

It’s often shocking how many pods can run on half the resources they’re requesting. Instead of guessing resource requests:

Try Cluster Orchestrator’s Vertical Pod Autoscaler (VPA) with a single click
Use Prometheus metrics to measure actual usage
Analyze reports from visibility tools

Leverage Spot, AMD, or ARM-based Nodes

Make architecture and pricing work in your favor:

Comparison of Spot and OnDemand EC2 pricing from AWS

Use node selectors or affinity rules to schedule less critical workloads to Spot nodes. You can use Harness’s Cluster Orchestrator to run your workloads partially in Spot instances. Spot nodes are up to 90% cheaper compared to On-Demand nodes
Prefer AMD or Graviton nodes for stateless or batch jobs
Separate workloads by architecture to avoid mixed pools

Use Fewer, More Efficient Node Pools

Instead of 10 specialized pools, consider:

Consolidating into fewer, well-utilized pools
Using node-level bin-packing strategies via Karpenter or Cluster Orchestrator
Tuning autoscaler thresholds to enable more aggressive scale-down

Invisible Decisions Are Expensive

One overlooked reason why Kubernetes cost optimization is hard is that most scaling decisions are opaque. Nodes appear and disappear, but teams rarely know why a particular scale-up or scale-down happened.

Was it CPU fragmentation? A pod affinity rule? A disruption budget? A cost constraint?

Without decision-level visibility, teams are forced to guess — and that makes cost optimization feel risky instead of intentional.

Cost-aware systems work best when they don’t just act, but explain. Clear event-level insights into why a node was added, removed, or preserved help teams build trust, validate policies, and iterate safely on optimization strategies.

Scheduled Scale-Down of Idle Resources

One of the most effective ways to eliminate idle cost is time- or activity-based scaling. Instead of keeping clusters and workloads always on, resources can be scaled down when they are not needed and restored only when activity resumes.

With Harness CCM Kubernetes AutoStopping, teams can automatically scale down Kubernetes workloads, Auto Scaling Groups, VMs, and other resources based on usage signals or fixed schedules. This removes idle spend without requiring manual intervention.

Cluster Orchestrator extends this concept to the cluster level. It enables scheduled scale-down of entire clusters or specific node pools, making it practical to turn off unused capacity during nights, weekends, or other predictable idle windows.

Sometimes, the biggest savings come from not running infrastructure at all when it isn’t needed.

Treat Cost Like a First-Class Metric

Cost is not just a financial problem. It’s an engineering challenge — and one that we, as developers, can tackle with the same tools we use for performance, resilience, and scalability.

Start small. Review a few workloads. Test new node types. Measure bin-packing efficiency weekly.

Equation to calculate Bin Packing Efficiency

You don’t need to sacrifice performance — just be intentional with your cluster design.

Check out Cluster Orchestrator by Harness CCM today!

Kubernetes doesn’t have to be expensive — just smarter.

‍

Engineering Blog

Build vs Buy IaC: Choosing the Right IaCM Strategy

Build vs buy IaC? Compare custom pipelines to IaCM platforms. Learn which approach scales best for your team. Explore Harness IaCM now.

Richard Black

January 26, 2026

Time to read

Have you ever watched a “temporary” Infrastructure as Code script quietly become mission-critical, undocumented, and owned by someone who left the company two years ago? We can all related to a similar scenario, if not infrastructure-specific, and this is usually the moment teams realise the build vs buy IaC decision was made by accident, not design.

As your teams grow from managing a handful of environments to orchestrating hundreds of workspaces across multiple clouds, the limits of homegrown IaC pipeline management show up fast. It starts as a few shell scripts wrapping OpenTofu or Terraform commands often evolves into a fragile web of CI jobs, custom glue code, and tribal knowledge that no one feels confident changing.

The real question is not whether you can build your own IaC solution. Most teams can. The question is what it costs you in velocity, governance, and reliability once the platform becomes business-critical.

The Appeal and Cost of Custom IaC Pipelines

Building a custom IaC solution feels empowering at first. You control every detail. You understand exactly how plan and apply flows work. You can tailor pipelines to your team’s preferences without waiting on vendors or abstractions.

For small teams with simple requirements, this works. A basic OpenTofu or Terraform pipeline in GitHub Actions or GitLab CI can handle plan-on-pull-request and apply-on-merge patterns just fine. Add a manual approval step and a notification, and you are operational.

The problem is that infrastructure rarely stays simple.

As usage grows, the cracks start to appear:

Governance gaps: Who is allowed to apply changes to production? How are policies enforced consistently? What happens when someone bypasses the pipeline and runs apply locally?
State and workspace sprawl: Managing dozens or hundreds of state files, backend configurations, and locking behaviour becomes a coordination problem, not a scripting one.
Workflow inconsistency: Each team builds its own version of “the pipeline.” What starts as flexibility turns into a support burden when every repository behaves differently.
Security and audit blind spots: Secrets handling, access controls, audit trails, and drift detection are rarely first-class concerns in homegrown tooling. They become reactive fixes after something goes wrong.

At this point, the build vs buy IaC question stops being technical and becomes strategic.

What an IaCM Platform Is Actually Solving

We cannot simply label our infrastructure as code management platform as “CI for Terraform.” It exists to standardise how infrastructure changes are proposed, reviewed, approved, and applied across teams.

Instead of every team reinventing the same patterns, an IaCM platform provides shared primitives that scale.

Consistent Workspace Lifecycles

Workspaces are treated as first-class entities. Plans, approvals, applies, and execution history are visible in one place. When something fails, you do not have to reconstruct context from CI logs and commit messages.

Enforced IaC Governance

IaC governance stops being a best-practice document and becomes part of the workflow. Policy checks run automatically. Risky changes are surfaced early. Approval gates are applied consistently based on impact, not convention.

This matters regardless of whether teams are using OpenTofu as their open-source baseline or maintaining existing Terraform pipelines.

Centralised Variables and Secrets

Managing environment-specific configuration across large numbers of workspaces is one of the fastest ways to introduce mistakes. IaCM platforms provide variable sets and secure secret handling so values are managed once and applied consistently.

Drift Detection Without Custom Glue

Infrastructure drift is inevitable. Manual console changes, provider behaviour, and external automation all contribute. An IaCM platform detects drift continuously and surfaces it clearly, without relying on scheduled scripts parsing CLI output.

Module and Provider Governance

Reusable modules are essential for scaling IaC, but unmanaged reuse creates risk. A built-in module and provider registry ensures teams use approved, versioned components and reduces duplication across the organisation.

Why Building This Yourself Rarely Scales

Most platform teams underestimate how much work lives beyond the initial pipeline.

You will eventually need:

Role-based access control.
Approval workflows that vary by environment.
Audit logs that satisfy compliance reviews.
Concurrency controls and workspace locking.
Safe ways to evolve pipelines without breaking dozens of teams.

None of these are hard in isolation. Together, they represent a long-term maintenance commitment. Unless building IaC tooling is your product, this effort rarely delivers competitive advantage.

How Harness IaCM Changes the Build vs Buy Equation

Harness Infrastructure as Code Management (IaCM) is designed for teams that want control without rebuilding the same platform components over and over again.

It supports both OpenTofu and Terraform, allowing teams to standardise workflows even as tooling evolves. OpenTofu fits naturally as an open-source execution baseline for new workloads, while Terraform remains supported where existing investment makes sense.

Harness IaCM provides:

Default plan and apply pipelines that work out of the box.
Workspace templates to enforce consistent backend configuration and governance.
Module and provider registries to manage reuse safely.
Policy enforcement, security checks, and cost visibility built directly into every run.

Instead of writing and maintaining custom orchestration logic, teams focus on infrastructure design and delivery.

Drift detection, approvals, and audit trails are handled consistently across every workspace, without bespoke scripts or CI hacks.

Making a Deliberate Build vs Buy IaC Decision

The build vs buy IaC decision should be intentional, not accidental.

If your organisation has a genuine need to own every layer of its tooling and the capacity to maintain it long-term, building can be justified. For most teams, however, the operational overhead outweighs the benefits.

An IaCM platform provides faster time-to-value, stronger governance, and fewer failure modes as infrastructure scales.

Harness Infrastructure as Code Management enables teams to operationalise best practices for OpenTofu and Terraform without locking themselves into brittle, homegrown solutions.

The real question is not whether you can build this yourself. It is whether you want to be maintaining it when the platform becomes critical.

Explore Harness IaCM and move beyond fragile IaC pipelines.

Engineering Blog

Overcoming the AI Velocity Paradox in Security

Discover how the organizations can bridge the critical gap between rapid AI code generation and lagging security processes to prevent vulnerabilities and compliance risks.

Vikas Gautam

January 23, 2026

Time to read

The rapid adoption of AI is fundamentally reshaping the software development landscape, driving an unprecedented surge in code generation speed. However, this acceleration has created a significant challenge for security teams: the AI velocity paradox. This paradox describes a situation where the benefits of accelerated code generation are being "throttled by the SDLC processes downstream," such as security, testing, deployment, and compliance, which have not matured or automated at the same pace as AI has advanced the development process.

This gap is a recognized concern among industry leaders. In Harness’s latest State of AI in Software Engineering report, 48% of surveyed organizations worry that AI coding assistants introduce vulnerabilities, and 43% fear compliance issues stemming from untested, AI-generated code.

This blog post explores strategies for closing the widening gap and defending against the new attack surfaces created by AI tooling.

‍

Defining the AI Velocity Paradox in Security

The AI velocity paradox is most acutely manifested in security. The benefits gained from code generation are being slowed down by downstream SDLC processes, such as testing, deployment, security, and compliance. This is because these processes have not "matured or automated at the same pace as code generation has."

Every time a coding agent or AI agent writes code, it has the potential to expand the threat surface. This can happen if the AI spins up a new application component, such as a new API, or pulls in unvalidated open-source models or libraries. If deployed without proper testing and validation, these components "can really expand your threat surface."

The imbalance is stark: code generation is up to 25% faster, and 70% of developers are shipping more frequently, yet only 46% of security compliance workflows are automated.

‍

The Dual Risk: Vulnerabilities vs. Compliance

The Harness report revealed that 48% of respondents were concerned that AI coding assistance introduced vulnerabilities, while 43% feared regulatory exposure. While both risks are evident in practice, they do not manifest equally.

Vulnerabilities are more tangible and appear more often in incident data. These issues include unauthenticated access to APIs, poor input validation, and the use of third-party libraries. This is where the "most tangible exposure is".

Compliance is a "slow burn risk." For instance, new code might start "touching a sensitive data flow which was previously never documented." This may not be discovered until a specific compliance requirement triggers an investigation. Vulnerabilities are currently seen more often in real incident data than compliance issues.

‍

A New Attack Surface: Non-Deterministic AI Agents

The components that significantly expand the attack surface beyond the scope of traditional application security (appsec) tools are AI agents or LLMs integrated into applications.

Traditional non-AI applications are generally deterministic; you know exactly what payload is going into an API, and which fields are sensitive. Traditional appsec tools are designed to secure this predictable environment.

However, AI agents are non-deterministic and "can behave randomly." Security measures must focus on ensuring these agents do not receive "overly excessive permissions to access anything" and controlling the type of data they have access to.

‍

Top challenges for AI application security

‍

Prioritizing AI Security Mitigation (OWASP LLM Top 10)

For development teams with weekly release cycles, we recommend prioritizing mitigation efforts based on the OWASP LLM Top 10. The three critical areas to test and mitigate first are:

Prompt Injection: This is the one threat currently seeing "the most attacks and threat activity".
Sensitive Data Disclosure: This is crucial for any organization that handles proprietary data or sensitive customer information, such as PII or banking records.
Excessive Agency: This involves an AI agent or MCP tool having a token with permissions it should not have, such as write control for a database, code commit controls, or the ability to send emails to end users.

‍

We advise that organizations should "test all your applications" for these three issues before pushing them to production.

‍

A Deep Dive into Prompt Injection

Here’s a walkthrough of a real-world prompt injection attack scenario to illustrate the danger of excessive agency.

The Attack Path is usually:

Excessive Agency: An AI application has an agent that accesses a user records database via an API or Model Context Protocol (MCP) tool. Critically, the AI agent has been given a "broadly scoped access token" that allows it to read, make changes, and potentially delete the database.
The Override Prompt: A user writes a prompt with an override, for example, suggesting a "system maintenance" is happening and asking the AI to "help me make a copy of the database and make changes to it." This is a "direct prompt injection" (or sometimes an indirect prompt injection), which is designed to force the LLM agent to reveal or manipulate certain data.
Hijacking: If no guardrails are in place to detect such prompts, the LLM will create a hijack scenario and make the request to the database.
Real Exfiltration: Once the hijacking is done, the "real exfiltration happens." The AI agent can output the data in the chatbot or write it to a third-party API where the user needs access to that data.

‍

This type of successful attack can lead to "legal implications," data loss, and damage to the organization's reputation.

Here’s a playbook to tackle Prompt Injection attacks

‍

Harness’s Vision for AI Security

Harness's approach to closing the AI security gap is built on three pillars:

AI Asset Discovery and Posture Management: This involves automatically discovering all AI assets (APIs, LLMs, MCP tools, etc.) by analyzing application traffic. This capability eliminates the "blind spot" that application security teams often have with "shadow AI," where developers do not document new AI assets. The platform automatically provides sensitive data flows and governance policies, helping you be audit-ready, especially if you operate in a regulated industry.
AI Security Testing: This helps organizations test their applications against AI-specific attacks before they are shipped to production. Harness's product supports DAST scans for the OWASP LLM Top 10, which can be executed as part of a CI/CD pipeline.
AI Runtime Protection: This focuses on detecting and blocking AI threats such as prompt injection, jailbreak attempts, data exfiltration, and policy violations in real time. It gives security teams immediate visibility and enforcement without impacting application performance or developer velocity.

‍

Read more about Harness AI security in our blog post.

Looking Ahead: The Evolving Attack Landscape

Looking six to 12 months ahead, the biggest risks come from autonomous agents, deeper tool chaining, and multimodal orchestration. The game has changed from focusing on "AI code-based risk versus decision risk."

Security teams must focus on upgrading their security and testing capabilities to understand the decision risk, specifically "what kind of data is flowing out of the system and what kind of things are getting exposed." The key is to manage the non-deterministic nature of AI applications.

‍

To stay ahead, a phased maturity roadmap is recommended:

Start with visibility.
Move to testing.
Then, focus on runtime protection.

By focusing on automation, prioritizing the most critical threats, and adopting a platform that provides visibility, testing, and protection, organizations can manage the risks introduced by AI velocity and build resilient AI-native applications.

‍

Learn more about tackling the AI velocity paradox in security in this webinar.

‍

Engineering Blog

Theory to Turbulence: Building a Developer-Friendly E2E Testing Framework for Chaos Platform

How we reduced chaos fault validation setup from 30 minutes to 5 using an API-driven, developer-first E2E testing framework.

Vedant Shrotria

December 22, 2025

Time to read

As an enterprise chaos engineering platform vendor, validating chaos faults is not optional — it’s foundational. Every fault we ship must behave predictably, fail safely, and produce measurable impact across real-world environments.

When we began building our end-to-end (E2E) testing framework, we quickly ran into a familiar problem: the barrier to entry was painfully high.

Running even a single test required a long and fragile setup process:

Installing multiple dependencies by hand
Configuring a maze of environment variables
Writing YAML-based chaos experiments manually
Debugging cryptic validation failures
Only then… executing the first test

This approach slowed feedback loops, discouraged adoption, and made iterative testing expensive — exactly the opposite of what chaos engineering should enable.

The Solution: A Simplified Chaos Fault Validation Framework

To solve this, we built a comprehensive yet developer-friendly E2E testing framework for chaos fault validation. The goal was simple: reduce setup friction without sacrificing control or correctness.

The result is a framework that offers:

An API-driven execution model instead of manual YAML wiring
Real-time log streaming for faster debugging and observability
Intelligent target discovery to eliminate repetitive configuration
Dual-phase validation to verify both fault injection and system impact

What previously took 30 minutes (or more) to set up and run can now be executed in under 5 minutes — consistently and at scale.

*A real execution run — proving that chaos validation doesn’t have to be chaotic —* ***From theory to turbulence***

System Architecture

High-Level Architecture

Layer Responsibilities

Core Components

1. Experiment Runner

Purpose: Orchestrates the complete chaos experiment lifecycle from creation to validation.

Key Responsibilities:

Experiment creation with variable substitution
Log streaming and target discovery
Concurrent validation management
Status monitoring and completion detection
Error handling and retry logic

Architecture Pattern: Template Method + Observer

type ExperimentRunner struct {
    identifiers utils.Identifiers
    config      ExperimentConfig
}
type ExperimentConfig struct {
    Name                  string
    FaultName             string
    ExperimentYAML        string
    InfraID               string
    InfraType             string
    TargetNamespace       string
    TargetLabel           string
    TargetKind            string
    FaultEnv              map[string]string
    Timeout               time.Duration
    SkipTargetDiscovery   bool
    ValidationDuringChaos ValidationFunc
    ValidationAfterChaos  ValidationFunc
    SamplingInterval      time.Duration
    }

Execution Flow:

Run() → 
  1. getLogToken()
  2. triggerExperimentWithRetry()
  3. Start experimentMonitor
  4. extractStreamID()
  5. getTargetsFromLogs()
  6. runValidationDuringChaos() [parallel]
  7. waitForCompletion()
  8. Validate ValidationAfterChaos

2. Experiment Monitor

Purpose: Centralized experiment status tracking with publish-subscribe pattern.

Architecture Pattern: Observer Pattern

type experimentMonitor struct {
    experimentID string
    runResp      *experiments.ExperimentRunResponse
    identifiers  utils.Identifiers
    stopChan     chan bool
    statusChan   chan string
    subscribers  []chan string
}

Key Methods:

start(): Begin monitoring (go-routine)
subscribe(): Create subscriber channel
broadcast(status): Notify all subscribers
stop(): Signal monitoring to stop

Benefits:

80% reduction in API calls
92% faster failure detection
Single source of truth
Easy to add new consumers

3. Validation Framework

Purpose: Dual-phase validation system for concrete chaos impact verification.

ValidationDuringChaos

Runs in parallel during experiment
Continuous sampling at configurable intervals
Stops when validation passes
Use case: Verify active fault impact

ValidationAfterChaos

Runs once after experiment completes
Single execution for final state
Use case: Verify recovery and cleanup

Function Signature:

type ValidationFunc func(targets []string, namespace string) (bool, error)
// Returns: (passed bool, error)

Sample Validation Categories:

Experiment Execution Engine

Execution Phases

Phase 1: Setup
├─ Load configuration
├─ Authenticate with API
└─ Validate environment

Phase 2: Preparation
├─ Get log stream token
├─ Resolve experiment YAML path
├─ Substitute template variables
└─ Create experiment via API

Phase 3: Execution
├─ Trigger experiment run
├─ Start status monitor
├─ Extract stream ID
└─ Discover targets from logs

Phase 4: Validation (Concurrent)
├─ Validation During Chaos (parallel)
│  ├─ Sample at intervals
│  ├─ Check fault impact
│  └─ Stop when passed/completed
└─ Wait for completion

Phase 5: Post-Validation
├─ Validation After Chaos
├─ Check recovery
└─ Final assertions

Phase 6: Cleanup
├─ Stop monitor
├─ Close channels
└─ Log results

State Machine

Concurrency Model

Main Thread:
├─ Create experiment
├─ Start monitor goroutine
├─ Start target discovery goroutine
├─ Start validation goroutine [if provided]
└─ Wait for completion
Monitor Goroutine:
├─ Poll status every 5s
├─ Broadcast to subscribers
└─ Stop on terminal status
Target Discovery Goroutine:
├─ Subscribe to monitor
├─ Poll for targets every 5s
├─ Listen for failures
└─ Return when found or failed
Validation Goroutine:
├─ Subscribe to monitor
├─ Run validation at intervals
├─ Listen for completion
└─ Stop when passed or completed

API Integration Layer

API Client Architecture

Variable Substitution System

Template Format: {{ VARIABLE_NAME }}

Built-in Variables:

INFRA_NAMESPACE          // Infrastructure namespace
FAULT_INFRA_ID          // Infrastructure ID (without env prefix)
EXPERIMENT_INFRA_ID     // Full infrastructure ID (env/infra)
TARGET_WORKLOAD_KIND    // deployment, statefulset, daemonset
TARGET_WORKLOAD_NAMESPACE // Target namespace
TARGET_WORKLOAD_NAMES   // Specific workload names (or empty)
TARGET_WORKLOAD_LABELS  // Label selector
EXPERIMENT_NAME         // Experiment name
FAULT_NAME              // Fault type
TOTAL_CHAOS_DURATION    // Duration in seconds
CHAOS_INTERVAL          // Interval between chaos actions
ADDITIONAL_ENV_VARS     // Fault-specific environment variables

Custom Variables: Passed via FaultEnv map in ExperimentConfig.

Validation Framework

Architecture

Validation Categories

1. Resource Validators

ValidatePodCPUStress(targets, namespace) (bool, error)
ValidatePodMemoryStress(targets, namespace) (bool, error)
ValidateDiskFill(targets, namespace) (bool, error)
ValidateIOStress(targets, namespace) (bool, error)

Detection Logic:

CPU: Usage > baseline + 30%
Memory: Usage > baseline + 20%
Disk: Usage > 80%
I/O: Read/write operations elevated

2. Network Validators

ValidateNetworkLatency(targets, namespace) (bool, error)
ValidateNetworkLoss(targets, namespace) (bool, error)
ValidateNetworkCorruption(targets, namespace) (bool, error)

Detection Methods:

Ping latency measurements
Packet loss percentage
Checksum errors

3. Pod Lifecycle Validators

ValidatePodDelete(targets, namespace) (bool, error)
ValidatePodRestarted(targets, namespace) (bool, error)
ValidatePodsRunning(targets, namespace) (bool, error)

Verification:

Pod age comparison
Restart count increase
Ready status check

4. Application Validators

ValidateAPIBlock(targets, namespace) (bool, error)
ValidateAPILatency(targets, namespace) (bool, error)
ValidateAPIStatusCode(targets, namespace) (bool, error)
ValidateFunctionError(targets, namespace) (bool, error)

5. Redis Validators

ValidateRedisCacheLimit(targets, namespace) (bool, error)
ValidateRedisCachePenetration(targets, namespace) (bool, error)
ValidateRedisCacheExpire(targets, namespace) (bool, error)

Direct Validation: Executes redis-cli INFO in pod, parses metrics

Validation Best Practices

Data Flow & Lifecycle

Complete Experiment Lifecycle

Data Structures Flow

// Input
ExperimentConfig
    ↓
// API Creation
ExperimentPayload (JSON)
    ↓
// API Response
ExperimentResponse {ExperimentID, Name}
    ↓
// Run Request
ExperimentRunRequest {NotifyID}
    ↓
// Run Response
ExperimentRunResponse {ExperimentRunID, Status, Nodes}
    ↓
// Log Streaming
StreamToken + StreamID
    ↓
// Target Discovery
[]string (target pod names)
    ↓
// Validation
ValidationFunc(targets, namespace) → (bool, error)
    ↓
// Final Result
Test Pass/Fail with error details

Performance & Scalability

Performance Metrics

Concurrent Test Execution

Each test gets isolated namespace
Separate experiment instances
No shared state between tests
Parallel execution supported

Example Usage of Framework

RunExperiment(ExperimentConfig{
    Name: "CPU Stress Test",
    FaultName: "pod-cpu-hog",
    InfraID:         infraID,
    ProjectID:       projectId,
    TargetNamespace: targetNamespace,
    TargetLabel:     "app=nginx", // Customize based on your test app
    TargetKind:      "deployment",
    FaultEnv: map[string]string{
     "CPU_CORES":            "1",
     "TOTAL_CHAOS_DURATION": "60",
     "PODS_AFFECTED_PERC":   "100",
     "RAMP_TIME":            "0",
    },
    Timeout:          timeout,
    SamplingInterval: 5 * time.Second, // Check every 5 seconds during chaos
    
    // Verify CPU is stressed during chaos
    ValidationDuringChaos: func(targets []string, namespace string) (bool, error) {
         clientset, err := faultcommon.GetKubeClient()
         if err != nil {
          return false, err
         }
         return validations.ValidatePodCPUStress(clientset, targets, namespace)
    },
    
    // Verify pods recovered after chaos
    ValidationAfterChaos: func(targets []string, namespace string) (bool,error) {
        clientset, err := faultcommon.GetKubeClient()
        if err != nil {
         return false, err
        }
        return validations.ValidateTargetAppsHealthy(clientset, targets, namespace)
    },
})

Knowledge Sharing and Learning

While this framework is proprietary and used internally, we believe in sharing knowledge and best practices. The patterns and approaches we’ve developed can help other teams building similar testing infrastructure:

Key Takeaways for Your Team

Whether you’re building a chaos engineering platform, testing distributed systems, or creating any complex testing infrastructure, these principles apply:

Measure your baseline — Know how long things take today
Set ambitious goals — 10x improvements are possible
Prioritize DX — Developer experience drives adoption
Automate ruthlessly — Eliminate manual steps
Share your learnings — Help others avoid the same pitfalls
Collect user feedback
Celebrate improvements!

We hope these insights help you build better testing infrastructure for your team!

Questions? Feedback? Ideas? Join Harness community. We’d love to hear about your testing challenges and how you’re solving them!

Engineering Blog

Knowledge Graph + RAG: A Unified Approach to DevOps Intelligence

Learn how Harness uses a software delivery knowledge graph, a semantic layer, and RAG together to give DevOps teams deeply contextual, trustworthy AI automation that goes far beyond “chat over docs.”

Sunil Gattupalle

December 17, 2025

Time to read

Knowledge graphs and RAG (Retrieval-Augmented Generation) are complementary techniques for enhancing large language models with external knowledge, and each brings unique strengths for DevOps use cases. While they are often mentioned together, they are fundamentally different systems, and combining them delivers far better outcomes than relying on either approach alone.

Core Differences

A knowledge graph is a semantic model composed of entities and relationships that reflect how systems, services, code, environments, and people connect. These entities may come from Harness or from third-party DevOps tools. Retrieval from a knowledge graph can be:

Structured: via graph queries that traverse relationships
Unstructured: via semantic indexing of graph-connected content

The foundation of the knowledge graph is its semantic layer, which serves as the source of truth for the structure and meaning of the data. This semantic layer defines what an “application,” “pipeline,” “service,” “environment,” “deployment,” or “policy” means - not just how it is stored. This enforces consistent definitions across tools, eliminates ambiguity, and grounds all reasoning in shared meaning.

Because the semantic layer governs how data flows into the graph, it ensures the graph scales cleanly, remains governable, and can incorporate new tools, relationships, and metadata without becoming chaotic.

RAG, by contrast, retrieves unstructured text (documents, runbooks, incident notes, commit messages, architecture diagrams) using embedding similarity and feeds the retrieved content to an LLM. RAG does not model structure or relationships; it retrieves relevant fragments of text.

The fundamental distinction lies in structure:

A knowledge graph encodes explicit, machine-interpretable relationships.
RAG retrieves text based on semantic similarity, without understanding how the connections work.

This is why the two approaches excel at different types of problems.

Strengths and Limitations

Knowledge Graph Strengths

Knowledge graphs excel at multi-hop reasoning, where answering a question requires walking multiple relationships — linking a failing service to its owning team, its CI pipeline, the associated environment, and the policies governing that environment.

They offer:

strong explainability
traceable reasoning
lineage and dependency analysis
organizational context awareness
consistent governance enforced by the semantic layer

The primary limitation is that a knowledge graph is limited by the data it models.

RAG Strengths

RAG systems shine when working with unstructured information at scale. They are excellent for:

documentation search
incident history retrieval
architecture and API references
runbook guidance
open-ended queries

However, RAG struggles with questions that require:

relationship reasoning
ownership inference
dependency mapping
policy or environment constraints
multi-step chains of logic

RAG retrieves text. It does not understand structure.

‍

Hybrid Approaches

Modern DevOps AI systems increasingly combine both approaches:

RAG provides breadth — rich unstructured context.
The knowledge graph provides depth — structured reasoning and grounding.
The semantic layer provides stability — consistent meaning and scalable governance.

The result is retrieval and reasoning that are not only relevant but also organized, contextualized, and aligned with the real structure of the software delivery environment.

Why Knowledge Graphs Excel in DevOps

DevOps environments are inherently relationship-heavy: pipelines, services, environments, teams, approvals, policies, artifacts, and dependencies all interact tightly.

A knowledge graph captures these interactions explicitly.

The semantic layer ensures that as systems evolve, definitions remain consistent.

This gives AI agents true organizational context — not just textual familiarity.

With a graph-backed semantic model, agents can reason about:

ownership
dependency chains
deployment pathways
policy enforcement
environment behavior
compliance boundaries

This is essential for generating pipelines, validating changes, automating deployments, and performing impact analysis.

‍

Limitations of RAG for DevOps

RAG is excellent for retrieving documentation, API references, runbooks, and historical incidents. But it cannot reliably infer:

which team owns a service
which pipeline deploys that service
which environments are impacted
which policies apply
what dependencies exist and how they cascade

RAG retrieves text; it does not reason across structured relationships.

This limits RAG-only approaches to “chatbots over docs,” which is useful but insufficient for deeper automation.

Hybrid Approaches Emerging

A hybrid system uses both unstructured retrieval (RAG) and structured context (knowledge graph) to produce highly accurate, domain-aware answers. The semantic layer ensures that the graph remains consistent and scalable even as the organization grows.

This combination enables:

context-aware pipeline generation
graph-grounded debugging
multi-step orchestration
data-driven governance
safe automation across tools

Knowledge Graphs Benefit More Than AI

Knowledge graphs — and especially the semantic layer behind them — benefit the entire engineering ecosystem, not just AI.

They provide:

a unified, shared set of definitions across the SDLC
governance and data quality enforcement
lineage and dependency mapping
centralized metadata consistency
better observability and reporting
clean integration across tools

AI simply leverages this foundation to become more grounded, less error-prone, and deeply contextual.

‍

Harness’s Hybrid Implementation

Harness uses a Software Delivery Knowledge Graph built on a semantic model that continuously synchronizes entities and relationships across Harness modules and third-party DevOps tools. The semantic layer defines meaning and ensures structure, while RAG enriches the system with unstructured context.

This enables AI agents to:

generate pipelines aligned with org standards
automatically debug issues with traceable reasoning
execute root-cause analysis across dependencies
perform safe rollbacks constrained by policies

Results include:

85% faster pipeline onboarding
7x faster issue resolution
50% less debugging time

This is possible because the system blends semantic structure (knowledge graph), meaning (semantic layer), and breadth of context (RAG), producing far more reliable DevOps automation than any single method alone. We'll be writing more about Knowledge Graph in upcoming blog posts.

‍

Engineering Blog

From Concept to Reality: The Journey Behind Harness Database DevOps

Discover the story behind Harness Database DevOps, how research, community learning, and developer empathy shaped a platform designed to modernize database delivery.

Animesh Pathak

Stephen Atwell

Matt Schillerstrom

November 26, 2025

Time to read

When I look back at how Harness Database DevOps came to life, it feels less like building a product and more like solving a collective industry puzzle, one piece at a time. Every engineer, DBA, and DevOps practitioner I met had their own version of the same story: application delivery had evolved rapidly, but databases were still lagging behind. Schema changes were risky, rollbacks were manual, and developers hesitated to touch the database layer for fear of breaking something critical.

That was where our journey began, not with an idea, but with a question: “What if database delivery could be as effortless, safe, and auditable as application delivery?”

The Problem We Couldn’t Ignore

At Harness, we’ve always been focused on making software delivery faster, safer, and more developer-friendly. But as we worked with enterprises across industries, one recurring gap became clear, while teams were automating CI/CD pipelines for applications, database changes were still handled in silos.

The process was often manual: SQL scripts being shared over email, version control inconsistencies, and late-night hotfixes that no one wanted to own. Even with existing tools, there was a noticeable disconnect between database engineers, developers, and platform teams. The result was predictable - slow delivery cycles, high change failure rates, and limited visibility.

We didn’t want to simply build another migration tool. We wanted to redefine how databases fit into the modern CI/CD narrative, how they could become first-class citizens in the software delivery pipeline.

Listening Before Building

Before writing a single line of code, we started by listening to DBAs, developers, and release engineers who lived through these challenges every day.

Our conversations revealed a few consistent pain points:

Database schema changes lacked version control discipline.
Rollbacks were error-prone, especially across multiple environments, and undocumented.
Application and database delivery cycles were never truly aligned.
Teams had limited observability into what changed, when, and by whom.

We also studied existing open-source practices. Many of us were active contributors or long-time users of Liquibase, which had already set strong foundations for schema versioning. Our goal was not to replace those efforts, but to learn from them, build upon them, and align them with the Harness delivery ecosystem.

That’s when the real learning began, understanding how different organizations implement Liquibase, how they handle rollbacks, and how schema evolution differs between teams using PostgreSQL, MySQL, or Oracle.

This phase of research and contribution provided us with valuable insights: while the tooling existed, the real challenge was operational, integrating database changes into CI/CD pipelines without friction or risk.

From Research to Blueprint

Armed with insights, we began sketching the first blueprints of what would eventually become Harness Database DevOps. Our design philosophy was simple:

Meet teams where they are. Integrate seamlessly with existing tools, such as Liquibase and Flyway.
Enable progressive automation. Let teams start small and grow into full automation.
Empower every role. Whether you’re a DBA or developer, you should have clarity and control over database delivery.

Early prototypes focused on automating schema migration, enforcing policy compliance, and building audit trails for database changes. But we soon realized that wasn’t enough.

Database delivery isn’t just about applying migrations; it’s about governance, visibility, and confidence. Developers needed fast feedback loops; DBAs needed assurance that governance was intact; and platform teams needed to integrate it into their broader CI/CD fabric. That realization reshaped our vision entirely.

Building the Foundation

We started with the fundamentals: source control and pipelines. Every database change, whether a script or a declarative state definition, needed to be versioned, automatically-tested, and traceable.

To make this work at scale, we leveraged script-based migrations. This allowed teams to track the actual change scripts applied to reach that state, ensuring alignment and transparency. The next challenge was automation. We wanted pipelines that could handle complex database lifecycles, provisioning instances, running validations, managing approvals, and executing rollbacks, all within a CI/CD workflow familiar to developers.

This was where the engineering creativity of our team truly shined. We integrated database delivery into Harness Pipelines, enabling one-click deployments and policy-driven rollbacks with complete auditability.

Our internal mantra became: “If it’s repeatable, it’s automatable.”

Evolving Through Feedback

Our first internal release was both exciting and humbling. We quickly learned that every organization manages database delivery differently. Some teams followed strict change control. Others moved fast and valued agility over structure.

To bridge that gap, we focused on flexibility, which allowed teams to define their own workflows, environments, and policies while keeping governance seamlessly built in.

We also realized the importance of observability. Teams didn’t just want confirmation that a migration succeeded; they wanted to understand “why something failed”, “how long it took”, and “what exactly changed” behind the scenes.

Each round of feedback, from customers and our internal teams, helped us to refine the product further. Every iteration made it stronger, smarter, and more aligned with real-world engineering needs. And the journey wasn’t just about code; it was about collaboration and teamwork. Here’s how Harness Database DevOps connects every role in the database delivery lifecycle.

The People Behind the Platform

Behind every release stood a passionate team: engineers, product managers, customer success engineer and developer advocates, with a shared mission: to make database delivery seamless, safe, and scalable.

We spent long nights debating rollback semantics, early mornings testing changelog edge cases, and countless hours perfecting pipeline behavior under real workloads. It wasn’t easy, but it mattered.

This wasn’t just about building software; it was about building trust between developers and DBAs, between automation and human oversight. When we finally launched Harness Database DevOps, it didn’t feel like a product release. It felt like the beginning of something bigger, a new way to bring automation and accountability to database delivery.

What makes us proud isn’t just the technology. It’s “how we built it”, with empathy, teamwork, and a deep partnership with our customers from day one. Together with our design partners, we shaped every iteration to ensure what we were building truly reflected their needs and that database delivery could evolve with the same innovation and collaboration that define the rest of DevOps.

‍

Built with Customers, Trusted by Teams

After months of iteration, user testing, and refinements, Harness Database DevOps entered private beta in early 2024. The excitement was immediate. Teams finally saw their database workflows appear alongside application deployments, approvals, and governance check, all within a single pipeline.

During the beta, more than thirty customers participated, offering feedback that directly shaped the product. Some asked for folder-based trunk deployments. Others wanted deeper rollback intelligence. Some wanted Harness to help there developers design and author changes in the first place. Many just wanted to see what was happening inside their database environments.

By the time general availability rolled around, Database DevOps had evolved into a mature platform, not just a feature. It offered migration state tracking, rollback mechanisms, environment isolation, policy enforcement, and native integration with the Harness ecosystem.

But more importantly, it delivered something intangible: trust. Teams could finally move faster without sacrificing control.

The Road Ahead

Database DevOps is still an evolving space. Every new integration, every pipeline enhancement, every database engine we support takes us closer to a world where managing schema changes is as seamless as deploying code.

Our mission remains the same: to help teams move fast without breaking things, to give developers confidence without compromising governance, and to make database delivery as modern as the rest of DevOps.

And as we continue this journey, one thing is certain: the story of Harness Database DevOps isn’t just about a product. It’s about reimagining what’s possible when empathy meets engineering.

Closing Thoughts

From its earliest whiteboard sketch to production pipelines across enterprises, Harness Database DevOps is the product of curiosity, collaboration, and relentless iteration. It was never about reinventing databases. It was about rethinking how teams deliver change, safely, visibly, and confidently.

And that journey, from concept to reality, continues every day with every release, every migration, and every team that chooses to make their database a part of DevOps.

‍

Request a demo

‍

Technical

How We Build

Engineering Blog

You’re Late to the OpenTofu Party. Here’s Why That’s a Problem.

OpenTofu is revolutionizing Infrastructure as Code. Join us and contribute to the future of open-source automation today!

Richard Black

November 6, 2025

Time to read

Are you still using Terraform without realizing the party has already moved on?

For years, Terraform was the default language of Infrastructure as Code (IaC). It offered predictability, community, and portability across cloud providers. But then, the music stopped. In 2023, HashiCorp changed Terraform’s license from Mozilla Public License (MPL) to the Business Source License (BSL), a move that put guardrails around what users and competitors could do with the code.

That shift opened a door for something new and truly open.

That “something” is OpenTofu.

And if you’re not already using or contributing to it, you’re missing your chance to help shape the future of infrastructure automation.

The fork that changed IaC forever

OpenTofu didn’t just appear out of thin air. It was born from community demand, a collective realization that Terraform’s BSL license could limit the open innovation that made IaC thrive in the first place.

So OpenTofu forked from Terraform’s last open source MPL version and joined the Linux Foundation, ensuring that it would remain fully open, community-governed, and vendor-neutral. A true Terraform alternative.

Unlike Terraform’s now-centralized governance, OpenTofu’s roadmap is decided by contributors, people building real infrastructure at real companies, not by a single commercial entity.

That means if you depend on IaC tools to build and scale your environments, your voice actually matters here.

Why OpenTofu is gaining momentum

OpenTofu is not a “different tool.” It’s a continuation, the same HCL syntax, same workflows, and same mental model, but under open governance and a faster, community-driven release cadence.

Let’s break down the Terraform vs OpenTofu comparison:

‍

It’s still Terraform-compatible. You can take your existing configurations and run them with OpenTofu today. But beyond compatibility, OpenTofu is already moving faster and more freely, prioritizing developer-requested features that a commercial model might not. Some key examples of it's true power and longevity include:

1. Standardized distribution with OCI registries

Packaging and sharing modules or providers privately has always been clunky. You either ran your own registry or relied on Terraform Cloud.

OpenTofu solves this with OCI Registries, i.e. using the same open container standard that Docker uses.

It’s clean, familiar, and scalable.

Your modules live in any OCI-compatible registry (Harbor, Artifactory, ECR, GCR, etc.), complete with built-in versioning, integrity checks, and discoverability. No proprietary backend required.

For organizations managing hundreds of modules or providers, this is a big deal. It means your IaC supply chain can be secured and audited with the same standards you already use for container images.

2. True security with encryption at rest

Secrets in your Terraform state have always been a headache.

Even with remote backends, you’re still left with the risk of plaintext credentials or keys living inside the state file.

OpenTofu is the only IaC framework with built-in encryption at rest.

You can define an encryption block directly in configuration:

This encrypts the state transparently, no custom wrapper scripts or external encryption logic.

It also supports multiple key providers (AWS KMS, GCP KMS, Azure Key Vault, and more).

Coming soon in OpenTofu 1.11 (beta): ephemeral resources.

This feature lets providers mark sensitive data as transient so it never touches your state file in the first place. That’s a security level no other mainstream IaC tool currently offers.

3. A community-driven future

OpenTofu’s most powerful feature isn’t in its code, it’s in its process.

Every proposal goes through a public RFC. Every contributor has a say. Every decision is archived and transparent.

If you want a feature, you can write a proposal, gather community feedback, and influence the outcome.

Contrast that with traditional vendor-driven roadmaps, where features are often prioritized by product-market fit rather than user need.

That’s what “being late to the party” really means: you miss your seat at the table where the next decade of IaC innovation is being decided.

Why you don’t want to miss this party

Being early in an open-source ecosystem isn’t about bragging rights, it’s about influence.

OpenTofu is already gaining serious traction:

Major cloud providers and IaC platforms are integrating it.
Contributors from across the industry are shaping releases.
Security and compliance enhancements (like encryption and OCI support) are coming faster than ever.

If you join later, you’ll still get the code. But you won’t get the same opportunity to shape it.

The longer you wait, the more you’ll be reacting to other people’s decisions instead of helping make them.

Ready to switch? The OpenTofu migration path is smooth.

Migrating is a one-liner!

The OpenTofu migration guide shows that most users can simply install the tofu CLI and reuse their existing Terraform files:

It’s the same commands, same workflow, but under an open license. You can even use your existing Terraform state files directly; no conversion step required.

For teams already managing infrastructure at scale, the move to OpenTofu doesn’t just preserve your workflow, it future-proofs it.

How Harness IaCM supports OpenTofu

When you’re ready to bring OpenTofu into a managed, collaborative environment, Harness Infrastructure as Code Management (IaCM) has you covered.

Harness IaCM natively supports both Terraform and OpenTofu. You can create a workspace, select your preferred binary, and run init, plan, and apply pipelines without changing your configurations.

That means you can:

Experiment safely with OpenTofu while retaining version control.
Store and share modules through a managed environment.
Adopt OpenTofu’s features (like encryption and OCI) inside CI/CD pipelines.
Gradually migrate Terraform workspaces without breaking production.

Harness essentially gives you the sandbox to explore OpenTofu’s potential, whether you’re testing ephemeral resource behavior or building private OCI registries for module distribution.

So while the OpenTofu community defines the standards, Harness ensures you can implement them securely and at scale.

Contribute, don’t just consume

The real magic of OpenTofu lies in participation.

If you’ve ever complained about Terraform limitations, this is your moment to shape the alternative.

You can:

Test new features.
Submit issues or RFCs.
Contribute code or docs.
Influence what the next release includes.

Everything lives in the open on the OpenTofu Repository.

Even reading a few discussions there shows how open, constructive, and fast-moving the community is.

Final thoughts

The IaC landscape is changing, and this time, the direction isn’t being set by a vendor, but by the community.

OpenTofu brings us back to the roots of open-source infrastructure: collaboration, transparency, and freedom to innovate.

It’s more than a fork, it’s a course correction.

If you’re still watching from the sidelines, remember: the earlier you join, the more your voice matters.

The OpenTofu party is already in full swing.

Grab your seat at the table, bring your ideas, and help build the future of IaC, before someone else decides it for you.

‍

Seamless Data Sync from Google BigQuery to ClickHouse in an AWS Airgapped Environment banner

Engineering Blog

Seamless Data Sync from Google BigQuery to ClickHouse in an AWS Airgapped Environment

This article provides a comprehensive guide on syncing data from Google BigQuery to ClickHouse in a secure, airgapped AWS environment. It details the use of a corporate proxy server to address the challenges of restricted outbound communication and outlines the implementation steps involved.

Nikunj Badjatya

December 31, 2024

Time to read

Seamless Data Sync from Google BigQuery to ClickHouse in an AWS Airgapped Environment

‍

Understanding the Key Components

Airgap Environment

An airgapped environment enforces strict outbound policies, preventing external network communication. This setup enhances security but presents challenges for cross-cloud data synchronization.

Proxy Server

A proxy server is a lightweight, high-performance intermediary facilitating outbound requests from workloads in restricted environments. It acts as a bridge, enabling controlled external communication.

ClickHouse

ClickHouse is an open-source, column-oriented OLAP (Online Analytical Processing) database known for its high-performance analytics capabilities.

This article explores how to seamlessly sync data from BigQuery, Google Cloud’s managed analytics database, to ClickHouse running in an AWS-hosted airgapped Kubernetes cluster using proxy-based networking.

Use Case

Deploying ClickHouse in airgapped environments presents challenges in syncing data across isolated cloud infrastructures such as GCP, Azure, or AWS.

In our setup, ClickHouse is deployed via Helm charts in an AWS Kubernetes cluster, with strict outbound restrictions. The goal is to sync data from a BigQuery table (GCP) to ClickHouse (AWS K8S), adhering to airgap constraints.

Challenges

Restricted Outbound Network: The ClickHouse cluster cannot directly access Google Cloud services due to airgap policies.
Data Transfer Between Isolated Clouds: There is no straightforward mechanism for syncing data from GCP to ClickHouse in AWS without external connectivity.

Solution

The solution leverages a corporate proxy server to facilitate communication. By injecting a custom proxy configuration into ClickHouse, we enable HTTP/HTTPS traffic routing through the proxy, allowing controlled outbound access.

Architecture Overview

BigQuery to GCS Export: Data is first exported from BigQuery to a GCS bucket.
ClickHouse GCS Integration: ClickHouse fetches data from GCS using ClickHouse’s GCS function.
Proxy Routing: ClickHouse’s outbound requests are routed through a corporate proxy server.
Data Ingestion in ClickHouse: The retrieved data is processed and stored within ClickHouse for analytics.

Implementation Steps

1. Proxy Configuration

Created a proxy.xml file defining proxy details for outbound HTTP/HTTPS requests.
Used a Kubernetes ConfigMap (clickhouse-proxy-config)* to store this configuration.
Mounted the ConfigMap dynamically into the ClickHouse pod.

2. Kubernetes Deployment

Mounted proxy.xml in the ClickHouse pod at /etc/clickhouse-server/config.d/proxy.xml.
Adjusted security contexts, allowing privilege escalation (for testing) and running the pod as root to simplify permissions.

3. Testing and Validation

Deployed a non-stateful ClickHouse instance to iterate quickly.
Verified that ClickHouse requests were routed through the proxy.

Observed proxy logs confirming outbound requests were successfully relayed to GCP.

Left window shows query to BigQuery and right window shows proxy logs — the request forwarding through proxy server

Outcome

This approach successfully enabled secure communication between ClickHouse (AWS) and BigQuery (GCP) in an airgapped environment. The use of a ConfigMap-based proxy configuration made the setup:

Scalable: Easily adaptable to different cloud vendors (GCP, Azure, AWS).
Flexible: Decouples networking configurations from application logic.
Secure: Ensures outbound traffic is strictly controlled via the proxy.

By leveraging ClickHouse’s extensible configuration system and Kubernetes, we overcame strict network isolation to enable cross-cloud data workflows in constrained environments. This architecture can be extended to other cloud-native workloads requiring external data synchronization in airgapped environments.

DB Performance Testing with Harness FME banner

Engineering Blog

DB Performance Testing with Harness FME

Explore how to effectively conduct DB performance testing using Harness FME, comparing popular databases like MariaDB and PostgreSQL through feature flags. Gain insights on optimizing database integrity and performance to enhance your web applications.

Joshua Klein

December 31, 2024

Time to read

DB Performance Testing with Harness FME

Databases have been crucial to web applications since their beginning, serving as the core storage for all functional aspects. They manage user identities, profiles, activities, and application-specific data, acting as the authoritative source of truth. Without databases, the interconnected information driving functionality and personalized experiences would not exist. Their integrity, performance, and scalability are vital for application success, and their strategic importance grows with increasing data complexity. In this article we are going to show you how you can leverage feature flags to compare different databases.

Let’s say you want to test and compare two different databases against one another. A common use case could be to compare the performance of two of the most popular open source databases. MariaDB and PostgreSQL.

MariaDB and PostgreSQL logos

Let’s think about how we want to do this. We want to compare the experience of our users with these different database. In this example we will be doing a 50/50 experiment. In a production environment doing real testing in all likelihood you already use one database and would use a very small percentage based rollout to the other one, such as a 90/10 (or even 95/5) to reduce the blast radius of potential issues.

To do this experiment, first, let’s make a Harness FME feature flag that distributes users 50/50 between MariaDB and PostgreSQL

Now for this experiment we need to have a reasonable amount of sample data in the db. In this sample experiment we will actually just load the same data into both databases. In production you’d want to build something like a read replica using a CDC (change data capture) tool so that your experimental database matches with your production data

Our code will generate 100,000 rows of this data table and load it into both before the experiment. This is not too big to cause issues with db query speed but big enough to see if some kind of change between database technologies. This table also has three different data types — text (varchar), numbers, and timestamps.

‍‍Now let’s make a basic app that simulates making our queries. Using Python we will make an app that executes queries from a list and displays the result.

Below you can see the basic architecture of our design. We will run MariaDB and Postgres on Docker and the application code will connect to both, using the Harness FME feature flag to determine which one to use for the request.

The sample queries we used can be seen below. We are using 5 queries with a variety of SQL keywords. We include joins, limits, ordering, functions, and grouping.

We use the Harness FME SDK to do the decisioning here for our user id values. It will determine if the incoming user experiences the Postgres or MariaDB treatment using the get_treatment method of the SDK based upon the rules we defined in the Harness FME console above.

Afterwards within the application we will run the query and then track the query_executionevent using the SDK’s track method.

See below for some key parts of our Python based app.

This code will initialize our Split (Harness FME) client for the SDK.

We will generate a sample user ID, just with an integer from 1–10,000

Now we need to get whether our user will be using Postgres or MariaDB. We also do some defensive programming here to ensure that we have a default if it’s not either postgres or mariadb

Now let’s run the query and track the query_executionevent. From the app you can select the query you want to run, or if you don’t it’ll just run one of the five sample queries at random.

The db_manager class handles maintaining the connections to the databases as well as tracking the execution time for the query. Here we can see it using Python’s time to track how long the query took. The object that the db_manager returns includes this value

Tracking the event allows us to see the impact of which database was faster for our users. The signature for the Harness FME SDK’s track method includes both a value and properties. In this case we supply the query execution time as the value and the actual query that ran as a property of the event that can be used later on for filtering and , as we will see later, dimensional analysis.

You can see a screenshot of what the app looks like below. There’s a simple bootstrap themed frontend that does the display here.

app screenshot

The last step here is that we need to build a metric to do the comparison.

Here we built a metric called db_performance_comparison . In this metric we set up our desired impact — we want the query time to decrease. Our traffic type is of user.

Metric configuration

One of the most important questions is what we will select for the Measure as option. Here we have a few options, as can be seen below

Measure as options

We want to compare across users, and are interested in faster average query execution times, so we select Average of event values per user. Count, sum, ratio, and percent don’t make sense here.

Lastly, we are measuring the query_execution event.

We added this metric as a key metric for our db_performance_comparison feature flag.

Selection of our metric as a key metric

One additional thing we will want to do is set up dimensional analysis, like we mentioned above. Dimensional analysis will let us drill down into the individual queries to see which one(s) were more or less performant on each database. We can have up to 20 values in here. If we’ve already been sending events they can simply be selected as we keep track of them internally — otherwise, we will input our queries here.

selection of values for dimensional analysis

Now that we have our dimensions, our metric, and our application set to use our feature flag, we can now send traffic to the application.

For this example, I’ve created a load testing script that uses Selenium to load up my application. This will send enough traffic so that I’ll be able to get significance on my db_performance_comparison metric.

I got some pretty interesting results, if we look at the metrics impact screen we can see that Postgres resulted in a 84% drop in query time.

Even more, if we drill down to the dimensional analysis for the metric, we can see which queries were faster and which were actually slower using Postgres.

So some queries were faster and some were slower, but the faster queries were MUCH faster. This allows you to pinpoint the performance you would get by changing database engines.

You can also see the statistics in a table below — seems like the query with the most significant speedup was one that used grouping and limits.

However, the query that used a join was much slower in Postgres — you can see it’s the query that starts with SELECT a.i... , since we are doing a self-join the table alias is a. Also the query that uses EXTRACT (an SQL date function) is nearly 56% slower as well.

Conclusion

In summary, running experiments on backend infrastructure like databases using Harness FME can yield significant insights and performance improvements. As demonstrated, testing MariaDB against PostgreSQL revealed an 84% drop in query time with Postgres. Furthermore, dimensional analysis allowed us to identify specific queries that benefited the most, specifically those involving grouping and limits, and which queries were slower. This level of detailed performance data enables you to make informed decisions about your database engine and infrastructure, leading to optimization, efficiency, and ultimately, better user experience. Harness FME provides a robust platform for conducting such experiments and extracting actionable insights. For example — if we had an application that used a lot of join based queries or used SQL date functions like EXTRACT it may end up showing that MariaDB would be faster than Postgres and it wouldn’t make sense to consider a migration to it.

The full code for our experiment lives here: https://github.com/Split-Community/DB-Speed-Test

DevOps Meets AI: Evaluating the Performance of Leading LLMs banner

Engineering Blog

DevOps Meets AI: Evaluating the Performance of Leading LLMs

This blog post explores the integration of large language models (LLMs) into DevOps workflows, highlighting their role in automating pipeline generation and enhancing software delivery efficiency. It shares insights from an AI engineering team's evaluation of LLM performance in streamlining DevOps…

Bashir Rastegarpanah

December 31, 2024

Time to read

DevOps Meets AI: Evaluating the Performance of Leading LLMs

Modern DevOps processes are essential for ensuring efficient, reliable, and scalable software delivery. However, managing infrastructure, CI/CD pipelines, monitoring, and incident response remains a complex and time-consuming challenge for many organizations. These tasks require continuous tuning, configuration management, and rapid troubleshooting, making DevOps resource-intensive. As software systems grow in complexity, manual intervention becomes a bottleneck, increasing the risk of human error, inefficiencies, and slower deployments. This is where automation becomes a necessity, helping teams streamline workflows, reduce operational overhead, and improve deployment velocity.

‍

The rise of artificial intelligence, particularly large language models (LLMs), has opened new possibilities for automating various aspects of software development and operations. By leveraging AI, organizations can enhance efficiency, reduce manual effort, and accelerate software delivery. LLMs bring the potential to transform DevOps by enabling intelligent automation, improving decision-making, and making systems more adaptive to changing requirements.

Our AI engineering team has been at the forefront of integrating AI into DevOps workflows. From AI-powered CI/CD optimizations to intelligent deployment strategies, we continuously explore ways to leverage AI for greater efficiency. In this blog, we share our journey in evaluating LLMs for DevOps automation, benchmarking their performance, and understanding their impact on software delivery workflows.

Harnessing LLMs for DevOps Automation

Before diving into the evaluation, let’s first outline the specific problem we aim to solve using large language models. (Note: In this post, I won’t go into the underlying architecture of the Harness AI DevOps Agent — stay tuned for a future blog post on that!)

Our exploration begins with the task of pipeline generation. Specifically, the AI DevOps Agent takes a user command describing the desired pipeline as input, along with relevant context information. The expected output is a pipeline YAML file generated by the AI DevOps agent, which is composed of multiple sub-agents, automating the configuration process and streamlining DevOps workflows. An example user command and the resulting YAML pipeline would be:

“Create an IACM pipeline to do create a IACM init and plan”

Response:

For simplicity, we conducted the first phase of our evaluations by focusing on generating a single step of the pipeline. Additionally, we explored two different solution designs for utilizing LLMs:

Direct Single LLM Calls: In this approach, we send the user command along with the relevant context (e.g., stage type, pipeline schema) in a single request to the LLM under evaluation.
Agentic Framework Approach: This approach leverages an agentic framework to distribute sub-tasks — such as context generation, schema verification, and step generation — among multiple AI agents. We implemented this framework using AutoGen.

Performance Metrics: How We Measure Success

In this blog post, we focus on the generation use case — specifically, creating pipeline steps, stages, and related configurations — and introduce the metrics used to evaluate the performance of different models for this task. Our evaluations are conducted against a benchmark dataset with a known ground truth. Specifically, we have curated a dataset consisting of user commands for creating pipeline steps and their corresponding YAML configurations. Using this benchmark data, we have developed a set of metrics to assess the quality of AI-generated YAML outputs in response to user prompts.

Since we are evaluating AI-generated pipelines against known, predefined pipelines, the comparison ultimately involves measuring the differences between two YAML files. To accomplish this, we leverage and build upon DeepDiff, a framework for computing the structural differences between key-value objects. DeepDiff is conceptually inspired by Levenshtein Edit Distance, making it well-suited for quantifying variations between YAML configurations and assessing how closely the generated output matches the expected pipeline definition.

At its core, DeepDiff quantifies the difference between two objects by determining the number of operations required to transform one into the other. This difference is then normalized to produce a similarity score between 0 and 1, providing a structured way to compare data. While we utilize the standard DeepDiff library as one of our evaluation metrics, we have also developed two modified versions tailored specifically for comparing step YAMLs. These adaptations address the unique challenges of our use case, ensuring a more precise and meaningful assessment of AI-generated pipeline configurations.

In particular, we have introduced:

DeepDiff 2: This metric first applies schema verification before computing the similarity score, assigning a score of zero if the generated YAML fails validation. Additionally, it does not penalize differences in optional fields such as name, identifier, and description, ensuring that minor variations do not disproportionately impact the similarity score. Moreover, as long as the generated solution adheres to schema validation, this metric allows additional keys in the step without penalizing the score.
DeepDiff 3: This metric builds upon DeepDiff 2 but introduces a penalty for any additional key that does not exist in the reference solution. This stricter approach provides a more precise comparison to the ground truth, considering that extra keys with default values may impact the user experience. Users may not expect to see default values for optional fields in the UI, making it essential to account for such differences in evaluation.

Benchmarking LLMs: Evaluating the Leading Models

Benchmark Dataset

Let’s first introduce the benchmark data used for this study.

At Harness, our QA team generates numerous sample pipelines using automation tools such as APIs and Terraform Providers to simulate customer use cases and various Harness configurations. These pipelines play a crucial role in sanity testing, ensuring that when a new version of Harness is released, all steps, stages, and pipelines continue to function as expected.

For this study, we leveraged this data to create a benchmark dataset of 115 step YAMLs. For each example, we manually added a potential user command that could generate the corresponding step. The same user command was then used to generate a step YAML using an LLM. The AI-generated solutions were subsequently compared against the original YAML file to evaluate accuracy and quality.

Below is an example of a user command and its corresponding YAML file, which serves as the ground truth in our evaluation:

User Command:“Please add a Terraform plan step to the pipeline.”

Ground Truth YAML:

This YAML structure represents the expected output when an LLM generates a pipeline step based on the given user command. The AI-generated YAML will be evaluated against this reference to assess its accuracy and quality.

Models Compared

We evaluated both an agentic framework and direct model calls for utilizing LLMs in pipeline generation. The selection of models for each approach was based on the technical adaptability of the frameworks we used. For example, AutoGen supports only a limited set of LLMs, which influenced our model choices for the agentic framework.

As a result, there isn’t a one-to-one correspondence between the models used in the agentic framework and those used in direct calls. However, there is significant overlap between the two sets.

Agentic Framework: Models operating within an agent-driven setup

GPT-4o
O3-mini-medium
Claude-3.7

Direct Model Calls: Models queried directly without an agentic framework

GPT-4o
O3-mini-medium
Claude-3.7
DeepSeek R1
DeepSeek V3

This comparison allows us to assess how different models and methodologies perform in generating high-quality DevOps pipeline configurations.

Results

The figure below illustrates the performance of each model based on the three evaluation metrics introduced earlier. Models that are called using an agentic framework are prefixed with “Autogen_” in the results.

Our findings indicate that using an agentic framework significantly improves response quality across all three metrics. However, AutoGen does not yet support DeepSeek models, so for these models, we only report their performance when called directly.

LLM Performance Comparison for Pipeline Step Generation

In order to gain deeper insights into the scores, we also visualize the number of samples that failed the schema verification step, where a zero score is assigned to such cases. This highlights instances where models struggle to generate valid YAML structures:

Schema Verification Failures Across Models

The plot above clearly demonstrates the effectiveness of an agentic framework with a dedicated schema verification agent. Notably, none of the models within the agentic framework produced outputs that failed schema validation.

Takeaways

Our evaluation of LLMs for DevOps automation provided valuable insights into their strengths, limitations, and practical applications. Below are some key takeaways:

LLMs demonstrate strong potential for automating DevOps workflows, particularly in generating pipeline YAMLs from user commands — achieving a pass rate of over 95% for the best models. This reduces manual effort, increases efficiency, and streamlines software delivery.
Leveraging an agentic framework that breaks tasks into smaller sub-tasks and distributes them among sub-agents significantly improves accuracy. This approach reduces schema verification failures and minimizes model hallucinations, leading to more reliable and structured pipeline generation.

AI Agents vs Real-World Web Tasks: Harness Leads the Way in Enterprise Test Automation banner

Engineering Blog

AI Agents vs Real-World Web Tasks: Harness Leads the Way in Enterprise Test Automation

This article explores the capabilities of AI agents in executing real-world enterprise web tasks, focusing on their performance in test automation for complex banking and business applications. It provides valuable benchmarks and insights for engineering teams looking to enhance their automation st…

Ben Markines

December 31, 2024

Time to read

AI Agents vs Real-World Web Tasks: Harness Leads the Way in Enterprise Test Automation

Written by Deba Chatterjee, Gurashish Brar, Shubham Agarwal, and Surya Vemuri

‍

Can an AI agent test your enterprise banking workflow without human help? We found out. AI-powered test automation will be the de facto method for engineering teams to validate applications. Following our previous work exploring AI operations on the web and test automation capabilities, we expand our evaluation to include agents from the leading model providers to execute web tasks. In this latest benchmark, we evaluate how well top AI agents, including OpenAI Operator and Anthropic Computer Use, perform real-world enterprise scenarios. From banking applications to audit trail log navigation, we tested 22 tasks inspired by our customers and users.

Building on Previous Research

Our journey began with introducing a framework to benchmark AI-powered web automation solutions. We followed up with a direct comparison between our AI Test Automation and browser-use. This latest evaluation extends our research by incorporating additional enterprise-focused tasks inspired by the demands of today’s B2B applications.

The B2B Challenge

Business applications present unique challenges for agents performing tasks through web browser interactions. They feature complex workflows, specialized interfaces, and strict security requirements. Testing these applications demands precision, adaptability, and repeatability — the ability to navigate intricate UIs while maintaining consistent results across test runs.

To properly evaluate each agent, we expanded our original test suite with three additional tasks:

A banking application workflow requiring precise transaction handling, i.e., deposit of funds into a checking account
Navigation of a business application to view audit logs filtered by date
Interacting with a messaging application and validating the conversation in the history

These additions brought the total test suite to 22 distinct tasks varying in complexity and domain specificity.

Comprehensive Evaluation Results

User tasks and Agent results

The four solutions performed very differently, especially on complex tasks. Our AI Test Automation led with an 86% success rate, followed by browser-use at 64%, while OpenAI Operator and Anthropic Computer Use achieved 45% and 41% success rates, respectively.

The performance varies as tasks interact with complex artifacts such as calendars, information-rich tables, and chat interfaces.

Additional Web Automation Tasks

As in previous research, each agent executed their tasks on popular browsers, i.e., Firefox and Chrome. Also, even though OpenAI Operator required some user interaction, no additional manual help or intervention was provided outside the evaluation task.

The first additional task involves banking. The instructions include logging into a demo banking application, depositing $350 into a checking account, and verifying the transaction. Each solution must navigate the site without prior knowledge of the interface.

Our AI Test Automation completed the workflow, correctly selecting the family checking account and verifying that the $350 deposit appeared in the transaction history. Browser-use struggled with account selection and failed to complete the deposit action. Both Anthropic Computer Use and OpenAI Operator encountered login issues. Neither solution progressed past the initial authentication step.

Finding audit trail records in a table full of data is a common enterprise requirement. We challenged each solution to navigate Harness’s Audit Trail interface to locate two-day-old entries. The AI Test Automation solution navigated to the Audit Logs and paged through the table to identify two-day-old entries. Browser-use reached the audit log UI but failed to navigate, i.e., paginate to the requested records. Anthropic Computer Use did not scroll sufficiently to find the Audit Trail tile. The default browser resolution is a limiting factor with Anthropic Computer Use. The OpenAI Operator found the two-day-old audit logs.

This task demonstrates that handling information-rich tables remains challenging for browser automation tools.

Messaging Application Interaction

The third additional task involves a messaging application. The intent is to initiate a conversation with a bot and verify the conversation in a history table. This task incorporates browser interaction and verification logic.

The AI Test Automation solution completed the chat interaction and correctly verified the conversation’s presence in the history. Browser-use also completed this task. Anthropic Computer Use, on the other hand, is unable to start a conversation. OpenAI Operator initiates the conversation but never sends a message. As a result, a new conversation does not appear in the history.

This task reveals varying levels of sophistication in executing multi-step workflows with validation.

What Makes Solutions Perform Differently?

Several factors contribute to the performance differences observed:

Specialized Architecture: Harness AI Test Automation leverages multiple agents designed for software testing use cases. Each agent has varying levels of responsibility, from planning to handling special components like calendars and data-intensive tables.

Enterprise Focus: Harness AI Test Automation is designed with enterprise use cases in mind. There are certain features to take into account from the enterprise. A sample of these features includes:

security
repeatability for CI/CD integration
precision
ability to interact with an API
uncommon interfaces that are not generally accessible via web crawling, hence not available for training

Task Complexity: Browser-use, Anthropic Computer Use, and OpenAI Operator execute many tasks. But as complexity increases, the performance gap widens significantly.

Why Harness Outperforms

Custom agents for calendars, rich tables
API-driven validation where UI alone is insufficient
Secure handling of login and secrets

Conclusion

Our evaluation demonstrates that while all four solutions handle basic web tasks, the performance diverges when faced with more complex tasks and web UI elements. In such a fast-moving environment, we will continue to evolve our solution to execute more use cases. We will stay committed to tracking performance across emerging solutions and sharing insights with the developer community.

At Harness, we continue to enhance our solution to meet enterprise challenges. Promising enhancements to the product include self-diagnosis and tighter CI/CD integrations. Intent-based software testing is easier to write, more adaptable to updates, and easier to maintain than classic solutions. We continue to enhance our AI Test Automation solution to address the unique challenges of enterprise testing, empowering development teams to deliver high-quality software confidently. After all, we’re obsessed with empowering developers to do what they love: ship great software.

Build a scalable cloud cost optimization recommendation system

This article details the development of a customizable, scalable engine for generating cloud cost optimization recommendations. By utilizing a policy-based approach, organizations can automate cost management across multiple cloud platforms, ensuring efficient resource utilization and significant s…

Anmol Maheshwari

December 31, 2024

Time to read

How we built a scalable system of generating recommendations for cloud cost optimization

Overview

As cloud adoption continues to rise, efficient cost management demands a robust and automated strategy. Native cloud provider recommendations, while helpful, often have limitations — they primarily focus on vendor-specific optimizations and may not fully align with unique business requirements. Additionally, cloud providers have little incentive to highlight cost-saving opportunities beyond a certain extent, making it essential for organisations to implement customised, independent cost optimization strategies.

At Harness, we developed a Policy-Based Cloud Cost Optimization Recommendations Engine that is highly customisable and operates across AWS, Azure, and Google Cloud. This engine leverages YAML-based policies powered by Cloud Custodian, allowing organisations to define and execute cost-saving rules at scale. The system continuously analyses cloud resources, estimates potential savings, and provides actionable recommendations, ensuring cost efficiency across cloud environments.

Benefits of a policy based Cloud Cost Optimization Recommendations Engine

Customisability: Users can define policies tailored to their organisation’s cost optimization strategy. Policies allow filtering resources based on conditions such as resource metrics, operational state (e.g., running, stopped), lifecycle phase (e.g., age, creation date), and compliance attributes to enforce governance and cost optimization.
Multi-Cloud Support: Works across AWS, Azure, and GCP, ensuring a consistent and unified cost-saving strategy. Avoids vendor lock-in by not relying solely on provider-native cost recommendations.
Automated Cost Optimization: Automatically applies policies across multiple accounts and regions to detect and remediate cost inefficiencies. Continuously refines recommendations as cloud resources evolve.
Transparent and Scalable Approach: Cost-saving logic is fully visible in YAML policies, unlike opaque cloud provider recommendations. Scales effortlessly as new resources and accounts are added.

Key Components Involved

Cloud Custodian

Cloud Custodian, an open-source CNCF-backed tool, is at the core of our policy-based engine. It enables defining governance rules in YAML, which are then executed as API calls against cloud accounts. This allows seamless policy execution across different cloud environments.

Cloud Cost Data Sources

The system relies on detailed billing and usage reports from cloud providers to calculate cost savings:

AWS Cost and Usage Report (CUR) — Provides granular cost breakdowns at a per-resource level.
Azure Billing Report — Offers insights into cloud usage, pricing models, and applied discounts.
Google Cloud Cost Usage Report — Captures detailed billing data, including SKU-level pricing and committed use discounts.

Solution Overview

The solution leverages Cloud Custodian to define YAML-based policies that identify cloud resources based on specific filters. The cost of these resources is retrieved from relevant cost data sources (AWS Cost and Usage Report (CUR), Azure Billing Report, and GCP Cost Usage Data). The identified cost is then multiplied by the predefined savings percentage to estimate the potential savings from the recommendation.

The diagram above illustrates the workflow of the recommendation engine. It begins with user-defined or Harness-defined cloud custodian policies, which are executed across various accounts and regions. The Harness application processes these policies, fetches cost data from cloud provider reports (AWS CUR, Azure Billing Report, GCP Cost Usage Data), and computes savings. The final output is a set of cost-saving recommendations that help users optimize their cloud spending.

How It Works — A Step-by-Step Breakdown with an example

Below is an example YAML rule that deletes unattached Amazon Elastic Block Store (EBS) volumes. When this policy is executed against any account and region, it filters out and deletes all unattached EBS volumes.

Policy Definition: YAML policies are written using Cloud Custodian to target specific resource inefficiencies.
Policy Execution: Policies are executed across multi-cloud accounts and regions, filtering unoptimized resources.
Cost Data Integration: The system retrieves the cost of filtered resources from AWS CUR, Azure Billing Report, and GCP Cost Usage Data.
Savings Calculation: The estimated cost savings is derived by applying the predefined savings percentage associated with each policy.
Recommendation Generation: The final output is a set of actionable cost-saving recommendations that users can review and apply.

Conclusion

Harness CCM’s Policy-Based Recommendation Engine offers an intelligent, automated, and scalable approach to optimizing cloud costs. Unlike native cloud provider tools, it is designed for multi-cloud environments, allowing organisations to define custom cost-saving policies and gain transparent, data-driven insights for continuous optimization.

With over 50 built-in policies and full support for user-defined rules, Harness enables businesses to maximise savings, enhance cost visibility, and automate cloud cost management at scale. By reducing unnecessary cloud spend, companies can reinvest those savings into innovation, growth, and core business initiatives — rather than increasing the profits of cloud vendors.

Sign up for Harness CCM today and experience the power of automated cloud cost optimization firsthand!

AI Tooling in Non-Greenfield Codebases banner

Engineering Blog

AI Tooling in Non-Greenfield Codebases

This blog post explores the integration of AI tooling in existing codebases, highlighting challenges and benefits faced by software engineers. Based on real experiences, it delves into the impact of AI coding assistants on API management and migration within the Harness platform.

Joshua Klein

December 31, 2024

Time to read

AI Tooling in Non-Greenfield Codebases

It’s 2025 and if you work as a software engineer, you probably have access to an AI coding assistant at work. In this blog, I’ll share with you my experience working on a project to change the API endpoints of an existing codebase while making heavy use of an AI code assistant.

‍

There’s a lot to be said about research showing the capability of AI code assistants on the day to day work of a software engineer. It’s clear as mud. Many people also have their own experience of working with AI tooling causing massive headaches with ‘AI Slop’ that is difficult to understand and only tangentially related to the original problem they were trying to address; filling up their codebase and making it impossible for them to actually understand what it is (or is supposed to be) doing.

I was part of the Split team that was acquired by Harness in Summer 2024. I had been maintaining an API wrapper for the Split APIs for a few years at this point.This allowed our users to take their existing python codebases and easily automate management of Split feature flags, users, groups, segments and other administrative entities. We were getting about 12–13,000 downloads per month. Not something that gets an enormous amount of traffic but not bad for someone who’s not officially on a Software Engineering team.

The architecture of the Python API client is that instantiating it constructs a client class that shares an API Key and optional base url configuration. Each API is served by what is called a ‘microclient’, which essentially handles the appropriate behavior of that endpoint, returning a resource of that type during create, read, and update commands.

API Client Architecture

Example showing the call sequence of instantiating the API Client and making a list call

As part of the migration of Split into the Harness platform, Split will be deprecating some of its API endpoints — these — such as Users and Groups — will proceed to be maintained in the future under the banner of the Harness Platform. Split Customers are going to be migrated to have their Split App accessed from within Harness, and so Users, Groups, and Split Projects will proceed to be managed in Harness, meaning that Harness endpoints will have to be used.

How to mate the API Client with the proper endpoints for customers post Harness Migration?

With respect to API keys, the Split API keys will continue to work for existing endpoints, and after migration to harness they will still be able to work. Harness API keys will work for everything and be required for Harness endpoints post-migration.

Now the fun begins

I had some great help from the former Split (now Harness FME) PMM and Engineering teams who took on the task of actually feeding me the relevant APIs from the Harness API Docs. This gave me a good starting point to understand what I might need to do.

Essentially to have similar control over Harness’s Role Based Access Control (RBAC) and Project information just as we did in Split — I’d need to utilize the following Harness APIs

Users
Groups
Projects
Invites (to invite users)
Role Assignments
Roles
Resource Groups
Tokens
API Keys
Service Accounts

Not all Split accounts will be migrating at once to the Harness platform — this will be over a period of a few months. This means that we will have to support both API access styles for at least some period of time. I also know that I still have my normal role at Harness supporting onboarding customers using our FME SDKs and don’t have a lot of free time to re-write an API client from scratch, so I got to thinking about what my options were.

Mode Select

I really wanted to make the API transition as seamless as possible for my API client users. So the first thing I figured was that I would need a way to determine if the API key being used was from a migrated account. Unfortunately, after discussing with some folks there simply wasn’t going to be time for building out an endpoint like this for what will be, at most, a period of a few months. As such my first design decision was how to determine which ‘mode’ the Client was going to use, the existing mode with access to the older Split API endpoints, or the ‘new’ mode with those endpoints deprecated and a collection of new Harness endpoints available.

I decided this was going to be done with a variable on instantiation. Since the API client’s constructor signature already included an object as its argument, this I thought would be pretty straightforward.

Eg:

Would then have an additional option for:

Now — I was thinking and questioning how I would implement this.

Recently, Harness Employees were given access to Windsurf IDE with Claude AI. I figured since I could use the help that I would sign on and that this would help me build out my code changes faster.

I had used Claude, ChatGPT, DeepSeek, and various other AI assistants through their websites for small scale problem solving (eg — fill in this function, help me with this error, write me a shell script that does XYZ) but never actually worked with something integrated into the IDE.

So I fired up Windsurf and put in a pretty ambitious prompt to see what it was capable of doing.

Split has been acquired by harness and now the harness apis will be used for some of these endpoints. I will need to implement a seperate ‘harness_mode’ boolean that is passed in at the api constructor. In harness mode there will be new endpoints available and the existing split endpoints for users, groups, restrictions, all endpoints except ‘get’ for workspaces, and all endpoints for apikeys when the type == ‘admin’ will be deprecated. I will still need to have the apikey endpoint available for type==’client_side’ and ‘server_side’ keys.

It then whirred to work, and, quite frankly. I was really impressed with the results. However — It didn’t quite understand what I wanted. The harness endpoints are completely different in structure and methods (and in base url). The result was that I’d get the microclients to have harness methods and harness placeholders in the URLs but this wasn’t going to work. I should have told the AI that I really want different microclients and different resources for Harness. I reverted the changes and went back to the drawing board. (but I’ll get back to this later)

OpenAPI

My second Idea was to attempt to generate some API code from the Harness API docs themselves. Harness’s API docs have an OpenAPI specification available, and there are tools that can be used to generate API clients out of these specifications. However, it became clear to me that the tooling to create APIs from OpenAPI specifications isn’t easily filterable. Harness has nearly 300 API endpoints for the rich collection of modules and features that it has. Harness’s nearly 10 MB OpenAPI spec would actually crash the OpenAPI generator — it was too big. I spent some time working on code to strip out and filter the OpenAPI Spec JSON just to the endpoints I needed.

Here, the AI tooling was also helpful. I asked

how can I filter a openapi json by either tag or by endpoint resource path?

can this also remove components that aren’t part of the endpoints with tags

could you also have it remove unused tags

But the problem ended up being that the OpenAPI spec is actually more complex then I initially thought, including references, parameters and dependencies for objects. So it wasn’t going to be as simple as passing in my endpoints I need and proceeding to send them to the API Generator.

I kept attempting to run the filter script generated and then proceeded to run the generator. I did a few loops of attempting to run the script, getting an error, and sending it back to the AI assistant.

By the end I did seem to get a script that could do filtering, but filtering down to just what I needed ended up being still too big for the OpenAPI generator. You can see that code here

For a test, I did start generating with just one endpoint (harness_user) and reviewing the python generated code. One thing that was clear after reviewing the file was that it was just structured so wildly differently from the API Client that I already have. Also there are dozens of warnings inside of the generated code to not make any changes or updates to it. Moreover, I was not familiar with the codebase

Either manually or attempting via an AI assistant, stitching these together was not going to be easy, so I stashed this idea as well.

As an aside, I think this is worth noting, that an AI code assistant can’t help you when you don’t even know how to really specify what exactly you want and what your outcome is going to look like. I needed to have a better understanding of what I was trying to accomplish

Further Design Review

One of the things I had in my mind was that I really wanted to make the transition as seamless as possible. However, once my idea of the automated mode select was dashed, I still thought I could, through heroic effort, automate the creation of the existing Split python classes via the Harness APIs.

I had a deep dive into this idea and really came back with the result that it would simply be too burdensome to implement and not really give the users what they need.

For example — to create an API Key in Split, we just had one API endpoint with a json body:

However, Harness has a very rich RBAC model and with multiple modules has a far more flexible model of Service Accounts, API Keys, and individual tokens. Harness’s model allows for easy key rotation and allows the API key to really be more of a container for the actual token string that is used for authentication in the APIs.

Shown more simply in the diagrams below:

Observe the difference in structure of API Key authentication and generation

Now the Python microclient for generating API keys for Split currently makes calls structured like so:

To replicate this would mean that I would have to have the client in ‘Harness Mode’ create a Service Account, API Key, and Token all at the same time, and automatically map the roles to a created service account, being seamless to the user.

This is a tall task, and being pragmatic, I don’t see that as a real sustainable solution for developers using my library as they get more familiar with the Harness platform. They’re going to want to use Harness objects natively.

This is especially true with the delete method of the current client,

The Harness method for deleting a token takes the token identifier, not the token itself, making this signature impossible to reproduce with Harness’s APIs. And even if I could delete a token, would I want to delete the token and keep the service account and api key? Would I need to replicate the role assignment and roles that Split has? Much of this is very undefined.

Wanting to keep things as straightforward and maintainable as possible, along with trying to move to understanding the world in Harness’s API Schema, I had a design decision in my head.

We were going to have ‘Harness Mode’ for the APIs that will explicitly deprecate the Split API microclients and resources and will then activate a separate client that will use Harness API endpoints and resources. The endpoints that are unchanged will still use the Split endpoints and API keys.

Back to AI

Now that I’ve got a better understanding of how I want to design this, I felt like I could create a better prompt.

Split has been acquired by harness and now the harness apis will be used for some of these endpoints. I will need to implement a seperate ‘harness_mode’ boolean that is passed in at the api constructor. In harness mode there will be new endpoints available and the existing split endpoints for users, groups, restrictions, all endpoints except ‘get’ for workspaces, and all endpoints for apikeys when the type == ‘admin’ will be deprecated. I will still need to have the apikey endpoint available for type==’client_side’ and ‘server_side’ keys. Make seperate microclients in harness mode for the following resources:

harness_user, harness_project, harness_group, role, role_assignment, service_account, and token

Ensure that that the harness_mode has a seperate harness_token key that it uses. It uses x-api-key as the header for auth and not bearer authentication

Claude then whirred away and this was with much better results here. With the separate microclients I had a much better structure to build my code with. This also helped me with understanding of how I thought I would continue building.

The next thing I asked it to do was to create resources for all of my microclient objects.

The next thing I did was a big mistake. I asked it to create tests for me for all of my microclients and resources. Creating the tests at this time before I had finished implementing my code means that the AI doesn’t know which one is right or not. So I spent a lot of time troubleshooting issues with tests until I just decided to delete all of my test files and create the tests much later in my development cycle. Once I had the designs for the microclients and resources reasonably implemented, I went forth and had it write the tests for me. DO NOT have the AI write BOTH your tests and your code before you have the chance to review either of them, or you will be in a world of pain and be spending hours trying to figure out what you actually want.

After the Magic

This was an enormous time saver for me. Having the project essentially built with custom scaffolding for me was just amazing.

The next thing I was going to do was fill in the resources. The resources were essentially a schema with an init call to pull the endpoints in and accessors to get the fields from the data.

The schemas I was able to pull from the apidocs.harness.io site pretty easily.

Here’s an example of the AI generated code for the harness group resource.

I did a few things here — I had the AI generate for me a generalizable getter and dict export from the schema itself — essentially allowing me to just copy and paste the schema into the resource and have it auto-generate the methods that it needs to have.

Here’s an example of that code for the harness user class.

Once this was done for all of my resources, I had the AI create tests for these resources and went through a few iterations before my tests passed.

Microclients

The microclients were a bit more challenging. Partly because of how the methods were really fundamentally different in many cases between the Split and Harness way of managing these HTTP resources.

There was more manual work here and not as much automation. That being said, the AI had a lot of helpful autocompletes.

For example, in the harness_user microclient class, the default list of endpoints looked like this

If I were to change one of them to the proper endpoint (ng/api/user) and then press tab it will automatically fix the other endpoints — small things like that really added up when I was going through and manually setting up things like endpoints, looping over the returned array from a GET endpoint. The AI tooling really helps speed up the implementation.

Once I had the microclients finished, I had the AI create tests and worked through running them, ensuring that we had coverage and the tests made sense and covered all of the microclient endpoints (including pagination for the list endpoints)

Base Client

The last thing to clean up now was the base client. The AI created a separate main harness_apiclient that would be instantiated when harness mode was enabled. I had to review the deprecation code to ensure that deprecation warnings were indeed only fired when specified. I also cleaned up and removed some extraneous code around supporting other base urls, and set the proper harness base url.

I proceeded to ask AI to allow me to pass in an account_identifier since many of the harness endpoints require that — allowing me to make it easier so that you didn’t need to pass that field in each time for every microclient request.

Grand Finale

Finally, I had the AI write me a comprehensive test script that would test all endpoints in both harness mode and split mode. I ran this with a Harness account and a Split account to ensure success. I fixed a few minor issues but ultimately it worked very well and seemed extremely straightforward and easy to use.

Lessons Learned

After this whole project I would like to let the reader depart with a few learnings. First of which is that your AI assistant still requires you to have a good sense of code smell. If something looks wrong or your implementation in your head would be different, always feel free to back up and revert the changes it makes. Better to be safe than sorry.

You really need to have the design in your head and constantly be comparing it to what the AI is building for you when you ask it questions. Don’t just accept it — interrogate it. Save and commit often so that you can revert to known states.

Do not have it create both your tests and implementations at the same time. Only have it do one until you are finished with it and then have it do the other.

You do not want to just keep asking it for things without an understanding of what you want the outcome to look like. Keep your hand on the revert button and don’t be afraid to revert to earlier parts of your conversation with the AI. If you do not review the code coming out of your AI assistant you will be in a world of trouble. Coding with an AI assistant still uses those Senior/Staff Software Engineer skillsets, perhaps even more than ever due to the sheer volume of code that is possible to generate. Design is more important than ever.

If you’re familiar with the legend of John Henry — he was a railroad worker who challenged a steam drilling machine with his hammer. With an AI assistant I really feel like I’ve been given a steam driller. Like this is the way to huge gains in efficiency in the production of software.

Learn how to work with your robot and be successful

I’m very excited for the future and how AI code assistants will grow and become part and parcel of the standard workflow for software development. I know it saved me a lot of time and from a lot of frustration and headaches.

Technical

Engineering Blog

Go Memory Leak: How One Line Drained Memory Across 1000+ Goroutines | Harness

This technical deep-dive reveals how Harness engineers discovered and fixed a critical Go memory leak where reassigning context variables in worker loops created invisible chains that prevented garbage collection across thousands of goroutines, ultimately consuming gigabytes of memory in their CI/CD delegate service.

Kiruthika Meena Ravichandran

October 10, 2025

Time to read

🧩 The Mystery: A Troubling Correlation Between CPU and Memory

In our staging environment, which handles the daily CI/CD workflows for all Harness developers, our Hosted Harness delegate was doing something curious: CPU and memory rose and fell in a suspiciously tight correlation, perfectly tracking system load.

(For context, Harness Delegate is a lightweight service that runs inside a customer’s infrastructure, securely connecting to Harness SaaS to orchestrate builds, deployments, and verifications. In the Hosted Delegate model, we run it in Harness’s cloud on behalf of customers, so they don’t have to manage the infrastructure themselves.)

At first glance, this looked normal. Of course, you expect CPU and memory to rise during busy periods and flatten when the system is idle. But the details told a different story:

Memory didn’t oscillate. Instead of rising and falling, it climbed steadily during high-traffic periods and then froze at a new plateau during idle, never returning to baseline.
Even more telling, CPU perfectly mirrored that memory growth. This near-perfect lockstep hinted that cycles weren’t just spent on real work—they were being burned by garbage collection, constantly fighting against an ever-growing heap.

In other words, what looked like “a busy system” was actually the fingerprint of a leak: memory piling up with load, and CPU spikes reflecting the runtime’s struggle to keep it under control.

🔍 The Investigation: Following the Breadcrumbs

The next step was to understand where this memory growth was coming from. We turned our attention to the core of our system: the worker pool. The delegate relies on a classic worker pool pattern, spawning thousands of long-running goroutines that poll for and execute tasks.

On the surface, the implementation seemed robust. Each worker was supposed to be independent, processing tasks and cleaning up after itself. So what was causing this leak that scaled perfectly with our workload?

We started with the usual suspects—unclosed resources, lingering goroutines, and unbounded global state—but found nothing that could explain the memory growth. What stood out instead was the pattern itself: memory increased in perfect proportion to the number of tasks being processed, then immediately plateaued during idle periods.

To dig deeper, we focused on the worker loop that handles each task:

This seemed innocent enough. We were just reassigning ctx to add task IDs for logging and then processing each incoming task.

⚡The Eureka Moment: An Invisible Chain

The breakthrough came when we reduced the number of workers to one. With thousands running in parallel, the leak was smeared across goroutines, but a single worker made it obvious how each task contributed.

To remove the noise of short-lived allocations, we forced a garbage collection after every task and logged the post-GC heap size. This way, the graph reflected only memory that was truly retained, not temporary allocations the GC would normally clean up. The result was loud and clear: memory crept upward with each task, even after a full sweep.

That was the aha moment 💡. The tasks weren't independent at all. Something was chaining them together, and the culprit was Go's context.Context.

A context in Go is immutable. Functions like context.WithValue doesn't actually modify the context you pass in. Instead, they return a new child context that holds a reference to its parent. Our AddLogLabelsToContext function was doing exactly that:

This is fine on its own, but it becomes dangerous when used incorrectly inside a loop. By reassigning the ctx variable in every iteration, we were creating a linked list of contexts, with each new context pointing to the one from the previous iteration:

Each new context referenced the entire chain before it, preventing the garbage collector from ever cleaning it up.

💣 The Damage: A Leak Multiplied

With thousands of goroutines in our worker pool, we didn't just have one tangled chain—we had thousands of them growing in parallel. Each worker was independently leaking memory, one task at a time.

A single goroutine's context chain looked like this:

Task 1: ctx1 → initialContext
Task 2: ctx2 → ctx1 → initialContext
Task 100: ctx100 → ctx99 → ... → initialContext

...and this was happening for every single worker.

📦 Impact (Back-of-the-Envelope Math)

1,000 workers × 500 tasks/worker/day = 500,000 new leaked context objects per day.
After one week: 3.5 million contexts stuck in memory across all workers.

Each chain lived as long as its worker goroutine—effectively, forever.

🔧 The Fix: Breaking the Chain

The fix wasn't concurrency magic. It was simple variable scoping:

The problem wasn't the function itself, but how we used its return value:

❌ ctx = AddLogLabelsToContext(ctx, ...) → chain builds forever

✅ taskCtx := AddLogLabelsToContext(ctx, ...) → no chain, GC frees it

The Universal Anti-pattern (and Where it Hides)

The core problem can be distilled to this pattern:

It's a universal anti-pattern that appears anywhere you wrap an immutable (or effectively immutable) object inside a loop.

Example 1: HTTP Request Contexts

Example 2: Logger Field Chains

Same mistake, different costumes.

📌 Key Takeaways

Scope variables in loops carefully: Never reassign an outer-scope variable with a "wrapped" version of itself inside a long-running loop. Always use a new, locally-scoped variable for the wrapped object.
Leaks can be parallel: One small mistake × thousands of goroutines = disaster.
Simplify to debug: Reducing our test environment to a single worker made the memory growth directly observable and the root cause obvious. Sometimes the best debugging technique is subtraction, not addition.

👀 What's Next?

After fixing this memory leak, we enabled the profiler for the delegate to get better visibility into production performance. And guess what? The profiler revealed another issue - a goroutine leak!

But that's a story for the next article...🕵️‍♀️

Stay tuned for "The Goroutine Leak Chronicles: When Profilers Reveal Hidden Secrets 🔍🔥"

‍