
- Flaky tests waste 16–24% of developers' time on average because of false failures, re-runs, and investigations. This costs engineering companies millions of dollars in lost productivity every year.
- Timing problems, test pollution, unstable infrastructure, and race conditions are all possible root causes. However, AI-powered detection can automatically identify and isolate flaky tests, improving signal quality.
- Harness CI and other modern CI platforms use machine learning to identify flaky tests, automatically quarantine unreliable tests, and maintain developers' trust without manual triage or pipeline interruptions.
What are Flaky Tests?
Flaky tests are automated tests that pass or fail inconsistently without changes to the code. In this guide, you’ll learn why flaky tests happen, how to detect them automatically in CI pipelines, and how modern platforms prevent them from slowing teams down.
Your test went well three times yesterday. It didn't work this morning. You ran it again without changing anything, and now it works. Congratulations, you've just passed a flaky test, and now someone's day is going to be ruined.
Flaky tests are like smoke alarms that go off for no reason. Everyone looks into it the first few times. Eventually, your entire test suite stops being an early warning system and becomes background noise. Harness CI uses AI to automatically identify flaky tests and put them in quarantine, so your pipelines send you reliable signals instead of random noise.
Why Tests That Don't Work Are Expensive (Even If You Don't Pay Attention to Them)
The 30 seconds it takes to hit "retry" isn't the real cost of flaky tests. It's everything that happens after developers stop trusting the test results.
Developer Time Goes Up in Investigating Black Holes
Someone has to figure out if a test failure is a real bug or just flakiness. An industrial case study found flaky tests consuming about 2.5% of developers' productive time - 1.1% on investigation, 1.3% on repairs, and 0.1% on tooling. For a team of 50 engineers, that's the equivalent of more than one full-time engineer's worth of work... gone.
And that's the best-case scenario, where teams really look into things. The worst-case scenario is that developers think everything is flaky, stop looking into failures, and real bugs make it to production. You're paying for tests that hurt your confidence instead of helping it.
Changing Contexts Breaks the Flow State
This is what really happens when a flaky test breaks your build. You're deep into the code, working on a complicated feature. The build doesn't work. You stop, switch contexts to look into the problem, find out it's not your fault, run the pipeline again, and wait. When the green build comes back 15 minutes later, you've lost your train of thought and spent 20 minutes on Slack instead.
Studies on productivity show that it takes 15 to 25 minutes to get back to full focus after being interrupted. If you have dozens of flaky test interruptions every week across your team, you're losing a lot of productive hours.
The Hidden Multiplier Is Trust Degradation
The cultural cost is the most harmful. When tests stop working, developers find other ways to do things. They automatically run builds again. After the third retry passes, they combine PRs with red builds. They stop making new tests because "tests are flaky anyway."
This loss of trust gets worse over time. Teams that tolerate flaky tests have lower test coverage, longer feedback loops, and more problems in production. Your quality assurance system will only be useful if developers trust the test results.
Root Causes: Why Tests Fail and How to Tell When They Do
The first step in fixing tests is to figure out why they fail. You can hunt down flaky tests in a systematic way instead of playing whack-a-mole because most of them follow a pattern.
Timing and Race Conditions: The Usual Suspects
Assumptions about timing are the main reason why tests fail. Your test says that element X should be ready in 100ms. It's always ready in 80 milliseconds on your laptop. It takes 120ms on a shared CI runner that is busy. Boom, failure that happens sometimes.
You could have problems with network calls, database queries, UI rendering, or async operations if you have to "wait for something to happen." Hard-coded sleep statements are especially bad because they're either too short (flaky) or too long (slow tests that waste time even when they pass).
The fix is to use explicit waits with timeouts: wait for specific conditions (such as an element becoming visible, an API response being received, or a state being updated) rather than arbitrary time intervals. You need to find out which tests have these problems first.
Pollution in Tests and Shared State
Tests that depend on the order in which they are run or share mutable state are like ticking time bombs. Test A runs first and puts data into the database. Test B assumes that the data is there. If you run them in parallel or in the opposite order, Test B fails at random.
Global variables, singleton patterns, shared file systems, and database records that don't get cleaned up all make tests depend on each other in ways that aren't obvious. When you run your tests in parallel to speed them up, test pollution shows up in a big way.
Test Intelligence helps by looking at test dependencies and running tests that are affected in isolation, which makes them less flaky because of pollution.
Unstable Infrastructure and Environment
The test is fine, but the setting isn't always. Network problems, shared CI runners fighting for resources, external API rate limits, and database connection pool exhaustion are all environmental factors that can cause your code to fail from time to time.
This is why teams that use shared, static Jenkins clusters have more problems than teams that use ephemeral build environments. You get rid of the "noisy neighbor" problem completely when every build runs in a clean, separate space with its own resources.
Code That Isn't Deterministic and Dependencies From Outside
Tests that rely on the current time, random number generation, external APIs, or other inputs that aren't always the same will eventually fail. Anything that isn't completely under your control in your test setup could cause flakiness. For example, today's date changes, APIs go down, and random seeds give you different values.
Dependency injection and test doubles are the answer. For example, you can mock the clock, stub external APIs, and seed random generators in a way that is predictable. But first, you need to know which tests have these problems.
Detection Strategies That Really Work
You can't fix something if you can't see it. The first step is to make systems that automatically show flaky tests instead of making developers remember and report them.
Using AI to Find Flaky Tests Automatically
It doesn't work to keep track of flaky tests by hand. You need automated detection that watches test runs over time and finds patterns that show flakiness.
AI-powered test intelligence looks at past test results to find tests that pass and fail on the same code without making any changes. After just a few runs, machine learning models can find flaky behavior and flag tests for further investigation before they turn into big problems.
The most important thing is to run the same test suite on the same code several times. Newer platforms can do this automatically without any help from people.
Quarantine Systems That Keep the Pipeline Signal
You have a problem when you find a flaky test. If you turn it off, you won't be able to test it. If you leave it running, it will keep breaking builds and teaching developers to ignore failures.
The answer is automatic quarantine. Put a flaky test in quarantine so it can still run, but doesn't block the pipeline. Failures are recorded and tracked, but developers don't have to deal with random failures from tests that are known to be flaky.
This keeps the quality of the signals in your main test suite while letting platform teams see the tests that are in quarantine and need to be fixed. You're separating the noise from the signal without losing either.
Using Flaky Test Rate as an Important Metric
Along with build duration and deployment frequency, treat flaky test rate as a top operational metric. Healthy test suites keep flaky rates below 1–2%, while rates above 5% show that there are big problems.
Keep an eye on this over time to see if it changes. A sudden spike usually means that the infrastructure has changed or that new code patterns have made things less stable. Platform teams should set up alerts and SLOs for flaky test rates so they can catch problems early.
How to Fix Flaky Tests: Best Practices That Work
Finding the problem is half the battle. You can't just hide the problems anymore; you need to use systematic methods to fix them.
First, Isolate and Reproduce
You need to be able to consistently reproduce the failure before you can fix a flaky test. Run the test hundreds of times on your own computer or in CI until you see how it fails.
Tools that make it easy to run tests again and again are helpful here. Some platforms let you run a single test 50 times with a single command, making it easy to find intermittent failures. Once you can consistently reproduce the failure, it becomes easier to investigate.
Should You Fix the Test or the Code?
Not all flaky tests are bad tests. Sometimes, flakiness in your production code indicates real race conditions, timing issues, or behavior that isn't always consistent.
Think about this: Is this flakiness testing something that could happen in production, or is it just a result of how we wrote the test? The flakiness is a signal that users could see this timing problem. Make the code work. If it's just a test artifact, fix the test.
Common Problems and Their Solutions
Different flaky test types need different fixes:
- Problems with timing: Instead of hard-coded sleeps, use explicit waits. When you have external dependencies, use retry logic with exponential backoff. If you need to, increase the timeouts, but it's better to make operations go faster than to make tests wait longer.
- Resource contention: Create temporary directories, test-specific namespaces, and database schemas. After tests, clean up the resources. Don't use global shared state.
- External dependencies: Mock APIs and services. Use test doubles. Use circuit breakers and fallbacks for integration tests that require connecting to real services.
- Inputs that aren't deterministic: Seed random number generators. Mock the clock on the system. Instead of generated data, use fixed test data.
Refactor to Make Things More Certain
The end goal is to make your test suite completely deterministic. Every time, the same code gives the same test results. This means making choices about architecture:
- Dependency injection that lets you use test doubles
- Whenever possible, use pure functions that don't have side effects
- There is a clear line between business logic and I/O
- Test fixtures that make a clean, separate state
These are good software design rules that make your production code more reliable, not just for tests.
Creating a Culture That Stops Flaky Tests
Flaky tests can't be fixed by technology alone. You need team rules and practices that stop flakiness from building up in the first place.
Make Flakiness Clear and Not Okay
Teams put up with what they keep track of. Flaky tests spread when you can't see them. Make the flaky test rate a dashboard metric. During code reviews, point out tests that are flaky. When you add flaky tests, think of them as production bugs that you should avoid and fix right away.
Some teams have a "you flake it, you fix it" policy, which means that the person who wrote the flaky test is responsible for finding out what went wrong and fixing it. This makes people responsible and encourages them to write stable tests ahead of time.
Put Money Into Test Infrastructure
Flaky tests are often a sign that the test infrastructure isn't good enough. Flakiness comes from shared, overloaded CI runners. So do test environments that are too fragile and test tools that are missing.
Platform teams should give:
- Isolated build environments for each test run
- Libraries of test fixtures that are in good shape
- Clear patterns and examples for common test situations
- Local CI environments are easy to set up again
Flakiness goes down naturally when it's easier to write stable tests than flaky ones.
Keep Fast Unit Tests and Slow Integration Tests Separate
When you mix fast, predictable unit tests with slow, environment-dependent integration tests, the integration test flakiness spreads to everything else. Instead of just the integration layer, developers learn not to trust any tests.
Group test suites by how fast and stable they are. Every time you commit, run fast, stable unit tests. Run integration tests less often or on a different track. Test Intelligence will only run the integration tests that are needed based on changes to the code.
This tiered approach means that most developer feedback comes from quick, reliable tests, and full integration coverage still happens without breaking the inner loop.
How Modern CI Platforms Automatically Deal With Flakiness
When a team gets too big, manual flaky test management doesn't work anymore. Modern platforms use automation and smart technology to solve the problem.
ML-Powered Detection That Gets Better Over Time
Harness CI uses machine learning to look at test patterns from thousands of runs. The system learns which tests tend to fail, when, and how often.
This is more than just finding out if someone "passed then failed." Advanced algorithms can find patterns like "fails more often under load," "flakes in parallel but not sequential runs," or "only flakes on certain OS versions."
The longer the system runs, the better it gets at telling the difference between real problems and false alarms.
Automatic Quarantine Without Any Human Action
The system automatically quarantines when it finds a flaky test. No platform team meetings, no filing tickets by hand, and no arguing about whether this test is "flaky enough" to be quarantined.
Quarantined tests still run and report results, but they don't stop builds or count as failures. Developers can look into quarantined tests when they have time, but they aren't held up by random failures.
This keeps both coverage (tests still run) and signal quality (builds aren't randomly red).
A Lot of Analytics and Reporting
Platform teams need to see not only the status of individual tests, but also the trends of flaky tests. Dashboards on modern CI platforms show:
- Flaky test rate over time for all teams and repositories
- The most problematic tests, ranked by how much they affect the system (frequency × failure count)
- Metrics for quarantine status and time spent in quarantine
- Patterns of root causes and suggested fixes
This information helps decide which problems to fix first and shows whether the flakiness is improving or worsening over time.
Real-World Effect: What Teams Get Out of Fixing Flakiness
When teams deal with flaky tests in a planned way, the benefits spread across many areas.
Developer productivity returns: Teams say they get 10–20% more done after eliminating flaky tests. This is because they don't have to spend time on false investigations and reruns.
Restoring trust: Developers only pay attention to failures and look into them thoroughly when they trust the test results again. This finds real bugs sooner and improves the quality of production.
Faster feedback loops: PR validation runs finish faster and provide useful feedback the first time, without needing to retry or investigate failures.
Less expensive infrastructure: Teams stop running tests "just to be sure" or the whole suite because they don't trust selective execution. When the tests that Cache Intelligence and test selection are based on are reliable, they work better.
Cultural change: Getting rid of flakiness shows that the platform team cares about developers' experience. It gives other CI improvements greater credibility and moves the whole company toward better testing practices.
One engineering team reported cutting test maintenance from around 10 hours per week to about 2 hours per week by aggressively removing and refactoring flaky end-to-end tests. Another organization claimed flaky tests cost them 40 hours per week before they deleted 70% of their problematic tests. With systematic detection, quarantine, and remediation, teams see faster builds, happier developers, and fewer production incidents.
Stop Putting Up with Flaky Tests
Flaky tests don't have to happen all the time when you make software. They're a sign of not having the right tools, not following the right practices, and having too much technical debt.
To fix the problem, you need three things: automated detection to identify where the flakiness is, systematic remediation to fix the root causes quickly, and preventive practices to ensure new flakiness doesn't build up faster than you can fix old problems.
All three of these things are made smarter and more automated by modern CI platforms. AI-powered detection finds flaky patterns on its own. Quarantine systems maintain signal quality without blocking teams. Analytics reveal patterns and help set priorities for problem-solving.
Your developers shouldn't have to be detectives every time a test fails. Make flaky tests someone else's problem, like the CI platform's, so your team can spend less time fixing test infrastructure and more time adding new features.
Are you ready to get rid of flaky tests in your pipelines? Learn how Harness Continuous Integration uses AI to find flaky tests, put them in quarantine, and help fix them on their own.
Flaky Tests: Frequently Asked Questions
How many tests that are flaky are normal?
Healthy test suites keep flaky rates between 1% and 2%. You have a systemic problem that needs to be fixed right away if more than 5% of your tests are flaky.
Should I get rid of flaky tests or try to fix them?
Not at first. Quarantine flaky tests first, so they don't stop builds but still send signals. Then look into whether they're showing real problems or just poorly written tests. If they're testing important situations, make sure they work. Think about deleting them if they are unnecessary or not worth much.
How long does it take to fix a test that keeps failing?
It can take anywhere from 15 minutes for simple timing issues to several days for more complicated race conditions or architectural problems. The average time for all the studies is 1 to 3 hours per test. This is why it's important to automate detection and prioritization: you want to fix the flaky tests that have the biggest effect first.
Can flaky tests show bugs in production?
Yes. Some flaky tests show real race conditions, timing problems, or behavior that isn't always the same, which could affect users. Don't just call a flaky test "just a bad test." Look into whether it's showing real problems with the code. Flakiness can sometimes be a signal, not just noise.
Do parallel test runs make things more flaky?
Parallel execution shows problems that sequential runs hide, like test pollution, race conditions, and resource contention. The parallelism isn't causing problems; it's just showing problems that were always there. Instead of avoiding parallelism, fix the root problems.
How do tools that use AI find flaky tests?
Machine learning models look at test results from hundreds or thousands of runs and find patterns like "passes and fails on the same code," "fails more often under certain conditions," or "failure rate correlates with infrastructure load." These systems are much better and faster at finding flaky tests than people are.
