June 1, 2026
How to Measure Flaky Test Rate in CI/CD Without Guesswork
Learn how to define, track, and interpret flaky test rate in CI/CD, with practical formulas, instrumentation tips, and metrics that improve build stability.
Flaky tests are one of the fastest ways to make a CI pipeline feel unreliable even when the product is fine. A test suite that fails unpredictably creates a bad loop, developers stop trusting failures, QA spends time rerunning jobs, and release decisions start relying on gut feel instead of evidence.
That is why the flaky test rate matters. It is a practical metric for separating genuine product failures from test instability, and it gives teams a way to track whether CI/CD reliability is improving or drifting in the wrong direction.
The hard part is not collecting a number. The hard part is defining one that is meaningful, consistent, and hard to game. This article breaks down how to measure flaky test rate without guesswork, how to instrument it in a pipeline, and how to interpret it alongside other build stability metrics.
What flaky test rate actually means
At a high level, flaky test rate measures the portion of test outcomes that change when the underlying code and environment have not meaningfully changed. In other words, the test fails sometimes and passes other times under the same conditions.
This sounds simple, but in practice there are a few different ways teams define it:
- Failure retry rate, how often a test only passes after rerunning
- Non-deterministic failure rate, how often a test fails on one run and passes on another with the same commit
- Flake frequency per test, how often an individual test exhibits unstable behavior over time
- Suite-level flakiness, the share of all test executions that are flaky
A flaky test rate is only useful if the definition is explicit. If one team counts a retry as a flake and another counts only confirmed nondeterministic failures, the metric will not be comparable.
For CI/CD reliability, the most useful definition is usually this:
Flaky test rate = flaky test outcomes / total test outcomes in a defined time window
That gives you a stable ratio you can trend over time, but it only works if you know what counts as a flaky outcome.
Why flaky test rate is not the same as failure rate
A high failure rate and a high flaky test rate are not the same thing.
A test can fail for good reasons, for example:
- The feature is broken
- A contract changed and the test is outdated
- An environment dependency is actually unavailable
- A deployment introduced a real regression
A flaky test fails for bad reasons, meaning the code under test is not the root cause, or at least not consistently the root cause.
This distinction matters because teams often chase the wrong problem. If failure rate is high, the answer may be product defects, brittle tests, or missing test coverage. If flaky test rate is high, the problem is usually test design, environment stability, timing assumptions, or shared state.
The definitions you need before you start measuring
Before adding dashboards or alerts, establish a shared vocabulary.
1. What is a test execution?
A test execution is a single run of a test in a given context, often tied to a commit, branch, environment, and pipeline job.
For example, if your test suite runs on every pull request and again after merge, those are separate executions even if they use the same test code.
2. What is a flaky outcome?
A flaky outcome is usually one of these patterns:
- The test fails, then passes on immediate retry with no code change
- The test passes on one run, then fails on a rerun against the same commit and environment
- The test fails intermittently across a stable time window with no relevant change in the product or test code
The first definition is easiest to automate in CI. The second is useful but requires careful correlation. The third is best for trend analysis, but it is harder to attribute.
3. What time window are you measuring?
Flaky rates are sensitive to window choice. A daily window is useful for spotting incidents. A weekly or monthly window is better for tracking real improvement.
Pick a primary window and keep it consistent. Otherwise the numbers will move around because of the window, not because of the tests.
A practical formula for flaky test rate
Here is a straightforward way to compute it.
text flaky test rate = number of flaky test outcomes / total test outcomes
If you want a percentage:
text flaky test rate % = (flaky test outcomes / total test outcomes) * 100
That raises the next question, what counts as a flaky test outcome?
A pragmatic CI-friendly approach is to count a test as flaky when it:
- Fails on the first run
- Passes on an immediate retry
- Has no relevant code or environment change between attempts
This method is not perfect, but it is operationally useful because it can be implemented in most pipelines without deep statistical modeling.
Example
Suppose over one week you observe:
- 1,200 total test executions
- 36 first-attempt failures that passed on retry
- 4 additional unstable cases discovered through reruns or quarantined jobs
If your team counts all 40 as flaky outcomes, then:
text 40 / 1200 = 0.0333
So the flaky test rate is 3.33% for that window.
What matters is not the exact decimal, but that you always calculate it the same way.
Where to get the data in CI/CD
Measuring flaky test rate requires test result data that is tied to execution context. At minimum, capture:
- Test name or ID
- Suite or component
- Build ID or pipeline run ID
- Commit SHA
- Branch name or pull request ID
- Timestamp
- Environment or test target
- Retry count
- Result for each attempt
- Failure reason or error signature if available
You can collect this from a CI system, a test runner, or a test reporting tool. The important part is that every run is traceable.
The minimum instrumentation model
If you are starting from scratch, track each test attempt as a row in a results table. A simple schema could look like this:
text run_id | commit_sha | branch | test_name | attempt | status | started_at | ended_at | error_signature
From that data, you can infer whether a failure was recovered by retry and whether retries were common enough to matter.
Why retries are both useful and dangerous
Retries are good for user experience in CI because they prevent one-off noise from blocking the whole pipeline. But retries also hide flakiness if you treat them as normal pass outcomes.
A retry that passes is not a clean pass. It is evidence of instability.
If your CI only reports final status and discards retry history, you are probably undercounting flaky tests.
That is why the raw attempt log is more valuable than just the final job result.
A better metric stack than a single number
Flaky test rate is useful, but on its own it can mislead you. A mature CI/CD dashboard usually combines several related metrics.
1. Flaky test rate
This tells you how much of the test signal is unstable.
2. First-pass pass rate
This measures how often tests pass without retry. It is often a better indicator of developer trust than final pass rate.
3. Retry rate
This shows how often the pipeline needed a second attempt to reach a green state.
4. False failure rate
This measures failures that did not represent a real product issue. It overlaps with flakiness, but some teams separate failures caused by ephemeral infrastructure issues.
5. Build stability metrics
This is a broader category that includes pipeline success rate, mean time to green, queue delays, and the proportion of builds blocked by unstable tests.
If you only track flake rate, you can miss the business impact. If you only track build stability, you can miss the root cause.
How to segment flaky test rate so it becomes actionable
A single suite-wide rate is not enough to fix the problem. You need slices.
By test type
Break it down by:
- Unit tests
- Integration tests
- API tests
- UI tests
- End-to-end tests
UI and end-to-end tests often show higher flake rates because they depend on timing, animations, network calls, and shared environments.
By environment
Compare:
- Local developer runs
- Pull request pipelines
- Main branch pipelines
- Staging environments
- Nightly scheduled runs
If a test is stable locally but flaky only in CI, the issue may be environment-related rather than test logic.
By owning team or subsystem
Assign flakiness to the component owner when possible. Without ownership, flaky test rate becomes an abstract dashboard number that nobody improves.
By failure signature
Group failures by stack trace, assertion text, network error, timeout, or selector issue. Two failures that look different may share the same underlying cause.
How to distinguish flakiness from legitimate failure
This is where teams often make mistakes.
A failure is not flaky just because it later passes. Sometimes the first failure exposes a real bug that is temporarily masked by a race condition, unstable backend, or state leakage in the environment.
Use a few rules.
Treat as likely flaky when:
- The same test passes and fails repeatedly on the same commit
- The error changes across retries
- The failure disappears when rerun immediately without any code change
- The stack trace suggests timing, waiting, or state isolation issues
Treat as likely product failure when:
- The failure is consistent across reruns
- The error is tied to a deterministic assertion mismatch
- Multiple independent tests fail in the same area
- Logs and traces point to a real regression
Treat as environment instability when:
- Many unrelated tests fail at once
- Failures correlate with infrastructure events, network issues, or service outages
- The same suite is unstable only on one runner type or region
This classification does not need to be perfect. It just needs to be good enough to separate signal from noise.
A CI pipeline pattern that makes flaky rate measurable
If your pipeline only stores final pass or fail, you will struggle to calculate anything beyond rough guesswork. A better pipeline pattern is:
- Run the test once
- If it fails, retry once or twice with the same commit and environment
- Record every attempt separately
- Mark the test as flaky if it passes after a prior failure
- Keep the retry reason and failure signature
Here is a simple GitHub Actions example showing a retry-friendly structure at the job level.
name: tests
on: [pull_request]
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test
The YAML above does not include retry logic by itself, but it shows the kind of pipeline where you would add it, either in the test runner or via a wrapper that records attempts.
A Playwright example can capture retry awareness directly in the test config.
import { defineConfig } from '@playwright/test';
export default defineConfig({ retries: 1, reporter: [[‘line’], [‘json’, { outputFile: ‘test-results.json’ }]], });
That JSON report is useful because it preserves attempt-level detail, which is exactly what you need to calculate flaky test rate accurately.
How to calculate flake rate from retry data
If your runner supports retries, the simplest calculation is often:
- Count tests that failed on attempt 1 and passed on attempt 2 as flaky
- Exclude tests that failed on all attempts, because those are unresolved failures
- Optionally track tests that passed initially but failed later in repeated runs as flaky too
For example, if a test suite has 800 test cases in a day and 24 of those had at least one failure but eventually passed on retry, then the daily flaky test rate is:
text 24 / 800 = 3.0%
If you want a stricter metric, you can count only tests that fail and pass within the same pipeline run. If you want a broader metric, count any instability across repeated runs in the same commit window.
The strict version is easier to automate and less ambiguous. The broader version is better for diagnosing persistent flake patterns, but it requires more correlation logic.
Common ways flaky test rate gets distorted
1. Counting every failure as a flake
This overstates flakiness and hides real defects. A test can fail honestly.
2. Ignoring retries entirely
This understates flakiness and makes CI look more reliable than it is.
3. Measuring only final build status
A green build with many retries is not the same as a truly stable build.
4. Mixing different test scopes
Unit tests and cross-browser UI tests should not always be compared on the same baseline.
5. Letting quarantined tests disappear from reporting
Quarantine is sometimes necessary, but quarantined tests should still be visible in metrics. Otherwise the dashboard looks better while the underlying problem remains.
Quarantine is not a measurement strategy
Quarantining a flaky test can be the right operational decision, especially when it blocks releases or distracts developers from real failures. But quarantine should not replace measurement.
Track quarantined tests separately:
- How many are quarantined
- How long they have been quarantined
- Whether they still fail intermittently
- Whether their flake rate is trending down after fixes
If quarantined tests are not counted anywhere, you have not solved flakiness. You have just hidden it.
How to interpret flaky test rate over time
A single month of data can be misleading. Look for trends.
A decreasing rate is good, but not enough
If flake rate drops from 6% to 3%, that is progress. Still, check whether the remaining 3% is concentrated in one suite or spread across the whole pipeline.
A stable rate can still be a problem
A flat flaky test rate may mean the team has normalized instability. If your rate never improves, it may be because nobody owns the cleanup work.
A rising rate often signals process drift
This can happen when:
- Test coverage expands faster than test quality practices
- New UI tests inherit old timing assumptions
- Infrastructure changes introduce more environmental noise
- Parallelization exposes hidden shared-state bugs
What good looks like in practice
A healthy CI/CD program usually has a few traits:
- Retry data is preserved
- Flaky tests are labeled, not hidden
- Build stability metrics are reported alongside test pass rates
- Teams have an owner for test reliability, not just feature correctness
- The pipeline distinguishes between first-pass green and retry-green builds
That last point is especially important. If developers see only final success, they will assume the system is healthier than it really is.
A lightweight dashboard model for teams
If you are building a QA or platform dashboard, include these widgets:
- Total test executions
- First-pass pass rate
- Retry count per build
- Flaky test rate by suite
- Top 10 flaky tests by frequency
- Flaky tests by environment
- Median time to green
- Builds blocked by instability
This gives QA managers and SREs a much better view of CI/CD reliability than a single pass/fail summary.
Example SQL for a simple flake-rate query
If your test results are stored in a table with attempt-level rows, you can calculate a basic rate like this:
SELECT
ROUND(
100.0 * SUM(CASE WHEN passed_after_retry = 1 THEN 1 ELSE 0 END) / COUNT(*),
2
) AS flaky_test_rate_percent
FROM test_attempts
WHERE started_at >= CURRENT_DATE - INTERVAL '7 days';
This assumes you already mark whether an execution passed after retry. Some teams derive that from grouped attempt data instead.
Measurement pitfalls in UI and end-to-end testing
UI tests are often the noisiest part of the suite, so flaky test rate tends to be highest there. Common causes include:
- Animated elements and unstable selectors
- Async rendering and insufficient waits
- Shared test accounts
- Rate limiting in third-party dependencies
- Browser timing differences
- Screens or dialogs that appear only sometimes
When UI flakiness is the main problem, the best fixes are usually in test design, not more retries. Better selectors, explicit waits, isolated test data, and deterministic setup often improve CI test reliability more than any dashboard can.
What to do once you have the metric
Measuring flaky test rate is only the first step. Use it to drive action.
If one test is repeatedly flaky
- Fix the root cause
- Remove arbitrary sleeps
- Improve selectors
- Isolate test data
- Reduce external dependencies
If one suite is flaky
- Review setup and teardown
- Check parallel execution conflicts
- Examine shared fixtures
- Split slow tests from fast checks
If flakiness is system-wide
- Audit runner infrastructure
- Check container resource limits
- Review network dependencies
- Investigate environment provisioning
- Look for global timing or clock-related issues
If the rate is improving but still high
- Prioritize the top offenders by frequency and business impact
- Stop adding new unstable tests until the worst ones are addressed
- Make reliability part of the definition of done for Test automation work
A simple governance model for reliability
Treat flaky test rate like any other engineering health metric.
- Set an owner for reliability reporting
- Define what counts as a flake
- Review the trend on a regular cadence
- Require a remediation plan for top flaky tests
- Track reductions in false failures over time
The goal is not to eliminate every intermittent failure immediately. The goal is to make instability visible, measurable, and hard to ignore.
Final takeaway
Flaky test rate is one of the most useful metrics in CI/CD because it exposes the difference between a truly stable pipeline and a pipeline that only looks stable after retries. If you define it clearly, record attempt-level data, and interpret it alongside build stability metrics, you can reduce false failures without confusing them with real defects.
The practical mindset is simple: do not trust final green status alone, measure retry behavior, segment the data, and fix the tests or environments that repeatedly undermine confidence.
When the flaky test rate goes down, CI becomes faster to trust, builds become easier to interpret, and release decisions get less noisy. That is the real value of the metric.