How to Measure Flaky Test Rate in CI/CD Without Guesswork

Flaky tests are one of the fastest ways to make a CI pipeline feel unreliable even when the product is fine. A test suite that fails unpredictably creates a bad loop, developers stop trusting failures, QA spends time rerunning jobs, and release decisions start relying on gut feel instead of evidence.

That is why the flaky test rate matters. It is a practical metric for separating genuine product failures from test instability, and it gives teams a way to track whether CI/CD reliability is improving or drifting in the wrong direction.

The hard part is not collecting a number. The hard part is defining one that is meaningful, consistent, and hard to game. This article breaks down how to measure flaky test rate without guesswork, how to instrument it in a pipeline, and how to interpret it alongside other build stability metrics.

What flaky test rate actually means

At a high level, flaky test rate measures the portion of test outcomes that change when the underlying code and environment have not meaningfully changed. In other words, the test fails sometimes and passes other times under the same conditions.

This sounds simple, but in practice there are a few different ways teams define it:

Failure retry rate, how often a test only passes after rerunning
Non-deterministic failure rate, how often a test fails on one run and passes on another with the same commit
Flake frequency per test, how often an individual test exhibits unstable behavior over time
Suite-level flakiness, the share of all test executions that are flaky

A flaky test rate is only useful if the definition is explicit. If one team counts a retry as a flake and another counts only confirmed nondeterministic failures, the metric will not be comparable.

For CI/CD reliability, the most useful definition is usually this:

Flaky test rate = flaky test outcomes / total test outcomes in a defined time window

That gives you a stable ratio you can trend over time, but it only works if you know what counts as a flaky outcome.

Why flaky test rate is not the same as failure rate

A high failure rate and a high flaky test rate are not the same thing.

A test can fail for good reasons, for example:

The feature is broken
A contract changed and the test is outdated
An environment dependency is actually unavailable
A deployment introduced a real regression

A flaky test fails for bad reasons, meaning the code under test is not the root cause, or at least not consistently the root cause.

This distinction matters because teams often chase the wrong problem. If failure rate is high, the answer may be product defects, brittle tests, or missing test coverage. If flaky test rate is high, the problem is usually test design, environment stability, timing assumptions, or shared state.

The definitions you need before you start measuring

Before adding dashboards or alerts, establish a shared vocabulary.

1. What is a test execution?

A test execution is a single run of a test in a given context, often tied to a commit, branch, environment, and pipeline job.

For example, if your test suite runs on every pull request and again after merge, those are separate executions even if they use the same test code.

2. What is a flaky outcome?

A flaky outcome is usually one of these patterns:

The test fails, then passes on immediate retry with no code change
The test passes on one run, then fails on a rerun against the same commit and environment
The test fails intermittently across a stable time window with no relevant change in the product or test code

The first definition is easiest to automate in CI. The second is useful but requires careful correlation. The third is best for trend analysis, but it is harder to attribute.

3. What time window are you measuring?

Flaky rates are sensitive to window choice. A daily window is useful for spotting incidents. A weekly or monthly window is better for tracking real improvement.

Pick a primary window and keep it consistent. Otherwise the numbers will move around because of the window, not because of the tests.

A practical formula for flaky test rate

Here is a straightforward way to compute it.

text flaky test rate = number of flaky test outcomes / total test outcomes

If you want a percentage:

text flaky test rate % = (flaky test outcomes / total test outcomes) * 100

That raises the next question, what counts as a flaky test outcome?

A pragmatic CI-friendly approach is to count a test as flaky when it:

Fails on the first run
Passes on an immediate retry
Has no relevant code or environment change between attempts

This method is not perfect, but it is operationally useful because it can be implemented in most pipelines without deep statistical modeling.

Example

Suppose over one week you observe:

1,200 total test executions
36 first-attempt failures that passed on retry
4 additional unstable cases discovered through reruns or quarantined jobs

If your team counts all 40 as flaky outcomes, then:

text 40 / 1200 = 0.0333

So the flaky test rate is 3.33% for that window.

What matters is not the exact decimal, but that you always calculate it the same way.

Where to get the data in CI/CD

Measuring flaky test rate requires test result data that is tied to execution context. At minimum, capture:

Test name or ID
Suite or component
Build ID or pipeline run ID
Commit SHA
Branch name or pull request ID
Timestamp
Environment or test target
Retry count
Result for each attempt
Failure reason or error signature if available

You can collect this from a CI system, a test runner, or a test reporting tool. The important part is that every run is traceable.

The minimum instrumentation model

If you are starting from scratch, track each test attempt as a row in a results table. A simple schema could look like this:

From that data, you can infer whether a failure was recovered by retry and whether retries were common enough to matter.

Why retries are both useful and dangerous

Retries are good for user experience in CI because they prevent one-off noise from blocking the whole pipeline. But retries also hide flakiness if you treat them as normal pass outcomes.

A retry that passes is not a clean pass. It is evidence of instability.

If your CI only reports final status and discards retry history, you are probably undercounting flaky tests.

That is why the raw attempt log is more valuable than just the final job result.

A better metric stack than a single number

Flaky test rate is useful, but on its own it can mislead you. A mature CI/CD dashboard usually combines several related metrics.

1. Flaky test rate

This tells you how much of the test signal is unstable.

2. First-pass pass rate

This measures how often tests pass without retry. It is often a better indicator of developer trust than final pass rate.

3. Retry rate

This shows how often the pipeline needed a second attempt to reach a green state.

4. False failure rate

This measures failures that did not represent a real product issue. It overlaps with flakiness, but some teams separate failures caused by ephemeral infrastructure issues.

5. Build stability metrics

This is a broader category that includes pipeline success rate, mean time to green, queue delays, and the proportion of builds blocked by unstable tests.

If you only track flake rate, you can miss the business impact. If you only track build stability, you can miss the root cause.

How to segment flaky test rate so it becomes actionable

A single suite-wide rate is not enough to fix the problem. You need slices.

By test type

Break it down by:

Unit tests
Integration tests
API tests
UI tests
End-to-end tests

UI and end-to-end tests often show higher flake rates because they depend on timing, animations, network calls, and shared environments.

By environment

Compare:

Local developer runs
Pull request pipelines
Main branch pipelines
Staging environments
Nightly scheduled runs

If a test is stable locally but flaky only in CI, the issue may be environment-related rather than test logic.

By owning team or subsystem

Assign flakiness to the component owner when possible. Without ownership, flaky test rate becomes an abstract dashboard number that nobody improves.

By failure signature

Group failures by stack trace, assertion text, network error, timeout, or selector issue. Two failures that look different may share the same underlying cause.

How to distinguish flakiness from legitimate failure

This is where teams often make mistakes.

A failure is not flaky just because it later passes. Sometimes the first failure exposes a real bug that is temporarily masked by a race condition, unstable backend, or state leakage in the environment.

Use a few rules.

Treat as likely flaky when:

The same test passes and fails repeatedly on the same commit
The error changes across retries
The failure disappears when rerun immediately without any code change
The stack trace suggests timing, waiting, or state isolation issues

Treat as likely product failure when:

The failure is consistent across reruns
The error is tied to a deterministic assertion mismatch
Multiple independent tests fail in the same area
Logs and traces point to a real regression

Treat as environment instability when:

Many unrelated tests fail at once
Failures correlate with infrastructure events, network issues, or service outages
The same suite is unstable only on one runner type or region

This classification does not need to be perfect. It just needs to be good enough to separate signal from noise.

A CI pipeline pattern that makes flaky rate measurable

If your pipeline only stores final pass or fail, you will struggle to calculate anything beyond rough guesswork. A better pipeline pattern is:

Run the test once
If it fails, retry once or twice with the same commit and environment
Record every attempt separately
Mark the test as flaky if it passes after a prior failure
Keep the retry reason and failure signature

Here is a simple GitHub Actions example showing a retry-friendly structure at the job level.

name: tests
on: [pull_request]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test

The YAML above does not include retry logic by itself, but it shows the kind of pipeline where you would add it, either in the test runner or via a wrapper that records attempts.

A Playwright example can capture retry awareness directly in the test config.

import { defineConfig } from '@playwright/test';

export default defineConfig({ retries: 1, reporter: [[‘line’], [‘json’, { outputFile: ‘test-results.json’ }]], });

That JSON report is useful because it preserves attempt-level detail, which is exactly what you need to calculate flaky test rate accurately.

How to calculate flake rate from retry data

If your runner supports retries, the simplest calculation is often:

Count tests that failed on attempt 1 and passed on attempt 2 as flaky
Exclude tests that failed on all attempts, because those are unresolved failures
Optionally track tests that passed initially but failed later in repeated runs as flaky too

For example, if a test suite has 800 test cases in a day and 24 of those had at least one failure but eventually passed on retry, then the daily flaky test rate is:

text 24 / 800 = 3.0%

If you want a stricter metric, you can count only tests that fail and pass within the same pipeline run. If you want a broader metric, count any instability across repeated runs in the same commit window.

The strict version is easier to automate and less ambiguous. The broader version is better for diagnosing persistent flake patterns, but it requires more correlation logic.

Common ways flaky test rate gets distorted

1. Counting every failure as a flake

This overstates flakiness and hides real defects. A test can fail honestly.

2. Ignoring retries entirely

This understates flakiness and makes CI look more reliable than it is.

3. Measuring only final build status

A green build with many retries is not the same as a truly stable build.

4. Mixing different test scopes

Unit tests and cross-browser UI tests should not always be compared on the same baseline.

5. Letting quarantined tests disappear from reporting

Quarantine is sometimes necessary, but quarantined tests should still be visible in metrics. Otherwise the dashboard looks better while the underlying problem remains.

Quarantine is not a measurement strategy

Quarantining a flaky test can be the right operational decision, especially when it blocks releases or distracts developers from real failures. But quarantine should not replace measurement.

Track quarantined tests separately:

How many are quarantined
How long they have been quarantined
Whether they still fail intermittently
Whether their flake rate is trending down after fixes

If quarantined tests are not counted anywhere, you have not solved flakiness. You have just hidden it.

How to interpret flaky test rate over time

A single month of data can be misleading. Look for trends.

A decreasing rate is good, but not enough

If flake rate drops from 6% to 3%, that is progress. Still, check whether the remaining 3% is concentrated in one suite or spread across the whole pipeline.

A stable rate can still be a problem

A flat flaky test rate may mean the team has normalized instability. If your rate never improves, it may be because nobody owns the cleanup work.

A rising rate often signals process drift

This can happen when:

Test coverage expands faster than test quality practices
New UI tests inherit old timing assumptions
Infrastructure changes introduce more environmental noise
Parallelization exposes hidden shared-state bugs

What good looks like in practice

A healthy CI/CD program usually has a few traits:

Retry data is preserved
Flaky tests are labeled, not hidden
Build stability metrics are reported alongside test pass rates
Teams have an owner for test reliability, not just feature correctness
The pipeline distinguishes between first-pass green and retry-green builds

That last point is especially important. If developers see only final success, they will assume the system is healthier than it really is.

A lightweight dashboard model for teams

If you are building a QA or platform dashboard, include these widgets:

Total test executions
First-pass pass rate
Retry count per build
Flaky test rate by suite
Top 10 flaky tests by frequency
Flaky tests by environment
Median time to green
Builds blocked by instability

This gives QA managers and SREs a much better view of CI/CD reliability than a single pass/fail summary.

Example SQL for a simple flake-rate query

If your test results are stored in a table with attempt-level rows, you can calculate a basic rate like this:

SELECT
  ROUND(
    100.0 * SUM(CASE WHEN passed_after_retry = 1 THEN 1 ELSE 0 END) / COUNT(*),
    2
  ) AS flaky_test_rate_percent
FROM test_attempts
WHERE started_at >= CURRENT_DATE - INTERVAL '7 days';

This assumes you already mark whether an execution passed after retry. Some teams derive that from grouped attempt data instead.

Measurement pitfalls in UI and end-to-end testing

UI tests are often the noisiest part of the suite, so flaky test rate tends to be highest there. Common causes include:

Animated elements and unstable selectors
Async rendering and insufficient waits
Shared test accounts
Rate limiting in third-party dependencies
Browser timing differences
Screens or dialogs that appear only sometimes

When UI flakiness is the main problem, the best fixes are usually in test design, not more retries. Better selectors, explicit waits, isolated test data, and deterministic setup often improve CI test reliability more than any dashboard can.

What to do once you have the metric

Measuring flaky test rate is only the first step. Use it to drive action.

If one test is repeatedly flaky

Fix the root cause
Remove arbitrary sleeps
Improve selectors
Isolate test data
Reduce external dependencies

If one suite is flaky

Review setup and teardown
Check parallel execution conflicts
Examine shared fixtures
Split slow tests from fast checks

If flakiness is system-wide

Audit runner infrastructure
Check container resource limits
Review network dependencies
Investigate environment provisioning
Look for global timing or clock-related issues

If the rate is improving but still high

Prioritize the top offenders by frequency and business impact
Stop adding new unstable tests until the worst ones are addressed
Make reliability part of the definition of done for Test automation work

A simple governance model for reliability

Treat flaky test rate like any other engineering health metric.

Set an owner for reliability reporting
Define what counts as a flake
Review the trend on a regular cadence
Require a remediation plan for top flaky tests
Track reductions in false failures over time

The goal is not to eliminate every intermittent failure immediately. The goal is to make instability visible, measurable, and hard to ignore.

Final takeaway

Flaky test rate is one of the most useful metrics in CI/CD because it exposes the difference between a truly stable pipeline and a pipeline that only looks stable after retries. If you define it clearly, record attempt-level data, and interpret it alongside build stability metrics, you can reduce false failures without confusing them with real defects.

The practical mindset is simple: do not trust final green status alone, measure retry behavior, segment the data, and fix the tests or environments that repeatedly undermine confidence.

When the flaky test rate goes down, CI becomes faster to trust, builds become easier to interpret, and release decisions get less noisy. That is the real value of the metric.