Why CI Pipelines Pass on Replay but Fail in Real Runs: A Debugging Guide for Environment Drift

When a CI pipeline passes after a replay but fails during the original run, the obvious instinct is to blame flaky tests. Sometimes that is correct. More often, the failure is a signal that the pipeline environment changed underneath the test, or that the test only appears stable when the surrounding conditions are already warm, cached, or subtly different from the real execution path.

That distinction matters. A replay that reuses artifacts, network state, browser state, or infrastructure timing can hide the very issue you need to see. If your team wants trustworthy automation, the goal is not to make every replay green. The goal is to understand why CI pipelines pass on replay but fail in real runs, then remove the drift that makes the first run unreliable.

This guide is diagnosis-first. It focuses on how to separate test noise from genuine environment problems, how to inspect the layers where drift sneaks in, and how to make your pipelines more reproducible without turning every build into a science project.

What environment drift actually is

Environment drift is any difference between the conditions your pipeline expects and the conditions it actually runs in. That difference can be obvious, like a different browser version or a missing environment variable, or subtle, like a changed DNS response, a warmed cache, or a container image that was rebuilt from a moving base tag.

In practice, drift shows up across several layers:

Code and dependency drift, package versions change, lockfiles are ignored, transitive dependencies shift.
Build image drift, base images are rebuilt, OS packages change, browser binaries differ.
Infrastructure drift, node allocation, CPU contention, memory pressure, ephemeral ports, clock skew.
Data drift, databases contain different seed data, test accounts collide, feature flags are different.
Network drift, service latency changes, DNS resolution changes, TLS behavior differs, third-party APIs rate limit.
State drift, local caches, browser profiles, token lifetimes, filesystem residue, leaked sessions.

If a test only passes when a replay has already warmed the system, the test may be observing hidden state, not validating product behavior.

That does not automatically mean the test is bad. It may simply mean the test is incomplete. A good diagnosis asks, “What changed between the failed run and the replay?” not just “How do I make it green?”

Why replays can be misleading

Replay is useful for confirming that a failure is not fully deterministic. But replay can accidentally normalize the pipeline in ways that hide real problems.

Common reasons a replay looks healthier than the original run include:

Cached dependencies and artifacts

The first run might spend time downloading packages, pulling Docker layers, or restoring build caches. During that window, timeouts, startup races, and missing readiness checks become visible. A replay can skip much of that work because the cache is already populated.

Warmed external state

A test that talks to a shared service may pass on replay because the service already contains records created by the failed run. That can make the replay look stable even though the pipeline is mutating shared state in ways the test did not account for.

Different execution timing

Many failures are race conditions. On replay, the test may run slightly slower, slightly faster, or on a less loaded worker. That changes whether it collides with startup, session expiry, or background jobs.

Hidden retries

Some CI systems or test runners retry failed steps. A replay may not be a true reproduction if the retried path benefits from a half-initialized environment, pre-existing auth token, or idempotent setup step.

Manual debugging changes the system

Even small changes during a replay, like opening the app locally, tailing logs, or pausing a job, can perturb timing enough to hide the original fault.

First question, is this a test problem or an environment problem?

Before changing code, classify the failure by symptom. This prevents teams from turning every issue into a test rewrite or, worse, every issue into an infrastructure upgrade.

Signs it is probably test noise

The selector is brittle and breaks with harmless DOM changes.
The test order matters, which suggests shared state leakage.
Assertions are too specific about timestamps, IDs, or generated text.
The test races an element before it is ready because it uses fixed sleeps instead of explicit waits.
The same failure appears across unrelated apps or branches only when parallelized.

Signs it is probably environment drift

The failure happens after an image rebuild, dependency refresh, or base OS change.
The same test passes locally but fails only in a specific runner type.
Failures correlate with resource pressure, noisy neighbors, or autoscaling.
The app behavior changes when a cache, proxy, or feature flag is present.
Replays succeed only after state from the original run has already been created.

A useful mental model is this, flaky tests are usually about test design, environment drift is usually about system assumptions. The two often overlap, but they do not start from the same root cause.

Build a reproducibility baseline

If you cannot reproduce the failure, you cannot confidently fix it. The first job is to make the original failure observable.

Capture the exact runtime context

For every failed CI run, store enough metadata to reconstruct the environment:

Git SHA, branch, and merge base
Container image digest, not just the tag
OS version, browser version, and driver version
Node, Python, Java, or other runtime versions
Relevant environment variables and feature flags
Test shard number and ordering
CPU and memory limits
Network topology details if available

This is not just for postmortems. It helps you compare the failed run and the replay at the level where drift actually happens.

Prefer immutable references over moving tags

A tag like node:20, ubuntu:latest, or browser:stable is convenient and often acceptable for development. For diagnosis and reproducibility, it is not enough. Use digests or pinned versions in the failing path.

jobs:
  test:
    runs-on: ubuntu-22.04
    container:
      image: ghcr.io/acme/test-runner@sha256:8e3c...f91
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test

A pinned container image does not solve all drift, but it removes an entire category of invisible changes.

Record build inputs, not only outputs

Teams often store artifacts, screenshots, and logs, but forget the inputs that produced them. If you are debugging environment drift, the input set matters more than the failure screenshot. Capture:

dependency lockfiles
generated config files
test seed data
browser capabilities
feature flag snapshots
service endpoints

The goal is to answer, “What did this job believe the world looked like?”

Where drift usually hides

1. Dependency resolution

A dependency lockfile is only useful if it is actually respected by the build. Problems appear when package managers do a partial install, when transitive dependencies are unconstrained, or when a lockfile is present but not enforced in CI.

Typical symptoms include:

tests failing after a dependency cache refresh
a build passing after replay because the resolved tree was already cached
browser tooling or test libraries changing behavior without application code changes

Practical check, compare installed versions from the failed run and the replay. If you do not log them, add that logging first.

2. Container and OS layer drift

Base image tags can move. Package repositories can change. OS patches can alter SSL behavior, file permissions, or locale defaults. A test that reads from /tmp, assumes a default timezone, or depends on a browser binary may fail only on a fresh image pull.

Look for:

apt-get update changing package availability
browser version mismatches between local and CI
file permission issues caused by different UID and GID mappings
locale and timezone differences affecting snapshots or string comparisons

3. External service dependencies

If your pipeline hits a real database, queue, email provider, payment sandbox, or third-party API, your test result includes all of their variability too.

Questions to ask:

Is the service state shared across runs?
Are there rate limits or eventual consistency delays?
Are responses timestamped or non-deterministic?
Does the service have cold start latency that changes on replay?

If the answer is yes, isolate it. Use a disposable test tenant, a seeded sandbox, or a mock boundary where appropriate. For API-level testing, that boundary should be deliberate, not accidental.

4. Feature flags and config drift

Many “same build, different result” incidents are really configuration incidents. A replay can inherit a feature flag change, a secret rotation, or a config map update that was not present in the original failure.

Make sure CI captures the exact flag state at runtime. If flags are managed externally, snapshot them or mirror them into the test artifact.

5. Parallel execution and shared state

Parallelism can create failures that disappear in replay because the collision has already happened once and the environment now contains one side of the race.

Watch for:

shared test accounts
reused inboxes or phone numbers
singleton fixtures
temp file collisions
port collisions
non-unique test data

If your tests create the same email address, customer ID, or resource name every time, they are asking for nondeterministic behavior.

A practical triage sequence

When a pipeline fails in real runs but passes on replay, resist the urge to make a broad fix. Triage systematically.

Step 1, reproduce with maximum fidelity

Re-run the exact same commit, same image digest, same runner type, same shard, same environment variables, and same seed if the test framework supports one.

If your test runner supports deterministic seeding, use it.

import { test, expect } from '@playwright/test';

test.use({ trace: ‘on-first-retry’ });

test('checkout flow', async ({ page }) => {
  await page.goto('/checkout');
  await expect(page.getByRole('heading', { name: /checkout/i })).toBeVisible();
});

The point is not the Playwright code itself, it is to keep the replay as close as possible to the original run, while capturing enough trace data to compare both executions.

Step 2, compare environment metadata

Diff the failed run and the replay:

image digest
runtime version
browser version
env vars
test data
cache state
network access
worker identity

Even a single version difference can matter. A patch release in Chrome, a changed CA bundle, or a new default in a package manager can shift behavior in small but important ways.

Step 3, bisect by layer

If the app code seems unchanged, isolate by layer:

run the same test with cache disabled
run in a fresh container
remove parallelism
disable external integrations
swap the browser or driver version
rerun only the setup steps

This helps determine whether the fault follows the code, the runner, the data, or the network.

Step 4, inspect the first divergence

Do not focus only on the final assertion. Find the first meaningful difference between the failed run and the replay:

a request took longer
a token expired earlier
an element rendered later
a background job had not finished
a service returned a different payload

The first divergence is usually more actionable than the final failure.

Making failures observable instead of mysterious

The best fix for environment drift is not more retries. It is better observability.

Add structured logs around setup and teardown

Log the important phases of the pipeline, not just the test assertion:

dependency install start and end
container startup time
app readiness checks
database migrations
seed completion
browser launch and navigation timing
teardown status

That lets you identify whether the failure is in provisioning, bootstrapping, or the test itself.

Capture artifacts on the first failure

Useful artifacts include:

screenshots
DOM snapshots
network logs
trace files
container logs
application logs
database migration logs

But do not stop at collection. Make sure the artifacts are searchable by commit SHA, environment, shard, and timestamp.

Add health checks before tests begin

If the app is not actually ready, tests should fail early with a clear readiness error, not later with a confusing assertion.

For example, a minimal pipeline health check might look like this:

until curl -fsS http://app:3000/health; do
  echo "waiting for app readiness"
  sleep 2
done

This is simplistic, but the principle is important. Fail on invalid prerequisites, not deep inside user-flow assertions.

Build reproducibility into the pipeline, not just the container

A reproducible build is more than a pinned image. It is a contract about inputs and side effects.

Keep test data deterministic

Use seeded fixtures, unique namespaces, and isolated data stores. Avoid tests that depend on the current contents of a shared database unless that database is explicitly controlled for the test.

Good patterns:

create a database per run
namespace test data by commit SHA or run ID
reset queues before each suite
seed via versioned fixtures

Bad patterns:

reusing production-like shared accounts
depending on mutable global tables
cleaning up only “most of the time”
assuming the environment starts empty

Avoid time-dependent assertions where possible

Timestamps, countdowns, expiring tokens, and relative ordering can be made stable with injected clocks or tolerant assertions.

from datetime import datetime, timezone

now = datetime.now(timezone.utc) assert abs((now - parsed_timestamp).total_seconds()) < 5

Use tolerances where the domain permits it, but do not use loose assertions to hide genuine latency problems. If the system must respond within a deadline, the test should enforce that deadline.

Separate initialization from validation

A common drift pattern is setup code that mutates state and validation code that assumes the mutation has completed. Break the workflow into explicit phases:

provision
seed
verify readiness
execute user flow
assert outcome
clean up

This makes it much easier to see which phase is unstable.

How to decide whether to retry, quarantine, or fix

Not every failure deserves the same response. Teams often overuse retries because they are cheap. But retries are only useful when the failure mode is transient and understood.

Retry when

the dependency is explicitly transient, such as a short-lived network timeout
you have observed that the same step passes after a brief backoff
the retry is not masking a correctness issue

Quarantine when

the test is nondeterministic and blocks the team
the root cause is not yet understood, but the signal is clearly unstable
you have a tracking plan to investigate and restore it

Fix when

the drift is reproducible and tied to a known input
the build uses moving dependencies or mutable state
the test relies on timing, ordering, or shared resources that should be isolated

A retry hides symptoms. A fix removes uncertainty.

If a pipeline only becomes stable because it was retried after the environment settled, that is not stable automation, that is delayed observation.

A debugging checklist for real-run failures

Use this checklist when replay passes but the original CI run fails:

Confirm the commit SHA and branch are identical.
Compare container image digests, not tags.
Compare runtime, browser, and driver versions.
Inspect environment variables and feature flags.
Check whether caches, volumes, or workspaces were reused.
Verify the test data set is identical or intentionally isolated.
Review network dependencies and third-party calls.
Look at the first divergence in logs or traces.
Disable parallelism and rerun once for diagnosis.
Reproduce in a fresh, empty environment.

If the failure disappears only when the environment is warmed or replayed, that is a clue, not a resolution.

A practical CI hardening strategy

Once you identify the drift source, make the pipeline more resistant to it.

Standardize the build surface

Use a single canonical runner image or a small set of approved images. Avoid bespoke setup steps in every repo if a shared base can handle them. The fewer ad hoc differences between projects, the easier it is to reason about failures.

Make test execution deterministic where possible

seed randomness
fix locale and timezone
isolate data
eliminate shared accounts
use explicit waits instead of sleeps
avoid test-order dependencies

Reduce external variability

stub or contract-test unstable third parties
use disposable infrastructure for integration tests
snapshot mutable configs at run start
limit dependencies on live services to the tests that truly need them

Treat observability as a first-class requirement

A green pipeline that cannot explain itself is fragile. Instrument the build enough that future failures can be diagnosed without guesswork.

Where test automation, bug tracking, and reporting fit in

This problem is not only about CI. It affects the broader QA workflow.

Test automation should distinguish product regressions from environmental instability.
Bug tracking should record the exact runtime context, not just the symptom.
Reporting should show failure clusters by image version, runner type, or environment variable, not only by test name.
QA workflows should make it easy to quarantine a test, compare runs, and trace a failure back to an input change.

If your reporting layer cannot correlate failures with runtime changes, you will keep reopening the same class of issue under different names.

For background on the core concepts, see software testing, test automation, and continuous integration.

The rule of thumb

If a test passes only after replay, assume the environment changed until proven otherwise. If the environment is stable, then assume the test is incomplete until proven otherwise.

That order matters. It keeps teams from misclassifying drift as flakiness, and flakiness as drift. The most effective CI teams do not just add retries or rewrite tests on instinct. They identify the layer where the system diverged, make that layer observable, and remove the assumption that failed.

When you reach that point, the pipeline becomes more than a pass or fail gate. It becomes a reliable diagnostic surface for the whole delivery system.