Why CI Browser Tests Fail on Pull Requests but Pass on Main: Environment Drift, Parallelism, and Secret Handling

When browser tests fail on pull requests but pass on main, the failure is usually not random. It is often a sign that your PR pipeline and main pipeline are not actually equivalent, even if they look similar in the YAML. The differences can be subtle, such as a missing secret, a different browser cache state, a changed test ordering, or a resource constraint that only shows up when several PR jobs run at once.

The hard part is that these failures are easy to misclassify as “just flaky tests.” If you respond by retrying everything, you can hide real regressions and make your CI less trustworthy. The better approach is to treat the discrepancy as a systems problem: compare environments, control concurrency, and make secret handling explicit.

If a browser test passes consistently on main but fails on pull requests, assume the two pipelines are exercising different realities until you prove otherwise.

What this pattern usually means

A PR run and a main-branch run can differ in ways that are easy to miss:

different build images or runner types
different env vars, including secrets and feature flags
fewer or more parallel jobs
different checkout depth or submodule behavior
cache hits on main and cache misses on PRs
PRs running from forks with restricted permissions
branch-specific test selection or tagging
different timing because main is less busy or has warmer infrastructure

In browser automation, those differences matter because tests are already sensitive to timing, rendering, authentication, network behavior, and local storage state. A small divergence can become a reproducible failure only in one pipeline.

For background on the broader concepts, see continuous integration, software testing, and test automation.

Start by proving the failure is really branch-specific

Before you inspect locators or add waits, confirm that the same commit fails only on PR and not on main. Teams often compare different code, different build times, or different dependent services without realizing it.

A useful first step is to run the exact same commit in both contexts, with the same pipeline definition and the same artifact versions. If the failure disappears when the commit is replayed on main, that points to environment drift rather than application logic.

Track these fields for every failed run:

commit SHA
branch name
pipeline job name
runner image or agent version
browser version
test shard or parallel slot
base URL or deployed environment
secret availability
feature flag values

If you do not already capture this data, add it now. It is much easier to debug with a structured record than by reading raw logs from five reruns.

A minimal comparison checklist

Use the same checklist for PR and main runs:

same browser binary and driver version
same environment variables
same app build artifact
same test command
same test data seed, if applicable
same container image digest
same network policy and service dependencies
same job concurrency and shard count

Any mismatch is a candidate root cause.

The most common cause: CI environment drift

CI environment drift means the test execution context changes between runs, even when the test code does not. In browser testing, drift is especially common because tests depend on the browser, the application, and the infrastructure around them.

Typical sources of drift include:

1. Different browser or driver versions

Chrome, Firefox, WebKit, and their automation drivers change behavior over time. A locator that worked last week can fail if the browser updates rendering or event timing. This is often invisible on main if the main branch uses a cached image or a different runner pool.

Pin your browser version in CI where possible, and record it in artifacts. If you use containers, prefer immutable image digests over floating tags.

2. Different build artifacts

PR pipelines often build from scratch, while main pipelines may reuse artifacts from a deployment step. If the browser tests hit a deployed preview, verify that the preview contains the same frontend bundle and backend schema expected by the tests.

A common mismatch looks like this:

PR deploy uses the latest frontend code but stale backend migration state
main deploy uses a full release artifact and matching database schema
tests pass on main because the deployment process is more complete

3. Different environment variables

Feature flags, API endpoints, and auth settings often differ by branch. A PR may disable an integration because secrets are unavailable or because the branch is untrusted. That can change page structure, authentication flow, or API responses.

4. Ephemeral infrastructure noise

PR jobs often land on colder, busier, or smaller runners. If browser tests depend on timing, cold starts can expose races. The test may pass on main because the environment is faster, not because the code is healthier.

Practical fix

Create a run manifest that prints the critical environment properties at the start of every job:

- name: Print run manifest
  run: |
    echo "sha=$GITHUB_SHA"
    echo "ref=$GITHUB_REF"
    node -v
    npx playwright --version
    google-chrome --version || true

That output becomes a baseline for comparing PR and main.

Parallel browser tests can hide or create failures

Parallelism is one of the most common reasons flaky PR tests appear only on pull requests. Teams often increase parallelism in PR pipelines to reduce feedback time, but that also increases contention and timing variance.

There are two separate problems here:

the tests are not isolated enough to run in parallel
the CI system schedules more parallel work on PRs than on main, or vice versa

How parallelism breaks browser tests

Browser tests are often written assuming one test owns one user state. In practice, parallel workers can conflict over:

shared user accounts
shared test data records
hard-coded emails or usernames
global localStorage or session cookies in reused profiles
mutable backend fixtures
rate-limited auth services
shared downloads or upload directories

A test that logs in as the same account across five workers can behave differently every time depending on ordering. On main, the suite may run with fewer shards and fewer collisions. On PR, the additional concurrency may expose the race.

A Playwright example of accidental shared state

import { test, expect } from '@playwright/test';

test('creates an order', async ({ page }) => {
  await page.goto('/login');
  await page.fill('#email', 'qa@example.com');
  await page.fill('#password', process.env.PASSWORD!);
  await page.click('button[type="submit"]');

await page.goto(‘/orders/new’); await page.click(‘text=Create Order’); await expect(page.locator(‘.toast’)).toContainText(‘Order created’); });

This test may fail if another worker already used the same account and altered the session or data state.

Better isolation patterns

create a unique user per worker
isolate data by test run ID
namespace resources with the commit SHA and shard index
reset browser state for each test
avoid reusing accounts with mutable preferences or permissions
keep file downloads in per-worker directories

If you must parallelize, design tests to be parallel-safe first, then scale concurrency.

Control the shard count deliberately

The same suite can pass with two workers and fail with six. If PR and main differ in worker count, you are not comparing the same system. Set the same parallelism for both branches while debugging.

In Playwright, that may mean fixing workers in config or overriding it in CI:

import { defineConfig } from '@playwright/test';

export default defineConfig({ workers: process.env.CI ? 4 : 2, retries: process.env.CI ? 1 : 0 });

For debugging, temporarily reduce workers to 1. If the failure disappears, you likely have a concurrency issue rather than a true app regression.

Secrets are often the hidden branch boundary

PR pipelines frequently run with limited or no secrets, especially for forks. That is usually the correct security choice, but it changes test behavior.

Browser tests can depend on secrets in ways that are not obvious:

authentication tokens for test users
API keys for third-party auth or payments
SSO configuration
webhook signatures
private test fixture access
feature toggles loaded from secret-backed config

If a secret is missing on PR but present on main, the app may silently switch to a fallback flow. The test then fails because the UI no longer matches the expected path.

Common secret-handling mistakes

tests assume process.env.TOKEN exists without checking
app code falls back to mock mode on PR, but real mode on main
forked PRs cannot access secrets, causing partial initialization
secret rotation breaks one branch cache but not the other

Make secret differences explicit

Fail fast when a required secret is missing. Do not let the app degrade into a different mode without logging it.

const token = process.env.API_TOKEN;
if (!token) {
  throw new Error('API_TOKEN is required for browser tests');
}

If you intentionally use different modes for PR and main, label them clearly in output. For example, print auth_mode=mock or auth_mode=real at startup so the pipeline artifact reflects the decision.

Hidden fallback behavior is a classic source of false confidence. A passing test is less valuable if it passed in a mode nobody intended to ship.

Timing issues are more visible on PRs

PR jobs often contend with lower cache hit rates, cold dependency installs, and noisier scheduling. That can make race conditions or brittle waits appear only there.

Browser automation fails for timing reasons when tests assume an element is ready before it is actually interactive. The fix is usually not to add a longer sleep. The fix is to wait for the specific condition the user needs.

Examples of brittle timing

clicking before animation completes
asserting text before API response finishes
waiting for a generic network idle state that never truly occurs
reading a toast before it is mounted
interacting with an iframe before it loads

Prefer condition-based waits

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Saved successfully')).toBeVisible();

This is better than a fixed sleep because it ties the wait to an observable state.

If a failure only happens on PR, inspect whether the PR runner is slower, more loaded, or using a different network path. A test that barely passes on main may be too close to the edge to survive a colder environment.

Build a repeatable triage flow

When the failure appears, do not immediately edit test code. Use a consistent triage sequence.

Step 1: compare the failed run with a passing run

Look at the exact job metadata, browser version, and environment variables.

Step 2: rerun the same commit in the same branch context

If the test passes on rerun, you may have flakiness. If it fails consistently on PR but not main, keep investigating environment drift.

Step 3: reduce concurrency

Set worker count to one. If the failure disappears, suspect shared state or test ordering.

Step 4: isolate secrets and auth

Confirm the pipeline has the expected credentials and that the app is not silently switching to guest mode or mock mode.

Step 5: compare deployed artifacts

Check whether the PR preview and main environment actually contain the same build, schema, and config.

Step 6: capture browser artifacts

Store screenshots, videos, console logs, and traces. These are often the fastest way to distinguish a broken selector from a backend mismatch.

Playwright trace collection is especially useful for branch-specific failures:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

How to distinguish a real regression from a CI artifact

Not every PR-only failure is infrastructure noise. Sometimes the PR truly breaks the app, but the failure only becomes visible because PR runs exercise a slightly different path.

A few clues help separate the two:

More likely a real regression

the same UI action fails in local development and in PR CI
the failure is deterministic on the same commit
screenshots show a visible app defect, not just a timing issue
logs show a backend error, schema mismatch, or missing data
the same failure appears after reducing parallelism

More likely a CI artifact

the failure disappears when retried
the failure disappears with one worker
the failure only occurs on a specific runner image
the failure correlates with missing secrets or branch permissions
the failure occurs during login, fixture setup, or teardown, not core app behavior

Treat deterministic, reproducible failures as product issues. Treat environment-dependent failures as pipeline issues until proven otherwise.

What to standardize across PR and main

The easiest way to reduce branch-specific browser failures is to make PR and main more alike.

Standardize the runtime

same container image
same browser version
same node or language version
same dependency lockfile resolution
same test command and environment variables

Standardize test data strategy

seeded data per run
unique resources per worker
idempotent teardown
isolated test accounts

Standardize pipeline behavior

same shard count during investigation
same artifact retention
same retry policy
same log and trace collection
same preview deployment process

Standardize feature flag exposure

If PRs and main differ in flags, make that explicit in the test plan. Separate tests that validate a feature behind a flag from tests that validate the default experience.

Example GitHub Actions pattern for comparison

A simple way to reduce surprises is to surface the differences in the workflow itself.

name: browser-tests

on: pull_request: push: branches: [main]

jobs: e2e: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test:e2e env: CI: true AUTH_MODE: $

This is not the only valid setup, but it makes the branch difference visible. If AUTH_MODE changes behavior, you will see it in the config rather than discovering it through a failing selector.

Where reporting helps most

Good reporting is not just about dashboards. It is about making branch differences visible enough that engineers can act on them.

At minimum, your reporting should show:

pass/fail by branch type
pass/fail by worker count
pass/fail by browser version
pass/fail by secret availability
pass/fail by deployment target
rerun success rate

If a dashboard shows that PR failures are concentrated on one runner image or one shard, you have a lead. If failures are evenly distributed, the issue is more likely in the test design or app state.

A practical decision tree

When browser tests fail on pull requests but pass on main, use this order:

Confirm the same commit and same test command.
Compare environment variables, browser versions, and container images.
Reduce parallelism to one worker.
Verify required secrets and auth mode.
Compare deployed artifacts and feature flags.
Inspect screenshots, traces, and console logs.
Fix shared state or timing issues.
Only then decide whether the application changed.

This order matters because it avoids the common trap of patching the symptom with retries or sleeps before understanding the cause.

Preventing recurrence

Once you isolate the cause, prevent the same class of failure from coming back.

lock browser and dependency versions
document which secrets are required on PR and which are not
fail fast when a required secret is missing
ensure each worker has isolated test data
keep PR and main job definitions as similar as possible
make parallelism an explicit setting, not an accidental default
collect artifacts on every failure, not just on main

If your team maintains a large suite, consider adding a small set of branch-diff diagnostics that run before the full browser suite. These checks can validate that the pipeline shape matches expectations, which saves time when the real problem is environment drift rather than application behavior.

Conclusion

When browser tests fail on pull requests but pass on main, the best explanation is usually not “the tests are flaky.” It is that PR and main are different execution environments in ways that matter to browser automation. The differences might be in browser version, secrets, cache state, parallelism, or deployment artifacts, but the symptom is the same: one branch exposes instability that the other hides.

The goal is not to make every test pass by force. It is to make the pipeline honest. If a test failure reflects a real regression, you want to catch it. If it reflects CI environment drift or parallel browser tests stepping on one another, you want to make that failure visible, reproducible, and fixable.

That discipline leads to faster triage, fewer false alarms, and a browser test suite that engineers can trust when a PR is on the line.