When browser tests fail on pull requests but pass on main, the failure is usually not random. It is often a sign that your PR pipeline and main pipeline are not actually equivalent, even if they look similar in the YAML. The differences can be subtle, such as a missing secret, a different browser cache state, a changed test ordering, or a resource constraint that only shows up when several PR jobs run at once.

The hard part is that these failures are easy to misclassify as “just flaky tests.” If you respond by retrying everything, you can hide real regressions and make your CI less trustworthy. The better approach is to treat the discrepancy as a systems problem: compare environments, control concurrency, and make secret handling explicit.

If a browser test passes consistently on main but fails on pull requests, assume the two pipelines are exercising different realities until you prove otherwise.

What this pattern usually means

A PR run and a main-branch run can differ in ways that are easy to miss:

  • different build images or runner types
  • different env vars, including secrets and feature flags
  • fewer or more parallel jobs
  • different checkout depth or submodule behavior
  • cache hits on main and cache misses on PRs
  • PRs running from forks with restricted permissions
  • branch-specific test selection or tagging
  • different timing because main is less busy or has warmer infrastructure

In browser automation, those differences matter because tests are already sensitive to timing, rendering, authentication, network behavior, and local storage state. A small divergence can become a reproducible failure only in one pipeline.

For background on the broader concepts, see continuous integration, software testing, and test automation.

Start by proving the failure is really branch-specific

Before you inspect locators or add waits, confirm that the same commit fails only on PR and not on main. Teams often compare different code, different build times, or different dependent services without realizing it.

A useful first step is to run the exact same commit in both contexts, with the same pipeline definition and the same artifact versions. If the failure disappears when the commit is replayed on main, that points to environment drift rather than application logic.

Track these fields for every failed run:

  • commit SHA
  • branch name
  • pipeline job name
  • runner image or agent version
  • browser version
  • test shard or parallel slot
  • base URL or deployed environment
  • secret availability
  • feature flag values

If you do not already capture this data, add it now. It is much easier to debug with a structured record than by reading raw logs from five reruns.

A minimal comparison checklist

Use the same checklist for PR and main runs:

  • same browser binary and driver version
  • same environment variables
  • same app build artifact
  • same test command
  • same test data seed, if applicable
  • same container image digest
  • same network policy and service dependencies
  • same job concurrency and shard count

Any mismatch is a candidate root cause.

The most common cause: CI environment drift

CI environment drift means the test execution context changes between runs, even when the test code does not. In browser testing, drift is especially common because tests depend on the browser, the application, and the infrastructure around them.

Typical sources of drift include:

1. Different browser or driver versions

Chrome, Firefox, WebKit, and their automation drivers change behavior over time. A locator that worked last week can fail if the browser updates rendering or event timing. This is often invisible on main if the main branch uses a cached image or a different runner pool.

Pin your browser version in CI where possible, and record it in artifacts. If you use containers, prefer immutable image digests over floating tags.

2. Different build artifacts

PR pipelines often build from scratch, while main pipelines may reuse artifacts from a deployment step. If the browser tests hit a deployed preview, verify that the preview contains the same frontend bundle and backend schema expected by the tests.

A common mismatch looks like this:

  • PR deploy uses the latest frontend code but stale backend migration state
  • main deploy uses a full release artifact and matching database schema
  • tests pass on main because the deployment process is more complete

3. Different environment variables

Feature flags, API endpoints, and auth settings often differ by branch. A PR may disable an integration because secrets are unavailable or because the branch is untrusted. That can change page structure, authentication flow, or API responses.

4. Ephemeral infrastructure noise

PR jobs often land on colder, busier, or smaller runners. If browser tests depend on timing, cold starts can expose races. The test may pass on main because the environment is faster, not because the code is healthier.

Practical fix

Create a run manifest that prints the critical environment properties at the start of every job:

- name: Print run manifest
  run: |
    echo "sha=$GITHUB_SHA"
    echo "ref=$GITHUB_REF"
    node -v
    npx playwright --version
    google-chrome --version || true

That output becomes a baseline for comparing PR and main.

Parallel browser tests can hide or create failures

Parallelism is one of the most common reasons flaky PR tests appear only on pull requests. Teams often increase parallelism in PR pipelines to reduce feedback time, but that also increases contention and timing variance.

There are two separate problems here:

  1. the tests are not isolated enough to run in parallel
  2. the CI system schedules more parallel work on PRs than on main, or vice versa

How parallelism breaks browser tests

Browser tests are often written assuming one test owns one user state. In practice, parallel workers can conflict over:

  • shared user accounts
  • shared test data records
  • hard-coded emails or usernames
  • global localStorage or session cookies in reused profiles
  • mutable backend fixtures
  • rate-limited auth services
  • shared downloads or upload directories

A test that logs in as the same account across five workers can behave differently every time depending on ordering. On main, the suite may run with fewer shards and fewer collisions. On PR, the additional concurrency may expose the race.

A Playwright example of accidental shared state

import { test, expect } from '@playwright/test';
test('creates an order', async ({ page }) => {
  await page.goto('/login');
  await page.fill('#email', 'qa@example.com');
  await page.fill('#password', process.env.PASSWORD!);
  await page.click('button[type="submit"]');

await page.goto(‘/orders/new’); await page.click(‘text=Create Order’); await expect(page.locator(‘.toast’)).toContainText(‘Order created’); });

This test may fail if another worker already used the same account and altered the session or data state.

Better isolation patterns

  • create a unique user per worker
  • isolate data by test run ID
  • namespace resources with the commit SHA and shard index
  • reset browser state for each test
  • avoid reusing accounts with mutable preferences or permissions
  • keep file downloads in per-worker directories

If you must parallelize, design tests to be parallel-safe first, then scale concurrency.

Control the shard count deliberately

The same suite can pass with two workers and fail with six. If PR and main differ in worker count, you are not comparing the same system. Set the same parallelism for both branches while debugging.

In Playwright, that may mean fixing workers in config or overriding it in CI:

import { defineConfig } from '@playwright/test';

export default defineConfig({ workers: process.env.CI ? 4 : 2, retries: process.env.CI ? 1 : 0 });

For debugging, temporarily reduce workers to 1. If the failure disappears, you likely have a concurrency issue rather than a true app regression.

Secrets are often the hidden branch boundary

PR pipelines frequently run with limited or no secrets, especially for forks. That is usually the correct security choice, but it changes test behavior.

Browser tests can depend on secrets in ways that are not obvious:

  • authentication tokens for test users
  • API keys for third-party auth or payments
  • SSO configuration
  • webhook signatures
  • private test fixture access
  • feature toggles loaded from secret-backed config

If a secret is missing on PR but present on main, the app may silently switch to a fallback flow. The test then fails because the UI no longer matches the expected path.

Common secret-handling mistakes

  • tests assume process.env.TOKEN exists without checking
  • app code falls back to mock mode on PR, but real mode on main
  • forked PRs cannot access secrets, causing partial initialization
  • secret rotation breaks one branch cache but not the other

Make secret differences explicit

Fail fast when a required secret is missing. Do not let the app degrade into a different mode without logging it.

const token = process.env.API_TOKEN;
if (!token) {
  throw new Error('API_TOKEN is required for browser tests');
}

If you intentionally use different modes for PR and main, label them clearly in output. For example, print auth_mode=mock or auth_mode=real at startup so the pipeline artifact reflects the decision.

Hidden fallback behavior is a classic source of false confidence. A passing test is less valuable if it passed in a mode nobody intended to ship.

Timing issues are more visible on PRs

PR jobs often contend with lower cache hit rates, cold dependency installs, and noisier scheduling. That can make race conditions or brittle waits appear only there.

Browser automation fails for timing reasons when tests assume an element is ready before it is actually interactive. The fix is usually not to add a longer sleep. The fix is to wait for the specific condition the user needs.

Examples of brittle timing

  • clicking before animation completes
  • asserting text before API response finishes
  • waiting for a generic network idle state that never truly occurs
  • reading a toast before it is mounted
  • interacting with an iframe before it loads

Prefer condition-based waits

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Saved successfully')).toBeVisible();

This is better than a fixed sleep because it ties the wait to an observable state.

If a failure only happens on PR, inspect whether the PR runner is slower, more loaded, or using a different network path. A test that barely passes on main may be too close to the edge to survive a colder environment.

Build a repeatable triage flow

When the failure appears, do not immediately edit test code. Use a consistent triage sequence.

Step 1: compare the failed run with a passing run

Look at the exact job metadata, browser version, and environment variables.

Step 2: rerun the same commit in the same branch context

If the test passes on rerun, you may have flakiness. If it fails consistently on PR but not main, keep investigating environment drift.

Step 3: reduce concurrency

Set worker count to one. If the failure disappears, suspect shared state or test ordering.

Step 4: isolate secrets and auth

Confirm the pipeline has the expected credentials and that the app is not silently switching to guest mode or mock mode.

Step 5: compare deployed artifacts

Check whether the PR preview and main environment actually contain the same build, schema, and config.

Step 6: capture browser artifacts

Store screenshots, videos, console logs, and traces. These are often the fastest way to distinguish a broken selector from a backend mismatch.

Playwright trace collection is especially useful for branch-specific failures:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

How to distinguish a real regression from a CI artifact

Not every PR-only failure is infrastructure noise. Sometimes the PR truly breaks the app, but the failure only becomes visible because PR runs exercise a slightly different path.

A few clues help separate the two:

More likely a real regression

  • the same UI action fails in local development and in PR CI
  • the failure is deterministic on the same commit
  • screenshots show a visible app defect, not just a timing issue
  • logs show a backend error, schema mismatch, or missing data
  • the same failure appears after reducing parallelism

More likely a CI artifact

  • the failure disappears when retried
  • the failure disappears with one worker
  • the failure only occurs on a specific runner image
  • the failure correlates with missing secrets or branch permissions
  • the failure occurs during login, fixture setup, or teardown, not core app behavior

Treat deterministic, reproducible failures as product issues. Treat environment-dependent failures as pipeline issues until proven otherwise.

What to standardize across PR and main

The easiest way to reduce branch-specific browser failures is to make PR and main more alike.

Standardize the runtime

  • same container image
  • same browser version
  • same node or language version
  • same dependency lockfile resolution
  • same test command and environment variables

Standardize test data strategy

  • seeded data per run
  • unique resources per worker
  • idempotent teardown
  • isolated test accounts

Standardize pipeline behavior

  • same shard count during investigation
  • same artifact retention
  • same retry policy
  • same log and trace collection
  • same preview deployment process

Standardize feature flag exposure

If PRs and main differ in flags, make that explicit in the test plan. Separate tests that validate a feature behind a flag from tests that validate the default experience.

Example GitHub Actions pattern for comparison

A simple way to reduce surprises is to surface the differences in the workflow itself.

name: browser-tests

on: pull_request: push: branches: [main]

jobs: e2e: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test:e2e env: CI: true AUTH_MODE: $

This is not the only valid setup, but it makes the branch difference visible. If AUTH_MODE changes behavior, you will see it in the config rather than discovering it through a failing selector.

Where reporting helps most

Good reporting is not just about dashboards. It is about making branch differences visible enough that engineers can act on them.

At minimum, your reporting should show:

  • pass/fail by branch type
  • pass/fail by worker count
  • pass/fail by browser version
  • pass/fail by secret availability
  • pass/fail by deployment target
  • rerun success rate

If a dashboard shows that PR failures are concentrated on one runner image or one shard, you have a lead. If failures are evenly distributed, the issue is more likely in the test design or app state.

A practical decision tree

When browser tests fail on pull requests but pass on main, use this order:

  1. Confirm the same commit and same test command.
  2. Compare environment variables, browser versions, and container images.
  3. Reduce parallelism to one worker.
  4. Verify required secrets and auth mode.
  5. Compare deployed artifacts and feature flags.
  6. Inspect screenshots, traces, and console logs.
  7. Fix shared state or timing issues.
  8. Only then decide whether the application changed.

This order matters because it avoids the common trap of patching the symptom with retries or sleeps before understanding the cause.

Preventing recurrence

Once you isolate the cause, prevent the same class of failure from coming back.

  • lock browser and dependency versions
  • document which secrets are required on PR and which are not
  • fail fast when a required secret is missing
  • ensure each worker has isolated test data
  • keep PR and main job definitions as similar as possible
  • make parallelism an explicit setting, not an accidental default
  • collect artifacts on every failure, not just on main

If your team maintains a large suite, consider adding a small set of branch-diff diagnostics that run before the full browser suite. These checks can validate that the pipeline shape matches expectations, which saves time when the real problem is environment drift rather than application behavior.

Conclusion

When browser tests fail on pull requests but pass on main, the best explanation is usually not “the tests are flaky.” It is that PR and main are different execution environments in ways that matter to browser automation. The differences might be in browser version, secrets, cache state, parallelism, or deployment artifacts, but the symptom is the same: one branch exposes instability that the other hides.

The goal is not to make every test pass by force. It is to make the pipeline honest. If a test failure reflects a real regression, you want to catch it. If it reflects CI environment drift or parallel browser tests stepping on one another, you want to make that failure visible, reproducible, and fixable.

That discipline leads to faster triage, fewer false alarms, and a browser test suite that engineers can trust when a PR is on the line.