Visual testingFlakinessBest practices

Why your screenshot tests keep crying wolf

Ananya Rao

QA Lead

May 28, 2026

6 min read

Almost every team that adopts visual regression testing hits the same wall in week two: the tool flags fifty 'differences', forty-nine of them are noise, and people quietly start clicking 'approve all'. Once that habit forms, the suite is worthless — it's no longer catching anything, it's just rubber-stamping.

The frustrating part is that the failures aren't random. Visual noise comes from a small number of well-understood sources, and once you can name them you can tune them out without lowering your guard against real regressions.

The usual suspects

Anti-aliasing — the same edge rendered on two machines differs by a few grey pixels along the border. Pixel-exact comparison treats this as a change.
Font hinting — operating systems render the same font slightly differently. A headline can shift by a sub-pixel and light up a whole text block.
Dynamic content — timestamps, 'time ago' labels, randomised testimonials, and A/B variants all change between captures.
Scrollbars and viewport — a 1px difference in available width reflows the entire page.

Tune the signal, not the threshold

The lazy fix is to crank tolerance down until nothing fails. That also means nothing gets caught. A better approach is layered: enable anti-aliasing tolerance so border pixels stop counting, mask the regions you know are dynamic, and only then set a sensitivity that reflects how strict this particular screen needs to be.

In PixellPeep that maps directly to features: the anti-aliasing filter, ignore regions for dynamic content, and a 0–100 sensitivity slider. A marketing page can run loose; a checkout flow should run tight.

A visual test you trust to fail is worth ten you've learned to ignore.

Capture conditions matter more than the algorithm

Most flakiness is decided before the comparison even runs. Capture at a fixed viewport, wait for fonts and images to finish loading, and run against a stable environment rather than production. Get the inputs consistent and the diff engine has an easy job.

Do that, and the next time a test fails, people look — because it's earned their attention.

Why your screenshot tests keep crying wolf

The usual suspects

Tune the signal, not the threshold

Capture conditions matter more than the algorithm

Keep reading

Pixel-diff vs. structural similarity: picking the right engine

We replaced our manual UI review checklist with one CI check