Why we built Canaryflux

This is a launch post, so let me be honest about why this product exists.

We kept watching the same thing happen on shipping teams. The team would push a Friday-night fix, the deploy would go green in CI, every Playwright test would pass, every screenshot diff would come back clean. Then on Monday morning a customer would DM a screenshot of the signup button cut in half on a Pixel 5 — and nobody on the team could reproduce it on their laptop.

The reason is depressing once you see it: "works on my machine" is even more true now than it was in 2010. Modern CI runs Chrome in a Linux container at a desktop viewport. Lighthouse runs Chrome in a Linux container at a simulated mobile viewport. Visual regression tools run Chrome in a Linux container at a recorded viewport. All of them are simulating what a phone sees. None of them are a phone.

Real visitors are using two-year-old Androids on intermittent connections, with a system font that doesn't match yours, with an OS that aggressively pauses background tabs, with a webview that the manufacturer hasn't updated in eight months. That's the test environment. CI is the wishful one.

The gap we kept hitting

If you've shipped a marketing site in the last two years, you've felt this. It looks like:

Hero CTA is clipped on iPhone SE because the headline ate one extra line at that exact viewport.
Cookie banner covers the page on a Galaxy A14 with no dismiss button — your CSS assumed the banner script loaded before paint, but on a slow network it didn't.
Country dropdown opens behind the on-screen keyboard on Android Chrome. Forms team can't reproduce because they use macOS.
Video modal opens blank because a config request 404'd on a CDN edge that only serves one geo region.

None of these show up in Lighthouse. None of them show up in your visual regression diff. They show up only when you point a real device at the page and use it like a customer would.

The two options teams already had

Until recently, you had two ways to catch this kind of bug.

Option A: a manual QA team with a device lab. This works. It also costs you a permanent headcount plus a closet full of phones plus the operational overhead of keeping them all charged and updated. Most companies under a hundred engineers won't do this, and the ones that do get bottlenecked on QA-team capacity before the engineering team is.

Option B: a device-cloud test-script suite. Real hardware in the cloud, but you have to write and maintain the tests. Coverage scales with engineering time you don't have, and a framework migration can invalidate scripts you wrote months ago.

Both options assume the QA bottleneck is finding bugs. From years of watching shipping teams, we became convinced the actual bottleneck is something earlier: nobody is even looking at the rendered page on a real phone before it ships. Not because they don't want to. Because the cost of looking is too high.

What Canaryflux does instead

We took the opposite shape of constraint. Instead of asking the team to write tests, we ask the team to paste a URL. The scanner does the rest:

Opens that URL on a set of device profiles in parallel — a mix of iPhone, Pixel, Galaxy, iPad, and desktop Chrome viewports, with the exact list depending on your plan.
Captures full-page screenshots, console errors, and network failures from each one.
Submits a safe form (newsletter / search) and clicks up to two main CTAs to catch post-click bugs — the broken modals and zero-state screens that screenshot tests miss because they only photograph the landing state.
Sends the captures to a vision LLM with a tightly-tuned QA prompt that grades each finding by severity and writes a one-line suggested fix.
Runs a second verification pass to drop false positives so you're not drowning in noise.

What you get back is a list of ranked findings, each one with a screenshot of the actual bug, the device it surfaced on, and a copy-pasteable fix. In our internal scans against public marketing sites, a typical run surfaces a handful of findings — sometimes more on busier pages.

Paste a URL. Get the report in about ninety seconds. No SDK to install, no snippet to embed, no DNS changes. If your site renders HTML, we work with it.

What we got wrong, and what we changed

The first version of Canaryflux just dumped raw model output into a list. We thought "device profiles + AI vision" was the whole product. It wasn't. The list was 60% noise. Half of every output was the model hedging — "this could be a layout issue" — and the other half was things that were technically present but not user-impacting.

The fix was the verification stage. Every candidate finding goes through a second pass with a stricter prompt and the screenshot re-examined. If the second pass can't confidently reproduce the bug from visible evidence, the finding is dropped. That single change dropped the typical scan from a couple dozen noisy candidates down to a handful of confirmed findings — and the handful is the part you actually want to read.

What's not in v1, but is on the roadmap

We're deliberately small at launch. Real product, narrow scope.

Scheduled scans. Right now you re-run from the dashboard. Recurring cron scans are coming on the Studio plan.
Authenticated-app scanning. v1 tests what a logged-out visitor sees. Logged-in app QA — with credentials you provide via a secrets vault — is on the same Studio roadmap.
Slack notifications. Pipe blocker findings straight to a team channel the moment they're verified. Shipping with Pro when ready.
More device profiles. Today's matrix targets the most common iPhone, Pixel, and Galaxy viewports plus desktop and tablet. We'll add older Androids and edge-region viewports as customers ask.

If you're a team that ships a marketing site

Canaryflux is built for you specifically. Not for native-app QA, not for full integration testing, not as a replacement for your CI suite. As a thing you run after every deploy to make sure the page you just shipped looks right on the phones your customers actually own.

The Free plan gives you three scans a month. Paste a URL — your own site, a staging deploy, a competitor's site, anything public — and see what comes back.

We'd rather you find the bug than your customer.

Why we built Canaryflux.