Why We Built Flusterduck
Session replay is the wrong tool for finding confused users. We built the right one.
I spent a Wednesday afternoon watching 47 Hotjar recordings trying to find out why our checkout conversion dropped 18% after a deploy. Scrubbing through footage of people scrolling, clicking, typing. Forty-seven recordings. Most of them were fine. Normal people doing normal things. Around recording 34, I found it: a coupon button that looked clickable but was disabled with no visual feedback. People clicked it, nothing happened, they clicked harder, then they left.
Four hours of my life to find a disabled button missing a tooltip.
That was the day I started thinking about this problem differently.
The session replay trap
Session replay tools are built on a bad assumption: that watching individual users is how you find UX problems. It isn't. It's how you confirm them after you already suspect something. And you're sampling. If 3% of users encounter a frustrating element, and your replay tool samples 10% of sessions, and you watch 5% of those, you're seeing 0.015% of the problem. That's archaeology, not monitoring.
What ops teams already figured out
If your server goes down at 2am, you don't find out by watching server logs play back in real time. You have a number. Uptime. It's 99.97% or it's not. When it drops, your phone buzzes. You fix it. You go back to sleep.
Nobody watches server logs for fun. Nobody should watch user sessions for fun either.
I kept asking myself: why doesn't this exist for UX? A number. One number per page. Is my checkout working or isn't it? Not "here's 200 recordings, go find out." Just: your checkout confusion score is 78, the coupon button is the problem, 31 users hit it in the last 15 minutes.
That's Flusterduck.
The confusion score
A confusion score is a single number from 0 to 100 that represents how much more confused users are on a page right now compared to what's normal for that page.
"Compared to what's normal" is the part that matters. A docs page at 45 might be fine. People re-read docs. A checkout page at 45 means something is wrong. The score is a deviation from your baseline, per page, computed from 7 days of behavioral data. When it moves, you investigate. When it doesn't, you don't.
Why 18 signals, not 1
Rage clicks are the signal everyone detects. Microsoft Clarity does it. PostHog does it. It's a good signal. It's also one signal.
What about the user who can't find the right form field and pauses for 8 seconds staring at a label? That's form hesitation. What about the mobile user who keeps pinch-zooming because your responsive layout is broken? That's pinch-zoom frustration. What about the keyboard user trapped in a modal with no escape? That's a focus trap.
We detect 18 distinct frustration signals across desktop, mobile, touch, and keyboard interactions. Each one gets a weight. Rage clicks are weighted at 25. Dead clicks at 12. Loop navigation at 20. Focus traps at 20. Scroll bounce at 8. The weights reflect how strongly each signal correlates with actual confusion, not just with annoyance.
A single rage click event on a carousel arrow is noise. A pattern of rage clicks plus form hesitation plus loop navigation on the same page within the same 15-minute window? That's a broken experience, and the co-occurrence multiplier bumps the score accordingly.
No recording. By design.
Flusterduck doesn't record the DOM. Doesn't capture screenshots. Doesn't replay sessions. This isn't a limitation we're apologizing for. It's a design decision we're proud of.
The SDK is under 4KB gzipped. Clarity is 17KB. FullStory is over 30KB. We track behavioral signals, not visual content. Click coordinates, timing, scroll velocity, navigation patterns. Element selectors, never text content. No GDPR headaches from accidentally recording passwords. No storage costs for replay data you'll never watch.
What I wanted to build
I wanted the tool I'd check at standup. Glance at the duck, see if anything's off, move on with my day. When something breaks, get a Slack message that tells me the page, the element, the signal, and how many users are affected. Fix it. Watch the score drop. Ship.
I wanted to type "how's the duck?" into Claude and get an answer without opening a dashboard.
I wanted deploy correlation that tells me "confusion on /settings jumped 340% after your last deploy" and also tells me "confusion on /checkout dropped 62% after deploy #892, whatever you did, it worked."
I wanted the inverse of a fire alarm. Not just "something's burning" but "that thing you fixed last week? It's still fixed. Good job."
That's what we built. One script tag. Real-time confusion scores. Element-level diagnosis. The duck watches so you don't have to.