2026-01-22technicalscoringarchitecturedeep-dive

How the Confusion Score Works

Signal weights, baselines, co-occurrence multipliers, and z-score normalization. The actual math.

The confusion score is a number from 0 to 100. It answers one question: how much more confused are users on this page right now compared to what's normal for this page?

"Normal" varies by page, by hour, by weekday. The scoring engine accounts for all of it.

Step 1: weighted signal aggregation

The SDK detects 18 distinct frustration signals. Each signal has a weight that reflects its correlation with actual user abandonment:

``typescript


export const SIGNAL_WEIGHTS: Record<SignalType, number> = {
  rage_click: 25,
  dead_click: 12,
  speed_frustration: 14,
  thrash_cursor: 15,
  loop_nav: 20,
  scroll_bounce: 8,
  form_hesitation: 12,
  form_abandon: 18,
  error_encounter: 15,
  tap_miss: 10,
  pinch_zoom: 8,
  swipe_miss: 10,
  scroll_hijack: 12,
  orientation_thrash: 8,
  tab_thrash: 15,
  focus_trap: 20,
  keyboard_nav_frustration: 10,
  scroll_depth_abandon: 6,
};



Rage clicks (25), loop navigation (20), and focus traps (20) are weighted highest. Scroll bounce (8) and pinch zoom (8) are lower because they have innocent explanations.

The raw score:


raw_score = sum(signal_count  signal_weight) / active_users

`
Dividing by active users matters. 40 rage click events on a page with 1,000 active users is background noise. 40 rage click events with 50 users is a fire. Step 2: co-occurrence multiplier 14 users frustrated simultaneously on the same page means something is actively broken right now. 14 users frustrated across 14 hours is Tuesday. The co-occurrence multiplier distinguishes these cases:
`typescript
export function computeCoOccurrenceMultiplier( frustratedUsersInWindow: number, ): number { if (frustratedUsersInWindow >= 50) return 2.5; if (frustratedUsersInWindow >= 20) return 2.0; if (frustratedUsersInWindow >= 10) return 1.5; return 1.0; }
`
The window is 5 minutes by default. If 10+ users are frustrated within that window, the score bumps 50%. At 20+ users: doubled. At 50+: 2.5x. This is the mechanism that surfaces "something broke in production" events with appropriate urgency. The multiplied score:
`

score = raw_score  co_occurrence_multiplier



Step 3: baseline computation

Is 34 good or bad? Depends on the page. The baseline engine computes a rolling 7-day average, plus per-hour and per-day-of-week averages:

`typescript


export function computeBaseline(
  page: string,
  history: ScoreHistoryEntry[],
  minDays: number = 7,
): PageBaseline | null {
  // ... sorts history, checks minimum data requirement ...

  const scores = history.map((h) => h.score);
  const mean = scores.reduce((a, b) => a + b, 0) / scores.length;
  const variance =
    scores.reduce((sum, s) => sum + (s - mean) * 2, 0) /

    (scores.length > 1 ? scores.length - 1 : 1);
  const stdDev = Math.sqrt(variance);

  // Compute per-hour and per-day-of-week means
  for (const entry of history) {
    const d = new Date(entry.timestamp);
    const hour = d.getUTCHours();
    const day = d.getUTCDay();
    // ... accumulate and average by time slot ...
  }

  return {
    page,
    mean,
    std_dev: stdDev,
    sample_count: history.length,
    time_of_day_means: timeOfDayMeans,
    day_of_week_means: dayOfWeekMeans,
    computed_at: Date.now(),
  };
}

`
The system knows your /checkout page is always noisier on Friday evenings than Tuesday mornings. The baseline requires 7 days minimum. Before that, scores use absolute thresholds and are labeled "provisional." Step 4: z-score normalization Raw numbers become the 0-100 scale here. The engine normalizes against the baseline using a modified z-score:
`typescript
if (baseline && baseline.sample_count >= 7) { baselineMean = baseline.mean; baselineStdDev = Math.max(baseline.std_dev, 1); const normalizedScore =

    ((score - baselineMean) / baselineStdDev)  25 + 50;
  deviation = (normalizedScore - 50) / 25;
  score = Math.max(0, Math.min(100, normalizedScore));
}

The formula: ((score - mean) / std_dev) * 25 + 50. A score of 50 means "exactly at baseline." 75 means one standard deviation above normal. The clamping at 0 and 100 keeps the number readable, but raw deviation is preserved for trend detection. The std_dev floor of 1 prevents division-by-zero on pages with very stable baselines.



Step 5: trend detection

The deviation value (how many standard deviations above or below baseline) drives trend arrows in the dashboard:

`typescript


if (deviation > 1) trend = 'up';
else if (deviation < -1) trend = 'down';
// else: 'stable'

Above 1: trending up. Below -1: trending down. Inside that band: stable. Thresholds are in z-score units, so they adjust to each page's volatility.

What the scores mean

0-20: Calm. Users are doing fine.
21-40: Mild friction. Worth a glance, not worth a Slack message.
41-60: Noticeable confusion. Something changed. Check recent deploys.
61-80: Real frustration. Users are struggling. Fix it today.
81-100: Something is broken. Fix it now.

Anomaly and trend detection

The baseline powers two alert types. Anomaly alerts fire when the current score exceeds the time-adjusted baseline by more than 2 standard deviations. The time adjustment matters: if your site is always noisy at 5pm on Fridays, a Friday spike needs to clear a higher bar.

Trend alerts catch gradual degradation. If the recent 7-day average is 30%+ higher than the historical baseline without a single sharp spike to explain it, the system flags it. The slow creep nobody notices until support tickets start climbing.

Cold start

Before baselines exist, the engine clamps the weighted sum directly to 0-100. Useful from minute one, if less personalized. After 7 days, baselines kick in. The dashboard labels this transition. No false precision.

The whole system runs in an edge worker. Scores update within seconds of new events arriving.

All posts