20 months on a platform for real-time quality analysis of video streams

2× throughput, 2× failure rate: 20 months of measured AI adoption


Three things happened: per-engineer throughput went up ~2×, review initially got slower, and production broke more often. All three are real, and the honest version of this story has all three in it.

🎯 TL;DR

We adopted Claude Code as the team’s daily driver. Per-engineer PR throughput went up ~2× (6.8 → 13.9 PRs / engineer-month) without reviews getting slower — while the team grew from 1 to 11 concurrent engineers at peak.

It didn’t happen in a straight line.

  • Adoption made things worse first. For ~5 months of half-commitment to the tool, median PR review time quadrupled (5.3h → 22.8h). Lumpier PRs, slower reviews. A team measuring the pilot here would have killed it just before it broke through.
  • Tuning Copilot reviews got us out of the trough. Custom rules, project-specific instructions, automated linting — the agent absorbed the first-pass review and throughput recovered. Quality guardrails are still a work in progress.
  • Then it shipped faster — at a cost. Change-failure rate roughly doubled (~8% → ~17%). PRs per release grew from 2 to 10.
  • AI co-authorship: 0% → 53% of commits within six months of agents becoming the default.

The conclusion isn’t “AI paid off.” It’s: AI made us fast — and quality didn’t keep up fast enough. Fast and right aren’t mutually exclusive. But they don’t come for free.

🏗️ The setup

  • A platform for real-time quality analysis of video streams, built on Spring Boot, Kafka and AWS infrastructure.
  • Team grew from 1 engineer (me) to 11 concurrent engineers at peak.
  • Three AI-adoption eras, defined by which tooling was the default not just available:
EraToolingMeasurement status
Pre-agentGitHub Copilot inline · Claude in a browser tab for design questionsPartly asserted — Copilot leaves no commit fingerprint
Agent-assistedClaude Code introduced, used occasionally on specific tasksMeasured from here forward (Claude Code signs commits)
Agent-drivenClaude Code as the default for most non-trivial workMeasured

📈 Numbers go up

Chart 1 — Per-engineer PR throughput, cycle time, and AI co-authorship across the three eras

Per-engineer throughput climbs ~2× while cycle time stays flat. AI co-authorship climbs in lockstep with the second jump (agent-assisted → agent-driven), not the first.

Throughput up, latency flat

  • Issue: Adding engineers normally stretches review queues. The team grew from 1 to 11 concurrent engineers at peak — review latency should have stretched. Per-engineer throughput should have flattened or fallen as coordination overhead rose.
  • What the data says:
    • Per-engineer throughput: 6.8 → 8.1 → 13.9 PRs / engineer-month (~2.05× across the three eras).
    • Median PR cycle time: 5.3h → 4.9h. Effectively flat.
  • What happened: The agent absorbed the review-latency cost teams usually pay for headcount. Visible AI review activity (Copilot bot account) rose 0 → 1 → 2 comments per PR — a floor, not a ceiling, because AI-drafted replies posted from engineer accounts count as “human” in any bot-vs-human split. Best estimate: by the agent-driven era, AI generated roughly half the review activity per PR — Copilot directly plus AI-drafted human-account replies.

Agent wrote half the commits

  • Issue: Adoption claims need a denominator. “Most engineers use AI now” is unfalsifiable without one.
  • What the data says:
    • Pre-agent: none of 604 commits were built by an agent (no fingerprint exists in git for inline-completion work).
    • Agent-assisted: 7 of 396 commits carry the Co-Authored-By: Claude trailer (1.8%).
    • Agent-driven: 1,088 of 2,053 commits (53%) carry it.
  • What it means: When the trailer crosses 50%, the agent is the default — not a tool reached for occasionally. The trailer is a floor, not a ceiling; manual rebases and fixups don’t carry it. Real adoption sits above 53%.

🕳️ The trough is the story

Look at the cycle-time line in the chart. It dips up sharply during the agent-assisted era.

  • Issue: A pilot that produces a worse metric than the baseline is the kind that gets killed.
  • What the data says: Median cycle time during agent-assisted use: 22.8 hours — roughly 4× the pre-agent baseline of 5.3h.
  • What happened: Half-committed agent usage produced fewer, lumpier PRs — refactor + feature + config bundled together. They sat in review queues longer because reviewers couldn’t context-switch into them quickly.
  • Conclusion: Had the metrics pipeline stopped here, the data said “Claude Code made us slower.” The pilot would have been killed. This is the Trough of Disillusionment from the Gartner hype cycle, observed at team scale. Reviews in the trough were still being done by humans — and humans alone couldn’t keep up with the lumpier PRs. The payoff came when we leaned harder on Copilot reviews fine-tuned with project-specific rules and instructions, with humans doing the second pass instead of the first.

⚠️ What about quality?

Production deploy data tells the other half of the delivery story (AWS App Runner operations, 13 months).

Chart 2 — PRs per release grew from 2 to 10; change-failure rate moved with it

Release size grew ~5× — and change-failure rate followed almost perfectly. The cost is in the chunks, not the changes.

  • Issue: Quality got worse.
  • What the data says:
    • PRs per release: ~2 → ~10 (pre-agent → agent-driven).
    • Change-failure rate (rollback %): ~8% → ~17%.
  • What happened: 10 changes per release = bigger blast radius per failure. One bad change pulls the whole release back. Every failure burned investigation time — even though App Runner’s health checks caught each one and auto-reverted before it reached steady state.
  • Conclusion: We shipped faster and in bigger, riskier chunks. The rollback rate doubled. Rollback count rose faster than that.

🔧 What I’d do differently

Two principles, in order of how much they’d change the outcome:

Review must outpace AI, not match it

  • Issue: The two numbers that moved the wrong way (agent-assisted cycle time, agent-driven CFR) share a mechanism. As AI takes on more of the writing AND more of the reviewing, the human judgment bar has to rise to match.
  • What the data says:
    • Visible AI review activity (Copilot bot account) rose 0 → 1 → 2 comments per PR.
    • Estimated AI share of review activity: ~half by the agent-driven era (Copilot bot account + AI-drafted human-account replies).
    • Change-failure rate doubled anyway.
  • What it means: Comment volume scaled. What didn’t was reviewer ownership on the diffs that mattered most. The agent-driven-era failures clustered in the biggest changes — exactly where everyone assumed someone else would catch the problem.
  • Conclusion: We’ve just started adding guardrails. The bigger gap is the analysis layer: bug and tech-debt root-cause categorisation. Without classifying what actually caused each regression — release size? data migration? schema drift? insufficient test coverage? — guardrails are guesses at where to harden. Categorise the failure modes first, then build the lint / type / contract-test / size-cap guardrails around the largest cluster.

Start with the metrics

  • Issue: Quality measurement was the gap throughout this project. Hotfix and revert PRs weren’t tagged. Era boundaries were identified post-hoc by reading commit messages and PR titles.
  • What it means: The only CFR proxy available retroactively was App Runner rollback events. That’s operational-failure data, not quality-of-shipped-code data — different claims. And the era bucketing held up but was hand-waved.
  • Conclusion: An afternoon of pre-rollout work — mandatory hotfix / revert PR labels, a CFR dashboard, and a written trigger for when to evaluate the rollout (“after N weeks of M% of devs using it on K% of PRs”) — would have given a quality axis to sit alongside throughput, plus a defensible window definition. Without it, retro analyses rest on proxies and hand-waves.

💡 The takeaway

The 2× per-engineer throughput is real.

The trough is real.

The 2× change-failure rate is also real.

The honest read isn’t “AI paid off.” It’s: AI made us fast — and quality didn’t keep up fast enough. Fast and right aren’t mutually exclusive. But they don’t come for free.

This isn’t a problem with a clean ending. The speed-vs-quality tension sharpens, not softens, as the agent gets better. Only an AI can keep up with another AI — humans alone can’t review at the pace AI produces. The next decade of this discipline is going to be a long-running war between quality guardrails and throughput tooling, both evolving together.

I’m not going to watch it from the sidelines. I want to be in it — building the patterns, instruments, and guardrails that make fast and right possible at the same time.

📐 Methodology & limitations (click to expand)

Metrics here follow DORA’s four keys (throughput, lead time, change-failure rate, MTTR) as defined by Forsgren, Humble & Kim in Accelerate (2018). MTTR is omitted because no consistent incident-start timestamp existed before the agent-driven era. The release-size → failure-rate mechanism is straight out of Accelerate’s chapter on continuous delivery.

🗂️ Data sources

  • Git history (git log) across the 3 actively deployed repos that make up the platform.
  • PR metadata via the GitHub API (gh pr list --json).
  • Production deploys via AWS App Runner list-operations.
  • Per-PR review comment counts via the GitHub Pulls API.

📅 Window boundaries

  • Pre-agent — from first commit through the months before Claude Code was introduced (~9 months). Copilot era.
  • Agent-assisted — between Claude Code being introduced and Claude Code becoming the default (~5 months).
  • Agent-driven — from Claude Code becoming the default through the end of the measurement window (~6 months).
  • Both boundaries are post-hoc reconstructions, identified by reading commit messages and PR titles for the inflection points.

📊 Metric definitions

  • 📈 Throughput — merged PR count per engineer-month, where engineer-month = (month × distinct PR-merging author). Squash-merged PRs count once.
  • ⏱️ Cycle timemergedAt − createdAt per PR, per-window median. Cancelled and draft PRs excluded.
  • 🤖 AI co-authorship rate — commits with Co-Authored-By: Claude trailer / total commits, per window. Floor, not ceiling.
  • 📦 PRs per release — distinct merged PRs in a given prod deploy / deploy count for that window.
  • ⚠️ Change-failure rateSTART_DEPLOYMENT operations followed by ROLLBACK_SUCCEEDED on the same service / total START_DEPLOYMENT. Lower bound on real operational failures — but a true rate for “deploys that didn’t reach steady state.”

⚙️ Things worth knowing

  • Squash-merge confounds raw commit counts. PR-based metrics (throughput, cycle time, release size) are robust to it.
  • Team-growth numbers are coarse. “11 concurrent at peak” is the maximum distinct PR-merging authors in any single month. Median monthly was ~4. Per-engineer throughput is computed against distinct-author-months as a proxy.
  • The Co-Authored-By: Claude trailer is opt-in. Engineers using Claude Code without the auto-trailer setting don’t show up in the 53% number.
  • “Human comments” in PR data is account-based, not authorship-based. As Claude usage grew, engineers (myself included) posted AI-drafted replies into Copilot review threads. Those count as “human” in any bot-vs-human split. The bot-account count (Copilot) is a floor on AI’s review share; the ceiling isn’t bounded.

Coming up in this series:

  • How to keep code review honest under AI — review patterns and guardrails that scale with how much the agent is writing.
  • How to cut failure rate without losing speed — what would actually move the CFR back down without giving up the throughput.
  • Plus the Kafka clock-skew war story (event-time vs ingestion-time, the hotfix that became an architecture rework).