Measuring Engineering Velocity with AI
PR throughput, cycle time, review latency — and how these metrics change when AI is in the loop.
The context
This is an area where the gap between what works in a demo and what survives production is wide — and getting wider. Teams that close this gap early ship faster. Teams that don't accumulate compounding technical debt that shows up as incidents, cost overruns, and velocity drops. This article maps what we're seeing across engagements and what the patterns suggest.
What we're seeing in production
Across the systems we build and manage, a few patterns repeat. The teams that invest in structure — evaluation frameworks, cost monitoring, architecture decisions made before code is written — ship more reliably. The teams that optimize for speed without structure ship faster in week one and slower every week after. Production is a forcing function. It exposes every shortcut.
The patterns that hold up
The patterns that survive production are not the patterns that look good in a demo. They involve: (1) clear boundaries between deterministic and non-deterministic code, (2) monitoring that tells you something is wrong before your users do, (3) fallback chains that degrade gracefully instead of breaking hard, and (4) evaluation frameworks that catch regression on every deploy. None of these are exciting. All of them are necessary.
What this means for your team
If you're building or managing AI systems, the question is not whether you'll encounter these patterns — it's whether you'll encounter them in production at 3am or in a design review before anything breaks. The teams that invest in production readiness early ship faster over any meaningful time horizon. The teams that don't pay interest on the technical debt every sprint.
Need help applying this?
We build and manage production systems. 30 minutes. Real conversation. No pitch.
