Klymb AI · February 2026

Measuring Engineering Productivity in the AI Era

AI Is Here. So Why Does It Feel Like Nothing Has Changed?

For Engineering Leaders · 15 min read

As AI reshapes how software gets built, the way we measure the people building it must evolve too. Here's why most metrics fail, and what the latest research says actually works.

Introduction

Engineering leaders at high-growth companies face a sharpening question: how productive is our engineering team, and is AI making it better? The pressure is real. Boards want to see engineering ROI. Headcount is growing fast, often ahead of revenue. AI tool budgets are climbing. And yet, most organizations either lack a structured approach to measuring engineering productivity or have struggled to implement one effectively.

The instinct is to reach for simple numbers like lines of code, commit counts, and tokens consumed by AI tools, but these measures track motion, not progress. They reward the wrong behaviors, they are trivially gameable, and AI has made inflating them effortless.

This paper makes two arguments. First, that the leading industry frameworks (DORA, SPACE, and DX Core 4) offer the right foundations for measuring engineering productivity across multiple dimensions. Second, that for VC-backed scaleups that need to move fast, a lightweight five-metric system is enough to answer the questions that actually matter: are we shipping effectively, is quality holding, is AI making a real difference, and is engineering output connecting to business growth?

Why Most Metrics Fail

The fundamental problem is Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

Every metric below seems reasonable in isolation. Each one, when turned into a target, produces perverse outcomes, and AI has made the problem worse by making output cheap to manufacture.

Metric	Claims to measure	Why it breaks	AI-era risk
Lines of Code	Engineering output	Best work often deletes code. Incentivizes verbosity and bloat.	AI generates hundreds of plausible lines in seconds. Measures mass, not value.
Commit Count	Engineering activity	Incentivizes fragmentation: trivial commits, split work, noisy git history.	AI-assisted workflows produce artificial volume with no real progress.
CR Count	Delivery throughput	Change requests (CRs, i.e. pull requests or merge requests) get sliced into tiny units to inflate count. Measured individually, engineers optimize for volume over value.	Agentic AI tools generate diffs at scale, decoupling volume from value.
Story Points	Team capacity	Teams inflate estimates to hit targets. Not comparable across teams.	Meaningless when AI compresses implementation time unpredictably.
Burndown Charts	Sprint progress	Encourages scope manipulation to make the chart look right.	AI can accelerate burndown artificially without improving outcomes.
Tokens / Engineer	AI adoption	Measures consumption, not value. A precise prompt beats 20 wasted iterations.	Incentivizes performative AI usage over effective usage.
Revenue / Engineer	Engineering efficiency	Driven by sales, pricing, market, not engineering. Punishes growing teams.	Misleading at scaleups where headcount grows ahead of revenue by design.

None of these metrics are useless in every context. Some provide directional signal when tracked at the team level and interpreted with care. But as individual KPIs or primary productivity measures, they fail. The question is: what works instead?

The Frameworks That Work: DORA, SPACE, and DX Core 4

Over the past decade, three research programs have converged on a shared insight: engineering productivity must be measured across multiple complementary dimensions.

DORA established four metrics (deployment frequency, lead time, change failure rate, and time to restore), proving that speed and stability are not trade-offs. SPACE (Microsoft) expanded the lens to include satisfaction, collaboration, and flow. DX Core 4 synthesizes both into four actionable dimensions with a strong emphasis on developer experience surveys as leading indicators.

The four dimensions these frameworks converge on:

Dimension	What it measures
Speed	How quickly value flows from idea to production: deployment frequency, lead time, and perceived delivery pace.
Effectiveness	How much time engineers spend on productive work versus friction, overhead, and interruptions. Measured through developer experience surveys.
Quality	How reliable is what we ship: change failure rate, time to restore service, and the proportion of engineering effort spent on unplanned rework.
Impact	Whether engineering work aligns with business outcomes: feature adoption, customer value, and strategic fit.

Each dimension in isolation produces its own pathology. Together, they create natural checks and balances, and that is what makes them robust enough for the AI transition.

A Practical Five-Metric System for Scaleups

The frameworks above are comprehensive, but not every scaleup has the bandwidth to implement the full DORA or DX Core 4 measurement stack on day one. For engineering leaders who need to start measuring now, a focused system built on five metrics can get you remarkably far:

Diff Throughput (team-level): How consistently teams deliver reviewed, tested changes. Pair with cycle time to detect queue problems. Source: GitHub/GitLab/Bitbucket analytics, tracked per team per sprint.
Remediation Ratio: Percentage of diffs that are remediation work. Rising ratios signal speed at the cost of stability. Source: Tag diffs by type in Git platform and track ratio over time.
Developer Experience Survey: How engineers perceive effectiveness: friction, tooling satisfaction, flow state frequency. Source: Quarterly structured survey via Slack/email/dedicated tools.
AI Adoption: Whether AI is helping and how much. Hours saved, usefulness, adoption patterns across cohorts. Source: Automated micro-surveys via Slack/Teams bots triggered on CR merge.
Revenue Growth: Monthly revenue trend as a business anchor. Tracks whether engineering delivery connects to business outcomes. Source: Monthly revenue data from finance, tracked as trend line.

Diff Throughput: A Useful Signal When Used Right

While individual output metrics are problematic, diff throughput (the number of change requests merged over a given period) can serve as a practical delivery signal with one critical condition: it must be tracked at the team level, never as an individual performance metric. The moment diff throughput becomes a personal KPI, Goodhart's Law reasserts itself and engineers optimize for volume over value.

At the team level, diff throughput is one of the most practical delivery metrics available. It reflects how well the system supports delivery (review speed, CI reliability, deployment friction) and serves as an early warning system: when throughput drops, something in the pipeline is broken. It is easy to collect, requires no manual tagging, and is immediately understandable by both engineers and leadership.

Remediation Ratio: Quality You Can Measure Without Tagging

Most approaches to tracking engineering quality require teams to manually label their work: tagging CRs as "remediation," "feature," or "hotfix" in a project management tool. In practice, this rarely happens consistently. Labels drift, teams forget, and the data becomes unreliable within weeks.

Instead of relying on manual tagging, a more scalable approach is to scan CR titles and commit messages and automatically categorize each diff. The language engineers use in their commit messages is remarkably consistent and classifiable, even without structured labels.

This produces an automated remediation ratio: the percentage of recent diffs that are reactive work rather than planned feature delivery. The absolute number matters less than the trend. A ratio that is steadily climbing tells you the team is spending an increasing share of its energy on reactive work rather than moving forward.

The ratio is powerful because it requires zero process change from engineers. No new labels, no new fields in Jira, no workflow changes. It reads the signals teams are already producing and turns them into a quality indicator that updates continuously.

Developer Experience Survey: The Leading Indicator

Quantitative metrics like diff throughput and remediation ratio tell you what is happening, but not why. Developer experience surveys fill that gap. They capture the friction, frustrations, and enablers that system metrics cannot see.

A structured quarterly survey covering dimensions like build times, review bottlenecks, tooling satisfaction, documentation quality, and flow state frequency gives engineering leaders a leading indicator of productivity problems before they surface in delivery metrics. The survey catches the cause while the system metrics only catch the effect.

The key to high-quality survey data is consistency and brevity. A focused set of questions that engineers can complete in under five minutes produces far better signal than an exhaustive annual engagement survey.

Critically, developer experience surveys also serve as a check on the other metrics. A team with rising diff throughput but declining satisfaction scores may be burning out, shipping faster today at the cost of attrition tomorrow.

AI Adoption Metrics: Measuring What AI Actually Changes

Metrics like tokens consumed or seat licenses activated tell you who has access to AI tools, not whether those tools are making a difference. Measuring real impact requires a way to identify which diffs involved AI assistance and a method for cross-referencing that data with delivery and quality metrics.

The most practical way to measure AI's real impact is experience sampling: asking engineers a small set of questions at natural workflow moments, such as when a change request is merged:

—Did you use AI assistance on this diff?
—Was the AI assistance useful?
—Roughly how many hours of work did AI help you save?

These micro-surveys are automated via Slack or Teams bots, triggered by webhook events when a diff is merged. They take seconds to complete and generate high response rates because they meet engineers where they already are. Because each response is tied to a specific CR, the data can be correlated with incident rates, cycle time, and revert frequency to validate whether self-reported savings translate into real improvements.

Revenue Growth: The Business Anchor

Revenue growth, not revenue per engineer, is the right business signal for scaleups. Per-engineer ratios are seductive because they promise a clean efficiency number, but they punish growing teams by design. A declining revenue-per-engineer figure during a growth sprint is not a sign of inefficiency. It is the expected shape of investment.

Tracking monthly revenue growth alongside the four engineering metrics anchors the system to business outcomes. It answers a question that pure engineering metrics cannot: is the work we are shipping connecting to the results the business needs?

Revenue growth is not an engineering metric per se. It is the business context that gives the other four metrics meaning. Without it, an engineering organization can optimize its own scoreboard while the company stalls.

Tying It All Together

Together, these five metrics span the dimensions that matter: delivery flow (diff throughput), quality (remediation ratio), developer experience (surveys), AI adoption (experience sampling), and business impact (revenue growth). They are lightweight enough to implement in weeks, not quarters. They resist gaming because they combine system-generated data with self-reported perceptual data and anchor both to a business outcome.

AI adoption is the cross-cutting lens across the entire system. Every metric should be examined through the split between AI-assisted and non-assisted work. Slicing each metric by AI cohort turns the five-metric system from a static scorecard into a learning engine for the AI transition.

Conclusion

The AI era has made the engineering productivity question both more urgent and more dangerous. More urgent because boards and investors want to see returns on AI tooling investments and growing engineering headcount. More dangerous because AI makes it trivially easy to inflate the output metrics that were already misleading.

The frameworks developed by DORA, SPACE, and DX Core 4 offer the right conceptual foundation: measuring speed, effectiveness, quality, and impact as complementary dimensions rather than collapsing productivity into a single number.

For scaleups that need to act now, the five-metric system provides a practical starting point that covers the essential dimensions without requiring a dedicated metrics team to maintain.

The organizations that get this right will not just measure productivity in the AI era. They will build the feedback loops that let them systematically improve it.

Interested in measuring AI's real impact on your engineering team?

Book a discovery call