The METR Study Says AI Slows Development by 19%.

When the METR study results came out in July, a shocking statistic came with it. Experienced developers using AI tools took 19% longer to complete their tasks compared to working without assistance. The same developers had predicted AI would make them 24% faster. Even after experiencing the actual slowdown, they estimated they had been 20% more productive with AI.

The gap between perception and measured reality wasn’t a minor miscalibration. It was a complete disconnect.

Having worked in development and IT for over 20 years, with the past decade in senior and lead developer roles, this pattern isn’t entirely surprising. I’ve watched engineers get genuinely excited about productivity tools & languages that promised to revolutionize how we work. What struck me about the METR findings was a familiar question: how can any of us accurately assess whether new tools actually improve our effectiveness, or whether we’re simply adapting our work patterns in ways that feel more efficient?

The Measurement Problem

Through a decade of working within and leading development teams, I’ve observed that we consistently struggle with measuring what actually drives productivity. We track story points, which represent relative estimates rather than absolute measures. We count commits as if frequency equates to value. We survey developers about their satisfaction with tools and treat subjective responses as productivity data.

The challenge runs deeper than poor metrics. Human perception of personal productivity tends to be remarkably unreliable, particularly when we’re adapting to new workflows or tools that change how our work feels.

The METR researchers conducted the type of rigorous study we rarely see in software development: a randomized controlled trial with proper experimental design and measurement protocols.

What the METR Study Actually Did

Unlike vendor-sponsored surveys or anecdotal reports, the METR researchers used a randomized controlled trial (RCT) — the gold standard for measuring cause and effect. Here’s what made their approach so rigorous:

Real developers, real tasks. They recruited 16 experienced developers from large open-source repositories (averaging 22,000+ stars and 1M+ lines of code). These weren’t junior developers learning to code, these were seasoned professionals with an average of 5 years contributing to mature projects.

Actual work, not synthetic benchmarks. Each developer provided a list of real issues from their backlog. Bug fixes, features, and refactors they would normally complete. In total, 246 tasks were included in the study.

Random assignment. Each task was randomly assigned to either allow AI tools (primarily Cursor Pro with Claude 3.5/3.7 Sonnet) or prohibit their use. This eliminated selection bias where developers might choose easier tasks for AI assistance.

Multiple measurement approaches. Developers self-reported completion times and recorded their screens. The researchers manually analyzed 128 screen recordings totaling 143 hours of video to understand how time was actually spent.

Prediction tracking. Before each task, developers estimated how much faster they expected AI to make them. This captured the critical perception vs. reality gap.

Average task completion time increased by 19%. But the developers predicted a 24% improvement and, even after experiencing the slowdown, estimated they had been 20% faster.

The result? When AI was allowed, average task completion time increased by 19%. But the developers predicted a 24% improvement and, even after experiencing the slowdown, estimated they had been 20% faster.

Why This Matters More Than You Think

The challenge isn’t that developers are dishonest or that AI tools are inherently bad. The challenge is that productivity is incredibly difficult to assess subjectively, especially when you’re in the middle of using a new tool.

Consider what the METR researchers found when they analysed those screen recordings. With AI tools, developers spent less time actively coding and searching for information, but more time:

  • Prompting AI systems and waiting for responses
  • Reviewing and testing AI-generated code
  • Debugging issues in AI-produced solutions

The developers felt more productive because they were writing less code manually. But the total time from task start to completion actually increased.

This matches what I’ve observed over the years with various “game-changing” tools. Teams often feel more productive because the nature of their work changes — they’re doing different activities that seem more efficient — but the end-to-end delivery time doesn’t actually improve.

How to Implement the METR Methodology in Your Team

You don’t need a research team to apply these measurement principles. Here’s how to adapt the METR approach for your own team:

Step 1: Establish Your Baseline

Before implementing any new tool, measure your current state. This is where most teams fail — they start measuring after they’ve already adopted the tool, making comparison impossible.

Track these metrics for 2–4 weeks of normal work:

  • Task completion time for similar types of work (bugs, features, refactors)
  • Code review iterations required before approval
  • Bug rates in production for newly shipped code
  • Time from commit to deployment for your delivery pipeline

Document the types of tasks you’re measuring. The METR study worked because they used real backlog items, not artificial benchmarks.

Step 2: Random Assignment (When Possible)

This is the hardest part to implement in a real team, but try to approximate it. When you introduce AI tools:

  • Have some developers use the tools, others continue without them (at least initially)
  • For developers using AI, designate some tasks as “AI-assisted” and others as “traditional” approach
  • If possible, assign similar tasks to both groups to enable comparison

The key is avoiding selection bias where AI tools only get used on the easiest or hardest tasks.

Step 3: Collect Predictions

Before each major task or sprint, ask developers to estimate:

  • How long they expect the task to take with their assigned approach
  • How confident they are in the estimate
  • What factors might affect the timeline

This captures the crucial perception data that the METR study revealed as so inaccurate.

Step 4: Track Multiple Data Points

Task completion time is important, but it’s not everything. Also measure:

Code quality indicators:

  • Bug reports within 30 days of deployment
  • Number of review iterations required
  • Complexity metrics (if you track them)

Delivery success:

  • Deployment success rate
  • Rollback frequency
  • Time spent on hotfixes

Process indicators:

  • Time spent in different activities (coding vs. debugging vs. research)
  • Learning curve progression for new tools
  • Cross-team collaboration efficiency

Step 5: Compare Predictions to Reality

After 4–6 weeks of measurement, analyse the data:

  • How did predicted completion times compare to actual times?
  • What patterns emerge in code quality metrics?
  • Are there differences between developers or task types?
  • How has the overall delivery pipeline been affected?

Most importantly, share these findings with the team. The METR study’s most valuable insight wasn’t just the 19% slowdown — it was revealing how poor developer self-assessment can be.

Implementation Tips

You dont have to giant convulted measurement processes that bog down your development team, I suggest the following:

Start small. Don’t try to measure everything at once. Pick 2–3 key metrics and get good at tracking those consistently.

Make it automatic. Manual time tracking doesn’t work long-term. Use your existing tools (JIRA, GitHub, deployment pipelines) to capture data automatically where possible.

Focus on teams, not individuals. Frame measurement as team improvement, not individual performance evaluation. The METR study worked because it focused on aggregate patterns, not ranking developers.

Expect resistance. Developers often resist being measured. Be transparent about what you’re measuring and why. Share results openly and use them for process improvement, not performance reviews.

Account for learning curves. New tools almost always slow things down initially. Plan for 4–8 weeks of reduced productivity as teams learn new workflows.

What This Means for Your AI Tool Strategy

The METR study doesn’t mean AI tools are bad or that teams shouldn’t adopt them. It means we need to be much more rigorous about measuring their impact and honest about the results.

Some teams may find AI tools genuinely helpful. Others may discover, like the METR study participants, that the tools slow them down despite feeling productive. Both outcomes are valuable if they’re based on actual data rather than perception.

What matters is having the measurement framework to tell the difference.

Your team deserves better than guesswork when it comes to productivity tools. Whether you’re evaluating AI assistants, deployment automation, or the next “revolutionary” development platform, the principles remain the same: measure carefully, compare objectively, and trust data over feelings.

The METR researchers have given us a proven methodology. The question is whether we’re disciplined enough to use it.

What’s one productivity metric your team should measure but doesn’t? I’d love to hear about your measurement challenges and successes in the comments.

If you found this helpful, I’m working on more practical frameworks for measuring development team effectiveness. Follow for updates on implementation guides and measurement templates.

References:

* METR. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” July 10, 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

By Ben

Leave a Reply

Your email address will not be published. Required fields are marked *