ResearchMarch 29, 2026 · 7 min read

The AI Coding Benchmark War Has a Legitimacy Crisis. Here's the Data.

Claude Opus 4.5 scores 80.9% on SWE-Bench Verified and 45.9% on SWE-Bench Pro. 35-point gap. OpenAI abandoned Verified. 59.4% of its hardest problems have flawed tests. The benchmark everyone cites is broken.

Published by GitIntel Research

TLDR

SWE-Bench Verified — the most-cited AI coding benchmark — has been quietly abandoned by OpenAI due to contamination and flawed test cases.
59.4% of the hardest unsolved Verified problems have been found to have incorrect or misleading test cases.
The same model scores up to 35 percentage points lower on SWE-Bench Pro, the harder replacement benchmark.
On SWE-Bench Pro, GPT-5.3-Codex leads at 56.8% — the largest gap from its Verified score of any top model.
77 models have been evaluated on Verified; none of those scores can be trusted at face value.

What SWE-Bench Pro Actually Shows

The Pro leaderboard tells a different story than Verified. The headline: no model is reliably above 57% on real, unseen engineering problems. We are not at "human-level software engineering."

The most interesting data point is GPT-5.3-Codex's relative performance. It scores 56.8% on Pro — the highest of any model — despite scoring 80.0% on Verified. That's a 23-point gap, the smallest delta in the top 6. OpenAI's decision to abandon Verified and champion Pro may have been a strategic bet on a benchmark where their model performs better relatively, but it's also a genuine data integrity call.

KEY FINDING

The best model on SWE-Bench Pro (GPT-5.3-Codex at 56.8%) is 24 points lower than the best model on SWE-Bench Verified (Claude Opus 4.5 at 80.9%). You can pick your leaderboard and your winner. That's a problem.

For teams building on top of these models — using them for automated code review, bug fixing, refactoring — the practical implication is real. If you've made architectural decisions based on a model's Verified score, you may have significantly overestimated its capability on your actual codebase.

The cost-efficiency picture shifts too. DeepSeek V3.2-Exp runs at roughly $1.30 per SWE-Bench run — the cheapest in the top tier. On Verified it scores 74.3%, making it look like a near-peer to Claude Opus 4.5. On Pro it scores 44.9%, 12 points behind GPT-5.3-Codex. Different budget decision.

The Open-Weight Wildcard

Two fully open models cracked the top 5 on SWE-Bench Verified: MiniMax M2.5 at 80.2% and DeepSeek V3.2-Exp at 74.3%. Both publicly available on Hugging Face, both runnable on-premises.

NVIDIA's Nemotron 3 Super (120B parameters, 12B active per token, Mixture-of-Experts architecture) scored 60.47% on SWE-Bench Verified in March 2026 — the top open-weight model on Pro-equivalent evaluations. Pre-trained on 25 trillion tokens, 1-million-token context window. Available free on Hugging Face.

# Open-weight top performers (SWE-Bench Verified, March 2026)
MiniMax M2.5 80.2% # closes gap with Claude Opus 4.5
DeepSeek V3.2-Exp 74.3% # $1.30/run, cheapest in top tier
NVIDIA Nemotron 3 Super 60.5% # 120B/12B-active MoE, full weights public

# Same models on SWE-Bench Pro (Scale Labs eval)
MiniMax M2.5 43.8% # −36.4pp delta — biggest drop in top 6
DeepSeek V3.2-Exp 44.9% # −29.4pp
NVIDIA Nemotron 3 Super ~38% # estimated, Pro eval in progress

The open-weight models show the largest Verified→Pro drops, suggesting their Verified scores have benefited most from benchmark optimization. That said, a fully open 120B model at ~38% on real engineering tasks is still remarkable infrastructure for teams who can't send code to external APIs.

What This Means If You're Building on AI Coding Tools

Don't trust Verified scores. Any model claiming 80%+ on SWE-Bench Verified trained after 2024 may have contaminated training data. The score is a marketing number, not a capability measure.

Demand Pro scores. SWE-Bench Pro uses post-cutoff tasks across Go, TypeScript, and JavaScript — not just Python. If a lab only reports Verified and not Pro, ask why. OpenAI's abandonment of Verified should set a precedent.

Benchmark on your own codebase. Neither Verified nor Pro measures how well a model handles your stack, your conventions, your domain. The only benchmark that matters for your team is the one you run on your actual repos. GitIntel scans your commit history; your own A/B tests on real tasks are the only reliable signal for adoption decisions.

Watch the delta, not the score. A model with an 80% Verified score and a 46% Pro score (34pp delta) has been heavily optimized for the benchmark. A model with 80% Verified and 57% Pro (23pp delta) is closer to its advertised capability. Smaller deltas indicate less benchmark gaming.

What Comes After Benchmarks

The deeper problem is structural: any public benchmark becomes contaminated over time as it enters training data. SWE-Bench Pro will face the same fate SWE-Bench Verified did — possibly faster, given how rapidly model training cycles run.

The research community is moving toward private held-out test sets and continuous fresh-problem generation. But until that infrastructure exists at scale, the benchmark war will keep producing impressive-sounding numbers that don't reflect engineering reality.

The honest answer to "which AI writes the best code?" in March 2026 is: we don't have a reliable, contamination-free way to answer that question. We have 77 models reporting numbers on a benchmark that's been compromised, and a newer benchmark with 35-point gaps suggesting the old one was fictional. Track the AI commits in your own git history — that's the only data that reflects your actual situation.

Measure AI code in your own repos

SWE-Bench can't tell you how much of your codebase is AI-generated. GitIntel can.

# Install GitIntel
curl -fsSL https://gitintel.com/install.sh | sh

# Scan your repo
cd your-repo && gitintel scan

View on GitHub

Open source (MIT) · Local-first · No data leaves your machine

Benchmark data sourced from llm-stats.com and Scale Labs leaderboard. All scores as of March 2026. SWE-Bench Pro scores use Scale AI's standardized scaffold for fair comparison.

Related reading on GitIntel: