Back to Blog
ResearchMarch 29, 2026 · 7 min read

AI Writes Your Tests. Coverage Goes Up 40%. Bugs Stay.

AI tools boost test coverage 40% and cut test-writing time 50%. But defect rates haven't moved. 1.75x more logic errors in AI code, AI-generated tests that validate the wrong behavior, and the 'coverage theater' pattern showing up in production postmortems.

Published by GitIntel Research

TLDR

These numbers are why GitHub shipped Copilot Testing as generally available in Visual Studio this quarter, why Cursor added one-click test generation in v0.44, and why Claude Code's /test command has become one of its most-used features. The speed improvement alone justifies adoption for routine test scaffolding.

The Numbers That Don't

Now look at the other side of the ledger. AI-generated code — the code those tests are being written for — has measurably different defect characteristics than human code:

| Metric | Human Code | AI-Assisted Code | | --- | --- | --- | | {metric} | {human} | {ai} |

The critical detail: when an AI writes both the implementation and the tests for that implementation in the same session, it is testing its own understanding of the problem — not the actual requirements. If the AI misunderstood the spec, both the code and the tests will reflect that misunderstanding. They'll pass each other perfectly.

COVERAGE TRAP PATTERN

Developer asks AI to implement a discount calculation function with a 10% cap for loyalty customers. AI misreads "cap" as a minimum floor, not a maximum ceiling. AI writes the function. AI writes tests. Tests assert the floored behavior. CI passes. Coverage: 94%. Production bills loyalty customers incorrectly for months.

What AI-Generated Tests Systematically Miss

After analyzing AI-generated test suites across open source repos and internal codebases, we found four recurring gaps:

1. Boundary conditions derived from business logic

AI tests cover the happy path and obvious edge cases (null, empty string, zero). They miss boundaries that only make sense in context: "orders placed after 5 PM on Fridays route to the overflow queue" isn't in the function signature. A human writing tests would have read the Jira ticket. The AI read the function.

2. Race conditions and state mutation

AI-generated unit tests almost universally run synchronously against isolated functions. They don't model concurrent access, cache invalidation timing, or external state that changes mid-operation. These failure modes require tests that are hard for LLMs to reason about because they require understanding the full system topology.

3. Regression tests for bugs that don't exist yet

The most valuable human-written tests often come from past incidents: "we had this bug in March, here's the test that proves it's fixed." AI has no access to your incident history. It writes tests for a clean theoretical version of your system.

4. Cross-module behavioral contracts

AI excels at unit tests. It struggles with integration tests that verify the behavioral contract between two modules written by different teams at different times. These are the tests that catch the subtle mismatches that cause production incidents — and they require broader system context than AI typically receives.

What We See in Commit History

We scanned test file patterns across repositories with high AI commit rates. The signal is consistent: AI-assisted commits add test files at higher rates, but the test-to-implementation ratio — lines of test per lines of code — stays flat or drops.

# Use GitIntel to measure test coverage trends alongside AI adoption
gitintel scan --format json | jq '
.commits[] | select(.ai_assisted == true) |
{sha: .sha, test_files_changed: .test_files_changed,
source_files_changed: .source_files_changed,
test_ratio: (.test_files_changed / .source_files_changed)}
'

In repos where AI commit rates exceed 15%, we consistently see:

None of these are disqualifying on their own. But together, they describe a test suite that looks healthy from the outside and provides less protection than its metrics suggest.

What Actually Works

The answer isn't to stop using AI for test generation. The 50% speed improvement on boilerplate test scaffolding is real and valuable. The answer is to be precise about what AI tests are good for and what they're not:

Use AI for: happy path scaffolding

Let AI generate the boilerplate — parameterized happy paths, null/empty inputs, type conformance checks. These are mechanical and AI does them well.

Write yourself: business logic edge cases

Any test that requires knowing your requirements document, your incident history, or your domain context should be human-authored. These are the tests that actually prevent regressions.

Measure differently: behavior coverage, not line coverage

Track which user stories or requirements have test coverage, not which lines. A function can be 100% line-covered by a test that asserts the wrong outcome. Mutation testing tools (Mutmut for Python, cargo-mutants for Rust) catch this — they check whether your tests actually fail when the behavior changes.

Enforce in review: flag AI-only test suites

When a PR has AI-generated code and AI-generated tests with no human-written test additions, that's a review flag. The PR hasn't been validated against requirements — it's been validated against itself.

# Mutation testing with cargo-mutants (Rust)
cargo mutants --timeout 30

# Output: lists mutations that survived (tests didn't catch them)
# 14 mutations survived out of 847 tested
# These are the gaps AI coverage doesn't show you

The Bigger Pattern

The coverage trap is a specific instance of a more general dynamic playing out across AI-assisted development: metrics that were designed to measure quality become easier to inflate, which makes them less reliable as quality signals.

Line coverage was already a flawed proxy for test quality before AI. AI has made it a worse proxy by making it cheaper to achieve. The same is true for PR count (AI makes it easier to open PRs, so PR count stops signaling velocity), commit frequency, and even documentation coverage.

Engineering organizations that thrive in the AI era will be those that understand which metrics AI can game and invest in the ones it can't: production incident rates, mean time to recovery, user-reported defects per feature, and rollback frequency. These are outcomes that require the code to actually work, not just look like it works.

KEY PRINCIPLE

If an AI can increase your metric in 30 seconds without improving the underlying thing the metric was designed to measure, you need a different metric. Coverage is now in that category.

Track AI test patterns in your repo

GitIntel scans your git history and surfaces which commits include AI-generated tests alongside AI-generated code — the pattern most associated with coverage theater.

# Install GitIntel
curl -fsSL https://gitintel.com/install.sh | sh

# Scan with test attribution
cd your-repo
gitintel scan --show-tests

View on GitHub

Open source (MIT) · Local-first · No data leaves your machine

Data compiled March 2026. Sources: CloudQA 2026 Testing Trends, CodeRabbit 13M PR analysis, GitClear 153M line study, Testomat.io AI unit test guide.


Related reading on GitIntel: