ResearchMarch 30, 2026 · 7 min read

AI Writes Your JSON. 1 in 4 Times, It's Wrong. ICLR 2026 Has the Data.

University of Waterloo's StructEval tested 12 AI models on 2,035 structured output tasks. GPT-4o tops out at 76%. Text→Mermaid collapses to 18.9%. The format-specific failure crisis is now peer-reviewed.

Published by GitIntel Research

TLDR

University of Waterloo's StructEval (ICLR 2026) ran 12 AI models through 2,035 structured output tasks across 18 formats
Best commercial models (GPT-4o, GPT-4.1-mini, o1-mini) top out at ~76% — failing roughly 1 in 4 tasks
Hardest task: Text→Mermaid averages 18.9% across all models. Matplotlib→TikZ: 28.4%
Open-source best (Qwen3-4B): 67%. Weakest (Phi-3-mini): 40.8%
No Claude model was evaluated — the study focused on GPT, Gemini, Qwen, Llama, and Phi families

StructEval-V

Visually rendered structures — code that must be rendered to produce a correct result.

{["HTML", "React", "SVG", "Matplotlib", "Mermaid", "TikZ"].map( (f) => (

{f}

) )}

Each model got one shot — no agent loop, no self-correction, no feedback. The evaluation combined automated VQA scoring (via GPT-4.1-mini as judge) with human annotation on a 397-sample quality check, which confirmed 88.92% of automated assessments were fair.

This is single-shot structured output performance. Not a multi-turn agent. Not a chain-of-thought scaffold. The most common real-world scenario when a developer says "generate a YAML config for me" or a CI pipeline runs an AI-assisted code transformation.

Model Scores: The Full Picture

| Model | Vendor | Type | Avg. Score | | --- | --- | --- | --- | | {m.model} | {m.vendor} | {m.type} | |

Source: StructEval (arXiv:2505.20139), TMLR January 2026 / ICLR 2026. Scores represent average across all tasks and formats.

THE COMMERCIAL VS. OPEN-SOURCE GAP

Best commercial models plateau around 75–76%. Best open-source models (Qwen3-4B) reach 67%. That's a roughly 10-percentage-point gap — or about one extra failure per every 10 tasks. Not nothing, but also not the order-of-magnitude advantage the marketing materials imply.

Where Every Model Collapses

The aggregate numbers hide the most important story. Three task types break every model tested, regardless of tier:

{hardTaskData.map((t) => ( ))}

Task	Category	Avg. Score (all models)
{t.task}	{t.type}	{t.avgScore}%

Text→Mermaid at 18.9% is not a rounding error. It means that across all 12 models tested, only about 1 in 5 attempts to generate a Mermaid diagram from a natural language description succeeded. GPT-4o, o1-mini, Gemini — all of them, averaging to 18.9%.

The pattern across all hard tasks is the same: they require the model to hold a complex visual or structural grammar in working memory while also solving the underlying task. Text→TOML fails because TOML's quoting and array rules diverge from JSON in ways models consistently confuse. Matplotlib→TikZ fails because it requires translating between two entirely different rendering paradigms.

The "easy" tasks are revealing too. Text→JSON, Text→HTML, Text→CSV, Text→Markdown — all of these score above 90% across models. The structured output failure problem is not universal. It's concentrated in a specific subset of formats that happen to be widely used in developer tooling (TOML in Rust projects, Mermaid in documentation pipelines, TikZ in academic papers).

What This Looks Like in a Real Repository

Consider a Rust project where a developer asks an AI assistant to generate a Cargo.toml dependency section or a .cargo/config.toml profile. The StructEval data says the model fails to produce valid, correct TOML roughly 2 times in 3 attempts on average — despite TOML being a simple, well-documented format.

# What the AI generates (hallucinated TOML syntax):
[profile.release]
opt-level = "3" # ✓ correct
lto = true # ✓ correct
overflow-checks = false # ✓ correct
panic = 'abort' # ✓ correct
codegen-units = "1" # ✗ wrong — must be integer, not string

# What valid TOML requires:
codegen-units = 1

This kind of error — a value type mismatch that looks correct to a human skimming the output — is exactly what GitIntel surfaces when scanning commit history. AI-authored commits containing config file changes are often the source of subtle breakages that only manifest at build time.

The StructEval benchmark makes this concrete: the problem is not that AI coding tools are generally incompetent. It's that their failure modes are format-specific, concentrated, and non-obvious — precisely the failure profile most likely to slip through review.

IMPORTANT CAVEAT

StructEval measures single-shot, non-agentic performance. A modern AI coding agent with self-correction loops, linting feedback, or multi-turn prompting will outperform these numbers. As researcher Dongfu Jiang (co-first author) noted: "Developers might have these agents working for them, but they still need significant human supervision."

Additionally: no Claude model (Anthropic) was included in the benchmark — the study covered GPT, Gemini, Qwen, Llama, and Phi families only. Claude's performance on StructEval tasks remains unmeasured by this study.

What This Means for Your Codebase

The error is invisible until it isn't

A malformed TOML value or an invalid Mermaid node definition won't fail your linter. It won't trigger a type error. It will either silently produce wrong output or fail at runtime in a way that's difficult to trace back to its AI origin. Without tooling that tracks which commits are AI-assisted, the audit trail disappears the moment the commit lands.

The 18% success rate on Mermaid is a documentation problem

Mermaid is the de facto standard for architecture diagrams in GitHub READMEs and Notion docs. If your team is using AI to generate these diagrams — and the tools are only succeeding 1 in 5 times — your documentation pipeline has an 81% failure rate on this specific task. That's not a theoretical risk. It's the current baseline according to the best available peer-reviewed data.

Format diversity amplifies the problem

Modern repositories are not monolithic Python codebases. They contain pyproject.toml, Cargo.toml, docker-compose.yml, .github/workflows/*.yml, Mermaid diagram blocks in Markdown, SVG assets, JSON schemas, and CSV test fixtures. Each format is a separate failure surface. The StructEval data shows the failure rates are not uniform — and the formats developers use most for configuration are among the hardest.

Run the Scan

GitIntel surfaces which commits are AI-assisted, which files were touched, and which authors are generating the most AI-attributed changes — giving you the attribution layer that StructEval's findings make critical. Run it against your own repo in under a minute:

# Install GitIntel
curl -fsSL https://gitintel.com/install.sh | sh

# Scan your repo — see which commits are AI-assisted
cd your-repo
gitintel scan

# Focus on config files specifically
gitintel scan --path "*.toml,*.yaml,*.yml" --format json

The output shows you exactly which config files were last touched in AI-assisted commits — the highest-risk surface the StructEval data identifies.

Know Your Repo's AI Surface

StructEval tells you the failure rates by format. GitIntel tells you which formats in your specific repo were generated by AI. Both pieces are necessary.

View on GitHub

Open source (MIT) · Local-first · No data leaves your machine

Source: StructEval (arXiv:2505.20139), published in Transactions on Machine Learning Research (TMLR) January 2026, presented at ICLR 2026. University of Waterloo TIGER-AI Lab. Lead authors: Jialin Yang, Dongfu Jiang, Wenhu Chen.

Related reading on GitIntel: