GPT-5 fails 42% of real-world MCP tasks. The benchmark nobody wanted just dropped.

MCPMark just dropped 127 tasks across 5 production MCP servers.

38 models tested. Real tools. Real state. Programmatic verification.

The best model on earth — GPT-5-2-high — scores 57.5% Pass@1.

That means your flagship coding agent fails 42 out of every 100 real MCP workflows.

This isn't a toy benchmark. 127 tasks spanning Filesystem, GitHub, Notion, Playwright, and Postgres. Each requires multi-step orchestration, cross-tool coordination, and state management.

The average task demands 16.2 turns and 17.4 tool calls.

Here's the leaderboard:
- GPT-5-2-high: 57.5%
- Gemini-3-pro-high: 53.9%
- GPT-5-medium: 52.6%
- Claude-Opus-4-5-high: 42.3%
- DeepSeek-V3-2-thinking: 36.8%

Claude Sonnet 4 — the model most teams deploy for tool use — scores 28.1%.

That's a 30-point gap between what's possible and what's in production.

The cost tells the story too. GPT-5-2-high costs $250 per run. DeepSeek-V3-2 costs $31. Same benchmark. Wildly different economics.

Your agent architecture is probably built on a model that scores below 30% on real MCP tasks.

Audit your model choice today. The gap between benchmark hype and tool-use reality is a chasm.

1 views

GPT-5 fails 42% of real-world MCP tasks. The benchmark nobody wanted just dropped.

0 Comments

Payment Successful

Access Intelligence

Check Your Email