MCPMark just dropped 127 tasks across 5 production MCP servers.
38 models tested. Real tools. Real state. Programmatic verification.
The best model on earth — GPT-5-2-high — scores 57.5% Pass@1.
That means your flagship coding agent fails 42 out of every 100 real MCP workflows.
This isn't a toy benchmark. 127 tasks spanning Filesystem, GitHub, Notion, Playwright, and Postgres. Each requires multi-step orchestration, cross-tool coordination, and state management.
The average task demands 16.2 turns and 17.4 tool calls.
Here's the leaderboard:
- GPT-5-2-high: 57.5%
- Gemini-3-pro-high: 53.9%
- GPT-5-medium: 52.6%
- Claude-Opus-4-5-high: 42.3%
- DeepSeek-V3-2-thinking: 36.8%
Claude Sonnet 4 — the model most teams deploy for tool use — scores 28.1%.
That's a 30-point gap between what's possible and what's in production.
The cost tells the story too. GPT-5-2-high costs $250 per run. DeepSeek-V3-2 costs $31. Same benchmark. Wildly different economics.
Your agent architecture is probably built on a model that scores below 30% on real MCP tasks.
Audit your model choice today. The gap between benchmark hype and tool-use reality is a chasm.
Agentic AI
GPT-5 fails 42% of real-world MCP tasks. The benchmark nobody wanted just dropped.
1 views
0 Comments