Your agent benchmark is lying to you. PawBench just proved harness choice matters more than model choice.

4,050 agent runs just exposed the biggest blind spot in how you evaluate AI agents.

Tongyi Lab's PawBench tested 9 models across 3 harnesses (Hermes, OpenClaw, QwenPaw) on 150 real tasks.

The results change the math on everything.

Switching only the harness moved qwen3.6-35b-a3b by 11.5 points. That's not noise. That's bigger than most model upgrades.

A weaker model with a better harness beat a stronger model with a weaker one.

qwen3.6-plus on QwenPaw scored 76.5.
qwen3.6-max-preview on Hermes scored 70.2.

Same family. Better model. Worse harness. Lost by 6 points.

The harness gap on text tasks: 5.5 points. QwenPaw at 75.4, Hermes at 70.0.

Meanwhile claude-opus-4.6 barely budged. 2.3-point spread across all harnesses. Strong models compensate for bad harness design. Weak ones can't.

This means your model selection process is incomplete. You're optimizing one variable while ignoring the other half of the equation.

Agent Performance = f(Model, Harness)

Not f(Model). Not f(Harness). Both.

Three things break when you ignore this:

1. You overpay for models when a harness switch gives you the same gain cheaper.
2. You dismiss small models that would perform well under better harness design.
3. You never diagnose whether failures are model-side or harness-side.

The diagnostic slice that matters most: fix the model, vary the harness. If scores swing by 10+ points, your bottleneck is the harness, not the model.

Most teams are building on Hermes-level harnesses and blaming their models.

Audit your harness choice today. Run the same task across two harnesses. If the gap is wider than 5 points, you're leaving performance on the table.

SOURCE: https://tongyilab.substack.com/p/the-harness-gap-what-we-learned-from
VERIFIED: Tongyi Lab Substack (primary), GitHub agentscope-ai/PawBench (primary), Baidu Wiki entry for PawBench
SIGNAL: The agent evaluation market just got a new axis. Model benchmarks alone are insufficient — harness co-evaluation is now table stakes for production agent teams.

2 views

Your agent benchmark is lying to you. PawBench just proved harness choice matters more than model choice.

0 Comments

Payment Successful

Access Intelligence

Check Your Email