Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by τ²-Bench

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

Claude Opus 4.6Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench91.89
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified72.70
Proprietary
Aider-Polyglot
τ²-Bench90.80
Terminal Bench 2.068.50
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot
τ²-Bench90.20
Terminal Bench 2.047.60
Tool Decathlon
OSWorld-Verified
Proprietary
4
GLM-5智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench89.70
Terminal Bench 2.061.10
Tool Decathlon
OSWorld-Verified
Free commercial
5
Step 3.5 FlashStepFunAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench88.20
Terminal Bench 2.051.00
Tool Decathlon
OSWorld-Verified
Free commercial
6
GLM-4.7智谱AI
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon
OSWorld-Verified
Free commercial
7
Qwen3.5-397B-A17B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
Aider-Polyglot
τ²-Bench85.40
Terminal Bench 2.054.20
Tool Decathlon
OSWorld-Verified
Proprietary
9
Claude Sonnet 4.5Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench84.70
Terminal Bench 2.042.80
Tool Decathlon
OSWorld-Verified61.40
Proprietary
10
Grok 4.1 FastxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench82.71
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
11
Qwen3-Max-Thinking阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench82.10
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
12
GPT-5.2OpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench82.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
13
Opus 4.5Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench81.99
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Proprietary
14
DeepSeek V3.2DeepSeek-AI
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon
OSWorld-Verified
Free commercial
15
GPT-5OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench80.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
16
GLM-4.7-Flash智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench79.50
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
17
Qwen3.5-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench79.00
Terminal Bench 2.041.60
Tool Decathlon
OSWorld-Verified56.20
Free commercial
18
MiniMax M2MiniMaxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench77.20
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
19
Gemma 4 31BDeepMind
Thinking EnabledTools
Aider-Polyglot
τ²-Bench76.90
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
20
GLM-4.6智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench75.90
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench74.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
22
Claude Opus 4Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench72.50
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
23
Qwen3 Max (Preview)阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench72.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
24
Claude Sonnet 4.5Anthropic
Standard ModeTools
Aider-Polyglot
τ²-Bench71.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
25
Gemma 4 26B A4BDeepMind
Thinking EnabledTools
Aider-Polyglot
τ²-Bench68.20
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
26
DeepSeek V3.2-ExpDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
27
Kimi K2Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench64.30
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
28
Kimi K2Moonshot AI
Standard ModeTools
Aider-Polyglot
τ²-Bench64.30
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
29
Claude Sonnet 3.7Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench61.80
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified28.00
Proprietary
30
OpenAI o4 - miniOpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench56.90
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
31
GPT-4.1OpenAI
Standard ModeTools
Aider-Polyglot
τ²-Bench54.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
32
GPT-4.1 miniOpenAI
Standard ModeTools
Aider-Polyglot
τ²-Bench53.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
33
Claude Sonnet 4Anthropic
Standard ModeTools
Aider-Polyglot
τ²-Bench52.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
34
Qwen3-30B-A3B-2507阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench49.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
35
GPT OSS 20BOpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench47.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
36
DeepSeek-V3-0324DeepSeek-AI
Standard ModeTools
Aider-Polyglot
τ²-Bench38.80
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
37
Aider-Polyglot
τ²-Bench37.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench37.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
39
Qwen3-235B-A22B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench34.40
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
40
Haiku 4.5Anthropic
Standard ModeTools
Aider-Polyglot
τ²-Bench33.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
41
o3-proOpenAI
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot83.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
43
OpenAI o3OpenAI
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
44
Grok 4xAI
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
45
DeepSeek-V3.1DeepSeek-AI
Thinking Enabled
Aider-Polyglot76.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot76.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
47
OpenAI o4 - miniOpenAI
Thinking Level · High
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
48
Claude Opus 4Anthropic
Thinking Enabled
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
49
Aider-Polyglot71.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot70.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Sort by:
Showing 50 of 100 modelsView τ²-Bench benchmark page