Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by Tool Decathlon

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

Kimi K2.6Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
GPT-5.4 miniOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
Proprietary
GLM 5.1智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified
Free commercial
4
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified
Proprietary
5
Qwen3.5-397B-A17B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
6
GPT-5.4 nanoOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.046.30
Tool Decathlon35.50
OSWorld-Verified39.00
Proprietary
Aider-Polyglot
τ²-Bench
Terminal Bench 2.051.50
Tool Decathlon26.90
OSWorld-Verified
Free commercial
8
o3-proOpenAI
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot83.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
10
OpenAI o3OpenAI
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
11
Grok 4xAI
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
12
DeepSeek-V3.1DeepSeek-AI
Thinking Enabled
Aider-Polyglot76.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot76.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
14
DeepSeek V3.2-ExpDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
15
OpenAI o4 - miniOpenAI
Thinking Level · High
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
16
Claude Opus 4Anthropic
Thinking Enabled
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
17
Aider-Polyglot71.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot70.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
19
DeepSeek V3.2DeepSeek-AI
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot68.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
21
Aider-Polyglot66.20
τ²-Bench
Terminal Bench 2.036.20
Tool Decathlon
OSWorld-Verified
Free commercial
22
Aider-Polyglot64.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
23
Claude Sonnet 4Anthropic
Thinking Enabled
Aider-Polyglot61.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
24
M2.1MiniMaxAI
Thinking EnabledTools
Aider-Polyglot61.00
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot60.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot59.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot56.70
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot55.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
29
GLM-4.7智谱AI
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot51.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot49.80
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
32
Qwen3-32B阿里巴巴
Thinking Enabled
Aider-Polyglot40.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot27.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
34
Claude Sonnet 4.6Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.10
Tool Decathlon
OSWorld-Verified72.50
Proprietary
35
Qwen3.6-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Free commercial
36
Qwen3.6-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon
OSWorld-Verified
Proprietary
37
Composer 2Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.70
Tool Decathlon
OSWorld-Verified
Proprietary
38
DeepSeek-V4-ProDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.30
Tool Decathlon
OSWorld-Verified
Free commercial
39
Qwen3.6-Max-Preview阿里巴巴
Deep Thinking ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified
Proprietary
40
DeepSeek-V4-ProDeepSeek-AI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.067.90
Tool Decathlon
OSWorld-Verified
Free commercial
41
Composer 2.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.30
Tool Decathlon
OSWorld-Verified
Proprietary
42
Opus 4.7Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.40
Tool Decathlon
OSWorld-Verified78.00
Proprietary
43
Qwen3.7-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.70
Tool Decathlon
OSWorld-Verified
Proprietary
44
GPT-5.4OpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.075.10
Tool Decathlon
OSWorld-Verified75.00
Proprietary
45
GPT-5.3 CodexOpenAI
Standard ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.077.30
Tool Decathlon
OSWorld-Verified
Proprietary
46
Claude Mythos PreviewAnthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.00
Tool Decathlon
OSWorld-Verified79.60
Proprietary
47
Claude Sonnet 4Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified42.20
Proprietary
48
GPT-5.5OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.70
Tool Decathlon
OSWorld-Verified78.70
Proprietary
49
Composer 1.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Proprietary
50
MiniMax M3MiniMaxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified70.00
Non-commercial
Sort by:
Showing 50 of 100 modelsView Tool Decathlon benchmark page