Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by OSWorld-Verified

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

Claude Fable 5Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified85.00
Proprietary
Claude Opus 4.8Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified83.40
Proprietary
Claude Mythos PreviewAnthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.00
Tool Decathlon
OSWorld-Verified79.60
Proprietary
4
GPT-5.5OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.70
Tool Decathlon
OSWorld-Verified78.70
Proprietary
5
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified78.40
Proprietary
6
Opus 4.7Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.40
Tool Decathlon
OSWorld-Verified78.00
Proprietary
7
GPT-5.4OpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.075.10
Tool Decathlon
OSWorld-Verified75.00
Proprietary
8
Kimi K2.6Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
9
Claude Opus 4.6Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench91.89
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified72.70
Proprietary
10
Claude Sonnet 4.6Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.10
Tool Decathlon
OSWorld-Verified72.50
Proprietary
11
GPT-5.4 miniOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
Proprietary
12
MiniMax M3MiniMaxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified70.00
Non-commercial
13
Qwen3.5-397B-A17B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
14
Claude Sonnet 4.5Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench84.70
Terminal Bench 2.042.80
Tool Decathlon
OSWorld-Verified61.40
Proprietary
15
Qwen3.5-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench79.00
Terminal Bench 2.041.60
Tool Decathlon
OSWorld-Verified56.20
Free commercial
16
Claude Sonnet 4Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified42.20
Proprietary
17
GPT-5.4 nanoOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.046.30
Tool Decathlon35.50
OSWorld-Verified39.00
Proprietary
18
Claude Sonnet 3.7Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench61.80
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified28.00
Proprietary
19
o3-proOpenAI
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot83.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
21
OpenAI o3OpenAI
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
22
Grok 4xAI
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
23
DeepSeek-V3.1DeepSeek-AI
Thinking Enabled
Aider-Polyglot76.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot76.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
25
DeepSeek V3.2-ExpDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
26
OpenAI o4 - miniOpenAI
Thinking Level · High
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
27
Claude Opus 4Anthropic
Thinking Enabled
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
28
Aider-Polyglot71.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot70.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
30
DeepSeek V3.2DeepSeek-AI
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot68.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
32
Aider-Polyglot66.20
τ²-Bench
Terminal Bench 2.036.20
Tool Decathlon
OSWorld-Verified
Free commercial
33
Aider-Polyglot64.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
34
Claude Sonnet 4Anthropic
Thinking Enabled
Aider-Polyglot61.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
35
M2.1MiniMaxAI
Thinking EnabledTools
Aider-Polyglot61.00
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot60.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot59.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot56.70
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot55.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
40
GLM-4.7智谱AI
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot51.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot49.80
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
43
Qwen3-32B阿里巴巴
Thinking Enabled
Aider-Polyglot40.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot27.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
45
Qwen3.6-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Free commercial
46
Qwen3.6-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon
OSWorld-Verified
Proprietary
47
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified
Proprietary
48
Composer 2Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.70
Tool Decathlon
OSWorld-Verified
Proprietary
49
DeepSeek-V4-ProDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.30
Tool Decathlon
OSWorld-Verified
Free commercial
50
GLM 5.1智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified
Free commercial
Sort by:
Showing 50 of 100 modelsView OSWorld-Verified benchmark page