Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by Aider-Polyglot

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

o3-proOpenAI
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot83.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
OpenAI o3OpenAI
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
4
Grok 4xAI
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
5
DeepSeek-V3.1DeepSeek-AI
Thinking Enabled
Aider-Polyglot76.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot76.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
7
DeepSeek V3.2-ExpDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
8
OpenAI o4 - miniOpenAI
Thinking Level · High
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
9
Claude Opus 4Anthropic
Thinking Enabled
Aider-Polyglot72.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
10
Aider-Polyglot71.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot70.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
12
DeepSeek V3.2DeepSeek-AI
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot68.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
14
Aider-Polyglot66.20
τ²-Bench
Terminal Bench 2.036.20
Tool Decathlon
OSWorld-Verified
Free commercial
15
Aider-Polyglot64.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
16
Claude Sonnet 4Anthropic
Thinking Enabled
Aider-Polyglot61.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
17
M2.1MiniMaxAI
Thinking EnabledTools
Aider-Polyglot61.00
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot60.40
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot59.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot56.70
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot55.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
22
GLM-4.7智谱AI
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot51.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot49.80
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
25
Qwen3-32B阿里巴巴
Thinking Enabled
Aider-Polyglot40.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot27.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
27
Claude Sonnet 4.6Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.10
Tool Decathlon
OSWorld-Verified72.50
Proprietary
28
Qwen3.6-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Free commercial
29
GPT-5.4 miniOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
Proprietary
30
Qwen3.6-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon
OSWorld-Verified
Proprietary
31
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified
Proprietary
32
Composer 2Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.70
Tool Decathlon
OSWorld-Verified
Proprietary
33
DeepSeek-V4-ProDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.30
Tool Decathlon
OSWorld-Verified
Free commercial
34
GLM 5.1智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified
Free commercial
35
Qwen3.6-Max-Preview阿里巴巴
Deep Thinking ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified
Proprietary
36
Kimi K2.6Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
37
DeepSeek-V4-ProDeepSeek-AI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.067.90
Tool Decathlon
OSWorld-Verified
Free commercial
38
Composer 2.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.30
Tool Decathlon
OSWorld-Verified
Proprietary
39
Opus 4.7Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.40
Tool Decathlon
OSWorld-Verified78.00
Proprietary
40
Qwen3.7-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.70
Tool Decathlon
OSWorld-Verified
Proprietary
41
GPT-5.4OpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.075.10
Tool Decathlon
OSWorld-Verified75.00
Proprietary
42
GPT-5.3 CodexOpenAI
Standard ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.077.30
Tool Decathlon
OSWorld-Verified
Proprietary
43
Claude Mythos PreviewAnthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.00
Tool Decathlon
OSWorld-Verified79.60
Proprietary
44
Claude Sonnet 4Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified42.20
Proprietary
45
GPT-5.5OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.70
Tool Decathlon
OSWorld-Verified78.70
Proprietary
46
Composer 1.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Proprietary
47
MiniMax M3MiniMaxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified70.00
Non-commercial
48
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified78.40
Proprietary
49
Claude Opus 4.8Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified83.40
Proprietary
50
Claude Fable 5Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified85.00
Proprietary
Sort by:
Showing 50 of 100 modelsView Aider-Polyglot benchmark page