Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by Terminal Bench 2.0

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

GPT-5.5OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.70
Tool Decathlon
OSWorld-Verified78.70
Proprietary
Claude Mythos PreviewAnthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.082.00
Tool Decathlon
OSWorld-Verified79.60
Proprietary
GPT-5.3 CodexOpenAI
Standard ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.077.30
Tool Decathlon
OSWorld-Verified
Proprietary
4
GPT-5.4OpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.075.10
Tool Decathlon
OSWorld-Verified75.00
Proprietary
5
Qwen3.7-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.70
Tool Decathlon
OSWorld-Verified
Proprietary
6
Opus 4.7Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.40
Tool Decathlon
OSWorld-Verified78.00
Proprietary
7
Composer 2.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.069.30
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot
τ²-Bench90.80
Terminal Bench 2.068.50
Tool Decathlon
OSWorld-Verified
Proprietary
9
DeepSeek-V4-ProDeepSeek-AI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.067.90
Tool Decathlon
OSWorld-Verified
Free commercial
10
Kimi K2.6Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
11
Qwen3.6-Max-Preview阿里巴巴
Deep Thinking ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified
Proprietary
12
Claude Opus 4.6Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench91.89
Terminal Bench 2.065.40
Tool Decathlon
OSWorld-Verified72.70
Proprietary
13
GLM 5.1智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified
Free commercial
14
DeepSeek-V4-ProDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.063.30
Tool Decathlon
OSWorld-Verified
Free commercial
15
Composer 2Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.70
Tool Decathlon
OSWorld-Verified
Proprietary
16
Qwen3.6-Max-Preview阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon
OSWorld-Verified
Proprietary
17
Aider-Polyglot
τ²-Bench
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified
Proprietary
18
GLM-5智谱AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench89.70
Terminal Bench 2.061.10
Tool Decathlon
OSWorld-Verified
Free commercial
19
GPT-5.4 miniOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
Proprietary
20
Qwen3.6-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Free commercial
21
Opus 4.5Anthropic
Extended ThinkingTools
Aider-Polyglot
τ²-Bench81.99
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Proprietary
22
Claude Sonnet 4.6Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.10
Tool Decathlon
OSWorld-Verified72.50
Proprietary
23
DeepSeek-V4-ProDeepSeek-AI
Standard ModeTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.10
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.00
Tool Decathlon
OSWorld-Verified
Proprietary
25
DeepSeek-V4-FlashDeepSeek-AI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.056.90
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench
Terminal Bench 2.056.90
Tool Decathlon
OSWorld-Verified
Proprietary
27
DeepSeek-V4-FlashDeepSeek-AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.056.60
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench85.40
Terminal Bench 2.054.20
Tool Decathlon
OSWorld-Verified
Proprietary
29
Qwen3.5-397B-A17B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
30
MiniMax M2.5MiniMaxAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.051.70
Tool Decathlon
OSWorld-Verified
Free commercial
31
Aider-Polyglot
τ²-Bench
Terminal Bench 2.051.50
Tool Decathlon26.90
OSWorld-Verified
Free commercial
32
Step 3.5 FlashStepFunAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench88.20
Terminal Bench 2.051.00
Tool Decathlon
OSWorld-Verified
Free commercial
33
Kimi K2.5Moonshot AI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.050.80
Tool Decathlon
OSWorld-Verified
Free commercial
34
Aider-Polyglot
τ²-Bench
Terminal Bench 2.049.10
Tool Decathlon
OSWorld-Verified
Free commercial
35
M2.1MiniMaxAI
Thinking EnabledTools
Aider-Polyglot61.00
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Free commercial
36
Composer 1.5Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.047.90
Tool Decathlon
OSWorld-Verified
Proprietary
37
GPT-5.1OpenAI
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.047.60
Tool Decathlon
OSWorld-Verified
Proprietary
38
Aider-Polyglot
τ²-Bench90.20
Terminal Bench 2.047.60
Tool Decathlon
OSWorld-Verified
Proprietary
39
DeepSeek V3.2DeepSeek-AI
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon
OSWorld-Verified
Free commercial
40
GPT-5.4 nanoOpenAI
Thinking Level · Extra HighTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.046.30
Tool Decathlon35.50
OSWorld-Verified39.00
Proprietary
41
Claude Sonnet 4.5Anthropic
Thinking EnabledTools
Aider-Polyglot
τ²-Bench84.70
Terminal Bench 2.042.80
Tool Decathlon
OSWorld-Verified61.40
Proprietary
42
Qwen3.5-27B阿里巴巴
Thinking EnabledTools
Aider-Polyglot
τ²-Bench79.00
Terminal Bench 2.041.60
Tool Decathlon
OSWorld-Verified56.20
Free commercial
43
GLM-4.7智谱AI
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon
OSWorld-Verified
Free commercial
44
Composer 1Cursor
Thinking Enabled
Aider-Polyglot
τ²-Bench
Terminal Bench 2.040.00
Tool Decathlon
OSWorld-Verified
Proprietary
45
Aider-Polyglot66.20
τ²-Bench
Terminal Bench 2.036.20
Tool Decathlon
OSWorld-Verified
Free commercial
46
Gemini 2.5-ProGoogle Deep Mind
Thinking EnabledTools
Aider-Polyglot
τ²-Bench
Terminal Bench 2.032.60
Tool Decathlon
OSWorld-Verified
Proprietary
47
o3-proOpenAI
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Aider-Polyglot83.10
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
49
OpenAI o3OpenAI
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
50
Grok 4xAI
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Proprietary
Sort by:
Showing 50 of 100 modelsView Terminal Bench 2.0 benchmark page