Back to Main Leaderboard

LLM Agent Benchmark Leaderboard

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-06-13 11:57:39

As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Top picks

Ranked by Aider-Polyglot

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

Aider-Polyglot40.00
τ²-Bench
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
Aider-Polyglot
τ²-Bench79.50
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench79.00
Terminal Bench 2.041.60
Tool Decathlon
OSWorld-Verified56.20
Free commercial
Aider-Polyglot
τ²-Bench49.00
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench47.70
Terminal Bench 2.0
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench
Terminal Bench 2.059.30
Tool Decathlon
OSWorld-Verified
Free commercial
Aider-Polyglot
τ²-Bench
Terminal Bench 2.051.50
Tool Decathlon26.90
OSWorld-Verified
Free commercial
Sort by: