AI Model Leaderboards

Live rankings across ARC-AGI-2, HLE, AIME 2025, SWE-bench Verified, and more — browse composite scores or drill into math, coding, and agent categories.

View benchmark detailsUpdated on 2026-06-17 07:42:33

As of 2026-06, AA Intelligence Index leaders include Claude Fable 5 (with fallback), Claude Opus 4.8 (max), GPT-5.5 (xhigh), based on 10 standardized capability benchmarks.

On the user-preference side, LMArena Text Generation currently ranks claude-fable-5, Claude Opus 4.6 (thinking), Opus 4.7 (thinking) near the top via anonymous A/B voting.

Scroll down for per-benchmark breakdowns in math, coding, and agent categories. See Data Methodology for scoring details, or browse LLM Blogs for in-depth commentary.

Composite Rankings

There is no single, universally agreed-upon comprehensive AI model ranking, so we selected two representative leaderboards that approach the question from different angles. Artificial Analysis Intelligence Index aggregates scores from 10 standardized benchmarks (coding, math, reasoning, etc.) to measure objective capability. LMArena (formerly Chatbot Arena) ranks models by Elo ratings derived from anonymous crowd-sourced A/B voting, reflecting real-world user preference. Together they offer both an objective and a subjective perspective.

AA Intelligence Index

Full ranking

Composite of 10 standardized benchmarks across coding, math, science, reasoning, and agentic tasks.

Updated 2026-06-13

#ModelScore
1
Anthropic
Claude Fable 5 (with fallback)
65
2
Anthropic
Claude Opus 4.8 (max)
61
5
57
6
Google Deep Mind
Gemini 3.1 Pro Preview
57
8
Qwen3.7 Max
57
9
Google Deep Mind
Gemini 3.5 Flash
55
10
Google
Gemini 3.5 Flash (medium)
55

LMArena Text Generation

Full ranking

Elo ratings from anonymous crowdsourced A/B voting, reflecting real user preference for response quality.

Updated 2026-06-10

#ModelElo
1
Anthropic
claude-fable-5
1510
3
1502
4
1498
5
Anthropic
Opus 4.7
1492
6
F
Muse Spark
1487
7
Google Deep Mind
Gemini 3.1 Pro Preview
1487
9
Anthropic
claude-opus-4-8-thinking
1486
10
1481
Source: LMArena

Leading model developers

View all 99 organizations

Jump to a developer to explore its full model lineup, series, and product lines.

Today's picksRotates daily · discover more labs

Per-Benchmark Rankings

Filter by math, coding, agent, and more. Switch benchmarks below or jump into a category leaderboard for the full ranking. View all benchmarks.

Recommended models

Ranked by HLE

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side. Scores shown are the best result across all evaluation modes.

HLE64.70
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified93.90
τ²-Bench
Proprietary
HLE59.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified95.00
τ²-Bench
Proprietary
HLE58.70
ARC-AGI-283.30
FrontierMath - Tier 438.00
SWE-bench Verified
τ²-Bench
Proprietary
HLE58.00
ARC-AGI-242.50
FrontierMath - Tier 414.60
SWE-bench Verified77.40
τ²-Bench
Proprietary
HLE57.90
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified88.60
τ²-Bench
Proprietary
HLE57.20
ARC-AGI-284.60
FrontierMath - Tier 439.60
SWE-bench Verified
τ²-Bench
Proprietary
HLE54.70
ARC-AGI-275.80
FrontierMath - Tier 422.90
SWE-bench Verified87.60
τ²-Bench
Proprietary
HLE54.70
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE54.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified80.20
τ²-Bench
Free commercial
HLE53.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified80.40
τ²-Bench
Proprietary
HLE53.00
ARC-AGI-266.30
FrontierMath - Tier 422.90
SWE-bench Verified80.84
τ²-Bench91.89
Proprietary
HLE52.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE52.20
ARC-AGI-285.00
FrontierMath - Tier 435.40
SWE-bench Verified
τ²-Bench
Proprietary
HLE52.10
ARC-AGI-277.10
FrontierMath - Tier 427.10
SWE-bench Verified
τ²-Bench
Proprietary
HLE51.40
ARC-AGI-277.10
FrontierMath - Tier 416.70
SWE-bench Verified80.60
τ²-Bench90.80
Proprietary
HLE51.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified71.30
τ²-Bench
Free commercial
HLE50.60
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified78.80
τ²-Bench
Proprietary
HLE50.40
ARC-AGI-24.90
FrontierMath - Tier 42.10
SWE-bench Verified77.80
τ²-Bench89.70
Free commercial
HLE50.20
ARC-AGI-211.80
FrontierMath - Tier 44.20
SWE-bench Verified76.80
τ²-Bench
Free commercial
HLE50.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified78.80
τ²-Bench
Proprietary
HLE50.00
ARC-AGI-254.20
FrontierMath - Tier 431.30
SWE-bench Verified
τ²-Bench
Proprietary
HLE49.80
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified75.30
τ²-Bench82.10
Proprietary
HLE49.00
ARC-AGI-258.30
FrontierMath - Tier 48.30
SWE-bench Verified79.60
τ²-Bench
Proprietary
HLE48.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified72.40
τ²-Bench79.00
Free commercial
HLE48.40
ARC-AGI-284.60
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Proprietary
HLE48.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified76.40
τ²-Bench86.70
Free commercial
HLE48.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified80.60
τ²-Bench
Free commercial
HLE45.80
ARC-AGI-245.10
FrontierMath - Tier 418.80
SWE-bench Verified76.20
τ²-Bench85.40
Proprietary
HLE45.50
ARC-AGI-254.20
FrontierMath - Tier 418.80
SWE-bench Verified80.00
τ²-Bench82.00
Proprietary
HLE45.10
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified79.00
τ²-Bench
Free commercial
HLE44.40
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified73.50
τ²-Bench
Proprietary
HLE43.50
ARC-AGI-233.60
FrontierMath - Tier 44.20
SWE-bench Verified68.70
τ²-Bench90.20
Proprietary
HLE43.20
ARC-AGI-237.60
FrontierMath - Tier 44.20
SWE-bench Verified80.90
τ²-Bench81.99
Proprietary
HLE42.80
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified73.80
τ²-Bench87.40
Free commercial
HLE42.70
ARC-AGI-217.60
FrontierMath - Tier 412.50
SWE-bench Verified76.30
τ²-Bench
Proprietary
HLE42.00
ARC-AGI-218.00
FrontierMath - Tier 414.60
SWE-bench Verified
τ²-Bench
Proprietary
HLE41.50
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified
τ²-Bench
Proprietary
HLE40.20
ARC-AGI-272.10
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Proprietary
HLE38.60
ARC-AGI-215.90
FrontierMath - Tier 42.10
SWE-bench Verified58.60
τ²-Bench
Proprietary
HLE37.70
ARC-AGI-2
FrontierMath - Tier 46.30
SWE-bench Verified
τ²-Bench
Proprietary
HLE35.20
ARC-AGI-29.90
FrontierMath - Tier 412.50
SWE-bench Verified72.80
τ²-Bench80.00
Proprietary
HLE34.80
ARC-AGI-2
FrontierMath - Tier 410.40
SWE-bench Verified
τ²-Bench
Proprietary
HLE33.60
ARC-AGI-213.60
FrontierMath - Tier 44.20
SWE-bench Verified82.00
τ²-Bench84.70
Proprietary
HLE30.60
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE30.40
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified68.00
τ²-Bench75.90
Free commercial
HLE28.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Non-commercial
HLE26.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench76.90
Free commercial
HLE25.10
ARC-AGI-24.00
FrontierMath - Tier 42.10
SWE-bench Verified73.10
τ²-Bench80.30
Free commercial
HLE24.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified77.20
τ²-Bench
Free commercial
HLE22.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified74.80
τ²-Bench
Free commercial
Sort by:
Showing 50 of 211 modelsView HLE benchmark page

Leaderboard FAQ

01

Where does the leaderboard data come from?

Scores are aggregated from primary sources: official model cards, technical reports, papers, vendor blog posts, and reproducible third-party evaluations. Each row links back to the underlying model detail page where the source is cited.

02

Why do scores for the same model differ across benchmarks?

Each benchmark measures a different capability — reasoning (HLE, ARC-AGI-2), math (AIME, FrontierMath), coding (SWE-bench Verified), agent use (τ²-Bench), and so on. A model tuned for one capability may perform very differently on another, which is exactly why we surface per-benchmark scores rather than a single number.

03

How often is the leaderboard updated?

Data is revalidated every 5 minutes, and new models or evaluation results are added as soon as they are published. The "Updated on" indicator at the top of the page reflects the most recent data refresh.

04

How should I read the composite ranking?

The composite view aggregates a model's standing across multiple core benchmarks. It is a useful first filter, but for production decisions you should drill into the specific benchmark closest to your workload — for example, SWE-bench Verified for coding agents, or τ²-Bench for tool-use scenarios.

05

How do I compare an open-source model with a closed API model?

Use the license filter at the top to mix open and closed models in the same view, then look at the same benchmark column for both. Beyond raw scores, consider total cost of ownership: API pricing for closed models vs. self-hosting cost for open weights.

Explore more

The leaderboard covers benchmarked models. Browse the full catalog by model, organization, or benchmark.