Open Source LLM Leaderboard

Track benchmark rankings for open-weight and open-source AI models, then compare score, size, and license signals in one place.

View benchmark detailsUpdated on 2026-06-17 07:42:33

Composite Rankings

There is no single, universally agreed-upon comprehensive AI model ranking, so we selected two representative leaderboards that approach the question from different angles. Artificial Analysis Intelligence Index aggregates scores from 10 standardized benchmarks (coding, math, reasoning, etc.) to measure objective capability. LMArena (formerly Chatbot Arena) ranks models by Elo ratings derived from anonymous crowd-sourced A/B voting, reflecting real-world user preference. Together they offer both an objective and a subjective perspective.

AA Intelligence Index

Full ranking

Composite of 10 standardized benchmarks across coding, math, science, reasoning, and agentic tasks.

Updated 2026-06-13

#ModelScore
1
Anthropic
Claude Fable 5 (with fallback)
65
2
Anthropic
Claude Opus 4.8 (max)
61
5
57
6
Google Deep Mind
Gemini 3.1 Pro Preview
57
8
Qwen3.7 Max
57
9
Google Deep Mind
Gemini 3.5 Flash
55
10
Google
Gemini 3.5 Flash (medium)
55

LMArena Text Generation

Full ranking

Elo ratings from anonymous crowdsourced A/B voting, reflecting real user preference for response quality.

Updated 2026-06-10

#ModelElo
1
Anthropic
claude-fable-5
1510
3
1502
4
1498
5
Anthropic
Opus 4.7
1492
6
F
Muse Spark
1487
7
Google Deep Mind
Gemini 3.1 Pro Preview
1487
9
Anthropic
claude-opus-4-8-thinking
1486
10
1481
Source: LMArena

Leading model developers

View all 99 organizations

Jump to a developer to explore its full model lineup, series, and product lines.

Today's picksRotates daily · discover more labs

Per-Benchmark Rankings

Filter by math, coding, agent, and more. Switch benchmarks below or jump into a category leaderboard for the full ranking. View all benchmarks.

Recommended models

Ranked by FrontierMath

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side. Scores shown are the best result across all evaluation modes.

HLE4.70
ARC-AGI-2
FrontierMath - Tier 40.01
SWE-bench Verified51.80
τ²-Bench64.30
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE9.80
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified22.00
τ²-Bench49.00
Free commercial
HLE17.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench68.20
Free commercial
HLE20.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified67.80
τ²-Bench66.70
Free commercial
HLE8.40
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified56.00
τ²-Bench
Free commercial
HLE45.10
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified79.00
τ²-Bench
Free commercial
HLE48.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified80.60
τ²-Bench
Free commercial
HLE7.60
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified34.40
τ²-Bench34.40
Free commercial
HLE7.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified55.60
τ²-Bench
Free commercial
HLE14.40
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified59.20
τ²-Bench79.50
Free commercial
HLE54.70
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE54.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified80.20
τ²-Bench
Free commercial
HLE52.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE51.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified71.30
τ²-Bench
Free commercial
HLE50.40
ARC-AGI-24.90
FrontierMath - Tier 42.10
SWE-bench Verified77.80
τ²-Bench89.70
Free commercial
HLE50.20
ARC-AGI-211.80
FrontierMath - Tier 44.20
SWE-bench Verified76.80
τ²-Bench
Free commercial
HLE30.40
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified68.00
τ²-Bench75.90
Free commercial
HLE5.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified38.80
τ²-Bench38.80
Free commercial
HLE48.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified72.40
τ²-Bench79.00
Free commercial
HLE48.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified76.40
τ²-Bench86.70
Free commercial
HLE42.80
ARC-AGI-2
FrontierMath - Tier 42.10
SWE-bench Verified73.80
τ²-Bench87.40
Free commercial
HLE30.60
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE28.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Non-commercial
HLE26.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench76.90
Free commercial
HLE25.10
ARC-AGI-24.00
FrontierMath - Tier 42.10
SWE-bench Verified73.10
τ²-Bench80.30
Free commercial
HLE24.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified77.20
τ²-Bench
Free commercial
HLE22.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified74.80
τ²-Bench
Free commercial
HLE21.70
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified68.40
τ²-Bench37.00
Free commercial
HLE21.70
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified69.20
τ²-Bench
Free commercial
HLE21.40
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified73.40
τ²-Bench
Free commercial
HLE19.40
ARC-AGI-24.90
FrontierMath - Tier 4
SWE-bench Verified80.20
τ²-Bench
Free commercial
HLE19.00
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified60.10
τ²-Bench
Free commercial
HLE18.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE18.20
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE17.70
ARC-AGI-21.30
FrontierMath - Tier 4
SWE-bench Verified57.60
τ²-Bench
Free commercial
HLE17.30
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified34.00
τ²-Bench47.70
Free commercial
HLE15.90
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified66.00
τ²-Bench
Free commercial
HLE14.40
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified64.20
τ²-Bench
Free commercial
HLE12.50
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified69.40
τ²-Bench77.20
Free commercial
HLE10.60
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified57.60
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Non-commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
HLE
ARC-AGI-2
FrontierMath - Tier 4
SWE-bench Verified
τ²-Bench
Free commercial
Sort by:
Showing 50 of 105 modelsView FrontierMath benchmark page

Leaderboard FAQ

01

Which open-source models appear on this leaderboard?

The leaderboard tracks open-weight or publicly available models — including Llama, Qwen, DeepSeek, Mistral, GLM, and other releases whose weights or code are available under tracked licenses. It may include permissive, non-commercial, or otherwise restricted licenses; closed-weight API-only models such as GPT or Claude are excluded here.

02

Why do scores for the same model differ across benchmarks?

Each benchmark measures a different capability — reasoning (HLE, ARC-AGI-2), math (AIME, FrontierMath), coding (SWE-bench Verified), agent use (τ²-Bench), and so on. A model tuned for one capability may perform very differently on another, which is exactly why we surface per-benchmark scores rather than a single number.

03

How often is the leaderboard updated?

Data is revalidated every 5 minutes, and new models or evaluation results are added as soon as they are published. The "Updated on" indicator at the top of the page reflects the most recent data refresh.

04

How should I read the composite ranking?

The composite view aggregates a model's standing across multiple core benchmarks. It is a useful first filter, but for production decisions you should drill into the specific benchmark closest to your workload — for example, SWE-bench Verified for coding agents, or τ²-Bench for tool-use scenarios.

05

Can I run these open-source models locally?

Most listed models publish weights on Hugging Face or GitHub and can be served via vLLM, Ollama, llama.cpp, or similar runtimes. Hardware requirements scale with parameter count — a 7B model fits on a single consumer GPU, while 65B+ models typically need multi-GPU or quantized deployment.

Explore more

The leaderboard covers benchmarked models. Browse the full catalog by model, organization, or benchmark.