Where does the leaderboard data come from?

Scores are aggregated from official model cards, technical reports, papers, vendor announcements, and reproducible third-party evaluations.

Why do scores for the same model differ across benchmarks?

Different benchmarks measure different capabilities — reasoning, math, coding, or agent use — so the same model often gets very different scores across benchmarks.

How often is the leaderboard updated?

Data is revalidated every 5 minutes; new models and evaluations are added on publication. The "Updated on" indicator shows the most recent refresh time.

How should I read the composite ranking?

The composite ranking aggregates a model's standing across several core benchmarks. For production decisions, drill into the specific benchmark closest to your workload.

How do I compare an open-source model with a closed API model?

Use the license filter to view open and closed models together, then compare the same benchmark column. Also consider API pricing vs. self-hosting cost.

AI Model Leaderboards

Live rankings across ARC-AGI-2, HLE, AIME 2025, SWE-bench Verified, and more — browse composite scores or drill into math, coding, and agent categories.

View benchmark detailsUpdated on 2026-05-02 07:14:49

As of 2026-05, AA Intelligence Index leaders include GPT-5.5 (xhigh), GPT-5.5 (high), Opus 4.7 (max), based on 10 standardized capability benchmarks.

On the user-preference side, LMArena Text Generation currently ranks Opus 4.7 (thinking), Claude Opus 4.6 (thinking), Claude Opus 4.6 near the top via anonymous A/B voting.

Scroll down for per-benchmark breakdowns in math, coding, and agent categories. See Data Methodology for scoring details, or browse LLM Blogs for in-depth commentary.

Composite Rankings

There is no single, universally agreed-upon comprehensive AI model ranking, so we selected two representative leaderboards that approach the question from different angles. Artificial Analysis Intelligence Index aggregates scores from 10 standardized benchmarks (coding, math, reasoning, etc.) to measure objective capability. LMArena (formerly Chatbot Arena) ranks models by Elo ratings derived from anonymous crowd-sourced A/B voting, reflecting real-world user preference. Together they offer both an objective and a subjective perspective.