LMArena Math Arena Leaderboard

The latest AI math reasoning leaderboard based on LMArena Math Arena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for Claude, GPT, Gemini, DeepSeek, Qwen, and more.

Top Model

Kimi K2.6

Top Score

1483.00

Model Count

356

Data version

2026年06月16日

Data source: LM Arena

About This Leaderboard

This leaderboard ranks AI models by mathematical reasoning ability. Data comes from LMArena's Math sub-track, evaluated through anonymous blind testing by real users on math problem-solving tasks.

Methodology Overview

Blind testing: Users submit math problems, two anonymous models provide solutions, and users vote for the better answer — eliminating brand bias.

Elo scoring: Uses the Bradley-Terry model to calculate Elo scores. Higher scores mean users more frequently prefer that model's math solutions.

Broad scenario coverage: Testing spans algebra, geometry, calculus, competition math, and more diverse real-world math tasks.

DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.

Origin:AllChina
Leaderboard snapshot month:

Ranking Table

RankModelScore95% CIVotesOrganizationLicense
15Moonshot AIKimi K2.6Moonshot AI1483.00+/-161,372Moonshot AIModified MIT
18DeepSeek-AIDeepSeek-V4-Pro (thinking)DeepSeek-AI1477.00+/-161,391DeepSeek-AIMIT
22Moonshot AIKimi K2 ThinkingMoonshot AI1472.00+/-112,818Moonshot AIModified MIT
31MiniMaxminimax-m3MiniMax1461.00+/-26556MiniMaxProprietary
51Moonshot AIKimi K2.5 InstantMoonshot AI1442.00+/-25513Moonshot AIModified MIT
57Moonshot AIKimi K2 Thinking (thinking-turbo)Moonshot AI1438.00+/-103,785Moonshot AIModified MIT
58DeepSeek-AIDeepSeek-V4-ProDeepSeek-AI1437.00+/-151,651DeepSeek-AIMIT
60DeepSeek-AIDeepSeek-V4-Flash (thinking)DeepSeek-AI1436.00+/-161,511DeepSeek-AIMIT
67Alibabaqwen3-max-2025-09-23Alibaba1430.00+/-24582AlibabaProprietary
68DeepSeek-AIDeepSeek V3.2DeepSeek-AI1430.00+/-113,004DeepSeek-AIMIT
71Tencenthunyuan-hy3-previewTencent1429.00+/-28405Tencenttencent-hunyuan-community
75DeepSeek-AIDeepSeek V3.2-Exp (thinking)DeepSeek-AI1428.00+/-26481DeepSeek-AIMIT
77DeepSeek-AIDeepSeek-V4-FlashDeepSeek-AI1427.00+/-151,523DeepSeek-AIMIT
78DeepSeek-AIDeepSeek V3.2 (thinking)DeepSeek-AI1426.00+/-122,506DeepSeek-AIMIT
88DeepSeek-AIDeepSeek V3.2-ExpDeepSeek-AI1418.00+/-21775DeepSeek-AIMIT
91Moonshot AIKimi K2 0905Moonshot AI1416.00+/-21759Moonshot AIModified MIT
93DeepSeek-AIDeepSeek-V3.1DeepSeek-AI1415.00+/-18992DeepSeek-AIMIT
94MiniMaxAIMiniMax-M2.7MiniMaxAI1415.00+/-141,953MiniMaxAIModified MIT
95DeepSeek-AIDeepSeek-V3.1 (thinking)DeepSeek-AI1415.00+/-22663DeepSeek-AIMIT
100DeepSeek-AIDeepSeek-R1DeepSeek-AI1411.00+/-141,606DeepSeek-AIMIT
105StepFunAIStep 3.5 FlashStepFunAI1408.00+/-122,641StepFunAIApache 2.0
107DeepSeek-AIDeepSeek-V3.1 Terminus (thinking)DeepSeek-AI1407.00+/-41197DeepSeek-AIMIT
115StepFunAIStep 3.5 FlashStepFunAI1403.00+/-122,404StepFunAIProprietary
124Alibabaqwen3-235b-a22b-thinking-2507Alibaba1398.00+/-24489AlibabaApache 2.0
126MiniMaxAIMiniMax M2.5MiniMaxAI1397.00+/-122,436MiniMaxAIModified MIT
127DeepSeek-AIDeepSeek-R1-0528DeepSeek-AI1396.00+/-20869DeepSeek-AIMIT
128DeepSeek-AIDeepSeek-V3.1 TerminusDeepSeek-AI1395.00+/-39218DeepSeek-AIMIT
130Alibabaqwen3-235b-a22b-no-thinkingAlibaba1394.00+/-122,392AlibabaApache 2.0
132MiniMaxAIM2.1MiniMaxAI1392.00+/-181,010MiniMaxAIMIT
136Moonshot AIKimi K2Moonshot AI1389.00+/-141,695Moonshot AIModified MIT
153MiniMaxminimax-m1MiniMax1372.00+/-131,801MiniMaxApache 2.0
154DeepSeek-AIDeepSeek-V3-0324DeepSeek-AI1370.00+/-103,190DeepSeek-AIMIT
161StepFunAIStep3StepFunAI1364.00+/-31351StepFunAIApache 2.0
167MiniMaxAIMiniMax M2MiniMaxAI1356.00+/-33319MiniMaxAIApache 2.0
174Tencenthunyuan-turbos-20250416Tencent1348.00+/-20845TencentProprietary
183Alibabaqwen-plus-0125Alibaba1324.00+/-19732AlibabaProprietary
190StepFunstep-2-16k-exp-202412StepFun1313.00+/-20642StepFunProprietary
194DeepSeek-AIDeepSeek-V3DeepSeek-AI1311.00+/-112,721DeepSeek-AIDeepSeek
202Alibabaqwen2.5-plus-1127Alibaba1304.00+/-141,404AlibabaProprietary
204Tencenthunyuan-turbos-20250226Tencent1302.00+/-31238TencentProprietary
206StepFunstep-1o-turbo-202506StepFun1299.00+/-24565StepFunProprietary
207glm-4-plus-0111Zhipu1298.00+/-19721ZhipuProprietary
214Tencenthunyuan-large-2025-02-10Tencent1294.00+/-24497TencentProprietary
215DeepSeekdeepseek-v2.5-1210DeepSeek1293.00+/-171,031DeepSeekDeepSeek
216Alibabaqwen-max-0919Alibaba1292.00+/-122,249AlibabaQwen
217Tencenthunyuan-standard-2025-02-10Tencent1290.00+/-24499TencentProprietary
220DeepSeek-AIDeepSeek V2.5DeepSeek-AI1288.00+/-103,649DeepSeek-AIDeepSeek
221glm-4-plusZhipu AI1287.00+/-103,599Zhipu AIProprietary
226Tencenthunyuan-large-visionTencent1280.00+/-30351TencentProprietary
227Tencenthunyuan-turbo-0110Tencent1279.00+/-31243TencentProprietary
236DeepSeekdeepseek-coder-v2DeepSeek1271.00+/-131,858DeepSeekDeepSeek License
251Tencenthunyuan-standard-256kTencent1250.00+/-29361TencentProprietary
281Alibabaqwen1.5-32b-chatAlibaba1200.00+/-122,649AlibabaQianwen LICENSE
308DeepSeek-AIDeepSeek LLM 67B ChatDeepSeek-AI1155.00+/-23576DeepSeek-AIDeepSeek License

Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.

FAQ

01

What is LMArena Math Arena?

LMArena Math Arena is an anonymous evaluation track focused on mathematical reasoning. Users submit real math questions, compare hidden model solutions side by side, and vote for the better answer; the leaderboard is then calculated with Elo-style scoring.

02

How is Math Arena different from MATH-500 or AIME?

Static benchmarks such as MATH-500 and AIME use fixed problem sets and automated grading. Math Arena uses open-ended user questions and human preference voting, making it a useful complement for measuring how models handle varied real-world math tasks.

03

Do thinking models perform better in Math Arena?

Models with extended reasoning or chain-of-thought style capabilities often rank higher on math tasks because they spend more time decomposing and checking solutions. That benefit can come with higher latency and cost.

04

How do China-developed models perform in math?

DeepSeek, Qwen, GLM, and related models have become competitive in math reasoning leaderboards. Open licenses and Chinese-language support can make them especially useful for local deployment and education scenarios.