LMArena Math Arena Leaderboard
The latest AI math reasoning leaderboard based on LMArena Math Arena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for Claude, GPT, Gemini, DeepSeek, Qwen, and more.
Top Model
Kimi K2.6
Top Score
1483.00
Model Count
356
Data version
2026年06月16日
Data source: LM Arena
About This Leaderboard
This leaderboard ranks AI models by mathematical reasoning ability. Data comes from LMArena's Math sub-track, evaluated through anonymous blind testing by real users on math problem-solving tasks.
Methodology Overview
Blind testing: Users submit math problems, two anonymous models provide solutions, and users vote for the better answer — eliminating brand bias.
Elo scoring: Uses the Bradley-Terry model to calculate Elo scores. Higher scores mean users more frequently prefer that model's math solutions.
Broad scenario coverage: Testing spans algebra, geometry, calculus, competition math, and more diverse real-world math tasks.
DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.
Ranking Table
| Rank | Model | Score | 95% CI | Votes | Organization | License |
|---|---|---|---|---|---|---|
| 15 | Kimi K2.6Moonshot AI | 1483.00 | +/-16 | 1,372 | Moonshot AI | Modified MIT |
| 18 | DeepSeek-V4-Pro (thinking)DeepSeek-AI | 1477.00 | +/-16 | 1,391 | DeepSeek-AI | MIT |
| 22 | Kimi K2 ThinkingMoonshot AI | 1472.00 | +/-11 | 2,818 | Moonshot AI | Modified MIT |
| 31 | minimax-m3MiniMax | 1461.00 | +/-26 | 556 | MiniMax | Proprietary |
| 51 | Kimi K2.5 InstantMoonshot AI | 1442.00 | +/-25 | 513 | Moonshot AI | Modified MIT |
| 57 | Kimi K2 Thinking (thinking-turbo)Moonshot AI | 1438.00 | +/-10 | 3,785 | Moonshot AI | Modified MIT |
| 58 | DeepSeek-V4-ProDeepSeek-AI | 1437.00 | +/-15 | 1,651 | DeepSeek-AI | MIT |
| 60 | DeepSeek-V4-Flash (thinking)DeepSeek-AI | 1436.00 | +/-16 | 1,511 | DeepSeek-AI | MIT |
| 67 | qwen3-max-2025-09-23Alibaba | 1430.00 | +/-24 | 582 | Alibaba | Proprietary |
| 68 | DeepSeek V3.2DeepSeek-AI | 1430.00 | +/-11 | 3,004 | DeepSeek-AI | MIT |
| 71 | hunyuan-hy3-previewTencent | 1429.00 | +/-28 | 405 | Tencent | tencent-hunyuan-community |
| 75 | DeepSeek V3.2-Exp (thinking)DeepSeek-AI | 1428.00 | +/-26 | 481 | DeepSeek-AI | MIT |
| 77 | DeepSeek-V4-FlashDeepSeek-AI | 1427.00 | +/-15 | 1,523 | DeepSeek-AI | MIT |
| 78 | DeepSeek V3.2 (thinking)DeepSeek-AI | 1426.00 | +/-12 | 2,506 | DeepSeek-AI | MIT |
| 88 | DeepSeek V3.2-ExpDeepSeek-AI | 1418.00 | +/-21 | 775 | DeepSeek-AI | MIT |
| 91 | Kimi K2 0905Moonshot AI | 1416.00 | +/-21 | 759 | Moonshot AI | Modified MIT |
| 93 | DeepSeek-V3.1DeepSeek-AI | 1415.00 | +/-18 | 992 | DeepSeek-AI | MIT |
| 94 | 1415.00 | +/-14 | 1,953 | MiniMaxAI | Modified MIT | |
| 95 | DeepSeek-V3.1 (thinking)DeepSeek-AI | 1415.00 | +/-22 | 663 | DeepSeek-AI | MIT |
| 100 | DeepSeek-R1DeepSeek-AI | 1411.00 | +/-14 | 1,606 | DeepSeek-AI | MIT |
| 105 | Step 3.5 FlashStepFunAI | 1408.00 | +/-12 | 2,641 | StepFunAI | Apache 2.0 |
| 107 | DeepSeek-V3.1 Terminus (thinking)DeepSeek-AI | 1407.00 | +/-41 | 197 | DeepSeek-AI | MIT |
| 115 | Step 3.5 FlashStepFunAI | 1403.00 | +/-12 | 2,404 | StepFunAI | Proprietary |
| 124 | qwen3-235b-a22b-thinking-2507Alibaba | 1398.00 | +/-24 | 489 | Alibaba | Apache 2.0 |
| 126 | 1397.00 | +/-12 | 2,436 | MiniMaxAI | Modified MIT | |
| 127 | DeepSeek-R1-0528DeepSeek-AI | 1396.00 | +/-20 | 869 | DeepSeek-AI | MIT |
| 128 | DeepSeek-V3.1 TerminusDeepSeek-AI | 1395.00 | +/-39 | 218 | DeepSeek-AI | MIT |
| 130 | qwen3-235b-a22b-no-thinkingAlibaba | 1394.00 | +/-12 | 2,392 | Alibaba | Apache 2.0 |
| 132 | 1392.00 | +/-18 | 1,010 | MiniMaxAI | MIT | |
| 136 | Kimi K2Moonshot AI | 1389.00 | +/-14 | 1,695 | Moonshot AI | Modified MIT |
| 153 | minimax-m1MiniMax | 1372.00 | +/-13 | 1,801 | MiniMax | Apache 2.0 |
| 154 | DeepSeek-V3-0324DeepSeek-AI | 1370.00 | +/-10 | 3,190 | DeepSeek-AI | MIT |
| 161 | Step3StepFunAI | 1364.00 | +/-31 | 351 | StepFunAI | Apache 2.0 |
| 167 | 1356.00 | +/-33 | 319 | MiniMaxAI | Apache 2.0 | |
| 174 | hunyuan-turbos-20250416Tencent | 1348.00 | +/-20 | 845 | Tencent | Proprietary |
| 183 | qwen-plus-0125Alibaba | 1324.00 | +/-19 | 732 | Alibaba | Proprietary |
| 190 | step-2-16k-exp-202412StepFun | 1313.00 | +/-20 | 642 | StepFun | Proprietary |
| 194 | DeepSeek-V3DeepSeek-AI | 1311.00 | +/-11 | 2,721 | DeepSeek-AI | DeepSeek |
| 202 | qwen2.5-plus-1127Alibaba | 1304.00 | +/-14 | 1,404 | Alibaba | Proprietary |
| 204 | hunyuan-turbos-20250226Tencent | 1302.00 | +/-31 | 238 | Tencent | Proprietary |
| 206 | step-1o-turbo-202506StepFun | 1299.00 | +/-24 | 565 | StepFun | Proprietary |
| 207 | glm-4-plus-0111Zhipu | 1298.00 | +/-19 | 721 | Zhipu | Proprietary |
| 214 | hunyuan-large-2025-02-10Tencent | 1294.00 | +/-24 | 497 | Tencent | Proprietary |
| 215 | deepseek-v2.5-1210DeepSeek | 1293.00 | +/-17 | 1,031 | DeepSeek | DeepSeek |
| 216 | qwen-max-0919Alibaba | 1292.00 | +/-12 | 2,249 | Alibaba | Qwen |
| 217 | hunyuan-standard-2025-02-10Tencent | 1290.00 | +/-24 | 499 | Tencent | Proprietary |
| 220 | DeepSeek V2.5DeepSeek-AI | 1288.00 | +/-10 | 3,649 | DeepSeek-AI | DeepSeek |
| 221 | glm-4-plusZhipu AI | 1287.00 | +/-10 | 3,599 | Zhipu AI | Proprietary |
| 226 | hunyuan-large-visionTencent | 1280.00 | +/-30 | 351 | Tencent | Proprietary |
| 227 | hunyuan-turbo-0110Tencent | 1279.00 | +/-31 | 243 | Tencent | Proprietary |
| 236 | deepseek-coder-v2DeepSeek | 1271.00 | +/-13 | 1,858 | DeepSeek | DeepSeek License |
| 251 | hunyuan-standard-256kTencent | 1250.00 | +/-29 | 361 | Tencent | Proprietary |
| 281 | qwen1.5-32b-chatAlibaba | 1200.00 | +/-12 | 2,649 | Alibaba | Qianwen LICENSE |
| 308 | DeepSeek LLM 67B ChatDeepSeek-AI | 1155.00 | +/-23 | 576 | DeepSeek-AI | DeepSeek License |
Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.
FAQ
What is LMArena Math Arena?
LMArena Math Arena is an anonymous evaluation track focused on mathematical reasoning. Users submit real math questions, compare hidden model solutions side by side, and vote for the better answer; the leaderboard is then calculated with Elo-style scoring.
How is Math Arena different from MATH-500 or AIME?
Static benchmarks such as MATH-500 and AIME use fixed problem sets and automated grading. Math Arena uses open-ended user questions and human preference voting, making it a useful complement for measuring how models handle varied real-world math tasks.
Do thinking models perform better in Math Arena?
Models with extended reasoning or chain-of-thought style capabilities often rank higher on math tasks because they spend more time decomposing and checking solutions. That benefit can come with higher latency and cost.
How do China-developed models perform in math?
DeepSeek, Qwen, GLM, and related models have become competitive in math reasoning leaderboards. Open licenses and Chinese-language support can make them especially useful for local deployment and education scenarios.





