LMArena Math Arena Leaderboard

Name: LMArena Math Arena Leaderboard
Creator: DataLearner
License: https://creativecommons.org/licenses/by/4.0/

The latest AI math reasoning leaderboard based on LMArena Math Arena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for Claude, GPT, Gemini, DeepSeek, Qwen, and more.

Top Model

Kimi K2.6

Top Score

1483.00

Model Count

356

Data version

2026年06月16日

Data source: LM Arena

About This Leaderboard

This leaderboard ranks AI models by mathematical reasoning ability. Data comes from LMArena's Math sub-track, evaluated through anonymous blind testing by real users on math problem-solving tasks.

Methodology Overview

Blind testing: Users submit math problems, two anonymous models provide solutions, and users vote for the better answer — eliminating brand bias.

Elo scoring: Uses the Bradley-Terry model to calculate Elo scores. Higher scores mean users more frequently prefer that model's math solutions.

Broad scenario coverage: Testing spans algebra, geometry, calculus, competition math, and more diverse real-world math tasks.

DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.

Origin:All China

Leaderboard snapshot month:

Ranking Table

Rank	Model	Score	95% CI	Votes	Organization	License
15	Kimi K2.6Moonshot AI	1483.00	+/-16	1,372	Moonshot AI	Modified MIT
18	DeepSeek-V4-Pro (thinking)DeepSeek-AI	1477.00	+/-16	1,391	DeepSeek-AI	MIT
22	Kimi K2 ThinkingMoonshot AI	1472.00	+/-11	2,818	Moonshot AI	Modified MIT
31	minimax-m3MiniMax	1461.00	+/-26	556	MiniMax	Proprietary
51	Kimi K2.5 InstantMoonshot AI	1442.00	+/-25	513	Moonshot AI	Modified MIT
57	Kimi K2 Thinking (thinking-turbo)Moonshot AI	1438.00	+/-10	3,785	Moonshot AI	Modified MIT
58	DeepSeek-V4-ProDeepSeek-AI	1437.00	+/-15	1,651	DeepSeek-AI	MIT
60	DeepSeek-V4-Flash (thinking)DeepSeek-AI	1436.00	+/-16	1,511	DeepSeek-AI	MIT
67	qwen3-max-2025-09-23Alibaba	1430.00	+/-24	582	Alibaba	Proprietary
68	DeepSeek V3.2DeepSeek-AI	1430.00	+/-11	3,004	DeepSeek-AI	MIT
71	hunyuan-hy3-previewTencent	1429.00	+/-28	405	Tencent	tencent-hunyuan-community
75	DeepSeek V3.2-Exp (thinking)DeepSeek-AI	1428.00	+/-26	481	DeepSeek-AI	MIT
77	DeepSeek-V4-FlashDeepSeek-AI	1427.00	+/-15	1,523	DeepSeek-AI	MIT
78	DeepSeek V3.2 (thinking)DeepSeek-AI	1426.00	+/-12	2,506	DeepSeek-AI	MIT
88	DeepSeek V3.2-ExpDeepSeek-AI	1418.00	+/-21	775	DeepSeek-AI	MIT
91	Kimi K2 0905Moonshot AI	1416.00	+/-21	759	Moonshot AI	Modified MIT
93	DeepSeek-V3.1DeepSeek-AI	1415.00	+/-18	992	DeepSeek-AI	MIT
94	MiniMax-M2.7MiniMaxAI	1415.00	+/-14	1,953	MiniMaxAI	Modified MIT
95	DeepSeek-V3.1 (thinking)DeepSeek-AI	1415.00	+/-22	663	DeepSeek-AI	MIT
100	DeepSeek-R1DeepSeek-AI	1411.00	+/-14	1,606	DeepSeek-AI	MIT
105	Step 3.5 FlashStepFunAI	1408.00	+/-12	2,641	StepFunAI	Apache 2.0
107	DeepSeek-V3.1 Terminus (thinking)DeepSeek-AI	1407.00	+/-41	197	DeepSeek-AI	MIT
115	Step 3.5 FlashStepFunAI	1403.00	+/-12	2,404	StepFunAI	Proprietary
124	qwen3-235b-a22b-thinking-2507Alibaba	1398.00	+/-24	489	Alibaba	Apache 2.0
126	MiniMax M2.5MiniMaxAI	1397.00	+/-12	2,436	MiniMaxAI	Modified MIT
127	DeepSeek-R1-0528DeepSeek-AI	1396.00	+/-20	869	DeepSeek-AI	MIT
128	DeepSeek-V3.1 TerminusDeepSeek-AI	1395.00	+/-39	218	DeepSeek-AI	MIT
130	qwen3-235b-a22b-no-thinkingAlibaba	1394.00	+/-12	2,392	Alibaba	Apache 2.0
132	M2.1MiniMaxAI	1392.00	+/-18	1,010	MiniMaxAI	MIT
136	Kimi K2Moonshot AI	1389.00	+/-14	1,695	Moonshot AI	Modified MIT
153	minimax-m1MiniMax	1372.00	+/-13	1,801	MiniMax	Apache 2.0
154	DeepSeek-V3-0324DeepSeek-AI	1370.00	+/-10	3,190	DeepSeek-AI	MIT
161	Step3StepFunAI	1364.00	+/-31	351	StepFunAI	Apache 2.0
167	MiniMax M2MiniMaxAI	1356.00	+/-33	319	MiniMaxAI	Apache 2.0
174	hunyuan-turbos-20250416Tencent	1348.00	+/-20	845	Tencent	Proprietary
183	qwen-plus-0125Alibaba	1324.00	+/-19	732	Alibaba	Proprietary
190	step-2-16k-exp-202412StepFun	1313.00	+/-20	642	StepFun	Proprietary
194	DeepSeek-V3DeepSeek-AI	1311.00	+/-11	2,721	DeepSeek-AI	DeepSeek
202	qwen2.5-plus-1127Alibaba	1304.00	+/-14	1,404	Alibaba	Proprietary
204	hunyuan-turbos-20250226Tencent	1302.00	+/-31	238	Tencent	Proprietary
206	step-1o-turbo-202506StepFun	1299.00	+/-24	565	StepFun	Proprietary
207	glm-4-plus-0111Zhipu	1298.00	+/-19	721	Zhipu	Proprietary
214	hunyuan-large-2025-02-10Tencent	1294.00	+/-24	497	Tencent	Proprietary
215	deepseek-v2.5-1210DeepSeek	1293.00	+/-17	1,031	DeepSeek	DeepSeek
216	qwen-max-0919Alibaba	1292.00	+/-12	2,249	Alibaba	Qwen
217	hunyuan-standard-2025-02-10Tencent	1290.00	+/-24	499	Tencent	Proprietary
220	DeepSeek V2.5DeepSeek-AI	1288.00	+/-10	3,649	DeepSeek-AI	DeepSeek
221	glm-4-plusZhipu AI	1287.00	+/-10	3,599	Zhipu AI	Proprietary
226	hunyuan-large-visionTencent	1280.00	+/-30	351	Tencent	Proprietary
227	hunyuan-turbo-0110Tencent	1279.00	+/-31	243	Tencent	Proprietary
236	deepseek-coder-v2DeepSeek	1271.00	+/-13	1,858	DeepSeek	DeepSeek License
251	hunyuan-standard-256kTencent	1250.00	+/-29	361	Tencent	Proprietary
281	qwen1.5-32b-chatAlibaba	1200.00	+/-12	2,649	Alibaba	Qianwen LICENSE
308	DeepSeek LLM 67B ChatDeepSeek-AI	1155.00	+/-23	576	DeepSeek-AI	DeepSeek License

Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.

FAQ

What is LMArena Math Arena?

LMArena Math Arena is an anonymous evaluation track focused on mathematical reasoning. Users submit real math questions, compare hidden model solutions side by side, and vote for the better answer; the leaderboard is then calculated with Elo-style scoring.

How is Math Arena different from MATH-500 or AIME?

Static benchmarks such as MATH-500 and AIME use fixed problem sets and automated grading. Math Arena uses open-ended user questions and human preference voting, making it a useful complement for measuring how models handle varied real-world math tasks.

Do thinking models perform better in Math Arena?

Models with extended reasoning or chain-of-thought style capabilities often rank higher on math tasks because they spend more time decomposing and checking solutions. That benefit can come with higher latency and cost.

How do China-developed models perform in math?

DeepSeek, Qwen, GLM, and related models have become competitive in math reasoning leaderboards. Open licenses and Chinese-language support can make them especially useful for local deployment and education scenarios.