MiniMax-M1与其它模型在不同评测上的对比结果

标签:## 时间:2025/06/17 13:52:07 作者:小木

Category Task MiniMax-M1-80K MiniMax-M1-40K Qwen3-235B-A22B DeepSeek-R1-0528 DeepSeek-R1 Seed-Thinking-v1.5 Claude 4 Opus Gemini 2.5 Pro (06-05) OpenAI-o3
Extended Thinking 80K 40K 32k 64k 32k 32k 64k 64k 100k
Mathematics AIME 2024 86.0 83.3 85.7 91.4 79.8 86.7 76.0 92.0 91.6
AIME 2025 76.9 74.6 81.5 87.5 70.0 74.0 75.5 88.0 88.9
MATH-500 96.8 96.0 96.2 98.0 97.3 96.7 98.2 98.8 98.1
General Coding LiveCodeBench (24/8~25/5) 65.0 62.3 65.9 73.1 55.9 67.5 56.6 77.1 75.8
FullStackBench 68.3 67.6 62.9 69.4 70.1 69.9 70.3 69.3
Reasoning & Knowledge GPQA Diamond 70.0 69.2 71.1 81.0 71.5 77.3 79.6 86.4 83.3
HLE (no tools) 8.4* 7.2* 7.6* 17.7* 8.6* 8.2 10.7 21.6 20.3
ZebraLogic 86.8 80.1 80.3 95.1 78.7 84.4 95.1 91.6 95.8
MMLU-Pro 81.1 80.6 83.0 85.0 84.0 87.0 85.0 86.0 85.0
Software Engineering SWE-bench Verified 56.0 55.6 34.4 57.6 49.2 47.0 72.5 67.2 69.1
Long Context OpenAI-MRCR (128k) 73.4 76.1 27.7 51.5 35.8 54.3 48.9 76.8 56.5
OpenAI-MRCR (1M) 56.2 58.6 58.8
LongBench-v2 61.5 61.0 50.1 52.1 58.3 52.5 55.6 65.0 58.8
Agentic Tool Use TAU-bench (airline) 62.0 60.0 34.7 53.5 44.0 59.6 50.0 52.0
TAU-bench (retail) 63.5 67.8 58.6 63.9 55.7 81.4 67.0 73.9
Factuality SimpleQA 18.5 17.9 11.0 27.8 30.1 12.9 54.0 49.4
General Assistant MultiChallenge 44.7 44.7 40.0 45.0 40.7 43.0 45.8 51.8 56.5
欢迎大家关注DataLearner官方微信,接受最新的AI技术推送
相关博客