MiniMax-M1与其它模型在不同评测上的对比结果
Category | Task | MiniMax-M1-80K | MiniMax-M1-40K | Qwen3-235B-A22B | DeepSeek-R1-0528 | DeepSeek-R1 | Seed-Thinking-v1.5 | Claude 4 Opus | Gemini 2.5 Pro (06-05) | OpenAI-o3 |
---|---|---|---|---|---|---|---|---|---|---|
Extended Thinking | 80K | 40K | 32k | 64k | 32k | 32k | 64k | 64k | 100k | |
Mathematics | AIME 2024 | 86.0 | 83.3 | 85.7 | 91.4 | 79.8 | 86.7 | 76.0 | 92.0 | 91.6 |
AIME 2025 | 76.9 | 74.6 | 81.5 | 87.5 | 70.0 | 74.0 | 75.5 | 88.0 | 88.9 | |
MATH-500 | 96.8 | 96.0 | 96.2 | 98.0 | 97.3 | 96.7 | 98.2 | 98.8 | 98.1 | |
General Coding | LiveCodeBench (24/8~25/5) | 65.0 | 62.3 | 65.9 | 73.1 | 55.9 | 67.5 | 56.6 | 77.1 | 75.8 |
FullStackBench | 68.3 | 67.6 | 62.9 | 69.4 | 70.1 | 69.9 | 70.3 | — | 69.3 | |
Reasoning & Knowledge | GPQA Diamond | 70.0 | 69.2 | 71.1 | 81.0 | 71.5 | 77.3 | 79.6 | 86.4 | 83.3 |
HLE (no tools) | 8.4* | 7.2* | 7.6* | 17.7* | 8.6* | 8.2 | 10.7 | 21.6 | 20.3 | |
ZebraLogic | 86.8 | 80.1 | 80.3 | 95.1 | 78.7 | 84.4 | 95.1 | 91.6 | 95.8 | |
MMLU-Pro | 81.1 | 80.6 | 83.0 | 85.0 | 84.0 | 87.0 | 85.0 | 86.0 | 85.0 | |
Software Engineering | SWE-bench Verified | 56.0 | 55.6 | 34.4 | 57.6 | 49.2 | 47.0 | 72.5 | 67.2 | 69.1 |
Long Context | OpenAI-MRCR (128k) | 73.4 | 76.1 | 27.7 | 51.5 | 35.8 | 54.3 | 48.9 | 76.8 | 56.5 |
OpenAI-MRCR (1M) | 56.2 | 58.6 | — | — | — | — | — | 58.8 | — | |
LongBench-v2 | 61.5 | 61.0 | 50.1 | 52.1 | 58.3 | 52.5 | 55.6 | 65.0 | 58.8 | |
Agentic Tool Use | TAU-bench (airline) | 62.0 | 60.0 | 34.7 | 53.5 | — | 44.0 | 59.6 | 50.0 | 52.0 |
TAU-bench (retail) | 63.5 | 67.8 | 58.6 | 63.9 | — | 55.7 | 81.4 | 67.0 | 73.9 | |
Factuality | SimpleQA | 18.5 | 17.9 | 11.0 | 27.8 | 30.1 | 12.9 | — | 54.0 | 49.4 |
General Assistant | MultiChallenge | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 |
欢迎大家关注DataLearner官方微信,接受最新的AI技术推送
