Benchmark Results

MiniMax M2.5

Benchmark Results

综合评估

4 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

Thinking Mode

85.20

47 / 177

ARC-AGI

Thinking Mode

63.70

32 / 65

HLE

Thinking Mode

19.40

103 / 154

ARC-AGI-2

Thinking Mode

4.90

44 / 59

编程与软件工程

2 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

Thinking ModeTools

80.20

10 / 105

SWE-Bench Pro - Public

Thinking ModeTools

55.40

15 / 40

数学推理

1 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

Thinking Mode

86.30

48 / 106

Agent能力评测

1 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

Thinking ModeTools

97.80

10 / 35

指令跟随

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

Thinking ModeTools

12 / 29

AI Agent - 信息收集

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Thinking ModeTools

76.30

16 / 43

AI Agent - 工具使用

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

Thinking ModeTools

51.70

30 / 46

生产力知识

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

Thinking Mode

16 / 20

长上下文能力

1 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

Thinking Mode

69.50

3 / 13

OpenClaw智能体能力综合测评

2 evaluations

Benchmark / mode

Score

Rank/total

Claw Bench

Thinking ModeTools

92.10

4 / 29

Pinch Bench

Thinking ModeTools

87.80

6 / 37

Compare with other models

Competitor Comparison

Benchmark scores for MiniMax M2.5 compared against top models in its class

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	MiniMax M2.5Current	GLM-5	Kimi K2.5
ARC-AGI 综合评估	63.70Thinking Enabled	44.70Thinking Enabled	65.30Thinking Enabled
ARC-AGI-2 综合评估	4.90Thinking Enabled	4.90Thinking Enabled	11.80Thinking Enabled
GPQA Diamond 综合评估	85.20Thinking Enabled	86.00Thinking Enabled	87.60Thinking Enabled
HLE 综合评估	19.40Thinking Enabled	50.40Thinking Enabled ｜ Tools	50.20Thinking Enabled ｜ Tools
SWE-Bench Pro - Public 编程与软件工程	55.40Thinking Enabled ｜ Tools	--	50.70Thinking Enabled ｜ Tools
SWE-bench Verified 编程与软件工程	80.20Thinking Enabled ｜ Tools	77.80Thinking Enabled	76.80Thinking Enabled ｜ Tools
AIME2025 数学推理	86.30Thinking Enabled	--	96.10Thinking Enabled
τ²-Bench - Telecom Agent能力评测	97.80Thinking Enabled ｜ Tools	98.00Thinking Enabled ｜ Tools	--
IF Bench 指令跟随	70.00Thinking Enabled ｜ Tools	72.00Thinking Enabled ｜ Tools	--
BrowseComp AI Agent - 信息收集	76.30Thinking Enabled ｜ Tools	75.90Thinking Enabled ｜ Tools	60.60Thinking Enabled ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	51.70Thinking Enabled ｜ Tools	61.10Thinking Enabled ｜ Tools	50.80Thinking Enabled ｜ Tools
GDPval-AA 生产力知识	36.00Thinking Enabled	46.00Thinking Enabled	40.00Thinking Enabled

3 additional benchmarks remain in the chart above.

Standard API Pricing: MiniMax M2.5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
MiniMax M2.5	MiniMaxAI	$0.3 / 1M tokens	$2.4 / 1M tokens	—
GLM-5	智谱AI	$1 / 1M tokens	$3.2 / 1M tokens	—
Kimi K2.5	—	0.6 美元/100 万tokens	3 美元/100 万tokens	—

Version History

How each version of the MiniMax M2.5 series stacks up on benchmark tests

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

10 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	MiniMax M2.5Current	MiniMax M2	M2.1
GPQA Diamond 综合评估	85.20Thinking Enabled	78.00Thinking Enabled	81.00Thinking Enabled
HLE 综合评估	19.40Thinking Enabled	12.50Thinking Enabled	22.00Thinking Enabled
SWE-Bench Pro - Public 编程与软件工程	55.40Thinking Enabled ｜ Tools	--	32.60Thinking Enabled ｜ Tools
SWE-bench Verified 编程与软件工程	80.20Thinking Enabled ｜ Tools	69.40Thinking Enabled ｜ Tools	74.80Thinking Enabled
AIME2025 数学推理	86.30Thinking Enabled	78.00Thinking Enabled	81.00Thinking Enabled
τ²-Bench - Telecom Agent能力评测	97.80Thinking Enabled ｜ Tools	87.00Thinking Enabled ｜ Tools	87.00Thinking Enabled ｜ Tools
IF Bench 指令跟随	70.00Thinking Enabled ｜ Tools	72.30Thinking Enabled	70.00Thinking Enabled ｜ Tools
BrowseComp AI Agent - 信息收集	76.30Thinking Enabled ｜ Tools	44.00Thinking Enabled ｜ Tools	47.40Thinking Enabled ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	51.70Thinking Enabled ｜ Tools	--	47.90Thinking Enabled ｜ Tools
Pinch Bench OpenClaw智能体能力综合测评	87.80Thinking Enabled ｜ Tools	--	84.30Thinking Enabled ｜ Tools

Single-Benchmark Version Trend

Viewing: GPQA Diamond · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the MiniMax M2.5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
MiniMax M2.5	MiniMaxAI	$0.3 / 1M tokens	$2.4 / 1M tokens	—
MiniMax M2	—	0.3 美元/100万tokens	1.2 美元/100万tokens	—
M2.1	—	0.3 美元/100 万tokens	1.2 美元/100 万tokens	—

MiniMax M2.5 Benchmark Analysis

MiniMax M2.5 currently shows benchmark results led by SWE-bench Verified (10 / 105, score 80.20), Claw Bench (4 / 29, score 92.10), Pinch Bench (6 / 37, score 87.80). This page also compares it with 2 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

MiniMax M2.5 模型评测分析报告

引言

本报告基于官方公告页面信息，对 MiniMax M2.5 模型进行分析。分析焦点为评测指标、基准测试以及来源材料中呈现的比较。数据来源于提供的基准测试和规格说明，无额外解读。

模型概述

MiniMax M2.5 模型于 2026 年 2 月 12 日发布。包括两个版本：MiniMax-M2.5 和 MiniMax-M2.5-Lightning。两个版本能力相同，但在推理速度上不同。模型在数十万个复杂真实世界环境中使用强化学习进行训练。距离 M2 和 M2.1 版本发布已有三个半月。

能力覆盖超过 10 种语言的编码、代理工具使用、搜索以及办公任务。编码支持完整开发生命周期阶段，包括系统设计、环境设置、开发、功能迭代、代码审查和测试。处理跨平台的完整栈项目，如 Web、Android、iOS 和 Windows。

在代理工具使用和搜索方面，模型在令牌使用超过最大上下文的 30% 时丢弃历史记录。使用并行工具调用减少运行时间。办公工作集成包括 Word、PowerPoint 和 Excel 技能，用户可创建专家组合这些技能。

性能基准测试

模型在多个基准测试中进行评估。结果在指定情况下为 3-4 次运行的平均值。

与其他模型的比较

比较基于提供的基准测试和成本指标。

与 Claude Opus 4.6 比较：
- SWE-Bench Verified 时间：22.8 分钟 (MiniMax M2.5) vs. 22.9 分钟。
- Droid 脚手架：79.7% vs. 78.9%。
- OpenCode 脚手架：76.1% vs. 75.9%。
- 每个任务成本：Claude Opus 4.6 的 10%。
与 Claude Opus 4.5 比较：在 VIBE-Pro 上相当。
与 Opus、Gemini 3 Pro、GPT-5 等模型的通用成本比较：输出价格为 1/10 至 1/20。推理速度接近 2 倍（M2.5-Lightning 为 100 令牌/秒 vs. 其他前沿模型）。

效率和成本分析

推理速度为 M2.5-Lightning 的 100 令牌/秒和 M2.5 的 50 令牌/秒。每个 SWE-Bench 任务的令牌消耗为 3.52M（vs. M2.1 的 3.72M）。

成本结构：

M2.5-Lightning：每百万输入令牌 0.3 美元，每百万输出令牌 2.4 美元。
M2.5：M2.5-Lightning 成本的一半。
以 100 令牌/秒连续运行：每小时 1 美元。
以 50 令牌/秒连续运行：每小时 0.3 美元。
四个实例全年连续运行：10,000 美元。

模型在两个版本中支持缓存。训练使用 Forge RL 框架，异步调度和树结构合并实现 40 倍加速，CISPO 算法用于稳定性，以及过程奖励机制。

部署和使用指标

模型部署在 MiniMax Agent 中。覆盖 MiniMax 公司日常任务的 30%，包括研发、产品、销售、人力资源和财务。公司 80% 的新代码由 M2.5 生成。预构建专家套件用于办公、财务和编程。

SWE-Bench Verified	80.2%	比 M2.1 快 37%（22.8 分钟 vs. 31.3 分钟）。Droid 脚手架：79.7%。OpenCode 脚手架：76.1%。
Multi-SWE-Bench	51.3%	-
BrowseComp	76.3%	带上下文管理。比 M2.1 少用 20% 的搜索轮次。
VIBE-Pro	与 Claude Opus 4.5 相当	内部基准测试，使用 Claude Code 脚手架。
Terminal Bench 2	使用修改进行测试	Claude Code 2.0.64 脚手架，8 核 CPU/16GB 内存，7,200 秒超时。
RISE	专家级搜索性能	使用基于 Playwright 的浏览器工具。
GDPval-MM	59.0% 平均胜率	与主流模型的成对 LLM-as-a-judge 评估。
MEWC	在 179 个问题上评估	来自 2021–2026 年 Excel 竞赛问题。
财务建模	按评分标准评分	3 次运行的平均值。
AIME25 ~ AA-LCR	内部测试	来自 Artificial Analysis Intelligence Index 的公共数据集。

Benchmark Results

Benchmark Results

综合评估

编程与软件工程

数学推理

Agent能力评测

指令跟随

AI Agent - 信息收集

AI Agent - 工具使用

生产力知识

长上下文能力

OpenClaw智能体能力综合测评

Competitor Comparison

Standard API Pricing: MiniMax M2.5 vs. Peer Models

Version History

Single-Benchmark Version Trend

Standard API Pricing Across the MiniMax M2.5 Series

MiniMax M2.5 Benchmark Analysis

MiniMax M2.5 模型评测分析报告

引言

模型概述

性能基准测试

与其他模型的比较

效率和成本分析

部署和使用指标

Sources