GLM 5.1 Benchmark Analysis

GLM 5.1 是智谱AI于2026年4月发布的旗舰开源大语言模型，在数学推理（AIME 2026得分95.3，全球第2）和软件工程（SWE-Bench Pro得分58.4，开源模型第一）方向表现较强。本页提供GLM 5.1在9项主流基准上的完整评测数据，以及与Kimi K2.6、DeepSeek-V4-Pro等同类模型的横向对比、GLM系列历代版本的纵向对比，并附有API定价信息和能力分析。

GLM 5.1 目前收录了 9 项评测数据，覆盖综合评估、数学推理、软件工程和 AI Agent 四个方向，但各方向收录的基准数量差异较大，且并非所有竞品模型都参与了相同的基准测试，因此跨方向的横向比较需谨慎。

理解这些数据有一个关键前提：GLM 5.1 的所有成绩均来自思考模式（reasoning mode），部分基准还额外开启了工具调用或联网能力。以 HLE 为例，无工具条件下得分 31.0（全球第 61），开启工具后上升至 52.3（全球第 9），两个条件下的排名相差超过 50 位。这说明 GLM 5.1 的工具调用能力对其整体表现有显著影响，评估时需区分"模型本身的推理能力"与"模型在有工具支撑下的任务完成能力"。

与同类模型的横向比较

将 GLM 5.1 与 Kimi K2.6、MiniMax-M2.7、DeepSeek-V4-Pro 对比，各模型取最佳得分：

评测基准	GLM 5.1	Kimi K2.6	MiniMax-M2.7	DeepSeek-V4-Pro
GPQA Diamond	86.2	90.5	87.0	90.1
HLE（含工具）	52.3	54.0	28.0	48.2
SWE-Bench Pro	58.4	58.6	56.2	55.4
BrowseComp	79.3	83.2	—	83.4
Terminal Bench 2.0	63.5	66.7	—	67.9
Tool Decathlon	40.7	50.0	—	—
AIME 2026	95.3	96.4	—	—
IMO-AnswerBench	83.8	86.0	—	89.8

加粗为各行最高分。

在对比的 8 项基准中，GLM 5.1 没有在任何一项取得最高分。与 Kimi K2.6 相比，GLM 5.1 在软件工程方向差距最小（SWE-Bench Pro 仅差 0.2 分），在工具使用编排方向差距最大（Tool Decathlon 落后约 9 分）。DeepSeek-V4-Pro 在联网信息收集和终端工具执行上略优于 GLM 5.1，但在含工具的综合评估（HLE）上低于 GLM 5.1。MiniMax-M2.7 由于数据缺失较多，难以全面对比。

总体来看，GLM 5.1 与 Kimi K2.6 属于当前开源模型的同一梯队，两者在多数基准上差距较小，但 Kimi K2.6 在目前有数据的项目中均不低于 GLM 5.1。

历代版本的改进趋势

评测基准	GLM-4.6	GLM-4.7	GLM-5	GLM 5.1
GPQA Diamond	82.9	85.7	86.0	86.2
HLE（含工具）	30.4	42.8	50.4	52.3
BrowseComp	45.1	52.0	75.9	79.3
Terminal Bench 2.0	—	41.0	61.1	63.5
SWE-Bench Pro	—	40.6	—	58.4
AIME 2026	—	92.9	92.7	95.3

从趋势来看，HLE 和 BrowseComp 在 GLM-4.7 到 GLM-5 之间提升幅度最大，说明这一阶段是综合推理和 Agent 能力的主要突破期。GLM-5 到 GLM 5.1 的提升幅度整体收窄，更像是定向增强而非全面代际跃升，重点集中在软件工程和长程任务方向。GPQA Diamond 历代变化不足 4 分，改进相对有限。

几个值得关注的问题

长程任务能力的验证程度有限

智谱 AI 官方声称 GLM 5.1 支持单次任务持续自主工作 8 小时，目前主要通过官方 demo 展示，包括复刻 macOS 桌面界面、构建 Linux 系统等场景。这类演示能够说明模型在特定条件下的能力上限，但尚无独立第三方在标准化基准上对小时级任务的系统性评测。Terminal Bench 2.0（63.5）可部分反映其长程工具执行能力，但该基准并非专门针对小时级任务设计。用户在实际场景中的表现会因任务类型和复杂度不同而存在差异。

工具依赖性较强

如前所述，GLM 5.1 在有无工具环境下的表现差异较大。这意味着它在有完整工具链支撑的工程环境中更能发挥优势，而在纯文本推理或工具受限的场景下，竞争力相对减弱。

价格随版本迭代明显上涨

GLM 5.1 相较于 GLM-5，输入价格从 $1.00 上涨至 $1.40（+40%），输出价格从 $3.20 上涨至 $4.40（+37.5%）。在对比模型中，GLM 5.1 的输出价格是最高的。对于输出量较大的使用场景，这一成本变化需要纳入选型考量。

小结

GLM 5.1 在数学推理和软件工程方向有较强表现，在开源模型中属于前列。其主要适用场景是有工具环境支撑的工程类任务，Agent 工具调用能力是其发挥优势的重要前提。

与竞品相比，GLM 5.1 与 Kimi K2.6 处于相近水平，两者在多数基准上差距不大，但目前可对比的数据中 Kimi K2.6 整体略优。官方重点宣传的长程任务能力目前缺乏系统性第三方验证，是否适合具体场景需用户自行评估。价格方面相较前代有明显提升，选型时需结合实际用量综合判断。

Benchmark Results

GLM 5.1

Benchmark Results

General Knowledge

4 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

Thinking Mode

86.20

43 / 179

LiveBench

Standard Mode

70.18

37 / 115

HLE

Thinking Mode

71 / 159

HLE

Thinking ModeTools

52.30

13 / 159

Coding and Software Engineer

1 evaluations

Benchmark / mode

Score

Rank/total

SWE-Bench Pro - Public

Thinking ModeTools

58.40

10 / 44

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Thinking ModeToolsInternet

79.30

13 / 45

AI Agent - Tool Usage

4 evaluations

Benchmark / mode

Score

Rank/total

MCP-Atlas

Standard ModeTools

75.60

8 / 23

Terminal Bench 2.0

Thinking ModeTools

63.50

13 / 46

TerminalBench 2.1

Thinking Level · HighTools

58.70

13 / 15

Tool Decathlon

Thinking ModeTools

40.70

3 / 7

Math and Reasoning

2 evaluations

Benchmark / mode

Score

Rank/total

AIME 2026

Thinking Mode

95.30

3 / 15

IMO-AnswerBench

Thinking Mode

83.80

11 / 20

Compare with other models

Competitor Comparison

Benchmark scores for GLM 5.1 compared against top models in its class

GLM 5.1Kimi K2.6 MiniMax-M2.7 DeepSeek-V4-Pro

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

8 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GLM 5.1Current	Kimi K2.6	MiniMax-M2.7	DeepSeek-V4-Pro
GPQA Diamond 综合评估	86.20Thinking Enabled	--	87.00Thinking Enabled	90.10Thinking Level · High
HLE 综合评估	52.30Thinking Enabled ｜ Tools	54.00Thinking Enabled ｜ Tools	28.00Thinking Enabled	48.20Thinking Level · Extra High ｜ Tools
LiveBench 综合评估	70.18Standard Mode	72.17Thinking Enabled	63.49Deep Thinking Mode	73.58Standard Mode
SWE-Bench Pro - Public 编程与软件工程	58.40Thinking Enabled ｜ Tools	58.60Thinking Enabled ｜ Tools	56.20Thinking Enabled ｜ Tools	55.40Thinking Level · Extra High ｜ Tools
BrowseComp AI Agent - 信息收集	79.30Thinking Enabled ｜ Tools	83.20Thinking Enabled ｜ Tools	--	83.40Thinking Level · Extra High ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	63.50Thinking Enabled ｜ Tools	66.70Thinking Enabled ｜ Tools	--	67.90Thinking Level · Extra High ｜ Tools
Tool Decathlon AI Agent - 工具使用	40.70Thinking Enabled ｜ Tools	50.00Thinking Enabled ｜ Tools	--	--
IMO-AnswerBench 数学推理	83.80Thinking Enabled	--	--	89.80Thinking Level · High

Standard API Pricing: GLM 5.1 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GLM 5.1	智谱AI	$1.4 / 1M tokens	$4.4 / 1M tokens	—
Kimi K2.6	Facebook AI研究实验室	$0.95 / 1M tokens	$4 / 1M tokens	—
MiniMax-M2.7	MiniMaxAI	$0.3 / 1M tokens	$1.2 / 1M tokens	—
DeepSeek-V4-Pro	DeepSeek-AI	$0.435 / 1M tokens	$0.87 / 1M tokens	—

Version History

How each version of the GLM 5.1 series stacks up on benchmark tests

GLM 5.1GLM-5 GLM-4.7 GLM-4.6

Benchmark categories:

9 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GLM 5.1Current	GLM-5	GLM-4.7	GLM-4.6
GPQA Diamond 综合评估	86.20Thinking Enabled	86.00Thinking Enabled	85.70Thinking Enabled	82.90Thinking Enabled ｜ Tools
HLE 综合评估	52.30Thinking Enabled ｜ Tools	50.40Thinking Enabled ｜ Tools	42.80Thinking Enabled ｜ Tools	30.40Thinking Enabled ｜ Tools
LiveBench 综合评估	70.18Standard Mode	68.85Standard Mode	--	--
SWE-Bench Pro - Public 编程与软件工程	58.40Thinking Enabled ｜ Tools	--	40.60Thinking Enabled ｜ Tools	--
BrowseComp AI Agent - 信息收集	79.30Thinking Enabled ｜ Tools	75.90Thinking Enabled ｜ Tools	52.00Thinking Enabled ｜ Tools	45.10Thinking Enabled ｜ Tools
MCP-Atlas AI Agent - 工具使用	75.60Standard Mode ｜ Tools	--	58.10Standard Mode ｜ Tools	--
Terminal Bench 2.0 AI Agent - 工具使用	63.50Thinking Enabled ｜ Tools	61.10Thinking Enabled ｜ Tools	41.00Thinking Enabled ｜ Tools	--
AIME 2026 数学推理	95.30Thinking Enabled	92.70Thinking Enabled	92.90Thinking Enabled	--
IMO-AnswerBench 数学推理	83.80Thinking Enabled	82.50Thinking Enabled	--	--

Single-Benchmark Version Trend

Viewing: GPQA Diamond · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GLM 5.1 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GLM 5.1	智谱AI	$1.4 / 1M tokens	$4.4 / 1M tokens	—
GLM-5	智谱AI	$1 / 1M tokens	$3.2 / 1M tokens	—

Sources

z.aiz.ai