GLM-5 Benchmark Analysis

GLM-5 currently shows benchmark results led by τ²-Bench (4 / 40, score 89.70), HLE (19 / 159, score 50.40), τ²-Bench - Telecom (5 / 35, score 98). This page also compares it with 3 competitor models and 4 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

GLM-5作为智谱AI的第五代旗舰模型，在多个维度上实现了显著提升：

核心性能指标：

数学推理：AIME 2026得分92.7%，GPQA-Diamond得分86.0%
编程能力：SWE-bench Verified达到77.8%，SWE-bench Multilingual为73.3%
Agent任务：BrowseComp得分62.0，Terminal-Bench 2.0达到56.2
人文推理：HLE（使用工具）得分50.4，排名第3

模型规模：

总参数：744B（7440亿）
激活参数：40B（400亿）
采用MoE（混合专家）架构
上下文长度：200K tokens

二、在开源模型中的地位

GLM-5在开源模型阵营中表现突出：

多项基准测试领先
- SWE-bench Verified（77.8%）：开源模型第一
- Terminal Bench 2.0（61.1%）：开源模型第三
- τ²-Bench（89.7%）：开源模型第二
超越同类竞品
- 全面超越Google Gemini 3.0 Pro的综合表现
- 在Agent能力评测中优于多数开源模型
- 前端开发构建成功率达98%（CC-Bench-V2）
参数效率优势
- 相比前代GLM-4.7（355B参数）规模翻倍
- 但激活参数仅40B，保持高效推理

三、与顶级闭源模型的差距

虽然GLM-5在开源领域表现优异，但与顶级闭源模型仍存在差距：

与Claude Opus 4.5的对比：

SWE-bench Verified：GLM-5（77.8%）vs Claude Opus 4.5（80.9%）
官方定位：在软件工程任务上"接近"Opus 4.5的使用体验
在复杂推理和长期规划上仍有提升空间

优势领域：

成本效益：API价格仅为主流模型的20%左右
推理速度：优化后的架构提供更快响应
开源透明：完全开源，支持本地部署和定制

四、技术创新亮点

架构优化
- 首次集成DeepSeek Sparse Attention机制
- 大幅降低部署成本，提升Token效率
- 支持无损长文本性能
训练方法创新
- 引入"Slime"异步强化学习框架
- 预训练数据从23T提升至28.5T
- 异步Agent强化学习算法
能力融合
- 首个原生融合推理、编码、Agent能力的开源模型
- 支持思考模式（Thinking Mode）和常规模式切换

五、应用场景优势

特别擅长的领域：

Agentic Engineering：从"Vibe Coding"到系统化工程
前端开发：构建成功率达98%，较前代提升26个百分点
长期任务规划：自主完成多步复杂工作流
代码智能体：兼容Claude Code、Cline等主流工具

六、综合评价

优势：

开源模型中的综合能力第一梯队
极高的参数效率和成本优势
在Agent和编程任务上表现卓越
MIT开源协议，商用友好

不足：

⚠️ 与顶级闭源模型（如Claude Opus 4.5）仍有3-5%的性能差距
⚠️ 在某些复杂推理场景下表现略逊于Gemini 3 Pro

总结： GLM-5是目前开源模型中最强大的选择之一，特别适合需要高性价比AI解决方案的企业和开发者。它在编程、Agent任务和系统工程方面的能力已经达到准一线水平，是国产开源大模型的重要里程碑。

Benchmark Results

GLM-5

Benchmark Results

General Knowledge

6 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

Thinking Enabled

44 / 179

LiveBench

Standard Mode

68.85

43 / 115

HLE

50.40

19 / 159

HLE

Thinking Enabled

30.50

75 / 159

ARC-AGI

Thinking Enabled

44.70

44 / 65

ARC-AGI-2

Thinking Enabled

4.90

44 / 59

Coding and Software Engineer

1 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

Thinking Enabled

77.80

23 / 108

常识推理

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Standard Mode

53.20

23 / 63

Agent Level Benchmark

3 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

5 / 35

τ²-Bench

89.70

4 / 40

Terminal Bench Hard

2 / 13

Math and Reasoning

3 evaluations

Benchmark / mode

Score

Rank/total

AIME 2026

Thinking Enabled

92.70

8 / 15

IMO-AnswerBench

Thinking Enabled

82.50

14 / 20

FrontierMath - Tier 4

Standard Mode

2.10

56 / 80

Instruction Following

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

10 / 29

AI Agent - Information Search

2 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

75.90

19 / 45

BrowseComp

Thinking Enabled

26 / 45

AI Agent - Tool Usage

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

61.10

18 / 46

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

Thinking Enabled

14 / 21

Long Context

2 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

Thinking Enabled

12 / 13

LongBench v2

Standard Mode

60.80

6 / 11

Claw-style Agent Evaluation

2 evaluations

Benchmark / mode

Score

Rank/total

Claw Bench

Thinking EnabledTools

91.70

5 / 29

Pinch Bench

Thinking EnabledTools

86.40

12 / 37

Compare with other models

Competitor Comparison

Benchmark scores for GLM-5 compared against top models in its class

GLM-5DeepSeek V4 Kimi K2.5 MiniMax M2.5

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GLM-5Current	Kimi K2.5	MiniMax M2.5
ARC-AGI 综合评估	44.70Thinking Enabled	--	63.70Thinking Enabled
ARC-AGI-2 综合评估	4.90Thinking Enabled	--	4.90Thinking Enabled
GPQA Diamond 综合评估	86.00Thinking Enabled	--	85.20Thinking Enabled
HLE 综合评估	50.40Thinking Enabled ｜ Tools	50.20Thinking Enabled ｜ Tools	19.40Thinking Enabled
SWE-bench Verified 编程与软件工程	77.80Thinking Enabled	76.80Thinking Enabled ｜ Tools	80.20Thinking Enabled ｜ Tools
τ²-Bench - Telecom Agent能力评测	98.00Thinking Enabled ｜ Tools	--	97.80Thinking Enabled ｜ Tools
IF Bench 指令跟随	72.00Thinking Enabled ｜ Tools	--	70.00Thinking Enabled ｜ Tools
BrowseComp AI Agent - 信息收集	75.90Thinking Enabled ｜ Tools	60.60Thinking Enabled ｜ Tools	76.30Thinking Enabled ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	61.10Thinking Enabled ｜ Tools	50.80Thinking Enabled ｜ Tools	51.70Thinking Enabled ｜ Tools
GDPval-AA 生产力知识	46.00Thinking Enabled	--	36.00Thinking Enabled
AA-LCR 长上下文能力	63.00Thinking Enabled	--	69.50Thinking Enabled
Claw Bench OpenClaw智能体能力综合测评	91.70Thinking Enabled ｜ Tools	81.70Thinking Enabled ｜ Tools	92.10Thinking Enabled ｜ Tools

1 additional benchmarks remain in the chart above.

Standard API Pricing: GLM-5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GLM-5	智谱AI	$1 / 1M tokens	$3.2 / 1M tokens	—
MiniMax M2.5	MiniMaxAI	$0.3 / 1M tokens	$2.4 / 1M tokens	—

Version History

How each version of the GLM-5 series stacks up on benchmark tests

GLM-5GLM-4.7 GLM-4.6 GLM-4.5 GLM4

Benchmark categories:

11 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GLM-5Current	GLM-4.7	GLM-4.6	GLM-4.5
GPQA Diamond 综合评估	86.00Thinking Enabled	85.70Thinking Enabled	82.90Thinking Enabled ｜ Tools	79.10Thinking Enabled
HLE 综合评估	50.40Thinking Enabled ｜ Tools	42.80Thinking Enabled ｜ Tools	30.40Thinking Enabled ｜ Tools	14.40Thinking Enabled
SWE-bench Verified 编程与软件工程	77.80Thinking Enabled	73.80Thinking Enabled ｜ Tools	68.00Standard Mode	64.20Thinking Enabled
Simple Bench 常识推理	53.20Standard Mode	47.70Thinking Enabled	--	--
Terminal Bench Hard Agent能力评测	43.00Thinking Enabled ｜ Tools	33.30Thinking Enabled ｜ Tools	--	--
τ²-Bench Agent能力评测	89.70Thinking Enabled ｜ Tools	87.40Thinking Enabled ｜ Tools	75.90Thinking Enabled ｜ Tools	--
τ²-Bench - Telecom Agent能力评测	98.00Thinking Enabled ｜ Tools	--	71.00Thinking Enabled ｜ Tools	--
AIME 2026 数学推理	92.70Thinking Enabled	92.90Thinking Enabled	--	--
IF Bench 指令跟随	72.00Thinking Enabled ｜ Tools	--	43.00Thinking Enabled	--
BrowseComp AI Agent - 信息收集	75.90Thinking Enabled ｜ Tools	52.00Thinking Enabled ｜ Tools	45.10Thinking Enabled ｜ Tools	--
Terminal Bench 2.0 AI Agent - 工具使用	61.10Thinking Enabled ｜ Tools	41.00Thinking Enabled ｜ Tools	--	--

Single-Benchmark Version Trend

Viewing: GPQA Diamond · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GLM-5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GLM-5	智谱AI	$1 / 1M tokens	$3.2 / 1M tokens	—

Sources

z.aiz.ai pinchbench.compinchbench.com