GLM-5 Benchmark Analysis

GLM-5 currently shows benchmark results led by τ²-Bench (4 / 40, score 89.70), HLE (19 / 159, score 50.40), τ²-Bench - Telecom (5 / 35, score 98). This page also compares it with 3 competitor models and 4 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

GLM-5作为智谱AI的第五代旗舰模型,在多个维度上实现了显著提升:

核心性能指标:

  • 数学推理:AIME 2026得分92.7%,GPQA-Diamond得分86.0%
  • 编程能力:SWE-bench Verified达到77.8%,SWE-bench Multilingual为73.3%
  • Agent任务:BrowseComp得分62.0,Terminal-Bench 2.0达到56.2
  • 人文推理:HLE(使用工具)得分50.4,排名第3

模型规模:

  • 总参数:744B(7440亿)
  • 激活参数:40B(400亿)
  • 采用MoE(混合专家)架构
  • 上下文长度:200K tokens

二、在开源模型中的地位

GLM-5在开源模型阵营中表现突出

  1. 多项基准测试领先

    • SWE-bench Verified(77.8%):开源模型第一
    • Terminal Bench 2.0(61.1%):开源模型第三
    • τ²-Bench(89.7%):开源模型第二
  2. 超越同类竞品

    • 全面超越Google Gemini 3.0 Pro的综合表现
    • 在Agent能力评测中优于多数开源模型
    • 前端开发构建成功率达98%(CC-Bench-V2)
  3. 参数效率优势

    • 相比前代GLM-4.7(355B参数)规模翻倍
    • 但激活参数仅40B,保持高效推理

三、与顶级闭源模型的差距

虽然GLM-5在开源领域表现优异,但与顶级闭源模型仍存在差距:

与Claude Opus 4.5的对比:

  • SWE-bench Verified:GLM-5(77.8%)vs Claude Opus 4.5(80.9%)
  • 官方定位:在软件工程任务上"接近"Opus 4.5的使用体验
  • 在复杂推理和长期规划上仍有提升空间

优势领域:

  • 成本效益:API价格仅为主流模型的20%左右
  • 推理速度:优化后的架构提供更快响应
  • 开源透明:完全开源,支持本地部署和定制

四、技术创新亮点

  1. 架构优化

    • 首次集成DeepSeek Sparse Attention机制
    • 大幅降低部署成本,提升Token效率
    • 支持无损长文本性能
  2. 训练方法创新

    • 引入"Slime"异步强化学习框架
    • 预训练数据从23T提升至28.5T
    • 异步Agent强化学习算法
  3. 能力融合

    • 首个原生融合推理、编码、Agent能力的开源模型
    • 支持思考模式(Thinking Mode)和常规模式切换

五、应用场景优势

特别擅长的领域:

  1. Agentic Engineering:从"Vibe Coding"到系统化工程
  2. 前端开发:构建成功率达98%,较前代提升26个百分点
  3. 长期任务规划:自主完成多步复杂工作流
  4. 代码智能体:兼容Claude Code、Cline等主流工具

六、综合评价

优势:

  • 开源模型中的综合能力第一梯队
  • 极高的参数效率和成本优势
  • 在Agent和编程任务上表现卓越
  • MIT开源协议,商用友好

不足:

  • ⚠️ 与顶级闭源模型(如Claude Opus 4.5)仍有3-5%的性能差距
  • ⚠️ 在某些复杂推理场景下表现略逊于Gemini 3 Pro

总结: GLM-5是目前开源模型中最强大的选择之一,特别适合需要高性价比AI解决方案的企业和开发者。它在编程、Agent任务和系统工程方面的能力已经达到准一线水平,是国产开源大模型的重要里程碑。

Benchmark Results

GLM-5

Benchmark Results

Thinking
Tool usage

General Knowledge

6 evaluations
Benchmark / mode
Score
Rank/total
GPQA Diamond
Thinking Enabled
86
44 / 179
LiveBench
Standard Mode
68.85
43 / 115
50.40
19 / 159
HLE
Thinking Enabled
30.50
75 / 159
ARC-AGI
Thinking Enabled
44.70
44 / 65
ARC-AGI-2
Thinking Enabled
4.90
44 / 59

Coding and Software Engineer

1 evaluations
Benchmark / mode
Score
Rank/total
SWE-bench Verified
Thinking Enabled
77.80
23 / 108

常识推理

1 evaluations
Benchmark / mode
Score
Rank/total
Simple Bench
Standard Mode
53.20
23 / 63

Agent Level Benchmark

3 evaluations
Benchmark / mode
Score
Rank/total

Math and Reasoning

3 evaluations
Benchmark / mode
Score
Rank/total
AIME 2026
Thinking Enabled
92.70
8 / 15
IMO-AnswerBench
Thinking Enabled
82.50
14 / 20
2.10
56 / 80

Instruction Following

1 evaluations
Benchmark / mode
Score
Rank/total
72
10 / 29

AI Agent - Information Search

2 evaluations
Benchmark / mode
Score
Rank/total
75.90
19 / 45
BrowseComp
Thinking Enabled
62
26 / 45

AI Agent - Tool Usage

1 evaluations
Benchmark / mode
Score
Rank/total
61.10
18 / 46

Productivity Knowledge

1 evaluations
Benchmark / mode
Score
Rank/total
GDPval-AA
Thinking Enabled
46
14 / 21

Long Context

2 evaluations
Benchmark / mode
Score
Rank/total
AA-LCR
Thinking Enabled
63
12 / 13
LongBench v2
Standard Mode
60.80
6 / 11

Claw-style Agent Evaluation

2 evaluations
Benchmark / mode
Score
Rank/total
Claw Bench
Thinking EnabledTools
91.70
5 / 29
Pinch Bench
Thinking EnabledTools
86.40
12 / 37

Competitor Comparison

Benchmark scores for GLM-5 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkGLM-5CurrentKimi K2.5MiniMax M2.5
ARC-AGI
综合评估
44.70Thinking Enabled
--
63.70Thinking Enabled
ARC-AGI-2
综合评估
4.90Thinking Enabled
--
4.90Thinking Enabled
GPQA Diamond
综合评估
86.00Thinking Enabled
--
85.20Thinking Enabled
HLE
综合评估
50.40Thinking Enabled | Tools
50.20Thinking Enabled | Tools
19.40Thinking Enabled
SWE-bench Verified
编程与软件工程
77.80Thinking Enabled
76.80Thinking Enabled | Tools
80.20Thinking Enabled | Tools
τ²-Bench - Telecom
Agent能力评测
98.00Thinking Enabled | Tools
--
97.80Thinking Enabled | Tools
IF Bench
指令跟随
72.00Thinking Enabled | Tools
--
70.00Thinking Enabled | Tools
BrowseComp
AI Agent - 信息收集
75.90Thinking Enabled | Tools
60.60Thinking Enabled | Tools
76.30Thinking Enabled | Tools
Terminal Bench 2.0
AI Agent - 工具使用
61.10Thinking Enabled | Tools
50.80Thinking Enabled | Tools
51.70Thinking Enabled | Tools
GDPval-AA
生产力知识
46.00Thinking Enabled
--
36.00Thinking Enabled
AA-LCR
长上下文能力
63.00Thinking Enabled
--
69.50Thinking Enabled
Claw Bench
OpenClaw智能体能力综合测评
91.70Thinking Enabled | Tools
81.70Thinking Enabled | Tools
92.10Thinking Enabled | Tools
1 additional benchmarks remain in the chart above.

Standard API Pricing: GLM-5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

ModelSupplierStandard inputStandard outputBase price applies to
GLM-5
智谱AI$1 / 1M tokens$3.2 / 1M tokens
MiniMax M2.5
MiniMaxAI$0.3 / 1M tokens$2.4 / 1M tokens

Version History

How each version of the GLM-5 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

11 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkGLM-5CurrentGLM-4.7GLM-4.6GLM-4.5
GPQA Diamond
综合评估
86.00Thinking Enabled
85.70Thinking Enabled
82.90Thinking Enabled | Tools
79.10Thinking Enabled
HLE
综合评估
50.40Thinking Enabled | Tools
42.80Thinking Enabled | Tools
30.40Thinking Enabled | Tools
14.40Thinking Enabled
SWE-bench Verified
编程与软件工程
77.80Thinking Enabled
73.80Thinking Enabled | Tools
68.00Standard Mode
64.20Thinking Enabled
Simple Bench
常识推理
53.20Standard Mode
47.70Thinking Enabled
--
--
Terminal Bench Hard
Agent能力评测
43.00Thinking Enabled | Tools
33.30Thinking Enabled | Tools
--
--
τ²-Bench
Agent能力评测
89.70Thinking Enabled | Tools
87.40Thinking Enabled | Tools
75.90Thinking Enabled | Tools
--
τ²-Bench - Telecom
Agent能力评测
98.00Thinking Enabled | Tools
--
71.00Thinking Enabled | Tools
--
AIME 2026
数学推理
92.70Thinking Enabled
92.90Thinking Enabled
--
--
IF Bench
指令跟随
72.00Thinking Enabled | Tools
--
43.00Thinking Enabled
--
BrowseComp
AI Agent - 信息收集
75.90Thinking Enabled | Tools
52.00Thinking Enabled | Tools
45.10Thinking Enabled | Tools
--
Terminal Bench 2.0
AI Agent - 工具使用
61.10Thinking Enabled | Tools
41.00Thinking Enabled | Tools
--
--

Single-Benchmark Version Trend

Viewing: GPQA Diamond · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GLM-5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

ModelSupplierStandard inputStandard outputBase price applies to
GLM-5
智谱AI$1 / 1M tokens$3.2 / 1M tokens

Sources