加载中...

SWE-bench Verified

Name: Software Engineering Bench - Verified
Creator: OpenAI

在人工智能领域，随着大型语言模型（LLMs）在各类任务中的表现不断提升，评估这些模型的实际能力变得尤为重要。尤其是在软件工程领域，AI 模型是否能够准确地解决真实的编程问题，是衡量其真正应用潜力的关键。而在这方面，OpenAI 推出的 *SWE-bench Verified* 基准测试，旨在提供一个更加可靠和精确的评估工具，帮助开发者和研究者全面了解 AI 模型在处理软件工程任务时的能力。

更新于 2026-04-03

9,569 次浏览

问题数量

500

发布机构

OpenAI

评测类别

编程与软件工程

评测指标

Accuracy

支持语言

英文

难度等级

高难度

简介

OpenAI基于SWE-Bench提炼的更加准确和更具代表性的大模型代码工程任务解决能力评测

SWE-bench Verified Model Score Leaderboard

Source: DataLearnerAI

Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology

模式说明:

normal

thinking

low

medium

high

deeper thinking

parallel_thinking

图表加载中...

Latest SWE-bench Verified model rankings and full benchmark leaderboard

Browse the latest scores, model modes, release dates, and parameter sizes for SWE-bench Verified.

已排除 5 条并行模式成绩

排名	模型
1	Claude Opus 4.5 Thinking Level · Medium	80.9	2025-11-25	未知
2	Claude Opus 4.6 Deep ThinkingTools	80.84	2026-02-05	未知
3	Gemini 3.1 Pro Preview Thinking Level · HighTools	80.6	2026-02-20	未知
4	MiniMax M2.5 Thinking Level · MediumTools	80.2	2026-02-12	2290
5	GPT-5.2 Deep ThinkingTools	80	2025-12-11	未知
6	Claude Sonnet 4.6 Thinking Level · Medium	79.6	2026-02-17	未知
7	Qwen 3.6 Plus Preview Thinking Level · MediumTools	78.8	2026-03-31	未知
8	GLM-5 Thinking Level · Medium	77.8	2026-02-11	7440
9	Claude Sonnet 4.5 Thinking Level · MediumTools	77.2	2025-09-30	未知
10	GPT-5.1-Codex-Max Thinking Level · HighTools	76.8	2025-11-19	未知
11	Kimi K2.5 Thinking Level · MediumTools	76.8	2026-01-27	10000
12	Qwen3.5-397B-A17B Thinking Level · MediumTools	76.4	2026-02-16	397
13	GPT-5.1 Thinking Level · High	76.3	2025-11-12	未知
14	GPT-5.1 Thinking Level · HighTools	76.3	2025-11-12	未知
15	Gemini 3.0 Pro (Preview 11-2025) Thinking Level · Medium	76.2	2025-11-18	未知
16	Qwen3-Max-Thinking Thinking Level · Medium	75.3	2026-01-26	10000
17	o3-pro Thinking Level · High	75	2025-06-10	未知
18	M2.1 Thinking Level · Medium	74.8	2025-12-23	2300
19	Claude Opus 4.1 Thinking Level · Medium	74.5	2025-08-06	未知
20	Claude Opus 4.1 Thinking Level · MediumTools	74.5	2025-08-06	未知
21	GPT-5 Codex Thinking Level · High	74.5	2025-09-15	未知
22	Step 3.5 Flash Thinking Level · Medium	74.4	2026-02-02	1960
23	GLM-4.7 Thinking Level · MediumTools	73.8	2025-12-22	3580
24	Haiku 4.5 Thinking Level · MediumTools	73.3	2025-10-15	未知
25	DeepSeek V3.2 Thinking Level · MediumTools	73.1	2025-12-01	6710

滚动或悬停加载剩余 63 条

SWE-bench Verified

更新于 2026-04-03

9,569 次浏览

问题数量

500

发布机构

OpenAI

评测类别

编程与软件工程

评测指标

Accuracy

支持语言

英文

难度等级

高难度

简介

OpenAI基于SWE-Bench提炼的更加准确和更具代表性的大模型代码工程任务解决能力评测

SWE-bench Verified Model Score Leaderboard

Source: DataLearnerAI

Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology

模式说明:

normal

thinking

low

medium

high

deeper thinking

parallel_thinking

图表加载中...

Latest SWE-bench Verified model rankings and full benchmark leaderboard

Browse the latest scores, model modes, release dates, and parameter sizes for SWE-bench Verified.

已排除 5 条并行模式成绩

排名	模型
1	Claude Opus 4.5 Thinking Level · Medium	80.9	2025-11-25	未知
2	Claude Opus 4.6 Deep ThinkingTools	80.84	2026-02-05	未知
3	Gemini 3.1 Pro Preview Thinking Level · HighTools	80.6	2026-02-20	未知
4	MiniMax M2.5 Thinking Level · MediumTools	80.2	2026-02-12	2290
5	GPT-5.2 Deep ThinkingTools	80	2025-12-11	未知
6	Claude Sonnet 4.6 Thinking Level · Medium	79.6	2026-02-17	未知
7	Qwen 3.6 Plus Preview Thinking Level · MediumTools	78.8	2026-03-31	未知
8	GLM-5 Thinking Level · Medium	77.8	2026-02-11	7440
9	Claude Sonnet 4.5 Thinking Level · MediumTools	77.2	2025-09-30	未知
10	GPT-5.1-Codex-Max Thinking Level · HighTools	76.8	2025-11-19	未知
11	Kimi K2.5 Thinking Level · MediumTools	76.8	2026-01-27	10000
12	Qwen3.5-397B-A17B Thinking Level · MediumTools	76.4	2026-02-16	397
13	GPT-5.1 Thinking Level · High	76.3	2025-11-12	未知
14	GPT-5.1 Thinking Level · HighTools	76.3	2025-11-12	未知
15	Gemini 3.0 Pro (Preview 11-2025) Thinking Level · Medium	76.2	2025-11-18	未知
16	Qwen3-Max-Thinking Thinking Level · Medium	75.3	2026-01-26	10000
17	o3-pro Thinking Level · High	75	2025-06-10	未知
18	M2.1 Thinking Level · Medium	74.8	2025-12-23	2300
19	Claude Opus 4.1 Thinking Level · Medium	74.5	2025-08-06	未知
20	Claude Opus 4.1 Thinking Level · MediumTools	74.5	2025-08-06	未知
21	GPT-5 Codex Thinking Level · High	74.5	2025-09-15	未知
22	Step 3.5 Flash Thinking Level · Medium	74.4	2026-02-02	1960
23	GLM-4.7 Thinking Level · MediumTools	73.8	2025-12-22	3580
24	Haiku 4.5 Thinking Level · MediumTools	73.3	2025-10-15	未知
25	DeepSeek V3.2 Thinking Level · MediumTools	73.1	2025-12-01	6710

滚动或悬停加载剩余 63 条

SWE-bench Verified

简介

相关资源

SWE-bench Verified Model Score Leaderboard

Latest SWE-bench Verified model rankings and full benchmark leaderboard

SWE-bench Verified

简介

相关资源

SWE-bench Verified Model Score Leaderboard

Latest SWE-bench Verified model rankings and full benchmark leaderboard

SWE-bench Verified Model Score Leaderboard

Latest SWE-bench Verified model rankings and full benchmark leaderboard

SWE-bench Verified详细排名数据表格

SWE-bench Verified Model Score Leaderboard

Latest SWE-bench Verified model rankings and full benchmark leaderboard

SWE-bench Verified详细排名数据表格