OSWorld-Verified
OSWorld 是一个用于测试 AI 代理在真实计算机环境中的基准。这些代理是能处理文字、图片等信息的 AI 系统。基准包括开放式任务,比如操作文件或使用软件。OSWorld Verified 是它的改进版,通过修复问题和提升运行方式,提供更准确的测试结果。它支持不同操作系统,如 Ubuntu、Windows 和 macOS,并允许 AI 通过互动学习来完成任务。
Updated Apr 24, 2026·888 views
- Problem Count
- 369
- Institution
- 个人
- Category
- AI Agent - 工具使用
- Metrics
- Accuracy
- Language
- 英文
- Difficulty
- 中等难度
Overview
一个用于验证大模型Agent在操作计算机方面能力的评测基准,OSWorld的升级版本
Related resources
Latest OSWorld-Verified model rankings and full benchmark leaderboard
Browse the latest scores, model modes, release dates, and parameter sizes for OSWorld-Verified.
Source: DataLearnerAI
Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology
Model Mode Legend
OSWorld-Verified Rank
| Rank | Model | License | |||
|---|---|---|---|---|---|
![]() Claude Mythos Preview Extended ThinkingTools | 79.60 | 2026-04-07 | Unknown | Closed | |
![]() GPT-5.5 Thinking Level · HighTools | 78.70 | 2026-04-23 | Unknown | Closed | |
![]() Opus 4.7 Extended ThinkingTools | 78.00 | 2026-04-16 | Unknown | Closed | |
4 | ![]() GPT-5.4 Thinking Level · Extra HighTools | 75.00 | 2026-03-05 | Unknown | Closed |
5 | ![]() Kimi K2.6 Thinking EnabledTools | 73.10 | 2026-04-20 | 1000B | Free Commercial |
6 | ![]() Claude Opus 4.6 Extended ThinkingTools | 72.70 | 2026-02-05 | Unknown | Closed |
7 | ![]() Claude Sonnet 4.6 Thinking EnabledTools | 72.50 | 2026-02-17 | Unknown | Closed |
8 | ![]() GPT-5.4 mini Thinking Level · Extra HighTools | 72.10 | 2026-03-17 | Unknown | Closed |
9 | ![]() Qwen3.5-397B-A17B Thinking EnabledTools | 62.20 | 2026-02-16 | 39.7B | Free Commercial |
10 | ![]() Claude Sonnet 4.5 Thinking EnabledTools | 61.40 | 2025-09-30 | Unknown | Closed |
11 | ![]() Qwen3.5-27B Thinking EnabledTools | 56.20 | 2026-02-25 | 27B | Free Commercial |
12 | ![]() Claude Sonnet 4 Thinking EnabledTools | 42.20 | 2025-05-23 | Unknown | Closed |
13 | ![]() GPT-5.4 nano Thinking Level · Extra HighTools | 39.00 | 2026-03-17 | Unknown | Closed |
14 | ![]() Claude Sonnet 3.7 Thinking EnabledTools | 28.00 | 2025-02-25 | Unknown | Closed |



