GPT-5.3 Codex Benchmark Analysis

GPT-5.3 Codex currently shows benchmark results led by Terminal Bench 2.0 (3 / 46, score 77.30), IC SWE-Lancer(Diamond) (1 / 8, score 81.40), LiveBench (25 / 115, score 72.76). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

GPT-5.3-Codex — 评测结果与模型解读

简短摘要
GPT-5.3-Codex（发布：2026-02-05）是 OpenAI 面向编程代理与知识工作场景的最新闭源模型。凭借 400k tokens 的超大上下文窗口和在代理/终端任务上的领先表现，它适合用于 IDE 助手、DevOps 代理与长期工程协作场景，但企业接入需注意治理与成本。

核心规格（速览）

发布时间：2026-02-05
定位：面向编程代理与知识工作者的闭源大模型
上下文窗口：400,000 tokens
最大输出长度：128,000 tokens
权重：未公开（闭源）
典型基准：Terminal-Bench 77.3%、SWE-Bench Pro 56.8%、OSWorld 64.7%（见下文）

评测要点与解读

长上下文与跨日工程能力

400k token 的上下文窗口使模型在跨文件、跨日的工程任务（如多文件补丁、长期 PR 审查）中保留更多历史状态，从而提升连贯性与准确率。

代理与终端交互

在需要与终端/工具链交互的代理任务（CLI、CI 报错解析、自动化测试生成）上，GPT-5.3-Codex 的表现相比前代有明显提升，适合做 IDE 插件与运维自动化助手。

自我迭代与工程化风险

引入“模型参与自身开发”的流程可以加快迭代，但也带来治理与可解释性问题：需要审计链路与回放机制以防模型生成的工程决策不可追溯。

优点

超大上下文，适合复杂工程场景。
代理能力强，适用于自动化运维与交互式调试。
能同时处理代码、文档和表格类知识工作任务。
与工程工具链整合度高（可用于模型开发加速）。

局限与风险

闭源、不可本地化权重；合规/审计受限。
在安全相关任务存在双重用途风险，需严格权限与审计。
长上下文和长输出会增加推理成本，需做成本/延迟权衡。

工程接入建议（要点）

混合上下文策略：重要上下文 + 摘要 + 短期记忆，避免每次传入全部历史。
沙箱化命令执行：命令/脚本先在模拟或只读环境验证。
完整审计日志：保存 prompt/response，用于回放与合规。
分级访问与审批：对安全敏感能力设置多层审批与白名单。

一段结论

GPT-5.3-Codex 在工程代理与长期协作任务上实现了跨代进步：它是构建高效 IDE 助手和运维代理的强候选，但企业在接入时必须同步加强治理（审计、最小权限、沙箱化）并评估运行成本。

Benchmark Results

GPT-5.3 Codex

Benchmark Results

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

IC SWE-Lancer(Diamond)

81.40

1 / 8

SWE-Bench Pro - Public

56.80

13 / 44

General Knowledge

2 evaluations

Benchmark / mode

Score

Rank/total

LiveBench

High

72.76

25 / 115

LiveBench

Deep Thinking Mode

71.64

32 / 115

AI Agent - Tool Usage

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

77.30

3 / 46

Compare with other models

Competitor Comparison

Benchmark scores for GPT-5.3 Codex compared against top models in its class

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-5.3 CodexCurrent	Claude Opus 4.6	Gemini 3.0 Pro (Preview 11-2025)
LiveBench 综合评估	72.76Thinking Level · High	--	73.39Thinking Level · High
Terminal Bench 2.0 AI Agent - 工具使用	77.30Thinking Level · Extra High ｜ Tools	65.40Extended Thinking ｜ Tools	56.90Thinking Level · High ｜ Tools

Standard API Pricing: GPT-5.3 Codex vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Opus 4.6: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
Claude Opus 4.6	Anthropic	$5 / 1M tokens	$25 / 1M tokens	<= 200K

Version History

How each version of the GPT-5.3 Codex series stacks up on benchmark tests

Benchmark categories:

1 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-5.3 CodexCurrent	GPT-5.2-Codex	GPT-5.1-Codex-Max
LiveBench 综合评估	72.76Thinking Level · High	74.30Standard Mode	73.98Deep Thinking Mode

Single-Benchmark Version Trend

Viewing: LiveBench · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.3 Codex Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

GPT-5.3 Codex Benchmark Analysis

GPT-5.3-Codex — 评测结果与模型解读

核心规格（速览）

评测要点与解读

长上下文与跨日工程能力

代理与终端交互

自我迭代与工程化风险

优点

局限与风险

工程接入建议（要点）

一段结论

Benchmark Results

Benchmark Results

Coding and Software Engineer

General Knowledge

AI Agent - Tool Usage

Competitor Comparison

Standard API Pricing: GPT-5.3 Codex vs. Peer Models

Version History

Single-Benchmark Version Trend

Standard API Pricing Across the GPT-5.3 Codex Series

Sources