GPT-5.3 Codex Benchmark Analysis

GPT-5.3 Codex currently shows benchmark results led by Terminal Bench 2.0 (3 / 46, score 77.30), IC SWE-Lancer(Diamond) (1 / 8, score 81.40), LiveBench (25 / 115, score 72.76). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

GPT-5.3-Codex — 评测结果与模型解读

简短摘要
GPT-5.3-Codex(发布:2026-02-05)是 OpenAI 面向编程代理与知识工作场景的最新闭源模型。凭借 400k tokens 的超大上下文窗口和在代理/终端任务上的领先表现,它适合用于 IDE 助手、DevOps 代理与长期工程协作场景,但企业接入需注意治理与成本。


核心规格(速览)

  • 发布时间:2026-02-05
  • 定位:面向编程代理与知识工作者的闭源大模型
  • 上下文窗口:400,000 tokens
  • 最大输出长度:128,000 tokens
  • 权重:未公开(闭源)
  • 典型基准:Terminal-Bench 77.3%、SWE-Bench Pro 56.8%、OSWorld 64.7%(见下文)

评测要点与解读

长上下文与跨日工程能力

400k token 的上下文窗口使模型在跨文件、跨日的工程任务(如多文件补丁、长期 PR 审查)中保留更多历史状态,从而提升连贯性与准确率。

代理与终端交互

在需要与终端/工具链交互的代理任务(CLI、CI 报错解析、自动化测试生成)上,GPT-5.3-Codex 的表现相比前代有明显提升,适合做 IDE 插件与运维自动化助手。

自我迭代与工程化风险

引入“模型参与自身开发”的流程可以加快迭代,但也带来治理与可解释性问题:需要审计链路与回放机制以防模型生成的工程决策不可追溯。


优点

  • 超大上下文,适合复杂工程场景。
  • 代理能力强,适用于自动化运维与交互式调试。
  • 能同时处理代码、文档和表格类知识工作任务。
  • 与工程工具链整合度高(可用于模型开发加速)。

局限与风险

  • 闭源、不可本地化权重;合规/审计受限。
  • 在安全相关任务存在双重用途风险,需严格权限与审计。
  • 长上下文和长输出会增加推理成本,需做成本/延迟权衡。

工程接入建议(要点)

  1. 混合上下文策略:重要上下文 + 摘要 + 短期记忆,避免每次传入全部历史。
  2. 沙箱化命令执行:命令/脚本先在模拟或只读环境验证。
  3. 完整审计日志:保存 prompt/response,用于回放与合规。
  4. 分级访问与审批:对安全敏感能力设置多层审批与白名单。

一段结论

GPT-5.3-Codex 在工程代理与长期协作任务上实现了跨代进步:它是构建高效 IDE 助手和运维代理的强候选,但企业在接入时必须同步加强治理(审计、最小权限、沙箱化)并评估运行成本。

Benchmark Results

GPT-5.3 Codex

Benchmark Results

Thinking

Coding and Software Engineer

2 evaluations
Benchmark / mode
Score
Rank/total

General Knowledge

2 evaluations
Benchmark / mode
Score
Rank/total
72.76
25 / 115
LiveBench
Deep Thinking Mode
71.64
32 / 115

AI Agent - Tool Usage

1 evaluations
Benchmark / mode
Score
Rank/total

Competitor Comparison

Benchmark scores for GPT-5.3 Codex compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkGPT-5.3 CodexCurrentClaude Opus 4.6Gemini 3.0 Pro (Preview 11-2025)
LiveBench
综合评估
72.76Thinking Level · High
--
73.39Thinking Level · High
Terminal Bench 2.0
AI Agent - 工具使用
77.30Thinking Level · Extra High | Tools
65.40Extended Thinking | Tools
56.90Thinking Level · High | Tools

Standard API Pricing: GPT-5.3 Codex vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Opus 4.6: Base price applies to <= 200K
ModelSupplierStandard inputStandard outputBase price applies to
Claude Opus 4.6
Anthropic$5 / 1M tokens$25 / 1M tokens<= 200K

Version History

How each version of the GPT-5.3 Codex series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

1 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkGPT-5.3 CodexCurrentGPT-5.2-CodexGPT-5.1-Codex-Max
LiveBench
综合评估
72.76Thinking Level · High
74.30Standard Mode
73.98Deep Thinking Mode

Single-Benchmark Version Trend

Viewing: LiveBench · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.3 Codex Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Sources