GPT-4o(2024-11-20) Benchmark Details

GPT-4o(2024-11-20) currently shows benchmark results led by HumanEval (7 / 39, score 90.20), SimpleQA (21 / 47, score 38.80), MMLU (37 / 66, score 85.70). This page also compares it with 3 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

GPT-4o(2024-11-20)

Benchmark Results

General Knowledge

2 evaluations

Benchmark / mode

Score

Rank/total

MMLU

85.70

37 / 66

MMLU Pro

77.90

75 / 132

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

HumanEval

90.20

7 / 39

SWE-bench Verified

Standard Mode

106 / 111

Math and Reasoning

2 evaluations

Benchmark / mode

Score

Rank/total

MATH

68.50

24 / 42

FrontierMath

0.30

57 / 60

Common Sense

1 evaluations

Benchmark / mode

Score

Rank/total

SimpleQA

38.80

21 / 47

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

Aider-Polyglot

Standard Mode

18.20

50 / 59

Compare with other models

Competitor Comparison

Benchmark scores for GPT-4o(2024-11-20) compared against top models in its class

GPT-4o(2024-11-20)Claude3-Opus Gemini 2.0 Pro Experimental DeepSeek-V3

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

7 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-4o(2024-11-20)Current	Claude3-Opus	Gemini 2.0 Pro Experimental	DeepSeek-V3
MMLU 综合评估	85.70Standard Mode	86.80Standard Mode	86.50Standard Mode	88.50Standard Mode
MMLU Pro 综合评估	77.90Standard Mode	68.45Standard Mode	79.10Standard Mode	75.90Standard Mode
HumanEval 编程与软件工程	90.20Standard Mode	84.90Standard Mode	--	89.00Standard Mode
FrontierMath 数学推理	0.30Standard Mode	--	--	1.70Standard Mode
MATH 数学推理	68.50Standard Mode	60.10Standard Mode	91.80Standard Mode	87.80Standard Mode
SimpleQA 常识问答	38.80Standard Mode	--	44.30Standard Mode	24.90Standard Mode
Aider-Polyglot Agent能力评测	18.20Standard Mode	--	35.60Standard Mode	48.40Standard Mode

Standard API Pricing: GPT-4o(2024-11-20) vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
Claude3-Opus	Anthropic	$15 / 1M tokens	$75 / 1M tokens	—
DeepSeek-V3	DeepSeek-AI	$0.27 / 1M tokens	$1.1 / 1M tokens	—

Version History

How each version of the GPT-4o(2024-11-20) series stacks up on benchmark tests

GPT-4o(2024-11-20)GPT-4o GPT-4

Benchmark categories:

7 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-4o(2024-11-20)Current	GPT-4o	GPT-4
MMLU 综合评估	85.70Standard Mode	88.70Standard Mode	86.40Standard Mode
MMLU Pro 综合评估	77.90Standard Mode	77.90Standard Mode	--
HumanEval 编程与软件工程	90.20Standard Mode	90.00Standard Mode	67.00Standard Mode
FrontierMath 数学推理	0.30Standard Mode	0.30Standard Mode	--
MATH 数学推理	68.50Standard Mode	75.90Standard Mode	--
SimpleQA 常识问答	38.80Standard Mode	38.20Standard Mode	--
Aider-Polyglot Agent能力评测	18.20Standard Mode	23.10Standard Mode	--

Single-Benchmark Version Trend

Viewing: MMLU · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-4o(2024-11-20) Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-4o	Microsoft Azure	$2.5 / 1M tokens	$10 / 1M tokens	—

Sources

epoch.aiepoch.ai