GPT-4o Benchmark Details

GPT-4o currently shows benchmark results led by HumanEval (8 / 39, score 90), MMLU (15 / 66, score 88.70), BBH (5 / 21, score 91.70).

Benchmark Results

GPT-4o

Benchmark Results

General Knowledge

5 evaluations

Benchmark / mode

Score

Rank/total

BBH

91.70

5 / 21

MMLU

88.70

15 / 66

MMLU Pro

77.90

75 / 132

GPQA Diamond

70.10

119 / 187

HLE

5.30

162 / 170

Coding and Software Engineer

4 evaluations

Benchmark / mode

Score

Rank/total

HumanEval

8 / 39

LiveCodeBench

35.10

108 / 123

SWE-bench Verified

106 / 111

IC SWE-Lancer(Diamond)

23.30

6 / 8

Math and Reasoning

5 evaluations

Benchmark / mode

Score

Rank/total

MATH

75.90

16 / 42

MATH-500

75.90

43 / 44

AIME2025

42.10

94 / 107

AIME 2024

9.30

61 / 62

FrontierMath

0.30

57 / 60

Common Sense

1 evaluations

Benchmark / mode

Score

Rank/total

SimpleQA

38.20

22 / 47

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

Aider-Polyglot

Standard Mode

23.10

47 / 59

Claw-style Agent Evaluation

1 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking EnabledTools

71.10

30 / 37

Compare with other models