Muse SparkvsGPT-5.4

Across 8 shared benchmarks, GPT-5.4 leads overall: Muse Spark wins 3, GPT-5.4 wins 5, with 0 ties and an average score difference of -3.74.

Facebook AI研究实验室 · 2026-04-08 · Reasoning model

OpenAI · 2026-03-05 · Multimodal model

Muse Spark3 wins(38%)(63%)5 winsGPT-5.4

Benchmark scores

Grouped by capability, sorted by largest gap within each. 8 shared benchmarks.

GPT-5.4 2/3

Benchmark	Muse Spark	GPT-5.4	Diff
ARC-AGI-2	42.5028 / 62Thinking (No Tools)	77.109 / 62Normal (No Tools)	-34.60
HLE	586 / 172深度思考（无工具、并行）	52.1021 / 172极高强度思考（工具）	+5.90
GPQA Diamond	89.5025 / 187Thinking (No Tools)	92.8011 / 187极高强度思考（无工具）	-3.30

Even 2/2

Benchmark	Muse Spark	GPT-5.4	Diff
Terminal Bench 2.0	5924 / 47Thinking (With Tools)	75.104 / 47极高强度思考（工具）	-16.10
MCP-Atlas	82.205 / 27Normal (With Tools)	70.6014 / 27极高强度思考（工具）	+11.60

GPT-5.4 2/2

Benchmark	Muse Spark	GPT-5.4	Diff
FrontierMath - Tier 4	14.6023 / 80Normal (No Tools)	27.1011 / 80极高强度思考（无工具）	-12.50
FrontierMath	399 / 60Thinking (No Tools)	47.605 / 60极高强度思考（无工具）	-8.60

Muse Spark 1/1

Benchmark	Muse Spark	GPT-5.4	Diff
τ²-Bench - Telecom	9220 / 35Thinking (With Tools)	64.3030 / 35Normal (With Tools)	+27.70

Prices use DataLearner records when available; missing fields are not inferred.

One or both models have incomplete public pricing.

On average across the 8 shared benchmarks, GPT-5.4 scores 3.74 higher.

Largest single-benchmark gap: ARC-AGI-2 — Muse Spark 42.50 vs GPT-5.4 77.10 (-34.60).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.