LMArena Tracks

Text Generation Coding Math Image Edit Text-to-Video Image-to-Video Text-to-Image

Text-to-Video Arena Leaderboard

Name: Text-to-Video Arena Leaderboard
Creator: DataLearner
License: https://creativecommons.org/licenses/by/4.0/

The latest AI video generation leaderboard based on Text-to-Video Arena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for leading video models.

Top Model

happyhorse-1.0

Top Score

1,435

Model Count

Data version

2026年05月12日

Data source: LM Arena

About This Leaderboard

This leaderboard ranks AI text-to-video models by generation quality. Data comes from LMArena's Text-to-Video Arena track, evaluated through anonymous blind testing by real users.

Methodology Overview

Blind testing: Users submit text descriptions, two anonymous models generate videos, and users vote for the better result.

Elo scoring: Based on the Bradley-Terry model. Higher scores indicate stronger user preference for that model's video output.

Diverse generation scenarios: Covers natural landscapes, human motion, creative animation, product showcases, and more.

DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.

Origin:All China

Leaderboard snapshot month:

Ranking Table

Rank	Model	Score	95% CI	Votes	Organization	License
	happyhorse-1.0Alibaba-ATH	1,435	+/-9	6,266	Alibaba-ATH	Proprietary
10	Wan2.6 T2VAlibaba	1,341	+/-11	24,738	Alibaba	Proprietary
24	Hailuo 2.3MiniMaxAI	1,199	+/-12	9,370	MiniMaxAI	Proprietary
25	Hailuo 2.3MiniMaxAI	1,199	+/-7	50,014	MiniMaxAI	Proprietary
27	Hailuo 2.3MiniMaxAI	1,181	+/-12	9,333	MiniMaxAI	Proprietary

Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.

2026-05 Market Signals

Current Best (SOTA)

Veo 3.1 Audio 1080p

Veo 3.1 Fast-Audio 1080p

Sora-2-Pro

Best China Model

Wan2.6-T2V

Seedance-V1.5-Pro

Kling-2.6-Pro

Best Open Model

•Wan-V2.2-A14B
•Kandinsky-5.0-T2V-Pro
•Mochi-V1

FAQ

How does Text-to-Video Arena rank models?

Rankings are based on side-by-side anonymous votes. Users enter the same prompt, compare outputs from two hidden models, and choose the better video. Elo-style scoring then aggregates those comparisons into a leaderboard.

What is audio-video sync, and why does it matter?

Audio-video sync means generated sound effects or speech match the motion and timing in the video. It matters because synchronized audio can make generated clips usable with less post-production work.

What use cases are text-to-video models good for?

Common uses include short-form video creation, marketing assets, e-commerce product clips, storyboarding, game cinematics, and educational demos.

Which models support the longest generation length?

Long generation limits change quickly by product tier and release. In practice, check the current model documentation and compare both maximum duration and quality consistency across longer clips.

About This Leaderboard

This leaderboard ranks AI text-to-video models by generation quality. Data comes from LMArena's Text-to-Video Arena track, evaluated through anonymous blind testing by real users.

Methodology Overview

Blind testing: Users submit text descriptions, two anonymous models generate videos, and users vote for the better result.

Elo scoring: Based on the Bradley-Terry model. Higher scores indicate stronger user preference for that model's video output.

Diverse generation scenarios: Covers natural landscapes, human motion, creative animation, product showcases, and more.

Rank

Model

Score

95% CI

Votes

Organization

License

happyhorse-1.0Alibaba-ATH

1,435

+/-9

6,266

Alibaba-ATH

Proprietary

Wan2.6 T2VAlibaba

1,341

+/-11

24,738

Alibaba

Proprietary

Hailuo 2.3MiniMaxAI

1,199

+/-12

9,370

MiniMaxAI

Proprietary

Hailuo 2.3MiniMaxAI

1,199

+/-7

50,014

MiniMaxAI

Proprietary

Hailuo 2.3MiniMaxAI

1,181

+/-12

9,333

MiniMaxAI

Proprietary

FAQ

How does Text-to-Video Arena rank models?

What is audio-video sync, and why does it matter?

What use cases are text-to-video models good for?

Common uses include short-form video creation, marketing assets, e-commerce product clips, storyboarding, game cinematics, and educational demos.

Which models support the longest generation length?

Long generation limits change quickly by product tier and release. In practice, check the current model documentation and compare both maximum duration and quality consistency across longer clips.