Text-to-Video Arena Leaderboard
The latest AI video generation leaderboard based on Text-to-Video Arena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for leading video models.
Top Model
happyhorse-1.0
Top Score
1,435
Model Count
39
Data version
2026年05月12日
Data source: LM Arena
About This Leaderboard
This leaderboard ranks AI text-to-video models by generation quality. Data comes from LMArena's Text-to-Video Arena track, evaluated through anonymous blind testing by real users.
Methodology Overview
Blind testing: Users submit text descriptions, two anonymous models generate videos, and users vote for the better result.
Elo scoring: Based on the Bradley-Terry model. Higher scores indicate stronger user preference for that model's video output.
Diverse generation scenarios: Covers natural landscapes, human motion, creative animation, product showcases, and more.
DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.
Ranking Table
| Rank | Model | Score | 95% CI | Votes | Organization | License |
|---|---|---|---|---|---|---|
happyhorse-1.0Alibaba-ATH | 1,435 | +/-9 | 6,266 | Alibaba-ATH | Proprietary | |
| 10 | Wan2.6 T2VAlibaba | 1,341 | +/-11 | 24,738 | Alibaba | Proprietary |
| 24 | 1,199 | +/-12 | 9,370 | MiniMaxAI | Proprietary | |
| 25 | 1,199 | +/-7 | 50,014 | MiniMaxAI | Proprietary | |
| 27 | 1,181 | +/-12 | 9,333 | MiniMaxAI | Proprietary |
Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.
2026-05 Market Signals
Current Best (SOTA)
Veo 3.1 Audio 1080p
Veo 3.1 Fast-Audio 1080p
Sora-2-Pro
Best China Model
Wan2.6-T2V
Seedance-V1.5-Pro
Kling-2.6-Pro
Best Open Model
- •Wan-V2.2-A14B
- •Kandinsky-5.0-T2V-Pro
- •Mochi-V1
FAQ
How does Text-to-Video Arena rank models?
Rankings are based on side-by-side anonymous votes. Users enter the same prompt, compare outputs from two hidden models, and choose the better video. Elo-style scoring then aggregates those comparisons into a leaderboard.
What is audio-video sync, and why does it matter?
Audio-video sync means generated sound effects or speech match the motion and timing in the video. It matters because synchronized audio can make generated clips usable with less post-production work.
What use cases are text-to-video models good for?
Common uses include short-form video creation, marketing assets, e-commerce product clips, storyboarding, game cinematics, and educational demos.
Which models support the longest generation length?
Long generation limits change quickly by product tier and release. In practice, check the current model documentation and compare both maximum duration and quality consistency across longer clips.
