Image-to-Video Arena Leaderboard
The latest AI image-to-video leaderboard based on anonymous Arena voting. Covers Elo scores, confidence intervals, and vote counts for leading video animation models.
Top Model
happyhorse-1.0
Top Score
1,445
Model Count
39
Data version
2026年05月12日
Data source: LM Arena
About This Leaderboard
This leaderboard ranks AI image-to-video models by animation quality. Data comes from LMArena's Image-to-Video Arena track, evaluated through anonymous blind testing by real users.
Methodology Overview
Blind testing: Users upload an image, two anonymous models generate animated videos, and users vote for the more natural result.
Elo scoring: Based on the Bradley-Terry model, scientifically measuring each model's relative strength in image-to-video tasks.
Ranking Table
| Rank | Model | Score | 95% CI | Votes | Organization | License |
|---|---|---|---|---|---|---|
happyhorse-1.0Alibaba-ATH | 1,445 | +/-15 | 14,889 | Alibaba-ATH | Proprietary | |
Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.
2026-05 Market Signals
Current Best (SOTA)
Grok Imagine Video 720p
Veo 3.1 Audio 1080p
Veo 3.1 Audio
Best China Model
Vidu-Q3-Pro
Wan2.5-I2V-Preview
Kling-2.6-Pro
Best Open Model
- •Wan-V2.2-A14B
- •LTX-2-19B
- •Pika-V2.2
FAQ
What is the difference between image-to-video and text-to-video?
Text-to-video generates a clip from a prompt alone. Image-to-video starts from a reference image, which gives stronger control over subject identity, composition, and visual style.
Which model should I use to animate old photos?
For portrait animation, compare models on facial expression stability, motion naturalness, and identity preservation. Specialized lip-sync tools may be better when speech alignment is the main requirement.
How can I keep characters consistent?
Use a strong reference image as the first frame, keep the prompt specific, and avoid large changes in clothing, camera angle, or style unless the model supports identity conditioning.
