Why video prompts are harder than image prompts
An image needs a single frame. A video needs every frame to make physical sense in sequence. Beginners run into three failure modes. Abstract action: the model cannot tell whether the subject moves or the camera moves and the shot drifts. Action overload: two or three actions stuffed into five seconds — pick one or fail all. Missing stability: forget to mention camera holds static and most models default to a slow push that ruins the mood you wanted.
So every video prompt should state, explicitly: subject action (one, concrete), camera motion (static / push / pull / track / tilt / orbit), duration (3-5-8 seconds), and stability (no shake / steady gimbal feel).
4 shot skeletons
Atmospheric still (highest success rate)
[scene + subject] + camera holds static + gentle [micro motion: drift, sway, ripple] + [time of day + light] + no shake
Example: a single raindrop falling on a glass window at night, camera holds static, neon lights blurred in the background, gentle vertical impact ripple, no camera shake. 3–5 seconds, almost any model lands this cleanly.
Slow dolly in
[scene] + slow steady dolly in toward [subject] + [motion direction] + over 5 seconds + cinematic pacing + no jitter
"Slow", "steady", "over X seconds" are the load-bearing words. Skipping duration leaves the model to invent a default push speed, which is usually too fast.
Tracking shot
[subject walking/moving] + camera tracks horizontally to the right at the same pace + [environment scrolling past] + medium shot + steady gimbal feel
The "at the same pace" phrase is the cure for the most common failure mode: subject moves faster than the camera.
Action close-up
close-up of [subject performing single action] + [single specific motion verb: pours, lifts, turns] + slow motion 120fps look + shallow depth of field + camera holds static
One verb only. "Pours and stirs and lifts" collapses every model. One verb per shot.
Video prompt structure diagram
Wrong vs. right examples
✗ Wrong
a beautiful cinematic video of a girl walking in a forest, magical, dreamy, stunning, amazing 4k
No camera motion, no duration, no stability, action is the vague "walking". Ten runs give ten different shots.
✓ Right
a young woman in a wool coat walks slowly forward through a misty pine forest, camera tracks horizontally to the right at the same walking pace, 5-second shot, soft morning backlight, cinematic shallow depth of field, steady gimbal feel, no camera shake
Walking speed (slowly), camera motion (tracks horizontally), sync (same pace), duration (5-second), light (backlight) and stability (no shake) all present. Ten runs give a consistent shot.
5 real samples
a single raindrop slides down a foggy window at night, camera holds static, neon city lights blurred in the background, slow motion 120fps look, gentle vertical motion only, shallow depth of field, no camera shake, 5-second clip
The easiest reliable shot: static camera + single micro motion. Works on essentially every model.
close-up of a barista's hands slowly pouring espresso into a glass cup, warm cafe interior blurred behind, single action of pouring only, camera holds static, soft side light from the right, real-time pacing, 4-second clip
"Single action of pouring only" prevents the model from adding stirs, lifts and set-downs.
a young woman in a black trench coat walks forward through a rain-soaked Tokyo alley at night, camera tracks horizontally to the right at the same walking pace, neon reflections on wet ground, shallow depth of field, steady gimbal feel, 6-second cinematic shot
"At the same walking pace" stops the camera from outpacing the subject.
extreme close-up of melted chocolate slowly dripping onto a glossy croissant, camera holds completely static, single dripping motion, warm side light, shallow depth of field, slow motion 120fps look, 3-second clip
Macro food shots are static + single slow action. Slow motion amplifies the material reading.
misty mountain valley at sunrise, slow steady drone dolly forward over the treetops, sunlight breaking through clouds, very gentle pacing over 8 seconds, cinematic wide shot, no jitter, smooth motion
"Slow steady drone dolly forward" tells the model the rig and direction; "over 8 seconds" controls speed.
6 common pitfalls
"Moves", "walks", "interacts" are too soft. Use concrete verbs: "slowly pours", "lifts the cup", "turns the head to the left".
"She walks in, sits down, picks up the cup, drinks" in 5 seconds always fails. One core action per shot.
State "3-second / 5-second / 8-second clip" plus "slowly / steady / real-time".
Without "no shake / steady", most models default to handheld jitter that destroys still-life mood.
Heavy subject motion + heavy camera motion almost always collapses. Lock one, move the other.
"Masterpiece, 8k, best quality" do almost nothing for video. "Cinematic, shallow depth of field, color grading" is enough.
Model comparison
| Model | Typical duration | Strengths | Notes |
|---|---|---|---|
| Runway Gen-3 | 5–10 s | Cinematic, tracking shots | Strong action continuity, sensitive to camera-motion words |
| Pika 2.x | 3–5 s | Short atmospheric clips | Likes short, simple action descriptions |
| Kling 2.x | 5–10 s | People performance, ads | Excellent at non-English prompts |
| Seedance 2.0 | 5–8 s | Widescreen cinematic | See the Seedance page on this site |
| Sora | 10–60 s | Long narrative shots | Prompts read more like a natural description |
| Hailuo / MiniMax | 5–6 s | People, landscapes | Very friendly to long descriptive sentences |
| Veo 2 / Veo 3 | 5–8 s | Cinematic quality | Google ecosystem |