Y
Prompt Workshop
Guide · Video

How to Write AI Video Prompts: Action, Camera Movement and Duration

AI video prompts need three things on top of an image prompt: action (what moves), camera (how the camera moves), and duration (how long). This guide gives four shot skeletons and compares Runway, Pika, Kling, Seedance, Sora and Veo.

Why video prompts are harder than image prompts

An image needs a single frame. A video needs every frame to make physical sense in sequence. Beginners run into three failure modes. Abstract action: the model cannot tell whether the subject moves or the camera moves and the shot drifts. Action overload: two or three actions stuffed into five seconds — pick one or fail all. Missing stability: forget to mention camera holds static and most models default to a slow push that ruins the mood you wanted.

So every video prompt should state, explicitly: subject action (one, concrete), camera motion (static / push / pull / track / tilt / orbit), duration (3-5-8 seconds), and stability (no shake / steady gimbal feel).

4 shot skeletons

Atmospheric still (highest success rate)

[scene + subject] + camera holds static + gentle [micro motion: drift, sway, ripple] + [time of day + light] + no shake

Example: a single raindrop falling on a glass window at night, camera holds static, neon lights blurred in the background, gentle vertical impact ripple, no camera shake. 3–5 seconds, almost any model lands this cleanly.

Slow dolly in

[scene] + slow steady dolly in toward [subject] + [motion direction] + over 5 seconds + cinematic pacing + no jitter

"Slow", "steady", "over X seconds" are the load-bearing words. Skipping duration leaves the model to invent a default push speed, which is usually too fast.

Tracking shot

[subject walking/moving] + camera tracks horizontally to the right at the same pace + [environment scrolling past] + medium shot + steady gimbal feel

The "at the same pace" phrase is the cure for the most common failure mode: subject moves faster than the camera.

Action close-up

close-up of [subject performing single action] + [single specific motion verb: pours, lifts, turns] + slow motion 120fps look + shallow depth of field + camera holds static

One verb only. "Pours and stirs and lifts" collapses every model. One verb per shot.

Video prompt structure diagram

actiona barista slowly pours espresso into a glass cup
cameracamera holds static / slow dolly in / tracks left
durationover 5 seconds / 8-second clip
speedslow motion 120fps look / real time
stabilityno camera shake / steady gimbal feel
scenewarm window light · cafe interior
stylecinematic, shallow depth of field

Wrong vs. right examples

✗ Wrong

a beautiful cinematic video of a girl walking in a forest, magical, dreamy, stunning, amazing 4k

No camera motion, no duration, no stability, action is the vague "walking". Ten runs give ten different shots.

✓ Right

a young woman in a wool coat walks slowly forward through a misty pine forest, camera tracks horizontally to the right at the same walking pace, 5-second shot, soft morning backlight, cinematic shallow depth of field, steady gimbal feel, no camera shake

Walking speed (slowly), camera motion (tracks horizontally), sync (same pace), duration (5-second), light (backlight) and stability (no shake) all present. Ten runs give a consistent shot.

5 real samples

Sample 1 · Rain-night stillRunway Gen-3 / Seedance
a single raindrop slides down a foggy window at night, camera holds static, neon city lights blurred in the background, slow motion 120fps look, gentle vertical motion only, shallow depth of field, no camera shake, 5-second clip

The easiest reliable shot: static camera + single micro motion. Works on essentially every model.

Sample 2 · Coffee pourKling / Pika
close-up of a barista's hands slowly pouring espresso into a glass cup, warm cafe interior blurred behind, single action of pouring only, camera holds static, soft side light from the right, real-time pacing, 4-second clip

"Single action of pouring only" prevents the model from adding stirs, lifts and set-downs.

Sample 3 · Tokyo tracking shotRunway Gen-3
a young woman in a black trench coat walks forward through a rain-soaked Tokyo alley at night, camera tracks horizontally to the right at the same walking pace, neon reflections on wet ground, shallow depth of field, steady gimbal feel, 6-second cinematic shot

"At the same walking pace" stops the camera from outpacing the subject.

Sample 4 · Food macroSeedance / Hailuo
extreme close-up of melted chocolate slowly dripping onto a glossy croissant, camera holds completely static, single dripping motion, warm side light, shallow depth of field, slow motion 120fps look, 3-second clip

Macro food shots are static + single slow action. Slow motion amplifies the material reading.

Sample 5 · Drone pushKling / Sora
misty mountain valley at sunrise, slow steady drone dolly forward over the treetops, sunlight breaking through clouds, very gentle pacing over 8 seconds, cinematic wide shot, no jitter, smooth motion

"Slow steady drone dolly forward" tells the model the rig and direction; "over 8 seconds" controls speed.

6 common pitfalls

Pitfall 1 · Vague action verb

"Moves", "walks", "interacts" are too soft. Use concrete verbs: "slowly pours", "lifts the cup", "turns the head to the left".

Pitfall 2 · Multiple actions stuffed in

"She walks in, sits down, picks up the cup, drinks" in 5 seconds always fails. One core action per shot.

Pitfall 3 · No duration or speed

State "3-second / 5-second / 8-second clip" plus "slowly / steady / real-time".

Pitfall 4 · No stability cue

Without "no shake / steady", most models default to handheld jitter that destroys still-life mood.

Pitfall 5 · Subject and camera both moving fast

Heavy subject motion + heavy camera motion almost always collapses. Lock one, move the other.

Pitfall 6 · Reusing image quality keywords

"Masterpiece, 8k, best quality" do almost nothing for video. "Cinematic, shallow depth of field, color grading" is enough.

Model comparison

ModelTypical durationStrengthsNotes
Runway Gen-35–10 sCinematic, tracking shotsStrong action continuity, sensitive to camera-motion words
Pika 2.x3–5 sShort atmospheric clipsLikes short, simple action descriptions
Kling 2.x5–10 sPeople performance, adsExcellent at non-English prompts
Seedance 2.05–8 sWidescreen cinematicSee the Seedance page on this site
Sora10–60 sLong narrative shotsPrompts read more like a natural description
Hailuo / MiniMax5–6 sPeople, landscapesVery friendly to long descriptive sentences
Veo 2 / Veo 35–8 sCinematic qualityGoogle ecosystem
Reality check: Even the best video models miss roughly half the time. Plan for 3–5 attempts per shot, keep prompts short, actions singular, and stability explicit.

Frequently asked questions

How long can AI videos be?

3–10 seconds is the reliable range across Runway, Pika, Kling, Seedance, Hailuo and Veo. Sora goes up to 60 seconds but prompt control gets harder as duration grows.

Do video models need negative prompts?

Most ignore them. Phrase the constraint positively: "no camera shake", "single action only", "no extra motion".

How do I keep a character consistent across multiple shots?

Today the safest path is to drive each shot from the same reference image (image-to-video) and lock the seed when supported. Pure text consistency across shots is unreliable.

Can I write video prompts in languages other than English?

Kling and Hailuo handle Chinese exceptionally well; Runway, Pika and Sora prefer English.

Try this skeleton in the structured editor

Open the editor and fill in subject / style / light / composition blocks separately; the editor assembles the final prompt for you.

Open the editor →
Yan · AI Prompt Workshop editorial team|Last updated on 2026-06-12。This site does not call any cloud model. Every prompt and parameter in this article was tested and refined locally by the editorial team.