~8 min read 14 Apr 2026

The Seedance 2.0 Prompting Bible

AI Summary

Seedance 2.0 requires a specific prompting language for camera, lighting, motion, and constraints. Typing normal English descriptions produces generic output because the model responds to a structured 5-layer stack: subject, action, camera, style, and constraints. The framework was built from hundreds of generations, official volcengine documentation, and community-confirmed techniques.

Key takeaways

The 5-layer prompt structure is: subject first (pins the model to a center of gravity), action second (present tense, one primary movement), camera third (one movement per generation with rhythm descriptors like slow or smooth), style fourth (lighting has the single biggest impact on quality, bigger than style adjectives or resolution requests), and constraints last as guardrails.

Separate subject movement from camera movement every time. 'Spinning camera around a dancing person' creates ambiguity the model cannot resolve, but 'the dancer spins slowly, camera holds fixed framing' eliminates the majority of shaky, chaotic output.

The word 'fast' is the most dangerous keyword in Seedance prompting. Combining fast camera, fast subject, and a busy scene almost guarantees jitter and compression artifacts. The fix is making only one element fast while holding everything else steady.

Original post

If you don't know the specific prompting techniques for Seedance 2.0 you will generate slop, every single time, regardless of how creative your idea is or how much you're paying per generation...

the model has its own language for camera, lighting, motion, and constraints, and typing normal english descriptions into the prompt box is like speaking french to someone who only understands japanese

this is the complete reference for that language: every camera keyword, every lighting modifier, every constraint that actually works, and the exact 5-layer structure that turns the same $0.60 generation from stock footage into something that stops a scroll

the framework comes from hundreds of generations, the official volcengine documentation, every higgsfield and yaroflasher tutorial worth watching, and the community techniques that have been confirmed to actually produce results... compressed into the only tab you need open while you're generating

the full seedance 2.0 setup

if you want the complete walkthrough for setting up seedance 2.0 from scratch, getting access, configuring your first generation, and understanding the platform options, i'm covering all of that inside my community at weeklyaiops.com...

this article focuses on the prompting system itself, but the setup guide lives there with video walkthroughs and the workflows i use daily

what you're actually working with

seedance 2.0 is a multimodal film set, not a text-to-video box, and the distance between those two things is the distance between typing a google search and directing a $50,000 commercial

in a single generation you can feed it:

up to 9 reference images (character sheets, mood boards, product photos, storyboard panels)
up to 3 video clips (camera motion reference, choreography, pacing)
up to 3 audio tracks (voiceover, music, sound effects)
plus your text prompt

that's 12 reference files processed simultaneously through a dual-branch diffusion transformer that generates video and audio in one inference pass, not stitched after the fact, not two pipelines bolted together...

one pass, synchronized video with dual-channel stereo audio, lip-synced speech across 8+ languages (english, mandarin, japanese, korean, spanish, french, german, portuguese, and chinese dialects), background music, and foley

output runs from 4 to 15 seconds per generation at up to 1080p resolution, with dual-channel stereo audio synchronized in the same pass

sora 2 takes text and images, kling 3.0 takes text and images, veo 3.1 takes text and images...

seedance takes all four modalities at once, and on higgsfield you get a generous number of generations in a suite where you can also run kling, veo, sora, and 30+ other models side by side, which means you can compare the same prompt across models in one place and actually see what seedance does differently

if you're only typing text into the prompt box you're using around 15% of the tool while paying the same price as someone using all of it

the 5-layer prompt stack

the official volcengine documentation lists a 6-element formula but community testing has compressed it to 5 layers that consistently outperform longer and looser prompts:

subject > action > camera > style > constraints

the ordering carries weight... subject first pins the model to a center of gravity so it doesn't split attention across competing elements, action second provides the kinetic anchor (the thing that must move even if everything else shifts), camera third locks framing before the model re-decides the lens every few seconds, style late in the stack adds visual flavor without hijacking motion, and constraints last act as guardrails that close whatever gaps the other four layers left open

layer 1: subject

specificity in the subject line is genuinely load-bearing

bad: "a woman" better: "a young woman with brown hair" best: "a woman in her late 20s, tight dark curls at ear length, small silver hoop in left ear, wearing a fitted black turtleneck, neutral expression"

every identity marker you provide is one the model doesn't hallucinate... hair length, clothing texture, posture, accessories, skin detail all drift when you leave them unspecified because the model fills gaps with whatever its training data averaged to, which is always generic

one subject per generation is the safest path, two characters work if you keep them spatially separated and tag them individually as @Character_A and @Character_B, three or more is where the coin stops landing in your favor

layer 2: action

what happens, present tense, one primary movement per shot... and this is where 90% of prompts collapse because people write states instead of directions

bad: "she looks happy and is enjoying the sunset" good: "she slowly turns toward the camera, breeze lifting the hem of her skirt, eyes narrowing against the light"

one gives the model a photograph to approximate, the other gives it a sequence to execute, and the quality gap is enormous

the rule that almost nobody follows: separate subject movement from camera movement, every time...

"spinning camera around a dancing person" is a single instruction where the model can't tell who's supposed to spin, but "the dancer spins slowly, camera holds fixed framing" splits the ambiguity into two clear directives and eliminates the majority of shaky, chaotic output that people keep blaming on the model

layer 3: camera

seedance treats camera direction as a first-class conditioning signal, which is where it separates from everything else on the market

one primary camera movement per generation, describe rhythm (slow, smooth, gentle) rather than technical specs... t

he official guide discourages f-stop values, ISO numbers, and exact millimeter specifications because the model responds better to descriptive language than to camera metadata

the complete keyword library:

static shots:

fixed or locked-off for zero camera movement
static wide for a wide unmoving establishing shot
locked tripod, zero camera shake when ambient jitter persists

movements:

push-in / dolly in moves camera toward subject... tension, emphasis, emotional close-ups
pull-out / dolly out moves camera away... environmental reveals, context
pan left/right is horizontal rotation in place... scanning, following action
tracking shot / follow moves alongside the subject... action sequences
orbit / arc / 360 orbit circles the subject... product showcases, portraits, hero moments
aerial / drone shot for high altitude... landscapes, establishing geography
handheld introduces natural shake... documentary feel, UGC authenticity
crane up/down for vertical ascent or descent... dramatic height reveals
gimbal creates smooth stabilized motion... polished cinematic, distinct from handheld
steadicam walk for smooth forward motion following a character through space
whip pan is a rapid horizontal sweep... urgency, scene transitions
dolly zoom creates the hitchcock vertigo effect where subject stays the same size while background warps
rack focus shifts focus between foreground and background planes to redirect attention

speed modifiers:

imperceptible / barely for extremely slow, almost unnoticeable movement
slow / gentle / gradual is the safest starting point and the default recommendation
smooth / controlled for natural rhythm
dynamic / swift for high impact... use with extreme caution

the word "fast" specifically is the most dangerous keyword in seedance prompting because combining fast camera + fast subject + busy scene almost guarantees jitter and compression artifacts... the fix is making only ONE element fast while holding everything else steady

if you want compound camera movement, sequence it rather than stacking it: "start: slow dolly-in, then: gentle pan right for the final 2 seconds" gives the model two clear temporal phases instead of two competing instructions fighting each other in the same clause

layer 4: style

lighting, color grading, film references, atmosphere

lighting descriptions have the single biggest impact on video quality among all prompt elements according to the official volcengine guide, bigger than style adjectives, bigger than quality modifiers, bigger than resolution requests...

if you only add one element to a weak prompt, make it a lighting description

lighting keywords that consistently produce:

golden hour is the single highest-quality-per-word improvement you can add
rim light / dramatic rim light against dark background creates cinematic edge separation
soft key from 45 degrees is flattering talking-head lighting
overcast daylight / even overcast diffused light eliminates flicker in bright scenes
backlit silhouette at sunset for dramatic mood
motivated lighting from practical source creates realism with the light source visible in frame
volumetric fog adds atmospheric depth, pairs with backlit setups
chiaroscuro for high-contrast lighting in the godfather mold

color grading:

teal and orange for classic hollywood
bleach bypass for desaturated, gritty, high-contrast texture
warm tone / amber-tinted for nostalgic feels
crushed blacks for deep cinematic shadow loss
pastel for soft anime or fashion aesthetic

film references as style anchors:

cinematic film tone, 35mm is the most reliable all-purpose anchor
16mm film, handheld camera for raw indie aesthetic
anamorphic lens flare for widescreen cinematic
national geographic quality for nature documentary
documentary-style handheld framing for observational realism

"cinematic" alone produces nothing predictable because the official guide calls it "too vague"... cinematic film tone, 35mm, warm golden lighting gives the model three intersecting constraints, cinematic by itself gives it permission to do whatever it wants

and a subtler trap: words like "glow," "glimmer," and "glints" invite specular flicker artifacts, so replace them with steady intensity or diffuse when you want soft light without temporal instability

layer 5: constraints

the guardrails, and the layer that separates AI-looking video from video that passes

essential constraints for every character prompt:

avoid jitter prevents screen shaking
avoid bent limbs prevents distorted arms and legs, use in every character prompt without exception
avoid identity drift prevents character features changing between shots
avoid temporal flicker prevents frame-to-frame brightness oscillation
no distortion, no stretching maintains geometric stability
maintain face consistency preserves face identity across cuts

the community quality suffix to append to every generation:

sharp clarity, natural colors, stable picture, no blur, no ghosting, no flickering

inelegant and measurably effective... the model reads positive constraint statements more reliably than negative prompt syntax, so "avoid X" and "maintain Y" outperform listing negatives

keywords that actively degrade output

these look helpful and they're not:

"fast" (unqualified) causes the model to accelerate everything simultaneously... name which single element moves fast and hold the rest steady
"cinematic" (alone) gives the model nothing to work with... always pair it with texture, lighting, or a film reference
"epic" has no visual meaning to a diffusion model
"amazing" / "beautiful" / "stunning" are feelings, not instructions, and the model can't render an adjective
"lots of movement" triggers jitter across the entire frame... name one specific movement
"glow" / "glimmer" / "glints" create specular flicker... use steady intensity or diffuse instead

the principle underneath all of these: if a word describes how the viewer should feel rather than what the camera should see, the model has to guess what visual would produce that feeling, and it guesses wrong

time-coded multi-shot prompting

you can direct individual shots within a single 15-second generation by writing timestamps into the prompt, and this is where seedance becomes something genuinely different from every other model

two formats work:

format A (range brackets):

[0-4s]: wide establishing shot, static camera, misty bamboo forest at dawn, golden hour light filtering through leaves [4-9s]: medium shot, slow push-in, the fighter steps forward, white silk kimono billowing, determined expression [9-15s]: close-up, orbit shot, the fighter strikes, slow motion, impact visible in the fabric ripple

format B (parenthetical seconds):

(0-3s) macro shot of perfume bottle among pink flowers, shallow depth of field, petals floating (3-7s) camera glides closer, a feminine hand enters frame, touches the bottle (7-12s) slow-motion spray, mist diffuses in air, particles catching rim light (12-15s) pull-out to hero frame, product centered, volumetric lighting, minimal background

each shot should specify camera position, subject action, and lighting state, and transition language between shots ("hard cut to," "seamless morph into") gives the model explicit cut instructions rather than letting it improvise...

if you're doing this on higgsfield you can queue up multiple variations of the same time-coded prompt and compare the outputs side by side, which is the fastest way to dial in the pacing

the 15-second climax arc template:

[0-4s]: wide shot, static, world established, ambient sound [4-8s]: medium shot, slow push-in, tension building, subject prepares [8-12s]: close-up, emotional peak approaching, one specific detail in sharp focus [12-15s]: extreme close-up or dramatic reveal, climax action, slow motion or static hold, silence

wide > tighter > tight > closest

the universal escalation pattern in filmmaking mapped directly onto a 15-second generation window

the @ reference system

the people getting outputs that don't read as AI are uploading 6 to 12 reference files and tagging every one with a specific role in the prompt... the difference between typing text and directing a production lives entirely in this system

the syntax:

every uploaded file needs an explicit role in the prompt... an image without an @ tag gets processed ambiguously, and ambiguity in a diffusion model produces averaging, which is the visual equivalent of mush

the first-last frame technique is the most underused shortcut in the model... upload your desired first frame as @Image1 and your desired last frame as @Image2, describe what happens between them, and seedance interpolates coherent motion connecting the two endpoints with no storyboarding or multi-step pipeline