How to Make a 30 Second Commercial with AI
The exact 4-step AI workflow for a broadcast-quality 30-second commercial: storyboard, talent via Higgsfield, scenes via Veo and Kling, audio mix, and the $22 compute cost breakdown.
You can make a 30-second commercial with AI in 2026 for roughly $22 in model credits. Higgsfield Soul 2.0 handles talent and voiceover, Veo 3.1 covers cinematic scene work and product visuals, and Kling 3.0 fills lifestyle motion. A production company quotes $8,000 to $30,000+ for the same deliverable.
TL;DR
- Storyboard first: the 30-second format has a fixed structure (hook at 0-5s, build at 5-20s, resolution and CTA at 20-30s) and every model decision follows from it
- Talent and voiceover: Higgsfield Soul 2.0 for any on-screen spokesperson or avatar; generates native audio with usable lip sync
- Scene work: Veo 3.1 for cinematic product and environment shots at 4K/60fps; Kling 3.0 for lifestyle motion and hand-held style sequences
- Sound: music bed at -20 LUFS under dialogue; design sound-on and sound-off versions before you route to channel
- A tested DTC 30-second spot ran $22 in compute and took 3.5 hours from brief to exported file
Why 30 seconds is still the workhorse format
Broadcast ad breaks, OTT mid-roll, and programmatic video inventory are built around 30-second blocks. The format also maps directly to AI generation: a 30-second spot is 4 to 6 clips, which is exactly what Veo, Kling, and Higgsfield produce natively. One clear idea, one narrative arc, manageable generation scope.
The 4-step workflow
Step 1: Storyboard your 30 seconds
Build the storyboard before touching any model. The 30-second structure:
- 0-5s: Hook (visual or spoken event that earns the next 25 seconds)
- 5-15s: Problem or context
- 15-23s: Resolution or proof (product in action, transformation)
- 23-27s: Brand moment (logo, tagline, product beauty shot)
- 27-30s: CTA
For each beat, decide: talking-head (Higgsfield), cinematic scene (Veo 3.1), or lifestyle motion (Kling 3.0).
The example used throughout this guide: DTC spot for a running shoe brand, CTV pre-roll and paid social, 16:9.
| Beat | Shot | Model |
|---|---|---|
| 0-5s | Runner lacing up at dawn, shoe close-up | Veo 3.1 |
| 5-15s | Spokesperson: "I started running again at 42..." | Higgsfield |
| 15-23s | Trail run wide + midsole close-up | Veo 3.1 + Kling 3.0 |
| 23-27s | Product beauty shot, logo lockup | Veo 3.1 |
| 27-30s | URL lower third, spoken CTA | Higgsfield |
Step 2: Talent and voiceover via Higgsfield Soul 2.0
Higgsfield Soul 2.0 is the only model with identity locking strong enough to hold a face consistently across multiple cuts. Upload one front-facing reference portrait. For the running shoe spot, the spokesperson is a generated avatar: athletic man, early 40s, relatable, no extreme features.
Prompt for the 5-15s section:
Athletic man in his early 40s, light stubble, wearing a running jacket, speaks directly to camera outdoors with a soft morning sky behind him, says "I started running again at 42. Every pair of shoes I tried made my knees worse." Sincere, slightly rueful expression. Handheld feel, very slight natural movement. 16:9. Clean audio. No music.
Usable clip on the third variant. 85 seconds generation time per clip. Face held across two head movements and matched the CTA clip generated in the same session.
CTA line (beat 5):
Same man, same outdoor background, slightly warmer expression, says "Try them free for 30 days." Direct. 16:9. Clean audio. No music.
Generate 4 variants per line. Discard clips with eye-line issues or lip sync lag at the first or last syllable. For broadcast, consider generating the video silent and laying a recorded voiceover over it in post.
Step 3: Cinematic scenes and lifestyle motion via Veo 3.1 and Kling 3.0
The non-talking-head shots split between two models.
Veo 3.1 for cinematic shots: 4K/60fps, 8-second clips, 3 to 4 minutes per clip. Use for product beauty shots, environment establishing shots, and any frame where broadcast resolution matters.
Beat 1 (0-5s):
Extreme close-up of a running shoe being laced, hands moving deliberately, warm pre-dawn light from low left, orange and blue sky in background bokeh. Cinematic depth of field. 16:9. 6 seconds. No audio.
Output: bokeh read as early morning without generating a full landscape. Lacing motion held 6 seconds without hand-drift.
Beat 3, trail run (15-20s):
Wide shot of a runner from behind on a wooded trail, early morning light, slight mist in tree line, golden hour through branches. 16:9. Steadicam feel. 8 seconds.
Output: motion held 8 seconds clean. Mist added depth beyond what the prompt specified.
Beat 3, midsole (20-23s):
Extreme close-up of a running shoe midsole in motion, ground level, asphalt, fast stride. Slow motion, 60fps feel. Warm light. 16:9. 4 seconds.
Output: 3 variants. Two had motion blur on the shoe upper; third was clean.
Kling 3.0 for lifestyle motion: 1080p, 55 to 70 seconds per clip. Use for handheld-feel inserts where broadcast 4K isn't needed and speed matters.
Runner stretching calves against a park bench at dawn, casual athletic wear, handheld feel, slight camera sway. Natural morning light. 16:9. 5 seconds.
Output: first variant had a leg proportion issue. Second was clean.
Step 4: Music and sound design
License or generate a track at exactly 30 seconds. A 60-second track faded mid-phrase sounds unfinished. Music bed at -20 LUFS; Higgsfield dialogue at -12 LUFS. Thin AI voice? Boost 2kHz narrow, low-cut below 120Hz.
Add diegetic sound to no-dialogue beats (lacing shot: shoe-lace tension, ambient birdsong). It earns the cut into the spokesperson.
Design sound-on and sound-off cuts before routing. TV and CTV are always sound-on. Paid social is roughly 60% sound-off on Meta. Sound-off version needs captions on all dialogue and beats 1, 3, and 4 must communicate without audio.
Routing by media channel
| Channel | Format | Sound | Key adjustment |
|---|---|---|---|
| Broadcast TV | 16:9, 1080p or 4K | Always on | Export at broadcast spec (29.97fps, -1dBFS max, -24 LUFS integrated); CTA is verbal only |
| Paid social (Meta, TikTok) | 16:9 or 9:16 | Mostly off | Captions on all dialogue; CTA text overlay at 27-30s |
| CTV / streaming (pre-roll) | 16:9, 1080p | Always on | Non-skippable format; hook must earn 30 seconds, not just 5 |
| In-app video (programmatic) | 16:9 or 1:1 | Mostly off | Logo and CTA in first 3 seconds; assume viewer may close at any point |
| OOH digital (DOOH) | Varies by placement | Always off | No dialogue; visual storytelling only; brand moment extended to 5 seconds |
Broadcast and CTV: best footage in beat 4. Paid social and in-app: best footage in beat 1.
Walkthrough: DTC running shoe 30-second spot for $22
| Clip | Model | Variants generated | Cost |
|---|---|---|---|
| Lacing at dawn, 6s | Veo 3.1 | 2, used 1 | $3.80 |
| Spokesperson, setup line, 8s | Higgsfield Soul 2.0 | 4, used 1 | $2.20 |
| Spokesperson, CTA line, 4s | Higgsfield Soul 2.0 | 4, used 1 | $1.80 |
| Trail run wide, 8s | Veo 3.1 | 2, used 1 | $3.80 |
| Midsole close-up, 4s | Veo 3.1 | 3, used 1 | $2.85 |
| Runner stretching, 5s | Kling 3.0 | 2, used 1 | $0.95 |
| Product beauty shot, 4s | Veo 3.1 | 2, used 1 | $3.80 |
Total model cost: $19.20. Brand overlay, color grade, and lower thirds in 8frame Studio. Music licensed at $2.50. Total: $21.70. Production time from brief to export: 3 hours 25 minutes.
Agency quotes for the same brief: $9,500 / $14,000 / $22,000, with 12 to 20 business day turnarounds.
Pitfalls
Sound-on/sound-off tradeoff. You can't fully optimize for both. Decide your primary channel first, design for it, then adapt the secondary version. Trying to do both simultaneously produces a spot that does neither well.
Audio sync on a 30-second timeline. Sync drift is more visible here than in a 15-second social cut because viewers are paying attention. Higgsfield's native audio lands within 50ms on clean lines. Check plosive (B, P) and sibilant (S, SH) lines at 50% speed in your NLE before export. Correct in post; don't re-generate just for a sync fix.
Brand cap at the end. Generated footage looks generically polished in the final frames, not specifically branded. Build beats 4 and 5 as designed overlays in post: logo lockup, tagline, product image as a motion graphic. Always more intentional than a generated "hero shot" with a prompted logo.
FAQ
Can it air on broadcast TV?
Yes, with the right export spec. Broadcast requires 29.97fps (NTSC) or 25fps (PAL), -24 LUFS integrated loudness, -1dBFS true peak, and ProRes 422 HQ or MXF delivery. Veo 3.1 at 4K/60fps clears the resolution bar. The spec is about your mastering, not the generation model. Some cable networks also require a third-party QC pass (Signiant and similar platforms); budget an extra $50 to $150 per spot for that.
SAG/ACTRA compliance for AI-generated talent in a 30-second spot?
Evolving area. As of mid-2026, SAG-AFTRA's agreement covers AI-generated performances based on a specific consenting performer's likeness. A fully synthetic avatar from a text or image prompt sits in a grayer space. For broadcast and national media buys, get a media attorney's opinion before airing. For digital and streaming placements without named union talent, the risk is lower. Higgsfield Soul 2.0's terms specify that generated avatars using non-union reference images are the licensee's responsibility to clear for distribution.
What is the best format for streaming ads?
16:9 at 1080p is the CTV and OTT standard. Some platforms (Hulu, Peacock, Amazon) accept 4K, but 1080p is where most inventory serves. Audio at -24 LUFS integrated, matching broadcast. The key structural difference: streaming pre-roll is often non-skippable, so the hook doesn't need to compete with a skip button. You have the viewer for 30 seconds. Use the full arc rather than front-loading everything into the first 5 seconds.
The 30-second format is solvable with four generation sessions and a clear storyboard. Higgsfield holds your talent, Veo carries your cinematic moments, Kling fills your lifestyle cuts, and post handles everything branded.
Run the 30-second commercial workflow template on 8frame's /workflows to get the clip structure, overlay slots, and audio spec pre-loaded. For the underlying talent and talking-head technique that carries this workflow, the AI UGC ad guide goes deeper on Higgsfield identity locking and avatar consistency across cuts.