use case·8 min read·June 3, 2026

How to Make a Podcast Video with AI

A 4-step workflow for turning a raw podcast recording into a host avatar video, ambient b-roll, and vertical clips for Reels and TikTok. Tested on 8frame with Higgsfield, Wan, and Kling.

You can make a podcast video with AI by generating a host avatar (Higgsfield Soul 2.0), producing ambient b-roll for the full-episode cut (Wan 2.5), and extracting vertical clips with animated waveforms for social distribution. The whole pipeline runs inside 8frame. You need your audio file, a reference photo of your host, and about 90 minutes the first time. After that, each new episode adds roughly 30 minutes of production time.

TL;DR

Host video: Higgsfield Soul 2.0 generates a talking-head avatar locked to your reference image, consistent across a full-episode cut
Ambient b-roll: Wan 2.5 produces the contextual footage (studio atmosphere, objects, cutaways) that breaks up a static talking head
Vertical clips: Kling 3.0 handles short-form extractions with animated waveforms and title cards for Reels and TikTok
Full workflow cost: $12 to $30 per episode in model credits, depending on episode length and clip count

Why podcast video matters now

YouTube is the biggest podcast discovery platform in 2026. Spotify and Apple Podcasts serve the video version first when it exists. An audio-only feed loses the recommendation surface. Most hosts don't want to film a studio every week. That's the problem this workflow solves.

The second reason is clip distribution. Finding the 3 to 4 shareable moments in a 45-minute episode, trimming them, adding captions and a waveform, and formatting 9:16 takes an editor 2 to 3 hours. This workflow cuts it to under 45 minutes.

Higgsfield Soul 2.0 holds an avatar's face across many generated clips without drift. Wan 2.5 produces slow atmospheric b-roll that doesn't read as AI stock. Kling 3.0 turns a background image with an audio waveform into a format the algorithm reads as a native Reel.

The 4-step workflow

Step 1: Host avatar setup with Higgsfield Soul 2.0

This is the anchor of the whole production. Everything else is b-roll and framing around it.

Upload one reference photo of your host: front-facing, shoulders visible, neutral expression, even lighting. Higgsfield's identity lock anchors to this image and holds it across every clip you generate.

Prompt structure for a podcast talking head:

[Host description] seated at a minimal studio desk, speaking directly to camera. Soft key light from the left, slight fill from the right. Dark or neutral background, slight depth of field blur. Podcast microphone partially in frame lower-left. Natural head movement, genuine expression. Horizontal 16:9. No text overlays.

Tested prompt for a male host, 35-45, professional but approachable:

Man in his late 30s, short dark hair, wearing a plain dark crew-neck, seated at a minimal studio desk, speaking directly to camera with measured calm. Soft warm key light from left, cooler fill from right. Dark charcoal background, slight depth of field. Podcast microphone visible at lower-left edge. Subtle head movement and natural eye contact. 16:9. No text. 8 seconds.

Generation time: about 85 to 95 seconds per clip. The microphone detail matters: it frames the viewer's expectation and makes the talking head read as intentional video, not a disembodied face. Generate 4 to 6 variants, keep the most natural one. You don't generate the entire episode length; you cut between the avatar and b-roll throughout.

Step 2: Ambient b-roll with Wan 2.5

Wan 2.5 is the right model for atmospheric clips: slow, textured motion like a coffee cup steaming on a desk, a bookshelf softly out of focus, or city light through a window at dusk. These are the visual metaphors that make a podcast episode feel produced without feeling commercial.

Cut to 3 to 5 seconds of b-roll every 45 to 90 seconds of talking head. B-roll gives the viewer's eye somewhere to go and covers jump cuts in the audio edit.

Tested Wan 2.5 prompts for podcast b-roll:

Topic-neutral atmospheric clip (works for any episode):

Close-up of a ceramic mug of coffee on a minimal wooden desk. Steam rising slowly. Shallow depth of field, bokeh background of a softly lit room. Warm tones. No movement except the steam. 5 seconds.

Result: clean, slow atmospheric clip with a satisfying depth of field. The steam motion is subtle enough to not distract. Works as a transition cut or over-the-shoulder placeholder.

Guest arrival / "starting a conversation" clip:

Two sets of hands on opposite sides of a table, one placing a phone face-down, the other adjusting a notebook. Neutral background. Soft overhead light. No faces. 4 seconds.

Result: implied dialogue without showing faces, which removes any identity consistency problem. Cuts naturally before a new segment begins.

Abstract thought clip for intellectual topics:

Slow zoom on a wall of printed documents covered in handwritten notes and colored highlights. Warm tungsten light. Slight camera drift to the right. Out of focus edges. 6 seconds.

Result: Wan 2.5 produced good texture on the paper and natural camera drift. Works over a segment where the host explains a framework or walks through a concept.

Generate 8 to 12 b-roll clips per episode. Cost per clip in Wan 2.5: approximately $0.35 to $0.55.

Step 3: Animated waveform and title cards for vertical clips

This step is about building the vertical (9:16) format. The core element is an animated audio waveform synchronized to the audio, over a still or slow-motion visual background. Add a title card at the start with the episode name and the clip hook. Add auto-captions beneath the speaker's lower third.

In 8frame, build this as a vertical canvas layer: background clip (Wan 2.5 or a static image), waveform centered in the lower third, auto-captions from your audio, episode title card overlay for the first 2 seconds.

Kling 3.0 prompt for a short-form clip background:

Slow aerial drift over a foggy city at dusk, warm amber street lights below, camera moving left at 0.2x speed. No sky visible, buildings filling frame. 9:16 vertical. 8 seconds.

Result: Kling generated a clean slow drift with consistent lighting. Fog added depth and masked any hard edges that would look artificial on a tighter crop. At 9:16 the composition holds well.

Title card styling: white sans-serif, episode name and number on line one, clip hook on line two (8 words max). Keep it above the waveform so neither element competes for the same zone.

Step 4: Routing the full episode vs. vertical clips

Two deliverables from the same source.

Full episode video (YouTube, Spotify Video, Apple Podcasts): Cold open (15 to 30 seconds, a strong mid-episode moment), title card, alternating avatar and b-roll throughout, chapter markers, end card. Export 1080p 16:9, H.264, 48kHz audio.

Vertical clips (Reels, TikTok, YouTube Shorts): 45 to 90 seconds each. Hook moment first, captions throughout, waveform in lower third, soft CTA at the end ("Full episode linked in bio"). Export 1080 x 1920.

Full example: 60-minute episode to 5 vertical clips

Source: 60-minute founder fundraising interview. Clean stereo mix.

Select moments (30 min): Transcript review to find 5 standalone claims under 90 seconds each, no heavy context required. The five selected: a quote on why investors fund founders not ideas (52 sec), a tension moment at $11k in the bank before the term sheet (70 sec), the structural mistake in VC cold emails (85 sec), the 30-second pitch that got a first check (60 sec), one deck mistake that signals inexperience (78 sec).

Generate avatar clips (25 min): Same reference image in Higgsfield for all five sessions. Three variants per clip, one kept. Total generation: about 22 minutes. Cost: $6.40 in Higgsfield credits.

Generate b-roll (15 min): 2 to 3 Wan 2.5 clips per vertical (12 clips total). Prompts matched the topic: document walls for the email/deck advice clips, city-at-dusk for the tension clip, neutral desk shots for the personal story. Cost: $4.80.

Assemble (20 min): Vertical canvas in 8frame. Waveform layer, auto-captions from audio, title cards, export 5 clips.

Total time: 90 minutes. Total model cost: $11.20.

Common pitfalls

Talking head uncanny valley. Comes from a low-res, angled, or heavily edited reference photo. Higgsfield needs a clean front-facing image. If you don't have one, generate it with Nano Banana Pro first and use that as the anchor.

Audio-video drift. You're not generating lip-sync, you're generating a host avatar cut to audio. Drift starts past 12 seconds per clip. Keep clips under 12 seconds and cut to b-roll and back.

Repetitive shot framing. Same medium shot for 60 minutes reads as static. Vary prompts across clips: some slightly wider, some with the mic more prominent, some with different background depth.

FAQ

Can I use my own face for the host avatar?

Yes. Upload a photo of yourself as the reference image in Higgsfield Soul 2.0. The identity-locking system anchors to your face and maintains it across clips. Front-facing photo, even lighting, neutral expression. If you don't have a good reference shot, generate one with Nano Banana Pro first and use that as the Higgsfield input.

What are the audio sync limits, and how long can each clip run?

This workflow generates ambient talking-head video cut to your audio, not frame-accurate lip-sync. For clips up to 10 to 12 seconds, the head movement and expression sync naturally enough that viewers read it as intentional video. Beyond 12 seconds, head motion starts to decouple from audio rhythm. Keep avatar clips under 12 seconds each. Cut to b-roll and back rather than trying to generate a 30-second unbroken talking head. The identity lock maintains face consistency across as many clips as you generate from the same reference image, so the edit reads as continuous.

Does the avatar need to match the actual host?

No. Many shows use a stylized AI avatar as the show's visual identity rather than the actual host's face. Generate a fictional avatar that fits the show's aesthetic, use it consistently across all episodes as the reference image, and it becomes the brand character. A real host photo works too, if you have rights to it.

For the talking-head and b-roll techniques that underpin this workflow, the AI UGC ad guide covers Higgsfield identity locking and Seedance b-roll generation at full depth.

Run this workflow on 8frame's canvas with your audio file and a host reference image.

How to Make a Podcast Video with AI

TL;DR

Why podcast video matters now

The 4-step workflow

Step 1: Host avatar setup with Higgsfield Soul 2.0

Step 2: Ambient b-roll with Wan 2.5

Step 3: Animated waveform and title cards for vertical clips

Step 4: Routing the full episode vs. vertical clips

Full example: 60-minute episode to 5 vertical clips

Common pitfalls

FAQ

Can I use my own face for the host avatar?

What are the audio sync limits, and how long can each clip run?

Does the avatar need to match the actual host?

Related articles

Make it
move.

Stay in the loop

TL;DR

Why podcast video matters now

The 4-step workflow

Step 1: Host avatar setup with Higgsfield Soul 2.0

Step 2: Ambient b-roll with Wan 2.5

Step 3: Animated waveform and title cards for vertical clips

Step 4: Routing the full episode vs. vertical clips

Full example: 60-minute episode to 5 vertical clips

Common pitfalls

FAQ

Can I use my own face for the host avatar?

What are the audio sync limits, and how long can each clip run?

Does the avatar need to match the actual host?

Related articles

Make itmove.

Stay in the loop

Make it
move.