trend·8 min read·June 3, 2026

The Next 12 Months in AI Image and Video Generation

Six honest predictions for the future of AI generation through mid-2027, with confidence levels, what to prepare for now, and what to ignore.

The future of AI generation through mid-2027 is mostly predictable if you ignore the hype and look at what the labs are actually shipping. The capability gaps are closing. The pricing floor is dropping. A few specific things are going to change how you work, and a few things currently generating headlines will turn out to matter far less than the coverage suggests. Here's our read.

TL;DR

Long-form video crossing the 60-second mark and native multi-character control are the two changes most likely to restructure real workflows in the next year.
Real-time generation and on-device models are real but will matter most to developers, not everyday creators.
Copyright resolution is coming but won't be clean, and neither will the hype around it.
Integrated multimodal is already happening. If your tooling separates image, video, and text as distinct steps, that's the thing to fix now.

The 6 predictions

1. Long-form video crosses 60 seconds (confidence: high)

Right now, every model in active production tops out at 10 to 30 seconds per generation. Kling 3.0 is the outlier at 3 minutes, but it's still a single uninterrupted clip. The problem is that a 30-second clip with a cut every 3 seconds requires assembling 10 separate generations. Consistency across those cuts is the hard part, not clip length per se.

The labs are solving this from two directions. Scene-level memory (holding character identity and lighting state across calls) is already in early access at two labs as of Q2 2026. And the inference cost per second of video has dropped roughly 60% since Q1 2025. At the current rate, 90-second to 3-minute coherent clips from a single generation call are achievable by Q1 2027.

What this means practically: the workflow chains you're using today to stitch clips together will become less necessary. The stitch step won't disappear entirely because editors will still want cut control, but the error rate at joins will drop and "temporal drift" in character appearance will become less of a daily friction point.

2. Native multi-character control becomes standard (confidence: high)

Seedance 2.0 introduced multi-reference conditioning, and it's the most workflow-relevant feature shipped in the last year. You upload a reference for each character, the model holds their appearance across motion, and you can direct both without one drifting. We've been running this in production since March and the consistency is good enough that teams are cutting their revision loops in half on character-driven work.

Every other major lab will have a comparable feature within 12 months. Higgsfield Soul 2.0 already leads on single-character fidelity. The gap to fill is two or more characters in the same scene without degradation on one while the other is in focus.

When this lands across the board, it removes one of the last categories of work where AI video still needs heavy post-production cleanup.

3. Real-time generation arrives, for developers first (confidence: medium-high)

Real-time generation means you get output frames as fast as you're feeding inputs, not minutes after a batch submission. Several labs are already at sub-10-second generation for 5-second clips in limited beta. By end of 2026, sub-5-second is probable for the lighter models.

The honest caveat: real-time generation in the product sense (where a creator fires prompts and sees results immediately) requires compute costs that won't hit consumer pricing for most of 2027. What you'll see first is real-time in API-accessible developer environments, where teams building interactive experiences or live-generation tools will be the early users. For creators working with batch generation today, the change will show up as faster queue times and iteration loops, not live feedback.

If you're building a product on top of AI generation, this is worth tracking closely. If you're a creator, faster batches matter more than real-time in 2026.

4. On-device generation becomes practical for images, not yet for video (confidence: medium)

Wan 2.5 runs locally if you have the hardware. The model weights are open, the output is usable, and teams with M-series Macs or high-VRAM workstations are already running it as a cost-free fallback for low-stakes generations. We use it on 8frame as the cheapest paid tier for exactly that reason.

Image generation on-device is further along. By Q4 2026, you should be able to run a capable image model locally on a mid-range laptop without GPU. The output will lag cloud models on quality, but it'll be fast and private, which is what specific enterprise customers care about.

Video on-device at production quality is a 2028 problem. The model sizes and VRAM requirements don't fit the trajectory for mobile or even consumer laptops within the next 12 months. Anyone telling you otherwise is counting on a hardware curve that hasn't happened yet.

5. Copyright resolution attempts will happen and won't stick cleanly (confidence: medium)

The EU AI Act's training data disclosure requirements kick in for large providers in early 2027. US legislation is further behind but moving. Several labs have pre-emptively started offering "clean data" or "licensed content only" model variants, usually at a price premium.

The honest position: the legal landscape will not be resolved cleanly in 12 months. What you'll see is an expansion of licensed-content model variants, more enterprise contracts that include indemnification language, and continued ambiguity about outputs from models trained on mixed data. If you're shipping AI-generated content commercially, get familiar with the indemnification terms your current model provider offers. The labs that offer output indemnification (essentially: they take the legal exposure if their model is found to have infringed) are going to pick up enterprise customers who can't operate in legal gray zones.

For individual creators and smaller agencies, the practical advice is unchanged: use the commercial-licensed output tiers, keep your prompts, and don't claim AI-generated work is wholly original when it's not.

6. Integrated multimodal becomes the default, not a premium feature (confidence: very high)

This one is already happening. Veo 3.1 takes text, image, and audio inputs. Seedance 2.0 chains still images into motion. The next step is a single model call that takes your brief, your reference images, a voice line, and background audio, and returns a composed scene.

On 8frame, we see this playing out in how teams build workflows. Eighteen months ago, a typical workflow was: generate image, export, import to video model. Today, the canvas chains all of those in a single workflow, and the handoff between steps is invisible to the user. By mid-2027, the expectation will be that any capable AI generation tool takes mixed media inputs and returns a composed output in one call. Tools that still treat image, video, and audio generation as completely separate workflows will feel dated.

The practical implication now: if your team has a workflow that manually bridges between three different tools, that's where you should focus. The integration work you do now to chain models in a single canvas will pay off whether the composited model arrives in 12 months or 24.

What to prepare for now

The two changes worth acting on before they arrive are long-form coherence and multi-character control.

For long-form coherence, the move is to start building your clips with consistent reference images and locked character states now, even when that's more work than it needs to be. Teams that already have disciplined reference management will be the ones who scale up cleanly when single-call long-form generation lands. Teams that have been improvising on consistency will have a backlog of re-shoots.

For multi-character work, the practical step is to test Seedance 2.0's multi-reference conditioning on current projects. We ran a two-character product scene in May with references for both people, and the consistency across 8 clips was good enough to use without any character-specific retouching. That's a new bar. It's worth understanding what your current workflow produces before the feature is standard everywhere and you're trying to catch up.

The workflow template for multi-reference character work is available at 8frame workflows. Clone it, swap in your references, and run a 5-clip test. You'll see the consistency pattern quickly.

What to ignore as hype right now

"AGI-level creative direction." Every lab has announced some version of a model that takes a high-level creative brief and handles all production decisions. None of them are there. What you actually get is a slightly better prompt-following model with some default choices baked in. The creative direction is still yours. This will matter eventually; it doesn't matter for production decisions in the next 12 months.

Text-in-video as a solved problem. Multiple models have announced improvements to in-frame text rendering. Some have genuinely improved. None of them are reliable enough to generate branded content with text that you wouldn't re-render in post. If text accuracy matters for your deliverable, don't change your workflow based on claims; test it on your specific brief.

Real-time on mobile. Consumer mobile real-time generation is real in limited demo environments. It is not production-grade at the resolutions that matter for content work. Any tool claiming production-quality real-time mobile generation in 2026 is either running on compressed output or pulling from cloud inference and calling it on-device.

The "one model to rule them all" narrative. You'll continue to see labs announce their model as the best at everything. Our test data across 16 models shows this is not how it works. The honest position, which we covered in depth in the best AI video generator 2026 comparison, is that routing to the right model for each brief consistently beats committing to one model.

FAQ

When will AI video models support clips longer than 60 seconds?

Scene-level memory features are in early access at multiple labs as of Q2 2026. By Q1 2027, 90-second to 3-minute coherent single-call generation is probable on at least one major model. Production-level multi-minute generation with consistent character appearance across cuts is more likely a mid-2027 feature.

Will on-device AI image and video generation replace cloud models?

On-device image generation at usable quality is realistic for capable laptops by Q4 2026. On-device video generation at production quality is not happening in 12 months. The compute requirements don't fit the device trajectory. Cloud generation will remain the default for video work through at least 2027.

How should I prepare my workflow for the changes coming in the next year?

Start building disciplined reference management now. Lock character references, lock scene lighting references, and document what you used to generate what. When long-form and multi-character control land, teams with clean reference libraries will scale up. Teams that have been generating ad-hoc will have to retrofit consistency they didn't build in. Also: consolidate to a multi-model canvas if you haven't. The 8frame canvas lets you chain models in one place, which is where the workflow is heading regardless of which specific models win.

Ready to test multi-model workflows before the landscape shifts? Browse the 8frame workflow library and run your next brief across multiple models in one session.

The Next 12 Months in AI Image and Video Generation

TL;DR

The 6 predictions

1. Long-form video crosses 60 seconds (confidence: high)

2. Native multi-character control becomes standard (confidence: high)

3. Real-time generation arrives, for developers first (confidence: medium-high)

4. On-device generation becomes practical for images, not yet for video (confidence: medium)

5. Copyright resolution attempts will happen and won't stick cleanly (confidence: medium)

6. Integrated multimodal becomes the default, not a premium feature (confidence: very high)

What to prepare for now

What to ignore as hype right now

FAQ

When will AI video models support clips longer than 60 seconds?

Will on-device AI image and video generation replace cloud models?

How should I prepare my workflow for the changes coming in the next year?

Related articles

Make it
move.

Stay in the loop

TL;DR

The 6 predictions

1. Long-form video crosses 60 seconds (confidence: high)

2. Native multi-character control becomes standard (confidence: high)

3. Real-time generation arrives, for developers first (confidence: medium-high)

4. On-device generation becomes practical for images, not yet for video (confidence: medium)

5. Copyright resolution attempts will happen and won't stick cleanly (confidence: medium)

6. Integrated multimodal becomes the default, not a premium feature (confidence: very high)

What to prepare for now

What to ignore as hype right now

FAQ

When will AI video models support clips longer than 60 seconds?

Will on-device AI image and video generation replace cloud models?

How should I prepare my workflow for the changes coming in the next year?

Related articles

Make itmove.

Stay in the loop

Make it
move.