Upload & Setup

The first step of every project is uploading your audio and configuring how FilmGen should generate your video. This page covers all the options available at project setup.

Audio Upload

Drag and drop or click to select your audio file. FilmGen accepts the following formats:

Format	Extension
MPEG Audio Layer 3	.mp3
Waveform Audio	.wav
MPEG-4 Audio	.m4a
Free Lossless Audio Codec	.flac

Maximum file size is 50 MB. Upload progress is shown in real time with a progress bar.

Reference Images

Upload up to 14 reference images to guide visual consistency across your project. Reference images are used by the image generation model to maintain character, product, or brand identity across all scenes.

Use reference images when you need:

Recurring talent or character appearance consistency
Product shots for advertising or branded content
A specific visual world or brand identity
Style guidance beyond what text prompts can convey

Video Model

Choose which AI video provider generates your scene clips. Each provider has different strengths, duration options, and visual characteristics. See the Video Providers page for a detailed comparison.

Video Model

Sora 2

Clip lengths: 4s, 8s, 12s. Strong at cinematic motion and human subjects.

Video Model

Veo 3.1

Clip lengths: 4s, 6s, 8s. Excellent at natural scenes and landscapes.

Video Model

Runway Gen4.5

Clip lengths: 5s, 10s. Great for stylized and abstract visuals.

Video Model

Luma Ray-2

Clip lengths: 5s, 10s. Strong at dynamic camera movement.

Video Model

Grok Imagine

Text-to-video with distinctive visual style.

Generation Mode

The generation mode shapes how the AI plans your video. See Generation Modes for a full breakdown.

performancestoryadvertisinganimeshort_filmdocumentary

Visual Style

Choose from 8 visual styles that influence the art direction across all scenes. See Visual Styles for examples and descriptions.

cinematicanimeabstractnoirneonvintagesurrealminimal

Planner Model

The planner model is the AI used for creative reasoning and planning. See Planner Models for selection guidance.

Script Input

Optionally paste lyrics or a script. When provided, the platform runs speech alignment to map words to timestamps, enabling lip-sync aware scene generation in performance mode. This is especially useful when you want generated scenes to match specific lyrics or dialogue.