Upload & Setup
The first step of every project is uploading your audio and configuring how FilmGen should generate your video. This page covers all the options available at project setup.
Audio Upload
Drag and drop or click to select your audio file. FilmGen accepts the following formats:
| Format | Extension |
|---|---|
| MPEG Audio Layer 3 | .mp3 |
| Waveform Audio | .wav |
| MPEG-4 Audio | .m4a |
| Free Lossless Audio Codec | .flac |
Maximum file size is 50 MB. Upload progress is shown in real time with a progress bar.
Reference Images
Upload up to 14 reference images to guide visual consistency across your project. Reference images are used by the image generation model to maintain character, product, or brand identity across all scenes.
Use reference images when you need:
- Recurring talent or character appearance consistency
- Product shots for advertising or branded content
- A specific visual world or brand identity
- Style guidance beyond what text prompts can convey
Video Model
Choose which AI video provider generates your scene clips. Each provider has different strengths, duration options, and visual characteristics. See the Video Providers page for a detailed comparison.
Video Model
Sora 2
Clip lengths: 4s, 8s, 12s. Strong at cinematic motion and human subjects.
Video Model
Veo 3.1
Clip lengths: 4s, 6s, 8s. Excellent at natural scenes and landscapes.
Video Model
Runway Gen4.5
Clip lengths: 5s, 10s. Great for stylized and abstract visuals.
Video Model
Luma Ray-2
Clip lengths: 5s, 10s. Strong at dynamic camera movement.
Video Model
Grok Imagine
Text-to-video with distinctive visual style.
Generation Mode
The generation mode shapes how the AI plans your video. See Generation Modes for a full breakdown.
Visual Style
Choose from 8 visual styles that influence the art direction across all scenes. See Visual Styles for examples and descriptions.
Planner Model
The planner model is the AI used for creative reasoning and planning. See Planner Models for selection guidance.
Script Input
Optionally paste lyrics or a script. When provided, the platform runs speech alignment to map words to timestamps, enabling lip-sync aware scene generation in performance mode. This is especially useful when you want generated scenes to match specific lyrics or dialogue.
