Audio Analysis
Before any visual generation begins, FilmGen performs deep audio analysis to understand your track's structure, rhythm, mood, and lyrics. This analysis drives every downstream decision.
Music Analysis
FilmGen analyzes your audio for timing, structure, mood, and lyrical content before visual generation starts. The output includes:
Rhythm & Tempo
BPM detection, time signature, and rhythmic patterns that drive scene pacing and transitions.
Key & Harmony
Musical key, harmonic progressions, and tonal mood that influence color palette and atmosphere.
Genre & Mood
Genre classification and emotional mood mapping that shape visual style and tone.
Structure
Section detection (intro, verse, chorus, bridge, outro) used to plan scene boundaries and pacing.
Lyrics
Lyric extraction and transcription for text-aware scene generation and overlay planning.
Visual Themes
Suggested visual themes and motifs derived from the music's emotional and structural content.
Speech Alignment
When a script or lyrics are provided, FilmGen can map words to their positions in the audio with word-level timestamp alignment.
Speech alignment enables:
- Lip-sync aware video generation in performance mode
- Precise lyric-to-scene timing
- Dialogue-aware scene windowing
- Per-scene delivery style analysis (tempo, intensity, emotion)
Performance Analysis
For performance mode projects with speech alignment, the platform performs additional vocal performance analysis:
Delivery Style
Detects whether each segment is sung, spoken, whispered, or shouted to guide motion and performance direction.
Intensity Mapping
Measures vocal intensity across the track to drive camera distance, movement speed, and visual energy.
Phrase Shape
Analyzes phrase contours and phrasing patterns to align visual transitions with natural speech rhythms.
Section Roles
Maps each vocal section to its narrative role (hook, verse, ad-lib) to plan visual emphasis.
Scene Windowing
After analysis, the platform calculates scene windows— the timing boundaries for each scene based on track duration, section structure, vocal timing, and the video provider's duration buckets. Each scene gets a planned duration that fits the provider's available clip lengths (e.g., 4s, 6s, or 8s for Veo).
