Audio Analysis

Before any visual generation begins, FilmGen performs deep audio analysis to understand your track's structure, rhythm, mood, and lyrics. This analysis drives every downstream decision.

Music Analysis

FilmGen analyzes your audio for timing, structure, mood, and lyrical content before visual generation starts. The output includes:

Rhythm & Tempo

BPM detection, time signature, and rhythmic patterns that drive scene pacing and transitions.

Key & Harmony

Musical key, harmonic progressions, and tonal mood that influence color palette and atmosphere.

Genre & Mood

Genre classification and emotional mood mapping that shape visual style and tone.

Structure

Section detection (intro, verse, chorus, bridge, outro) used to plan scene boundaries and pacing.

Lyrics

Lyric extraction and transcription for text-aware scene generation and overlay planning.

Visual Themes

Suggested visual themes and motifs derived from the music's emotional and structural content.

Speech Alignment

When a script or lyrics are provided, FilmGen can map words to their positions in the audio with word-level timestamp alignment.

Speech alignment enables:

Lip-sync aware video generation in performance mode
Precise lyric-to-scene timing
Dialogue-aware scene windowing
Per-scene delivery style analysis (tempo, intensity, emotion)

Performance Analysis

For performance mode projects with speech alignment, the platform performs additional vocal performance analysis:

Delivery Style

Detects whether each segment is sung, spoken, whispered, or shouted to guide motion and performance direction.

Intensity Mapping

Measures vocal intensity across the track to drive camera distance, movement speed, and visual energy.

Phrase Shape

Analyzes phrase contours and phrasing patterns to align visual transitions with natural speech rhythms.

Section Roles

Maps each vocal section to its narrative role (hook, verse, ad-lib) to plan visual emphasis.

Scene Windowing

After analysis, the platform calculates scene windows— the timing boundaries for each scene based on track duration, section structure, vocal timing, and the video provider's duration buckets. Each scene gets a planned duration that fits the provider's available clip lengths (e.g., 4s, 6s, or 8s for Veo).