Movie Gen: Meta's New Video Generative AI Model

Earlier this year, OpenAI’s Sora showcased the potential of video generation models, sparking a wave of releases from other companies. On October 4th, Meta finally introduced its own text-to-video model, and impressively, it also includes the ability to generate audio. Let’s dive into Meta’s newly released foundational model, Movie Gen. 🍿

Meta's Generative Models

Movie Gen isn’t Meta’s first foray into image and video generation models. Meta has already introduced two foundational models. Let’s briefly review their features. The first is Make-A-Scene, announced in July 2022. As the name suggests, it allows users to generate desired images based on simple sketches and text. At that time, models like MidJourney and Stable Diffusion were gaining attention. Any new model had to outperform or distinguish itself from these. Meta’s Make-A-Scene stood out by incorporating not only text but also specific image conditions (sketches) into its generation process. This allowed creators to produce images or videos that were more aligned with their vision.

Make-A-Scene model that generates images based on sketches. Source: Meta AI Blog

There was another model that came before Movie Gen: Emu, announced in September 2023. Emu focused on the fact that existing image generation models often miss fine aesthetic details. To address this, Meta proposed a method based on Diffusion models, fine-tuned with a small number of images to capture intricate details. This process is often referred to as “Needles in a Haystack,” where Meta aimed to find fine image details like searching for needles in a haystack. After Emu, Meta introduced Emu-Edit, a model that allows for editing only specific parts of an image, emphasizing precise modifications.

Emu-Edit generates images that blend naturally with existing ones. Source: *Emu Edit: Precise Image Editing via Recognition and Generation Tasks* (Meta GenAI, 2023)

In both models, Meta emphasized the importance of “creative freedom” for creators. This means the model must allow creators to express their intentions in detail, generate results that align with their envisioned references, and enable precise editing for fine-tuning the output.

How is Movie Gen Different?

The Movie Gen series is Meta’s third major generative model, encompassing both Movie Gen Video and Movie Gen Audio. Movie Gen Video, in particular, reflects Meta’s core philosophy. While the model’s ability to generate videos sets it apart, what Meta emphasizes most is its personalization and precise editing capabilities.

Personalization means that the model can generate content based on specific inputs. For example, if a creator inputs the face of a particular individual along with a desired prompt, the model can generate a video tailored to that vision. Additionally, it allows for precise editing of existing videos according to specific instructions. Similar to the “Needles in a Haystack” challenge discussed earlier, making fine edits to parts of an image or video within large datasets is more difficult than it seems. In one example, a video of lanterns being released can be edited to replace the lanterns with bubbles or change the background, based on detailed instructions. This demonstrates how techniques previously applied to images can now be implemented in video.

Source: Movie Gen: A Cast of Media Foundation Models (The Movie Gen team, 2024)

Unlike images, videos must maintain spatial and temporal consistency for objects and people. While individual frames may look natural, when put together, they can appear disjointed or unnatural. Sudden transitions or improper continuity between scenes can disrupt the flow. So, what technologies are integrated into the Movie Gen model to address these issues?

Technology Behind Movie Gen

Movie Gen Video is a foundation model with 30 billion parameters that combines Text-to-Image and Text-to-Video models to generate high-quality videos at 16 frames per second and up to 16 seconds in length. The model has “watched” around 1 billion images and 100 million video datasets, allowing it to naturally learn object movement, interaction, and physical relationships within a “visual world.” Through this training, Movie Gen Video can generate realistic videos while maintaining consistent quality across various resolutions and aspect ratios. To enhance this capability, the model undergoes Supervised Fine-Tuning (SFT), using high-quality video data and text captions.

Movie Gen Video's architecture.

Movie Gen Video

At the core of the architecture, Movie Gen Video embeds images through a Time Auto-encoder (TAE), which, combined with text information via Cross-Attention, enables Joint Learning that links image and text data. TAE is an extension of the basic VAE generative model, developed with a time axis to handle video creation, allowing it to extract and compress spatio-temporal features. This compressed information then passes through Diffusion and Transformer blocks for video generation training.

Movie Gen Audio

Movie Gen Audio, with 13 billion parameters, is a foundation model specialized in generating high-quality sound and music for Text-to-Audio and Video-to-Audio tasks. It can produce 48kHz cinematic-level sound effects and music that sync perfectly with video inputs, while also being finely adjusted by text prompts. Having learned from about 1 million hours of audio data, the model captures not only physical but also psychological associations that viewers may experience. Notably, Movie Gen Audio can generate both diegetic and non-diegetic sounds, harmonizing them seamlessly with visual elements, thereby enriching creative expression with a natural coherence between audio and visuals.

We’ve explored the evolution of Meta’s generative models and the recently announced Movie Gen model. This model represents the essence of Meta’s philosophy—creating tools for free and creative expression. According to Meta, Movie Gen outperforms previous models like Sora, Runway, and Kling. Notably, it includes an audio model, which previous models hadn’t tackled, making it an all-in-one solution for video production.

Generative technology has now advanced beyond images to encompass video. It’s exciting to imagine how much more refined these models will become in the future.