Imagine bringing your own virtual world to life with nothing more than a single image, a brief description, or a photograph. This is the transformative power of Genie (Generative Interactive Environment), a revolutionary model introduced by Google this February.
Now, with the introduction of Genie 2, Google hints at advancing towards the concept of a World Model—an AI framework that helps understand, predict, and act in the world. In this newsletter, we’ll explore the concept of World Models and the innovations Genie 2 brings.
World Model
Genie 2 was unveiled through the Google DeepMind blog. While Genie 1 highlighted its ability to create playable games, Genie 2 shifts the focus to being a Foundation World Model. But what exactly is a World Model, and what does it mean for AI to interact with its environment?
A system architecture for autonomous intelligence. Source: Meta
What Can Google's Genie Do?
How Genie 2 works. Source: Google DeepMind
Genie 2 is designed to generate and visualize a wide range of elements within virtual environments, showcasing impressive adaptability and intelligence.
Action Controls
Action control enables Genie 2 to interpret user inputs within a virtual environment. For example, humans instinctively understand that pressing an arrow key moves a character. However, without proper training, a model lacks this inherent understanding.
Genie 2 addresses this challenge by distinguishing between movable objects, like characters, and static elements, like trees or clouds, ensuring accurate interpretation and precise control of actions within the environment.
Generating counterfactuals
A standout feature of Genie 2 is its ability to generate counterfactual scenarios. Starting from the same initial conditions, it can create entirely different outcomes based on user choices.
This enables training agents to explore a wide range of possibilities. It fosters diverse and comprehensive learning, which is especially valuable in reinforcement learning and simulation environments.
Genie 2, generating counterfactuals. Source: Google DeepMind
The image above may resemble real gameplay footage, but it is entirely generated by Genie 2. Starting from the same initial point, the outcomes differ based on the buttons pressed. This highlights its dynamic and interactive potential.
Advanced Features of Genie 2:
- Long-Horizon Memory: Genie 2 retains information over extended periods, enabling it to accurately reconstruct elements of the world that have temporarily disappeared from view.
- Long Video Generation: It can generate consistent and seamless virtual worlds for up to one minute, dynamically creating new content in real-time.
- Realistic Visual Details: With exceptional precision, it simulates physical effects like water movement, smoke, and lighting, crafting immersive and visually stunning environments.
These features highlight Genie 2’s ability to create rich, interactive, and lifelike virtual worlds.
Genie Meets SIMA
After creating an interactive environment, the next step is to add an entity to interact within it. This could be someone or something to press buttons and control actions. Google DeepMind tested their SIMA agent, introduced in March, in the 3D environments created by Genie 2.
SIMA: The Interactive Agent
SIMA is an advanced agent designed to understand natural language prompts and act accordingly. Within Genie 2’s virtual worlds, SIMA uses a keyboard and mouse to control an avatar, executing tasks based on user instructions.
A Fully Automated System
Google DeepMind completed the automation framework by integrating Imagen 3, an advanced image generation model. Combined with SIMA and Genie 2, these components form a seamless system. Together, they generate, interact with, and enhance virtual environments.
The image above is a scene from a game generated using Imagen 3. On the left is a red door, and on the right, a blue door. Depending on the user’s actions or commands, the game outcome will vary.
When a prompt is input into SIMA, the agent determines how to move the character. Then, Genie dynamically generates the corresponding scene to match the actions.
These features take Genie 2 beyond just providing training data for AI agents. It can generate entirely new environments, a significant advancement. Researchers can use it to create novel scenarios. This allows them to test an agent’s generalization and problem-solving skills more effectively.