On August 27th, Google Research unveiled their study on the real-time game engine model, GameNGen. The model is powered by diffusion and reinforcement learning. Let’s take a closer look at the GameNGen model.
What do we need?
Game developers create the rules and fun elements of a game, while users learn and play by those rules. This forms a repetitive loop that works as follows:
- Players control the game
- Game states get upgraded
- Results are displayed
Players interact with the game, causing changes and immediately seeing the results. When a player makes a request, the game engine updates the game state, fetching the appropriate images and elements to render the result on the screen. This process is something AI excels at.
Consider how image generation models work:
When you input a text prompt, the model generates high-quality images that match the prompt. Broadly speaking, this process is similar to game generation because both involve creating visuals based on user inputs.
GameNGen can replace traditional game engines by following this same principle. It generates the appropriate game state and outputs matching visuals based on the user’s input. GameNGen can produce 20 frames per second in real time.
How GameNGen Works
Poster of DOOM
GameNGen has completely replaced the game engine for DOOM, a classic FPS game. One of its defining features is its ability to run on almost any platform, from computers and consoles to unexpected devices like toasters, microwaves, and even treadmills.
Model architecture of Google's GameNGen
GameNGen was developed through two major processes: first, training an agent to play the game to collect data, and second, training the model using the collected data.
To begin with, let’s look at how they trained the agent to collect data. A game resembles a simulation of a world where we constantly interact with its environment. This process of learning through interaction with the environment is similar to reinforcement learning. The key components involved in agent training can be represented by the following symbols:
- Environment (E)
- State (S)
- Observation (O)
- Action (A)
- Rendering Logic (Value Function; V)
- Program Logic (Probability Function; p)
In DOOM, the dynamic memory is represented as S, the rendered result on the screen as O, the process of rendering O based on S as V, the user’s input as A, and the outcome based on user input as p. While typical reinforcement learning focuses on maximizing rewards to produce optimal gameplay, this research differs. The goal is to collect a wide range of examples through human-like gameplay.
The researchers collected the agent’s actions and observations and used them to train the generative model, which is in this case, Stable Diffusion v1.4. Instead of relying on text input, the model generates images based on previous actions and observations.
To solve the Auto-regressive Drift issue that occurs during the generative training process, the researchers applied Noise Augmentation. Auto-regressive Drift refers to the gradual accumulation of errors over time, caused by differences between the training and inference phases. During training, the researchers force the model to predict the next frame using real-world information through a process called Teacher-forcing. However, during inference, the model uses its own outputs as input, and if noise or errors occur, the model fails to control them, causing the quality to degrade over time.
Noise Augmentation reduces this gap by intentionally corrupting the context frames with varying levels of Gaussian noise. By doing so, the model is exposed to unexpected situations, making it more resilient and capable of maintaining robust performance in dynamic and unpredictable environments, similar to real game conditions.
Comparison of GameNGen Model Predictions vs. Actual Results
GameNGen model is the result of such training, allowing humans to play the game directly. The predicted gameplay closely matched the actual gameplay, with human evaluators reporting that it was nearly indistinguishable from real DOOM gameplay footage when compared to GameNGen’s playthrough.
(For those curious about the actual gameplay footage, 🔗 check it out here!)
Of course, there are still many challenges to address. Currently, the model learns the game rules through reinforcement learning, but in the future, it will need to generate the rules and devise methods for training gameplay from the outset. Another issue is the excessive computational load compared to traditional game engines, especially in games that don’t require high graphical demands.
The significance of this research lies in showing the potential for replacing game engines with AI models. Notably, this model used Stable Diffusion v1.4, released in 2022, indicating endless possibilities for future improvements.