I hadn’t been paying a lot of attention to the state of World Models up to this point, but Genie 3 from DeepMind feels like a real step change.

It can create dynamic, 3D worlds at 720p that you can interact with in real-time at 24 FPS for minutes at a time.

I suppose we’ve all taken for granted that the distilled knowledge of conventional Large Language Models (LLMs) comes from the cumulative written knowledge of the online world. This includes a partial, implicit understanding—or at least an approximation—of real-world phenomena like physics and how objects behave and interact with each other. An LLM "knows" a ball will fall when dropped because it has processed countless descriptions and depictions of that exact event.

But written language is an incredibly low-bandwidth way to capture the true, high-fidelity nature of reality. It’s a symbolic representation, an abstract layer sitting on top of the real thing.

World Models are trained on massive sets of videos. It doesn't just learn what a bouncing ball is; it learns the underlying principles of its movement, its physics, its relationship to an environment and a controlling agent.

The implications for gaming are pretty huge. Asset generation is a massive component of AAA title development, often representing hundreds of millions of dollars in budget. A technology like this could drive the marginal cost of creating unique, high-quality assets towards zero.

Beyond that, other platforms which depend on digital assets at scale become immediately more practical. Whether it’s a truly personalized “metaverse” (with all the connotations that term has acquired), this technology provides the engine for user-generated content that was always the missing ingredient. This extends to augmented reality, where a generative model could create intelligent, interactive digital objects that are seamlessly integrated into our physical surroundings, understanding and reacting to them.

https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/