
World models, also known as world simulators, are being touted by some as the next big thing in AI.
AI pioneer Fei-Fei Li’s World Labs has raised $230 million to build a “large world model,” and DeepMind has hired Sora, one of OpenAI’s video generator creators, to work on a “world simulator.” (Sora was released on Monday. Here are our initial impressions.)
But what on earth is it? is These?
World models are inspired by mental models of the world that humans naturally develop. Our brains take abstract representations from our senses and shape them into a more concrete understanding of the world around us, creating what we call “models” long before AI adopted that phrase. The predictions our brain makes based on these models affect how we perceive the world.
A paper by AI researchers David Ha and Jurgen Schmidhuber presents the example of a baseball batter. A batsman has milliseconds to decide how to swing the bat. This is shorter than the time it takes for visual signals to reach the brain. Hawa Schmidthuber says the reason you can hit a fastball at 100 miles per hour is because you can instinctively predict where the ball will go.
“In professional athletes, all of this happens unconsciously,” the researchers wrote. “Their muscles reflexively swing the bat at the right time and place, based on predictions from internal models. “We can quickly act on predictions about the future without having to consciously unfold possible future scenarios to make plans.”
It is this subconscious reasoning aspect of our world model that some believe is a prerequisite for human-level intelligence.
model the world
Although the concept has been around for decades, world models have recently gained popularity in part due to their promising applications in the field of generative video.
Most, if not all, AI-generated videos move into uncanny valley territory. Watch it long enough. bizarre It will happen as if your limbs are twisting and merging together.
A generative model trained on years of video can accurately predict when a basketball bounces, but without actually knowing why. It’s as if the language model doesn’t really understand the concepts behind words and phrases. But a world model that has even a basic understanding of why basketballs bounce the way they do would be better at showing that they do so.
To enable these kinds of insights, world models are trained on a variety of data, including photos, audio, video, and text, with the intention of creating an internal representation of how the world works and the ability to infer the consequences of actions. .
“Viewers expect the world they’re watching to behave in a similar way to reality,” said Alex Mashrabov, former head of AI at Snap and CEO of Higgsfield, which builds generative models for video. “When a feather falls under the weight of an anvil or a bowling ball rises hundreds of feet, it creates an impact and takes the viewer out of the moment. With a strong world model, rather than having the author define how each object moves (which is tedious, cumbersome, and a poor use of time), the model will understand it.”
But creating better video is just the tip of the iceberg of the global model. Researchers, including Meta senior AI scientist Yann LeCun, say the model could one day be used for sophisticated predictions and planning in both the digital and physical realms.
In a talk earlier this year, LeCun explained how world models can help achieve desired goals through inference. Given a goal (a clean room), a model with a basic representation of the “world” (e.g. a video of a dirty room) can come up with a sequence of actions to achieve that goal (place a vacuum cleaner to clean it, clean it). there is. Emptying the dishes, emptying the trash can) not because it is an observed pattern, but because it knows how to move from dirty to clean at a deeper level.
“We need machines that understand the world. (Machines) can remember things, have intuition, and have common sense. Being able to reason and plan at the same level as humans,” LeCun said. “Despite what we hear from the most passionate people, current AI systems cannot do any of this.”
LeCun estimates that his world model is at least a decade away from us, but today’s world model shows promise as a basic physics simulator.
OpenAI mentions in its blog that Sora, which it considers a model of the world, can simulate actions such as a painter leaving brushstrokes on a canvas. Models like Sora, and even Sora itself, can effectively simulate video games. For example, Sora can render UI and game worlds similar to Minecraft.
Justin Johnson, co-founder of World Labs, said in an episode of the a16z podcast that world models of the future will be able to generate 3D worlds on demand for games, virtual photography, and more.
“We already have the ability to create virtual, interactive worlds, but that would take hundreds of millions of dollars and an enormous amount of development time,” Johnson said. “With (the world model), you don’t just get an image or a clip, you get a fully simulated, life-like, interactive 3D world.”
high obstacles
Although the concept is attractive, there are many technical challenges.
Training and running world models requires enormous amounts of computing power, even compared to the amount currently used by generative models. Some of the latest language models can run on modern smartphones, but Sora (which is an early world model) requires thousands of GPUs to train and run. This is especially true if GPU use becomes more common.
Like all AI models, world models tend to hallucinate and internalize biases in their training data. For example, a world model trained primarily on sunny weather videos of European cities may have difficulty understanding or depicting a snowy Korean city, or may simply be inaccurate in doing so.
A general lack of training data can make these problems worse, Mashrabov says.
“We’ve seen models really limited by generations of certain types or races,” he said. “Training data for world models must be broad enough to cover a variety of scenarios, but also very specific to allow AI to deeply understand the nuances of those scenarios.”
In a recent post, Cristóbal Valenzuela, CEO of AI startup Runway, said that data and engineering challenges prevent today’s models from accurately capturing the behavior of the world’s inhabitants (e.g. humans and animals). “The model needs to produce a consistent map of the environment, as well as the ability to navigate and interact with that environment,” he said.
But if all major hurdles are overcome, Mashrabov believes world models could connect AI and the real world “more powerfully,” leading to breakthroughs not only in virtual world creation, but also in robotics and AI decision-making.
It can also create more capable robots.
Today’s robots have no awareness of the world around them (or their own bodies), so they are limited in what they can do. Mashrabov said the world model could give them that awareness. At least to some extent.
“Advanced world models allow AI to personally understand what scenario it is in and begin to deduce possible solutions,” he said.
TechCrunch has a newsletter focused on AI! Sign up here You can receive it in your inbox every Wednesday.