Questa è una versione PDF del contenuto. Per la versione completa e aggiornata, visita:
https://blog.tuttosemplice.com/en/why-modern-ai-fails-at-the-exact-physics-a-1-year-old-masters/
Verrai reindirizzato automaticamente...
Imagine holding a ripe apple at shoulder height and simply letting go. What happens? It falls, hits the ground, perhaps bruises, and rolls away. This sequence of events is so deeply ingrained in our understanding of reality that a toddler can anticipate it without a second thought. Yet, this exact scenario represents a profound boundary in modern technology. Welcome to the enigma that researchers face today. At the heart of this mystery lies intuitive physics—the innate, unspoken understanding of how objects interact in the physical world. While artificial intelligence can defeat grandmasters at chess, generate photorealistic art, and diagnose complex diseases, it is utterly baffled by the simple act of a dropped apple.
To understand why the world’s most powerful supercomputers struggle with a concept a one-year-old human masters effortlessly, we must dive into the architecture of machine cognition and the hidden complexities of our physical universe.
We live in an era where AI seems omnipotent. LLMs (Large Language Models) can draft legal contracts, write compelling narratives, and translate between dozens of languages in milliseconds. Neural networks can analyze medical scans with a precision that rivals, and sometimes surpasses, human experts. However, this linguistic and analytical prowess masks a fundamental blind spot. When an LLM describes an apple falling from a tree, it is not drawing upon a mental simulation of gravity, mass, and impact. Instead, it is relying entirely on statistical probabilities.
These systems string together words like “fall,” “gravity,” and “ground” because those words frequently co-occur in their vast training data. It is a masterful, highly convincing illusion of comprehension. The system does not know that the apple cannot fall upward, nor does it understand that a glass apple would shatter upon impact while a rubber one would bounce. It merely predicts the next most likely token in a sequence based on billions of text documents. This lack of an underlying “world model” is why an AI might confidently generate an image of a bicycle with square wheels, or write a story where a dropped object floats away, completely oblivious to the absurdity of its output. The machine knows the vocabulary of physics, but it has absolutely no experience of the physics itself.
To truly grasp why the dropped apple is such a formidable challenge, we must dissect what happens in the human brain during this mundane event. Humans possess a robust, evolutionary engine for intuitive physics. Long before we learn the mathematical formula for gravity in high school, we understand its effects implicitly. We understand object permanence—the fact that the apple still exists even if it rolls under a couch and out of sight. We understand material properties, spatial relationships, and cause-and-effect dynamics.
The dropped apple dilemma encapsulates the glaring absence of this common-sense reasoning in machines. For an AI, the universe is not made of atoms, forces, and physical laws; it is made of discrete data points, pixels, and text. When you ask a machine learning model to predict the outcome of a physical interaction it hasn’t explicitly seen before, it often fails spectacularly. If you place an apple on a table and push the table, a human knows the apple moves with the table. An AI might assume the apple stays suspended in mid-air while the table slides away.
This happens because the rules of reality are so obvious to humans that we rarely bother to write them down. We don’t write extensive texts explaining that “when a glass of water is turned upside down, the water falls out.” Consequently, these foundational truths are largely absent from the text-based datasets used to train these systems. AI is trying to learn the rules of the game by reading the commentary, without ever being allowed to step onto the field.
This cognitive gap is not merely a philosophical curiosity; it is a massive, expensive hurdle for practical applications, particularly in the fields of robotics and automation. This phenomenon is closely related to Moravec’s paradox, which states that high-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources. A robotic arm in a controlled factory environment can weld car parts with sub-millimeter precision because the environment is perfectly structured and predictable.
However, take a robot out of the factory and ask it to navigate a messy kitchen, and the limitations of its physical understanding become glaringly obvious. If a robot is tasked with clearing a dining table and accidentally knocks an apple off the edge, it lacks the intuitive physics to instinctively catch it, predict where it will land, or understand that it might bruise. In automation, dealing with the unpredictable physical world requires an understanding of friction, weight distribution, and momentum.
Engineers attempt to solve this by training robots in simulated environments using reinforcement learning, forcing them to drop millions of virtual apples to learn the consequences. Yet, the “sim-to-real” gap persists. The infinite variables of the real world—a slight breeze, a sticky surface, an irregular shape, a sudden change in lighting—cannot be perfectly modeled. Without a generalized understanding of physics, robots remain brittle, struggling to adapt to novel physical situations that a human child would navigate effortlessly.
The race is now on to bridge this gap and solve the dropped apple dilemma once and for all. Researchers at leading tech institutions are exploring new architectures that go far beyond traditional LLMs and standard neural networks. The ultimate goal is to develop “world models”—AI systems that learn the underlying, invariant rules of the physical universe rather than just the statistical patterns of human language.
One promising approach involves training models on massive amounts of video data, forcing the AI to predict the next frame of a video. By watching millions of hours of objects falling, bouncing, splashing, and breaking, the hope is that the neural network will implicitly learn the laws of physics, much like it learned the rules of grammar by reading text.
Another critical avenue is “embodied AI,” where artificial intelligence is placed inside a physical robot or a highly realistic virtual body, allowing it to actively interact with its environment. Just as a human infant learns about gravity by throwing toys from a highchair and observing the results, an embodied AI learns through trial, error, and physical feedback. By pushing, dropping, and manipulating objects, the AI begins to build a rudimentary understanding of cause and effect. Despite these rapid advancements, achieving true physical common sense remains elusive. The computational power required to simulate and understand the physical world in real-time is staggering.
The dropped apple dilemma serves as a humbling reminder of the immense complexities inherent in our everyday reality. While we marvel at the rapid, seemingly magical advancements in artificial intelligence, machine learning, and automation, the most profound mysteries often lie in the things we take for granted. Intuitive physics—the silent, automatic calculations our brains perform every single second—remains one of the most significant barriers to achieving true artificial general intelligence.
Until a neural network can watch an apple fall and genuinely understand the physical forces at play, rather than just predicting the next word in a sentence or the next pixel in an image, AI will remain a brilliant but fundamentally disconnected observer of our universe. The journey to teach machines the unwritten rules of reality is just beginning, and it promises to be one of the most fascinating scientific endeavors of our time. The next time you drop an object, take a moment to appreciate the incredibly complex physics engine inside your own mind—an engine that, for now, remains unmatched by any machine.
Intuitive physics refers to the innate understanding of how objects interact in the real world, encompassing concepts like gravity, mass, and object permanence. While human toddlers develop this awareness naturally through daily interactions, artificial intelligence currently lacks this fundamental common sense. Modern machines process discrete data points and statistical probabilities rather than experiencing actual physical forces, making them unable to instinctively predict real-world outcomes.
Advanced language models rely entirely on statistical probabilities derived from massive text datasets rather than a genuine mental simulation of the physical universe. They string together words based on how frequently they appear together in their training data. Because humans rarely write down obvious physical rules, these foundational truths are missing from the text, leaving the system to predict words without any real comprehension of cause and effect.
The Moravec paradox is the observation that high-level reasoning requires very little computation for machines, whereas low-level sensorimotor skills demand enormous computational resources. This explains why artificial intelligence can easily defeat chess grandmasters or draft complex documents but struggles to navigate a messy room. Dealing with unpredictable physical variables like friction and momentum remains a massive hurdle for modern robots outside of strictly controlled environments.
Scientists are developing new architectures called world models that aim to learn the invariant rules of the physical universe instead of just language patterns. One major approach involves training neural networks on massive amounts of video data so they can implicitly learn physical laws by predicting future frames. Another strategy involves placing the system inside a physical robot to learn through trial, error, and direct environmental feedback.
Embodied artificial intelligence involves placing a machine learning system inside a physical robot or a highly realistic virtual body so it can actively interact with its surroundings. This approach is crucial because it allows the system to learn about physical forces through direct manipulation, much like a human infant learns by playing with toys. By pushing and dropping objects, the system builds a practical understanding of cause and effect that text-based training simply cannot provide.