New Breakthroughs from DeepMind - Relational Networks and Visual Interaction Networks

Yitaek Hwang

- Last Updated: December 2, 2024

Yitaek Hwang

- Last Updated: December 2, 2024

Given enough GPUs, distributed machine learning systems (such as the one Facebook has published earlier this week) excel in recognizing and labeling images. These systems can quickly and accurately determine whether a dog is in the image, but struggle to answer relational questions.

For example, a computer vision software cannot determine whether the dog in the picture is bigger than the ball it is playing with or the couch it is sitting on.

New Breakthroughs from DeepMind - Puppy and Ball

Such relational intelligence separates artificial intelligence systems with human cognition. While humans can reason about physical relationships between objects, computers have yet to make that connection until now.

DeepMind, the creators of AlphaGo, quietly published two groundbreaking research papers into this area, demonstrating a way to train relational reasoning using deep neural networks.

1. Relational Networks

The team at DeepMind created a new module called Relational Network (RN) to train the system with spatial relationships. This module can be plugged into an existing neural network system and can help the system reason about text and image inputs.

The RN module takes in segmented images (pixels mapped to objects such as sphere, cube, etc) and establishes relationships of all possible permutations. For example, for a setup including a cylinder, cube, and sphere, the RN module will compare the size (amongst other distinguishing characteristics of the objects) for each object to establish a relationship for all the pairs.

When tested on the CLEVR dataset (a benchmark for determining an AI system’s ability to reason about visual data), RN-augmented networks fared better than humans at 95.5% accuracy.

For comparison, previous state-of-the-art systems fell short at 68.5%, and humans achieved 92.5%. DeepMind’s system could answer the following question: “There is a tiny rubber thing that is the same color as the large cylinder; what shape is it?”

RN module also showed promising results for language-based tasks. DeepMind’s system scored more than 95% on 18 of the 20 bAbl tests (a text equivalent of the CLEVR test), demonstrating its ability to reason about contextual relationships.

For example, it could answer the following question: “Sandra picked up the football. Sandra went to the office. Where is the football?” This result shows that RN module can be a plug-in extension for many neural networks to augment the system’s understanding of relational data.

The original paper is available here.

2. Visual Interaction Networks

Another task that separates human cognition with computer systems is our ability to predict the near future based on our understanding of the physical relationships.

In simple terms, we can make a reasonable prediction of the future when we throw a football such as where it will go, what would happen it were to hit a wall, and how the two objects would be affected. We can reason about these physical interactions from our understanding of the physical phenomenons.

DeepMind’s paper on Visual Interaction Network (VIN) attempts to train computers on the physics of objects and their relationships based on their position and time.

Note that this is different than generative approaches to “imagining” future events as shown by Okwechime and Bowden. Rather, VIN takes in the states of the objects and predicts their positions in the future using its learned system of physical rules.

This approach more closely resembles and mimics how humans would reason about these phenomenon.

DeepMind’s tests included bouncing billiards, objects connected by springs, and planetary systems with gravitational forces. VIN not only showed remarkable accuracy in predicting object location in the future, but also outperformed other models where relational reasoning was absent.

While spring and gravitational systems are well understood by humans, the real potential here is to use AI to model unfamiliar and less-studied systems to predict future events.

Details of the results and the approach can be found here.

So What's the Significance?

Both of these papers show progress towards generalized AI. Given relational knowledge, computers can start to formulate relationships between disparate entities and reason about how that relationship can affect their state.

DeepMind’s modular and scalable approach to emulate human intelligence lays foundational work to build more sophisticated models that can not only outperform humans in specific tasks (e.g. beat a game, transcribe text, label images), but also reason about the correlation between a conglomeration of systems and objects.