Collectionless Artificial Intelligence

🔬 Research Summary by Marco Gori and Stefano Melacci

Marco Gori and Stefano Melacci are, respectively, Full Professor and Associate Professor of Computer Science at the University of Siena (Siena, Italy), with their research focused on foundational aspects of Machine Learning, recently oriented towards problems of learning over time.

[Original paper by Marco Gori and Stefano Melacci]

Overview: Learning from huge data collections introduces risks related to data centralization, privacy, energy efficiency, limited customizability, and control. This paper focuses on the perspective in which artificial agents are progressively developed over time by online learning from potentially lifelong streams of sensory data. This is achieved without storing the sensory information and without building datasets for offline learning purposes while pushing towards interactions with the environment, including humans and other artificial agents.

Introduction

The outstanding results of machine learning-based applications are largely due to models that are trained on huge datasets. This triggers several questions about the nature of such datasets and the way they are exploited:

What’s inside these data collections, and who owns them?
Who has the resources for developing agents that learn from these huge collections?

An artificial agent learning from a large dataset inherits biases and gains skills that are directly related to the collection’s contents. Moreover, data means “power” since owning large collections allows training large models that are then exploited in downstream applications if and only if someone has access to significant hardware and energy resources.

“Collectionless AI” identifies those approaches where intelligent agents do not need to accumulate sensory data, processing samples without storing them when they are acquired from the environment. Environmental interactions, including the information coming from humans, play a crucial role in the learning process and offer control, as well as agent-by-agent communication. We think of agents that edge computing devices can manage, and this requires thinking of new learning protocols where machines learn in a lifelong manner.

Key Insights

Risks connected with data centralization

The growing ubiquity of Large Language Models (LLM) has recently opened strong debates on scenarios giving rise to potentially rogue AIs involving social and political aspects. The source of these debates is deeply connected with the exploitation of increasingly large data collections, which requires huge financial resources, thus leading to the centralization of information. This aspect produces undeniable privacy problems as well as very controversial geopolitical effects.

Data centralization issues

The progressive accumulation of data has been mostly stimulated at the dawn of the Web by technologies that have early recognized the strategic value of collecting data by massive crawling. Just like for the precious Web search services, the quality of modern machine learning-based services is strongly associated with the privilege of having access to huge data collections.

Privacy and geopolitical issues

When off-device processing data from the camera/microphone of a private smartphone, there are privacy issues in the way the information is stored and communicated through the network. As a result, on-device learning, without building databases, might become an important requirement of future AI-based technologies. When pushing collection-centered AI, we implicitly contribute to creating serious geopolitical issues connected with the domain of a few countries that can control data and the development of technologies exploiting them.

Energy efficiency issues

Training large Transformers requires significant energy, and the training procedures are neither actively driven by the agent nor the supervisor. Differently, interacting with the agent for customizing the teaching processes would yield a more controlled setting, focusing only on what is more important for the agent at a certain instant, favoring distributed computations with brief/targeted communications among agents that might reduce the energy consumption needed to develop them.

Limited control, customizability, and causality

Data might be affected by biases that might be hard to filter out or contain inappropriate material. Differently, the progressive interaction with the environment paves the way for a more controllable and informed AI. In turn, it opens to a better exploitation of the temporal dimension of the information that can be used to capture the causal structure of the predictions better.

Collectionless AI

A radically different perspective emerges as we think of machines that acquire cognitive skills without accessing previously stored data collections but simply by environmental interactions, where the sensory information is immediately processed, and agent-to-human or agent-to-agent exchanges occur. In nature, animals do not rely on data collections but process information as time passes and create (and update) an appropriate internal representation of the environment. It is the interaction with the environment that allows them to be in touch with the treasure of information, which enables the growth of their cognitive skills. Artificial neural networks are still struggling to find a good trade-off between plasticity and stability without relying on optimizing previously built data collections, which is the main focus of Collectionless AI. When considering the spectacular results mostly in deep learning, at first glance, we might regard the proposed Collectionless AI challenge as quite ridiculous. However, from one side, because of how intelligence emerges in nature, it is of interest in itself from a truly scientific point of view. Moreover, we also advocate its potential impact for enabling a truly different type of AI technologies, which could have a dramatic impact on society, going beyond the issues mentioned above with data collections.

Time as the protagonist of learning

Sensory information is characterized by a natural temporal development of the data. In nature, we do not learn from a huge dataset of “shuffled images,” and animals gain visual skills without storing their whole visual life. Why can they afford to see it without accessing a previously stored database? Is there a specific biological aspect that cannot be captured in machines? This paper sustains the position that machines can likely gain those skills, once we face the challenge of learning without using data collections, exploiting the natural development of the sensory information over time.

Benchmarks in Collectionless AI

Just like humans, machines can also be expected to “live in their environment,” and they can be evaluated online. A massive online active evaluation open to a wide audience can be a viable path for qualitatively evaluating virtual agents that progressively learn, as in the case of LLMs. The quality of the same agent can be evaluated at different stages of its evolution, analyzing progress and regressions.

Between the lines

This paper proposes a new view of AI that is centered around the idea of Collectionless AI. The new learning protocol assumes that machines interact in their environment without permission to store information to re-create the typical conditions of offline learning. Machines are expected to develop their memorization skills by abstracting the information acquired from the sensors that are processed online. We argue that emphasizing the importance of this new framework might open the doors to a new approach to machine learning. Moreover, the emergence of the collectionless philosophy can contribute to better understanding of intelligence processes in nature as well as open an alternative technological path that is not centered on the privilege of controlling large data collections.