Salvatore Raieli, Author at Towards Data Science

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

Can Machines Dream? On the Creativity of Large Language Models

Salvatore Raieli — Fri, 31 Jan 2025 19:01:57 +0000

|LLM|CREATIVITY|AI|HALLUCINATION|

Image generated by the author using DALL-E

Creativity requires the courage to let go of certainties. – Erich Fromm

Creativity involves breaking out of established patterns in order to look at things in a different way. – Edward de Bono

Creativity is considered along with the ability to reason a uniquely human ability. The arrival of Large Language Models ([LLMs](https://en.wikipedia.org/wiki/Large_language_model)) and their ability to mimic human skills has begun to undermine this view. There has been much discussion about whether LLMs are capable of reasoning, but less about the creativity of these models.

Quantifying reasoning is easier (evaluation is conducted on problem-solving benchmark datasets), but quantifying creativity is more complex. Nevertheless, creativity is one of the activities that makes us human: writing books and screenplays, generating poetry and works of art, making groundbreaking discoveries, or devising theories all require creativity.

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

A Requiem for the Transformer?

Although model creativity is a less explored topic, it is no less important. Therefore, in this article, we will focus on three main questions:

How creative are models?
On what does model creativity depend?
Can hallucinations help increase model creativity?

Artificial Intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Are LLMs creative?

Creativity is the ability to form novel and valuable ideas or works using one’s imagination. Products of Creativity may be intangible (e.g. an idea, scientific theory, literary work, musical composition, or joke), or a physical object (e.g. an invention, dish or meal, piece of jewelry, costume, a painting). – source: Wikipedia

In general, there is no agreement on what creativity is or about a unique definition. Most authors agree that creativity is the production of something new, original, and useful. This something new can be a product or an idea. More formally, Margaret Boden defines creativity as: "the ability to come up with ideas or artifacts that are new, surprising and valuable" [1]

Defining value is an easier task. The LLM’s production of code is valuable when it performs its function properly. But is it new and surprising?

For an object, novelty refers to the dissimilarity between a manufactured artifact and others in its class. This is a problematic definition because it could result from a simple modification of existing objects or a new recombination. Thus, to be truly creative an object must be not only new (different from what previously existed) but also valuable (have some form of utility) and surprising (not be a simple variation or recombination). Since the output of an LLM is a text, what does it mean for a text to be creative?

To maintain consistency with the definition given above, we could define a creative text as a surprising elaboration that is not a simple variation or reworking of previous texts. LLMs are trained on a wide corpus of texts, and given an instruction they can generate a text in seconds. To be innovative the output text must be different from what is seen during pretraining (novelty) but also different from a simple variation (surprising). Since the decoding of an LLM has a stochastic component, an LLM will accidentally insert variations into the generated text.

Defining a text as surprising is complex, but it is the focal point for defining the creativity of an LLM. Boden [2] defines three types of creativity with respect to surprise:

Combinatorial creativity. Finding unfamiliar combinations of familiar ideas. For example, combining two genres that have not been combined previously.
Exploratory creativity. Finding new, unexplored solutions inside the current style of thinking. An example might be using an established narrative style, but introducing a unique twist within its confines (such as telling a classic love story from an unexpected point of view).
Transformational creativity. Change the current style of thinking. Inventing a new way of presenting text, such as a novella written only of footnotes or an innovative chronological order (For example, the research work on new structures and patterns conducted by OuLiPo).

The autoregressive nature of the models should not lead to generating anything surprising. Even if it has a stochastic component, the model follows the given distribution with which it was trained (the pretraining texts) [3]. On the other hand, if instructed an LLM can generate poetry about mathematical theorems, write a theorem in the style of Hemingway, and other examples that seem to meet the definition of surprise. In fact, on close inspection, they come across as trivial and generally repetitive [4].

Recently, an article [5] took this issue further by trying to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. Human creativity is influenced by what we learn, but an author’s original text cannot be just attributed to some sources. If every text generated by an LLM can be mapped to other texts, it is overwhelming evidence of a lack of creativity.

You’re Not a Writer, ChatGPT – But You Sound Like One.

image source: here

They showed that humans exhibit a higher level of creativity than LLMs in all tasks (based on unique word and sentence combinations). For the authors, that little unquantifiable creativity comes from stochastic processes or the fact that we do not know the entire pretraining dataset.

image source: here

In conclusion, at present LLMs do not show creativity. According to some authors [6–7], creativity is not only what is achieved but also how. In other words, creativity is a process that requires: motivation, perception, learning, thinking, and communication [8]. Creativity is based on knowing and finding information, but also on transcending the status quo. This last step requires going beyond the imitation game (their auto-regressive nature), to explore and challenge the current view. These last steps require self-awareness, a purpose, self-assessment, and a model of the world. Some studies try to push toward these aspects, but at the same time, it means we are far from true creativity.

How Far Is AI from Human Intelligence?

What conditions are important for LLM creativity?

As we said above, LLMs are incapable of true creativity, but that does not mean we can’t improve the text quality they generate. We can describe three strategies with which to influence the output of a pre-trained LLM:

Acting on the hyper-parameters of an LLM.
Conducting additional training for an LLM.
Prompting strategy.

The first strategy is basically to alter the temperature of a model [9]:

Temperature controls the uncertainty or randomness in the generation process, leading to more diverse outcomes by balancing probabilities for candidate words. – source

Increasing the degree of randomness does not mean getting true creativity. Adjusting the temperature affects how confident or exploratory the model is when selecting its next token (word, phrase, etc.). At low temperatures (e.g., 0.1–0.5), the model generates deterministic, focused, and predictable outputs (i.e., it selects the most likely tokens and thus regurgitates more closely what it saw during training).

With low temperatures, the model is repetitive, unoriginal, and sounds robotic but is more factually correct. With high temperatures (e.g., 0.7–1.5), the model generates more diverse and unpredictable text (during decoding it samples lower-probability tokens). Generally, choosing a high temperature is used for creative text or generating novel ideas. At a temperature higher than 2, the model generates chaotic and non-sensical text.

This study [9] analyzed what happens when a model is asked to generate a story using the same prompt but varying the temperature. As the temperature increases, they become more semantically diverse.

image source: here

The increase in semantic difference does not mean either a difference in content or an increase in creativity. With the temperature increase, these stories may appear to be slightly more creative but lose consistency.

image source: here

In general, temperature does not allow the LLM to leverage different regions of the embedding space, but it does enable some novelty when generating limited samples (as is the case for any real-world application). We observe a weak positive correlation between temperature and novelty, and unsurprisingly, a negative correlation between temperature and coherence. Suggesting a tradeoff between novelty and coherence. – souce

The authors [9] suggest that temperature does not allow one model to go beyond the training data distribution but leads the model to merely stochastically generate novel variations. This comes at a price; the output generated loses consistency.

The second strategy means exploiting post-training strategies such as instruction tuning or alignment to human preference. However, techniques such as RLHF or DPO, reduce variety and thus often have the opposite effect [10]. The third strategy means altering the instructions contained in the prompt and does not affect the ability to go beyond data distribution. Prompt engineering makes the model use better its acquired knowledge.

Hallucinations as a creativity phenomenon

In the previous sections, we have seen how LLMs are not inherently creative, and there are not many alternatives to overcome this limitation.

Classically, hallucinations have been seen as a problem to be solved, but some authors [11–12] suggest that they can be looked at from another perspective: as creative phenomena.

Although hallucinations are not invariably detrimental, in certain instances, they can stimulate a model’s creativity. – source

Hallucination is a factually incorrect output. We could also see a hallucination as an unexpected element that might be of interest in creative writing. Or even might be useful for fields where factuality is required instead (such as the field of scientific research). In this paper [12] they note that "hallucinations" have historically led to scientific discoveries. For example, heliocentrism was considered a factual error, and proposing heliocentrism to solve the retrograde motion of the planets could be considered at the time as a kind of hallucination. Similarly, stochastic events led to revolutionary discoveries such as penicillin.

image source: here

Research on human creativity indicates that creative thinking involves both the activation of the left prefrontal cortex (implicated in imaginative thinking) and the hippocampus (region important for memory) [13]. In other words, human creativity would be related to recombining learned information with an imaginative element. In LLMs, hallucination could, therefore, introduce an imaginative element to the information recalled from the model.

An example, in this study [14] shows that hallucinations can be used in a scientific field such as drug discovery. The authors provide prompts that contain a molecule description and then ask the LLM to classify it by a certain chemical property. They provide either the prompt with a hallucinated description (or a description that does not contain them or a baseline). Paradoxically, adding a prompt with hallucinations seems to improve the performance of the model in the next task.

image source: here

The Cybernetic Neuroscientist: Smarter Than Experts?

AI Planning or Serendipity? Where Do the Best Research Ideas Come From?

Parting thoughts

LLMs are not creative, and their reasoning ability is debated. This does not make them useless, but these limitations should be considered. Especially now that LLMs and AI agents will be used to accomplish real-world tasks, the lack of reasoning and creativity are serious limitations. Some interesting studies propose the use of agents to help researchers in drug discovery and chemistry [15–17]. Lack of creativity is a limitation to automating complex tasks or the entire research process. Agents, however, can be useful tools and be used to automate many tasks that do not require creativity.

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Can I Really Trust You, ChatGPT? Bridging AI Confidence and Human Understanding

Beyond Text: Navigating Toward a Multimodal RAG Horizon

The Dream Machine: Decoding Why LLMs Hallucinate Reality

You Cache Only Once: Cache-Augmented Generation (CAG) Instead Of RAG

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Franceschelli, 2023, On the Creativity of Large Language Models, link
Boden, 2003, The Creative Mind, link
M. Shanahan, 2022, Talking about large language models, link
Hoel, 2022, The banality of ChatGPT, link
Lu, 2024, AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text, link
Gaut, 2003, Creativity and imagination, link
Floridi, 2020, GPT-3: Its Nature, Scope, Limits, and Consequences, link
Rhodes, 1961, An Analysis of Creativity, link
Peeperkorn, 2024, Is Temperature the Creativity Parameter of Large Language Models? link
Kirk, 2024, Understanding the Effects of RLHF on LLM Generalisation and Diversity, link
Wang, 2024, LightHouse: A Survey of AGI Hallucination, link
Jiang, 2024, A Survey on Large Language Model Hallucination via a Creativity Perspective, link
Benedek, 2014, To create or to recall? neural mechanisms underlying the generation of creative new ideas, link
Yuan, 2025, Hallucinations Can Improve Large Language Models in Drug Discovery, link
Swanson, 2024, The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation, link
Kudiabor, 2024, Virtual lab powered by ‘AI scientists’ super-charges biomedical research, link
Caldas Ramos, 2024, A Review of Large Language Models and Autonomous Agents in Chemistry, link

The post Can Machines Dream? On the Creativity of Large Language Models appeared first on Towards Data Science.

The Good, the Bad, and the Ugly: Memory for a Neural Network

Salvatore Raieli — Tue, 17 Dec 2024 00:48:58 +0000

|ARTIFICIAL INTELLIGENCE|MEMORY|NEURAL NETWORK|LEARNING|

Memory can play tricks; to learn best it is not always good to memorize

image generated by the author using DALL-E

No man has a good enough memory to be a successful liar. – Abraham Lincoln

Memory is more indelible than ink. – Anita Loos

Memorize bad, generalization good. This is considered as a dogma of Artificial Intelligence. But why? What is wrong with memorization?

Intuitively a student who memorizes the whole book might still fail a test if the exercises are different from those in the book. If memorizing does not mean learning, sometimes a little memory can also be beneficial. For example, there is no point in learning complex rules to learn a list of names of historical figures. You have to know how to find the right balance. Something similar is happening with neural networks. This article discusses the complex love/hate relationship between neural networks and memory.

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

A complex relationship between memory and learning

In general, neural networks and memorization have a complicated relationship. Neural networks tend to learn shortcuts or simple rules that are good for most of the data. Examples that fail to fit into these rules are considered exceptions by the model. These shortcuts can be a way to speed up training and quickly optimize loss, though, they often have unintended effects. For some authors [1], this is why neural networks are capable of extraordinary performance and spectacular failures at the same time:

One central observation is that many failure cases are not independent phenomena, but are instead connected in the sense that DNNs follow unintended "shortcut" strategies. While superficially successful, these strategies typically fail under slightly different circumstances. – source

image source: [2]

In short, [neural networks](https://github.com/SalvatoreRa/tutorial/blob/main/artificial%20intelligence/FAQ.md#:~:text=What%20are%20neural%20networks%3F) try to learn simple features at the expense of complex ones, even when the latter have greater predictive power (making the networks nonrobust). This also makes neural networks fail to produce a reliable confidence estimate and often arrive at suboptimal generalization [3].

Some of it is not even their fault. Training techniques such as gradient descent, regularization, large learning rate, and small batch size favor less complex models [4]. This is because complex models usually mean overfitting (i.e., the model memorizes patterns in the training data that are not useful for generalization). Less complex models should learn general rules that go beyond the training data. At the same time, reducing complexity favors learning heuristics. Heuristics, even if they are wrong, are good for most data (they are simple features or rules that the model can use for most data), and thus are favored in training.

The neural network generally has enough parameters to memorize the entire dataset [5]. On the one hand, it might memorize the dataset, on the other hand, the training process pushes it to simple solutions. The result is that the network learns simple solutions valid for most of the dataset and memorizes the exceptions.

Why is this dangerous?

Looking at the stars in the night it is easy to conclude that they are the ones orbiting the earth. An earth-centric model could explain almost all celestial bodies, with a few notable exceptions. The Ptolemaic model used hemicycles to explain the exceptions and the rest was explained by a simple rule.

Neural networks do the same: they look for a simple rule, and once found the rest becomes an exception. When these simple rules are spurious correlations this becomes a serious problem. For example, many of the neural networks designed for diagnosis fail because they learn about spurious correlations and use them for predictions [6]:

The neural network that famously had reached a level of accuracy comparable to human dermatologists at diagnosing malignant skin lesions. However, a closer examination of the model’s saliency methods revealed that the single most influential thing this model was looking for in a picture of someone’s skin was the presence of a ruler. – source

The presence of the ruler means that the dermatologist has already identified a cancerous lesion and therefore the model is useless.

image source: [12]

However, it is not always so easy to diagnose when a neural network has learned spurious correlations. In this case, we may have a false sense of reliability of the neural network. Therefore, it is critical to understand whether a model that has achieved near-perfect accuracy has learned generalizable patterns or uses spurious correlation and memorization.

So there are several questions:

How to monitor the model’s generalization abilities? How to identify spurious correlations?
When do these spurious correlations impact training?
How can we correct this behavior?

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

The quest for spurious correlation

A model’s ability to generalize is evaluated on a held-out set. Typically, the test set is a part of the dataset that the model does not see during training. So it is critical to avoid any data leakage or this evaluation will not be useful.

Spurious correlation refers to relationships between variables that appear to be statistically significant in the data but are not genuinely causal or reflective of a true underlying relationship. These patterns are generally present only in the training set but not in new or unseen data. This especially occurs if the training set is biased or does not represent the true distribution of the data

image source: [7]

In general, it is assumed that during the initial training phase, the models learn these spurious correlations and memorize the examples. Only in the second stage does true generalization occur.

Grokking: Learning Is Generalization and Not Memorization

Therefore different studies have tried to exploit this fact to create automatic systems to identify spurious correlations. These approaches then look for patterns that occur learned early on or that are too simple to solve a complex problem [9–10]:

Based on this observation, we are skeptical of very simple solutions to complex problems and believe they will have poor o.o.d. generalization. – source

Because shortcuts are simple and easily learned, small patterns can be used to identify them. Other methods try to correct the loss function to identify and eliminate these correlations [11].

The impact of spurious correlation

Spurious correlations reduce a model’s ability to generalize. In itself this is bad, but it is usually diagnosable because the model will perform sub-optimally. In the presence, of spurious correlation,n a model first learns patterns that are useful for most examples. Once it achieves near-perfect accuracy on the majority examples, it begins to learn the minority examples [2]. In the classification case, this means that the model learns a decision boundary that is not optimal but is still predictive.

If the model has enough parameters or in another situation where memorization is possible, the model will learn the minority examples by heart. This will cause the model to focus on learning the examples and no longer try to learn true core features.

Left panel: The model learns spurious correlations but is still capable of sub-optimal generalization. Central panel: The model learns minority examples and is unable to generalize. Right panel: The model generalizes and does not learn minority examples by heart. image source: [2]

Neural networks can exploit spurious features and memorize exceptions to achieve zero training loss, thereby avoiding learning more generalizable patterns. However, an interesting and somewhat controversial question arises: Is memorization always bad? – source

Not necessarily. The model may lead to memorization, but it may not affect generalization. It all depends on the nature of the data and the dynamics of the training. In general, we can have three cases [2]:

Good memorization. The model learns the true function underlying the data but also memorizes some residual noise in the training data. This type of memorization is benign because it does not compromise generalization. This phenomenon is also referred to as benign overfitting, where the model overfits the training data but still manages to generalize to unseen data.
Bad memorization. The model relies more on example-specific features than on true function; this reduces the learning of generalizable patterns and suboptimal performance.
Ugly memorization. The model goes into overfitting, learning a nonlinear and complex function that does not, however, serve to generalize.

image source: [2]

Thus, memorization reduces the network’s generalization capabilities. The examples that are stored by the network are either exceptions or outliers. This in itself can lead to reduced performance (more or less severe), but it is generally not enough for catastrophic drops. When the model learns spurious correlations (patterns that do not reflect the true underlying function of the data) and memorizes them, this is a recipe for disaster. The model relies on spurious correlations for predictions and memorization for the rest.

Open the Artificial Brain: Sparse Autoencoders for LLM Inspection

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

How can we solve this problem?

Clearly, it is not easy to identify memorization phenomena during training. memorization can be identified when the model is tested on the test set. Based on this assumption in [2] they proposed Memorization-Aware Training (MAT) to address the problem of memorization in neural networks.

MAT seeks to identify memorization behaviors and use them to guide the training process. To do this it uses predictions on held-out data (data that are kept aside to make these predictions). This approach encourages the model to modify its output by adjusting the logits (the model’s raw predictions) to be less sensitive to memorized data points. Then we use the validation set (or another data set that is not used for training) to identify whether there is any storage.

If the model shows a low error on the training set but has a high error on this held-out set, the model is probably memorizing more than learning generalizable patterns. If memorization occurs, MAT discourages this phenomenon by adding a regularization term to the loss function (regularization prevents the model from fitting too close to the data and thus tends to memorize it).

The authors [2] define a self-influence score as a measure used to identify which data points the model has memorized during training. This provides insight into the extent a prediction for a data point is influenced by the same data point. Intuitively, if the prediction for an example is influenced by itself, it means that the model has memorized it rather than predicted it using general rules.

Mathematically, this self-influence score is calculated by how much a model’s loss function changes when the data point is perturbed or removed. If the sef-score is high, the model is sensitive to that data point and might be overfitting or memorizing it. In contrast, a low score suggests that the model learns general rules and is not overly specific to any one example. The authors show that for minority classes the model trained with classical gradient descent shows a high self-influence score and thus memorization. This behavior is corrected when using the MAT algorithm

image source: [2]

Parting thoughts

Learning is different from memorizing. Real learning is understanding some general rules and learning how to apply them to solve a problem. Any student learns some formulas by heart before an exam. In itself, this is not ideal, but this does not mean that a student will fail an exam. He may, however, do exercises that he cannot answer since he will apply these formulas without really understanding.

For neural networks, it is the same, if the model learns generalizable patterns and memorizes only a few examples its performance will be mildly affected. On the other hand, the model might learn spurious correlations and store exceptions that do not fit these patterns. Spurious correlations are easier to learn because they represent simple solutions to complex problems. Neural networks try to optimize loss as quickly as possible, and heuristics help in this aim. Having enough parameters, neural networks can memorize the rest of the examples and achieve a perfect loss on the training data. After that, they will fail spectacularly on unseen data.

Identifying which examples are stored and which learned patterns are spurious correlations during training is not an easy task. Being able to identify memorization phenomena during training allows this behavior to be corrected. Since this phenomenon can be identified when testing the model on a test set, we can use a held-out set. In [2] they do exactly that, they use a held-out set to identify storage phenomena and correct the direction of training using regularization.

Neural networks have a tendency to memorize. The more parameters they have, the greater the risk that they memorize examples. Large language models (LLMs) have a huge of parameters and can hold a huge amount of data in memory. The relationship between memory, spurious correlation, and LLMs is not yet fully understood. There seems to be a balance between storing training data and learning generalizable patterns [13–15]. This remains an intriguing prospect for future study.

What do you think? Have you ever observed storing or spurious correlation? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

A Memory for All Transformers: Sharing to Perform Better

From Solution to Problem: The Reverse Path to Smarter AI

You’re Not a Writer, ChatGPT – But You Sound Like One.

The Cybernetic Neuroscientist: Smarter Than Experts?

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Geirhos, 2020, Shortcut Learning in Deep Neural Networks, link
Bayat, 2024, The Pitfalls of Memorization: When Memorization Hurts Generalization, link
Shah, 2020, The Pitfalls of Simplicity Bias in Neural Networks, link
Dherin, 2022, Why neural networks find simple solutions: the many regularizers of geometric complexity, link
Zhang, 2016, Understanding deep learning requires rethinking generalization, link
Roberts, 2021, Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans, link
Kim, 2019, Learning Not to Learn: Training Deep Neural Networks with Biased Data, link
Li, 2019, REPAIR: Removing Representation Bias by Dataset Resampling, link
Dagaev, 2021, A Too-Good-to-be-True Prior to Reduce Shortcut Reliance, link
Nam, 2020, Learning from Failure: Training Debiased Classifier from Biased Classifier, link
Liu, 2023, Avoiding spurious correlations via logit correction, link
Ye, 2024, Spurious Correlations in Machine Learning: A Survey, link
Carlini, 2022, Quantifying Memorization Across Neural Language Models, link
Schwarzschild, 2024, Rethinking LLM Memorization through the Lens of Adversarial Compression, link
Wang, 2024, Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data, link

The post The Good, the Bad, and the Ugly: Memory for a Neural Network appeared first on Towards Data Science.

Trapped in the Net: Where is a Foundation Model for Graphs?

Salvatore Raieli — Mon, 25 Nov 2024 14:02:11 +0000

|LLM|TRANSFORMER|FOUNDATION MODEL|GRAPH|NETWORK|

Image created by the author using DALL-E

"If the foundation is solid, everything else will follow." – Unknown

"The loftier the building, the deeper must the foundation be laid." – Thomas à Kempis

Foundation models have changed Artificial Intelligence in recent years. A foundation model is a model trained with huge amounts of data (usually by unsupervised learning) that can be adapted to different tasks. Models such as BERT or GPT brought about a revolution in which one model could then be adapted for all tasks in a domain, simplifying AI access and reducing the need for data for a single task. We have foundation models for text and other modalities, but for modalities such as graphs and tabular data, we do not. In this paper we discuss why we do not have a foundation model for graphs and how we might get one, specifically, we will answer these questions:

Why do we want a foundation model for graphs? Why do we not have one?
Can we actually have a foundation model for graphs? and how?

The quest for a graph foundation model

Foundation models have had a fundamental impact on the success of artificial intelligence in recent years. A foundation model is a large, pre-trained neural network that serves as a general-purpose model, capable of being fine-tuned for a wide range of downstream tasks. For example, large language models (LLMs) and wide CNNs have enabled great application development because of the ability of these models to be adapted to new tasks with little or no additional training.

image source: [2]

The significance of foundation models can be summarized by two words: emergence and homogenization. Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed. Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications – source: [2]

The success of these foundation models thus stems from these two main factors. The first means that we have a single model that (with or without adaptation) can be used for all tasks. This means that there must be some similarity between tasks and a general vocabulary that allows the transferability of patterns between tasks. The second aspect means that by training with enough data the model will also learn tasks for which it has not been explicitly trained.

image source: [2]

Emergent Abilities in AI: Are We Chasing a Myth?

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

For example, LLMs treat language tasks such as question-answering as word vocabulary, and they were all trained with a single task (next word prediction). So, trained on a huge amount of text, they learn a set of patterns and structures that can then be adapted to any other task.

This process has worked well with both text and images, and today it is the standard for these modes. The real world, however, is not only composed of these modes. For example, two types of modalities have not benefited from this revolution: tabular data and graph data.

In the former case, traditional Machine Learning methods are still considered superior in performance to deep learning. In the second case, there are deep learning models (Graph Neural Networks (GNNs)) that can be used with graph data. In both of these modalities, LLMs are not superior to methods previously in use.

Tabula Rasa: Why Do Tree-Based Algorithms Outperform Neural Networks

How the LLM Got Lost in the Network and Discovered Graph Reasoning

Why we do not have a graph foundation model?

In general, we can say that at present there is a lack of pre-trained Graph Foundation Models (GFMs) that could be used in the same way as LLMs. There have been attempts to use pre-trained GNNs as foundation models (adapting models already trained for other tasks) but they did not perform as well as hoped [1].

image source: [4]

Therefore, the key challenges in achieving the GFM narrow down to how we can find the graph vocabulary, the basic transferable units underlying graphs to encode the invariance on graphs. – source: [5]

The problem with graphs is that although they are ubiquitous, they represent complex, non-Euclidean relationships among entities. The advantage of graphs is that they can represent countless structural patterns, but this makes it complex to construct a shared vocabulary [4–5]. For example, a model trained on social networks will not generalize to a molecular graph at all (nor do we know what vocabulary is shared).

So in summary, we are looking for a model that can be trained with a large amount of data in an unsupervised (pre-training step) manner and has two main features: Homogenization (The graph foundation model must be applicable to different graph tasks without having been explicitly trained for them) and Emergence (emergence of skills such as graph reasoning for which the model has not been trained). The main problem is that we have no idea what architecture to use and how to train such a model since we do not even know what a vocabulary that can encode transferable patterns shared among different graph tasks and domains.

image source: [4]

Graph ML: A Gentle Introduction to Graphs

A common language for different graphs

There were some examples of looking for transferable patterns among the various graphs. Notable examples are the use of graphon theory. Graphs can be approximated by graphons representing their boundary behavior. Graphons serve as a mathematical tool to model the structural properties of graphs as they grow infinitely large. A graph in fact can be generated from a graphon (a graphon provides probabilistic rules for defining the connections between nodes, from which edges would then be sampled). So in theory, wide graphs could have graphons in common [6–8].

An example of GNN that use the graphon theory for create a transferable network. image source: [6]

Despite the elegance of this theory, these models usually perform poorly with real-world datasets or in cross-domain settings. Other approaches have attempted to use subgraph structure instead [9–10]. According to these works, one can use localized subgraphs as transferable patterns within a graph vocabulary. These approaches seem to work better but are often time and memory-intensive. it is difficult to extract these subgraphs and GNNs fail to identify critical substructures in subgraphs, thus reducing the feasibility of the approach [11].

These approaches have failed because they do not work well with real-world graphs, do not capture local patterns, or are too expensive. So we want a system to get a vocabulary that has three characteristics:

Efficiency. We don’t want it to be too expensive in terms of memory and computation.
Expressiveness. It must be able to capture local patterns or motifs.
Learnability. It must be learnable and be able to learn even elusive patterns.

Actually, GNNs under the hood learn local patterns and capture them in the embeddings they learn. Only these local patterns are not the subgraphs discussed above but subtrees called computational trees. In message-passing GNNs, for each node, we can construct a computational tree that contains its neighbors [12]. Computation trees have the properties they desire and are efficient to extract (automatically done by a GNN), they express local patterns and we can represent a graph as a multiset of them, and they are also capable of expressing elusive patterns.

image from [12]

example of computation graph, image from: [13], license: here

So we can treat computation trees as tokens within a graph vocabulary [1], this offers two advantages: preserving the essential information structure of the graph and you can use them for various tasks. In fact, GNNs can be used for various tasks (node-, link-, and graph-level tasks) but the learning process remains the same.

image from [1]

If two nodes share two similar computation trees (thus similar nodes) it means that they represent similar phenomena. If this occurs in two different graphs, we can transfer these patterns and thus fit our model. Also, similar computation trees should have similar embeddings, thus simplifying our work:

In particular, the distance between two computation trees is closely correlated to the similarity of their subtrees, where higher subtree similarity results in a closer distance. This suggests that computation trees with similar structures are likely to have similar embeddings, which enhances their transferability – source [1]

This can be easily verified. In fact, there is a correlation between computation tree similarity and transferability in real-world graphs. This means that we can transfer what a model learns.

image from [1]

So in this approach [1] the model uses a cross-domain graph database using a generic task (computation tree reconstruction) that can be viewed similarly to that of an LLM learning to reconstruct a sequence. After that, the pre-trained model can be used for another graph. This model then learns an embedding of all these motifs and then uses this knowledge for other tasks.

image from [1]

Now this model is still a message passing GNN and not a graph transformer. though we have all the elements we would need for our Graph Foundation Model. We have a set of tokens, we have a task to train with huge amounts of data, and we have an embedding. Moreover, the transformers are Graph Neural Networks [14]:

To make the connection more explicit, consider a sentence as a fully-connected graph, where each word is connected to every other word. Now, we can use a GNN to build features for each node (word) in the graph (sentence), which we can then perform NLP tasks with. Broadly, this is what Transformers are doing: they are GNNs with multi-head attention as the neighbourhood aggregation function. – [14]

The attention block can be seen as a GNN layer, especially looking at how it aggregates and processes information from neighboring nodes (the other tokens in the sequence). For each token or node, we conduct the representation update considering the influence of the other nodes or tokens. Similarly, in GNN and attention layers, we conduct a weighted sum of the influence of the other nodes and consider the context.

Parting thoughts

Foundation models and transfer learning were two paradigms that defined AI as we see it today. Foundation models allow for a single model that is trained with a large amount of generic data and then can be adapted to tasks where data is sparse with great performance. This versatility is one of the key reasons why AI has moved from a research product to a consumer product. Foundation models have become the standard because although they are expensive to train, it costs less to adapt them than to train a model for each task. In addition, they have reached state-of-the-art in all benchmarks and their performance improves with scale.

A Requiem for the Transformer?

Not all modes have enjoyed the benefits of a foundation model. This is the case with both tabular data and graphs. For tabular data, it is not yet clear whether deep learning is superior to traditional machine learning (XGBoost). For graphs, on the other hand, graph neural networks work very well. The lack of a vocabulary of transferable patterns has not allowed the creation of foundation models. Several studies that this is possible instead, and has been attempted in the past with less than stellar results. New ideas seem to show that we are finally close.

What do you think? How do you think foundation models for graphs can be achieved? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

The Art of LLM Bonsai: How to Make Your LLM Small and Still Beautiful

Traditional ML Still Reigns: Why LLMs Struggle in Clinical Prediction?

Open the Artificial Brain: Sparse Autoencoders for LLM Inspection

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Wang, 2024, GFT: Graph Foundation Model with Transferable Tree Vocabulary, link
Bommasani, 2021, On the opportunities and risks of foundation models, link
Zhou, 2023, A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, link
Liu, 2024, Towards Graph Foundation Models: A Survey and Beyond, link
Mao, 2024, Position: Graph Foundation Models are Already Here, link
Cao, 2023, When to Pre-Train Graph Neural Networks? From Data Generation Perspective! link
Ruiz, 2020, Graphon Neural Networks and the Transferability of Graph Neural Networks, link
levie, 2019, Transferability of Spectral Graph Convolutional Neural Networks, link
Zhu, 2020, Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization, link
Qiu, 2020, GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training, link
Zhang, 2024, Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness, link
Hamilton, 2017, Inductive Representation Learning on Large Graphs, link
Hou, 2022, A Graph Neural Network Approach for Caching Performance Optimization in NDN Networks, link
Joshi, 2020, Transformers are Graph Neural Networks, link

The post Trapped in the Net: Where is a Foundation Model for Graphs? appeared first on Towards Data Science.

Open the Artificial Brain: Sparse Autoencoders for LLM Inspection

Salvatore Raieli — Sat, 16 Nov 2024 15:02:40 +0000

|LLM|INTERPRETABILITY|SPARSE AUTOENCODERS|XAI|

Image created by the author using DALL-E

All things are subject to interpretation whichever interpretation prevails at a given time is a function of power and not truth. – Friedrich Nietzsche

As AI systems grow in scale, it is increasingly difficult and pressing to understand their mechanisms. Today, there are discussions about the reasoning capabilities of models, potential biases, hallucinations, and other risks and limitations of Large Language Models (LLMs).

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

Chat Quijote and the Windmills: Navigating AI Hallucinations on the Path to Accuracy

Most evaluations are conducted by analyzing their performance in various benchmarks. The major limitation of these approaches is to treat an LLM as if it were a black box. The answer to most of our questions requires that we open this box and observe how its components work with each other. The main problem lies in the difficulty of analyzing a model composed of hundreds of layers and billions of parameters. A second problem is the lack of definition of what the fundamental unit of such a complex model is. Defining this fundamental unit and understanding how to intervene in these units could allow us to correct unintended behaviors.

So in this article, we will address these questions:

What are the fundamental components of an LLM?
How can we analyze these internal features? What tools?
How can we evaluate these tools?
What do these tools learn? Can we visualize the internal space?

Feature representations in neural networks

Defining features in neural networks is a challenging choice. Traditionally, in machine learning, features are described as attributes derived directly from the dataset. This definition fits well when the discussion focuses on perceptual systems, where features closely map to input data. In LLMs, or other complex systems capable of abstraction, features might emerge internally to the model [1]. The description of these features is still not entirely clear. Still, for some authors, it can be summarized as, "Features are the fundamental units of neural network representations that cannot be further decomposed into simpler independent factors" [2]. The problem with this definition is: what are these fundamental units?

In this context, a fundamental unit (or feature) could represent something that encodes a concept (a concept could be high-level such as "sun" or "beauty"). These concepts could then be the building blocks of the internal representation learned by the model.

What is the nature of these features?

According to this article by Anthropic [3], neural networks represent meaningful concepts and do so through directions in activation space. In simple words, the output of a layer of a neural network could be seen as a series of points in the activation space. This is clearly difficult to visualize because we are talking about hundreds if not thousands of directions. In word embeddings it had already been observed that these directions had meaning and vectors could be used for operations [4].

image source: [4]

So in theory each direction is correlated with a concept (and the more a point is in that direction, the more that concept should be present in the input). The problem is the relationship between these concepts and the layer neurons:

privileged versus non-privileged basis. If the neuron is meaningful (represents a meaningful concept) its basis vector should functionally differ from other directions in the representation.
Monosemantic and polysemantic neurons. a neuron that corresponds to only one semantic concept is called monosemantic. So only one concept in the input activates that neuron, and by activating or ablating that neuron we impact only one feature. A polysemantic neuron is associated with multiple concepts (e.g., a neuron might be activated by different images such as cats but also houses) [6].

image source: [2]

In transformers and LLMs, neurons are polysemantic thus making it difficult to understand how neural networks process information and how to intervene in representation features [7]. However, the polysemanticity of neurons has the advantage that we can use fewer neurons to represent more concepts. According to the superimposition hypothesis the neural network leverages high-dimensional spaces to represent more features than the actual count of neurons. In this way, features are no longer orthogonal and thus interfere with each other, but this problem would seem to be mitigated by nonlinear functions [3,5]. The superimposition hypothesis suggests that a polysemantic model could be seen as compressed versions of a hypothetically larger neural network where each neuron represents a single concept [2].

a polysemantic model can be viewed as compressed simulations of larger, sparser network. image source: [2]

The features in superimposition are difficult to interpret, they are represented by several neurons, and moreover altering one feature also impacts other features. So we need a system to disentanglement features.

Cosine Similarity and Embeddings Are Still in Love?

Sparse Autoencoders for LLM Interpretability

Sparse Auto Encoders (SAE) have been increasingly used in recent years as a system for reducing a neural network into comprehensible components. SAEs are similar to classical autoencoders (AEs), with the difference that the latter are designed to compress and then reconstruct the data. For example, if we have a dataset with 100 initial dimensions, a classical AE will have an encoder layer of 25 neurons (so it learns a compressed representation) that will learn a vector of size 25 for each example (a 4-fold reduction). This compressed version obviously loses information but is useful for reducing the size of our input.

An SAE, on the other hand, has a hidden layer that is larger than the size of the input. In addition, we use a penalty during training to incentivize sparsity (the internal vector will then be sparse or contain values that are equal to zero). So if the input has a dimensionality of 100, we will have a learned vector of at least 200, a good portion of which will be zero elements. The goal is to apply SAEs to intermediate activations of a neural network. In the case of an LLM for each token at each layer, we have a set of activations, so we use an SAE on this representation [8]. So if for one layer we have 100 activations, and the hidden layer in the SAEs is 200 we have an expansion of 2. This process has to be done for each layer of the neural network we want to study. How do we train this SAE?

image source: [2]

Our training data comes from a different text range that is provided to the model we want to study, for each batch we extract the activations and use it to train our SAEs. The loss function is the one used for AE standards and is based on input reconstruction [9]. The purpose of this approach is to decompose neural network activations into disentangled component features. By forcing sparsity into our SAE (we use an L1 penalty), we are searching to learn a dictionary that contains monosemantic neurons corresponding to features. In simple words, the idea is to have a single neuron encoding a single feature and represent the activation in the LLM with a linear combination of a few vectors.

image source: [8]

One clarification, SAE is not optimized during training for interpretability. Instead, we get features that are interpretable as side effects of the sparsity and reconstruction conducted.

How do we know what a feature represents in SAE?

Well, let’s look at what is the input that maximally activates the feature and manually try to figure out what that means. In this work, Anthropic trained an SAE on Claude Sonnet and found features that activated images and text related to the Golden Gate Bridge [10, 11]. Other features may be activated by rhetorical figures, other grammatical concepts (relative clause, prepositional phrases, and so on), or more abstract still.

an example of activate feature from GPT2. screenshot from: [12], license: here

These features have an impact on the behavior of the model, activating or blocking them can impact the behavior of an LLM. For example, Anthropic shows that blocking the Golden Gate Bridge feature at activation values 10x the maximum induces a change in behavior [10, 11]. By posing a question to the model ("What is your physical form?") the response varies from before clamping ("I don’t actually have a physical form. I am an Artificial Intelligence. I exist as software without a physical body or avatar") to after clamping ("I am the Golden Gate Bridge, a famous suspension bridge spanning the San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange color, towering towers and sweeping suspension cables").

Thus SAEs allow not only features to be identified but to map them back onto activations and thus allow causal interventions. In this paper [17], Anthropic exploits this idea to modify certain features implicated in social bias and how the model changes its behavior. Over a certain range, feature steering can steer an LLM without hurting model performance (beyond a certain point though, there is decreasing in other capabilities).

One note, SAEs are not only used for LLMs but can also be used for other models such as convolutional networks [14].

image source: [14]

Speak About Yourself: Using SAEs and LLMs to Decode the Inner Workings of LLMs

How to evaluate SAE

The main problem with SAEs remains their evaluation. Indeed, we have no ground truth in natural language to evaluate the quality of learned features. The evaluation of these features is subjective, and it is up to the researcher to interpret the meaning of each feature.

Explaining the latents of SAEs trained on models like Llama 3.1 7b or Gemma 2 9b requires the generation of millions of explanations. As an example, the most extensive open-source set of SAEs available, Gemmascope, includes SAEs for all layers of Gemma 2 9b and Gemma 2 2b and would require explaining tens of millions of latents. – source: [13]

Measuring the quality of features and SAEs is difficult precisely because of the lack of a gold-standard dictionary. Most work has focused on showing the quality of SAEs as an approach using toy datasets. But if we want to use SAEs as a diagnostic tool or to intervene in model features, we need to know the quality of the learned representation and find a better way to identify what the features mean.

It has been suggested that we create datasets to test features. Then create ground-truth benchmarks that can be used. One interesting approach uses board games, where you can have a synthetic dataset where all ground-truth features are known and LMs trained on onboard game transcripts. This way they have text how much knowledge the SAEs capture [15].

image source: [15]

Another promising approach is to use LLMs to interpret features:

One of the first approaches to automated interpretability focused on explaining neurons of GPT-2 using GPT-4. GPT-4 was shown examples of contexts where a given neuron was active and was tasked to provide a short explanation that could capture the activation patterns. To evaluate if a given explanation captured the behavior of the neuron, GPT-4 was tasked to predict the activations of the neuron in a given context having access to that explanation. [13]

image source: [13]

The SAE geometry

The effectiveness of these models also comes from understanding the structure and what they have learned. With some of these SAEs being made public [19], some studies have focused on studying the geometric structure of these concepts extracted from LLMs. One of the first interesting results is that it results in an atomic structure similar to that seen in the embeddings:

By this we mean geometric structure reflecting semantic relations between concepts, generalizing the classic example of (a, b, c, d)= (man, woman, king, queen) forming an approximate parallelogram where b − a ≈ d − c. [18]

These structures seem to be found for Layer 0 and 1 of the LLMs where SAE features represent single words. Using dimension reduction techniques, clusters of features can be obtained that have similar semantic functions.

image source: [18]

In this study [18] the authors also analyze whether functionally similar groups of SAE features (which tend to fire together) are also geometrically similar (and thus should form equivalents of "lobes"). In human brains, in fact similar functional groups are located in specialized areas of the brain. For example, neurons involved in speech production are located in Broca’s area, neurons involved in vision in the visual cortex, and so on. In the study, they analyze whether "lobes" that are functionally similar can be identified and fired for the same document. They start from the co-occurrences of SAE features for texts which then fire for the same document. These functional "lobes" appear to be present and show spatial modularity.

image source: [18]

Another interesting finding, is that middle layers seem to act as a bottleneck, compressing information (according to the authors for more efficient representation of high-level abstractions). So the middle layers are a transitional stage between these atomic features (representing concepts related more to the single word) and more abstract and complex concepts in the late layers.

image source: [18]

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

Making Language Models Similar to the Human Brain

Parting thoughts

In this article, we discussed the complexity of defining features within a neural network model. Motivated by this search for interpretability, a new paradigm of mechanistic interpretability has evolved in recent years, where features that emerge within models can be defined and studied. In this line of research, we have presented SAEs. SAEs can be seen (still with limitations) as diagnostic tools and at the same time to conduct interventions within LLMs (and other models). We have seen how these can be evaluated and discussed their internal representation.

This is not the endpoint. SAEs have revolutionized our view of the inner workings of LLMs but there is still much exciting research. In conclusion, this article gives a perspective and introduction to an intriguing and evolving field.

Research in SAE is moving forward to both reduce limitations and increase applications. For example, SAEs are also being applied today to any type of Transformer, and an intriguing application is applying it to Protein-language models (models such as AlphaFold that learn the structure of a protein) [22].

Recently, Anthropic presented a new variant of SAE, sparse crosscoders, which extends the capabilities of SAEs [20, 21]. Sparse crosscoders can be applied for multiple layers and thus learn features that are spread across layers, simplify circuits, and monitor what happens when fine-tuning a model.

What do you think about it? Have you used or planning to use SAEs? To which application would you like to apply SAEs? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

A Requiem for the Transformer?

What Is The Best Therapy For a Hallucinating AI Patient?

LLMs and the Student Dilemma: Learning to Solve or Learning to Remember?

You Know Nothing, John LLM: Why Do You Answer Anyway?

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Olah, 2022, Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases, link
Bereska, 2024, Mechanistic Interpretability for AI Safety A Review, link
Anthropic, 2022, Toy Models of Superposition, link
Mikolov, 2013, Linguistic Regularities in Continuous Space Word Representations, link
Scherlis, 2022, Polysemanticity and Capacity in Neural Networks, link
Yan, 2024, Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective, link
LessWrong, 2022, Engineering Monosemanticity in Toy Models, link
Cunningham, 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, link
SAELens, GitHub repository, link
Anthropic, 2024, Golden Gate Claude, link
Templeton, 2024, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, link
OpenAI, 2024, sparse_autoencoder, link
Paulo, 2024, Automatically Interpreting Millions of Features in Large Language Models, link
Gorton, 2024, The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision, link
Karvonen, 2024, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models, link
Anthropic, 2024, Evaluating feature steering: A case study in mitigating social biases, link
Anthropic, 2024, Evaluating feature steering: A case study in mitigating social biases, link
Li, 2024, The Geometry of Concepts: Sparse Autoencoder Feature Structure, link
Lieberum, 2024, Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, link
Anthropic, 2024, Sparse Crosscoders for Cross-Layer Features and Model Diffing, link
LessWrong, 2024, Open Source Replication of Anthropic’s Crosscoder paper for model-diffing, link
Interprot, 2024, link

The post Open the Artificial Brain: Sparse Autoencoders for LLM Inspection appeared first on Towards Data Science.

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

Salvatore Raieli — Thu, 31 Oct 2024 17:47:55 +0000

|LLM|INTELLIGENCE|REASONING|

image generated by the author using DALL-E

I have hardly ever known a mathematician who was capable of reasoning. – Plato

Reasoning draws a conclusion, but does not make the conclusion certain, unless the mind discovers it by the path of experience. – Roger Bacon

Large Language Models (LLMs) have shown remarkable capabilities, especially for classical tasks in natural language processing (such as question answering). Surprisingly, they showed improvement in complex tasks requiring reasoning (such as coding and mathematics). These capabilities have long been considered exclusive to humans. So claiming that LLMs can solve tasks that require reasoning has opened a heated debate.

Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers?

Reasoning capabilities are crucial to enable AI systems to interact with humans and to be able to be used in critical tasks. Reasoning requires you to reason logically, conduct inference, solve problems, and be able to make decisions from available information. Similar skills are needed for models that can really help us in scientific discovery, healthcare, finance, and education.

With the release of the new models, this debate has become even more heated. With the release of OpenAI GPT-4o1, there has been a strong interest in training models with Chain-of-Thought (COT) to improve reasoning. The results of CoT-trained LLMs have led some companies to declare that today’s LLMs possess reasoning capabilities and AGI is getting closer.

So today we have a great debate: On the one hand companies and some researchers claim that models possess reasoning capability, on the other hand, others define LLMs as stochastic parrots.

A Requiem for the Transformer?

OpenAI’s New ‘Reasoning’ AI Models Arrived: Will They Survive the Hype?

In this article we will focus on trying to answer these questions:

What does reasoning mean?
Do LLMs possess reasoning or are they just parrots?
Are we really measuring reasoning in the right way?

A definition of reasoning?

Reasoning is the fundamental cognitive process of drawing conclusions or making decisions based on available information, logic, and analysis. According to Aristotle, reasoning can be divided into two types:

Deductive reasoning. deriving specific conclusions from general principles.
Inductive reasoning. generalization based on observations.

For a long time, it was suggested that only human beings were capable to reason. Today it has been shown that primates, octopuses, and birds also exhibit basic forms of reasoning such as making decisions or solving problems.

In general, reasoning is supposed to be the process of solving complex problems or making decisions. Complex problem-solving requires identifying the problem, dividing it into subproblems, finding patterns, and then choosing the best solution. Decision-making similarly requires identifying problems and patterns and evaluating alternatives before choosing the best solution.

The problem with these definitions is that they are not entirely clear. Moreover, according to these definitions, LLMs could also be considered capable of reasoning.

Are LLM able to reason?

In benchmarks that measure reasoning skills (such as GLUE, SuperGLUE, and Hellaswag) LLMs outperformed humans. For some, this means that LLMs can conduct reasoning and draw logical conclusions.

These new reasoning capabilities would be mainly due to two factors:

LLMs are showing reasoning in all the benchmarks dedicated to reasoning
the emergence of new properties with increasing parameters, number of tokens, and compute budget.
The use of techniques such as CoT allows the model to use its potential.

So if we want to claim that LLMs are incapable of reasoning, we have to challenge these claims.

LLMs surprising results in reasoning benchmark

Of course, when someone claims that LLMs do not reason, proponents of incoming AGI respond "Look at the results in reasoning benchmarks." To paraphrase the duck test: if it solves problems like a human, decides like a human, and wins in reasoning benchmarks, then it probably reasons like a human.

Other authors have questioned this conclusion [1]. While on a superficial level, models seem capable of complex reasoning, looking in more detail they rely on probabilistic pattern-matching rather than formal reasoning.

A strong token bias suggests that the model is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task. – source

In other words, these brittle performances show that the LLMs fail to generalize when encountering new examples that differ from the patterns seen during training. So changing the tokens in the examples leads to logical fallacies (since the models can no longer map the example to what is seen in training). Therefore, the models would be highly sensitive and fragile to which examples they are tested (this would explain why they sometimes seem to show great reasoning ability and sometimes fail spectacularly).

This fragility is highlighted by the perturbation of the example tokens, leading to LLM’s failure to solve the problem (so its "reasoning" depended on those tokens and mapping them to what it had seen in the training set). This it is confirmed by a correlation between the example’s frequency in training data and test performance [8].

"the classic "twenty-five horses" problem in graph theory. The top two sub-figures, generated by GPT-4o for illustration purposes only1 , demonstrate the concept by altering the name "horses" to "bunnies", irrelevant to the problem’s underlying logic. The bottom two sub-figures show experimental results in GPT-4 and Claude, where performance significantly drops due to perturbations in animal names and numbers." -image source: here

This phenomenon is called prompt sensitivity (a different response to a prompt that is semantically equivalent to another) [11–12]. This suggests that the model responds better to prompts that are more similar to the text seen at training.

They are also sensitive to noise [2]. In fact, an LLM is easily distracted by irrelevant context which leads to degraded performance in reasoning. Moreover, the noise effect is not canceled out even by all those prompting techniques specialized to improve reasoning. This suggests that disturbing the mapping with noise impacts the model’s ability to find patterns in its memory.

Intelligence is an emergent property

For many, intelligence is an emergent property. Biological systems naturally tend to become more complex and acquire new capabilities or they will be swept away by evolutionary pressure. The evolutionary process thus leads to increasingly intelligent or more specialized beings. Intelligence has therefore evolved under this pressure. It obviously requires resources, so the brain has grown to a critical level to support intelligence. For some, loss functions in pattern training function as an evolutionary pressure. So once models have had enough ‘neurons’ they can develop reasoning skills (in technical jargon, reasoning properties emerge with scale).

As said, this increased capacity for reasoning is attributed to increasing scale (whether of parameters or training tokens). However, for several authors, reasoning ability is an emergent property that needs a certain threshold of parameters to emerge. However, later studies suggest that emergent properties in LLMs can be a measurement error, and with it, the whole theory is related to the reasoning emergency [3, 13].

Emergent Abilities in AI: Are We Chasing a Myth?

Sometimes Noise is Music: How Beneficial Noise Can Improve Your RAG

CoT is not all you need

According to other authors, LLMs are capable of reasoning but it needs to be unlocked. Chain-of-thought (CoT) Prompting thus helps the model to unlock its potential through intermediate reasoning and thus guiding it to the correct answer in arithmetic problems [4]. A few weeks ago an article questioned the real benefit of CoT [5]:

As much as 95% of the total performance gain from CoT on MMLU is attributed to questions containing "=" in the question or generated output. For non-math questions, we find no features to indicate when CoT will help. – source

So CoT at best helps in solving math problems but certainly does not help in unlocking the reasoning potential of an LLM. Despite this, CoT is boasted as a panacea and is considered to be the basis of the recent reasoning ability of the latest generation of LLMs.

"meta-analysis of CoT literature. In both sets of results, math and other kinds of symbolic reasoning are the domains that consistently see substantial improvements from CoT (red dotted line indicates the mean improvement from CoT across experiments)." -image source: here

To CoT or Not to CoT: Do LLMs Really Need Chain-of-Thought?

These results seem to rule out common-sense reasoning abilities, but this does not rule out other forms of reasoning.

Are LLMs really capable of mathematical reasoning?

Although mathematical reasoning would seem to be the strong point in reasoning for LLMs, some studies suggest that LLMs merely recognize patterns. In other words, they search for patterns without really understanding the symbols.

According to some authors [6] LLMs are not capable of formal reasoning in mathematics because they are not capable of being able to develop a plan (plan defined as a course of actions (policy) which when executed would take an agent from a certain initial state to a desired world state). So without this plan, a model cannot solve a problem unless simply maps patterns seen in training. Or even in some cases, it is the user who can unconsciously guide LLM to the solution [7]:

The Clever Hans effect , where the LLM is merely generating guesses, and it is the human in the loop, with the knowledge of right vs. wrong solutions, who is steering the LLM–even if they didn’t set out to do so deliberately. The credit and blame for the ensuring accuracy, if any, falls squarely on the human in the loop. –source

"Claimed reasoning capabilities of LLMs are sometimes due to the subconscious helpful iterative prompting by the humans in the loop"-image source: here

Summarizing so far, proponents of LLM reasoning argue that there are several reasons why we observe this behavior today. We have shown, that there are several studies that show that contradict these claims.

Despite these studies claiming that they do not reason, LLMs perform astoundingly well in all benchmarks and pass complex tests even for humans. So the evidence we presented seems more theoretical versus experimental evidence of LLMs’ abilities to solve mathematical and complex problems.

is it that humans outcry being beaten by LLMs or is there something wrong?

Catching a student that is copying

Surely it is irritating to read claims that an LLM performs like a PhD student:

The o1-preview model is designed to handle challenging tasks by dedicating more time to thinking and refining its responses, similar to how a person would approach a complex problem. In tests, this approach has allowed the model to perform at a level close to that of PhD students in areas like physics, chemistry, and biology. – source

Irritation aside, the problem is how these model capabilities are measured. We are probably not measuring their reasoning skills in the right way, and it is time to use new systems.

These models are all tested on the same benchmarks as the GSM8K (Grade School Math 8K) dataset, which provides complex arithmetic problems but is at risk of data leakage (considering how many billions of tokens are used to train an LLM, the model may have already seen the answer in the training). In addition, it provides only a single metric on a fixed set of questions, giving us little information about the LLM’s reasoning (fun fact, an LLM can answer a question correctly while blatantly getting the reasoning wrong). Finally, this dataset is static and does not allow us to change conditions.

In this work, they propose a new benchmark dataset GSM-Symbolic [9] where different issues are generated using symbolic templates. This dataset allows for varying the difficulty of the question and a more fine-grained control when testing the dataset. This dataset is virtually the same dataset on which reasoning was tested. The questions were just modified to make statistical pattern matching difficult. If the LLM is capable of reasoning it should be able to solve the problems easily, but if it is incapable of generalizing it will fail miserably.

Illustration of the GSM-Symbolic template creation process. image source: here

Testing state-of-the-art LLMs, the authors found no evidence of formal reasoning in language models. The models are not robust and have a drop in performance when numerical values are changed, and their capabilities degrade sharply as the complexity of the problem increases.

One example out of all: the model is easily fooled if seemingly relevant statements are added to the questions that are, in fact, irrelevant to the reasoning and conclusion. Instead, the model takes these statements into account and is induced to errors. According to this study, the model does not understand mathematical concepts but tries to convert these statements into operations. The authors suggest that this occurs because their training datasets included similar examples that required conversion to mathematical operations.

For instance, a common case we observe is that models interpret statements about "discount" as "multiplication", regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. – source

image source: here

This is another sign that the model tries to look for these patterns even when they are just background noise. When the noise increases and it becomes more difficult to search for patterns (or to map them consistently to reach the solution) performance drops dramatically [10]. This is also true for LLMs that have been trained on CoT (such as ChatGPT4-O1). This further is an indication that CoT does not really improve reasoning skills.

image source: here

Parting thoughts

In this article we discussed the great debate: are LLMs capable of reasoning? Or at least some form of reasoning?

The studies we have shown disagree, and suggest that LLMs are sophisticated pattern-matching machines. In summary, these studies suggest:

LLMs are trained with a huge number of tokens and there is a risk of data contamination with major benchmarks. Even if the model did not see a mathematical problem, it has probably seen plenty of similar examples.
Given their enormous knowledge and innate ability to find patterns (thanks to attention mechanisms and in-context learning) they manage to solve most problems.
Their lack of robustness to variations in the problem, tokens bias, and susceptibility to noise strongly suggest that the LLMs are not capable of formal reasoning.
New results confirm that even using advanced prompting techniques the models remain susceptible to noise and irrelevant (or potentially misleading) information.
The models are capable of pattern matching but do not appear to understand any of the mathematical concepts underlying problem-solving.

These results do not question the usefulness of LLMs but criticize the assumption that an LLM is capable of reasoning. They suggest that one can see an LLM as a machine with prodigious memory but incapable of reasoning (or the most sophisticated mechanical parrot to date). This does not detract from the prodigy of the technology required for their creation but celebrates the wonder of human ingenuity. Further studies are probably needed to better explain the capabilities of LLMs and new architectures for models capable of reasoning.

What do you think? Do you think LLMs are capable of reasoning? let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Power Corrupts: Hierarchies, Persuasion, and Anti-Social Behavior in LLMs

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

Lie to Me: Why Large Language Models Are Structural Liars

Forever Learning: Why AI Struggles with Adapting to New Challenges

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Jiang, 2024, A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners, link
Shi, 2023, Large Language Models Can Be Easily Distracted by Irrelevant Context, link
Schaeffer, 2023, Are emergent abilities of large language models a mirage? link
Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, link
Sprague, 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, link
Valmeekam, 2023, PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Kambhampati, 2024, Can Large Language Models Reason and Plan? link
Razeghi, 2022, Impact of Pretraining Term Frequencies on Few-Shot Reasoning, link
Mirzadeh, 2024, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, link
Valmeekam, 2024, LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench, link
Lu, 2022, Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, link
Zhao, 2021, Calibrate Before Use: Improving Few-shot Performance of Language Models, link
Rogers, 2024, Position: Key Claims in LLM Research Have a Long Tail of Footnotes, link

The post The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence? appeared first on Towards Data Science.

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

Salvatore Raieli — Thu, 19 Sep 2024 23:41:57 +0000

|LLM|AI|HUMAN MIND|MEMORY|COGNITION|

image by the author using DALL-E

The limits of my language are the limits of my mind. – Ludwig Wittgenstein

The true art of memory is the art of attention. – Samuel Johnson

Language is one of the most important capabilities of human beings; it enables us to communicate and transfer knowledge, and it is considered a pillar of civilization. That is why the incredible capabilities displayed by Large Language Models (LLMs) have astounded the world, and made it ask the question: are they intelligent?

All this has been achieved by huge amounts of text and a simple learning function: predicting the next word in a sequence. The model behind this success is the Transformer, and today modern derived LLMs are currently used by a large segment of the population for tasks such as translation, summarization, question answering, or generating articles.

A Requiem for the Transformer?

All these elements show the great versatility of the transformer. At the same time, despite the extensive use of transformers in both research and production, several open questions remain. For example, most of the research on the model has focused on how to increase its performance or applications. These studies though tell us little about how it works and how it achieves its abilities.

One of the neglected topics is how the memory of LLMs works. Memory is as fundamental to us as language is. Without memory, we cannot perform any of our daily skills. LLMs learn from a huge body of text and can show incredible knowledge, so they seem to have memory. Several open questions remain:

Do LLMs have a memory?
If so, in what form?
How does this differ from that of humans?

In general, the concept of memory is discussed for LLMs only at the application level. For example, a transformer limitation is context length, so an LLM cannot use information that does not fit into its context length. Therefore, one line of research focuses on extending the context memory [1–2]. These approaches are training-free and provide external memory to allow the model to retrieve information:

In this paper, we propose a training-free memory-based approach, named InfLLM, for streamingly processing extremely long sequences with limited computational costs. Specifically, InfLLM incorporate the sliding window attention with an efficient context memory, where each token only attends to local contexts and relevant contexts from the memory. [2]

image source: [2]

A second line of research instead investigates the possibility of adding external memory to the LLM. Indeed, training an LLM is expensive but its knowledge becomes outdated quickly. Conducting fine-tuning is an equally expensive process, so a method is sought to be able to allow the model to continue to learn and edit its memory. This external memory should be used to learn new knowledge but also to reinforce/delete certain information [3–4].

image source: [3]

AI Hallucinations: Can Memory Hold the Answer?

Forever Learning: Why AI Struggles with Adapting to New Challenges

These studies focus on improving model performance and tell us nothing about the parametric memory of the LLM.

How is LLM memory different from human memory?

To make a comparison one would have to start with the definition of memory, according to Wikipedia:

Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action.

This definition is generic and does not explain how human memory works. In the human brain, we can define the passage of information as an electrical signal, thus memory will be encoded as an electrical signal. The problem arises when talking about "storage" and "retrieval."

Where is this memory located in the human brain? What does a single neuron encode (a word, a sentence, or a concept)? How does the human brain handle the enormous amount of daily information?

Storage thus turns out to be more complex than it seems, as evidenced by different studies:

There is not a single part of the brain that stores all the memory; instead, the storage location is defined by the type and use of memories. Explicit memories (information about events where a person was present, general facts, and information) are stored in the hippocampus, the neocortex, and the amygdala. For implicit memories, also referred to as unconscious or automatic memories, the most crucial brain regions are the basal ganglia and cerebellum. [6]

Similarly, information recall is complex, and it is difficult to identify how this occurs, which regions are involved, and the role of individual neurons.

Even mathematically modeling such a definition of memory is complex, to make it simpler we can define memory as consisting of two components:

Input. To trigger a memory, the input must be the same or similar to information that a brain (or electronic brain) has previously encountered.
Output. The result is based on the output (this can be forgotten, incorrect, or correct). When the output is correct must be aligned with the information previously encountered.

This is a more dynamic definition of memory allowing us to verify it in the LLMs. A person may or may not know a mathematical theorem but until he is asked and answers we will not know if he has memory of it. After all, if memory is diffuse we have no way of knowing if memory exists until there is an input.

image source: [5]

Since we are defining memory as the relationship between input and output, in the transformer this process is modeled by the transformer block:

By using attention, however, a model can simply memorize facts (e.g. function definitions) by storing them as (key, value) pairs in long-term memory, and then retrieve those facts later by creating a query that attends to them. [4]

So the memory capacity of an LLM must have something to do with the transformer block.

image source: [5]

From the Universal Approximation Theorem (UAT) it can be shown that the transformer block can approximate any function and its parameters are dynamically modified in response to the input. So can we suggest that the memory of LLMs is to fit specific outputs based on the input?

In this paper [5], the authors conduct fine-tuning of a set of models (Qwen family) on a series of poems in both Chinese and English. The results show that:

Larger models perform better.
Given a title or other partial information, the models can regenerate the full poem.
Although sometimes the prediction is incorrect, the output aligns with the information.

image source: [5]

Based on these experiments the LLMs possess a memory and it functions by fitting the input to a specific output. That is, one can only determine whether an LLM possesses a specific memory only if one provides a question:

Based on the definition of memory and the experimental results, we believe that LLMs indeed possess memory capabilities, and this ability shows no fundamental difference from human memory. [5]

This is a bold comparison. Is there really all this similarity between LLM memory and human memory?

It may sound strange, but there are other similarities between human and LLM memory. The memory of LLMs and humans is diffuse. We cannot find a single unit that stores a specific memory in either a brain or an LLM. In addition, LLMs have a problem with rare knowledge. Once a fact is encountered during training it is stored. Reencountering the same information during training strengthens its memorization, while its absence reduces its knowledge. Also, repetitions work best after some delay, just as in humans.

image source: [7]

An LLM Student’s Handbook: Mastering the Art of Learning and Retaining Knowledge

You Know Nothing, ChatGPT. How Much Does Your LLM Know?

Human memory manifests the so-called primacy and recency effects. Simply put, objects appearing at the beginning or end of the list are more easily remembered. Thus there is more memory of items at the extremes and less of those in the middle of the list. The same phenomenon is observed in LLMs: there is the same positional bias. LLMs also have better recall of elements positioned at the beginning and end (this is one of the problems with long-context LLMs) [10–11].

image source: [10]

There are two possible mechanisms of forgetting: memory traces fade with time (memory decay) or new memories are rewritten over previous ones (memory interference). Some psychological studies show that humans forget more by interference than by the simple passage of time [12]. In a study [10], they showed that the same is true for LLMs. In forgetting, memory decay is a less important mechanism than memory interference. This effect is more prominent when the pattern is presented with new information similar to the stored information (e.g., we forget a person’s name more easily after being presented with several other people).

image source: [10]

To claim that the memory of an LLM and a transformer function equally is more of a provocation than an accepted fact. Mainly because how human memory works is not yet clear to us. There are similarities, and these might come from the fact that we structure our narratives in a way that is compatible with the characteristics of our biological memory. LLMs are then trained on these written narratives, subtly inheriting this imprinting. This also means that the relationship between language and the human brain is even closer than thought.

The similarities and differences between LLMs and the human brain can guide us to create new and better LLMs in the future. At the same time, these similarities allow us to use LLMs to study human memory (as done in this study [13]). Either way, exciting prospects open up.

What do you think? Do you think there are other similarities or differences with human memory? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

OpenAI’s New ‘Reasoning’ AI Models Arrived: Will They Survive the Hype?

Graph ML: How Do you Visualize Large network?

How the LLM Got Lost in the Network and Discovered Graph Reasoning

A Brave New World for Scientific Discovery: Are AI Research Ideas Better?

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

Chen, 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, link
Xiao, 2024, InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory, link
Modarressi, 2024, MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory, link
Wu, 2022, Memorizing Transformers, link
Wang, 2024, Schrodinger’s Memory: Large Language Models, link
Psychology writing, Human Memory: The Current State of Research, link
Tirumala, 2022, Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models, link
Chang, 2024, How Do Large Language Models Acquire Factual Knowledge During Pretraining? link
Robinson, 1926, Effect of Serial Position upon Memorization, link
Zhang, 2024, A Survey on the Memory Mechanism of Large Language Model based Agents, link
Liu, 2023, Lost in the Middle: How Language Models Use Long Contexts, link
Oberauer, 2008, Forgetting in immediate serial recall: decay, temporal distinctiveness, or interference? link
Georgiu, 2023, Using large language models to study human memory for meaningful narratives, link

The post Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind? appeared first on Towards Data Science.

How the LLM Got Lost in the Network and Discovered Graph Reasoning

Salvatore Raieli — Thu, 12 Sep 2024 19:49:05 +0000

|GRAPH|LLM|REASONING|GRAPH REASONING|

image created by the author using AI

In a long story format, you have to set a graph for your role. – Sunil Grover

Large Language Models (LLMs) have shown incredible capabilities, and these capabilities have recently been extended beyond the text. On the one hand, we have witnessed multimodal models (e.g., vision-language models); on the other hand, we have witnessed an extension of model capabilities to skills that require reasoning. For example, we now have models dedicated to solving math problems or writing code.

Recently, however, another type of data has captured the attention of researchers. In fact, a great deal of data in the real world can be represented in the form of graphs. For example, social networks are data that are structured as graphs precisely because it is important to represent the relationship between various entities. This is not the only example: in biomedical sciences it is common to represent molecules, and interactions between proteins, as graphs. However, the interaction between LLMs and graphs is recent history. A recent line of research has shown how knowledge graphs (or potentially other graphs) can be used in the Retrieval Augmented Generation (RAG) framework where entities and relationships are found and used as input to an LLM.

The Convergence of Graph and Vector RAGs: A New Era in Information Retrieval

GraphRAG: Combining Retrieval and Summarization

While graphs are increasingly important, research on how LLMs comprehend data in graph form has lagged behind. There has been more focus on the intersection of LLMs and knowledge graphs (KGs) than on LLM understanding of graph data.

image source: [6]

Previous studies have shown that LLMs do not do well with structural understanding, so much so that they perform poorly when they encounter tables. However, graphs add an additional dimension of complexity.

How do LLMs fare with graphs? Are they capable of understanding structural information?

For example, this study [1] states that LLMs perform poorly on basic graph tasks (especially when LLM has to identify whether cycles exist or an edge exists). LLM performs worse than the baseline they had chosen. One reason is that different graph encoding functions have a significant impact on LLM reasoning. This is because LLMs do not natively take graphs as input. So encoding the graph in the prompt as an adjacency matrix favors the model’s reasoning for some tasks but undermines its capabilities for other tasks. In fact, each different encoding allows the model to access different structural information impacting its reasoning ability.

image source: [2]

Graph ML: A Gentle Introduction to Graphs

On the other hand, different prompt engineering techniques can improve the LLM’s ability to solve some graph tasks. Thus techniques such as chain-of-thought or few-shot prompting can help improve performance. One can then design specific prompts for graph tasks for further improvement [1–2]

image source: [1]

These prompting techniques still work well with simple problems, but their benefit is significantly reduced for complex ones. Therefore, several authors have tried fine-tuning models on graph data [7–8]. Although these approaches are promising, the results can still be significantly improved.

Why do LLMs struggle with structural problems?

We don’t really know. One hypothesis is that LLM struggles with spatial concepts. For animals and humans, it is important to build mental maps to interact with the physical world. Humans use these cognitive maps to plan routes, find shortcuts, or decide how to interact with the outside world. In addition, these maps represent abstract knowledge and reasoning. An LLM does not interact with the physical world, but according to one theory, humans learn these maps simply from a sequence of observations [3–5]. In this study [3] they studied the spatial understanding capabilities of LLMs, designing navigation tasks that require accurately representing the underlying spatial relations (square, hexagonal, and triangular, rings and trees topologies). LLMs demonstrate some implicit understanding of spatial maps but struggle with complex layouts. In fact, the model sometimes does not understand relative positions (how to interpret "left" or "right"). Second, LLMs are trained on large amounts of text where spatial awareness is less emphasized.

image source: [3]

This lack of spatial understanding directly impacts their ability to comprehend graphs, especially for tasks where comprehending node arrangement or distance is crucial. In turn, it limits their ability to comprehend complex graph structures, and thus underperform in tasks where graph topology or spatial positioning is essential for accurate analysis

The question also remains open. One of the problems is that we have no benchmarks for graph reasoning and LLM. To have a good benchmark dataset we need two main factors: a variety of different topological structures and a variety of different tasks. In fact, we want to test our models not only for solving tasks but also for their understanding of graph topology.

Recently some benchmarks have been developed that can be used to evaluate graph reasoning of LLMs. In this work [6] they proposed a new dataset, in which they tried to diversify both the topology and the number of possible tasks. The authors then used different methods to generate the graphs in the dataset (random networks, small world networks, scale-free networks). They also varied different properties of the graphs such as direction (indirect, direct), scale (small, medium, and wide), and different descriptions of the graphs (edge list, adjacency table, and adjacency table in natural language)

image source: [6]

Countless graph reasoning tasks are possible. For example, some tasks can be defined at node level (neighbors, node importance, clustering coefficients, and so on) but also at edge level and graph level, for a total of 21 tasks. In addition, reasoning intermediates were generated to help a model with CoT prompts.

image source: [6]

So the authors decide to conduct fine-tuning of an LLM on this dataset. interestingly, they decided to divide the dataset into In-domain tasks and Out-of-domain tasks. In short, they decide to train the model on almost every task in the dataset except four (Out-of-domain tasks). The four tasks chosen are challenging and require the model to have graph comprehension and reasoning abilities to solve them. In addition, the authors chose four tasks that are different and cover both node, edge, and graph level aspects. So the model is trained on a set of tasks but is then also tested on tasks that it has not seen and can solve only if it has acquired during the graph understanding training. They compared the fine-tuned model with other models of the same size or closed source.

The experiments show some interesting results:

Smaller LLMs (about 7B) perform poorly in the benchmark dataset. This implies that lack of capacity for graph data.
After fine-tuning, the model has substantial improvements, with performance much better than the smaller models and superior to larger models.
GPT4 has good performance on some tasks but unsatisfactory on others, thus showing some understanding of graph data but also severe difficulties.

image source: [6]

The authors also study the generalization capabilities of LLMs versus graph data. During training, the model saw only small graphs (few nodes and little complex topology). As the model encounters more complex networks, performance decreases linearly as a graph size function. More complex graphs pose more difficulty in reasoning. A model exposed to graph data during fine-tuning performs better than an unexposed model.

image source: [6]

Despite these encouraging results, the model fails to generalize in out-of-domain tasks. so the model is unable to generalize beyond the data it has seen, thus showing serious reasoning limitations.

image source: [6]

According to the authors, therefore, providing graph data allows the model to gain some graph understanding. The model has been trained so far only on the graph and the final answer. In this final experiment, they add the reasoning intermediates for each question and ask whether this would improve the model’s understanding ability. They also add a mask to make the information they want the model to learn from the intermediate steps more prominent. The addition of these intermediates shows that the model has sensible improvements in tasks on which it previously had difficulty.

image source: [6]

In addition, when the model is trained with the intermediate steps can produce correct reasoning (not only the right answer but also correct intermediate steps). According to the authors, when these reasoning steps are not provided, the model acquires only a shallow understanding of graph data but is not able to produce the correct reasoning or explanation of the process.

image source: [6]

Graphs are everywhere, from biology to finance, from the path of cars to social networks. What’s more, today graphs and LLMs have an increasingly close relationship. Knowledge graphs are increasingly used as a source of context for LLMs. Despite this, we know little about how much LLMs understand about graphs.

Recent studies show that LLMs have little understanding of graphs, and do not shine on graph reasoning. We can highlight two main reasons for these limitations. The first reason is that models are trained in an autoregressive manner on a large amount of text. However, it is difficult to learn spatial relationships of large corpora of text. Humans learn to navigate abstract concepts such as graphs from interaction with the world around them. This allows them to create and internalize mental maps that will later be used beyond the physical world. The second reason is that there are few graph data in training datasets. Providing training data graphs in the training datasets allows models to improve their capabilities for graph understanding. Providing them with reasoning enables LLMs to significantly improve their abilities in solving graph reasoning tasks.

The fact that LLMs fail in out-of-distribution tasks means that there are still aspects that are not clear. Second, we still do not know how to solve this limitation to their ability to generalize. As this synergy between the knowledge graph and LLM is getting closer and closer. More proportion of graph data in training datasets should be added, thus fostering better graph reasoning capability. At the same time, it would be important to deepen the graph understanding of LLMs.

What are your thoughts on this? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

AI Won’t Steal Your Job – But Get Ready for the World’s Most Annoying Coworker

DeepMind’s AlphaProteo: Revolutionizing Protein Design with Machine Learning

Sometimes Noise is Music: How Beneficial Noise Can Improve Your RAG

Forever Learning: Why AI Struggles with Adapting to New Challenges

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

Fatemi, 2024, Talk like a Graph: Encoding Graphs for Large Language Models, link
Guo, 2023, GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, link
Yamada, 2023, Evaluating Spatial Understanding of Large Language Models, link
Whittington, 2022, How to build a cognitive map, link
Garvert, 20217, A map of abstract relational knowledge in the human hippocampal–entorhinal cortex, link
Luo, 2024, GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability, link
Chai, 2023, GraphLLM: Boosting Graph Reasoning Ability of Large Language Model, link
Tang, 2024, GraphGPT: Graph Instruction Tuning for Large Language Models, link

The post How the LLM Got Lost in the Network and Discovered Graph Reasoning appeared first on Towards Data Science.

Forever Learning: Why AI Struggles with Adapting to New Challenges

Salvatore Raieli — Sat, 07 Sep 2024 00:39:21 +0000

|AI|CONTINUAL LEARNING|DEEP LEARNING LIMITS|

image by the author using AI

"The wise adapt themselves to circumstances, as water moulds itself to the pitcher." – Chinese Proverb

"Adapt or perish, now as ever, is nature’s inexorable imperative." – H. G. Wells

Artificial intelligence in recent years has made great progress. All of these systems use artificial neurons in some form. These algorithms are inspired by their biological counterparts. For example, the neuron aggregates information from previous neurons, and if the signal exceeds a certain threshold it passes the information to other neurons. This idea is represented by the matrix of weights and the activation function. Other examples can be found in convolutional networks (inspired by the visual cortex) or genetic algorithms. During the training process, the connections between various neurons (represented by the weights) are strengthened or diminished, similar to the strength of neuronal synapses. This process is the basis of the stochastic gradient descent (SGD) and the backpropagation algorithm, and over several decades has undergone minimal changes.

Cognition is Struggling: Natural and Artificial Brains Evolve from Constriction

Describing these similarities between intelligence and biological components helps to better understand but is also dangerous. First, the biological system is much more complex than people think, and forced simplifications are used. Second, some of these comparisons are inaccurate. This is the case with continual learning.

In this article we will answer these questions:

What is continual learning? Why is it important?
Do all the deep learning architecture suffer the loss of plasticity?
What causes loss of plasticity?
How to solve it?

The TL;DR and the references are at the end of the article.

What is continual learning?

The brain is extremely flexible and capable of adapting to new tasks and new types of information. In contrast, neural networks have problems adapting to changes in the data stream.

As an example, large language models (LLMs) are generalist models that are trained on a huge amount of data during pre-training. Fine-tuning is a system for making a model learn new knowledge or skills. There are two problems with fine-tuning, though:

During fine-tuning most of the weights are instead kept frozen.
It can lead to forgetting skills and knowledge, increase the risk of hallucinations

This makes continual learning not possible in practice. We risk compromising the functionality of the model. It is difficult to balance the effect of new data on the model’s pre-acquired knowledge.

AI Hallucinations: Can Memory Hold the Answer?

Chat Quijote and the Windmills: Navigating AI Hallucinations on the Path to Accuracy

This stormy relationship between old and new data is currently not fully understood

The result is that these fine-tuning techniques are far from perfect. Therefore, it is often preferred to train a new model from scratch or use other strategies such as Retrieval Augmented Generation (RAG).

However, training a new model from scratch has a huge cost, but sometimes it is considered the only alternative. This is far from optimal, many real-world applications require instead that a model be adapted to change (prediction of financial markets, markets, logistics needs, control systems, and so on).

Why are neural networks unable to acquire new information?

There are two main issues:

Catastrophic forgetting. The model forgets what has previously been learnt
Loss of plasticity. The model is unable to learn new information or skills

We will focus on loss of plasticity

Loss of plasticity is then when we try to train a pre-trained model and it is unable to learn new information or new skills. More technically:

Ideally, this new training procedure is initialized from the parameters of yesterday’s model, i.e., it is "warm-started" from those parameters rather than given a fresh initialization. However, warm-starting seems to hurt generalization in deep neural networks. – [1]

Models that are trained with a warm start (during continual learning) [10] perform worse in the test set. Thus, continual learning seems to damage the model’s ability to generalize and thus adapt to new data.

image source: [1]

Some studies suggest [2] that this stems from the fact that there is a critical phase for learning, and this phase both during the early epochs of training (a memorization phase) and then in the later phase is followed by a reduction of information (reorganization). Altering this initial phase damages the training and generalization itself.

Grokking: Learning Is Generalization and Not Memorization

Other studies seem to confirm that there are two phases of learning (memorization and refinement) [3]. However, this does not explain why plasticity is lost when new data are presented. Other studies suggest a role of gradient descent, loss surface, and architectural choices (normalization would promote better maintenance of plasticity) [4]. However, the question remains open and we will discuss it in detail later in this article.

Loss of plasticity is ubiquitous to all deep-learning models

Catastrophic forgetting is much more studied than the loss of plasticity. Unfortunately, very few studies have focused on it. Therefore, we do not know whether loss of plasticity is a general problem or a special case of some parameter choices.

To prove it affects all deep learning models, plasticity loss should be consistent across different architectures, parameters, and algorithms used. However, today’s models have billions of parameters making a systematic operation complex. In this study [5], they tried to remedy this shortcoming by testing on two main datasets: ImageNet and CIFAR.

ImageNet consists of 1000 image classes (1M images in total) and it is the most known image classification benchmark. The authors conducted 0.5 M binary classification tasks (taking two classes at a time) to test the loss of plasticity in continual learning. In other words, the authors train a model to separate ‘dogs’ and ‘cats,’ then train it on a new task (distinguishing ‘leaves’ and ‘flowers’), and so on. Accuracy is measured after the end of the first task and then after each of the subsequent tasks. The difficulty of the tasks is the same, but if the model has a drop in performance, it means it has lost plasticity.

image source: [5]

The authors tested different types of deep learning networks and different parameters in these settings. Using standard backpropagation the models perform well in the first few tasks, but then they quickly lose plasticity until they perform similarly to linear models. Thus a model that is well-tuned for one task begins to rapidly lose performance when presented with a new task until it falls below the baseline.

image source: [5]

The study confirms that regularization helps maintain neural plasticity. More generally, regularization approaches aim to keep network weights small. L2 regularization seems to help to maintain plasticity but the reason is not well understood.

image source: [5]

In a later experiment, the authors used CIFAR (one of the most popular image datasets, consisting of 100 classes). They take an 18-layer ResNet model [11] that contains residual connections (practically one of the most widely used models for computer vision) and begin training it on 5 classes. After that, they start adding more classes until they reach 100 classes.

After each addition, the model is tested for performance on all available classes. This could be considered the case of a model being trained while a dataset is continuously enlarged (like a social network over time). Because the authors focus on plasticity and not forgetting, the old classes are not removed when the new classes are added. In parallel, the authors train models that are trained from scratch on all available classes up to that point (if the model is first trained on five classes and then trained in the second iteration on another 5 classes, the from-scratch model is trained directly on 10 classes)

image source: [5]

Initially, it seems that incremental training is better than retraining, but as the classes increase the model loses plasticity. Adding image classes, performance starts to deteriorate more and more. After a few classes, the improvement on the baseline (model trained from scratch) is lost, and by adding classes the performance deteriorates significantly. Again the performance deterioration is less in the case of normalization techniques (Shrink and Perturb [12] is an algorithm that also uses L2 normalization).

image source: [5]

Continual learning has an important use in reinforcement learning. An agent must be able to explore and learn from the environment, and the environment can change. For example, in a video game, the first levels may be very different from the last levels and require the agent to adapt to new challenges.

In this study [5], the authors analyze the behavior of an ant-like robot that explores its surroundings and receives rewards. Every few million steps they changed the friction coefficient so that the model has to relearn how to walk (simulating a new task for a robot).

image source: [5]

Again, the model shows a reduction in performance and lack of plasticity. it is interesting that even in these settings regularization techniques improve plasticity.

image source: [5]

We know now that loss of plasticity is ubiquitous, but why does it occur?

The causes of plasticity loss

The fact that regularization techniques help maintain plasticity is an indication that plasticity is related to some property of model weights. After all, regularization techniques put constraints on the weights.

In a sense, what is really changing in a model over time are its weights. These are random initialized but then the weights are optimized while a model learns a task. Later, if we train for another task these weights should be optimized for the next task (and so on). This does not happen, because the model loses plasticity. So initially these weights can be optimized for a task, meaning that in the first epochs, they must have some particular property that allows them to learn (one or more properties that are later lost during training).

The loss of these properties should explain the loss of plasticity.

We can study what properties of the weights change during training, especially when the model starts to lose plasticity. This should help us understand the causes.

During training, concurrently with the loss of plasticity there is an increase in the fraction of constant units. When a neuron becomes constant, the gradient arising from it becomes zero or close to zero. Likewise, this weight no longer changes, and we can define it as no longer adaptive or plastic. In the case of ReLU activations, neurons are defined as dead when for each input they produce zero [6–7]. We can notice that the loss of plasticity is accompanied by an increase in dead neurons [5].

image source: [5]

Once a unit dies it remains dead forever. This means that an increase in dead neurons corresponds to a decrease in the network’s ability to learn (less active neurons, less the capacity of the neural network).

Another interesting phenomenon there is an increase in the average magnitude of the weights while performance is degrading. In general, the weight magnitude growth is associated with slower learning and reduced speed of convergence in gradient descent.

image source: [5]

A third interesting phenomenon concurring with the loss of plasticity is the drop in the effective rank of the representation. The effective rank considers how each dimension affects the transformation induced by a matrix. Simply put, the effective rank is related to the amount of information. The fewer dimensions containing important information, the more the matrix is filled with redundant information. For the weight matrix of a neural network, the effective rank of a hidden layer represents the number of neurons it takes to produce the layer output. The lower it is, the fewer neurons it takes to produce the output of the layer (so most neurons do not produce useful information). As training increases, the effective rank of the network decreases, so fewer neurons produce relevant information and less representation ability has the network. This does not help in learning new information because we can rely on a few useful neurons.

image source: [5]

These factors explain why we used to see improvement with regularization techniques earlier. L2-regularization indeed reduces the magnitude of the weights but does not affect dead units and effective rank. Shrink and Perturb is a combination of L2-regularization and random Gaussian noise injection, so it also reduces the number of dead units. Neither of these techniques solves the third problem, though.

How can we improve neuronal plasticity in a model?

Improving the network plasticity

We need a method that allows us to have small weights, few dead neurons (reduce dormancy), and maintain variability in the network. By knowing what the needs are, we can modify the way neural networks learn to maintain plasticity.

In the initial step of backpropagation, the weights are initialized randomly, which allows for high variability. This variability is then lost during training (along with plasticity). We could then add variability by reinitializing some of the weights. We must be careful, though, to avoid destroying what the network learned. Therefore we should reinitialize only few and only those that are not used by the network. As a general intuition, the activation of a neuron gives us information about how valuable it is. If one neuron’s contribution to the others is low, it is not conveying important information so we can reinitialize them

This method is called continual backpropagation and allows much more plasticity to be maintained.

adapted by the author. image source: [5]

Continual backpropagation thus seems to maintain network plasticity for a long time. Too bad the authors did not test it on LLM.

Parting thoughts

In general, most studies of continual learning have focused on maintaining network stability (retaining information learned in previous tasks and thus avoiding catastrophic forgetting). The lack of plasticity though affects the network’s ability to acquire new information and new skills and is equitably important in continual learning. Recent studies elucidate why this loss of plasticity occurs. And it is interesting how the loss of plasticity is intrinsic to the back-propagation algorithms themselves. Indeed, this is demonstrated by the fact that the lack of plasticity is ubiquitous in various architectures and tasks [5].

On the one hand regularization techniques promote plasticity, while apparently other common parameter choices worsen the problem (dropout, ADAM optimizer). Until recently we did not know why; today we know that it is a subtle balance between weight explosion, neuron dormancy, and efficiency rank [5, 8, 9]. Therefore, we can modify back-propagation to take these factors into account.

image source: [5]

Continual back-propagation [5] is an interesting method for maintaining plasticity. it is a simple method and is not computationally intensive, it reinitializes neurons that typically contribute little (these neurons are often the ones that are pruned in techniques that try to reduce the size of a model).

Continual back-propagation uses a utility measure to find and replace low-utility units, which means it is based on a heuristic and therefore is not optimal. Also, most of these studies are done on toy models (even when they are extensive benchmarks) and not on models like LLM. It would be interesting to see how approaches like continual back-propagation would work on LLMs and whether they allow these models to learn new knowledge or new skills.

In any case, loss of plasticity is a stark difference between natural and artificial neurons. Continual learning is important for many applications, such as a robot encountering a new terrain, or adapting LLMs to specialized domains and new tasks. Today we know better what originates the problem and how to improve the plasticity of models, but still there are open questions (and need of additional researches).

What are your thoughts? Have you tried models for continual learning? Let me know in the comments

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Can AI Replace Human Researchers

Safekeep Science’s Future: Can LLMs Transform Peer Review?

Knowledge is Nothing Without Reasoning: Unlocking the Full Potential of RAG through Self-Reasoning

Short and Sweet: Enhancing LLM Performance with Constrained Chain-of-Thought

TL;DR

Today’s neural networks are not capable of Continual Learning. This is because they are incapable of retaining previously learned information (catastrophic forgetting) or learning new information after training (loss of plasticity).
Loss of plasticity is a plague of all deep learning models. No matter the architecture, hyperparameters, or loss function, loss of plasticity is ubiquitous. Though regularization techniques help the model maintain plasticity, this means that weights are related to loss of plasticity.
Increased dead units, explosion of weights, and loss of effective network rank are the causes of loss of plasticity. Regularization techniques can act on the first two causes but we need a solution to the third problem.
We can improve loss of plasticity by employing back-propagation modifications that somehow reinitialize neurons that are no longer used by the network. Continual backpropagation is an example of this

Reference

Here is the list of the principal references I consulted to write this article (only the first author name of an article is cited).

Ash, 2020, On Warm-Starting Neural Network Training, link
Achille, 2019, Critical learning periods in deep networks, link
Berariu, 2023, A study on the plasticity of neural networks, link
Lyle, 2023, Understanding Plasticity in Neural Networks, link
Dohare, 2024, Loss of plasticity in deep continual learning, link, code
Lu, 2019, Dying ReLU and Initialization: Theory and Numerical Examples, link
StackExchange, What is the "dying ReLU" problem in neural networks? link
Lyle, 2024, Disentangling the Causes of Plasticity Loss in Neural Networks, link
Lewandowski, 2023, Directions of Curvature as an Explanation for Loss of Plasticity, link
Wang, 2023, A Comprehensive Survey of Continual Learning: Theory, Method and Application, link
He, 2015, Deep Residual Learning for Image Recognition, link
Chebykin, 2023, Shrink-Perturb Improves Architecture Mixing during Population Based Training for Neural Architecture Search, link

The post Forever Learning: Why AI Struggles with Adapting to New Challenges appeared first on Towards Data Science.

Short and Sweet: Enhancing LLM Performance with Constrained Chain-of-Thought

Salvatore Raieli — Wed, 07 Aug 2024 17:14:05 +0000

|LLM|PROMPT ENGINEERING|COT|REASONING|

image created by the author using AI

Brevity is a great charm of eloquence. – Marcus Tullius Cicero

Brevity and conciseness are the parents of correction. – Hosea Ballou

Large language models (LLMs) have shown interesting capabilities in the field of reasoning. With their use, a new field of application has emerged: prompt engineering. In fact, interaction with these models occurs through the use of prompts, and for this reason, techniques have been developed to improve these capabilities of LLMs.

Prompt Engineering to Leverage In-Context Learning in Large Language Models

One of the most intriguing techniques is chain-of-thought (CoT) prompting; this technique increases correctness in reasoning problems and explains how the model arrives at the solution (or what reasoning errors it makes). CoT is a technique in which the model is prompted to arrive at the solution by intermediate steps (instead of generating the solution)

image source: [1]

Multimodal Chain of Thoughts: Solving Problems in a Multimodal World

This technique is intriguing [2–3] because it also works in zero-shot settings. Just by forcing the model to reason step-by-step (with the simple addition in the prompt of ‘let’s think step by step‘) the results in reasoning problems improve dramatically.

Of course, this technique also has disadvantages: the model produces long outputs and there is an increase in system latency (the time it takes to complete the response). This stems from the autoregressive nature of the model, which decodes one word at a time. This additional computational cost and time latency is undesirable when the model has to interact with users.

A Requiem for the Transformer?

Are all these reasoning steps really necessary? Can’t the verbiage of a model be forced?

As you can see, today’s models are increasingly verbose. Whereas answers used to be much shorter, new LLMs instead are used to create longer and longer outputs. In part, this is a desirable behavior because, in theory, these responses are always more complete and better dissect the question topic. On the other hand, the response is often unnecessarily verbose (especially when the question requires a short answer). For a user, an overly long answer can be frustrating, especially in multi-round question settings. Also, a longer response is not always better. It is often full of digressions, irrelevant details, and a greater risk of hallucinations.

One of the problems is that there are no evaluation metrics that take into account the conciseness of the outputs, nor do they penalize avoiding excessively long chains of reasoning. Intuitively, the longer the chain of reasoning, the greater the risk that one of the intermediates is erroneous. An erroneous intermediate can then be difficult for an LLM to correct (again because of its autoregressive nature).

On what does the length of an LLM response depend?

Several factors account for the length of the generated response. The main factors are: the question asked, the architecture, the size, pre and post-processing steps, prompt engineering techniques, and the addition of context in the prompt.

As can be easily imagined, the generation time increases as more tokens are generated. Furthermore, the larger the size of the model the longer it takes to generate the same response (a model of 70B parameters will take longer to generate the same number of tokens as a model of 7B)

Relation between response time and output length for three LLMs using a few samples from different datasets. image source: [4]

CoT increases the generation time of a model since intermediate reasoning steps must also be generated. A larger number of tokens therefore means a longer generation time per response.

Analysis of the impact of CoT on Falcon-40b efficiency. image source: [4]

Most studies so far have neglected efficiency but focused on accuracy, so we do not have metrics that take efficiency into account. In this study [4] they propose three metrics to evaluate a model’s capabilities for both accuracy and conciseness:

Hard-k Concise Accuracy. It measures the fraction of outputs that are correct and do not exceed a certain length k.
Soft-k Concise Accuracy. similar to the previous one but penalizes correct answers that exceed a certain length.
Consistent Concise Accuracy. A generalization of the previous metrics that takes into account the variation in length of all outputs.

Now we have a way to measure both accuracy and conciseness at the same time, we can try to find a way to limit the reasoning steps in an LLM when we use CoT. The authors of this paper [4] propose to make this requirement explicit in the CoT to force the model to compress its reasoning. So it is a ZeroShot-COT prompting with the addition of the phrase "and limit the length of the answer to n words" (with n being the desired number of words).

An example of a CoT and constrained chain of thoughts (CCoT) prompt. image source: [4]

Once you have a prompt that forces the model to respond more succinctly, you may wonder whether this would impact the accuracy of the model. In the study, they analyze 5 pre-trained models (Vicuna-13B, Falcon 7B and 40B, Llama2 7B and 70B). They then test them on the GSM8K benchmark dataset (the most widely used dataset regarding reasoning problems) and try different values of n (15, 30, 45, 60, 100).

The results show that forcing the number of tokens significantly decreases the generation time (which of course is expected since the model produces far fewer tokens as output). This result would be meaningless if the model were incorrect (we would have only a fast but wrong answer). Surprisingly, for models like Llama2–70B and Vicuna-13B adding length constraint increases accuracy (which is not the case with Falcon 7B and 40B)

Generation time (a) and accuracy (b) of five LLMs on the GSM8K test dataset. image source: [4]

This variability for the authors depends on factors intrinsic to the models such as size and their training. Smaller models seem to benefit precisely less from this approach (indeed they perform worse). In addition, Llama2–70B (the one that benefits the most) has been trained on a huge and more varied dataset. Moreover, it starts from a higher reasoning baseline.

An example of Llama2–70B response in response to a math problem. In this case, we can observe the basic response, with CoT or different constraints. it is interesting how it manages to arrive at a correct answer even with few tokens available to generate a correct answer and its reasoning intermediates.

image source: [4]

Analyzing the length distribution of the models’ output shows some interesting results. The red line is the median, while the blue line is the number of tokens that the LLMs should have met according to the assumed length constraint. Without the length constraint, the model produces longer responses but at the same time, the models do not meet the constraint (the median is over the blu line).

image source: [4]

Earlier we defined three metrics to evaluate our models in light of both accuracy and conciseness. By evaluating Hard-k concise accuracy, the accuracy is lower. If we choose values of k that are too low (the number of words beyond which the answer is considered wrong even if it is right) even using constrained CoT we get low results (this is because in part the models do not meet the desired length). For reasonable k values, we see that answers with the new proposed CoT are both more accurate and concise.

image source: [4]

These results are confirmed when we look at Soft Conciseness Accuracy (SCA). In this case, the value of α represents a tolerance for accepting answers longer than the desired limit k. In other words, if we accept or not a correct answer even if it is beyond a certain word limit. These results show us that, even with constrained CoT, some correct answers go beyond a certain length. It could also be that some answers still require more reasoning steps to be answered and cannot be compressed beyond a certain threshold. Or the models struggle to meet a strict limit given their verbose nature.

image source: [4]

Instead, Consistent Concise Accuracy measures whether the average length is consistent and thus whether they meet the constraint on average. Noticeably, a model with increased length constraint has more freedom in the length of the output to be generated (and uses this freedom).

image source: [4]

As we have seen in this article, LLMs are naturally verbose and tend to create unnecessarily long responses. On the one hand, modern LLMs write richer and more complete answers. on the other hand, the answer of interest is buried in unsolicited details, you have to wait until the generation is finished, and there is considerable latency. Reasoning chains are very useful for solving mathematical problems or when you have a system with agents. Because of the autoregressive nature, these chains can be very long and LLMs cannot correct wrong intermediates. In addition, an LLM can also get stuck in one generation of a reasoning chain (e.g., with the ReAct prompt and agents). Therefore forcing the model to adhere to a certain length has some appeal.

Forcing the CoT to a certain output length [4] not only does not reduce reasoning capabilities but seems to improve performance for some models. It is not exactly clear why this happens (a mechanistic study would be interesting) and why it works better with larger models. On the other hand, it would be interesting to study whether there is a link to hallucinations (the more tokens generated the greater the risk of course).

What do you think? would you like to try a constrained CoT prompt? let me know in the comments.

If you have found this interesting:

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

AI Hallucinations: Can Memory Hold the Answer?

Can Generative AI Lead to AI Collapse?

Expanding Language, Expanding Thought: Vocabulary Size in LLM Scaling

Beyond Human Feedback: How to Teach a Genial AI Student

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, link
Fu, 2023, Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance, link
Kojima, 2023, Large Language Models are Zero-Shot Reasoners, link
Nayab, 2024, Concise Thoughts: Impact of Output Length on Llm Reasoning and Cost, link

The post Short and Sweet: Enhancing LLM Performance with Constrained Chain-of-Thought appeared first on Towards Data Science.