Shreya Rao, Author at Towards Data Science

NLP Illustrated, Part 3: Word2Vec

Shreya Rao — Wed, 29 Jan 2025 17:01:57 +0000

Welcome to Part 3 of our illustrated journey through the exciting world of Natural Language Processing! If you caught Part 2, you’ll remember that we chatted about word embeddings and why they’re so cool.

NLP Illustrated, Part 2: Word Embeddings

Word embeddings allow us to create maps of words that capture their nuances and intricate relationships.

This article will break down the math behind building word embeddings using a technique called Word2Vec – a Machine Learning model specifically designed to generate meaningful word embeddings.

Word2Vec offers two methods – Skip-gram and CBOW – but we’ll focus on how the Skip-gram method works, as it’s the most widely used.

These words and concepts might sound complex right now but don’t worry – at its core, it’s just some intuitive math (and a sprinkle of machine learning magic).

Real quick – before diving into this article, I strongly encourage you to read my series on the basics of machine learning. A couple of concepts (like gradient descent and loss functions) build on those fundamentals, and understanding them will make this article much easier to follow.

Machine Learning Starter Pack

That said, don’t worry if you’re unfamiliar with those concepts – this article will cover them at a high level to ensure you can still follow along!

Since Word2Vec is a machine-learning model, like any ML model, it needs two things:

Training data: text data to learn from
A problem statement: the question the model is trying to answer

Training data

We’re trying to create a map of words, so our training data is going to be text. Let’s start with this sentence:

This will be our toy training data. Of course, in the real world, Word2Vec is trained on massive corpora of text – think entire books, Wikipedia, or large collections of websites. For now though, we’re keeping it simple with just this one sentence, so the model will only learn embeddings for these 18 words.

A problem statement

For Word2Vec, the core problem is simple: Given two words, determine whether they are neighbors

To define "neighbors," we use something called a context window, which specifies how many neighboring words on either side to consider.

For instance, if we want to find the neighbors of the word "happiness"…

…and set the context window size to 2, the neighbors of "happiness" will be "can" and "be".

And here, if we input "happiness" and "can" into the model, ideally we want it to predict that they are neighbors.

Similarly, for the word "darkness," with a context window of 2, the neighbors would be "in" and "the" (before), and "of" and "times" (after).

If we set our context window to 3, the neighbors for "happiness" will be three words on either side.

Terminology segway: Here "happiness" is referred to as the target word, while the neighboring words are known as the context words.

By default, the context window size in Word2Vec is set to 5. However, for simplicity in our example, we’ll use a context window size of 2.

Now, we need to convert this sentence into a neat little table, just like we do for other machine learning problems, with clearly defined inputs and output values.

We can construct this dataset by pairing the target word with each of its context words as inputs…

…and the output will be a label indicating whether the target and context words are neighbors:

1 indicates that they are neighbors

But there’s a glaring issue with this. All our training pairs are positive examples (neighbors), which doesn’t teach the model what non-neighbors look like.

Enter Negative Sampling.

Negative Sampling introduces pairs of words that are not neighbors. So for instance, we know that "happiness" and "light" are not neighbors, so we add that data to our training data with the label 0 to indicate that they are not neighbors.

By adding negative samples, the final dataset contains a mix of positive and negative pairs so that the model can learn to predict whether a given pair is a true neighbor or not.

Typically, we use 2 — 5 negative samples per positive pair for large datasets and up to 10 for smaller ones.

We’ll use 2 negative pairs per positive pair. Our training dataset now looks like this:

Now comes the fun part – the machine learning magic. Here’s the problem we’re solving: Given a target word and a context word, predict the probability that they are neighbors.

Let’s break it down step by step.

Step 0: Decide embedding dimensions

The first thing we do is to decide the size of the word embeddings. As we’ve learned, larger embeddings capture more nuances and richer relationships but come at the cost of increased computational expense.

The default embedding size in Word2Vec is 100 dimensions, but to keep the explanation simple, let’s use just 2 dimensions.

This means each word will be represented as a point on a 2D graph like so:

Step 1: Initialize embedding matrices

Next, we initialize two distinct sets of embeddings – target embeddings and context embeddings.

And, at the start of training, these embeddings are randomly initialized with values:

The target embeddings and context embeddings are randomly initialized with different values because they serve distinct purposes.

Target Embeddings: Represent each word when it’s the target word in training
Context Embeddings: Represent each word when it’s a context (neighboring) word

Step 2: Calculate the similarity of target word and context word

In the training process, we work with blocks of one positive pair and their corresponding negative samples.

So in the first pass, we only focus on the first positive pair and its corresponding 2 negative samples.

Now we can determine how similar 2 words are by calculating the dot product of their embeddings: the target embedding (if its a target word) and the context embedding (if its a context word).

A larger dot product indicates the words are more "similar" (likely neighbors)
A smaller dot product suggests they are more dissimilar (less likely to be neighbors)

And remember, in the first pass, we only calculate the similarity of the 3 pairs in the first block.

Let’s start by taking the dot product of the target word embedding of "happiness" with the context word embedding of "can":

We get:

Now we need to find a way to convert these scores to probabilities because we want to know how likely is it that these two words are neighbors. We can do that by passing this dot product through a sigmoid function.

As a quick refresher, the sigmoid function squishes any input value into a range between 0 and 1, making it perfect for interpreting probabilities. If the dot product is large (indicating high similarity), the sigmoid output will be close to 1 and if the dot product is small (indicating low similarity), the sigmoid output will be closer to 0.

So passing the dot product, -0.36, through the sigmoid function, we get:

Similarly, we can calculate the dot product and corresponding probabilities for the other two pairs…

…to get the predicted probability that "happiness" and "light" are neighbors…

…and the predicted probability that "happiness" and "even" are neighbors:

This is how we calculate the model’s predicted probabilities of these 3 pairs being neighbors.

As we can see, the predicted values are pretty random and inaccurate, which makes sense because the embeddings were initialized with random values.

Next, we move on to the key step: updating these embeddings to improve the predictions.

Step 4: Calculate error

NOTE: If you haven’t read the article on Logistic Regression, it might be helpful to do so, as the process of calculating error there is very similar. But don’t worry, we’ll also go over the basics here.

Now that we have our predictions, we need to calculate the "error" value to measure how far off the model’s predictions are from the true labels. For this, we use the Log Loss function.

For every prediction, the error is calculated as:

And the overall Log Loss for all predictions in the block is the average of the individual prediction errors:

For our example, if we calculate the loss for the 3 pairs above, it will look like this:

Evaluating this…

…we get 0.3. Our goal is to reduce this loss to 0 or as close to 0 as possible. A loss of 0 means that the model’s predictions perfectly match the true labels.

Step 4: Update embeddings using gradient descent

Again won’t dive into the details here since we covered this in our previous article on Logistic Regression. However, we know that the best way to minimize the loss function is by using gradient descent.

To put it simply, Log Loss is a convex function…

…and gradient descent helps us find the lowest point on this curve – the point where the loss is minimized.

It does so by:

calculating the gradient (the slope) of the loss function with respect to the embeddings and
adjusting the embeddings slightly in the opposite direction of the gradient to reduce the loss

So once gradient descent works its magic, we get new embeddings like so:

Let’s visualize this change. We start with our target embedding ("hapiness") and context embedding ("can", "light" and "even") in our block.

And after gradient descent, they shift slightly like so:

This is the REAL magic of this step. We see that automatically:

for the positive pair, the target embedding of"happiness" is nudged closer to the context embedding of "can," its neighbor
and for the negative pairs, the target embedding ("happiness") is adjusted to move further away from the non-neighboring context embeddings of "light" and "even"

Step 5: Repeat steps 2–4

Now all we have to do is rinse and repeat steps 2- 4 using the next block of positive and negative pairs.

Let’s see what this looks like for the second block.

For these values, we determine the model’s predictions of whether the words are neighbors or not by:

(1) Taking dot products and passing them through the sigmoid function…

(2) And then using the Log Loss and gradient descent we update the target and context embedding values for the words in this block:

Again, doing so will nudge the neighboring word embedding closer together and dissimilar ones are pushed farther apart.

That’s pretty much it. We just repeat these steps with each block in our training data.

Sidenote: Going through all blocks in the training dataset once is called an epoch. We usually repeat this for 5–20 epochs for a super robust training process.

By the end of our full training process, we’ll get up with our final target and embeddings that look something like this:

If we get rid of the context embedding, we are left with just the final target embeddings.

And these final target embeddings are the word embedding that we were after at the beginning!!

SIDENOTE: If needed, the context embeddings could be averaged or combined with the target embeddings to create a hybrid representation. However, this is rare and not standard practice.

This happens because the training process refines embeddings based on word relationships. Similar words (neighbors) are pulled closer together, while dissimilar words (non-neighbors) are pushed apart. While doing so, it also ends up capturing deeper relationships between words, including synonyms, analogies, and subtle contextual similarities.

Here, our training data was just a single sentence with 18 words, so the embeddings may not seem meaningful. But imagine training on a massive corpus – an entire book, a collection of articles, or billions of sentences from the web.

And that’s it! That’s how we create word embeddings using Word2Vec, specifically the skip-gram method.

Word2Vec IRL

Now that we’ve unpacked the mathematical magic behind Word2Vec, let’s bring it to life and create our own word embeddings.

Use pre-trained word embeddings

The easiest and most efficient way to get started is to use pre-trained word embeddings. These embeddings are already trained on massive datasets like Google News and Wikipedia, so they’re incredibly robust. This means we don’t have to start from scratch, saving both time and computational resources.

We leverage some pre-trained Word2Vec embeddings using Gensim, a popular Python library for NLP that’s optimized for handling large-scale text processing tasks.

# install gensim 
# !pip install --upgrade gensim

import gensim.downloader as api

Let’s look at all available pre-trained Word2Vec models in Gensim:

available_models = api.info()['models']

print("Available pre-trained Word2Vec models in Gensim:n")
for model_name, details in available_models.items():
    if 'word2vec' in model_name.lower():  # find models with 'word2vec' in their name
        print(f"Model: {model_name}")
        print(f"  - Description: {details.get('description')}")

Available pre-trained Word2Vec models in Gensim:

Model: word2vec-ruscorpora-300
  - Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Model: word2vec-google-news-300
  - Description: Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).
Model: __testing_word2vec-matrix-synopsis
  - Description: [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.

We see that there are two usable pre-trained models (since one of the models is labeled test). Lets’ put the word2vec-google-news-300 model to test!

Here’s how to find synonyms of the word "beautiful":

w2v_google_news.most_similar("king")

[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

These all make sense.

If you recall from the previous article, we saw how we can perform mathematical operations on word embeddings to get intuitive results. One of the most popular examples of this is…

…which we can test like so:

# king + woman - man
w2v_google_news.most_similar_cosmul(positive=['king', 'woman'], negative=['man'])

The results are impressively accurate!

Let’s try another combination:

# better + bad - good
w2v_google_news.most_similar_cosmul(positive=['better', 'bad'], negative=['good'])

[('worse', 0.9141383767127991),
 ('uglier', 0.8268526792526245),
 ('sooner', 0.7980951070785522),
 ('dumber', 0.7923389077186584),
 ('harsher', 0.791556715965271),
 ('stupider', 0.7884790301322937),
 ('scarier', 0.7865160703659058),
 ('angrier', 0.7857241034507751),
 ('differently', 0.7801468372344971),
 ('sorrier', 0.7758733034133911)]

And "worse" is the top match! Very cool.

As we can see, these pre-trained models are incredibly robust and can be leveraged for most use cases. However, they’re not perfect for every situation. For instance, if we’re working with niche domains like legal or medical texts, general-purpose embeddings may fail to capture the specific meanings and nuances of the language.

Say we have this legal text:

"The appellant seeks declaratory relief under Rule 57, asserting that the respondent’s fiduciary duty was breached by non-disclosure of material facts in accordance with Section 10(b) of the Securities Exchange Act of 1934."

Legal documents are often written in a formal, highly structured style, with terms like "Rule 57" or "Section 10(b)" referencing specific laws and statutes. Words like "material facts" have a precise legal meaning – facts that can influence the outcome of a case – which is very different from how "material" is understood in everyday language.

Pre-trained embeddings trained on general corpora, such as Google News, won’t capture these nuanced, domain-specific meanings. Instead, for tasks like this, we need embeddings trained on domain-specific corpora, such as legal judgments, statutes, or contracts.

Code our own Word2Vec from scratch

This is where building our own Word2Vec model is helpful. By training on a legal corpus, we can create embeddings tailored to our use case, capturing the relationships and meanings specific to the legal domain.

And just like that we’re done! You now know everything you need to know about Word2Vec.

As always, feel free to connect with me on LinkedIn or email me at shreya.Statistics@gmail.com!

Unless specified, all images are by the author.

The post NLP Illustrated, Part 3: Word2Vec appeared first on Towards Data Science.

NLP Illustrated, Part 2: Word Embeddings

Shreya Rao — Wed, 27 Nov 2024 00:15:37 +0000

Welcome to Part 2 of our NLP series. If you caught Part 1, you’ll remember that the challenge we’re tackling is translating text into numbers so that we can feed it into our machine learning models or neural networks.

NLP Illustrated, Part 1: Text Encoding

Previously, we explored some basic (and pretty naive) approaches to this, like Bag of Words and TF-IDF. While these methods get the job done, we also saw their limitations – mainly that they don’t capture the deeper meaning of words or the relationships between them.

This is where word embeddings come in. They offer a smarter way to represent text as numbers, capturing not just the words themselves but also their meaning and context.

Let’s break it down with a simple analogy that’ll make this concept super intuitive.

Imagine we want to represent movies as numbers. Take the movie [Knives Out](https://www.imdb.com/title/tt8946378/?ref=tt_mvclose) **** as an example.

source: Wikipedia

We can represent a movie numerically by scoring it across different features, such as genres – Mystery, Action, and Romance. Each genre gets a score between -1 and 1 where: -1 means the movie doesn’t fit the genre at all, 0 means it somewhat fits, and 1 means it’s a perfect match.

So let’s start scoring Knives Out! For Romance, it scores -0.6 – there’s a faint hint of romance, but it’s subtle and not a big part of the movie, so it gets a low score.

Moving on to Mystery, it’s a strong 1 since the entire movie revolves around solving a whodunit. And for Action, the movie scores 0.2. While there is a brief burst of action toward the climax, it’s minimal and not a focal point.

This gives us three numbers that attempt to encapsulate Knives Out based these features: Romance, Mystery, and Action.

Now let’s try visualizing this.

If we plot Knives Out on just the Romance scale, it would be a single point at -0.6 on the x-axis:

Now let’s add a second dimension for Mystery. We’ll plot it on a 2D plane, with Romance (-0.6) on the x-axis and Mystery (1.0) on the y-axis.

Finally, let’s add Action as a third dimension. It’s harder to visualize, but we can imagine a 3D space where the z-axis represents Action (0.3):

This vector (-0.6, 1, 0.3) is what we call a movie embedding of Knives Out.

Now let’s take another movie as an example: [Love Actually](https://www.imdb.com/title/tt0314331/?ref=tt_mvclose).

source: Wikipedia

Using the same three features – Romance, Mystery, and Action – we can create a movie embedding for it like so:

And we can plot this on our movie embeddings chart to see how Love Actually compares to Knives Out.

From the graph, it’s obvious that Knives Out and Love Actually **** are super different. But what if we want to back this observation with some numbers? Is there a way to math-ify this intuition?

Luckily, there is!

Enter cosine similarity. When working with vectors, a common way to measure how similar two vectors are is by using cosine similarity. Simply put, it calculates the similarity by measuring the cosine of the angle between two vectors. The formula looks like this:

Here, A and B are the two vectors we’re comparing. A⋅B is the dot product of the two vectors, and ∥A∥ and ∥B∥ are their magnitudes (lengths).

Cosine similarity gives a result between -1 and 1, where:

1 means the vectors are identical (maximum similarity)
-1 means they are completely opposite (no similarity)

From the graph, we’ve already observed that Knives Out and Love Actually seem very different. Now, let’s quantify that difference using cosine similarity. Here, vector A represents the embedding for Knives Out and vector B represents the embedding for Love Actually.

Plugging the values into the cosine similarity formula, we get:

And this result of -0.886 (very close to -1) confirms that Knives Out and Love Actually are highly dissimilar. Pretty cool!

Let’s test this further by comparing two movies that are extremely similar. The closest match to Knives Out is likely its sequel, The Glass Onion.

source: Wikipedia

Here’s the movie embedding for The Glass Onion:

The embedding is slightly different from Knives Out. The Glass Onion **** scores a little higher in the Action category than in its predecessor, reflecting the increased action sequences of the sequel.

Now, let’s calculate the cosine similarity between the two movie embeddings:

And voilà – almost a perfect match! This tells us that Knives Out and The Glass Onion **** are extremely similar, just as we expected.

This movie embedding is a great start, but it’s far from perfect because we know movies are much more complex than just three features.

But we could make the embedding better by expanding the features. We can then capture significantly more nuance and detail about each film. For example, along with Romance, Mystery, and Action, we could include genres like Comedy, Thriller, Drama, and Horror, or even hybrids like RomCom.

Beyond genres, we could include additional data points like Rotten Tomatoes Score, IMDb Ratings, Director, Lead Actors, or metadata such as the film’s release year or popularity over time. The possibilities are endless, giving us the flexibility to design features as detailed and nuanced as needed to truly represent a movie’s essence.

Let’s switch gears and see how this concept applies to word embeddings. With movies, we at least had a sense of what our features could be – genres, ratings, directors, and so on. But when it comes to all words, the possibilities are so vast and abstract that it’s virtually impossible for us to define these features manually.

Instead, we rely on our trusted friends – the machines – to figure it out for us. We’ll dive deeper into how machines create these embeddings soon, but for now, let’s focus on understanding and visualizing the concept.

Each word can be represented as a set of numerical values (aka vectors) across several hidden features or dimensions. These features capture patterns such as semantic relationships, contextual usage, or other language characteristics learned by machines. These vectors are what we call word embeddings.

This is important, so I’m going to reiterate: Word embeddings are vector representations of words.

For a very, very naive word embedding, we might start with just three features – similar to how we began with movies.

We can turn up the heat by expanding to 16 features, **** capturing more nuanced properties of words.

Or, we could take it even further with 200 features, creating a highly detailed and rich representation of each word.

The more features we add, the more complex and precise the embedding becomes, enabling it to capture subtle patterns and meanings in language.

The idea behind word embeddings is simple yet powerful: to arrange words in such a way that those with similar meanings or usage are placed close to each other in the embedding space. For example, words like "king" and "queen" would naturally cluster together, while "apple" and "orange" might form another cluster, far away from unrelated words like "car" or "bus."

While it’s impossible to visualize embeddings with 32 or more dimensions, conceptually, we can think of them as a high-dimensional space where words are grouped based on their relationships and meanings. Imagine it as a vast, invisible map where words with similar meanings are neighbors, capturing the intricate structure of language.

an attempt to visualize word embeddings

This clustering is what makes embeddings so effective in capturing the subtleties of language.

Another cool feature of word embeddings is that we can perform mathematical operations with them, leading to interesting and often intuitive results. One of the most famous examples is:

Just to illustrate this, let’s say we have the following 3-dimensional word embeddings for the words:

Using these embeddings, we can perform the operation: "king" – "man" + "woman"…

…which gives us the word embedding for "queen"!

Or we could have relationships like this:

…or even:

You get the idea. This works because word embeddings capture relationships between words in a mathematically consistent way. That’s what makes embeddings so powerful – they don’t just measure similarity; they encode meaningful relationships that mirror our human understanding of language.

Now, the big question is: how do we come up with word embeddings for each word? As mentioned earlier, the answer lies in leveraging the power of machines! **** Word embedding models learn the relationships and features of words by analyzing MASSIVE amounts of text data. And by doing so, we can see these patterns emerge naturally during the training process.

In the next article, we’ll see how to do that by diving into one of the most popular word embedding models: Word2Vec!

In the meantime, if you’d like to dive deeper into Neural Networks, I have a series on Deep Learning that breaks down the math behind how they work.

Deep Learning, Illustrated

Feel free to connect with me on LinkedIn or email me at shreya.statistics@gmail.com!

Unless specified, all images are by the author.

The post NLP Illustrated, Part 2: Word Embeddings appeared first on Towards Data Science.

NLP Illustrated, Part 1: Text Encoding

Shreya Rao — Tue, 19 Nov 2024 13:01:57 +0000

Welcome back to the corner of the internet where we take complex-sounding machine learning concepts and illustrate our way through them – only to discover they’re not that complicated after all!

Today, we’re kicking off a new series on Natural Language Processing (NLP). This is exciting because NLP is the backbone of all the fancy Large Language Models (LLMs) we see everywhere – think Claude, GPT, and Llama.

In simple terms, NLP helps machines make sense of human language – whether that means understanding it, analyzing it, or even generating it.

If you’ve been following along our Deep Learning journey, we’ve learned that at their heart, neural networks operate on a simple principle: they take an input, work their mathematical magic, and spit out an output.

For neural networks to do this though both the input and the output must be in a format they understand: numbers.

This rule applies whether we’re working with a straightforward model…

…or a highly sophisticated one like GPT.

Now here’s where it gets interesting. We interact with models like GPT using text. For instance, we might ask it: "what is the capital of India?" and the model is able to understand this text and provide a response.

asking ChatGPT a question

But wait – didn’t we just say that neural networks can’t directly work with text and need numbers instead?

we can’t input text into a neural network

That’s exactly the challenge. We need a way to translate text into numbers so the model can work with it.

we need to convert the text into numbers before inputting it in a neural network

This is where text encoding comes in, and in this article, we’ll explore some straightforward methods to handle this text-to-number translation.

One Hot Encoding

One of the simplest ways to encode text is through one-hot encoding.

Let’s break it down: imagine we have a dictionary containing 10,000 words. Each word in this dictionary has a unique position.

The first word in our dictionary, "about" is at position 1 and the last word, "zoo" sits at position 10,000. Similarly, every other word has its unique position somewhere in between.

Now, let’s say we want to encode the word "dogs". First, we look up its position in the dictionary…

…and find that "dogs" is at the 850th position. To represent it, we create a vector with 10,000 zeros and then set the 850th position to 1 like so:

It’s like a light switch: if the word’s position matches, the switch is on (1) and if it doesn’t, the switch is off (0).

Now, suppose we want to encode this sentence:

Along with the word vector of "dogs", we find the word vector of "barks"…

…and "loudly":

Then to represent the full sentence, we stack these individual word vectors into a matrix, where each row corresponds to one word’s vector:

This forms a sentence matrix, with rows corresponding to the words. While this is simple and intuitive, one-hot encoding comes with a big downside: inefficiency.

Each word vector is massive and mostly filled with zeros. For example, with a dictionary of 10,000 words, each vector contains 10,000 elements, with 99.99% of them being zeros. If we expand to a larger dictionary – like the Cambridge English Dictionary, which has around 170,000 words – the inefficiency becomes even more pronounced.

Now imagine encoding a sentence by stacking these 170,000-sized word vectors into a sentence matrix – it quickly becomes huge and difficult to manage. To address these issues, we turn to a more efficient approach: the Bag of Words.

Bag of Words

Bag of Words (BoW) simplifies text representation by creating a single vector for an entire sentence, rather than separate vectors for each word.

Imagine we have these four sentences we want to encode:

Brownie points if you know where this quote is from. And if you don’t let’s just pretend this is a normal thing people say.

The first step is to create a dictionary of all the unique words across these four sentences.

BoW dictionary

Each sentence is represented as a vector with a length equal to the number of unique words in our dictionary. And each element in the vector represents a word from the dictionary and is set to the number of times that word appears in the sentence.

For example, if we take the first sentence "onions have layers,", its vector would look like this:

BoW encoding of setence 1

"onions" appears once, "have" appears once, and "layers" appears once. So, the vector for this sentence would have 1 in those positions.

Similarly, we can encode the remaining sentences:

BoW encoding of all four sentences

Let’s encode one last example:

For this sentence, the words "layers" and "have" are repeated twice, so their corresponding positions in the vector will have the value 2.

BoW encoding of the sentence

Here’s how we can implement BoW in Python:

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "Onions have layers",
    "Ogres have layers",
    "You get it?",
    "We both have layers"
]

bag_of_words = CountVectorizer()
X = bag_of_words.fit_transform(sentences)

print("BoW dictionary:", bag_of_words.get_feature_names_out())
print("BoW encoding:n", X.toarray())

BoW dictionary: ['both' 'get' 'have' 'it' 'layers' 'ogres' 'onions' 'we' 'you']
BoW encoding:
 [[0 0 1 0 1 0 1 0 0]
 [0 0 1 0 1 1 0 0 0]
 [0 1 0 1 0 0 0 0 1]
 [1 0 1 0 1 0 0 1 0]]

While BoW is simple and effective for counting words, it doesn’t capture the order or context of words. For example, consider the word "bark" in these two sentences:

The word "bark" in "dogs bark loudly" versus "the tree’s bark" has entirely different meanings. But BoW would treat "bark" the same in both cases, missing the differences in meaning provided by the surrounding words.

Bi-grams

This is where bi-grams come in handy. They help capture more context by looking at adjacent words. Let’s illustrate this with these two sentences:

Just like in the BoW approach, we start by creating a dictionary:

However, this time, in addition to individual words, we include word pairs (bi-grams). These bi-grams are formed by looking at directly adjacent words in each sentence.

For example, in the sentence "dogs bark loudly," the bi-grams would be:

And in "the tree’s bark", these are the bigrams:

We add this to our dictionary to get our bi-gram dictionary:

bi-gram dictionary

Next, we represent each sentence as a vector. Similar to BoW, each element in this vector corresponds to a word or bi-gram from the dictionary, with the value indicating how many times that word or bi-gram appears in the sentence.

bi-gram encoding

Using bi-grams allows us to retain context by capturing relationships between adjacent words. So, if one sentence contains "tree’s bark" and another "dogs bark," these bi-grams will be represented differently, preserving their meanings.

Here’s how we can implement bi-grams in Python:

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "dogs bark loudly",
    "the tree's bark"
]

bigram = CountVectorizer(ngram_range=(1, 2))  #(1, 2) specifies that we want single words and bigrams
X = bigram.fit_transform(sentences)

print("Bigram dictionary:", bigram.get_feature_names_out())
print("Bigram encoding:n", X.toarray())

Bigram dictionary: ['bark' 'bark loudly' 'dogs' 'dogs bark' 'loudly' 'the' 'the tree' 'tree'
 'tree bark']
Bigram encoding:
 [[1 1 1 1 1 0 0 0 0]
 [1 0 0 0 0 1 1 1 1]]

N-grams

Just as bi-grams group two consecutive words, we can extend this concept to n-grams, where n represents the number of words grouped together. For instance, with n=3 (tri-grams), we would group three consecutive words, such as "dogs bark loudly." Similarly, with n=5, we would group five consecutive words, capturing even more context from the text.

This approach enables us to capture even richer relationships and context in text data, but it also increases the size of the dictionary and computational complexity.

TF-IDF

While Bag of Words and Bi-grams are effective for counting words and capturing basic context, they don’t consider the importance or uniqueness of words in a sentence or across multiple sentences. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes in. It weighs words based on:

Term Frequency (TF): how often a word appears in a sentence
Inverse Document Frequency (IDF): how rare or unique a word is across all sentences

This weighting system makes TF-IDF useful for highlighting important words in a sentence while downplaying common ones.

To see this in action, let’s apply TF-IDF to our familiar set of four sentences.

Like before, we create a dictionary of unique words across our sentences.

Term Frequency (TF)

To calculate TF of a word, we use the formula:

For instance, for the word "onions" in the first sentence…

…the TF is:

Similarly, let’s calculate the TF of "both" in the first sentence:

Using this same logic, we can get the TFs of all the words in the dictionary across all four sentences like so:

TF of all words in each sentence

Note that the TF of a word can vary across different sentences. For example, the word "both" doesn’t appear in the first three sentences, so its TF for those sentences is 0. However, in the last sentence, where it appears once out of four total words, its TF is 1/4.

Inverse Document Frequency (IDF)

Next, we calculate IDF for each word. IDF gives a higher value to words that appear in fewer sentences, thus emphasizing words that appear in fewer sentences.

For example, we see the word "both" appears in only one of the four sentences:

So its IDF is:

Similarly, we can get IDF for the rest of the words in the dictionary:

IDF of all words in the dictionary

Here, the word "both" appears only in sentence 4, giving it a higher IDF score compared to common words like "have," which appears in multiple sentences.

Unlike TF, the IDF of a word remains consistent across all sentences.

TFIDF

The final TF-IDF score for a word is the product of its TF and IDF:

This results in sentence vectors where each word’s score reflects both its importance within the sentence (TF) and its uniqueness across all sentences (IDF).

Plugging in TF and IDF terms in our formula, we get our final TF-IDF sentence vectors:

TD-IDF encodings of all sentences

Here’s how we calculate TF-IDF in Python:

from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "Onions have layers",
    "Ogres have layers",
    "You get it?",
    "We both have layers"
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(sentences)

print("TF-IDF dictionary:", tfidf.get_feature_names_out())
print("TF-IDF encoding:n", X.toarray())

Note: The Python results might differ slightly from manual calculations because of:

1. L2 Normalization: Scikit-learn’s TfidfVectorizer normalizes vectors to unit length by default. 2. Adjusted IDF Formula: The IDF calculation includes a smoothing term to prevent division by zero for words that appear in all sentences

Read more about about this here.

While the methods we’ve discussed are essential building blocks in NLP, they come with significant limitations.

1 – these methods lack semantic understanding. They fail to grasp the meaning of words and identify relationships between synonyms like "fast" and "quick." While bi-grams can provide some local context, they still miss deeper connections and subtle nuances in meaning.

2 – these approaches rely on rigid representations, treating words as isolated entities. For example, we intuitively understand that "king" and "queen" are related, but these methods represent "king" and "queen" as being just as unrelated as "king" and "apple," completely ignoring their similarities.

3 – they face scalability challenges. They depend on sparse, high-dimensional vectors, which grow more unwieldy and inefficient as the dictionary size increases.

What if we could represent words in a way that captures their meanings, similarities, and relationships? That’s exactly what word embeddings aim to do. Word embeddings revolutionize text encoding by creating dense, meaningful vectors that retain both context and semantic relationships.

In the next article, NLP Illustrated, Part 2: Word Embeddings, we’ll explore how these embeddings go beyond basic word counts to capture the complex, nuanced relationships between words!

NLP Illustrated, Part 2: Word Embeddings

Connect with me on LinkedIn or shoot me an email at shreya.statistics@gmail.com if you have any questions/comments!

NOTE: All illustrations are by the author unless specified otherwise

The post NLP Illustrated, Part 1: Text Encoding appeared first on Towards Data Science.

Implementing Convolutional Neural Networks in TensorFlow

Shreya Rao — Tue, 20 Aug 2024 17:12:30 +0000

Welcome to the practical implementation guide of our Deep Learning Illustrated series. In this series, we bridge the gap between theory and application, bringing to life the neural network concepts explored in previous articles.

Deep Learning, Illustrated

In today’s article, we’ll build a Convolutional Neural Network (CNN) using TensorFlow. Be sure to read the previous CNN article, as this one assumes you’re already familiar with the inner workings and mathematical foundations of a CNN. We’ll be focusing only on implementation here, so prior knowledge will help you follow along more easily.

Deep Learning Illustrated, Part 3: Convolutional Neural Networks

We’ll create the same simple image classifier that predicts whether a given image is an ‘X’ or not.

And we’ll break down each step in detail along the way to ensure you understand both the how and the why!

Step 1: Importing the necessary libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
import numpy as np
import matplotlib.pyplot as plt

TensorFlow and Keras (which is a high-level API within TensorFlow) will handle the creation and training of our CNN, while NumPy and Matplotlib will help us with data manipulation and visualization.

NOTE: To ensure that our results are consistent each time we run the code, we’ll set a random seed:

# Setting seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

Setting a seed ensures consistent results by making sure the random processes in the code run the same way each time. Think of it like shuffling a deck of cards in exactly the same order every time we play.

Step 2: Understanding and Generating the Data

Let’s first generate the images that our model will learn to classify. Previously we saw that an ‘X’ can be represented by a 5×5 pixel image like so:

Let’s translate this to code:

# 'X' pattern
def generate_x_image():
    return np.array([
        [1, 0, 0, 0, 1],
        [0, 1, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 1, 0, 1, 0],
        [1, 0, 0, 0, 1]
    ])

This function generates a simple 5×5 image of an ‘X’. Next, we’ll create a function that generates random 5×5 images that do not resemble an ‘X’:

def generate_not_x_image():
    # Ensuring not to generate an 'X' pattern
    while True:
        img = np.random.randint(2, size=(5, 5))
        if not np.array_equal(img, generate_x_image()):
            return img

Step 3: Building the Dataset

With our functions ready, we can now create a dataset of 1,000 images. We’ll label them accordingly, with 1 for images of an ‘X’ and 0 for those that are not:

# Create a dataset
num_samples = 1000
images = []
labels = []

for _ in range(num_samples):
    if np.random.rand() > 0.5:
        images.append(generate_x_image())
        labels.append(1)
    else:
        images.append(generate_not_x_image())
        labels.append(0)

images = np.array(images).reshape(-1, 5, 5, 1)
labels = np.array(labels)

This code generates 1,000 images, half of which contain an ‘X’ and the other half don’t. We then reshape the images to ensure they have the correct dimensions for our CNN.

To train our model effectively, we’ll split this dataset into training and testing sets:

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(images, labels, test_size=0.2, random_state=42)

This split reserves 80% of the data for training the model and 20% for testing it. The test set helps us evaluate how well the model performs on new, unseen data.

Before we dive into model building, let’s take a look at some of the images in our dataset to understand what we’re working with.

# Function to display images
def display_sample_data(images, labels, num_samples=5):
    plt.figure(figsize=(10, 2))
    for i in range(num_samples):
        ax = plt.subplot(1, num_samples, i + 1)
        plt.imshow(images[i].reshape(5, 5), cmap='gray_r')
        plt.title(f'Label: {labels[i]}')
        plt.axis('off')
    plt.show()

This function displays images from our training set, helping us confirm that the data is correctly labeled and formatted.

# Display first 5 samples of our training data
display_sample_data(x_train, y_train)

Step 4: Building the CNN Model

Now that our data is ready, let’s build the CNN! Here’s the architecture we used previously:

1 – Convolutional Layer: Applies four 3×3 filters to an input image to detect features and creates four feature maps

2 – Max-Pooling Layer: Reduces the dimensions of the feature maps, making the model more efficient

3 – Flatten Layer: Converts the 2D data into a 1D array, preparing it for the neural network

4 – Hidden Layer: A fully connected hidden layer with three neurons all with ReLU activation functions

5 – Output Layer: A single neuron with a sigmoid activation function

model = Sequential([
    # 1 - Convolutional Layer
    Conv2D(4, (3, 3), activation='relu', input_shape=(5, 5, 1)),
    # 2 - Max-Pooling Layer
    MaxPooling2D(pool_size=(2, 2)),
    # 3 - Flatten Layer
    Flatten(),
    # 4 - Hidden Layer
    Dense(3, activation='relu'),
    # 5 - Output Layer
    Dense(1, activation='sigmoid')
])

Step 5: Compiling the Model

Compiling the model is crucial as it defines how the model will learn. Here’s what each part of the compile function does:

Optimizer (Adam): The optimizer adjusts the model’s weights to minimize the loss function. We use Adam here, but this is a list of other optimizers that can be used.
Loss Function (Binary Crossentropy): Measures how far off the model’s predictions are from the actual results. Since we’re dealing with binary classification (X or not-X), binary cross entropy is appropriate. The loss value is what the model tries to minimize during training. Lower loss values indicate better performance (i.e., the model’s predictions are closer to the actual values). The loss directly influences how the model is trained and the optimizer uses it to update the model’s weights in the direction that minimizes the loss.
Metrics (Accuracy): Metrics are used to evaluate the model’s performance. We use accuracy to track the proportion of correct predictions out of all predictions made. Metrics are additional measures that evaluate the performance of the model but are not used by the optimizer to adjust the model during training. They provide a way to assess how well the model is performing.

Note: While accuracy is a common metric, it’s not always the most reliable, especially in certain scenarios like imbalanced datasets. However, for simplicity, we’ll use it here. If you’re interested in exploring other evaluation metrics that might provide a more nuanced view of model performance, this article covers some alternatives.

model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

Step 6: Training the Model

Training the model involves feeding it the training data multiple times (epochs) so it can learn to make better predictions. An epoch is one complete pass through the entire training dataset. We’ll train our model for 10 epochs:

history = model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))

Step 7: Evaluating the Model

After training, we evaluate the model’s performance on the test data:

loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

The accuracy metric tells us that 94.8% of the images in the test data were correctly classified.

Step 8: Visualizing the Training Process

Finally, let’s visualize how the model’s accuracy changed over the epochs. This helps us understand how well the model learned during training:

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

And that’s it. We’ve built a simple convolutional neural network to predict if an image is an ‘X’ or not using TensorFlow in less than 5 minutes!

Deep Learning in Practice

As always, feel free to connect with me on LinkedIn for any comments/questions!

Note: Unless specified, all images are by the author.

The post Implementing Convolutional Neural Networks in TensorFlow appeared first on Towards Data Science.

Implementing Neural Networks in TensorFlow (and PyTorch)

Shreya Rao — Mon, 08 Jul 2024 21:38:51 +0000

Welcome to the practical implementation guide of our Deep Learning Illustrated series. In this series, we’ll bridge the gap between theory and application, bringing to life the neural network concepts explored in previous articles.

Deep Learning, Illustrated

Remember the simple neural network we discussed for predicting ice cream revenue? We will build that using TensorFlow, a powerful tool for creating neural networks.

Deep Learning Illustrated, Part 2: How Does a Neural Network Learn?

And the kicker: we’ll do it in less than 5 minutes with just 27 lines of code!

Let’s first start with: what is TensorFlow?

TensorFlow is a comprehensive ecosystem of tools, libraries, and community resources for building and deploying machine learning applications. Developed by Google, it’s designed to be flexible and efficient, capable of running on various platforms from CPUs to GPUs and even specialized hardware like TPUs. The name "TensorFlow" derives from its core concept: tensor flow. Tensors, which are multi-dimensional arrays, flow through a computational graph during the training and inference processes.

Okay, let’s get to building our neural network. The goal of the model is to predict daily ice cream revenue based on two features: temperature and day of the week. We’ll approach this task step-by-step, explaining each component of the process.

Step 1: Data preparation

First, we’ll translate the ice cream sales data that we used previously…

…into a format suitable for our neural network:

import numpy as np

# Data 
day = [2, 6, 1, 3, 2, 5, 7, 4, 3, 1]
temperature = [22, 33, 20, 25, 24, 30, 35, 28, 26, 21]
revenue = [1.51, 2.22, 1.37, 1.77, 1.64, 2.04, 2.42, 1.90, 1.75, 1.45]

# Combine day and temperature into a single feature array
X_train = np.column_stack((day, temperature))
y_train = np.array(revenue)

This creates our input features, X_train…

…and target values, y_train:

Step 2: Standardize the data

Next, we’ll standardize our data. Standardization is a crucial preprocessing step that transforms features to have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

This ensures all features contribute equally to the model, improving convergence speed and stability during training.

Step 3: Build the Neural Network

In this step, we define our neural network model. We decided previously that the architecture consists of one hidden layer with two neurons and one output neuron all using the ReLU activation.

Let’s stick to this same architecture and translate the above to code. We construct our neural network using TensorFlow’s Keras API.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Initialize the model
model = Sequential()

# Add hidden layer - 2 neurons with the ReLU activation function with 2 inputs
model.add(Dense(2, input_dim=2, activation='relu'))

# Add output layer - 1 neuron with the ReLU activation function
model.add(Dense(1, activation='relu'))

The Sequential model allows us to build a stack of layers. The Dense layers are fully connected layers where each neuron in one layer is connected to every neuron in the next layer.

Step 4: Compile and train the model

Before training, we need to compile our model:

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

Compilation is a crucial step that configures the learning process. Here, we specify:

The optimizer: adam (Adaptive Moment Estimation), which adapts the learning rate during training.
The loss function: Mean Squared Error (MSE), which measures the average squared difference between predictions and actual values.

_Note: Here we used the Adam optimizer, but we can used any other optimization algorithm that makes sense. This lists all the ones we can use in TensorFlow. And similarly we can use any of the various loss functions defined here._

Now we can train our model:

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=100, verbose=1)

The fit method is where the actual learning occurs. We specify our input features (X_train_scaled), target values (y_train), and the number of training cycles (epochs). The verbose parameter controls the level of output during training.

We can visualize the training process:

import matplotlib.pyplot as plt

# Plot training loss over epochs
plt.plot(history.history['loss'])
plt.title('Model Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

This plot illustrates how our loss (prediction error) decreases over time, providing insight into the learning process.

Step 5: Make predictions

Finally, we can use our trained model to make predictions:

from sklearn.metrics import mean_squared_error

# Make predictions on the training data
predictions = model.predict(X_train_scaled)
print("Predicted Revenues on Training Data:", predictions)

Here, we use our trained model to predict ice cream sales based on our input features. And if we want to see how accurate the predictions were, we can use the MSE to measure of our model’s accuracy.

# Calculate the Mean Squared Error
mse = mean_squared_error(y_train, predictions)
print("Mean Squared Error on Training Data:", mse)

The MSE isn’t as low as we’d hoped, but that’s okay. This is a super basic neural network, and the whole point is to add complexity and tweak the architecture to improve our results.

While this example uses a simple dataset and model architecture, the principles we’ve covered lay the groundwork for more complex neural network applications. As we continue our journey in Deep Learning, we’ll encounter more sophisticated architectures and larger datasets, but the fundamental process remains the same.

Bonus: PyTorch

Now that we’ve seen how to implement our model in TensorFlow, let’s take a look at how we can achieve the same results using another powerful framework: PyTorch. PyTorch, developed by Facebook’s AI Research lab, is known for its flexibility and efficiency, making it a popular choice as well.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Data 
day = [2, 6, 1, 3, 2, 5, 7, 4, 3, 1]
temperature = [22, 33, 20, 25, 24, 30, 35, 28, 26, 21]
revenue = [1.51, 2.22, 1.37, 1.77, 1.64, 2.04, 2.42, 1.90, 1.75, 1.45]

# Convert to numpy array
X = np.array(list(zip(day, temperature)), dtype=np.float32)
y = np.array(revenue, dtype=np.float32)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert data to PyTorch tensors
X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1)

# Build the neural network
class IceCreamSalesModel(nn.Module):
    def __init__(self):
        super(IceCreamSalesModel, self).__init__()
        self.hidden = nn.Linear(2, 2)
        self.output = nn.Linear(2, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.hidden(x))
        x = self.output(x)
        return x

model = IceCreamSalesModel()

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Train the model
num_epochs = 100
losses = []

for epoch in range(num_epochs):
    # Forward pass
    predictions = model(X_tensor)
    loss = criterion(predictions, y_tensor)
    losses.append(loss.item())

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Plot training loss
plt.plot(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.show()

# Make predictions
with torch.no_grad():
    predictions = model(X_tensor)
print("Predicted Revenue:", predictions.numpy())

# Evaluate the model
mse_value = criterion(predictions, y_tensor).item()
print("Mean Squared Error:", mse_value)

And that’s all! We’ve learned how to implement a simple neural network to predict ice cream sales using both TensorFlow and PyTorch. In the next article, we’ll cover how to implement a Convolutional Neural Network (CNN) in both frameworks.

As always, feel free to connect with me on LinkedIn for any comments/questions!

Note: Unless specified, all images are by the author.

The post Implementing Neural Networks in TensorFlow (and PyTorch) appeared first on Towards Data Science.

Deep Learning Illustrated, Part 5: Long Short-Term Memory (LSTM)

Shreya Rao — Fri, 21 Jun 2024 19:29:06 +0000

Welcome to Part 5 in our illustrated journey through Deep Learning!

Deep Learning, Illustrated

Today we’re going to talk about Long Short-Term Memory (Lstm) networks, which are an upgrade to regular Recurrent Neural Networks (RNN) which we discussed in the previous article. We saw that RNNs are used to solve sequence-based problems but struggle with retaining information over long distances, leading to short-term memory issues. Here’s where LSTMs come in to save the day. They use the same recurrent aspect of RNNs but with a twist. So let’s see how they achieve this.

Sidenote – this is one of my favorite articles I’ve written, so I can’t wait to take you on this journey!

Let’s first see what was happening in our RNN previously. We had a neural network with an input x, one hidden layer that consists of one neuron with the tanh activation function, and one output neuron with the sigmoid activation function. So the first step of the RNN looks something like this:

Terminology segue: We’re going to call each step a hidden state. So the above is the first hidden state of our RNN.

Here, we first pass our first input, _x₁,_ to the hidden neuron to get h₁.

h₁ = first hidden state output

From here we have two options:

(option 1) Pass this h₁ to the output neuron to get a prediction using just this one input. Mathematically:

y_₁hat = first hidden state prediction

(option 2) Pass this h₁ to the next hidden state, by passing this value into the hidden neuron of the next network.

So the second hidden state will look like this:

first and second hidden states

Here we are taking the output from the hidden neuron in the first network and passing it to the hidden neuron in the current network alongside the second input, x₂. Doing so gives us our second hidden layer output, h₂.

h₂ = second hidden state output

Again, from here, we can do two things with h₂:

(option 1) Pass it to the output neuron to get a prediction that is a result of the first, _x_₁, and the second, x₂.

y_₂hat = second hidden state prediction

(option 2) Or we simply pass it to the next network as is.

And this process continues, with each state taking the output from the hidden neuron of the previous network (alongside the new input) and feeding it to the hidden neuron of the current state, thereby generating the output for the current hidden layer. We could then pass this output either to the next network or to the output neuron to produce a prediction.

This entire process can be captured by these key equations:

Despite its simplicity, this approach has a limitation: as we progress to the final steps, the information from the initial steps starts to fade away because the network fails to retain a lot of information. The larger the input sequence, the more pronounced this problem becomes. Clearly, we need a strategy to enhance this memory.

Enter LSTMs.

They accomplish this by implementing a simple yet effective strategy: at each step, they discard unnecessary information from the input and past steps, effectively "forgetting" information that’s not important and only retaining information that’s crucial. It’s kind of like how our brain processes information – we don’t remember every single detail, but only hold on to the details that we find necessary, discarding the rest.

LSTM Architecture

Consider a hidden state of our basic RNN.

hidden state of an RNN

We know each state starts with two players: the previous hidden state value _h_ₜ₋₁, and the current input, _x_ₜ. And the end goal is to produce a hidden state output, _h_ₜ, which can either be passed onto the next hidden state or passed to the output neuron to produce a prediction.

LSTMs have a similar structure, with a slight elevation in complexity:

hidden state of an LSTM

This diagram might seem daunting, but it’s actually intuitive. Let’s break it down slowly.

We had two players in an RNN with the end goal of producing a hidden state output. Now we have three players at the beginning that are inputted to the LSTM – previous long-term memory Cₜ₋₁, previous hidden state output hₜ₋₁ and input xₜ:

And the end goal is to produce two outputs – new long-term memory Cₜ and new hidden state output hₜ:

The primary focus of the LSTM is to discard as much unnecessary information as possible, which it accomplishes in three sections –

i) the forget section

section 1 – forget section

ii) the input section

section 2 – input section

iii) and the output section

section 3 – output section

We notice that they all have a purple cell in common:

These cells are called gates. To decide what information is important and what is not, LSTMs employ these gates, which are essentially neurons with the sigmoid activation function.

These gates decide what proportion of information to retain in their respective sections, effectively acting as gatekeepers that only let a proportion of information pass through them.

The use of a sigmoid function in this context is strategic, as it outputs values ranging from 0 to 1, which correspond directly to the proportions of information we intend to retain. For instance, a value of 1 implies that all information will be preserved, a value of 0.5 means only half of the information will be kept, and a value of 0 denotes that all information will be discarded.

Now let’s come to the formula for all these gates. If you look closely at the hidden state diagram, we see that they all have the same input, _x_ₜ, and _h_ₜ₋₁, but different weight and bias terms.

They all have the same mathematical formula, but we need to swap out the weight and bias values appropriately.

Each of these will produce values between 0 and 1, since that’s how the sigmoid function works, which will determine what proportion of certain information in each section we want to retain.

Note: Here you’ll notice we’re just using a vector notation of weights. This just means we’re going to multiply the xₜ, and hₜ₋₁ with their respective weights represented by W.

Forget section

The main purpose of this section is to figure out what proportion of the long-term memory we want to forget. So all we’re doing here is taking this proportion (a value from 0–1) from the forget gate…

…and multiplying that with the previous long-term memory:

This product gives us the exact previous long-term memory that the forget gate thinks is important and forgets the rest. So the close the forget gate proportion, fₜ, is to 1 the more of the previous long term memory we’re going to retain.

Note: The ‘x’ symbol within the blue bubble signifies a multiplication operation. This notation is consistently used throughout the diagrams. Essentially, these blue bubbles indicates that the inputs are subjected to the mathematical operation depicted in the bubble.

Input section

The main purpose of this section is to create a new long-term memory, which is done in 2 steps.

(step 1) create a candidate for the new long-term memory, _C(tilda)_ₜ. We get this candidate for the new long-term memory using this neuron with the tanh activation function:

We see here that the inputs for this neuron are _x_ₜ, and _h_ₜ₋₁, similar to the gates. So, passing them through the neuron…

…we get the output, which is a candidate for the new long-term memory.

Now we only want to retain necessary information from the candidate. This is where the input gate comes into play. We use the proportion obtained from the input gate…

…to retain only the necessary data for the candidate by multiplying this input gate proportion with the candidate:

(step 2) now to get the final long-term memory, we take the old long-term memory that we decided to keep in the forget section…

…and add that to the amount of new candidate that we decided to keep in this input section:

And viola, we completed mission 1 of the game, we created a new long-term memory! Next, we need to produce a new hidden state output.

Output section

The main purpose of this section is to create a new hidden state output. This is pretty straightforward. All we’re doing here is taking the new long-term memory, Cₜ, passing it through the tanh function…

…and then multiplying it with the output gate proportion…

new hidden state output

…which gives us the new hidden state output!

And just like that we completed mission 2 – producing a new hidden state output!

And now we can pass these new outputs to the next hidden state to repeat the same process all over again.

We also see that each of the hidden states has an output neuron:

Just like in an RNN, each of these states can produce their own individual outputs. And similar to RNNs, we use the hidden state output, hₜ, to produce a prediction. So passing hₜ to the output neuron…

…we get a prediction for this hidden state!

And that wraps this up. As we saw, LSTMs take RNNs to the next level by handling long-term dependencies in sequential data better. We saw how LSTMs cleverly manage to retain essential information and discard the irrelevant, much like our brains do. This ability to remember important details over extended sequences makes LSTMs particularly powerful for tasks such as natural language processing, speech recognition, and time series prediction.

Connect with me on LinkedIn or shoot me an email at shreya.statistics@gmail.com if you have any questions/comments!

NOTE: All illustrations are by the author unless specified otherwise

The post Deep Learning Illustrated, Part 5: Long Short-Term Memory (LSTM) appeared first on Towards Data Science.

Deep Learning Illustrated, Part 4: Recurrent Neural Networks

Shreya Rao — Tue, 11 Jun 2024 19:24:50 +0000

Welcome to Part 4 of our illustrated Deep Learning journey! Today, we’re diving into Recurrent Neural Networks. We’ll be talking about concepts that will feel familiar, such as inputs, outputs, and activation functions, but with a twist. And if this is your first stop on this journey, definitely read the previous articles, particularly Parts 1 and 2, before this one.

Deep Learning, Illustrated

Recurrent Neural Networks (RNN) are unique models explicitly designed to handle sequence-based problems, where the next position relies on the previous state.

Let’s unpack what a sequence-based problem is with a simple example from this MIT course. Picture a ball at a specific point in time, tn.

If we’re asked to predict the ball’s direction, without further information, it’s a guessing game – it could be moving in any direction.

But what if we were provided with data about the ball’s previous positions?

Now we can confidently predict that the ball will continue moving to the right.

This prediction scenario is what we call a sequential problem – where the answer is strongly influenced by prior data. These sequential problems are everywhere, from forecasting tomorrow’s temperature based on past temperature data to a range of language models including sentiment analysis, named entity recognition, machine translation, and speech recognition. Today, we’ll tackle sentiment detection, a straightforward example of a sequence-based problem.

In sentiment detection, we take a piece of text and determine whether it conveys a positive or negative sentiment. Today we’re going to build an RNN that takes a movie review as an input and predicts whether it is positive or not. So, given this movie review…

…we want our neural network to predict that this has a positive sentiment.

This may sound like a straightforward classification problem, but standard neural networks face two major challenges here.

First, we’re dealing with a variable input length. A standard neural network struggles to process inputs of differing lengths. For example, if we train our neural network with a three-word movie review, our input size will be fixed at three. But what if we want to input a longer review?

It would be stumped and unable to process the above review with twelve inputs. Unlike previous articles where we had a set number of inputs (the ice cream revenue model had two inputs – temperature and day of the week), in this case, the model needs to be flexible and adapt to however many words are thrown its way.

Second, we have sequential inputs. A typical neural network doesn’t fully understand the directionality of the inputs, which is critical here. Two sentences might contain the exact same words, but in a different order, they can convey completely opposite meanings.

Given these challenges, we need a method to process a dynamic number of inputs sequentially. Here’s where the RNNs shine.

The way we approach this problem is to first process the first word of the review, "that":

Then use this information to process the second word, "was":

And finally, use all of the above information to process the last word, "phenomenal", and provide a prediction for the sentiment of the review:

Before we start building our neural network, we need to discuss our inputs. Inputs to a neural network must be numerical. However, our inputs here are words, so we need to convert these words into numbers. There are several ways to do this, but for today, we’ll use a basic method.

Stay tuned for an upcoming article where we’ll explore more sophisticated methods to address this challenge.

For now, let’s imagine we have a large dictionary of 10,000 words. We’ll (naively) assume that any words appearing in the reviews can be found within this 10,000-word dictionary. Each word is mapped to a corresponding number.

To convert the word "that" to a bunch of numbers, we need to identify the number "that" is mapped to…

…and then represent it as matrix of 10,000 0’s except the 8600th element which is a 1:

Similarly, the numerical representations for the next two words, "was" (9680th word in the dictionary) and "phenomenal" (4242th word in the dictionary), will be:

And that’s how we take a word and convert it into a neural network-friendly input.

Let’s now turn our attention to the design of the neural network. For the sake of simplicity, let’s assume our network has 10,000 inputs (= 1 word), a single hidden layer composed of one neuron, and one output neuron.

And of course, if this is a fully trained neural network, then each input will have associated weights and the neurons will have bias terms.

In this network, the input weight are labeled as wᵢ, where i denotes the input. The bias term in the hidden layer neuron is bₕ. The weight connecting the hidden layer to the output neuron is wₕᵧ. Finally, the bias in the output neuron is represented by bᵧ, as y indicates our output.

We will use the hyperbolic tangent function (tanh) as the activation function for the hidden neuron.

And as a refresher from the first article, tanh takes an input and produces an output within the range of -1 to 1. Large positive inputs tend towards 1, while large negative inputs approach -1.

To determine the sentiment of the text, we could use the sigmoid activation function in the output neuron. This function takes the output from the hidden layer and outputs a value between 0 and 1, representing the probability of positive sentiment. A prediction closer to 1 indicates a positive review, while a prediction closer to 0 suggests that it is not likely to be positive.

While this method works for now, there is a more sophisticated approach that could yield better results! If you stick around to very end of the article, you’ll see a new and powerful and ubiquitous activation function – Softmax Activation – that tackles this problem better.

With these activation functions, our neural network looks like this:

This neural network takes a text input and predicts the probability of it having a positive sentiment. In the above example, the network processes the input "that" and predicts its likelihood of being positive. Admittedly, the word "that" on its own doesn’t provide much hint as to the sentiment. Now, we need to figure out how to incorporate the next word into the network. This is when the recurrent aspect of recurrent neural networks comes into play, leading to a modification in the basic structure.

We input the second word of the review, "was" by creating an exact copy of the above neural network. However, instead of using "that" as the input, we use "was":

exact copy of the neural network with input "was"

Remember, we also want to use the information from the previous word, "this" in this neural network. Therefore, we take the output from the hidden layer of the previous neural network and pass it into the hidden layer of the current network:

incorporating data from the previous neural network with input "that" to the current one with input "was"

This is a crucial step, so let’s break it down slowly.

From the [first article](https://towardsdatascience.com/neural-networks-illustrated-part-1-how-does-a-neural-network-work-c3f92ce3b462), we learned that each neuron’s processing consists of two steps: summation and activation function (please read the first article if you’re unsure of what these terms mean). Let’s see what this looks like in our first neural network.

In the hidden layer neuron of the first neural network, the first step is summation:

neural network 1 – hidden neuron – part 1 summation

Here, we multiply each of the inputs by their corresponding weights and add the bias term to the sum of all the products:

To simplify this equation, let’s represent it this way where wₓ represents the input weights and x represents the input:

neural network 1 – hidden neuron – part 1 summation

Next, in step 2, we pass this summation through the activation function, tanh:

neural network 1 – hidden neuron – part 2 activation function = h1

This produces the output h₁ from the hidden layer of the first neural network. From here, we have two options – pass h1 to the output neuron or pass it to the hidden layer of the next neural network.

(option 1) If we want the sentiment prediction for just "that", then we can take h₁ and pass it to the output neuron:

option 1 – pass h1 to the output neuron in the first neural network

For the output neuron, we perform the summation step…

neural network 1 – output neuron – part 1 summation

…and then apply the sigmoid function to this sum…

neural network 1 – output neuron – part 2 activation = y1_hat

…which gives us our predicted positive sentiment value:

neural network 1 – output neuron – part 2 activation = y1_hat

So this _y₁hat gives us the predicted probability that "that" has a positive sentiment.

(option 2) But that’s not what we want. So instead of passing h₁ to the output neuron, we pass this information to the next neural network like so:

option 2— pass h1 to the hidden neuron in the second neural network

Similar to other parts of the neural network where we have input weights, we also have an input weight, wₕₕ, for the input from one hidden layer to another. The hidden layer incorporates h₁ by adding the product of h₁ and wₕₕ to the summation step in the hidden neuron. Therefore, the updated summation step in the hidden neuron of the second neural network neuron will be:

neural network 2— hidden neuron – part 1 summation

Key thing to note – all the bias and weight terms throughout the network remain unchanged since they are simply copies from the previous network.

This sum is then passed through the tanh function…

neural network 2 – hidden neuron – part 2 activation function

…producing h₂, the output from the hidden layer in the second neural network:

neural network 2 – hidden neuron – part 2 activation function

From here, again, we can obtain a sentiment prediction by passing h₂ through the output neuron:

neural network 2 – output neuron – part 1 and 2 = y2_hat

Here, _y₂hat yields the predicted probability that "that was" has a positive sentiment.

But we know that’s not the end of the review. So, we will replicate this process where we clone this network once again but with the input "phenomenal" and pass the previous hidden layer output to the current hidden layer.

We process the hidden layer neuron…

neural network 3 – hidden neuron – part 1 and 2 = h3

…to an output, h₃:

neural network 3 – hidden neuron – part 1 and 2 = h3

Since this is the last word in the review and consequently the final input, we pass this data to the outer neuron…

neural network 3 – output neuron – part 1 and 2 = y3_hat

…to give us a final prediction of the sentiment:

neural network 3 – output neuron – part 1 and 2 = y3_hat

And this _y₃hat is the sentiment of the movie review we want and is how we achieve what we drew out in the beginning!

Formal Representation

If we flesh out the above diagram with details, then we’ll get something like this:

Each stage of the process involves an input, x, that travels through the hidden layer to generate an output, h. This output then either moves into the hidden layer of the next neural network or it results in a sentiment prediction, depicted as _yhat. Each stage incorporates weight and bias terms (bias is not shown in the diagram). A key point to underscore is that we’re consolidating all the hidden layers into a singular compact box. While our model only contains one layer with a single neuron in the hidden layer, more complex models could include multiple hidden layers with numerous neurons, all of which are condensed into this box, called the hidden state. This hidden state encapsulates the abstract concept of the hidden layer.

Essentially, this is a simplified version of this neural network:

It’s also worth noting that we can represent all of this in this streamlined diagram for simplicity:

The essence of this process is the recurrent feeding of the output from the hidden layer back into itself, which is why it’s referred to as a recurrent neural network. This is often how neural networks are represented in textbooks.

From a mathematical standpoint, we can boil this down to two fundamental equations:

the entire process represented by 2 formula

The first equation encapsulates the full linear transformation that takes place within the hidden state. In our case, this transformation is the tanh activation function within the individual neuron. The second equation denotes the transformation happening in the output layer, which is the sigmoid activation function in our example.

the entire process represented by 2 formula

What type of problems does an RNN solve?

Many-to-One

We just discussed this scenario where multiple inputs (in our case, all the words in a review) are fed into an RNN. The RNN then generates a single output, representing the sentiment of the review. While it’s possible to have an output at every step, our primary interest lies in the final output, as it encapsulates the sentiment of the entire review.

Another example is text completion. Given a string of words, we want the RNN to predict the next word.

One-To-Many

A classic example of a one-to-many problem is image captioning. Here, the single input is an image and the output is a caption consisting of multiple words.

Many-to-Many

This type of RNN is used for tasks like machine translation, for instance, translating an English sentence into Hindi.

Drawbacks

Now that we’ve unpacked how an RNN works, it’s worth addressing why they’re not as widely used (plot twist!). Despite their potential, RNNs face significant challenges during training, particularly due to something called the Vanishing Gradient Problem. This problem tends to intensify as we unroll the RNN further, which in turn, complicates the training process.

In an ideal world, we want the RNN to take into account both the current step input and the ones from the previous steps equally:

However, it actually looks something like this:

Each step slightly forgets the previous one, leading to a short-term memory problem known as the vanishing gradient problem. As the RNN processes more steps, it tends to struggle with retaining information from previous ones.

With only three inputs, this issue isn’t too pronounced. But what if we have six inputs?

We see that the information from the first two steps is almost absent in the final step, which is a significant issue.

Here’s an example to illustrate this point using a text completion task. Given this sentence to complete, an RNN might be successful.

However, if more words are added in between, the RNN might struggle to predict the next word accurately. This is because the RNN could potentially forget the context provided by the initial words due to the increased distance between them and the word to be predicted.

This highlights the fact that while RNNs sound great in theory, they often fall short in practice. To address the short-term memory issues, we use specialized types of RNNs known as Long Short-Term Memory (LSTM) networks, which is covered in Part 5 of the series!

Deep Learning Illustrated, Part 5: Long Short-Term Memory (LSTM)

Bonus: Softmax Activation Function

We spoke earlier about an alternate, much better way of tackling our sentiment prediction. Let’s take a few steps back to the drawing board and go back to when we were deciding on the activation function for our output neuron.

But our focus this time is a bit different. Let’s zoom in on a basic neural network, setting aside the recurrent aspects. Our goal now? To predict the sentiment of a single input word, not the entire movie review.

Previously, our prediction model aimed to output the probability of a input being positive. We accomplished this using the sigmoid activation function in the output neuron, which churns out probability values for the likeliness of a positive sentiment. For instance, if we input the word "terrible", our model would ideally output a low value, indicating a low likelihood of positivity.

However, thinking about it, this isn’t that great of an output. A low probability of a positive sentiment doesn’t necessarily imply negativity – it could also mean that the input was neutral. So, how do we improve this?

Consider this: What if we want to know whether the movie review was positive, neutral, or negative?

So instead of just one output neuron that spits out the probability prediction that the input is positive, we could use three output neurons. Each one would predict the likelihood of the review being positive, neutral, and negative, respectively.

Just as we used the sigmoid function for a single-output neuron network to output probability, we could apply the same principle to each of these neurons in our current network and use a sigmoid function in all of them.

And each neuron will output its respective probability value:

However, there’s a problem: the probabilities don’t sum up correctly (0.1 + 0.2 + 0.85 != 1) so this isn’t such a great workaround. Simply sticking a sigmoid function for all the output neurons doesn’t fix the problem. We need to find a way to normalize these probabilities across the three outputs.

Here’s where we introduce a powerful activation function to our arsenal – the softmax activation. By using the softmax activation function, our neural network takes on a new form:

While it may seem daunting at first, the softmax function is actually quite straightforward. It simply takes the output values (_yhat) from our output neurons and normalizes them.

However, it’s crucial to note that for these three output neurons, we won’t use any activation function; the outputs (_yhats) will be the result we obtain directly after the summation step.

If you need a refresher of what the summation step and activation step in the neuron entail, this article in the series goes over the inner workings of a neuron in detail!

We normalize these _yhat outputs through the use of the softmax formula. This formula provides the prediction for the probability of a positive sentiment:

Similarly, we can also obtain the prediction probabilities of negative and neutral outcomes:

Let’s see this in action. For instance, if "terrible" is our input, these will be the resulting _yhat values:

We can then take these values and plug them into the softmax formula to calculate the prediction probability that the word "terrible" has a positive connotation.

This means that by using the combined outputs from the sentiment neuron, the probability that "terrible" carries a positive sentiment is 0.05.

If we want to calculate the probability of the input being neutral, we would use a similar formula, only changing the numerator. Therefore, the likelihood of the word "terrible" being neutral is:

And probability prediction that "terrible" is negative is:

And voila! Now the probabilities add up to 1, making our model more explainable and logical.

So, when we ask the neural network – "What’s the probability that "terrible" has a negative sentiment attached to it?", we get pretty a straightforward answer. It confidently states that there is an 85% probability that "terrible" has a negative sentiment. And that’s the beauty of the softmax activation function!

That’s a wrap for today! We’ve tackled two biggies – Recurrent Neural Networks and the Softmax Activation Function. These are the building blocks for many advanced concepts we’ll dive into later. So, take your time, let it sink in, and as always feel free to connect with me on LinkedIn or shoot me an email at shreya.statistics@gmail.com if you have any questions/comments!

NOTE: All illustrations are by the author unless specified otherwise

The post Deep Learning Illustrated, Part 4: Recurrent Neural Networks appeared first on Towards Data Science.

Deep Learning Illustrated, Part 3: Convolutional Neural Networks

Shreya Rao — Sat, 11 May 2024 05:33:02 +0000

Welcome to Part 3 of our illustrated journey through Deep Learning. If you’ve missed the previous articles, definitely go back to read them. They lay the groundwork for what we’re about to dive into today.

Deep Learning, Illustrated

To quickly recap, we previously discussed the inner workings of neural networks by building a simple model to predict the daily revenue of an ice cream shop. We found that neural networks can handle complex problems by harnessing the combined power of several neurons. This allows them to uncover patterns in data that might otherwise be hard to recognize. We also learned that neural networks primarily solve two types of problems: Regression or Classification.

Just as we built a revenue prediction model, we can create models to address diverse problems by modifying the structure. Convolutional Neural Networks (CNNs) are specialized models designed for image recognition tasks. However, they rely on the same fundamental principles as the models we have encountered thus far (plus a few more steps). Today, we will explore the inner workings of a CNN and understand exactly what is happening behind the scenes.

For our first-ever CNN, let’s build an X-or-not-X model. This model should determine whether an image represents an X or not.

Groundbreaking, I know. For fellow Silicon Valley watchers, this model is very much inspired by my boy Jian Yang’s brilliant hotdog-not-hotdog app.

In our revenue model, we used two inputs – temperature and day of the week – to predict revenue. These were easy to input because they were numerical. But how do we input images into a neural network instead of numerical values?

The answer is rather straightforward. When we zoom into an image, we see that it’s basically just a bunch of pixels:

Since our X is a simple black and white image, let’s designate each pixel as either a 1 (representing a black pixel) or a 0 (representing a white pixel). These pixels are stored as a matrix of 0s and 1s.

We can convert this 5×5 matrix into a column:

And this column of 25 (5×5) 1s and 0s can now be our inputs into the neural network:

From the previous article, we also know that a trained neural network comes with weight and bias terms. Let’s assume this is a trained neural network, then here, we’ll have 25 inputs to this neuron, each with its own weight, plus one bias term. If we want to create a more complex neural network (as images typically require), we need to add more neurons and/or layers. However, this increase will exponentially increase the number of weights and bias terms that need to be optimized, requiring significant computational power.

Despite this, it may still be feasible for very very small images, such as our 5×5 pixel image. However, a 256×256 pixel image will result in 65536 (256×256) input weights plus 1 bias term… for a neural network with just 1 neuron! More complex images would require even more neurons and layers (!!). As a result, this method of feeding image pixel values may not scale effectively.

Another concern is that images may not always look as expected. For instance, we could have this ideally centered, beautiful little ‘X’:

Or a wonky one like this:

Or an off-centered one like this:

All the images are of ‘X’, but each ‘X’ looks slightly different. If we train our neural network using a perfectly centered ‘X’, it may not perform well with other ‘X’ images. This is because the network only recognizes a perfectly-centered ‘X’. It cannot identify an off-center or distorted ‘X’. It knows only one pattern. This is not practical for real-world applications, as images are rarely that straightforward. Therefore, we need to adapt our neural network to handle situations where the ‘X’ isn’t perfectly centered.

We need to be more creative with our approach in constructing this neural network, perhaps by understanding the underlying patterns in all the images instead of just the pattern of one kind of image.

And if you think about it – our minds recognize images in a similar way, focusing on the features of an image and piecing them together. Given the vast amount of information we encounter, our brains excel at identifying features and discarding unnecessary information.

So we need to address two issues: reducing the inputs we feed into the neural network and finding a way to detect patterns in images.

Filters

Let’s start by finding some consistent pattern in all the ‘X’ images. For instance, one possible pattern can be:

same pattern across all 3 ‘X’ images

And then we can determine that the image is of an ‘X’ by confirming that this pattern exists in the image.

This pattern is called a filter here. A filter captures a critical characteristic of ‘X’. Thus, even if the image is rotated or smaller or distorted, we maintain the essence of the image.

These filters are typically small square matrices, most commonly 3×3 pixels, although the size can vary.

To apply a filter to an image for pattern detection, we slide the 3×3 filter over each section, and calculate the dot product of the filter and the section it covers. So for the first section we…

…and then multiply together each overlapping pixel value in the filter and matrix…

…and then add the products:

By computing the dot product between the image and the filter, we can say that the filter is convolved with the image and that’s what gives convolutional neural networks their name.

We now do this to all the sections by sliding this filter depending on something called the stride, which we can set. The stride dictates how many cells over we want to move our filter. So if our stride = 1, we move it over the next section like this…

…and if stride = 2, we move it over like this:

Usually the stride is set to 2, but in our case let’s set it to 1.

With stride = 1, if we store all the dot products in a matrix, we get:

We then add a bias term to this output matrix…

…which results in something called a feature map.

It’s important to note that the larger our strides, the smaller our feature map will be. In our example, we used stride = 1, resulting in a relatively large feature map.

When working with actual images, we may need to increase our strides. After all, we are dealing with a 5×5 input image in our example, but real-world images are usually much larger and more complex.

Typically, each value in this feature map is passed through the ReLU activation function. And as a quick reminder from the first article, here is the formula for ReLU:

The function outputs the value as is if it’s greater than 0, and outputs 0 if the input is less than or equal to 0. Thus, by passing the feature map through the ReLU function, we obtain the following updated feature map:

In this scenario, all cells are set to 0, except for the one cell in the middle.

I know that was a lot of steps, but to summarize the convolutional process, we started with an input image of the X…

…and then a filter was applied to it, also known as convolving the filter with the image…

…subsequently, a bias term was added to the convolved matrix to create a feature map…

…and finally, we typically pass this feature map through the ReLU function to obtain an updated feature map:

The primary purpose of the convolution step is to reduce the input size (from the whole image to a feature map) to simplify processing. A valid question that arises is whether we’re losing a significant amount of information due to the reduced values in the resulting feature map matrix. Indeed, we do have fewer values, but the filters are designed to detect certain integral parts or features of the images and eliminate all unnecessary information. And like we discussed earlier, this is similar to how the human eye discerns objects, often ignoring irrelevant details. We don’t examine every single pixel, but rather look at distinct features. The focus is on preserving these essential features.

Similar to the previously mentioned filter, we can use additional filters to detect other features. For instance, we could use this filter…

…that can detect the following patterns:

So if we apply multiple filters using the same process as above, we’ll obtain a collection of feature maps derived from the same input image.

feature maps" srcset="https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ.png 2203w, https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ-300x136.png 300w, https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ-1024x465.png 1024w, https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ-768x349.png 768w, https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ-1536x697.png 1536w, https://towardsdatascience.com/wp-content/uploads/2024/05/191radahnyFLhvUTHfAAspQ-2048x930.png 2048w" sizes="auto, (max-width: 2203px) 100vw, 2203px" />

input image -> feature maps

A crucial question is, how do we determine the filters needed to detect features? This is determined during the training process, which we will discuss shortly.

Pooling

With our feature map now ready, we can move on to the next step – Pooling. This step is quite straightforward. We simply scan the previously created feature map, selecting small 2×2 sections, and choose the maximum value from each section. Here’s what our first step looks like:

max pooling – step 1

These 2×2 sections we take do not overlap, so here’s what our next step will look like:

max pooling – step 2

In this step, you’ll see we don’t have a full 2×2 section, but that’s okay because these sections don’t need to be perfect 2×2. We then move to the next step:

max pooling – step 3

And finally:

max pooling – step 4

We call this method max pooling because it takes the maximum value from each section. Alternatively, we could use mean pooling, which calculates the average value for each region. The result would look like this:

2×2 matrix from mean pooling

Note: Sum pooling is another option, which, as the name suggests, sums up the values in each region. However, max pooling is the most commonly used method.

Max pooling is primarily used to further reduce noise in an image. Its effectiveness becomes more apparent with larger images, as it identifies the area where the filter best matches the input image. Just as in the convolution step, the creation of the pooled feature map discards extraneous information. In this process, approximately 75% of the original information in the feature map is lost, as we retain only the maximum value from every set of four pixels and discard the rest. These are unnecessary details that, when removed, enable the network to function more efficiently. The extraction of the maximum value, the key point of the pooling step, is done to account for distortions.

Whew. That was quite a journey, but we’re not at the actual neural network part yet! But don’t worry, if you’ve read the previous articles, the rest will be pretty straightforward. All the work we’ve done so far has prepared us to use a real neural network. We’ll use the results from the pooling step as inputs for the neural network.

Flattening

The first step to input these values into a neural network involves flattening the feature map matrix. We can’t input the feature map as it is. Therefore, we flatten it. For instance, if we have four filters, they would result in four feature maps. These, in turn, would lead to four 2×2 matrices from the max pooling step. And this is what they will look like flattened:

flattening" class="wp-image-133998 has-transparency" srcset="https://towardsdatascience.com/wp-content/uploads/2024/05/1GmMcPri4MGK9HPVVkl-tWA.png 1000w, https://towardsdatascience.com/wp-content/uploads/2024/05/1GmMcPri4MGK9HPVVkl-tWA-281x300.png 281w, https://towardsdatascience.com/wp-content/uploads/2024/05/1GmMcPri4MGK9HPVVkl-tWA-960x1024.png 960w, https://towardsdatascience.com/wp-content/uploads/2024/05/1GmMcPri4MGK9HPVVkl-tWA-768x819.png 768w" sizes="auto, (max-width: 1000px) 100vw, 1000px" />

max pooling -> flattening

Neural Network (Finally)

All the features we’ve talked about before are stored in this flattened output, which allows us to use the flattened output as inputs to a neural network.

These features already provide a good level of accuracy for classifying images. But we want to improve the model’s complexity and precision. The job of the artificial neural network is to take this data and use these features to make the image classification better, which is the main reason we’re creating a convolutional neural network.

So we take these inputs and plug them into a fully connected neural network.

Note: This is called a fully connected neural network because here we are ensuring that each input and each neuron is connected to another neuron.

Let’s set our neural network architecture to be: 1 hidden layer with 3 neurons and 1 output neuron:

Now, we need to select our activation functions. In the previous article, we used the ReLU activation function for all neurons in our neural network for ice cream sales. The ReLU activation function remains a good choice for the inner layer. However, for the outer neuron, it’s not suitable due to the different nature of the problem we are trying to solve.

Previously, we were trying to answer: given the day of the week and temperature, what will the revenue of the ice cream store be? Now, our question is: given an image, is it the letter X or not? The nature of the problems and the answers we are seeking are significantly different, which means we need to adjust the processing of the outer neuron.

The first scenario was a regression problem, while the current one is a classification problem. We can approach our current problem by calculating a probability. For instance, given an input image, we can determine how likely it is that the image represents the ‘X’. Here, we’ll want the neural network to output values in the range of 0–1, where 1 indicates a high likelihood of being ‘X’, and 0 indicates it’s probably not an ‘X’.

To achieve this type of output, from our discussion of activation functions, the sigmoid function is a good choice.

The functions takes an input and squishes it into an S-shaped curve that outputs a value between 0 and 1. This is perfect for predicting probabilities. Given this, here is what the neural network would look like:

Let’s assume this neural network is trained. Then we know that each input in a trained neural network has associated weight and bias terms. This network subsequently outputs values between 0 and 1.

So if we input our flattened example into this trained neural network and the output is 0.98, and that indicates that there’s a 98% probability the image is an ‘X’.

To recap once again let’s see visually what we have done so far. We start with an input image:

Then convolve this image by applying filters to it…

Add bias terms to the output and pass them through the ReLU function to get feature maps:

Next, we perform max pooling on the feature maps:

We then take these outputs, flatten them, and pass them through our neural network…

…to get a prediction of 0.98!

Okay, this is great. But now we need a way to check how good this 0.98 prediction is. In this case, we know our original image is an ‘X’, so we can say – "the CNN did a good job here!", but we need something that in math-y terms tells us the same thing.

In the previous article, we used the Mean Square Error (MSE) cost function to evaluate the accuracy of our prediction and used that for our training process. Similarly, we need to use a cost function here. But as we discussed earlier, since the kinds of predictions are different, we can’t use the MSE.

In this case, we’ll use something called a Log Loss function, which will sound familiar if you read the article on Logistic Regression. In Logistic Regression we’re trying to check the accuracy of a similar kind of output. Even though a CNN is way more complex than a Logistic Regression model, we’re trying to answer the same type of question.

A Log Loss cost function looks like this:

Here, y = 1 if the image is an ‘X’ and 0 if it’s not and p_hat is the predicted probability. And the sigma just sums all the values across all the image predictions we want to evaluate over. So for this example, y = 1 (because we know the image is an ‘X’), the predicted probability p_hat = 0.98 and n = 1 because we are just trying to evaluate the output of one image:

Here, we see the cost function is very close to 0, which is good. The lower the cost function, the better. So this in mathematical terms is saying what we said earlier – "the CNN did a good job here!"

Training

NOTE: We won’t go into detail about the training process in this article because we covered it extensively in the previous one. So make sure you read that before you do this section!

Remember from the previous article that a neural network learns the optimal weights and bias through the training process using gradient descent. This involves running the training set through the network, making predictions, and calculating costs. We keep doing this until we get the optimal values. The same process happens when we train our Convolutional Neural Network, but with two changes.

First, instead of using the MSE cost function, we use the Log Loss. Second, besides finding the best weight and bias, we also look for the best filters and bias terms in the convolution step. The filters are just 3×3 number matrices. So, the goal is to find the optimal values for all these elements – the filters, bias terms in the convolution step, and the weight and bias terms in the neural network.

If you want to dive deeper into the math behind the training process, this video does a great job.

And that’s about it! This was a pretty meaty article, so it might be helpful to read through it and the previous two articles a couple of times and work through some of the logic on your own to let the concepts sync in.

Check out this article that brings this X-or-not-X classifier to life by building a CNN from scratch in TensorFlow!

Implementing Convolutional Neural Networks in TensorFlow

Part 4 on Recurrent Neural Networks is now live!

Deep Learning Illustrated, Part 4: Recurrent Neural Networks

NOTE: All images are illustrated by the author unless indicated otherwise.

As always, feel free to connect with me on LinkedIn if you have any questions/comments!

The post Deep Learning Illustrated, Part 3: Convolutional Neural Networks appeared first on Towards Data Science.

Deep Learning Illustrated, Part 2: How Does a Neural Network Learn?

Shreya Rao — Thu, 08 Feb 2024 20:15:15 +0000

An illustrated and intuitive guide on how Neural Networks learn

Welcome to Part 2 of the Deep Learning Illustrated series. In the previous article (definitely read that first!), we covered how a neural network works and how a trained neural network makes predictions. We also learned that the neural network arrives at optimal weight and bias values during the training process.

Deep Learning Illustrated, Part 1: How Does a Neural Network Work?

In this article, we’ll delve into the training process and explore exactly how a neural network learns.

If you haven’t read my previous articles, I highly recommend you start with my series of articles covering the basics of machine learning, specifically the one on Gradient Descent because you’ll find that a lot of the material covered there is relevant here.

Machine Learning Starter Pack

Let’s say we want to create a neural network that predicts the daily revenue of ice cream sales using the features temperature and day of the week.

This is the training dataset we’re using:

To build a neural network, as we learned in the previous article, we need to first decide on its architecture. This includes determining the number of hidden layers, the number of neurons in each layer, and the activation function of each neuron.

Let’s say we decided our architecture is: 1 hidden layer with 2 neurons, and 1 output neuron, all using the rectifier activation function.

Terminology segue: In the previous article, we learned about using subscripts to differentiate between different weights. We’re sticking with the same convention here, and in addition, we’ll use superscripts to indicate the layer to which the bias and weights belong. So, above, we can see that the weights going into the first layer of the neurons and the bias terms in that layer all have a superscript of 1.

Another thing you’ll notice is that our predicted output is denoted as r_hat. We learned that the hat symbol indicates it’s the predicted value, and since we’re predicting revenue here, we’re using r.

Once we’ve nailed down the architecture, it’s time to train the model by feeding it some data. During this training process, the neural network will learn the optimal values of the weight and bias terms. Let’s say that after training the model using the training data above, it produces the following optimal values:

This article will focus on exactly how we arrived at these optimal values.

Let’s start with a simple scenario. Suppose we have all the optimal values except the bias term for the outer layer neuron.

Since we don’t know the exact value of the bias, we begin by making an initial guess and setting the value to 0. Typically, bias values are initialized to 0 at the start.

Now, we need to input all the ice cream store features to make revenue predictions (aka forward propagation as we learnt from the previous article), assuming that the last bias term is 0. Let’s pass the 10 rows of our training data into the neural network…

…to get the following predictions:

Now that we have the predictions when the last bias term is equal to 0, we can compare them to the actual revenue. In the previous article, we learned that we can measure the accuracy of our predictions using a cost function, specifically the Mean Squared Error (MSE) for our use case.

Calculating the MSE of this model with a bias of 0:

We also know that the ultimate objective of any model is to reduce the MSE. Therefore, the goal now is to find an optimal bias value that minimizes this MSE.

One way to compare the MSE values at different bias values is by brute forcing it and trying different values for the last bias term. For example, let’s make a second guess for the bias term that is slightly higher than the last value of 0. Let’s try bias = 0.1 next.

We pass in the training data to the new model with bias = 0.1…

…which results in these predictions…

…which we then use to calculate MSE:

As we can see, the MSE of this model (0.03791) is slightly better than the previous MSE when the bias was set to 0 (0.08651).

To visualize this more clearly, let’s plot these values on a graph.

We can continue using this brute-force method by guessing values. Let’s say we also guessed 4 more values: bias = 0.2, 0.3, 0.4, and 0.5. We repeat the same process as above to generate an MSE chart that looks like this:

We notice that at bias = 0.3, the MSE is at its lowest. And at bias = 0.4 the MSE starts to increase again. This tells us that we minimized the MSE at bias = 0.3.

Fortunately, we were able to determine this after a few educated guesses and then confirm it through additional attempts. However, what if the optimal MSE value was 100? In that case, we would need to make 1000 (100 x 10) guesses to reach it. Therefore, this approach is not very efficient for finding the optimal bias values. Additionally, how can we be certain that the bias with the lowest MSE value is exactly 0.3? What if it’s 0.2998 or 0.301? It would be difficult to make precise guesses like that using this brute force technique.

Gradient Descent

Luckily, we have a waaaay more efficient way to determine the optimal bias value. We will utilize a concept called Gradient Descent. And yay for us – gradient descent was already covered (with beautiful illustrations if I can say so myself) in a previous article. So definitely read that before continuing.

To quickly summarize, by using gradient descent and leveraging derivatives, we can efficiently reach the lowest point of any convex curve (essentially a U-shaped curve). This is ideal in our current situation because the MSE graph above resembles a U-shaped curve, and we need to find the valley where the MSE is minimized. Gradient descent guides us by indicating the size and direction of each step needed to reach the bottom of the curve as quickly as possible.

Now let’s restart the process of finding the optimal bias using the steps laid out in gradient descent.

Step 1: Start with a random initial value for the bias

We can start with bias = 0 for instance:

Step 2: Calculate the step size of our step

Next, we need to determine the direction and how big of a step we should take. This can be achieved by calculating the step size, which is the result of multiplying a constant value known as the learning rate by the gradient of the MSE at the bias value. In this case, the bias value is 0 for this iteration.

Note: The learning rate is a constant used to control the step size. Typically, it falls between 0 and 1.

Let’s examine the derivative value more closely here. We know that the MSE is a function of _rhat, as shown in the formula:

And we also know that _rhat is determined by the relu function in the last neuron, as we can obtain r_hat only by utilizing the activation function:

And we know that the relu function in the last neuron includes the bias term.

Now, if we want to calculate the derivative of MSE with respect to the bias, we will use something called the chain rule, a super integral part of calculus, which utilizes the above 3 key pieces of information.

We need to use the chain rule because the terms are dependent on each other but not directly. It’s called the chain rule because they are all linked in a chain-like structure. We can almost think of the numerators and denominators canceling each other out.

This is how we calculate the derivative of MSE with respect to the bias. We calculate this derivative at the current bias value (0).

Step 3: Update the bias value by using the above step size

This will provide us with a new bias value that will hopefully bring us closer to our optimal bias value.

Step 4: Repeat Steps 2–3 until we reach our optimal value

We will continue to repeat this process of taking steps…

…making tiny leaps, with steps shrinking as we inch closer to the bottom…

…until finally…

…we reach the optimal value!

NOTE: We achieve the optimal value when the step size is close to 0 or when we reach a maximum number of steps that we set in the algorithm.

Perfect. This is how we can find the bias term, assuming that the optimal values for the other variables are already known.

Terminology segue: This process of determining the value of this bias term, is called backpropagation. In our previous article, we focused on forward propagation, which involves passing inputs to obtain an output. It’s called forward propagation because we literally are propagating the inputs forward. Meanwhile, this process is called backpropagation because we move backwards to update the bias values.

Now, let’s go one step further and consider a scenario where we know all the optimal values except for the bias term and the weight of the second input going into the last neuron.

Again we need to find optimal values for these two terms so that the MSE is minimized. For different values of the weight and bias, let’s create a plot of MSE. This plot will be similar to the one shown above, but in 3 dimensions.

Similar to the previous MSE curve, we need to find the point that minimizes the MSE. This point, known as the valley point, will provide us with the optimal values for the bias and weight terms. Once again, we can use gradient descent to reach this minimum point. The process is essentially the same here as well.

Step 1: Randomly initialize values of the weight and bias

Step 2: Calculate the step size using partial derivatives

This is where a slight deviation occurs. Instead of calculating the derivative of the MSE, we calculate something called the partial derivatives and update the step sizes simultaneously. By "simultaneously," we mean that the values of the partial gradients need to be calculated at the current weight and bias value, and we again use the chain rule:

Step 3: Simultaneously update the weight and bias terms

Step 4: Repeat Steps 2–3 until we converge at the optimal values

And we can get crazy with this. Now what if we want to optimize for all 9 values in the neural network?

Then, we’ll have 9 simultaneous equations that need to be updated in order to reach the minimum. (can’t even attempt to draw out this MSE function because, as you can or rather can’t imagine, that’ll be one insane-looking graph)

Even though the math becomes more complicated to do by hand as the number of simultaneous equations increases, the concept remains the same. We are trying to gradually move towards the bottom of the valley, relying on gradient descent to guide us.

By applying the equations and optimization procedures discussed above, hidden patterns of the data naturally emerge. This allows us to find these deep patterns without any human intervention.

Okay, to recap, we now understand that we always want to minimize the cost function (MSE in the above case) and how to obtain optimal values for our weight and bias terms that minimize MSE using gradient descent.

We learnt that by using gradient descent, we can easily traverse a convex-shaped curve to reach the bottom. And luckily for our case study, we had a beautiful looking convex curve. However, sometimes we may encounter a cost function that does not produce a perfect convex curve, but instead produces something that looks like this

If we use gradient descent, sometimes we may mistakenly identify one of the many local minimums (points that appear to be minimum points but are not actually) as the minimum point, instead of the global minimum (the actual minimum point).

Another issue with gradient descent is that as the number of data points in our dataset or the number of terms increases, the time it takes to perform gradient descent also increases. This is because the math involved becomes more complex.

In our small example, we have 10 data points (which is very unrealistic; usually we have hundreds of thousands of data points) and we are trying to optimize 9 parameters (this number can also be very high depending on how complex the architecture is). Currently, for each iteration of gradient descent, we use 10 data points to calculate the partial derivatives and update 9 parameter values.

This is essentially what Gradient Descent is doing:

In each iteration, we perform approximately 90 (910) small calculations to calculate the derivative for the MSE of each individual data point. Typically, we perform 1000 iterations like this, resulting in a total of 90,000 (901000) calculations.

However, what if we have 100,000 data points instead of just 10? In that case, we would need to calculate the MSE for all 100,000 data points and take the derivative of 900,000 (= 9100,000) terms. Normally, we would perform around 1000 steps of gradient descent to reach our optimal values, resulting in a staggering 900,000,000 (900,0001000) calculations. Additionally, our data can become much more complex with numbers in the millions and a larger number of parameters to optimize. This can quickly become very challenging.

To avoid this issue, we can utilize alternate optimization algorithms that are faster and more powerful.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is similar to gradient descent, with a teeny difference. In gradient descent, we update our values after calculating the MSE for the entire training dataset, which contains all 10 values. However, in SGD, we calculate the MSE using only one data point from the dataset.

The algorithm randomly selects a single data point and uses it to update the parameter values, instead of using the entire dataset.

This much is a lighter algorithm, making it faster than its all-encompassing counterpart.

Mini-batch Gradient Descent

This approach is a combination of vanilla and stochastic gradient descent. Instead of updating values based on just one data point or the entire dataset, we process a batch of data points per iteration. We can choose the batch size to be 5, 10, 100, 256, etc.

For example, if our batch size is 4, we calculate the MSE for partial derivatives based on 4 rows of data.

When dealing with gradient descent, apart from the big data problem and local minimum issues, we can also encounter another issue. Remember the learning rate? We didn’t delve too deeply into it, but we discussed that it’s a constant term set at the beginning of the model-building process. The choice of learning rate greatly affects the performance of gradient descent. If we set the learning rate too low, we may never converge to the optimal values, and if we set it too high, we may overshoot our step and diverge from the optimal value. In reality, there is a happy medium. Now, the question is, how do we find this learning rate?

Option #1: Try lots of different learning rates and see what works well

(much smarter) Option #2: Design an adaptive learning rate that "adapts" to our neural network and adapts to our MSE landscape.

This is precisely what other optimization algorithms aim to accomplish. However, discussing them in detail would require a whole other article. If you’re interested, you can refer to this article that dives into some popular ones.

That wraps up our foray into Neural Networks. These two articles should provide a solid foundation as we journey further into the world of Deep Learning. And to see how we put these concepts into action, read this article that builds the above ice cream revenue neural network in TensorFlow (and PyTorch as well!)

Implementing Neural Networks in TensorFlow (and PyTorch)

Part 3 on Convolutional Neural Networks is live now!

Deep Learning Illustrated, Part 3: Convolutional Neural Networks

Unless indicated, all images are by the author.

As always, feel free to connect with me on LinkedIn if you have any questions/comments!

The post Deep Learning Illustrated, Part 2: How Does a Neural Network Learn? appeared first on Towards Data Science.

Deep Learning Illustrated, Part 1: How Does a Neural Network Work?

Shreya Rao — Wed, 31 Jan 2024 22:02:59 +0000

Deep Learning Illustrated, Part 1: How Does A Neural Network Work?

An illustrated and intuitive introduction to Neural Networks

If you have read my previous articles, you’ll know what’s coming next. In this part of the internet, we take complex-sounding concepts and make them fun and nbd by illustrating them. And if you haven’t read my previous articles, I highly recommend you start with my series of articles covering the basics of machine learning because you’ll find that a lot of the material covered there is relevant here.

Machine Learning Starter Pack

Today, we’re going to tackle the big boy – an introduction to Neural Networks, a kind of Machine Learning model. This is just the first article in a whole series I plan on doing on Deep Learning. It will focus on how a simple artificial neural network learns and provide you with a deep (ha, pun) understanding of how a neural network is constructed, neuron by neuron, which is super essential as we’ll continue to build upon this knowledge. While we will dive into the mathematical details, there’s no need to worry because we will break down and illustrate each step. By the end of this article, you’ll realize that it’s waaaaay simpler than it sounds.

But before we explore that, you might be wondering: Why do we need neural networks? With so many machine learning algorithms available, why choose neural networks? The answers to this question are plentiful and extensively discussed, so we won’t delve too deeply into it. But it’s worth noting that neural networks are incredibly powerful. They can identify complex patterns in data that classical algorithms may struggle with, tackle highly complex machine learning problems (such as natural language processing and image recognition), and diminish the need for extensive feature engineering and manual efforts.

But all that said, neural network problems pretty much boil down to 2 main categories – Classification, predicting a discrete label for a given input (ex: is this a picture of a cat or a dog? is this movie review positive or negative?) or Regression, predicting a continuous value for a given input (ex: weather prediction – what will the temperature be tomorrow?).

Today we’ll focus on a regression problem. Consider a simple scenario: we recently moved to a new city and are currently searching for a new home. However, we notice that the prices of houses in the area vary significantly.

Since we are unfamiliar with the city, our only source of information is what we can find online. We come across a house that interests us but are unsure whether it is priced fairly.

So we decided to build a neural network for predicting the price of a house based on certain features – its size (in feet²), location (1=urban, 2=suburban, 3=rural), age, and the number of bedrooms. Our goal is to use these features to predict the house price.

The first thing we do is collect data about houses in the neighborhood and what price they sold for.

Next, we want to train a neural network. Training involves feeding this dataset into the model, which learns the patterns in the data.

Terminology segue: Since we’re using the dataset above to train the model, it is called the training data. Usually our training data will contain 1000s if not 100000s of rows, but we’ll keep it simple for now.

As a result, the model becomes capable of predicting the price of a new house based on the available data.

But before getting into the model building and training, let’s understand why it is called a neural network.

Background

A neural network enables computers to process data in a manner inspired by the human brain. It utilizes interconnected neurons arranged in layers, resembling the structure of the human brain.

This is a biological neuron.

It receives inputs, processes the received inputs or data (this processing is nothing short of magical), and generates an output.

Just like the human brain, which processes data by receiving inputs and generating outputs, the neural network operates similarly.

The blue lines here represent the inputs to the neuron. In the context of pricing a house, these inputs can be considered as the different feature variables, while the output will be the predicted house price.

Each input is associated with a constant term called a weight. So let’s add them to our artificial neuron.

The purpose of these weights is to indicate the importance of an input. A higher weight value means that the input is considered more important. So if the weight of age is higher than that of location, it means that the age of the house is given more importance than the location of the house.

Now, just like some magic happens in the biological neuron, this is what that magic looks like in the artificial neuron.

When we zoom in, we see that this magic is essentially 2 mathematical steps.

Magic, Part 1: Summation

The first part is a summation. Here, we multiply each input by its corresponding weight and then sum them together.

You may have also noticed a little b at the top. This is called the bias term and it is a constant value. We add this value to the weighted sum to complete the summation.

Mathematically:

where the features are represented by xᵢ and n = number of features

Magic, Part 2: Activation Function

Here the above summation is inputted through something called the activation function.

Think of activation functions as the translators of raw data into meaningful insights. They take the summation from the previous step and transform it into an output that’s useful for our specific task.

Let’s start with the binary step function. It’s straightforward: if your input (let’s call it x) is equal to or greater than 0, the function spits out a 1; otherwise, it gives you a 0. This is super handy when you need a clear-cut decision, like a yes or no. For example, based on the inputs will this house sell?

Then there’s the linear function, which tells it like it is. It simply returns whatever value it receives. So, if our summation is 5, the output is also 5.

Moving on to the sigmoid function, a real game-changer. It elegantly squishes any input value to fit within a 0 to 1 range. Why is this awesome? Because it’s perfect for probability-based questions. For example, what’s the likelihood of a house selling given certain conditions?

Then there’s the hyperbolic tangent function, or tanh for short. It’s similar to the sigmoid but with a twist: it outputs values ranging from -1 to 1. So, larger positive inputs hover near 1, while larger negative ones approach -1.

And, drumroll please, we have the rectifier function, also known as the ReLU (Rectified Linear Unit). This one’s a star in the neural network world. It’s simple but effective: if the input is positive, it keeps it; if negative, it turns it to zero. This functionality makes it incredibly useful in numerous scenarios.

We also have another one called Leaky ReLU (Leaky Rectified Linear Unit) which is a clever twist on the regular ReLU. While ReLU sets all negative inputs to zero, Leaky ReLU allows a small, non-zero, constant output for negative inputs. Imagine it as a slightly open faucet, letting a tiny trickle of water (or in our case, data) through, even when it’s mostly turned off.

The last one we’ll discuss, which has become more popular recently, is the Swish function.

There‘s a whole universe of other activation functions out there, each with unique characteristics. But these are some of the most popular and versatile ones. (read more about them here)

The cool thing about activation functions is that they can be tailored to our specific problem. For instance, if we’re predicting something continuous, like the price of a house (a regression problem), the rectifier function is a great pick. It only gives positive outputs, aligning well with the fact that house prices aren’t negative. But if we’re estimating probabilities, like the chances of a house selling, the sigmoid function is our go-to, with its neat 0 to 1 range mirroring probability values.

Let’s go ahead and choose the activation function to be a rectifier function in our neuron because that seems to make the most sense for our problem.

And this folks is considered a neural network model (!), albeit the simplest form of one. It just consists of 1 neuron, but a great place to start nonetheless.

The next thing we need to figure out is what the values of the weights and bias terms should be. We know they are constant terms, but what should their values be?

Remember, we discussed training the neural network earlier? All that means is determining the optimal values for our weights and bias terms. We’ll get into specifics of exactly how this training happens later.

For now, let’s assume that we trained our neural network and obtained the optimal values. So, let’s replace the terms with these optimal values.

And this is what we call a trained neural network that is ready to be put into action. Essentially, what this means is that we have utilized the available data to create the most effective model using one neuron from the training. Now, we can make predictions about house prices by inputting the relevant features of the house whose value we are trying to determine.

Let’s try predicting the price of the first house in our training dataset.

When we pass in the inputs, Part 1 of the magical data processing is the summation…

…and Part 2 is passing this summation value through the rectifier function:

Essentially our model takes the features of the first house as input and predicts a price of $1,036,000 based on those features. In other words, it’s saying, "Given these house features, I predict the price of the house to be $1,036,000."

But when we compare it to the actual house price, 1.75M it’s not that great of a prediction, unfortunately. We’re $714,000 off. Yikes.

If we input the remaining houses into this simple model, we will obtain the following predicted prices:

And as we can see, the predicted prices are all quite inaccurate. This indicates that our model is not very effective, which is understandable considering its lack of sophistication. It consists of only one neuron. Just like the human brain, it’s only when neurons collaborate that they make more impactful decisions and process data with greater sophistication.

Let’s take a step back and consider if there is a more intuitive way to solve this problem. Perhaps there is a way to enhance our predictions by considering the interactions between different features. Maybe the combination of two features is more significant than the individual features alone?

For example, the combination of bedrooms and size could be valuable. It’s possible that a smaller house with many rooms might feel cramped, making it less appealing to buyers and resulting in a lower price. Similarly, the combination of age and location could be important. In urban areas, newer houses tend to be more expensive, while in rural areas, buyers may prefer the charm of older houses, which can increase their value. It is also possible that older houses in rural areas are more renovated. Moreover, the combination of location, size, and bedrooms can be interesting. In suburban and rural areas, having more bedrooms in smaller houses may not be favorable. However, in urban areas, where people prefer proximity to the city for work while still having enough space for their families, they may be willing to pay more for smaller houses as long as they have sufficient bedrooms.

The possibilities are endless, and it’s challenging to consider all the different combinations. Fortunately, this is where we leverage the power of multiple neurons. Similar to how biological neurons collaborate to make better decisions, artificial neurons also work together to achieve the same goal.

Let’s make our simple neural network more powerful by adding two more neurons to it. This will create a cobweb-like structure:

In this case, all the inputs are being fed into each of the 3 neurons. Since we have inputs going into 3 neurons and we know each input is associated with a weight, there will be a total of 12 (= 4 * 3) different weights. To keep them separate, let’s introduce some notation.

The weights are represented by _wij, where i is the neuron number and j is the input that goes into it. So for instance, this highlighted weight…

…is labeled w₁₂ because it’s the 2nd input to the 1st neuron. And this highlighted input…

…is labeled w₃₄ because it’s the 4th input to the 3rd neuron. Similarly, this is all the weights labeled:

These weights can take on any value, which is determined during the training process.

Let’s say our training process of the neural network determined that only the bedroom and size features are relevant for neuron 1, while the other 2 features are not considered, then the weights for location and age going into the first neuron will be 0. Similarly, let’s say only the bedrooms, size and location are important for the second neuron, and age is ignored, so the weight for age going into the 2nd neuron is 0. Meanwhile, the third neuron only considers location and age as important features, while bedrooms and size are given a weight of 0.

The resulting neural network will look something like this:

Similarly, the training process will also produce optimal bias values. So let’s go ahead and add them here too (let’s also remove the inputs with weights = 0 just to make the diagram more readable):

You probably notice something odd: we have 3 outputs here. However, we only want one output, which is the predicted price. Therefore, we need to find a way to combine the outputs from the 3 neurons into one. To do this, let’s add another neuron in the front.

The structure remains the same as the previous ones, but instead of our 4 features being fed into the neuron as inputs, the outputs from the previous neurons are now used as inputs for the new neuron.

Terminology segue: Each layer is numbered, with the input layer typically being labeled as 0. The final layer is referred to as the output layer, while any layer situated between the input and output layer is considered a hidden layer.

And remember, every input is accompanied by a corresponding weight. Therefore, even these inputs to the new neurons will have weights, which can be estimated during the training process as well. The new bias will also be determined during the training process. As a result, the new neural network (assuming it’s fully trained) will have the following optimal values:

Now let’s move on to the activation functions. For this case, we’ll set all of them to be equal to the rectifier function. Generally, we have the flexibility to choose different activation functions based on the problem we are trying to solve. However, since the rectifier function is commonly used, let’s just go with that now.

NOTE: Usually, the same layer will have the same activation function.

Okay, finally the fun part. We trained our neural network with all the optimal bias and weight values. Now it’s time to take this baby for a spin and see how well it does in predicting house prices.

Let’s pass the features of our first house through this neural network again.

We’ll clarify the process by highlighting the activated inputs and neurons at each step.

step 1 – first neuron

step 2 – second neuron

step 3 – third neuron

And finally, using the outputs from the hidden layer and passing them through the output layer:

step 4 – final neuron

And that’s how we use this neural network to get outputs! This process of passing in inputs to get an output is called forward propagation.

We’ll repeat the same process for the rest of the houses:

Let’s compare these new predicted prices to the old predicted prices made by the neural network with just one neuron.

From just eyeballing it, it appears that the new predictions are performing better than the old ones. But what if we want to find a single number that quantifies how off our predictions are from the actual value?

This is where a cost function comes into play. A cost function tells us how off we are from our prediction. Depending on the type of prediction, we can use different cost functions. But for this problem, we’ll use one called the Mean Square Error (MSE). The MSE allows us to a) measure the deviation of our predictions from the actual price and b) compare predictions made by different models.

It calculates the average of the squares of the difference between the predicted house price and the actual house price. Mathematically:

Terminology segue: It is common notation to refer to the actual price as "y" and the predicted price as "y hat" (denoted that way because the little notation on the top of the "y" looks like a hat)

The objective is to minimize the MSE. The closer MSE is to 0, the better our model is at predicting prices.

So, using this formula, we can calculate the MSE of the old one-neuron model as:

Ugh…that’s a super gnarly number. This just confirms that our first model was pretty bad (cough horrendous cough).

Similarly, the MSE of the new, more complex model:

Still pretty bad but at least a little better than the previous MSE.

But we can consider creating a better model.

One approach is to add more neurons to the existing layer to improve the prediction power. Like this:

added a fourth neuron to the hidden layer

Or we could add an entirely new hidden layer:

added a second hidden layer with 3 neurons

Alternatively, we can place different activation functions at different layers:

As you can see, the possibilities are endless. We can adjust the complexity of our neural network to meet our specific needs. These different possibilities are called neural network architectures. We can customize the number of layers, the neurons at each layer, and the activation functions to fit the data and problem we’re trying to solve, making it as simple or complex as needed.

Now that we understand how a neural network works, the next article (up now woohoo) will focus on understanding how it learns the optimal bias and weight values aka the training process!

Deep Learning Illustrated, Part 2: How Does a Neural Network Learn?

As always, feel free to connect with me on LinkedIn for any comments/questions!

Unless specified, all images are by the author.

The post Deep Learning Illustrated, Part 1: How Does a Neural Network Work? appeared first on Towards Data Science.