Understanding LSTM & Bi-LSTM Networks in RNN (In Depth Intuition)

Prateek Smith Patra
20 min readOct 23, 2021
A recurrent neuron, where the output data is multiplied by a weight and fed back into the input

Recurrent Neural Networks(RNN in DeepLearning)

Recurrent Neural Network is basically a generalization of feed-forward neural network that has an internal memory. RNNs are a special kind of neural networks that are designed to effectively deal with sequential data. This kind of data includes time series (a list of values of some parameters over a certain period of time) text documents, which can be seen as a sequence of words, or audio, which can be seen as a sequence of sound frequencies over time.
RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation. For making a decision, it considers the current input and the output that it has learned from the previous input.

Cells that are a function of inputs from previous time steps are also known as memory cells.

Unlike feed-forward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other.

# Why RNN?

The basic challenge of classic feed-forward neural network is that it has no memory, that is, each training example given as input to the model is treated independent of each other. In order to work with sequential data with such models — you need to show them the entire sequence in one go as one training example. This is problematic because number of words in a sentence could vary and more importantly this is not how we tend to process a sentence in our head.

Recurrent neural networks are a special type of neural network where the outputs from previous time steps are fed as input to the current time step.

Basic Recurrent Neural Network with three input nodes

Different types of RNN’s

The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. A few examples may make this more concrete:

Different RNN Implementation types

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.

In the above diagram, a chunk of neural network, AA, looks at some input xtxt and outputs a value htht. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context — it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

Enhancing our memory with Long Short Term Memory Networks (LSTM) Networks

Long-Short Term Memory networks or LSTMs are a variant of RNN that solve the Long term memory problem of the former.

They have a more complex cell structure than a normal recurrent neuron, that allows them to better regulate how to learn or forget efficiently from the different input sources.

The key to LSTMs is the cell state (cell memory), the horizontal line running through the top of the diagram, through which the information flows along and the internal mechanism called gates that can regulate the flow of information.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions.

Cell State basically encodes the information of the inputs (relevant info.) that have been observed up to that step (at every step).

Representation of an LSTM cell

Cell state is a memory of the LSTM cell and hidden state (cell output) is an output of this cell.

Cells do have internal cell state, often abbreviated as “c”, and cells output is what is called a “hidden state”, abbreviated as “h”.
Regular RNNs have just the hidden state and no cell state. Therefore, RNNs have difficulty of accessing information from a long time ago.

Note: Hidden state is an output of the LSTM cell, used for Prediction. It contains the information of previous inputs (from cell state/memory) along with current input (decided according which context is important).

Hidden state (hₜ ₋ ₁) and cell input (xₜ) data is used to control what to do with memory (cell state) cₜ : to forget or to write new information.

We decide what to do with memory knowing about previous cell output (hidden state) and current input and we do this using gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a point-wise multiplication operation.

LSTM Gate

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.
A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions.

An LSTM neuron can do this learning by incorporating a cell state and three different gates: the input gate, the forget gate and the output gate. In each time step, the cell can decide what to do with the state vector: read from it, write to it, or delete it, thanks to an explicit gating mechanism.
With the input gate, the cell can decide whether to update the cell state or not. With the forget gate the cell can erase its memory, and with the output gate the cell can decide whether to make the output information available or not.

LSTMs also mitigate the problems of exploding and vanishing gradients.

To reduce the vanishing (and exploding) gradient problem, and therefore allow deeper networks and recurrent neural networks to perform well in practical settings, there needs to be a way to reduce the multiplication of gradients which are less than zero.

The LSTM cell is a specifically designed unit of logic that will help reduce the vanishing gradient problem sufficiently to make recurrent neural networks more useful for long-term memory tasks i.e. text sequence predictions.
The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten. Two other gates, the input gate and output gate, are also featured in LSTM cells.

Here’s a brief summary of all the internal formulation and working of different gates,cell state,hidden state and current input, explained through mathematical formulas, referenced from a research paper https://arxiv.org/abs/1603.03827 (~LSTM for text classification):

LSTM summarized

Let’s first have a look at LSTM cell more carefully.

LSTM cell another view

The data flow is from left-to-right in the diagram above, with the current input xₜ and the previous cell output hₜ₋₁ concatenated together and entering the top “data rail”. The long-term memory is usually called the cell state Ct. The looping arrows indicate recursive nature of the cell. This allows information from previous intervals to be stored within the LSTM cell. Here’s where things get interesting.

Input Gate:

The input gate is also called the save vector.
These gates determine which information should enter the cell state / long-term memory OR which information should be saved to the cell state or should be forgotten.

First, the (combined) input is squashed between -1 and 1 using a tanh activation function.
This squashed input (from tanh) is then multiplied element-wise by the output of the input gate. The input gate is basically a hidden layer of sigmoid activated nodes, with weighted xₜ and input values hₜ ₋ ₁, which outputs values of between 0 and 1 and when multiplied element-wise by the input determines which inputs are switched on and off (actually, the values aren’t binary, they are a continuous values between 0 & 1). In other words, it is a kind of input filter or gate (it tells what to learn and add to the memory from current input and the context its given and also how much {sigmoid gives values between 0&1} of what to learn).

Simplistic (could be wrong) view: Tanh gives the Standardized (between -1 & 1) value of the actual unscaled (combined) input vector and the sigmoid layer is the controller of what percentage (values between 0 & 1 or 0 to 100%) of what inputs should be passed on of the scaled (from tanh) values to be added to the memory considering the current and previous context.

But Why tanh activation?

Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero / turned-off / forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget certain memories.

Forget Gate:

The forget gate is also called the remember vector. The output of the forget gate tells the cell state which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell state.

Although initially it is randomly initialized, but it basically LEARNS What exactly to FORGET (when the current input and previous Context is given) from the memory (cell state).

Output Gate:

The output gate is also called the focus vector.
It basically highlights, out of all the possible values from the matrix(long memory), which information should be moving forward to the next hidden state.

Note: The working memory is usually called the hidden state(ht).
It is basically → ht (LSTM OUTPUT) → What part of the existing memory (Ct) should be fed as context for the next round. This is analogous to the hidden state in RNN and HMM.

Gates Summarized:
Input gate determines the extent to which the current timestamp input should be used , the Forget gate determines the extent to which output of the previous timestamp state should be used, and the Output gate determines the output of the current timestamp.

LSTM Networks

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1ht−1 and xtxt, and outputs a number between 00 and 11 for each number in the cell state Ct−1Ct−1. A 11 represents “completely keep this” while a 00 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~tC~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, Ct−1Ct−1, into the new cell state CtCt. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ftft, forgetting the things we decided to forget earlier. Then we add it∗C~tit∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information.

Bi-LSTM:(Bi-directional long short term memory):

BiLSTM Network Structure example with NLP text Inputs

Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. This structure allows the networks to have both backward and forward information about the sequence at every time step

Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backward you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

What they are suited for is a very complicated question but BiLSTMs show very good results as they can understand the context better, I will try to explain through an example.

Let’s say we try to predict the next word in a sentence, on a high level what a unidirectional LSTM will see is

“The boys went to ….”

And will try to predict the next word only by this context, with bidirectional LSTM you will be able to see information further down the road for example

Forward LSTM:

“The boys went to …”

Backward LSTM:

“… and then they got out of the pool”

You can see that using the information from the future it could be easier for the network to understand what the next word is.

Here in LSTM ,

  1. we use activation values , not just C (candidate values ) ,
  2. we also have 2 outputs from the cell , a new activation , and a new candidate value

so to calculate the new candidate

here in LSTM we control the memory cell through 3 different gates

as we said before we have 2 outputs from LSTM , the new candidate and a new activation , in them we would use the previous gates

To combine all of these together

In Bi-LSTM:

As in NLP , sometimes to understand a word we need not just to the previous word , but also to the coming word , like in this example

Here for the word “Teddy” , we can’t just say whether the next word is gonna be “Bears” or “Roosevelt”, it will depend on the context of the sentence.
Bi-lstm is general architecture that can use any RNN model

Here we apply forward propagation 2 times , one for the forward cells and one for the backward cells

Both activations(forward , backward) would be considered to calculate the output y^ at time t

output y^ at time t formula

What makes difference between LSTM & BiLSTM in RNN

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.

Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.

Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

(Follow Me on LinkedIn to grow professional network with Developers)

Prateek Smith Patra (Data Scientist) — https://www.linkedin.com/in/prateek-smith-patra-76a3031b5/

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--

Prateek Smith Patra

Big Data Integration Full Stack Developer @Cognizant | Jr. Data Scientist