# RNN with Keras: Understanding computations

This tutorial highlights structure of common RNN algorithms by following and understanding computations carried out by each model. It is intended for anyone knowing the general deep learning workflow, but without prior understanding of RNN. If you really never heard about RNN, you can read this post of Christopher Olah first.

The present post focuses on understanding computations in each model step by step, without paying attention to train something useful. It is illustrated with Keras codes and divided into five parts:

- TimeDistributed component,
- Simple RNN,
- Simple RNN with two hidden layers,
- LSTM,
- GRU.

*This diagram is an illustration of an LSTM cell. Check out part D for details.*

Companion source code for this post is available here.

The core idea of RNN over feedforward neural networks is to read input in a sequential way. In RNN, input is indexed with and processed sequentially. The index can represent time for time-series, or sentence’s position for NLP tasks. Information is stored, updated and transmitted over time using a hidden variable.

Simple RNN is a simple way to keep and update information along time. It is progressively described in Part A, B and C. This kind of model is effective but difficult to train for long-dependence series. The main issue is caused by the vanishing gradient problem. This problem is detailed in Section 10.7 of the Deep Learning book.

Gated RNNs have been introduced to circumvent the vanishing gradient problem. Two popular gated RNNs are described in Part D (LSTM) and Part E (GRU). The main idea is to control information flow by introducing gates. Insight into why it is working can be found here (please share if you know a better reference).

Note that even gated RNNs have computational and interpretability issues, and are not explicit memory. A promising way to circumvent those issues is attention mechanisms. An overview is available here (see also this and this short posts).

## Part A: Explanation of the TimeDistributed component

**A very simple network.**
Let’s begin with one-dimensional input and output.
In Keras, the command line:

```
Dense(activation='sigmoid', units=1)
```

corresponds to the mathematical equation:

Input and output are one-dimensional, so the weights are such that and . The output layer is indeed one-dimensional because we let `units = 1`

in the previous command line.
This equation can be represented by the following diagram (bias term has been masked to improve lisibility):

**TimeDistributed wrapper in dimension 1.**
The TimeDistributed wrapper applies the same layer at each time step.
For example, with one-dimensional input and output along dates, input is represented with and output with . Then, the model:

```
TimeDistributed(Dense(activation='sigmoid', units=1),
input_shape=(None, 1))
```

corresponds to the equation:

applied at each . Note that and are identical for each . In the previous command line, `input_shape=(None, 1)`

means that input layer is an array of shape , and `units = 1`

means that output layer contains unit for each . This model can be represented by the diagram:

**Input and output shapes in practice.**
Input has usually the shape , where is sample size, is temporal size, and is the dimension of each input vector.
Output has the shape , where is the dimension of each output vector.
In the previous example, we have selected , , and .

**Prediction of new inputs.**
Given a model trained on inputs of shape ,
we can feed the model with new inputs of shape .

In the previous example, we can select for example:

```
new_input = np.array([[[1],[0.8],[0.6],[0.2],
[1],[0],[1],[1]]])
new_input.shape # (1, 8, 1)
print(model.predict(new_input))
```

**Complete example of TimeDistributed with higher dimensions.**
Let , , , . Training inputs have shape and training outputs have shape .

The model is built and trained as follows:

```
dim_in = 2
dim_out = 3
model=Sequential()
model.add(TimeDistributed(Dense(activation='sigmoid', units=dim_out), # target is dim_out-dimensional
input_shape=(None, dim_in))) # input is dim_in-dimensional
model.compile(loss = 'mse', optimizer = 'rmsprop')
model.fit(x_train, y_train, epochs = 100, batch_size = 32)
```

Output for a new input of shape can be predicted as follows:

```
new_input = model.predict(np.array([[[1,1]]]))
new_input.shape # (1, 1, 2), which is a valid shape for this model
print(model.predict(new_input))
# [[[ 0.70669621 0.70633912 0.65635538]]]
# output is (1, 1, 3) as expected.
# Note that each column has been trained differently
```

This computation can be understood in details, by taking a two-dimensional vector, computing , and then applying the sigmoid function to each component.

```
W_y = model.get_weights()[0] # this is a (2,3) matrix
b_y = model.get_weights()[1] # this is a (3,1) vector
# At each time, we have a dense neural network
# (without hidden layer) from 2+1 inputs to 3 outputs.
# On the whole, there are 9 parameters
# (the same parameters are used at each time).
[[sigmoid(y)
for y in np.dot(x,W_y) + b_y] # like doing X * beta
for x in [[1,1]]]
# We obtain the same results as with 'model.predict'
```

We have: , , and we take . The formula (note the transpose for dimensional correctness) gives and after applying the sigmoid on each component, we obtain: .

## Part B: Explanation of simple RNN

*Simple RNN* is the simplest way for a neural network to keep information along time.
Information is stored in the hidden variable and updated at each time based on new inputs.
Simple RNN can be connected to a time distributed component to form the *Elman’s network*, introduced in 1990. The time distributed component allows computing output from the hidden variable.
We describe this complete network in this part.

**Description of the network.**
In Keras, the command lines:

```
dim_in=3; dim_out=2; nb_units=5;
model=Sequential()
model.add(SimpleRNN(input_shape=(None, dim_in),
return_sequences=True,
units=nb_units))
model.add(TimeDistributed(Dense(activation='sigmoid',
units=dim_out)))
```

corresponds to the mathematical equations (for all time ):

As before, training inputs have shape and training outputs have shape . In this example, we have taken and , then is a two-dimensional vector and is a three-dimensional vector.
We have selected `units=5`

, so is a five-dimensional vector.
In details, the `SimpleRNN`

line computes the full sequence from (and initial ); the `TimeDistributed`

line computes the sequence from .

Those equations can be represented by the following diagram:

This diagram shows one temporal step of the network, explaining how to compute and from and .

It remains to select the initial value of the hidden variable, and we take the null vector: .

**Complete example of simple RNN.**
Let , , , . Training inputs have shape and training outputs have shape .

The model is built and trained as follows:

```
dim_in = 2; dim_out = 3; nb_units = 5
model=Sequential()
model.add(SimpleRNN(input_shape=(None, dim_in),
return_sequences=True,
units=nb_units))
model.add(TimeDistributed(Dense(activation='sigmoid', units=dim_out)))
model.compile(loss = 'mse', optimizer = 'rmsprop')
model.fit(x_train, y_train, epochs = 100, batch_size = 32)
```

The weights of the trained network are:

```
W_x = model.get_weights()[0] # W_x a (3,5) matrix
W_h = model.get_weights()[1] # W_h a (5,5) matrix
b_h = model.get_weights()[2] # b_h a (5,1) vector
W_y = model.get_weights()[3] # W_y a (5,2) matrix
b_y = model.get_weights()[4] # b_y a (2,1) vector
```

We want to predict output for a new input of shape . We take a shape and let: , , and . The model predicts output for this series:

```
new_input = [[4,2,1], [1,1,1], [1,1,1]]
print(model.predict(np.array([new_input])))
# [[[ 0.79032147 0.42571515]
# [ 0.59781438 0.55316663]
# [ 0.87601596 0.86248338]]]
```

It is possible to retrive this result manually by computing from and ; then from ; then from and ; then from ; then from and ; then from . This is detailed in Part B of the companion code.

## Part C: Explanation of simple RNN with two hidden layers

In Part B, we have connected a `SimpleRNN`

layer and a `TimeDistributed`

layer to form an Elman’s network with one hidden layer. It is easy to stack another `SimpleRNN`

layer.

This Keras code:

```
dim_in = 3; dim_out = 2
model=Sequential()
model.add(SimpleRNN(input_shape=(None, dim_in),
return_sequences=True,
units=5))
model.add(SimpleRNN(input_shape=(None,4),
return_sequences=True,
units=7))
model.add(TimeDistributed(Dense(activation='sigmoid',
units=dim_out)))
```

corresponds to the mathematical equations (for all time ):

and is represented by the following diagram (two temporal steps are shown to help understanding how all is connecting together):

In this example, at each time , , , , are vectors of size , , , respectively.

Shape of weight matrices and manual computations are detailed in Part C of the companion code.

## Part D: Explanation of LSTM

**From SimpleRNN layer to LSTM layer.**
In Part B, we used a `SimpleRNN`

layer to update the hidden variable , i.e. to compute from . This layer in isolation at time is represented as follows:

Long short-term memory (LSTM) networks replace the `SimpleRNN`

layer with an `LSTM`

layer. An LSTM layer takes 3 inputs and outputs a couple at each step . is the hidden variable and is called the cell variable. This kind of networks has been introduced in 1997.

In Keras, the command line:

```
LSTM(input_shape=(None, dim_in),
return_sequences=True,
units=nb_units,
recurrent_activation='sigmoid',
activation='tanh')
```

corresponds to the equations:

(with null vectors for and ) and is represented by the following diagram:

In this setting, represents the *forget variable* (controlling how much information of is kept), represents *new information* to save, weighted by an *input variable* (controlling how much information of is kept). The combination of those variables forms the cell variable. Finally, represents the *output variable* (controlling how much information of is kept to ).

**Explanation of matrices.**
It can be confusing to understand how all the matrices are organized.
Let us suppose that `dim_in=7`

and `nb_units = 13`

.
The input vector has length 7, and hidden and cell vectors and have both length 13.

- Matrices , , , have shape each, because they are multiplied with ,
- Matrices , , , have shape each, because they are multiplied with ,
- Bias vectors , , , have length each.

Consequently, vectors , , and have length each.

In the Keras implementation of LSTM, and are defined as follows:

- is the concatenation of , , , , resulting in a matrix,
- is the concatenation of , , , , resulting in a matrix,
- is the concatenation of , , , , resulting in a vector of length .

With those notations, we can first compute a raw vector of length , before cutting it and applying activation functions to obtain , , and .

Note that in the post of Christopher Olah, , , and are defined as follows:

- is the concatenation of and ,
- is the concatenation of and ,
- is the concatenation of and ,
- is the concatenation of and .

**Connecting LSTM layer with subsequent layers.**
The rest of the network works as before. In Keras, we let:

```
model=Sequential()
model.add(LSTM(input_shape=(None, dim_in),
return_sequences=True,
units=nb_units,
recurrent_activation='sigmoid',
activation='tanh'))
model.add(TimeDistributed(Dense(activation='sigmoid',
units=dim_out)))
```

In details, `LSTM`

line computes the full sequence and from (and initials and ); the `TimeDistributed`

line computes the sequence from (note that the cell variable is used internally in `LSTM`

but not in subsequent layers, contrary to the hidden variable ).

Shape of weight matrices and manual computations are detailed in Part D of the companion code.

## Part E: Explanation of GRU

Gated Recurrent Units (GRU) are a popular alternative to LSTM introduced in 2014. They apparently give similar results to LSTM with fewer parameters to train (3 sets of weights for GRU instead of 4 for LSTM).

A GRU layer takes inputs and outputs at each step . In Keras, the command line:

```
GRU(input_shape=(None, dim_in),
return_sequences=True,
units=nb_units,
recurrent_activation='sigmoid',
activation='tanh')
```

corresponds to the equations:

(with null vector for ) and is represented by the following diagram:

In this setting, has a role similar to the *forget variable* (controlling how much information of and is kept), is a *recurrent variable* (controlling how is weighted), and represents *new information* to save (subsequently weighted by ).

**Explanation of matrices.**
As before, we suppose that `dim_in=7`

and `nb_units = 13`

, so has length 7 and has length 13.

- Matrices , , have shape each, because they are multiplied with ,
- Matrices , , have shape each, because they are multiplied with ,
- Bias vectors , , have length each.

Consequently, vectors , and have length each.

In the Keras implementation of LSTM, and are defined as follows:

- is the concatenation of , , , resulting in a matrix,
- is the concatenation of , , , resulting in a matrix,
- is the concatenation of , , , resulting in a vector of length .

Manual computations are detailed in Part E of the companion code, and are relatively less straightforward compared to LSTM.

### References

- Companion code for this post
- Understanding LSTM Networks by Christopher Olah,
- Keras documentation for TimeDistributed,
- Keras documentation for RNN,
- Wikipedia page on RNN describing the Elman networks.
- Thanks to J. Leon for this Tikz figure, on which I made figures (full sources are here)