Search

Translate

Long Short Term Memory(LSTM)

Long Short Term Memory(LSTM) 

Before we learn about LSTM lets understand, why we use LSTM instead of feed forward network and RNN?. 

Problem with feed forward networks: 

Feed forward networks won't consider previous output to get the new output, but in time series data like bitcoin price prediction, one need to know the previous prices to calculate future price, so feed forward networks won’t work well on the time series and other sequential data, as their outputs are dependent on the past values.

Recurrent neural network: 

RNN has multiple states, each acts as a temporary memory and stores the output of it. This output along with the new input at time t is considered while calculating the output of new state. Each state occurs in a sequential time stamp and stores previous information for a short period of time.

RNN works with recursive formula:

`h_t = tanh(w_s \cdot h_{t-1} + w_x \cdot x_t)`
 `o_t = w_y \cdot h_t` 

 where, `h_t` is the new state. 
 `h_{t-1}` is the previous state. 
 `x_t` is the input at time t. 
 tanh is the activation function. 
 `w_s`, `w_x` and `w_y` are corresponding weights. 
 `o_t` is the output.
Recurrent Neural network learn through back propagation through time (i.e) uses back propagation for every time stamp so, we calculate the loss and go back to each state and update the weights by multiplying gradient (i.e) we will find change in weight and add it with the old weight. If the gradient is too small then update in weight is negligible, so that the RNN won’t learn at all, this is problem is called vanishing gradient and to solve this we use Long Short Term Memory(LSTM) model.

Long Short Term Memory: 

LSTM has three gates such as Forget gate, Input gate and output gate and an intermediate cell state. When new input comes in, the forget gate identifies those information which are not required from the previous output and excludes it by adding the previous output and the new input which are multiplied by its corresponding weights and applying sigmoid activation function to the sum. The input gate stores the new information given by the user. Input gate and output gate are calculated in similar way as forget gate but the weights are different for each gate. In intermediate cell state we do the same thing what we did in input state but instead of sigmoid activation function we use tanh activation function, then this cell is updated by adding forget gate and input gate which are multiplied by the corresponding cell states. By doing so, the required information is stored for a long period of time.

`f_t = \sigma(w_h1 \cdot h_{t-1} + w_x1 \cdot x_t)` 
`i_t = \sigma(w_h2 \cdot h_{t-1} + w_x2 \cdot x_t)`
`c_t = tanh(w_h3 \cdot h_{t-1} + w_x3 \cdot x_t)` 
`c_t = f_t * c_{t-1} +i_t * c_t` 
`o_t = \sigma(w_h4 \cdot h_{t-1} + w_x4 \cdot x_t)` 
`h_t = o_t * tanh(c_t)` 

Where `f_t` is the forget gate. 
`i_t` is the input state. 
`c_t` is the intermediate cell state. 
`o_t` is the output gate. 
`h_t` is the output.

The LSTM can be used in NLP tasks like finding missing sentences and other sequential data.
Newest
Previous
Next Post »