LSTM-Solution for vanishing gradient
We humans never start thinking from scratch. We always have stored information or ideas from previous experience. We add extra information to previous information or modify some information or forget whole thing.
Ever wondered how machines work? How they relate things? How they work internally? Machines used RNN (Recursive Neural Network) in the beginning for this work. In Recursive Neural Network the output is redirected to be part of input, hence recursive network. But the problem in this network is memory. RNNs are bad at remembering things. Let’s say we have a movie review, and a person’s review on a movie says, “the movie is very bad. It is total waste of time and money. So never ever ever ever ever watch this movie”. In this scenario RNN only remembers last four words i.e. ever watch this movie. So here the negative review turns to be positive. This is not appreciable. This problem is called vanishing gradient.
So in-order to over come this problem we should add memory to our system. This updated system is called LSTM (Long Short Term Memory). So here we find extra channel which carries information throughout the Network and adds or forgets the information accordingly.
So lets dive-in and check what happens in LSTM.
LSTMs are nothing but special type of RNN which has the capability of remembering thing for long period of time. In RNN we will form a chain of repeating modules with simple structure. LSTMs also have chain like structure, but each repeating modules have 4 different neural network layers interacting with each other in-order to give better output.
In LSTM we use pointwise operations like vector addition and multiplication in order to add, modify or delete information. We use function like sigmoid and tanh to keep check on the number we are sending to next layer (if we get huge numbers we may find it difficult to make operations on it).
In above diagram pink color denotes ‘+’ (vector addition operation), blue color denotes ‘x’ (vector multiplication operation), yellow color denotes sigmoid function, and green color denotes tanh function. X(t) is input, X(t-1) is previous input, X(t+1) is next input. h(t) is output, h(t-1) and h(t+1) are previous and next output. The horizontal line running on the top of the diagram is called cell state. It acts like a conveyer belt which runs along the LSTM structure.
LSTM has the ability to add or delete information from this cell state with the help of gates. Gates are made of a sigmoid function with vector multiplication operation, this decides if we should add or delete the information from cell state. Sigmoid layer gives the output between 0 and 1. If output of sigmoid layer is 0 then no information comes from cell state else the sigmoid layer decide the amount of information to get from cell state. This is the first step in LSTM.
Next step is to create update for the cell state. This is done in four steps, first sigmoid function decides what value to update. Next, tanh creates a new candidate value which also can be added to cell state. Later you will combine both and decide what value to add. and later add the output of this to cell state creating a new cell state.
Here comes our last step where we decide our output. Output will be filtered version of our cell state. sigmoid function will decide what part to add to output while tanh decides the amount of cell state to be added. Later both are multiplied and the final output is created.
The LSTM explained above is basic LSTM. There are many other LSTMs which may give better output. But I hope I succeeded in giving basic idea of how LSTMs work.
For further reference:
Gated Recurrent Unit- An Introduction,