Комментарии:
great explanation! thank you for sharing
ОтветитьThis covers no explanation on Vanishing Gradients, which is the case for deeper nets, where the gradients suffer to propagate back towards the starting layer of the network!!! Learning is reduced & its so slow that the model barely learns at all!! Significant ideas such as Skip Connections through caching are way necessary for preventing this problem. More widely used in CNNs to avoid data loss
Ответитьthank you so much
ОтветитьThank you !
Ответитьthanks for the knowledge being delivered in such simple terms sir
thank you
Thank you Andrew Ng as usual amazing explanation!
ОтветитьHi, deepLearningAi.
Thank you so much for these wonderful videos. I am sure these changed a lot of lives.
Imagine you use sigmoid as activation function, which derivative is always less than 1 and greater than 0. During the backprop, you need to pass the derivative from back to front, which involves multiplying a number less than 1 for many times. If your network is deep enough, the gradient in the first few layers would become extremely small (almost zero), and eventually those neurons will stop learning.
ОтветитьThe videos here requires you to watch each and everyone in a step-wise fashion. People arguing and asking too many irrelevant questions are people who did not watch the other videos LOL so shut up and learn! IDIOTS
ОтветитьGreat .But need towatch ot again
ОтветитьIn 6min vid this is the concise and spot-on explanation. Those who were expecting some intricate complex explanations, please refer to some books, don't waste time here.
ОтветитьThis is a bad explanation, it completely breaks down if you use a sigmoid activation function...
Ответитьcompletely disregards explaining what he means by z and l
ОтветитьI feel like this explanation is a bit oversimplified. Also, what happens when the weight matrices are not some multiples of identity matrix?
ОтветитьThis is indeed not really about the gradient, but more about the activation. Also it assumes that W will be an Identity matrix... which is a big assumption.
I think for the gradient issue, you have to remember that the gradient for each layer is basically the inputs of that layer, times what ever the gradient was up to that layer. If you have a sigmoid/tanh activations, you will have that the inputs will always be a fraction. This might not be a big problem for the last layers, but as you back propagate backwards more and more, always multiplying by a fraction, you get smaller and smaller gradients, which makes it harder and harder for the weights of those layers to learn.
Similarly - if your activation function can takes larger values (say ReLu) - you run the risk of your gradients becoming bigger and bigger ("exploding") as you backpropagate.
the vanishing/exploding of activation is not the same as vanishing/exploding gradients. this is the part not well explained in this video.
ОтветитьMr. Professor, you speak so fast that I just can not catch up with you. :)
ОтветитьTBH this explanation is very sloppy.
ОтветитьWhat would be the effect of vanishing/exploding gradients? How do you know that the problem that is occurring in your network is gradient related?
Ответить