Комментарии:
I want to read papers but reading paper is hard and costs so much time your videos are so fast and clear thank you my friend
ОтветитьI love your channel my friend thank you so much this channel helped a lot
ОтветитьI'm ngl.. The first 5 mins sent me 😆- this is hella funny
ОтветитьGreat video, thank you for showing the bigger picture
ОтветитьIt's impressive that you predicted that residual connections are going down next. I've seen multiple papers where people start improving on their complexity
ОтветитьI still like Double-Blind reviews. The problem with open review is that the bias is higher. Even if the paper gives a lot of clues about who the author is in this case, you still have to do some digging to reach this conclusion. Not everybody will have this time. In addition, double-blind reviews will benefit unknown researchers that have not made a name for themselves yet by giving them a more fair chance to be evaluated. It's impossible to be perfect but I do strongly believe that double-blind is way better than open review or single blind review processes.
ОтветитьI think you are wrong about the skip connections; they represent something much more fundamental than just a training aid. We can regard the space we operate over as a vector that contains the input, the output, and all intermediate inferences, only with the output and intermediate facts initially noised out. Each step can be regarded as a specialised denoiser. When viewed like this the billion dollar question becomes "what should the intermediate facts be?", and the answer to this is the meat of the learned representation. That is where the structure lives.
The skip connections make it much harder for the network to lose any information, while forcing each intermediate step to work with (and train) a unified learned representation. Without the skip connections each interconnect will attempt to make it's own representation from scratch, and they will suck. I guess in theory with enough data and time (and each module being large enough to not lose any information), each intermediate representation would converge to the same power, so in that sense it is a training aid, but the final model might be better thought of as many models strung together, rather than a single model at that point.
Brilliant best ever explaining
ОтветитьI'm slowly starting to grasp more and more what the heck is going on in these videos!! I've been watching your paper reviews for some time now, and I think this is the first time I actually get everything you're saying in a video.
That feeling is great. Thank you for your work!
How do we ensure that learnable parameters of each head is different from each other?
It might happen, all the heads end up learning same thing
dude is so obsessed with uncovering who wrote the paper !! xD xD
ОтветитьGreat explanation
ОтветитьYou mentioned that "Transformers are generalisation of MLP" Got the point. But I have a doubt.
CNN, LSTM are specific version of MLP. --> reduce computations compared to MLP by adding the inductive prior or bias you mentioned.
If transformer is further generalization to MLP, I intuitively thought, it should have more computations than CNN which is specific, even though it outperforms in accuracy.
How come ViT has less computations than CNN if it is so generic architecture or building block ?
Can you please elaborate if my question makes sense.?
Jesus farking Christ, start talking about the paper instead of the review process.
ОтветитьThis was the best video on transformers I've seen on YT, still in 2022. I think Yannic may be one of the only people that is actually able to productively reason about the core dynamics, effectiveness and utility of ML model designs.
ОтветитьYannic! your channel is awesome, thank you for covering so many interesting things in a nice digestable way! Stay cool
ОтветитьWhat do you mean by these connections in transformers being computed on the fly? Training time?
ОтветитьI would actually consider the transformer as a regularized MLP (i.e., a special case of MLP, not the opposite way around). The weights that connect nodes between layers are regularized by the similarities of the connected nodes.
ОтветитьWhen you started whispering I laughed so hard ahahahhahah
ОтветитьThis is an awesome explanation. We can learn everything from data if we have an infinite amount of data, and we can have never have infinite amount of data, we must introduce inductive bias or strong priors and try to learn the universe from limited samples.
ОтветитьIf MLPs are generalisations of CNNs and LSTMs, and Transformers are generalisations of MLPs, and the only reason that Transformers now outperform CNNs and LSTMs is the abundance of pre-training data, why don't MLPs also outperform CNNs and LSTMs?
ОтветитьThis video and especially the explanation about inductive biases are pure gold!
ОтветитьThe generalization/specialization discussion was extremely helpful. Amazing content
ОтветитьAt first I thought you were very happy with the double blind review process, but I soon found out that you weren't.
You have encouraged me to arxiv my work and continue to work hard to do what I think is important, rather than wait for top journals/conferences to accept. Thank you.
That intro was hilarious
ОтветитьHaha, great opinion on peer review 🤣
ОтветитьCool 🌿
ОтветитьFunniest rant ever
Ответитьofcourse it's from Google
Ответитьthis kind of research is a bit disappointing for me because I don't have 2.5k hours of TPU, so I am wondering, is it even possible to contribute anything to the research community? please help me if you can.
ОтветитьThose first four minutes of the video are just raw humor 😂
ОтветитьThanks for your sharing. It is very useful
ОтветитьAside from the unnecessary 5 minutes long rant, this is a good video.
ОтветитьHahaha yeah, obviously Google
ОтветитьI love your take on the matter. Very eye-opening
ОтветитьYour insights really helped me understand this paper. Thank you!
ОтветитьAwesome, I have been wondering a while ago if the network that a network performs at each layer is not "too static". In a normal CNN the operations are completely defined by the programmer und only the weights are learned. But maybe the operations themselves can be learned as well. Seems like Transformers are going a bit in that direction if I understood everything correctly.
ОтветитьWe found who wrote it, good job there
Ответитьvlo
ОтветитьLove the style comparison of the two papers. Of course I totally disregard it.
ОтветитьAmi subscribe kore diyachi ar like kore diyachi .
ОтветитьIsn't it possible to review intentionally without uncovering the anonymity for oneself? For whatever reason, possibly ethical reasons?
Ответитьভাইয়া আমার চেনেল টা সাবস্ক্রাইব করে দেন
Ответить"attention is all you need" then "reward is enough" ha ha ha. what a disaster. how is it then? then cnn in parallel work just as well?
ОтветитьFor smaller training datasets (couple of hundred imahes), do you think Transformers can under-perform biased models like CNNs?
Ответить