An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Yannic Kilcher

3 года назад

333,708 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

Süleyman Emir Akın
Süleyman Emir Akın - 12.09.2023 10:52

I want to read papers but reading paper is hard and costs so much time your videos are so fast and clear thank you my friend

Ответить
Süleyman Emir Akın
Süleyman Emir Akın - 12.09.2023 10:51

I love your channel my friend thank you so much this channel helped a lot

Ответить
Shy Lilak
Shy Lilak - 14.08.2023 21:16

I'm ngl.. The first 5 mins sent me 😆- this is hella funny

Ответить
Stanislaw Cronberg
Stanislaw Cronberg - 25.06.2023 13:13

Great video, thank you for showing the bigger picture

Ответить
tim
tim - 16.06.2023 03:51

It's impressive that you predicted that residual connections are going down next. I've seen multiple papers where people start improving on their complexity

Ответить
Directed Evolution
Directed Evolution - 13.06.2023 07:59

I still like Double-Blind reviews. The problem with open review is that the bias is higher. Even if the paper gives a lot of clues about who the author is in this case, you still have to do some digging to reach this conclusion. Not everybody will have this time. In addition, double-blind reviews will benefit unknown researchers that have not made a name for themselves yet by giving them a more fair chance to be evaluated. It's impossible to be perfect but I do strongly believe that double-blind is way better than open review or single blind review processes.

Ответить
A G Systems
A G Systems - 30.05.2023 14:23

I think you are wrong about the skip connections; they represent something much more fundamental than just a training aid. We can regard the space we operate over as a vector that contains the input, the output, and all intermediate inferences, only with the output and intermediate facts initially noised out. Each step can be regarded as a specialised denoiser. When viewed like this the billion dollar question becomes "what should the intermediate facts be?", and the answer to this is the meat of the learned representation. That is where the structure lives.
The skip connections make it much harder for the network to lose any information, while forcing each intermediate step to work with (and train) a unified learned representation. Without the skip connections each interconnect will attempt to make it's own representation from scratch, and they will suck. I guess in theory with enough data and time (and each module being large enough to not lose any information), each intermediate representation would converge to the same power, so in that sense it is a training aid, but the final model might be better thought of as many models strung together, rather than a single model at that point.

Ответить
AoibhinnMcCarthy1996
AoibhinnMcCarthy1996 - 21.05.2023 23:10

Brilliant best ever explaining

Ответить
Plads Elsker
Plads Elsker - 08.05.2023 15:08

I'm slowly starting to grasp more and more what the heck is going on in these videos!! I've been watching your paper reviews for some time now, and I think this is the first time I actually get everything you're saying in a video.

That feeling is great. Thank you for your work!

Ответить
sumit Sp
sumit Sp - 26.04.2023 20:04

How do we ensure that learnable parameters of each head is different from each other?
It might happen, all the heads end up learning same thing

Ответить
Mysterybox10
Mysterybox10 - 24.03.2023 19:44

dude is so obsessed with uncovering who wrote the paper !! xD xD

Ответить
Jagannathan K
Jagannathan K - 21.03.2023 21:45

Great explanation

Ответить
Kartik Podugu
Kartik Podugu - 21.03.2023 07:23

You mentioned that "Transformers are generalisation of MLP" Got the point. But I have a doubt.

CNN, LSTM are specific version of MLP. --> reduce computations compared to MLP by adding the inductive prior or bias you mentioned.
If transformer is further generalization to MLP, I intuitively thought, it should have more computations than CNN which is specific, even though it outperforms in accuracy.
How come ViT has less computations than CNN if it is so generic architecture or building block ?
Can you please elaborate if my question makes sense.?

Ответить
TonyTiger6521
TonyTiger6521 - 06.02.2023 06:07

Jesus farking Christ, start talking about the paper instead of the review process.

Ответить
googleyoutubechannel
googleyoutubechannel - 20.12.2022 22:33

This was the best video on transformers I've seen on YT, still in 2022. I think Yannic may be one of the only people that is actually able to productively reason about the core dynamics, effectiveness and utility of ML model designs.

Ответить
commiekaza
commiekaza - 10.11.2022 20:22

Yannic! your channel is awesome, thank you for covering so many interesting things in a nice digestable way! Stay cool

Ответить
Ahmad Anis
Ahmad Anis - 09.11.2022 18:01

What do you mean by these connections in transformers being computed on the fly? Training time?

Ответить
Bin LI
Bin LI - 25.10.2022 00:21

I would actually consider the transformer as a regularized MLP (i.e., a special case of MLP, not the opposite way around). The weights that connect nodes between layers are regularized by the similarities of the connected nodes.

Ответить
Stefano Butelli
Stefano Butelli - 18.10.2022 13:30

When you started whispering I laughed so hard ahahahhahah

Ответить
R Q
R Q - 04.10.2022 08:07

This is an awesome explanation. We can learn everything from data if we have an infinite amount of data, and we can have never have infinite amount of data, we must introduce inductive bias or strong priors and try to learn the universe from limited samples.

Ответить
Ryan Denziloe
Ryan Denziloe - 05.09.2022 00:39

If MLPs are generalisations of CNNs and LSTMs, and Transformers are generalisations of MLPs, and the only reason that Transformers now outperform CNNs and LSTMs is the abundance of pre-training data, why don't MLPs also outperform CNNs and LSTMs?

Ответить
Stefan Vasilev
Stefan Vasilev - 13.08.2022 14:28

This video and especially the explanation about inductive biases are pure gold!

Ответить
Jesús Pérez
Jesús Pérez - 07.08.2022 18:35

The generalization/specialization discussion was extremely helpful. Amazing content

Ответить
John Tan Chong Min
John Tan Chong Min - 20.07.2022 09:22

At first I thought you were very happy with the double blind review process, but I soon found out that you weren't.

You have encouraged me to arxiv my work and continue to work hard to do what I think is important, rather than wait for top journals/conferences to accept. Thank you.

Ответить
Jackson Meeks
Jackson Meeks - 18.07.2022 03:54

That intro was hilarious

Ответить
Sirui Tao
Sirui Tao - 11.07.2022 08:49

Haha, great opinion on peer review 🤣

Ответить
ThePresistence
ThePresistence - 10.06.2022 03:57

Cool 🌿

Ответить
uchenna nwanyanwu
uchenna nwanyanwu - 09.06.2022 01:22

Funniest rant ever

Ответить
dev stuff
dev stuff - 27.05.2022 06:35

ofcourse it's from Google

Ответить
Hamed Gholami
Hamed Gholami - 26.05.2022 15:50

this kind of research is a bit disappointing for me because I don't have 2.5k hours of TPU, so I am wondering, is it even possible to contribute anything to the research community? please help me if you can.

Ответить
Vadrif Draco
Vadrif Draco - 23.05.2022 03:31

Those first four minutes of the video are just raw humor 😂

Ответить
王宣文
王宣文 - 18.05.2022 12:39

Thanks for your sharing. It is very useful

Ответить
TheTomer
TheTomer - 08.05.2022 18:37

Aside from the unnecessary 5 minutes long rant, this is a good video.

Ответить
Everything is Kubernetes
Everything is Kubernetes - 22.04.2022 06:12

Hahaha yeah, obviously Google

Ответить
Mơ Gừng Kẹo Dẻo
Mơ Gừng Kẹo Dẻo - 18.04.2022 06:06

I love your take on the matter. Very eye-opening

Ответить
Anna Woodard
Anna Woodard - 11.03.2022 23:41

Your insights really helped me understand this paper. Thank you!

Ответить
Bryce
Bryce - 25.02.2022 19:12

Awesome, I have been wondering a while ago if the network that a network performs at each layer is not "too static". In a normal CNN the operations are completely defined by the programmer und only the weights are learned. But maybe the operations themselves can be learned as well. Seems like Transformers are going a bit in that direction if I understood everything correctly.

Ответить
AiRepublic
AiRepublic - 24.02.2022 01:34

We found who wrote it, good job there

Ответить
Sayantan Das
Sayantan Das - 16.02.2022 04:38

vlo

Ответить
Aleksandra Żuraw
Aleksandra Żuraw - 08.01.2022 19:55

Love the style comparison of the two papers. Of course I totally disregard it.

Ответить
Kalyani Das
Kalyani Das - 12.12.2021 18:11

Ami subscribe kore diyachi ar like kore diyachi .

Ответить
Volker Siegel
Volker Siegel - 11.12.2021 07:41

Isn't it possible to review intentionally without uncovering the anonymity for oneself? For whatever reason, possibly ethical reasons?

Ответить
Morshed vai bd official channel
Morshed vai bd official channel - 10.12.2021 12:48

ভাইয়া আমার চেনেল টা সাবস্ক্রাইব করে দেন

Ответить
Dan Car
Dan Car - 13.11.2021 13:31

"attention is all you need" then "reward is enough" ha ha ha. what a disaster. how is it then? then cnn in parallel work just as well?

Ответить
Parnia Shokri
Parnia Shokri - 29.10.2021 00:35

For smaller training datasets (couple of hundred imahes), do you think Transformers can under-perform biased models like CNNs?

Ответить