Комментарии:
You are an inspiration james
ОтветитьThe is by far the best explanation of the paper that I could find. Thanks a lot!
ОтветитьThis video is really helful. Thank you!
ОтветитьWell, Bag of Words and Bag of Visual Words WAS a merger of NLP and Computer Vision, back in the day (2010s)
ОтветитьThanks a lot for this ! Amazing amazing explanation!
Ответитьhey, thanks a lot. i have come from TensorFlow. so can u please answer, is it training the whole vit model for our dataset or freezing the vit pre trained part and training classification head only (like trainable=false in tf)?
ОтветитьVery good introductory video. Thanks for sharing.
ОтветитьReally enjoyed every bit. Trying to setup the transformer for an Audio Regression task, the ViT has shown amazing performance in classification
ОтветитьThank you for the effort ur putting here in your explanations. :)
ОтветитьGreat video! I've watched quite a few videos and read papers about Transformers, but your video really made me understand the concept
Ответитьno fun using huggingface transformers library. you should have explained vision transformers using a more basic implementation, than a high level library
ОтветитьJames is the top G in deep learning
Ответитьit currently gives the error 'no module named 'datasets', anybody has a fix?
ОтветитьAre there such thing that is similar to word embeddings? Or you simply take your pixel data as patches and run it through the dense layer to get projections?
Ответитьthank you sir
ОтветитьGreat video man cheers! Do you have a video about using a dataset made up of your own images on the vision transformer?
ОтветитьThanks a lot for the video. I cant find any precise explanation about the function of self-attention layer and MLP layer in the encoder modules. Could you maybe add some information about that?
ОтветитьNice one..! This content desperately needs "Promo SM"!
ОтветитьAnother great video of yours. So clear and clarifying. Thx !
ОтветитьGreat explanation, unique on YT. Thanks!
Ответитьgreat stuff!
ОтветитьIn this video, there is no explanation for the output of a vision transformer. In NLP transformers, the output is a probability distribution over the vocab but in vision transformers, I guess it is over a code book. But what this code book is and how it is aligned to the input image is not clear. Thanks a lot for this video but it is incomplete
ОтветитьJust discovered your channel, great stuff! Thanks man! Great explanation and visualisation
ОтветитьThe clarity of your discourse is unmatched and it's always a pleasure to follow your videos. Kinda a side effect of your passion for the domain!?
ОтветитьIncredible content! Thx James!
ОтветитьThank you so much for this video :)
Ответить