Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Lex Clips

1 год назад

366,850 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

Ms Stone
Ms Stone - 22.10.2023 07:32

Great interview. Engaging and dynamic. Thank you.

Ответить
wasp082
wasp082 - 15.10.2023 07:15

The attention name was surrounding there in the past on other different architectures. It was common to see recurrent bidirectional neural networks with "attention" on the encoder side. That's why the name "attention is all you need" comes from. That because it basically deletes the need of a recurrent or sequentially architecture.

Ответить
Poize
Poize - 23.07.2023 05:39

Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.

Ответить
J M
J M - 07.07.2023 22:42

The model YOLO (you only look once) is also an example of a "meme title"

Ответить
Alexandros Angeli
Alexandros Angeli - 23.06.2023 15:53

Andrej's influence on the development of the field is so underrated. He's not only actively contributing academically (i.e. through research and co-founding OpenAI), but he also communicates ideas so well to the public (for free, by the way) that, he not only helps others contribute academically to the field, but also encourages many people to get into it simply because he manages to take an overwhelmingly complex (at least for me it used to be) topic such as the Transformer and strips it down to something that can be (more easily) digested. Or maybe that's just me - as my professor in my undergrad came no where near to an explanation of Transformers that it as good and intuitive as Andrej's videos do (don't get me wrong, [most] professors know their stuff very well, but Andrej is just on a whole other level).

Ответить
Gabriele Pi.
Gabriele Pi. - 05.06.2023 22:53

So the last 5 years in AI are "just" dataset size changes while keeping the original Trasformer Architecture (almost) ?

Ответить
Ryan Alba
Ryan Alba - 02.06.2023 10:28

Incredibly interesting and thought provoking. But still disappointed this wasn't about Optimus Prime

Ответить
→ to the knee
→ to the knee - 31.05.2023 21:47

Damn. That last sentence. Transformers are so resilient that they haven't been touched in the past FIVE YEARS of AI! I don't think that idea can ever be overstated given how fast this thing is accelerating...

Ответить
Jean Fradet
Jean Fradet - 23.05.2023 22:23

Self-attention. Transforming. It's all about giving the AI more parameters to optimize what are important internal representations of the interconnections between data itself. We've supplied first order interconnections. What about second order? Third... or is that expected to be covered in the sliding window technique itself? It would seem the more early representations we can add the greater we can couple to "the data" complex/nuance. At the other end, the more we couple to the output, the closer to alignment we can achieve. But input/output are fuzzy concepts in a sliding window technique. There is no temporal component to the information. The information is represented by large "thinking spaces" of word connections. It's somewhere between a CNN like technique to parse certain subsections of the entire thing at once, to a fully connected space between all the inputs. That said sliding is convenient as it removes the hard limit of what can be generated and makes for an easy to understand parameter we can increase at fairly small cost our increase our ability to generate long form representations exhibiting deeper level nuance/accuracy. The ability to just change the size of the window and have the network adjust seems a fairly nice way to flexibly scale the models, though there is a "cost" to moving around IE: network stability meaning you can only scale up or down so much at a time to maintain the most knowledge incurred from previous trainings.

Anyway, the key ingredient is, we purposefully encode the spatial information (to the words theme-selves) to the depth we desire. Or at least that's a possible extension. The next question of course is which areas of representation can we supply more data that easily encodes within the mathematics of information we think is important to be represented in the information (that isn't covered by the processes of the system itself (having the same thing represented in multiple ways (IE: the Data + the system) ) is a path to overly-complicated systems in terms of 'growth/addendums". The easiest path is to just represent in the data itself. And patch it. But you can do stages of processing/filtering along multiple fronts and incorporate them into a larger model more easily, as long as the encodings are compatible (which I imagine will most greatly affect the growth of these systems/swapability though standardized ).

Ideally this is information that is further self-represented within the data itself. FTT are a great approximations we can use to bridge continuous vs discrete knowledge. Though calculating it on word encodings feels a poor fit, we could break the "data signal" into an individual chosen subset of wavelengths. Note this doesn't not help in the next word prediction "component" of the data representation, but is a past knowledge based encoding that can be used in unison with the spatial/self-attention and parser encoding to represent the info (I'm actually not sure of the balance between spatial and self-attention except that the importance of the token in the generation of each word to the previous word (along with a possibly a higher order of inter-connections between the tokens) is contained within the input stream). If it is higher order than FFT's may already be represented and I've talked myself in a circle.

I wonder what results dropouts tied to categorization would yield on the swap-ability of different components between systems? Or the ability to turn various bits/n/bobs on/off in a way tied to the data? I think that's how one can understand the partial derivative reverse flow loss functions as well, by turning off all but one path at a time to split the parts considered, but that depends on the loss function being used. I imagine categorization of subsections of data to then spit off into distinct areas would allow for finer control on representations of subsystems to increase scorability on specific test without affecting other testing areas as much. Could be antithetical to AGI style understanding, but it allow for field specific interpretation of information in a sense.

Heck. What if we encoded each word as their dictionary definitions?

Ответить
Ralph Dratman
Ralph Dratman - 17.05.2023 20:28

What is meant by "making the.evaluation much bigger"?
I do not understand "evaluation" in this context.

Ответить
Nicolai Czempin
Nicolai Czempin - 16.05.2023 07:57

Attention is all you need?

No,

🎼🎶All You Need Is Love 🎶

Ответить
Schuyler Haussmann
Schuyler Haussmann - 15.05.2023 14:22

Any top AI expert that is not a member of the tribe?

Ответить
Christopher Willis
Christopher Willis - 15.05.2023 08:29

When he says the transformer is a general-purpose differential computer does he mean in the sense that it is Turing complete?

Ответить
clapas
clapas - 04.05.2023 20:34

Transformers is not so good a movie

Ответить
Michael Calmeyer Hentschel
Michael Calmeyer Hentschel - 04.05.2023 07:08

well, this clip was great until it ended abruptly in mid-sentence. PLEASE! I signed up for clips, but I am frustrated to have missed a punchline here from Karpathy explaining how GPT architecture has been held back for 5 years. I have SO little time to now scan the full version! But at least you linked it. Your sponsor has left me sleepless.

Ответить
Mark Giroux
Mark Giroux - 27.04.2023 08:09

"Don't touch the transformer"... good advise regardless of what kind of transformer you're taking about.

Ответить
Chen William
Chen William - 23.04.2023 18:11

I don't think so 😕

Ответить
Ariful Islam Leeton
Ariful Islam Leeton - 22.04.2023 03:58

Introduce myself my name is Ariful Islam leeton im software developers And developer open AI

Ответить
Stanislaw Baranski
Stanislaw Baranski - 10.04.2023 19:43

The way this guy thinks and speaks reminds me of Vitalik Buterin. What do they have in common? High intelligence is not the only factor here.

Ответить
Jim Luebke
Jim Luebke - 10.04.2023 17:12

So instead of pure intuition -- convolutional neural nets -- we have moved to something more like consciousness.

Ответить
Mark Counseling
Mark Counseling - 05.04.2023 00:25

If I understand correctly, transformers = good. N'est pa?

Ответить
Mark Counseling
Mark Counseling - 05.04.2023 00:22

What?

Ответить
MrRaja
MrRaja - 05.04.2023 00:11

Andrej speaks on my level of speed. Lex is more concise but packed with knowledge in each word. So speaking em slower lets the listener build sort of neural net of relativity in their brain all done automatically which is only limited by knowledge and experience.

Ответить
Nat Serrano
Nat Serrano - 01.04.2023 06:31

Lex low IQ is insuferable but he has the skill of getting good guests at least

Ответить
Emily Stewart
Emily Stewart - 28.03.2023 18:59

Lex Fridman, you seem bored and uninterested. Holding your head up with your hand. You have Andrej in front of you, be professional. ;-)

Ответить
Jordan Kohler
Jordan Kohler - 28.03.2023 06:15

I like your funny words magic man

Ответить
WALLACE
WALLACE - 27.03.2023 09:41

Any guess why Vadswani is ignored?

Ответить
Rohit Saxena
Rohit Saxena - 22.03.2023 14:11

We as a field are stuck in this local maxima called Transformers (for now!)

Ответить
douglas mennella
douglas mennella - 05.03.2023 02:14

Transformers truly are more than meets the eye …

Ответить
JetSpalt
JetSpalt - 26.02.2023 18:28

What absolute rubbish!!
Transformers are either Autobots or Decepticons. I don’t know what he was talking about but get a clue dude!

Ответить
moritzsur
moritzsur - 23.02.2023 01:52

I loved them as well but after the first 2 films it really got boring

Ответить
cmilkau
cmilkau - 20.02.2023 18:19

The paper is called "Attention Is All You Need", and IMHO attention is what made transformers so successful, not its application in the transformer architecture.

Ответить
Maxdoodledoo
Maxdoodledoo - 19.02.2023 04:11

...i am dissapointed it's not about optimus prime

Ответить
Maximilian Berkmann
Maximilian Berkmann - 17.02.2023 23:57

Long time no see badmephisto!

Ответить
oldtools
oldtools - 17.02.2023 22:33

Question:
without the optimization algorithms designing the hardware manufacturing is there reason to believe that the fundamental nature of these mechanisms reflect the inherent medium of computation? Nope. I guess not.

They're watching it.

Ответить
Andrea David Edelman
Andrea David Edelman - 16.02.2023 19:22

It’s self attention ⚠️

Ответить
Dan Car
Dan Car - 16.02.2023 19:09

and the greatest thing about it is that you do not need it

Ответить
Bruno T. Siqueira
Bruno T. Siqueira - 15.02.2023 21:05

Saving that one to see if I can understand wth Andrej is talking in a year.

Ответить
Axel Anderson
Axel Anderson - 13.02.2023 13:27

Attention is truly all you need.

Ответить
Lei Mococ
Lei Mococ - 12.02.2023 21:14

I had to check a few times if my video speed was not set to 1.5x

Ответить
James Bowery
James Bowery - 12.02.2023 02:22

Recurrence all you need.

Ответить
Sankhadip
Sankhadip - 07.02.2023 12:55

This makes me slightly freaked out that we don’t really understand what we’re developing….

Ответить
Muslim
Muslim - 04.02.2023 18:00

Does any one have a simple definition for “general differentiable computer”?

Ответить
La Chaine Sophro
La Chaine Sophro - 28.01.2023 15:08

Is this guy a robot...

Ответить