The Era of 1-bit LLMs by Microsoft | AI Paper Explained

The Era of 1-bit LLMs by Microsoft | AI Paper Explained

AI Papers Academy

4 месяца назад

88,901 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@gabrielsandstedt
@gabrielsandstedt - 15.06.2024 14:57

How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?

Ответить
@oryxchannel
@oryxchannel - 26.03.2024 18:00

The thinking around BitNet b1.58 is intimately tied to the .gif in the paper “Stanford engineers propose a simpler design for quantum computers.” See the short .gif in action. Funding for that research began prior to 2021. Funding was provided largely by the US Department of Defense. Guess who virtually IS the US military by virtue of having a $ 3 T market cap to keep secret projects, secret? Thats right. Microsoft.

Ответить
@arjavgarg5801
@arjavgarg5801 - 24.03.2024 10:32

Model weights will make a lot more sense

Ответить
@burthacklin
@burthacklin - 21.03.2024 22:40

This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.

Ответить
@anilaxsus6376
@anilaxsus6376 - 20.03.2024 19:52

but how is the accuracy ?

Ответить
@giacintoboccia9386
@giacintoboccia9386 - 15.03.2024 10:39

We had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.

Ответить
@xianghaisheng7800
@xianghaisheng7800 - 12.03.2024 20:12

It's a bit difficult to understand your accent, probably because I'm not a native speaker. Do you consider using an AI synthesized voice?

Ответить
@hypervanse
@hypervanse - 12.03.2024 04:28

wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.

Ответить
@chodnejabko3553
@chodnejabko3553 - 10.03.2024 11:58

This might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.

Ответить
@Tohidkhan-lt4pd
@Tohidkhan-lt4pd - 10.03.2024 05:44

🎉😊❤

Ответить
@adamhafchadi4924
@adamhafchadi4924 - 09.03.2024 19:46

what is that accent?

Ответить
@ntal5859
@ntal5859 - 08.03.2024 10:17

So in summary everything is either Yes=1, Never mind =0, No = - 1 If only women were so simple to work out.

Ответить
@pmarreck
@pmarreck - 08.03.2024 05:37

This is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English.
I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL

Ответить
@JulianHarris
@JulianHarris - 06.03.2024 15:12

Great! Very helpful. One suggestion I’d make: the numbers of bits being a fractional number will be unfamiliar for many people. I think it would be useful to make it clear that yes of course to represent three states in practicality you need two bits, and the 1.58 number is the theoretical Shannon-Hartley entropy

Ответить
@erickweil4580
@erickweil4580 - 06.03.2024 06:51

Ok, but what are the theory on WHY it achieves same performance? maybe this shows no one really understand how Neural Networks works and are giving them much more complicated steps when they could be just some "quantised" states.

Ответить
@ithaca2076
@ithaca2076 - 05.03.2024 17:01

I've made a few contributions to Quaternary algebra, I discovered the inclusive and exclusive not-gate and am currently working on proofs for them.

The issue with ternary and quaternary at the moment is that current computers have to use numerous transistors per ternary or quaternary bit. Until we have a ternary or quaternary transistor, we may have to keep using bytes just like regular integers. I haven't seen any patents for a working one that isn't several times larger than a binary transistor which makes going back to binary more efficient, of course it depends though.

I don't know what Microsoft is doing but on top of this, running ternary requires at absolute minimum 2 binary bits to run, meaning 2 physical data lines at best. Depending on how optimized everything from your languages compiler is to what kinds of operations you're performing it may use significantly more.

To run ternary on current hardware doesn't quite make practical sense, when for the same~ amount of data likes you could be using quaternary.

Ответить
@gotachange
@gotachange - 05.03.2024 04:42

Why does it still work when it’s quantified from float16 to -1,0,1. There could be countless numbers in float16 but only 3 numbers after quantification. I’m confused on this.😂

Ответить
@ArielTavori
@ArielTavori - 04.03.2024 19:07

So between this, Groq hardware, Mojo language and Mamba architecture... How many of these are compatible and stack their benefits synergistically? And where they stack is the performance additive or multiplicative?

Ответить
@Bigjuergo
@Bigjuergo - 04.03.2024 16:52

possible to test this model with llm studio?

Ответить
@Dan-dy8zp
@Dan-dy8zp - 04.03.2024 14:44

The LLM has 0.1975 bytes. I don't think it's going to work.

Ответить
@rayujohnson1302
@rayujohnson1302 - 04.03.2024 07:47

Technically they should call it a 2 bit LLM -- which has multiple meanings ;)

Ответить
@AriaAlessandra
@AriaAlessandra - 04.03.2024 04:58

Does it mean every model can be quantized this way?

Ответить
@GeorgeXian
@GeorgeXian - 04.03.2024 03:09

The one thing the paper neglects to mention which should have been the biggest breakthrough of the 1bit LLM is that the VRAM required for training should be drastically less than its full fat 16-bit float counterpart. It should be possible to train the 70b 1-bit model on a single RTX4090 - at present, the 70b model with any meaningful quantization cannot even be run on a single consumer GPU. I made a video on this subject last week.

At present the VRAM savings of current quantized LLMs are only apparent during execution, but what is more important is the democratization of LLM training. Lowering the barrier to training an LLM is a must to stop one company conquering the LLM space entirely.

Ответить
@karansarao2584
@karansarao2584 - 04.03.2024 00:48

Excellent explanations. This seems to be a comparison on Llama1 though, any confirmation if Llama2 models also perform similar after quantization? I am curious to know if this works on later generations, conceptually Llama2 outperforms Llama1 for the same size ( I.e 7B vs 7B, 13B vs 13B). So in effect the same weights now hold more complexity as compared to before, ie compression will work better when weights have more redundancy as compared to later versions where precision is more likely to be driving the performance differences

Ответить
@abdelkaioumbouaicha
@abdelkaioumbouaicha - 03.03.2024 22:48

📝 Summary of Key Points:

📌 The research paper discusses the era of 1bit LLMS, focusing on reducing the size of large language models to address issues related to compute and memory resources, as well as environmental concerns.

🧐 The introduction of the BitNet B 1.58 model architecture, which utilizes weights that are ternary (-1, 0, 1) to reduce the number of bits required to represent the model, leading to improved efficiency without sacrificing performance.

🚀 Benefits of the BitNet B 1.58 model include reduced memory usage, lower latency, and comparable performance to full-precision models, showcasing its potential for future applications in large language models.

💡 Additional Insights and Observations:

💬 "Quantization in machine learning refers to the process of reducing the precision of model weights to optimize memory usage and speed."
📊 The BitNet B 1.58 model demonstrates significant improvements in memory usage, latency, and perplexity compared to existing models like Lama.
🌐 The research paper presents compelling evidence of the effectiveness of the BitNet B 1.58 model through comparisons with established models and tasks.

📣 Concluding Remarks:

The era of 1bit LLMS introduces innovative approaches to reducing the size of large language models, with the BitNet B 1.58 model showing promising results in terms of efficiency and performance. This research opens up new possibilities for more accessible and environmentally friendly AI models in the future.
Generated using TalkBud

Ответить
@mshonle
@mshonle - 03.03.2024 19:23

I wonder what the distribution is between the three values? It would be interesting if it was evenly 33.33%.

Ответить
@dennou2012
@dennou2012 - 03.03.2024 18:17

What does the pareto improvement mean? That it's the 20% giving 80% performance?

Ответить
@JorgetePanete
@JorgetePanete - 03.03.2024 14:21

No code = No proof

Ответить
@simplemanideas4719
@simplemanideas4719 - 03.03.2024 13:02

To summarize the Bitnet was trained by scratch.
Therefore I cannot quantisize an existing llm to 1.58 bit?
Or is there a quantisizing approach for existing llms onto 1.58bit?

Ответить
@michabbb
@michabbb - 03.03.2024 13:01

A lot of "trees" here....

Ответить
@8eck
@8eck - 03.03.2024 12:02

Interesting how accuracy will be impacted in the end.

Ответить
@NLPprompter
@NLPprompter - 03.03.2024 10:02

well well well isn't this model seems like to run the best in quantum computer? please enlighten me.

Ответить
@MrSur512
@MrSur512 - 03.03.2024 09:13

Do we have code or at least from the community?

Ответить
@user-qr4jf4tv2x
@user-qr4jf4tv2x - 03.03.2024 08:15

i think transformer on current cpu is a cpu/gpu problem it self because those likes 1,0 when 1-bit reduces to 1,0 to fit the limitation of current cpu/gpu.. a gpu,cpu build for transformers might work better

Ответить
@Dent42
@Dent42 - 03.03.2024 03:56

Why not call it what it is? A trit

Ответить
@emiel2712
@emiel2712 - 02.03.2024 15:22

Wow this seems promising. I hope this will reproduce properly and work in other situations too. If it is truly better in general, new hardware could be so much more efficient

Ответить
@TommyJefferson1801
@TommyJefferson1801 - 02.03.2024 11:20

1.58 bit for correction

Ответить