Комментарии:
How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?
ОтветитьThe thinking around BitNet b1.58 is intimately tied to the .gif in the paper “Stanford engineers propose a simpler design for quantum computers.” See the short .gif in action. Funding for that research began prior to 2021. Funding was provided largely by the US Department of Defense. Guess who virtually IS the US military by virtue of having a $ 3 T market cap to keep secret projects, secret? Thats right. Microsoft.
ОтветитьModel weights will make a lot more sense
ОтветитьThis is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.
Ответитьbut how is the accuracy ?
ОтветитьWe had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.
ОтветитьIt's a bit difficult to understand your accent, probably because I'm not a native speaker. Do you consider using an AI synthesized voice?
Ответитьwonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.
ОтветитьThis might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.
Ответить🎉😊❤
Ответитьwhat is that accent?
ОтветитьSo in summary everything is either Yes=1, Never mind =0, No = - 1 If only women were so simple to work out.
ОтветитьThis is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English.
I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL
Great! Very helpful. One suggestion I’d make: the numbers of bits being a fractional number will be unfamiliar for many people. I think it would be useful to make it clear that yes of course to represent three states in practicality you need two bits, and the 1.58 number is the theoretical Shannon-Hartley entropy
ОтветитьOk, but what are the theory on WHY it achieves same performance? maybe this shows no one really understand how Neural Networks works and are giving them much more complicated steps when they could be just some "quantised" states.
ОтветитьI've made a few contributions to Quaternary algebra, I discovered the inclusive and exclusive not-gate and am currently working on proofs for them.
The issue with ternary and quaternary at the moment is that current computers have to use numerous transistors per ternary or quaternary bit. Until we have a ternary or quaternary transistor, we may have to keep using bytes just like regular integers. I haven't seen any patents for a working one that isn't several times larger than a binary transistor which makes going back to binary more efficient, of course it depends though.
I don't know what Microsoft is doing but on top of this, running ternary requires at absolute minimum 2 binary bits to run, meaning 2 physical data lines at best. Depending on how optimized everything from your languages compiler is to what kinds of operations you're performing it may use significantly more.
To run ternary on current hardware doesn't quite make practical sense, when for the same~ amount of data likes you could be using quaternary.
Why does it still work when it’s quantified from float16 to -1,0,1. There could be countless numbers in float16 but only 3 numbers after quantification. I’m confused on this.😂
ОтветитьSo between this, Groq hardware, Mojo language and Mamba architecture... How many of these are compatible and stack their benefits synergistically? And where they stack is the performance additive or multiplicative?
Ответитьpossible to test this model with llm studio?
ОтветитьThe LLM has 0.1975 bytes. I don't think it's going to work.
ОтветитьTechnically they should call it a 2 bit LLM -- which has multiple meanings ;)
ОтветитьDoes it mean every model can be quantized this way?
ОтветитьThe one thing the paper neglects to mention which should have been the biggest breakthrough of the 1bit LLM is that the VRAM required for training should be drastically less than its full fat 16-bit float counterpart. It should be possible to train the 70b 1-bit model on a single RTX4090 - at present, the 70b model with any meaningful quantization cannot even be run on a single consumer GPU. I made a video on this subject last week.
At present the VRAM savings of current quantized LLMs are only apparent during execution, but what is more important is the democratization of LLM training. Lowering the barrier to training an LLM is a must to stop one company conquering the LLM space entirely.
Excellent explanations. This seems to be a comparison on Llama1 though, any confirmation if Llama2 models also perform similar after quantization? I am curious to know if this works on later generations, conceptually Llama2 outperforms Llama1 for the same size ( I.e 7B vs 7B, 13B vs 13B). So in effect the same weights now hold more complexity as compared to before, ie compression will work better when weights have more redundancy as compared to later versions where precision is more likely to be driving the performance differences
Ответить📝 Summary of Key Points:
📌 The research paper discusses the era of 1bit LLMS, focusing on reducing the size of large language models to address issues related to compute and memory resources, as well as environmental concerns.
🧐 The introduction of the BitNet B 1.58 model architecture, which utilizes weights that are ternary (-1, 0, 1) to reduce the number of bits required to represent the model, leading to improved efficiency without sacrificing performance.
🚀 Benefits of the BitNet B 1.58 model include reduced memory usage, lower latency, and comparable performance to full-precision models, showcasing its potential for future applications in large language models.
💡 Additional Insights and Observations:
💬 "Quantization in machine learning refers to the process of reducing the precision of model weights to optimize memory usage and speed."
📊 The BitNet B 1.58 model demonstrates significant improvements in memory usage, latency, and perplexity compared to existing models like Lama.
🌐 The research paper presents compelling evidence of the effectiveness of the BitNet B 1.58 model through comparisons with established models and tasks.
📣 Concluding Remarks:
The era of 1bit LLMS introduces innovative approaches to reducing the size of large language models, with the BitNet B 1.58 model showing promising results in terms of efficiency and performance. This research opens up new possibilities for more accessible and environmentally friendly AI models in the future.
Generated using TalkBud
I wonder what the distribution is between the three values? It would be interesting if it was evenly 33.33%.
ОтветитьWhat does the pareto improvement mean? That it's the 20% giving 80% performance?
ОтветитьNo code = No proof
ОтветитьTo summarize the Bitnet was trained by scratch.
Therefore I cannot quantisize an existing llm to 1.58 bit?
Or is there a quantisizing approach for existing llms onto 1.58bit?
A lot of "trees" here....
ОтветитьInteresting how accuracy will be impacted in the end.
Ответитьwell well well isn't this model seems like to run the best in quantum computer? please enlighten me.
ОтветитьDo we have code or at least from the community?
Ответитьi think transformer on current cpu is a cpu/gpu problem it self because those likes 1,0 when 1-bit reduces to 1,0 to fit the limitation of current cpu/gpu.. a gpu,cpu build for transformers might work better
ОтветитьWhy not call it what it is? A trit
ОтветитьWow this seems promising. I hope this will reproduce properly and work in other situations too. If it is truly better in general, new hardware could be so much more efficient
Ответить1.58 bit for correction
Ответить