Комментарии:
Alex. While I appreciate the use of the SSD only File Server, couldn't you have Direct Attached to the MacBook Pro and done File Share over the Thunderbolt Bridge pointing your cache "to the same place" Mac Share vs SMB would seem to be efficient and eliminate the WiFi to Storage bottleneck? Just wondering if a measurable impact.
Ответитьsweet !
ОтветитьDidnt understand point of this video. if you already have 64GB laptop, why use others to run LLMs ? even on shared NAS, it will run on that 64GB one only. Why would anyone have multiple laptops lying around ?
ОтветитьCouldn’t you create a smb share on one Mac and then point the other MacBooks to it? Over thunderbolt loading the model should be even faster.
ОтветитьThis is awesome, can't wait for the M4 Mac Mini LLM review! Could you consider a video about the elephant in the room, multiple 8gb gpus clustered together to run a large model? There are millions of 8gb gpus that are stuck running quantized version of 7B models or just underutilized.
ОтветитьSo the endgame is to get a couple of base mac minis m4?
ОтветитьWaiting⏳ for m4 max
ОтветитьI remember rendering times in Final Cut & Compressor, on many machines - the same problems :)
ОтветитьWho is still waiting for m4 machine review
ОтветитьThey’re very short on basic documentation. Any ideas how can i manually add LLM to exp, so that they appear in tiny chat? Maybe you can do video about it?
ОтветитьI have a question regarding the installation of SQL Server Management Studio (SSMS) on a Mac. Specifically, I would like to know if it is feasible to install SSMS within a Windows 11 virtual machine using Parallels Desktop, and whether I would be able to connect this installation to an SQL Server that is running on the host macOS. Are there any specific configurations or steps I should be aware of to ensure a successful connection between SSMS and the SQL Server on macOS? Thank you!
ОтветитьAll the work just to get "Word salad hallucinator"... ;)
ОтветитьWhy is the inference performance different per machine? Are they sharing the GPU cores too or just the VRAM? Because based on the output you are getting the VRAM bandwidth is around 300 - 400GB/s
ОтветитьThis is a nice POC. Another great video from Alex.
I would definitely prefer having 10Gb switch and having everything connected to it (there are some 8 ports for 300USD). More stable to actually work, and probably, less messy.
Maybe getting miniPC with 10Gb port and some decent amount of memory? Its shame Apple has such a big tax on memory and storage upgrade.
There is also Asustor 10Gb SSD only NAS device with 12x SSD slots (Flashstor 12 Pro).
Please compare 4x Mac mini base model with 1x 4090 :D
ОтветитьHey Alex! Your videos are great!
I’m considering getting a MacBook Pro but not sure which model would be best for my needs. I’m a data science and machine learning student, so I’ll mostly use it for coding, data analysis, and some AI projects. Since I’m still in learning mode and not yet working professionally, I’m unsure if I need the latest high-end model. Any recommendations on which model or specs would best fit my use case? Thanks in advance!
CAN’T WAIT FOR YOUR M4 video
ОтветитьGet Mac mini pros, you will have 120Gb thunderbolt connection for the cluster.
ОтветитьDear Alex, I follow your channel for the language models, specifically for the MacBook Pro with Apple silicon. I congratulate you for your very precise and detailed content.
I have a question.
Can a Llama3.1 70b Q5_0 model with a weight of 49GB damage a MacBook Pro 16 M2 Max with 64GB ram?
I ran, on the MacBook, 2 models. (Mixtral8x7b Q4_0 26GB and Llama3.1 70B Q5_0 49GB).
When the 26GB one was running, the response was more fluid and quiet and the memory flow on the monitor looked "good", with a certain amount free and also without pressure. When I ran the 49GB weight (Llama3.1 70B Q5_0) it was not so fluid and also the Mac made an internal noise that was synchronized with the rhythm of each word that the model answered, in addition the memory monitor marked me that there was pressure in the memory.
So far so good. Just that detail. The problem came when I decided to reset the MacBook with a clean installation of the operating system and deleted the installation from utilities (as marked by Apple), then I exited disk utilities and clicked on install macOS Sonoma. The installation began, it marked me 3 hours of waiting, and everything started well. After about 6 minutes of installation, the screen image was transformed into a poor quality image at the same time that was fading in areas (from bottom to top) until it disappeared. In that screen image you could see lines and dots of green colors as well. All this happened in a second. He never gave me an image again, only a black image could be seen. You could only see that it turned on the MacBook by the keyboard lighting and if it turned off the office lights you could see a very faint white flash in the center of the screen. I connected a screen by HDMI but you couldn't see anything either, just a black screen.
I can see it's the video card. Do you think memory pressure could have influenced the heavier model that overloaded the MacBook Pro? Or do you think it was a matter of luck and it has no to do with language models?
I ran the models with Ollama and downloaded them from the same page.
Thank you very much for reading me,
Greetings
Suggestion, M3 pro basic vs M4 pro basic, for developers, Xcodebenchmark
Ответитьwhat about SD e elite chips ?
ОтветитьAlex wth, u r crazy!!!
ОтветитьThis is impressive. If this was for a real-world use case, I’d implement these optimizations:
- Don’t use the NAS since it introduces a single point of failure and it is much slower than directly attached storage. For best performance, the internal SSDs are your best choice. Storing the model on each computer is ok. This is called “shared nothing”
- Use identical computers. My hypothesis is that slower computers slow down the whole cluster. You would need to measure it with the Activity Monitor
- Measure the network traffic. Use a network switch (or better two together with ethernet bonding for redundancy and speed increase) so that you can add an arbitrary number of computers to your setup
- Measure how well your model scales out. If you have three computers and add a fourth, you would expect to get one third more tokens per second. The increase that you actually get in relation to the computing power you added, defines your scale out efficiency.
- use identical computers to get comparable results
- Now you have a perfect cluster where you can remove any component without breaking the complete cluster. Whichever component you remove, the rest would still function.
i saw some1 connected 4 minis and run llm
ОтветитьRust compile time comparison with M4 vs older M Series please.
Ответитьfinally a good exo explanation. thanks alex!
ОтветитьActually NAS is not required. First Networking via Thunderbolt cables, and then assigning internal or external drives or TB DAS as LLM sources should be faster.
Ответитьif you are buying Mac Book, make sure it has the larger storage
ОтветитьAlex, this is an interesting setup. I would like to see more of your results when clustering these machines together to consume various LLM workloads, especially the larger models.
Ответитьbetween m3 max 30 core GPU 36ram and m4 pro 48gb ram which one should I choose?
ОтветитьCould you please verify the functionality of connecting three computers to the Terramaster F8 SSD Plus by utilizing each of its USB ports?
ОтветитьHumm so I guess one more question this brings : is it better to go for one m4 pro with 48GB of ram or 2 m4 with 24GB each to run local LLMs since it would be the same price
ОтветитьQuestion: Can nodes be different OS? 😅
ОтветитьCan we run the biggest 405B Llama 3.2 model on this Apple Silicon Cluster?
ОтветитьI am having the same problem , how do you set ollama to save to SSD?
ОтветитьCan you please show how you moved it to the SSD. Like transferring llama to external storage
Ответитьhow does this actually work? you're not actually sharing the compute power right? basically it determines to which computer to send the query to, and then that computer shares the result with the one you're working on? would combining 3 of the same computer be beneficial or just repetitive?
ОтветитьThunderbolt 5 m4 64gb ram x3 - is it going to be a 192gb gpu memory cluster ?
ОтветитьOh my, you’re killing it 😮 Great job!
ОтветитьPlease test with various context windows 8k/ 32k/128k and especially with longer prompts > 1000 tokens.
ОтветитьHi this is very helpful. i am curious if you could run LLM benchmarks of the various M4 models you have and see if an increase in GPU core counts make a difference, if so how much of a difference.
ОтветитьAs a follow on your presentation today is - what if i wanna run Llama 3.1-70 or even 405 gb on a distributed computing setup
ОтветитьHas anybody tried this setup for LLM, (Which would do better in LLM processing, training, inference, RAG etc.)- would this run llama3.1 (70B)
(2x m4 base mini with 32gb ram each 256 ssd –Tbolt4 linked and load distributed ) VS 1x m4pro with 64gb ram 512gb. This i wanna see if you can pull it off. very curious about the effectiveness of a small cluster vs all in 1 system.
Can you try MacMini M4 Pro cluster?😂
With thunder bolt 5
Can you make a video with 4 X the 16Gb with the new mac mini in an cluster, or even better 4X 64GB to make a 256 of Vram 🙂
ОтветитьIs it possible to do with windows laptops?
ОтветитьWhat is the use case of running you own LLM?
ОтветитьIf I have this kind of cluster set up, how do I access the cluster from my main machine that is not part of the cluster?
ОтветитьCan you use the cluster model with multiple max for any program? Light wave? Final Cut Pro? Basically, I have a super computer for everything? Or does EXO only help you run LLM?
ОтветитьAlex, how did you change the default ports? mine keeps coming up on 52415 no matter what flags I give it on launch
Ответить