Running LLM Clusters on ALL THIS

12 дней назад

42,055 Просмотров

Комментарии:

@timrobertson8242 - 08.11.2024 08:47

Alex. While I appreciate the use of the SSD only File Server, couldn't you have Direct Attached to the MacBook Pro and done File Share over the Thunderbolt Bridge pointing your cache "to the same place" Mac Share vs SMB would seem to be efficient and eliminate the WiFi to Storage bottleneck? Just wondering if a measurable impact.

Ответить

@attaboyabhi - 08.11.2024 08:57

sweet !

Ответить

@rohandesai648 - 08.11.2024 09:30

Didnt understand point of this video. if you already have 64GB laptop, why use others to run LLMs ? even on shared NAS, it will run on that 64GB one only. Why would anyone have multiple laptops lying around ?

Ответить

@xsiviso4835 - 08.11.2024 11:39

Couldn’t you create a smb share on one Mac and then point the other MacBooks to it? Over thunderbolt loading the model should be even faster.

Ответить

@ishudshutup - 08.11.2024 11:52

This is awesome, can't wait for the M4 Mac Mini LLM review! Could you consider a video about the elephant in the room, multiple 8gb gpus clustered together to run a large model? There are millions of 8gb gpus that are stuck running quantized version of 7B models or just underutilized.

Ответить

@rickdg - 08.11.2024 12:08

So the endgame is to get a couple of base mac minis m4?

Ответить

@softwareengineeringwithkoushik - 08.11.2024 13:15

Waiting⏳ for m4 max

Ответить

- 08.11.2024 13:45

I remember rendering times in Final Cut & Compressor, on many machines - the same problems :)

Ответить

@sandeeppatil803 - 08.11.2024 14:26

Who is still waiting for m4 machine review

Ответить

@MrOktony - 08.11.2024 14:57

They’re very short on basic documentation. Any ideas how can i manually add LLM to exp, so that they appear in tiny chat? Maybe you can do video about it?

Ответить

@sreeharisreelal - 08.11.2024 15:06

I have a question regarding the installation of SQL Server Management Studio (SSMS) on a Mac. Specifically, I would like to know if it is feasible to install SSMS within a Windows 11 virtual machine using Parallels Desktop, and whether I would be able to connect this installation to an SQL Server that is running on the host macOS. Are there any specific configurations or steps I should be aware of to ensure a successful connection between SSMS and the SQL Server on macOS? Thank you!

Ответить

@ludekstipal6120 - 08.11.2024 15:36

All the work just to get "Word salad hallucinator"... ;)

Ответить

@mariol8831 - 08.11.2024 16:11

Why is the inference performance different per machine? Are they sharing the GPU cores too or just the VRAM? Because based on the output you are getting the VRAM bandwidth is around 300 - 400GB/s

Ответить

@DS-pk4eh - 08.11.2024 16:57

This is a nice POC. Another great video from Alex.

I would definitely prefer having 10Gb switch and having everything connected to it (there are some 8 ports for 300USD). More stable to actually work, and probably, less messy.
Maybe getting miniPC with 10Gb port and some decent amount of memory? Its shame Apple has such a big tax on memory and storage upgrade.

There is also Asustor 10Gb SSD only NAS device with 12x SSD slots (Flashstor 12 Pro).

Ответить

@SirDealer - 08.11.2024 17:44

Please compare 4x Mac mini base model with 1x 4090 :D

Ответить

@neeleshvashist - 08.11.2024 18:20

Hey Alex! Your videos are great!
I’m considering getting a MacBook Pro but not sure which model would be best for my needs. I’m a data science and machine learning student, so I’ll mostly use it for coding, data analysis, and some AI projects. Since I’m still in learning mode and not yet working professionally, I’m unsure if I need the latest high-end model. Any recommendations on which model or specs would best fit my use case? Thanks in advance!

Ответить

@Wunnabeanbag - 08.11.2024 22:16

CAN’T WAIT FOR YOUR M4 video

Ответить

@juehmingshi1739 - 09.11.2024 04:42

Get Mac mini pros, you will have 120Gb thunderbolt connection for the cluster.

Ответить

@abrahamortiz9812 - 09.11.2024 05:49

Dear Alex, I follow your channel for the language models, specifically for the MacBook Pro with Apple silicon. I congratulate you for your very precise and detailed content.

I have a question.

Can a Llama3.1 70b Q5_0 model with a weight of 49GB damage a MacBook Pro 16 M2 Max with 64GB ram?

I ran, on the MacBook, 2 models. (Mixtral8x7b Q4_0 26GB and Llama3.1 70B Q5_0 49GB).

When the 26GB one was running, the response was more fluid and quiet and the memory flow on the monitor looked "good", with a certain amount free and also without pressure. When I ran the 49GB weight (Llama3.1 70B Q5_0) it was not so fluid and also the Mac made an internal noise that was synchronized with the rhythm of each word that the model answered, in addition the memory monitor marked me that there was pressure in the memory.

So far so good. Just that detail. The problem came when I decided to reset the MacBook with a clean installation of the operating system and deleted the installation from utilities (as marked by Apple), then I exited disk utilities and clicked on install macOS Sonoma. The installation began, it marked me 3 hours of waiting, and everything started well. After about 6 minutes of installation, the screen image was transformed into a poor quality image at the same time that was fading in areas (from bottom to top) until it disappeared. In that screen image you could see lines and dots of green colors as well. All this happened in a second. He never gave me an image again, only a black image could be seen. You could only see that it turned on the MacBook by the keyboard lighting and if it turned off the office lights you could see a very faint white flash in the center of the screen. I connected a screen by HDMI but you couldn't see anything either, just a black screen.

I can see it's the video card. Do you think memory pressure could have influenced the heavier model that overloaded the MacBook Pro? Or do you think it was a matter of luck and it has no to do with language models?

I ran the models with Ollama and downloaded them from the same page.

Thank you very much for reading me,

Greetings

Ответить

@arozendojr - 09.11.2024 12:06

Suggestion, M3 pro basic vs M4 pro basic, for developers, Xcodebenchmark

Ответить

@vinusuhas4978 - 09.11.2024 12:13

what about SD e elite chips ?

Ответить

@muhammadhalimov422 - 09.11.2024 14:28

Alex wth, u r crazy!!!

Ответить

@_hmh - 09.11.2024 14:32

This is impressive. If this was for a real-world use case, I’d implement these optimizations:

- Don’t use the NAS since it introduces a single point of failure and it is much slower than directly attached storage. For best performance, the internal SSDs are your best choice. Storing the model on each computer is ok. This is called “shared nothing”

- Use identical computers. My hypothesis is that slower computers slow down the whole cluster. You would need to measure it with the Activity Monitor

- Measure the network traffic. Use a network switch (or better two together with ethernet bonding for redundancy and speed increase) so that you can add an arbitrary number of computers to your setup

- Measure how well your model scales out. If you have three computers and add a fourth, you would expect to get one third more tokens per second. The increase that you actually get in relation to the computing power you added, defines your scale out efficiency.

- use identical computers to get comparable results

- Now you have a perfect cluster where you can remove any component without breaking the complete cluster. Whichever component you remove, the rest would still function.

Ответить

@mantikhatasi - 10.11.2024 00:11

i saw some1 connected 4 minis and run llm

Ответить

@ryanswatson - 10.11.2024 15:15

Rust compile time comparison with M4 vs older M Series please.

Ответить

@agent00ameisenbar35 - 10.11.2024 16:18

finally a good exo explanation. thanks alex!

Ответить

@yenjun0204 - 10.11.2024 16:40

Actually NAS is not required. First Networking via Thunderbolt cables, and then assigning internal or external drives or TB DAS as LLM sources should be faster.

Ответить

@ZAcharyIndy - 10.11.2024 17:29

if you are buying Mac Book, make sure it has the larger storage

Ответить

@gmullBlack2 - 10.11.2024 19:26

Alex, this is an interesting setup. I would like to see more of your results when clustering these machines together to consume various LLM workloads, especially the larger models.

Ответить

@HamedTavakoli-m9h - 10.11.2024 19:57

between m3 max 30 core GPU 36ram and m4 pro 48gb ram which one should I choose?

Ответить

@FullStackDevSecOps - 10.11.2024 22:42

Could you please verify the functionality of connecting three computers to the Terramaster F8 SSD Plus by utilizing each of its USB ports?

Ответить

@loicdupond7550 - 10.11.2024 23:56

Humm so I guess one more question this brings : is it better to go for one m4 pro with 48GB of ram or 2 m4 with 24GB each to run local LLMs since it would be the same price

Ответить

@JonCaraveo - 11.11.2024 07:42

Question: Can nodes be different OS? 😅

Ответить

@dilip.rajkumar - 11.11.2024 12:24

Can we run the biggest 405B Llama 3.2 model on this Apple Silicon Cluster?

Ответить

@modoulaminceesay9211 - 11.11.2024 17:16

I am having the same problem , how do you set ollama to save to SSD?

Ответить

@modoulaminceesay9211 - 11.11.2024 17:28

Can you please show how you moved it to the SSD. Like transferring llama to external storage

Ответить

@nirglazer5962 - 11.11.2024 22:57

how does this actually work? you're not actually sharing the compute power right? basically it determines to which computer to send the query to, and then that computer shares the result with the one you're working on? would combining 3 of the same computer be beneficial or just repetitive?

Ответить

@peterbizik224 - 11.11.2024 23:24

Thunderbolt 5 m4 64gb ram x3 - is it going to be a 192gb gpu memory cluster ?

Ответить

@AlmorTech - 12.11.2024 01:49

Oh my, you’re killing it 😮 Great job!

Ответить

@radudamianov - 13.11.2024 02:48

Please test with various context windows 8k/ 32k/128k and especially with longer prompts > 1000 tokens.

Ответить

@Fingobob - 13.11.2024 10:50

Hi this is very helpful. i am curious if you could run LLM benchmarks of the various M4 models you have and see if an increase in GPU core counts make a difference, if so how much of a difference.

Ответить

@Fingobob - 13.11.2024 10:53

As a follow on your presentation today is - what if i wanna run Llama 3.1-70 or even 405 gb on a distributed computing setup

Ответить

@Fingobob - 14.11.2024 06:25

Has anybody tried this setup for LLM, (Which would do better in LLM processing, training, inference, RAG etc.)- would this run llama3.1 (70B)

(2x m4 base mini with 32gb ram each 256 ssd –Tbolt4 linked and load distributed ) VS 1x m4pro with 64gb ram 512gb. This i wanna see if you can pull it off. very curious about the effectiveness of a small cluster vs all in 1 system.

Ответить

@billyliu5452 - 14.11.2024 08:50

Can you try MacMini M4 Pro cluster?😂

With thunder bolt 5

Ответить

@8888-u6n - 14.11.2024 11:17

Can you make a video with 4 X the 16Gb with the new mac mini in an cluster, or even better 4X 64GB to make a 256 of Vram 🙂

Ответить

@hyposlasher - 15.11.2024 12:05

Is it possible to do with windows laptops?

Ответить

@Youtuber-ku4nk - 15.11.2024 13:08

What is the use case of running you own LLM?

Ответить

@mpsii - 15.11.2024 19:06

If I have this kind of cluster set up, how do I access the cluster from my main machine that is not part of the cluster?

Ответить

@DenverHarris - 17.11.2024 15:56

Can you use the cluster model with multiple max for any program? Light wave? Final Cut Pro? Basically, I have a super computer for everything? Or does EXO only help you run LLM?

Ответить