Post History

Current VersionApr 20, 2026 at 14:48

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization)

14B - 20B: ~13–19 tokens per second

30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit to the point you could experience crashes)

From what I've seen, when clustering Macs, Exo framework provides the absolute best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros with 32GB each, but performance will likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

I'm really betting on better small models that run on 16-24GB VRAM GPUs this year, especially for coding and agentic tasks, which make a single M1 Pro, or any of the inexpensive 3080ti mobile and similar machines potentially useful (I got one of those 16GB VRAM RTX laptops used for just over $800 - that machine has been very impressive - even ran Nemotron Super on it!).

Seeing how much qwen 3.6 improved over previous models is really encouraging. I'm going to really put it through the paces on a little RTX laptop over the next few weeks, for coding and agentic tasks especially.

For little models like qwen 3.6, a single laptop with an RTX GPU will likely perform faster than the M1 Pro, but M1 Pro has the benefit of Thunderbolt network speeds, so clustering with it is faster out of the box, unless you spend too much money on network hardware for the RTX machines. I'd like to try EXO with a couple RTX machines, using the best networking hardware that is reasonable to set up with a couple of them...

Version 4Apr 20, 2026 at 14:48

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization)

14B - 20B: ~13–19 tokens per second

30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit to the point you could experience crashes)

From what I've seen, when clustering Macs, Exo framework provides the absolute best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros with 32GB each, but performance with likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

I'm really betting on better small models that run on 16-24GB VRAM GPUs this year, especially for coding and agentic tasks, which make a single M1 Pro, or any of the inexpensive 3080ti mobile and similar machines potentially useful (I got one of those 16GB VRAM RTX laptops used for just over $800 - that machine has been very impressive - even ran nemotron Super on it!).

Seeing how much qwen 3.6 improved over previous models is really encouraging. I'm going to really put it through the paces on a little RTX laptop over the next few weeks, for coding tasks especially.

For little models like qwen 3.6, a single laptop with an RTX GPU will likely perform faster than the M1 Pro, but M1 Pro has the benefit of Thunderbolt network speeds, so clustering with it is faster out of the box, unless you spend too much money on network hardware for the RTX machines. I'd like to try EXO with a couple RTX machines, using the best networking hardware that is reasonable to set up with a couple of them...

Version 3Apr 20, 2026 at 14:08

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization)

14B - 20B: ~13–19 tokens per second

30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit or cause crashes)

From what I've seen, when clustering Macs, Exo framework provides the best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros with 32GB each, but performance with likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

Version 2Apr 20, 2026 at 13:53

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization)

14B - 20B: ~13–19 tokens per second

30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit or cause crashes)

From what I've seen, when clustering Macs, Exo framework provides the best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros, but performance with likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

Version 1Apr 20, 2026 at 13:52

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization) 14B - 20B: ~13–19 tokens per second 30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit or cause crashes)

From what I've seen, when clustering Macs, Exo framework provides the best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros, but performance with likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

Previous Versions