Post History

Current VersionApr 07, 2026 at 19:18

Here are some notes from the old rebolforum related to the ASUS RORG Flow Z13, in comparison to other machines I have with little consumer Nvidia GPUs:

2026-03-19 23:15:39

I've got a couple laptops with RTX 3080ti mobile GPUs (16Gb VRAM), which run all the commonly used smaller local models such as GPT-OSS:20b just as well as my tower with an RTX 3090.

My real current favorite machine, though, is a little Strix Halo laptop with 128GB sharable RAM. There are piles of these ASUS ROG Flow Z13s available for around $2500 (NEW, even with the current RAM shortage prices!). There's no machine close to this price range which can run such big models (for example, dense models in the 70 billion parameter range, and MOE models in the 120 billion range). You can bring one of these little laptops on an airplane, or use it while camping, or anywhere without Internet access, to do some serious inference work. Even a model like Qwen3.5-122B-A10B gets 20+ tokens per second on this machine, in low power mode. That's an amazing amount of power for far less money than anything else!

I think they've gotten a bad rap because when they first came out the software drivers weren't ready for mainstream, and people had problems, so now there are tons of bad reviews online. But that's not the current situation. Get it out the box, install any of the common inference apps (LM Studio, Ollama, Jan, Koboldcpp, etc.), download some models, and it just works - at about 1/4 price of a single RTX 6000 pro GPU alone (not a full server machine, just the GPU), for the same amount of GPU memory. You can buy small form factor Strix Halo desktop units, but those are getting to be more expensive than the ASUS ROG Flow Z13, without a monitor, kb, mouse, etc. - and not a mobile laptop unit, which is how this really shines (real disconnected inference capability while you travel away from Internet).

2026-03-21 06:32:26

Compared to any of my machines with Nvidia GPUs, the Strix Halo runs small models more slowly. For example, qwen3.5:35b-a3b 4bit runs at ~70 tokens per second on my RTX 3090, and ~50 tokens per second on the Strix Halo. My Strix Halo laptop is also a bit slower to load a new model, but I think that's because the ASUS ROG Flow Z13 likely has a less performant hard drive (the ROG laptop model I'm using only has 1TB HD, the Nvidia machines all have at least 2Tb HD). In practice, everything about the Strix Halo is snappy. 50 tps is not slow by any means, and the real benefit is that you can comfortably run bigger models like MOE models with 100 billion+ parameters, and/or smaller models at higher bit precision.

Running qwen3.5:122b-10b at 20 tps + on a portable local machine which sips power, is fantastic. I've even run Minimax 2.5 and Nemotron Super at very low precision, and they've got surprising amount of useful knowledge, even at those very low precisions.

I'm thinking of getting a few more of the Strix Halo machines just because prices are going up so dramatically on all the other options. I'd have to buy a used laptop with an Nvidia GPU that has 16-24 Gb VRAM, for the same price as a new Strix Halo laptop that has 128 Gb unified memory. For me that battle is easily won by the Strix Halo. The Apple machine are at least twice as much for something in the same ballpark (and I'm not a bic Mac fan).

2026-03-21 13:10:00

I haven't tried it yet, but I think Nvidia GPUs will likely perform much better in situations where more than one user is performing inference simultaneously (meaning, I woulldn't plan to build a multi-user inference server with a Strix Halo machine).

2026-03-21 15:09:43

I should point out that the Strix Halo has 256 GB/s memory bandwidth, Apple M3 Max has 400 GB/s, and M4 Max has 546 GB/s (beware that the Apple Max versions are far mor performant than the Pro versions of the same name). The DGX Spark has 273 GB/s, but it tends to perform about twice as fast as Strix Halo because it has specialized hardware accelerators for 4bit and 8bit quants, more mature software (Cuda vs ROCm), more powerful prefill processing, it performs faster image generation, etc.

I think everybody just looks at all those numbers and figures Strix Halo can't realistically be very useful, but for the price, I'm very surprised at how well that little ASUS ROG Flow Z13 laptop can do some actually productive inference work. The new 3.5 models from Qwen really help to make it more useful, and I'm expecting smarter and more reliable smaller models to continue to be developed, which should make less powerful hardware more and more useful in general. The 128Gb RAM is nowhere near big enough for frontier models like Kimi or GLM, but Qwen3.5-122B-A10B is pretty darn smart for agentic tool calling roles, and Qwen3-Coder-Next can get a lot of actual coding tasks completed, and the super-quantized versions of Minimax 2.5 and Nemotron Super contain an amazing amount of knowledge - even obscure info.

For example, I can submit queries to very low precision Minimax and Nemotron versions, about important people in paramotoring, popular wing and engine brands, and questions about jam.py, which none of the smaller models know much about, if anything, and those very quantized versions are approach the sorts of expectations we've gotten used to with all-knowing trillion parameter frontier models. You'll never be able to fit all human knowledge into a 200 billion parameter model, but it's very impressive how a low quant on a very large model like Minimax 2.5 enables lots of information to be stored in models that are less than 80Gb on disk.

Smaller models still need at least 4bit quantization, and those same models really make a lot fewer mistakes at 8bit quantization. I typically expect a larger parameter model to generally perform better at 4bit quant than a smaller model at 8bit quant, given the same size on disk. That's why I'm happy with the Strix Halo - it can run the models that are currently very reliable - especially 100+ parameter MOE models at 4bit. That's a useful sweet spot.

By the way, the good old standby GPT-OSS:20b is still very useful for gathering information with web search, and performing tool calls - even on smaller GPUs. That thing is a workhorse for cheap GPUs. I want to try it with the Intel Arc Pro B50 16GB GPU that costs ~$350 (in fact I'd love to build a system with 4 of those GPUs, if it turns out they work well together in current Llamacpp). But I think qwen3.5:35b-a3b 4bit and others coming are stronger contenders for all the little consumer GPUs.

Version 3Apr 07, 2026 at 19:18

Here are some notes from the old rebolforum related to the ASUS RORG Flow Z13, in comparison to other machines I have with little consumer Nvidia GPUs:

Nick — 2026-03-19 23:15:39 I've got a couple laptops with RTX 3080ti mobile GPUs (16Gb VRAM), which run all the commonly used smaller local models such as GPT-OSS:20b just as well as my tower with an RTX 3090. My real current favorite machine, though, is a little Strix Halo laptop with 128GB sharable RAM. There are piles of these ASUS ROG Flow Z13s available for around $2500 (NEW, even with the current RAM shortage prices!). There's no machine close to this price range which can run such big models (for example, dense models in the 70 billion parameter range, and MOE models in the 120 billion range). You can bring one of these little laptops on an airplane, or use it while camping, or anywhere without Internet access, to do some serious inference work. Even a model like Qwen3.5-122B-A10B gets 20+ tokens per second on this machine, in low power mode. That's an amazing amount of power for far less money than anything else! I think they've gotten a bad rap because when they first came out the software drivers weren't ready for mainstream, and people had problems, so now there are tons of bad reviews online. But that's not the current situation. Get it out the box, install any of the common inference apps (LM Studio, Ollama, Jan, Koboldcpp, etc.), download some models, and it just works - at about 1/4 price of a single RTX 6000 pro GPU alone (not a full server machine, just the GPU), for the same amount of GPU memory. You can buy small form factor Strix Halo desktop units, but those are getting to be more expensive than the ASUS ROG Flow Z13, without a monitor, kb, mouse, etc. - and not a mobile laptop unit, which is how this really shines (real disconnected inference capability while you travel away from Internet). Nick — 2026-03-21 06:32:26 Compared to any of my machines with Nvidia GPUs, the Strix Halo runs small models more slowly. For example, qwen3.5:35b-a3b 4bit runs at ~70 tokens per second on my RTX 3090, and ~50 tokens per second on the Strix Halo. My Strix Halo laptop is also a bit slower to load a new model, but I think that's because the ASUS ROG Flow Z13 likely has a less performant hard drive (the ROG laptop model I'm using only has 1TB HD, the Nvidia machines all have at least 2Tb HD). In practice, everything about the Strix Halo is snappy. 50 tps is not slow by any means, and the real benefit is that you can comfortably run bigger models like MOE models with 100 billion+ parameters, and/or smaller models at higher bit precision. Running qwen3.5:122b-10b at 20 tps + on a portable local machine which sips power, is fantastic. I've even run Minimax 2.5 and Nemotron Super at very low precision, and they've got surprising amount of useful knowledge, even at those very low precisions. I'm thinking of getting a few more of the Strix Halo machines just because prices are going up so dramatically on all the other options. I'd have to buy a used laptop with an Nvidia GPU that has 16-24 Gb VRAM, for the same price as a new Strix Halo laptop that has 128 Gb unified memory. For me that battle is easily won by the Strix Halo. The Apple machine are at least twice as much for something in the same ballpark (and I'm not a bic Mac fan). Nick — 2026-03-21 13:10:00 I haven't tried it yet, but I think Nvidia GPUs will likely perform much better in situations where more than one user is performing inference simultaneously (meaning, I woulldn't plan to build a multi-user inference server with a Strix Halo machine). Nick — 2026-03-21 15:09:43 I should point out that the Strix Halo has 256 GB/s memory bandwidth, Apple M3 Max has 400 GB/s, and M4 Max has 546 GB/s (beware that the Apple Max versions are far mor performant than the Pro versions of the same name). The DGX Spark has 273 GB/s, but it tends to perform about twice as fast as Strix Halo because it has specialized hardware accelerators for 4bit and 8bit quants, more mature software (Cuda vs ROCm), more powerful prefill processing, it performs faster image generation, etc.

I think everybody just looks at all those numbers and figures Strix Halo can't realistically be very useful, but for the price, I'm very surprised at how well that little ASUS ROG Flow Z13 laptop can do some actually productive inference work. The new 3.5 models from Qwen really help to make it more useful, and I'm expecting smarter and more reliable smaller models to continue to be developed, which should make less powerful hardware more and more useful in general. The 128Gb RAM is nowhere near big enough for frontier models like Kimi or GLM, but Qwen3.5-122B-A10B is pretty darn smart for agentic tool calling roles, and Qwen3-Coder-Next can get a lot of actual coding tasks completed, and the super-quantized versions of Minimax 2.5 and Nemotron Super contain an amazing amount of knowledge - even obscure info.

For example, I can submit queries to very low precision Minimax and Nemotron versions, about important people in paramotoring, popular wing and engine brands, and questions about jam.py, which none of the smaller models know much about, if anything, and those very quantized versions are approach the sorts of expectations we've gotten used to with all-knowing trillion parameter frontier models. You'll never be able to fit all human knowledge into a 200 billion parameter model, but it's very impressive how a low quant on a very large model like Minimax 2.5 enables lots of information to be stored in models that are less than 80Gb on disk.

Smaller models still need at least 4bit quantization, and those same models really make a lot fewer mistakes at 8bit quantization. I typically expect a larger parameter model to generally perform better at 4bit quant than a smaller model at 8bit quant, given the same size on disk. That's why I'm happy with the Strix Halo - it can run the models that are currently very reliable - especially 100+ parameter MOE models at 4bit. That's a useful sweet spot.

By the way, the good old standby GPT-OSS:20b is still very useful for gathering information with web search, and performing tool calls - even on smaller GPUs. That thing is a workhorse for cheap GPUs. I want to try it with the Intel Arc Pro B50 16GB GPU that costs ~$350 (in fact I'd love to build a system with 4 of those GPUs, if it turns out they work well together in current Llamacpp). But I think qwen3.5:35b-a3b 4bit and others coming are stronger contenders for all the little consumer GPUs.

Version 2Apr 07, 2026 at 13:24

Here are some notes from the old rebolforum related to the ASUS RORG Flow Z13, in comparison to other machines I have with little consumer Nvidia GPUs:

Nick — 2026-03-19 23:15:39 I've got a couple laptops with RTX 3080ti mobile GPUs (16Gb VRAM), which run all the commonly used smaller local models such as GPT-OSS:20b just as well as my tower with an RTX 3090. My real current favorite machine, though, is a little Strix Halo laptop with 128GB sharable RAM. There are piles of these ASUS ROG Flow Z13s available for around $2500 (NEW, even with the current RAM shortage prices!). There's no machine close to this price range which can run such big models (for example, dense models in the 70 billion parameter range, and MOE models in the 120 billion range). You can bring one of these little laptops on an airplane, or use it while camping, or anywhere without Internet access, to do some serious inference work. Even a model like Qwen3.5-122B-A10B gets 20+ tokens per second on this machine, in low power mode. That's an amazing amount of power for far less money than anything else! I think they've gotten a bad rap because when they first came out the software drivers weren't ready for mainstream, and people had problems, so now there are tons of bad reviews online. But that's not the current situation. Get it out the box, install any of the common inference apps (LM Studio, Ollama, Jan, Koboldcpp, etc.), download some models, and it just works - at about 1/4 price of a single RTX 6000 pro GPU alone (not a full server machine, just the GPU), for the same amount of GPU memory. You can buy small form factor Strix Halo desktop units, but those are getting to be more expensive than the ASUS ROG Flow Z13, without a monitor, kb, mouse, etc. - and not a mobile laptop unit, which is how this really shines (real disconnected inference capability while you travel away from Internet). Nick — 2026-03-21 06:32:26 Compared to any of my machines with Nvidia GPUs, the Strix Halo runs small models more slowly. For example, qwen3.5:35b-a3b 4bit runs at ~70 tokens per second on my RTX 3090, and ~50 tokens per second on the Strix Halo. My Strix Halo laptop is also a bit slower to load a new model, but I think that's because the ASUS ROG Flow Z13 likely has a less performant hard drive (the ROG laptop model I'm using only has 1TB HD, the Nvidia machines all have at least 2Tb HD). In practice, everything about the Strix Halo is snappy. 50 tps is not slow by any means, and the real benefit is that you can comfortably run bigger models like MOE models with 100 billion+ parameters, and/or smaller models at higher bit precision. Running qwen3.5:122b-10b at 20 tps + on a portable local machine which sips power, is fantastic. I've even run Minimax 2.5 and Nemotron Super at very low precision, and they've got surprising amount of useful knowledge, even at those very low precisions. I'm thinking of getting a few more of the Strix Halo machines just because prices are going up so dramatically on all the other options. I'd have to buy a used laptop with an Nvidia GPU that has 16-24 Gb VRAM, for the same price as a new Strix Halo laptop that has 128 Gb unified memory. For me that battle is easily won by the Strix Halo. The Apple machine are at least twice as much for something in the same ballpark (and I'm not a bic Mac fan). Nick — 2026-03-21 13:10:00 I haven't tried it yet, but I think Nvidia GPUs will likely perform much better in situations where more than one user is performing inference simultaneously (meaning, I woulldn't plan to build a multi-user inference server with a Strix Halo machine). Nick — 2026-03-21 15:09:43 I should point out that the Strix Halo has 256 GB/s memory bandwidth, Apple M3 Max has 400 GB/s, and M4 Max has 546 GB/s (beware that the Apple Max versions are far mor performant than the Pro versions of the same name). The DGX Spark has 273 GB/s, but it tends to perform about twice as fast as Strix Halo because it has specialized hardware accelerators for 4bit and 8bit quants, more mature software (Cuda vs ROCm), more powerful prefill processing, it performs faster image generation, etc.

I think everybody just looks at all those numbers and figures Strix Halo can't realistically be very useful, but for the price, I'm very surprised at how well that little ASUS ROG Flow Z13 laptop can do some actually productive inference work. The new 3.5 models from Qwen *really help to make it more useful, and I'm expecting smarter and more reliable smaller models to continue to be developed, which should make less powerful hardware more and more useful in general. The 128Gb RAM is nowhere near big enough for frontier models like Kimi or GLM, but Qwen3.5-122B-A10B is pretty darn smart for agentic tool calling roles, and Qwen3-Coder-Next can get a lot of actual coding tasks completed, and the super-quantized versions of Minimax 2.5 and Nemotron Super contain an amazing amount of knowledge - even obscure info.

For example, I can submit queries to very low precision Minimax and Nemotron versions, about important people in paramotoring, popular wing and engine brands, and questions about jam.py, which none of the smaller models know much about, if anything, and those very quantized versions are approach the sorts of expectations we've gotten used to with all-knowing trillion parameter frontier models. You'll never be able to fit all human knowledge into a 200 billion parameter model, but it's very impressive how a low quant on a very large model like Minimax 2.5 enables lots of information to be stored in models that are less than 80Gb on disk.

Smaller models still need at least 4bit quantization, and those same models really make a lot fewer mistakes at 8bit quantization. I typically expect a larger parameter model to generally perform better at 4bit quant than a smaller model at 8bit quant, given the same size on disk. That's why I'm happy with the Strix Halo - it can run the models that are currently very reliable - especially 100+ parameter MOE models at 4bit. That's a useful sweet spot.

By the way, the good old standby GPT-OSS:20b is still very useful for gathering information with web search, and performing tool calls - even on smaller GPUs. That thing is a workhorse for cheap GPUs. I want to try it with the Intel Arc Pro B50 16GB GPU that costs ~$350 (in fact I'd love to build a system with 4 of those GPUs, if it turns out they work well together in current Llamacpp). But I think qwen3.5:35b-a3b 4bit and others coming are stronger contenders for all the little consumer GPUs.

Version 1Apr 07, 2026 at 13:23

@nickantonaccio

Previous Versions