Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:
There was just a single Buy It Now listing for $1099:
There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).
Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲
If you're using models that fit entirely into VRAM, I'm not sure how much of a difference that RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.
So that makes the current prices for Strix Halo even more impressive. For anything more performant than the sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. A like-new ASUS ROG Flow Z13 with 128Gb shared memory is holding steady at $2572 (+ tax) on Amazon:
https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1
I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.
The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):
https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1
You're looking at $4000+ for a comparable Apple silicon product, and that Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for any sorts of models which require CUDA (video generation models, for example, which perform much better with CUDA), and all those little mini machines with an Nvidia GB10 chip are made for clustering, with NVLink-C2C network connections built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:
I wrote a post about why I'm considering clustering with those big rig machines, at some point:
https://aibynick.com/thread/26
For now, though, I'm still using APIs for all my commercial code generation work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use in agents. That Gemini model feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM you could self host.
So don't sweat getting hardware now. Qwen 3.6 35b has been the first local model that actually seems to make the thought of performing production coding work with a self-hosted GPU doable, but it's still not anywhere near as good as basically any huge mode. I'm amazed at what that model can achieve with only a 16Gb GPU, but the testing I do with it is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd immediately buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.