Use MTP! - AI By Nick

Nick AntonaccioAdmin
May 23, 2026 at 12:58 (edited, 8 revisions)

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

4.11 tokens per second without MTP (LM Studio community version)
11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

9.31 tokens per second without MTP
20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure:

You've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.
The MTP speculative decoding toggle in the model load parameters is switched on every session
You've got an MTP version of the most recent LLMs downloaded - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). All Gemma models and Nemotron 3 Super have supported MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

History

Nick AntonaccioAdmin
May 23, 2026 at 03:35 (edited, 1 revision)

BTW, qwen36:35a3 q4_k_s MTP runs at 59 tps on the DGX Spark and 54 tps on Strix Halo. The 8 bit version even runs at 40tps on the DGX Spark.

Holy crap that's quick for a very capable model, on a single piece of consumer hardware at low wattage.

History

Nick AntonaccioAdmin
May 30, 2026 at 17:33 (edited, 4 revisions)

Qwen 3.5 122a10b on Strix Halo runs at:

10.05 tokens per second without MTP (q5_k_l Bartoski version)
25.30 tokens per second with MTP (q4_k_s Unsloth version) Be aware of the q5 vs q4 comparison here

On the Nvidia GX10:

11.10 tokens per second without MTP (q5_k_l Unsloth version)
15.26 tokens per second with MTP (q5_k_l Unsloth version) Be aware of the q5 vs q5, exact same model comparison here

Strangely, the Bartoski version of Qwen 3.5 122a10b at q5_k_l quantization ran at 19.13 tokens per second on the Nvidia machine - faster than the MTP version of the exact same model from Unsloth, on same machine, with all other settings the same. What magic is Bartoski wielding?

BTW, the Nvidia machine is always faster than the Strix Halo at loading models, and at processing input tokens, regardless of MTP or not.

History

Nick AntonaccioAdmin
Jun 01, 2026 at 03:17 (edited, 2 revisions)

My go-to model is now consistently Qwen 3.6 35a3b (MOE). It's even more blazingly fast now with MTP, and extremely capable, even at 4 bit quantization. I've even noticed that it's done a great job with knowledge questions - my go-tos are asking very specific questions about paramotoring, and about lesser known frameworks such as jam.py. I have been absolutely amazed at how deep that small MOE's knowledge is, about obscure topics that even the frontier models knew nothing about last year. These sorts of knowledge tests have been performed in LM Studio, with no Internet search tools available to the model.

The very few times I've seen 3.6 MOE need some help completing a coding task, I've watched both varieties of Gemma 4 (26a4 MOE and 31b dense), help the task get completed.

With all these other fast running MTP models, and even the fast Bartoski version of the 122b MOE version of Qwen 3.5, there are lots of performant alternatives, for both AMD and Nvidia, when a task may be helped by connecting to another bigger, more knowledgeable model.

History

Please login to post a reply.