Post History

Current version by Nick Antonaccio

Current VersionMay 23, 2026 at 12:58

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure:

  1. You've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.
  2. The MTP speculative decoding toggle in the model load parameters is switched on every session
  3. You've got an MTP version of the most recent LLMs downloaded - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). All Gemma models and Nemotron 3 Super have supported MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Previous Versions
Version 8May 23, 2026 at 12:58

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure:

  1. You've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.
  2. The MTP speculative decoding toggle in the model load parameters is switched on for every session
  3. You've got an MTP version of the most recent LLMs downloaded - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). All Gemma models and Nemotron 3 Super support MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 7May 23, 2026 at 12:18

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime, and manually enable the MTP speculative decoding toggle in the model load parameters for every session.

MTP is only supported in the most recent release of both the runtime and the application. Also be sure to download an MTP version of the most recent LLMs - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). All Gemma models and Nemotron 3 Super support MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 6May 23, 2026 at 12:09

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application. Also be sure to download an MTP version of the most recent LLMs - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). All Gemma models and Nemotron 3 Super support MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 5May 23, 2026 at 12:05

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application. Also be sure to download an MTP version of the most recent LLMs - specifically with Qwen 3.6, if you downloaded a version in the past few weeks that wasn't explicitly labeled MTP, you'll need to download a newer version with MTP enabled (the Unsloth MTP models have all worked well for me). Gemma models support MTP out of the box.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 4May 23, 2026 at 11:54

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 3May 22, 2026 at 21:50

Be sure to download MTP versions of the most recent models - they run much faster. For example, in LM Studio community qwen3.6 27b_q4_k_s (that's the big dense model) sees these improvements in performance:

On the Asus gx10:

  • 4.11 tokens per second without MTP (LM Studio community version)
  • 11.45 tokens per second with MTP (Unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second without MTP
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free - and boy oh boy is that Strix Halo impressive here! It even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast.

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

Qwen36:35a3 q4_k_s MTP runs at 59 tps on the DGX Spark. The 8 bit version even runs at 40tps. Holy crap that's quick for a very capable model, on a single piece of consumer hardware at low wattage.

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.

I used LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0 for the tests above.

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 2May 22, 2026 at 21:49

Be sure to download MTP versions of the most recent models - they run much faster. For example, the normal LM Studio community version of qwen3.6 27bq4 (that's the big dense model):

On the Asus gx10:

  • 4.11 tokens per second
  • 11.45 tokens per second with MTP (unsloth version, same model, exact same machine, with all default settings)

On the Strix Halo, ASUS ROG Flow Z13 laptop, the exact same models ran significantly faster:

  • 9.31 tokens per second
  • 20.24 tokens per second with MTP

Those are fantastic improvements, entirely for free (and boy oh boy is that Strix Halo impressive here - it even runs the q6_k_xl quant of qwen36:27b at 13.48t tps. What a beast!).

You can get really serious work done with qwen 3.6 27b, at 20tps. For coding, that model is competitive with much much bigger models that were previously impossible to run on anything but datacenter class hardware.

Qwen36:35a3q4ks mtp runs at 59 tps on the DGX Spark. The 8 bit version even runs at 40tps. Holy crap that's quick for a very capable model, on a single piece of consumer hardware at low wattage!

So with this one update, we've really seen some tremendous improvements in capability become available to the self-hosting crowd, very quickly overnight.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application.

I'm using LM Studio 0.4.14 (Build 3) and Llama CPP 2.16.0

If you're using other inference software, be sure to have the most recent version of Llama CPP.

Version 1May 22, 2026 at 21:46

Be sure to download MTP versions of the most recent models - they run much faster. For example, the normal LM Studio community version of qwen3.6 27bq4 (that's the big dense model) ran at only 4.11 tokens per second on the Asus gx10. In comparison, the Unsloth MTP version for the same model runs at 11.45 tokens per second on the exact same machine, with all default settings. That's a fantastic improvement, entirely for free.

Just be sure you've got the most recent version of LM Studio and the Llama CPP runtime. MTP is only supported in the most recent release of both the runtime and the application. If you're using other inference software, be sure to have the most recent version of Llama CPP.