Qwen 3.5 122a10b on Strix Halo runs at:
- 10.05 tokens per second without MTP (q5_k_l Bartoski version)
- 17.23 tokens per second with MTP (q4_k_s Unsloth version) Be aware of the q5 vs q4 comparison here
On the Nvidia GX10:
- 11.10 tokens per second without MTP (q5_k_l Unsloth version)
- 15.26 tokens per second with MTP (q5_k_l Unsloth version) Be aware of the q5 vs q5, exact same model comparison here
Strangely, the Bartoski version of Qwen 3.5 122a10b at q5_k_l quantization ran at 19.13 tokens per second on the Nvidia machine - faster than the MTP version of the exact same model from Unsloth, on same machine, with all other settings the same. What magic is Bartoski wielding?
BTW, the Nvidia machine is always faster than the Strix Halo at loading models, and at processing input tokens, regardless of MTP or not.