Qwen 3.6 35a3b & 27b are incredible, Gemma 4 26a4b & 31b are runners-up

Nick AntonaccioAdmin
May 12, 2026 at 12:19 (edited, 5 revisions)

UPDATE FROM THE FUTURE: if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I'm really excited to see what the Qwen 3.6 35a3 MOE model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization
16 tokens per second with 6 bit quantization (6 bit is likely basically as reliable as 8 bit for most purposes)
24 tokens per second with 4 bit quantization

On my Strix Halo machines, using LM Studio default settings for the Vulkan 2.13.0 runtime:

46 tokens per second with 8 bit quantization

At that 8 bit speed, there's hardly a need to use a more quantized version (Strix Halo is really turning out to be a great machine for the money, keeping in mind that it can handle much bigger models than Qwen 3.6 35a3).

On the Asus GX10 (DGX Spark):

46 tokens per second with 8 bit quantization
53 tokens per second with 6 bit quantization

Running a knowledge query with Qwen 3.6 35a3 connected to the Internet (in Jan), yielded truly great results. This little model can do very impressive research.

Disconnected from the Internet, Qwen 3.6 35a3 is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code, and my first tests seem to indicate it also generally does better than Gemma4 31b dense and 26b MOE for text based tasks. I have a sense this may be the current best all-around task model for small local GPUs, especially for tasks that involve writing code. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

Over the next few weeks, I'm most inclined to test the Qwen 3.6 35a3 model, together with Pi & Nullclaw, on local light-weight GPUs. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b and 20b. Those are the current leading players in the small GPU LLM market, and I'm excited to see some very strong models that can run on sub-$1000 used laptops.

BTW, for 'world knowledge' on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax (iq3_s) are also good little knowledge LLMs on GPUs such as Strix Halo and DGX Spark. And of course, even an older small model like GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

History

Nick AntonaccioAdmin
Apr 27, 2026 at 04:52

Update: the Qwen 3.6 35a3 and Gemma 4 26a4 MOE models have become my workhorse self-hosted LLMs for local software development.

This example was completed as a single task, entirely with Qwen 3.6 35a3 on a laptop with only a mobile 3080 with 16GB VRAM:

http://1y1z.com:5993

This was done with Qwen 3.6 35a3 and Gemma 4 26a4 MOE on a Strix Halo laptop:

http://1y1z.com:5994

Nick AntonaccioAdmin
May 12, 2026 at 12:26 (edited, 6 revisions)

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gemma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything other than the smallest workloads. 27b is particularly useful when you employ it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

https://com-pute.com/nick/3d_game_qwen36_35a3_strix_halo.html
http://1y1z.com:3929 (a full CRUD Northwind database demo)
http://1y1z.com:8284/dashboard-website--qwen36-35a3--strix/ (a little dashboard/web site demo)

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

https://com-pute.com/nick/3d_game_qwen36_35a3_3080_16Gb.html
http://1y1z.com:8284/nexora/ (an additional tiny web site layout demo)
https://com-pute.com/nick/ui_controls_qwen36-35a3_3080_16Gb.html

Qwen tends to produce better quality output than even the most recent Gemma 4 models, and their models also perform faster - but those Gemma 4 versions (MOE 26a4b and dense 31b) are no slouches. Here's a little dashboard/web site example - this one took 4 iterations to complete, including the full-featured datatable, all the charts, animations, UI controls, etc.:

http://1y1z.com:8284/flashy-site/public/

Those results absolutely blow away the sorts of output we got from much bigger models such as GPT-OSS:120b, just a few months ago - and those results are much better than what comes out of other current models which require far beefier hardware.

For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce - it's a functional start, but you'd want to put in lots of additional work to polish up styling for a public facing web site:

http://1y1z.com:8284/demo-website--gpt120--pi/

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. For example, here's a similar stock web site example made by mimo25pro (via Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

Those are all purely visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

One benefit of the Gemma series is that they've produced E2B and E4B Edge models, optimized for on-device deployment. Those models have a 128K context window and support for text, images, and audio. The E2B (2.3B effective/5.1B total parameters) and E4B (4.5B effective/8B total parameters) versions are designed for mobile devices and edge computing implementations. You can run them on Android and iOS phones in the Google AI Edge Gallery app, in LM Studio Mobile (iOS Beta), and in Off Grid (Android). It's recommended you use E2B for phones with 6GB RAM and E4B for 8GB+ RAM.

Keep in mind, all of these Qwen and Gemma models are open source Apache 2.0 licensed. That's a welcome shift from Google's previous restrictive license for the Gemma family. You can now use those models like you own them.

Along with Qwen 3.6, I keep the 2 biggest Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, even to those larger/slower alternatives.

Every year, I imagine/predict what's clearly coming down the pipe, and what sort of progress we should expect. The industry has continued to satisfy expectations - I can't wait to see what position we'll be in next year! If things like Turboquant and 1 bit models mature, we may see capability which is currently only available in huge releases like Kimi, Deepseek, GLM, etc. (those models currently still require datacenter-scale compute), in small models that can be run on DGX Spark class hardware. Even if progress were to stop entirely, we've actually got truly game-changing tools available right now. Hopefully, as they say, this is the dumbest our models will ever be...

History

Nick AntonaccioAdmin
May 11, 2026 at 16:31 (edited, 6 revisions)

Qwen 3.6 27b punches so far above it's weights, it's just unreal. This morning I quickly vibe-coded a pile of demo 3D driving games, using a wide variety of the best and biggest models (deepseek-4-pro, glm5.1, grok4.3, mimo2.5pro, and more), and most produced absolutely terrible results. For example, after $2.94 in tokens on Openrouter, GLM5.1 gave me this unusable result:

http://1y1z.com:8284/3ddriving-glm51.html

Deepseek-4-pro produced this laughably bad piece of junk:

http://1y1z.com:8284/3ddriving-deepseek4pro.html

Grok 4.3, Ling-2.6-1t (a model with 1 trillion parameters), and Hy3, each made something at least kinda-sorta playable, but extremely basic:

Most of the other models produced a mix of various garbage results that were totally unusable, even after a few iteration attempts.

Then Qwen 3.6 27b provided this useful start of a game, right out of the gate:

http://1y1z.com:8284/3d-driving-qwen36-27b.html

That example cost $.05 to create via Openrouter, and was not only the best first-shot code result, it was better than every other final result, even after some iteration with other models.

I just keep seeing surprisingly high quality results come from the Qwen 3.6 models, not just in my own work, but in videos by a many Youtubers, and among my friends who are testing LLMs.

Lately I've been fond of demonstrating game results and web site layout results, even though I don't have any need to write that sort of code, because it's easy for onlookers to immediately understand the successes and failures of code an LLM has provided, just by casually looking at visual applications.

I've noticed that the same quality roughly parallels the models' output for all sorts of other tasks, when compared to highly visual demos. When you see that one model builds a horribly shoddy looking web site layout, and produces games with broken controls and little playability, you'll tend to see that same model struggle to provide working back end logic for business apps too.

You can see some business app CRUD examples by Qwen 3.6 scattered around my demo links on this site, such as:

http://1y1z.com:3929 A full CRUD database Northwind demo
http://1y1z.com:5938 A little invoicing & payroll app

As well as many quick demos oriented towards useful application building, such as:

http://1y1z.com:5994 A little public forum
https://com-pute.com/nick/ui_controls_qwen36-35a3_3080_16Gb.html A simple UI control demo

I've just been absolutely blown away by the Qwen 3.6 models. The 27b dense model often goes head to head with many of the frontier models in terms of code quality, for all sorts of tasks - and the 35a3b MOE model is outrageously good for the speed, especially given that it can run on such small GPUs.

16GB VRAM is the smallest I've tried with 35a3b - it works well even with q4 compression on that class of small consumer grade hardware. This model takes on deeper challenges, tends to understand more of the scope of a goal, and its output is generally more creative, stylish, and reliable than that of any similarly sized model (and often much better than far larger models).

The takeaway is that you can actually accomplish useful development goals with Qwen 3.6. It seems to be by far the most practical, genuinely effective, and impressive model yet, for small GPUs.

I hope this is a sign of even better quality and performance to come in smaller models this year!

History

Nick AntonaccioAdmin
May 12, 2026 at 12:31 (edited, 2 revisions)

I built a version of the Northwind app using a previous generation qwen3-coder-30b-a3b-instruct model:

http://1y1z.com:8429/customers

Even though this model runs at a much faster token per second rate, the whole development process took much longer to complete, because the model was less efficient, had to think more for each step, and required more manually entered prompt iterations. In the end, even the though most of the functionality was finished, the model never completed the chart features, even when specifically prompted. The total context used was also significantly higher at 83%. Additionally, that older version of Qwen coded used an older version of Flask, which produced some deprecation warnings when the app ran (nothing fatal, but something to be aware of).

The experience with Qwen 3.6 35a3 is just so much better. As a reminder, this is the Northwind demo created by 35a3, with just 2 prompt iterations:

http://1y1z.com:3929

History

Nick AntonaccioAdmin
Jun 21, 2026 at 00:11

Be sure to use the newer MTP versions of these models, and update your inference software:

https://aibynick.com/thread/35

Please login to post a reply.