Post History

Current version by Nick Antonaccio

Current VersionMay 12, 2026 at 12:26

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gemma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything other than the smallest workloads. 27b is particularly useful when you employ it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

Qwen tends to produce better quality output than even the most recent Gemma 4 models, and their models also perform faster - but those Gemma 4 versions (MOE 26a4b and dense 31b) are no slouches. Here's a little dashboard/web site example - this one took 4 iterations to complete, including the full-featured datatable, all the charts, animations, UI controls, etc.:

Those results absolutely blow away the sorts of output we got from much bigger models such as GPT-OSS:120b, just a few months ago - and those results are much better than what comes out of other current models which require far beefier hardware.

For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce - it's a functional start, but you'd want to put in lots of additional work to polish up styling for a public facing web site:

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. For example, here's a similar stock web site example made by mimo25pro (via Openrouter):

Those are all purely visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

One benefit of the Gemma series is that they've produced E2B and E4B Edge models, optimized for on-device deployment. Those models have a 128K context window and support for text, images, and audio. The E2B (2.3B effective/5.1B total parameters) and E4B (4.5B effective/8B total parameters) versions are designed for mobile devices and edge computing implementations. You can run them on Android and iOS phones in the Google AI Edge Gallery app, in LM Studio Mobile (iOS Beta), and in Off Grid (Android). It's recommended you use E2B for phones with 6GB RAM and E4B for 8GB+ RAM.

Keep in mind, all of these Qwen and Gemma models are open source Apache 2.0 licensed. That's a welcome shift from Google's previous restrictive license for the Gemma family. You can now use those models like you own them.

Along with Qwen 3.6, I keep the 2 biggest Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, even to those larger/slower alternatives.

Every year, I imagine/predict what's clearly coming down the pipe, and what sort of progress we should expect. The industry has continued to satisfy expectations - I can't wait to see what position we'll be in next year! If things like Turboquant and 1 bit models mature, we may see capability which is currently only available in huge releases like Kimi, Deepseek, GLM, etc. (those models currently still require datacenter-scale compute), in small models that can be run on DGX Spark class hardware. Even if progress were to stop entirely, we've actually got truly game-changing tools available right now. Hopefully, as they say, this is the dumbest our models will ever be...

Previous Versions
Version 6May 12, 2026 at 12:26

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gemma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything other than the smallest workloads. 27b is particularly useful when you employ it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

Qwen tends to produce better quality output than even the most recent Gemma 4 models, and their models also perform faster - but those Gemma 4 versions (MOE 26a4b and dense 31b) are no slouches. Here's a little dashboard/web site example - this one took a 4 iterations to complete, including the full-featured datatable, all the charts, animations, UI controls, etc.:

Those results absolutely blow away the sorts of output we got from much bigger models such as GPT-OSS:120b, just a few months ago - and those results are much better than those of other current models which require far beefier hardware.

For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce - it's a functional start, but you'd want to put in lots of additional work to polish up styling for a public facing web site:

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. For example, here's a similar stock web site example made by mimo25pro (via Openrouter):

Those are all purely visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

One benefit of the Gemma series is that they've produced E2B and E4B Edge models, optimized for on-device deployment. They have a 128K context window and support for text, images, and audio. The E2B (2.3B effective/5.1B total parameters) and E4B (4.5B effective/8B total parameters) versions are designed for mobile devices and edge computing implementations. You can run them on Android and iOS phones in the Google AI Edge Gallery app, in LM Studio Mobile (iOS Beta), and in Off Grid (Android). It's recommended you use E2B for phones with 6GB RAM and E4B for 8GB+ RAM.

Keep in mind, all of these Qwen and Gemma models are open source Apache 2.0 licensed. This is a great shift from Google's previous restrictive license for the Gemma family. You can now use these models like you own them.

Along with Qwen 3.6, I keep the 2 biggest Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, even to those larger/slower alternatives.

Every year, I imagine/predict what's clearly coming down the pipe, and what sort of progress we should expect. The industry has continued to satisfy expectations - I can't wait to see what position we'll be in next year. If things like Turboquant and 1 bit models mature, we may see capability which is currently only available in huge releases like Kimi, Deepseek, GLM, etc. (which currently still require datacenter-scale compute), in small models that can be run on DGX Spark class hardware. Even if progress were to stop entirely, we've actually got truly game-changing tools available right now. Hopefully, as they say, this is the dumbest our models will ever be...

Version 5May 09, 2026 at 23:55

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gmma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything other than the smallest workloads. 27b is particularly useful when you employ it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

Qwen tends to produce better quality output than even the most recent Gemma 4 models, and their models also perform faster - but those Gemma 4 versions (MOE 26a4b and dense 31b) are no slouches. Here's a little dashboard/web site example - this one took a 4 iterations to complete, including the full-featured datatable, all the charts, animations, UI controls, etc.:

Those results absolutely blow away the sorts of output we got from much bigger models such as GPT-OSS:120b, just a few months ago - and those results are much better than those of other current models which require far beefier hardware.

For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce - it's a functional start, but you'd want to put in lots of additional work to polish up styling for a public facing web site:

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. For example, here's a similar stock web site example made by mimo25pro (via Openrouter):

Those are all purely visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

One benefit of the Gemma series is that they've produced E2B and E4B (Edge models), optimized for on-device deployment with a 128K context window and native support for text, images, and audio. The E2B (2.3B effective/5.1B total parameters) and E4B (4.5B effective/8B total parameters) are designed for mobile devices and edge computing. You can run them on Android and iOS phones in the Google AI Edge Gallery app, in LM Studio Mobile (iOS Beta), and in Off Grid (Android). It's recommended the you use E2B for phones with 6GB RAM and E4B for 8GB+ RAM.

Keep in mind, all of these Qwen and Gemma models are open source Apache 2.0 licensed. This is a great shift from Google's previous restrictive license for the Gemma family. You can now use these models like you own them completely.

Along with Qwen 3.6, I keep the 2 biggest Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, even to those larger/slower alternatives.

Every year, I imagine/predict what's clearly coming down the pipe, and what sort of progress we should expect. The industry has continued to satisfy expectations - I can't wait to see what position we'll be in next year. If things like Turboquant and 1 bit models mature, we may see capability in small models that can be run on DGX Spark, for example, which is currently only available in huge releases like Kimi, Deepseek, GLM, etc. (which currently still require datacenter-scale compute). Even if progress were to stop entirely, we've actually got truly game-changing tools available right now. Hopefully, as they say, this is the dumbest our models will ever be...

Version 4May 09, 2026 at 19:25

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gmma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything but the smallest workloads. 27b is particularly useful when you use it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

Qwen tends to best even the local Gemma 4 models in both quality and performance speed, but the Gemma 4 versions (MOE 26a4b and dense 31b) are also no slouches. Here's a little dashboard/web site example:

Those results absolutely blow away the sorts of output we got from much bigger models such as GPT-OSS:120b, just a few months ago - and it's better than many other current models which require much beefier hardware.

For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work to polish up styling for a public facing web site):

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. For example, here's a similar stock web site made by mimo25pro (via Openrouter):

Those are pure visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

I keep the Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, to those large/slow alternatives.

Version 3May 09, 2026 at 18:17

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gmma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything but the smallest workloads. 27b is particularly useful when you use it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, and many other models which require much beefier hardware. For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work to polish up styling for a public facing web site):

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (via Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

Those are pure visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

Qwen tends to best even the local Gemma 4 models in both quality and performance speed, but the Gemma 4 versions (MOE 26a4b and dense 31b) are also no slouches. Here's a little dashboard/web site example:

http://1y1z.com:8284/flashy-site--gemma4-26ba4/

I keep Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, to those larg/slow alternatives.

Version 2May 09, 2026 at 17:13

I added some more demo examples created by qwen 3.6 35a3, to the quick start at https://aibynick.com/thread/29

The more I use this model to write actually useful code and to perform useful agentic tasks, the more it becomes just blindly obvious that this model is heads and shoulders above any other small local model out there. It's truly incredible what both the qwen 3.6 models can accomplish (35a3 is an MOE model, and 27 is a dense architecture). The 27b dense version is even more capable than 35a3, but does need a fast GPU for anything but the smallest workloads. It's particularly useful when you use it to help out with portions of tasks which the 35a3 gets stuck on. Use 35a3 to do 95% of a workflow, then call in 27b when extra help is needed.

The 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 and 3090 class GPUs, on Apple Mac processors with less RAM, etc.

These were built with qwen 3.6 35a3 on a Strix Halo laptop:

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM:

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, which require much beefier hardware. For example, this is the sort of disappointing output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work for a public facing site):

The output of that tiny qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (on Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

It's incredible to see Qwen besting the Gemma 4 models in both quality and performance speed. If you haven't tried the newest qwen models, you need to give them a shot. I keep a stable of other models loaded to offer different perspectives - especially some bigger models that run slowly even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the 3.6 qwen models, even compared to those very large and slow alternatives. The quen 3.6 models feel like the first truly capable, generally useful LLMs for small machines.

Version 1May 09, 2026 at 15:56

I added some more demo examples created by qwen 3.6 35a3, to the quick start at https://aibynick.com/thread/29

The more I use this model to write actually useful code and to perform useful agentic tasks, the more it becomes just blindly obvious that this model is heads and shoulders above any other small local model out there. It's truly incredible what both the qwen 3.6 models can accomplish (35a3 is an MOE model, and 27 is a dense architecture). The 27b dense version is even more capable than 35a3, but does need a fast GPU for anything but the smallest workloads.

The 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 and 3090 class GPUs, on Apple Mac processors with less RAM, etc.

These demos were built with qwen 3.6 35a3 on a Strix Halo laptop:

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM:

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, which require much beefier hardware:

And the output of that tiny qwen 3.6 35a3 model even approaches the quality of huge bleading edge models which need data center hardware to run. Here's an example made by mimo25pro (on Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

It's incredible to see Qwen besting the Gemma 4 models in both quality and performance speed. If you haven't tried the newest qwen models, you need to give them a shot. I keep a stable of other models loaded to offer different perspectives - especially some bigger models that run slowly even on DGX Spark and Strix Halo hardware - but the 3.6 qwen models feel like the first truly capable, generally useful LLMs for small machines.