For writing code, expect the experience using Qwen 3.6 35b and Gemma 4 26b to feel a bit retro, perhaps reminiscent of using GPT4o in some ways. Benchmarks show GPT4o and Qwen 3.6 35b to be similarly capable, but I have a sense that some of Qwen's capabilities are more modern. That self-hosted Qwen model won't always write perfect code first-shot like the newest frontier models, but that situation is handled more effectively than it could have been in GPT4o, because an agentic harness like Pi can be used to iterate autonomously through debug revisions.
Qwen3.6 35b nearly always provides move effective output than Gemma 4 26b, but it's sometimes nice to have one model check the other's work, to help break out of loops when one model can't seem to solve an issue. The dense Qwen 3.6 27b model is even more significantly capable than either the Qwen or Gemma MOE models, but it runs much more slowly on small GPUs (it's even slow on the Strix Halo).
Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output at the same speed, than those current models.
Also be aware that those models don't have the deep world knowledge that GPT4o and other huge models have. Nothing that runs on the small consumer GPUs will have that sort of tremendous world knowledge (endless info about obscure topics). There's only so much information that an be stored in 35 billion parameters (models like Kimi k2.6 are over a trillion parameters, and each of its 384 active experts is bigger then the entire Qwen 27b model!)
Those little Qwen and Gemma models are really optimized for coding and agentic tasks.
Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.