Post History

Current VersionMay 26, 2026 at 00:50

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust the reliable high quality code output of that model, and because I've still never experienced a rate limit, through all the massive volumes of work I've gotten done using that routine (that workflow is pinned on the home page of this forum). That solution is an absolutely stunning bargain, for a massive volume of frontier quality inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks which I run directly on servers. That combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents stored in local databases - that's come to be a killer use case).

I do also use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many interactions with the server environment (for example, to build agents which crawl & interact with 3rd party user interfaces). That sort of work typically involves thousands of iterations, so the benefit of automated agentic interaction & debug cycle updates during the development process is huge.

But with all the recent quality improvements in smaller models, MTP performance boosts, etc., especially with Qwen 3.6, it finally feels like self-hosted LLM inference is genuinely usable on smaller-than-datacenter-class GPU hardware. You can run those most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get production work done, especially when used with agentic harnesses that avoid burning piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more viable solution when even greater efficiency is required.

So, I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those options stay so inexpensive, but I no longer have a sense that I absolutely need those services, to maintain many of the foundational software development routines I've become accustomed to with frontier LLMs.

I do fully expect investment dollars to get exhausted, for companies like OpenAI & Anthropic, and at that point, LLM APIs will become more expensive & rate limited. I think many companies which originally planned to make their revenue primarily by providing LLM API services, will pivot into renting commodity data center hardware, because LLM research will eventually yield less valuable software products (just look at what Deepseek keeps doing...). I also expect that some of the companies currently providing these services at a loss will eventually get forced out of business, if they haven't found other ways to profit by replacing human labor with AI services (that's the only way to justify the current market caps for many of those companies - we'll have to see a huge percentage of human labor replaced by LLMs, for most of these companies to reach their revenue goals).

When those eventualities do come to be, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (the roles that Deepseek, Kimi, and GLM models currently occupy), but the existing smaller models are already absolutely usable for real development work, right now, in the current environment.

I still do lots of professional CRUD & common UI development work, and Qwen 3.6 can accomplish those sorts of goals without any trouble at all, just as well as frontier models. CRUD will continue to be fundamentally important to business operations, and I'd have no worries at all if I was never able to touch ChatGPT or Deepseek again for that sort of work.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that more and more of my clients are considering buying a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients have been setting up agents to complete tasks that significantly improve employee productivity. Even with cheap LLM APIs (as fantastically inexpensive and capable as Deepseek-4-pro currently is), some of those heavy workflows do still get to be expensive, especially as employees learn to rely on them more often. It makes sense to spend a few thousand dollars once to buy a server, to complete simple productivity enhancing agentic routines which cost $100+ per week in API fees. For any workflows which cost more than that, 2 or more GX10 machines could be paid off in a single year.

Additionally, most of my clients want to start using agentic tasks to process PHI and other sensitive data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good immediate drop-in solution for some of my clients, but $18/$45 in/out per million tokens is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters of significant GPU hardware, to run multiple simultaneous open source frontier models, in just a single year. The bottom line earnings are fantastic, when human labor savings equate to eliminating the busy work of several collective employees' salaries. Eliminating mind numbing busy work makes employees happier, and it increases total capacity. What business isn't searching for ways to earn more revenue without increasing expenses, and to stretch payroll dollars farther? Replacing human labor with agentic work, wherever possible, is a pattern we'll see more of everywhere. As more employers realize that a locally hosted LLM server can free the time of multiple workers, we'll see more businesses buying GPUs. It's easy math, for any business that has a practical LLM application.

Well before implementing expensive GPU clusters, having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 35a3b MOE and 27b dense, already tested and ready to go.

I've written lots about those hardware and model choices in other forum topics, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done quite impressively. I like having a backup for my frontier model development routines, and they've been particularly useful when I travel away from Internet access, but much more importantly, I'm already seeing how we'll be able to use those systems in practical agentic roles within the AI infrastructure of businesses I work with.

Version 4May 26, 2026 at 00:50