Post History

Current version by Nick Antonaccio

Current VersionMay 26, 2026 at 00:50

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust the reliable high quality code output of that model, and because I've still never experienced a rate limit, through all the massive volumes of work I've gotten done using that routine (that workflow is pinned on the home page of this forum). That solution is an absolutely stunning bargain, for a massive volume of frontier quality inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks which I run directly on servers. That combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents stored in local databases - that's come to be a killer use case).

I do also use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many interactions with the server environment (for example, to build agents which crawl & interact with 3rd party user interfaces). That sort of work typically involves thousands of iterations, so the benefit of automated agentic interaction & debug cycle updates during the development process is huge.

But with all the recent quality improvements in smaller models, MTP performance boosts, etc., especially with Qwen 3.6, it finally feels like self-hosted LLM inference is genuinely usable on smaller-than-datacenter-class GPU hardware. You can run those most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get production work done, especially when used with agentic harnesses that avoid burning piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more viable solution when even greater efficiency is required.

So, I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those options stay so inexpensive, but I no longer have a sense that I absolutely need those services, to maintain many of the foundational software development routines I've become accustomed to with frontier LLMs.

I do fully expect investment dollars to get exhausted, for companies like OpenAI & Anthropic, and at that point, LLM APIs will become more expensive & rate limited. I think many companies which originally planned to make their revenue primarily by providing LLM API services, will pivot into renting commodity data center hardware, because LLM research will eventually yield less valuable software products (just look at what Deepseek keeps doing...). I also expect that some of the companies currently providing these services at a loss will eventually get forced out of business, if they haven't found other ways to profit by replacing human labor with AI services (that's the only way to justify the current market caps for many of those companies - we'll have to see a huge percentage of human labor replaced by LLMs, for most of these companies to reach their revenue goals).

When those eventualities do come to be, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (the roles that Deepseek, Kimi, and GLM models currently occupy), but the existing smaller models are already absolutely usable for real development work, right now, in the current environment.

I still do lots of professional CRUD & common UI development work, and Qwen 3.6 can accomplish those sorts of goals without any trouble at all, just as well as frontier models. CRUD will continue to be fundamentally important to business operations, and I'd have no worries at all if I was never able to touch ChatGPT or Deepseek again for that sort of work.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that more and more of my clients are considering buying a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients have been setting up agents to complete tasks that significantly improve employee productivity. Even with cheap LLM APIs (as fantastically inexpensive and capable as Deepseek-4-pro currently is), some of those heavy workflows do still get to be expensive, especially as employees learn to rely on them more often. It makes sense to spend a few thousand dollars once to buy a server, to complete simple productivity enhancing agentic routines which cost $100+ per week in API fees. For any workflows which cost more than that, 2 or more GX10 machines could be paid off in a single year.

Additionally, most of my clients want to start using agentic tasks to process PHI and other sensitive data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good immediate drop-in solution for some of my clients, but $18/$45 in/out per million tokens is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters of significant GPU hardware, to run multiple simultaneous open source frontier models, in just a single year. The bottom line earnings are fantastic, when human labor savings equate to eliminating the busy work of several collective employees' salaries. Eliminating mind numbing busy work makes employees happier, and it increases total capacity. What business isn't searching for ways to earn more revenue without increasing expenses, and to stretch payroll dollars farther? Replacing human labor with agentic work, wherever possible, is a pattern we'll see more of everywhere. As more employers realize that a locally hosted LLM server can free the time of multiple workers, we'll see more businesses buying GPUs. It's easy math, for any business that has a practical LLM application.

Well before implementing expensive GPU clusters, having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 35a3b MOE and 27b dense, already tested and ready to go.

I've written lots about those hardware and model choices in other forum topics, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done quite impressively. I like having a backup for my frontier model development routines, and they've been particularly useful when I travel away from Internet access, but much more importantly, I'm already seeing how we'll be able to use those systems in practical agentic roles within the AI infrastructure of businesses I work with.

Previous Versions
Version 4May 26, 2026 at 00:50

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust the reliable high quality code output of that model, and because I've still never experienced a rate limit, through all the massive volumes of work I've gotten done using that routine (that workflow is pinned on the home page of this forum). That solution is an absolutely stunning bargain, for a massive volume of frontier quality inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks which I run directly on servers. That combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents stored in local databases - that's come to be a killer use case).

I do also use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many interactions with the server environment (for example, to build agents which crawl & interact with 3rd party user interfaces). That sort of work typically involves thousands of iterations, so the benefit of automated agentic interaction & debug cycle updates during the development process is huge.

But with all the recent quality improvements in smaller models, MTP performance boosts, etc., especially with Qwen 3.6, it finally feels like self-hosted LLM inference is genuinely usable on smaller-than-datacenter-class GPU hardware. You can run those most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get production work done, especially when used with agentic harnesses that avoid burning piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more viable solution when even greater efficiency is required.

So, I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those options stay so inexpensive, but I no longer have a sense that I absolutely need those services, to maintain many of the foundational software development routines I've become accustomed to with frontier LLMs.

I do fully expect investment dollars to get exhausted, for companies like OpenAI & Anthropic, and at that point, LLM APIs will become more expensive & rate limited. I think many companies which originally planned to make their revenue primarily by providing LLM API services, will pivot into renting commodity data center hardware, because LLM research will eventually yield less valuable software products (just look at what Deepseek keeps doing...). I also expect that some of the companies currently providing these services at a loss will eventually get forced out of business, if they haven't found other ways to profit by replacing human labor with AI services (that's the only way to justify the current market caps for many of those companies - we'll have to see a huge percentage of human labor replaced by LLMs, for most of these companies to reach their revenue goals).

When those eventualities do come to be, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (the roles that Deepseek, Kimi, and GLM models currently occupy), but the existing smaller models are already absolutely usable for real development work, right now, in the current environment.

I still do lots of professional CRUD & common UI development work, and Qwen 3.6 can accomplish those sorts of goals without any trouble at all, just as well as frontier models. CRUD will continue to be fundamentally important to business operations, and I'd have no worries at all if I was never able to touch ChatGPT or Deepseek again for that sort of work.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that more and more of my clients are considering buying a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients have been setting up agents to complete tasks that significantly improve employee productivity. Even with cheap LLM APIs (as fantastically inexpensive and capable as Deepseek-4-pro currently is), some of those heavy workflows do still get to be expensive, especially as employees learn to rely on them more often. It makes sense to spend a few thousand dollars once to buy a server, to complete simple productivity enhancing agentic routines which cost $100+ per week in API fees. For any workflows which cost more than that, 2 or more GX10 machines could be paid off in a single year.

Additionally, most of my clients want to start using agentic tasks to process PHI and other sensitive data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good immediate drop-in solution for some of my clients, but $18/$45 in/out per million tokens is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters of significant GPU hardware, to run multiple simultaneous open source frontier models, in just a single year. The bottom line earnings are fantastic, when the human labor savings equate to eliminating the busy work of several collective employees' salaries.

But well before implementing GPU clusters, having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 35a3b MOE and 27b dense, already tested and ready to go.

I've written lots about those hardware and model choices in other forum topics, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done quite impressively. I like having a backup for my frontier model development routines, and they've been particularly useful when I travel away from Internet access, but much more importantly, I'm already seeing how we'll be able to use those systems in practical agentic roles within the AI infrastructure of businesses I work with.

Version 3May 26, 2026 at 00:31

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust the reliable high quality code output of that model, and because I've still never experienced a rate limit, though all the massive volumes of work I've gotten done using that routine (that workflow is pinned on the home page of this forum). That solution is an absolutely stunning bargain, for such a tremendous volume of frontier quality inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks which I run directly on servers. That combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents in a local database). I do also use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many hands-on interactions with the server environment (for example, to build agents which crawl & interact with 3rd party user interfaces). That sort of work can involve thousands of iterations, so the benefit of automated agentic interaction during the development process is huge.

But with all the recent quality improvements in smaller models, MTP performance boosts, etc., especially with Qwen 3.6, finally feel like self-hosted LLM inference is genuinely usable on smaller-than-datacenter-class GPU hardware. You can run those most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get production work done, especially when used with agentic harnesses that don't burn piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more viable solution when even greater efficiency is required.

I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those options stay so inexpensive, but I no longer have a sense that I absolutely need those services, to maintain the sort of professional software development routines I've become accustomed to with frontier LLMs.

I do fully expect investment dollars to get exhausted, for companies like OpenAI & Anthropic, and at that point, LLM APIs will become more expensive & rate limited. I think many companies which originally planned to make their revenue by providing LLM API services, will pivot into renting commodity data center hardware, because LLM research will eventually yield less valuable software products. I also expect that some of the companies currently providing these services at a loss will eventually get forced out of business, if they haven't found ways to profit by replacing human labor with AI services (that's the only way to justify the current market caps for many of those companies).

When those eventualities do come to be, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (like the current Deepseek, Kimi, and GLM models), but the existing smaller models are absolutely usable for real development work, right now, in the current environment. I still do lots of professional CRUD & common UI development work, and Qwen 3.6 can accomplish those goals without any trouble at all, just as well as frontier models.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that my clients are more often considering buying a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients have been setting up agents to complete tasks which significantly improve employee productivity. Even with cheap LLM APIs (as fantastically inexpensive and capable as Deepseek-4-pro currently is, for example), some of those heavy workflows do still get to be expensive, especially as employees learn to rely on them more often. It makes sense to spend a few thousand dollars once to buy a server, for example, to complete agentic routines which cost $100+ per week in API fees. For any workflows which cost more than that, 2 or more GX10 machines could be paid off in a single year.

Additionally, most of my clients want to start using agentic tasks to process PHI and other sensitive data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good immediate drop-in solution for some of my clients, but $18/$45 in/out per million tokens is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters of hardware, to run multiple simultaneous open source frontier models, in a single year.

But well before that, having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 27b and 35a3b already tested and ready to go.

I've written lots about those hardware and model choices in other forum topics, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done quite impressively. I like having a backup for my frontier model development routines, and they've been particularly useful when I travel away from Internet, but much more importantly, I'm already seeing how we'll be able to use them in practical agentic roles within the AI infrastructure of businesses I work with.

Version 2May 26, 2026 at 00:09

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust 100% the ridiculously reliable high quality code output of that model, and because I've still never experienced a rate limit, with all the massive amount of work I get done using that routine (pinned on the home page of this forum). That's an absolutely stunning bargain for such a tremendously large volume of high quality frontier inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks that I run directly on servers - that combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents). I also do use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many hands-on interactions with the server environment (for example, to build agents that crawl/interact with 3rd party user interfaces - that sort of work can involve thousands of iterations, so the benefit of automated agentic interaction during the development process is tremendous).

But with all the recent quality improvements in smaller models, MTP performance improvements, etc., especially with the Qwen 3.6 varieties, I finally feel like self-hosted LLM inference is genuinely usable on less-than-datacenter-class GPU hardware. You can run those most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get production work done, especially when used with agentic harnesses that don't burn piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more viable solution when even greater efficiency is required.

I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those options stay so inexpensive, but I no longer have the sense that I absolutely need those services, to maintain the sort of professional software development routines I've become accustomed to with frontier LLMs.

I do fully expect investment dollars to get exhausted, for companies like OpenAI & Anthropic, and at that point, LLM APIs will become more expensive & rate limited. I think many companies which originally planned to make their revenue by providing LLM API services, will pivot into renting commodity data center hardware usage, because LLM research will eventually yield less valuable products. I also expect that some of the companies currently providing these services at a loss will eventually get forced out of business, if they haven't found ways to profit by replacing the majority of human labor, with AI services (that's the only way to justify the current market caps for many of those companies).

When those eventualities do come to be, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (Deepseek, Kimi, GLM, etc.), but the current smaller models are absolutely usable for real development work, right now, in the current environment. I still do lots of professional CRUD & basic UI development work, and Qwen 3.6 can accomplish those goals without any trouble at all, just as well as frontier models.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that my clients are more and more considering buying a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients are setting up agents to complete tasks which significantly improve employee productivity. Even with cheap LLM APIs - and as fantastically inexpensive and capable as Deepseek-4-pro currently is - some of those heavy workflows still get expensive. It makes sense to spend a few thousand dollars once, for example, to complete agentic routines which cost $100+ per week in API fees. For any workflows which cost more than that, 2 or more GX10 machines could be paid off in a single year.

Additionally, most of my own clients want to start using agentic tasks to process PHI and other sensitive data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good drop-in solution for some of my clients, but $18/$45 per million tokens in/out is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters of hardware, to run multiple open source frontier models, in a single year.

Having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 27b and 35a3b already tested and ready to go.

I've written lots about those hardware and model choices in other forum topics, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done quite impressively. I like having a backup for my frontier model routines, and for traveling, but much more important, I'm already seeing how we'll be able to use them in practical positions within the AI infrastructure of businesses I work with.

Version 1May 25, 2026 at 16:47

For most of my production work, I'm still using the zip file project management routine with ChatGPT 5.5, because I've come to trust 100% the ridiculously reliable high quality code output of that model, and because I've still never experienced a rate limit, with all the massive amount of work I get done using that routine. That's an absolutely stunning bargain for such a tremendously large volume of high quality frontier inference.

I've also chosen to use Deepseek-4-pro with Pi for agentic tasks that I run directly on servers - that combo has been super cheap and effective (for example, for agents that respond to user questions about huge collections of documents). I also do use Deepseek and Pi to develop software which requires many unattended iterations and/or which benefits from many hands-on interactions with the server environment (for example, for building agents that crawl/interact with 3rd party user interfaces - that sort of work can involve thousands of iterations, so the benefit of automated agentic interaction is tremendous).

But with all the recent improvements in models and MTP performance improvements, especially with the Qwen 3.6 varieties, I finally feel like self-hosted LLM inference is genuinely usable on less-than-datacenter-class GPU hardware. You can run the most recent models well on many variations of consumer grade hardware, and I'm beginning to really trust them to get work done, especially when used with agentic harnesses that don't burn piles of unnecessary tokens. Pi has been especially functional as a generally useful lightweight/configurable agent, and custom coded agents for custom tasks are getting to be a more common solution when even greater efficiency is required.

I'll absolutely continue to use the frontier models via APIs, and continue to use Openrouter, as long as those stay so inexpensive, but I no longer have the sense that I absolutely need those services, to maintain the sort of professional software development routines I've become accustomed to with frontier LLMs.

I do, however, fully expect investment dollars to get exhausted, and at that point, LLM APIs will become more expensive/rate limited. I think many companies providing these services will pivot into providing commodity data center hardware use, because LLM research will eventually become less valuable. And I think some of the current companies providing these services at a loss will eventually go out of business, if they haven't found ways to profit from replacing the majority of human labor (that's the only way to justify the current market caps for these companies).

When those things start to happen, I'll certainly invest in more hardware to run the full versions of whichever LLMs are best at that point (Deepseek, Kimi, GLM, etc.), but the current smaller models are absolutely usable for real development work, right now, in the current environment. I still do lots of CRUD & typical UI work, and Qwen 3.6 can do that without any trouble at all, just as well as frontier models.

But that's just one piece of the puzzle. Beyond my own development work, the other thing I see on the horizon, is that my clients are more and more considering to buy a GX10 or something similar for their agentic routines which burn lots of tokens. All my clients are setting up agents to complete tasks which significantly improve employee productivity. Even with cheap LLM APIs - and as fantastically inexpensive and capable as Deepseek-4-pro is - some of those heavy workflows get expensive. It makes sense to spend a few thousand dollars once for routines which cost $100+ per week. For any workflows which cost more than that, you could pay off 2 GX10s in a single year.

Additionally, most of my own clients want to start using agentic tasks to process PHI data, and those sorts of workflows require a HIPAA compliant LLM provider. BastionGPT looks like a good solution for some of my clients, but $18/$45 per million tokens in/out is far more expensive than Deepseek (50x more!). At that rate, some of my clients could pay off multiple clusters to run multiple open source frontier models, in a single year.

Having a fast, capable, secure, low-power self-hosted LLM alternative for all sorts of uses will soon become a critical requirement for many of my clients, so I'm relieved to have some relatively inexpensive hardware platforms such as the Asus GX10 and Strix Halo machines, as well as models such as Qwen 3.6 27b and 35a3b already tested and ready to go.

I've written lots more about those hardware and model choices elsewhere, so I'll leave it at this: the current self-hosted solutions are not just curiosities anymore, they can actually get real work done. I'm already seeing how we'll be able to use them in practical positions within the AI infrastructure of businesses I work with.