When it's worth running your own model serving, when SaaS wins, and the third option most founders don't consider. With a decision flowchart you can use this week.
Every few months, some permutation of this question lands in my inbox: “We’re thinking about running our own models. Should we?”
The answer is almost always no — but the cases where the answer is yes are real, and they’re not hard to identify once you know what to look for. The harder question that founders rarely ask is the third option: should we wait?
Let me walk through all three.
Most companies should start here, and many should stay here indefinitely.
The core argument for API-first is that foundation model providers — OpenAI, Anthropic, Google, Mistral — are spending billions of dollars training models that you can access for fractions of a cent per request. The quality gap between what you’d train yourself and what you can rent has narrowed dramatically. For most business problems, the API model is already good enough.
“Good enough” is doing a lot of work in that sentence. Let me be specific.
If your use case is:
…then an API-first approach will almost certainly give you 90%+ of the performance of a custom model at 5% of the cost and timeline. You should buy.
The buy case also applies when your data volume is low. Training and fine-tuning your own model requires meaningful labeled data. If you have fewer than 10,000 examples of the task you’re trying to automate, you’re likely not in a position where custom training pays off.
Finally, buy if your team doesn’t have dedicated ML infrastructure experience. Running model serving in production is not trivial. It requires GPU provisioning, batching logic, latency optimization, scaling decisions, and ongoing maintenance. If you don’t have someone who has done this before, the learning curve will eat the cost savings.
There are four situations where building your own model infrastructure genuinely makes sense.
1. Your data cannot leave your environment.
This is the clearest case. Healthcare, finance, defense, legal, and some government adjacent industries have data that cannot be sent to a third-party API — either because of regulation (HIPAA, FINRA) or because the data is core proprietary IP that you cannot risk sharing. If this is you, self-hosted is not optional. It’s the only option.
2. Your volume is high enough that API costs exceed hosting costs.
This is a math problem. At scale, API pricing becomes expensive. If you’re sending 100 million tokens a day to GPT-4o at current pricing, you’re looking at several thousand dollars per day. A well-optimized self-hosted Llama 3 70B on GPU instances can handle that volume at a fraction of the cost, once you’ve amortized the engineering time to set it up.
The crossover point depends on your specific use case, model size, and quality requirements. I’ve seen it happen anywhere from 10M tokens/day (for a high-quality model on expensive instances) to 500M tokens/day (for a smaller model on optimized hardware). If you’re below 1M tokens/day, you’re almost certainly not at the crossover.
3. Your task requires fine-tuning, and the fine-tuned model is a product differentiator.
If the quality of your AI output is a core part of what you’re selling — if being better than the API model at your specific task is your moat — then fine-tuning a smaller model might be worth the investment.
The caveat is that this is only a real moat if your fine-tuning data is hard to replicate. If your advantage is “we have 50,000 labeled examples of our domain task,” that’s a meaningful head start. If your advantage is “we fine-tuned on public data that anyone can access,” your moat lasts until someone else does the same thing.
4. Your latency requirements can’t be met by an API.
Some real-time applications — voice interfaces, interactive tools, certain production systems — need sub-200ms response times. API calls to external providers, with network latency, rate limits, and variable response times, may not be able to meet that bar. If latency is a hard constraint, you may need to run inference locally.
This is the option that most founders don’t consider, and it’s often the right one.
The AI capability landscape is moving fast. The model that costs $50/M tokens today will cost $5/M tokens in twelve months. The model that requires fine-tuning to be good at your task today might handle it out of the box next year. The infrastructure tooling that requires a PhD to operate today will be a managed service in 18 months.
If you’re early in your AI thinking, and your use case is not urgent, waiting may be the most rational strategy. You’ll get more capability for less money with a better ecosystem of tooling — and you’ll avoid locking in architectural decisions that will age badly.
I know this is counterintuitive. There’s significant pressure on founders to “do something with AI.” But the cost of a premature build — wrong architecture, wrong model, wrong abstraction layer — is significant. It’s not just the build cost. It’s the refactoring cost, the migration cost, and the organizational debt of having made a call that doesn’t hold up.
The question I ask is: “What do you lose by waiting six months?” If the honest answer is “nothing, our competitors aren’t doing this either and the API will be cheaper,” then wait.
Here’s how I’d walk through this in practice:
Can your data leave your environment?
Is your daily token volume above ~50M?
Is the API quality sufficient for your task?
Is the underlying technology likely to improve significantly in the next 6-12 months?
The build vs. buy framing implies that these are stable choices. They’re not.
Most companies I work with end up in a hybrid: buy for the general-purpose cases, build (or fine-tune) for the high-volume or high-sensitivity ones. The mistake is treating it as a one-time decision rather than an evolving architecture.
Build your system with a model abstraction layer. Don’t hardcode your API calls to a specific provider. Design your data pipeline so that switching models is a configuration change, not a refactoring project. That way, you can start with the API, accumulate data, and shift to fine-tuned or self-hosted when the math changes.
If you want to run through this decision for your specific situation, book a free intro call. I’ll ask you the right questions and tell you what I think — including if the right answer is to wait.
Book a free intro call. No pitch — just a direct conversation about where AI fits (or doesn’t) in your business right now.
Book a free intro call