Self-hosted vs API: when does owning your AI compute actually make sense?
GPU compute, inference pricing, token economics, and the cost reality behind enterprise AI deployments. A practical guide to figuring out when self-hosting makes sense and when APIs are the better call.

If you have been anywhere near an enterprise cloud budget in the last two years, you already know that AI workloads have completely changed the economics of infrastructure. The conversation used to be about right-sizing VMs and buying reserved instances. Now it is about GPU availability, token pricing, and whether your organization should be running its own models or just calling an API.
I went through this exact exercise myself recently, and let me tell you, the real cost of running AI at enterprise scale looks NOTHING like what the marketing materials suggest. But here is the good news: if you have been in EUC or infrastructure for any length of time, you are NOT flying blind here. The decision framework for self-hosted AI vs API is almost identical to what we have been doing with on-prem VDI vs cloud DaaS for over a decade. Same cost modeling, same capacity planning, same build-vs-buy analysis. Different technology, same process. If you have ever right-sized a VDI environment and been shocked at the delta between the project forecast and the actual bill, multiply that feeling by about 10. That is AI infrastructure economics right now. But the skills you already have to evaluate it are the same ones that got you through VDI planning.
GPU Instance Pricing: The Sticker Shock Is Real
Let me start with the raw compute because that is where most of the budget goes if you are running your own models. The flagship GPU for serious AI workloads right now is the NVIDIA H100 (80GB HBM3). Every cloud provider offers instances built around these, and if you are used to general compute costs, prepare yourself. These numbers are not a typo.
On AWS, the p5.48xlarge gives you 8x H100 GPUs with 640GB total GPU memory, 192 vCPUs, and 2TB of system RAM. On-demand pricing sits around $98/hour in us-east-1. That is roughly $72K per month running 24/7, or about $861K per year for a single instance. With a 1-year reserved instance (all upfront), you are looking at roughly $567K per year, a solid 34% discount. The 3-year all-upfront drops further to around $390K per year (55% savings), but that is a $1.17 million commitment.
Azure has their ND H100 v5 series which is comparable at similar pricing. If you are already in the Microsoft ecosystem, Azure also offers some bundling with their AI services that can affect your total cost picture. GCP offers H100 access through their a3-highgpu-8g instances at competitive pricing, and their Dynamic Workload Scheduler helps with GPU capacity access.
For organizations that do not need full H100 power, the previous generation A100 instances are still very capable and significantly cheaper. The AWS p4d.24xlarge (8x A100 40GB) runs about $33/hour on-demand. These are still serious machines for inference and smaller-scale fine-tuning.
Then there are the inference-optimized instances. The AWS g5 series using NVIDIA A10G GPUs is where a lot of enterprises land for production inference. A g5.xlarge (1x A10G, 24GB GPU memory) runs about $1/hour on-demand. These are much more reasonable for serving models in production, especially if you are running quantized open source models.
Inference API Pricing: The Token Economy
If you are not running your own infrastructure (and in my opinion, most enterprises should start here), then you are paying per token through managed APIs.
For Anthropic's Claude models, the pricing as of early 2026 looks roughly like this: Claude Opus 4 runs $15 per million input tokens and $75 per million output tokens. Claude Sonnet 4 is $3/$15 per million input/output. Claude Haiku 3.5 is $0.80/$4 per million. There is a reason most production workloads end up on Sonnet or Haiku.
OpenAI's GPT-4o is priced at $2.50/$10 per million input/output tokens. GPT-4o mini drops to $0.15/$0.60 per million, which is genuinely affordable for what you get. Keep in mind that reasoning models (o1, o3) use internal chain-of-thought tokens that you pay for but do not see in the output, which can make the effective cost significantly higher than the sticker price.
For open source models through inference providers, running Llama 3.1 405B through providers like Together AI or Fireworks costs roughly $3-5 per million input tokens. Llama 3.1 70B drops to about $0.60-0.90 per million. The smaller models can be served for pennies.
Here is where most enterprises get the math wrong though. They look at per-token pricing and think "that is cheap" without modeling their actual usage. Let me walk through a real scenario.
Say you have a customer service AI agent handling 50,000 conversations per day. Each conversation averages 2,000 input tokens (system prompt + conversation history + customer message) and 500 output tokens. Using Claude Sonnet 4:
- Input: 50,000 x 2,000 = 100M tokens/day x $3/M = $300/day
- Output: 50,000 x 500 = 25M tokens/day x $15/M = $375/day
- Daily total: $675
- Monthly total: roughly $20,250
- Annual: $243,000
That is $243K per year for ONE use case on Sonnet, not even Opus. Now multiply across all the AI use cases your organization is spinning up simultaneously. This is why your CFO is suddenly showing up to AI strategy meetings. I have seen this exact pattern play out with VDI cloud migrations where the per-user cost looked great in the proposal and then the actual bill came in 3x higher because nobody modeled the real usage patterns. AI is doing the same thing right now, just faster and with bigger numbers.
The Hidden Costs
The GPU instances and API tokens are just the obvious costs. There is a whole layer of infrastructure spending that builds up.
Data transfer and networking: If you are running self-hosted models, you need high-bandwidth networking between your GPU nodes. Cross-AZ traffic on AWS is $0.01/GB, and GPU clusters generate a lot of inter-node traffic. A training job using 8 nodes can easily generate 50-100 TB of network traffic in a single run.
Storage: Large language models are big files. A single Llama 3.1 405B model in FP16 is roughly 810 GB. Quantized versions (GPTQ, AWQ, GGUF) bring this down to 200-400 GB. High-performance storage (AWS FSx for Lustre, Azure Managed Lustre) that can feed models to GPUs fast enough runs $0.145-0.200/GB/month. For a serious deployment with multiple model versions and fine-tuning checkpoints, you can accumulate 20-50 TB easily.
Fine-tuning costs: Full fine-tuning of a 70B parameter model typically requires 4-8 H100 GPUs for 10-50 hours. At $98/hour for an 8x H100 instance, a single run costs $980-4,900. And you are not doing one run. You are doing hyperparameter sweeps and experimenting with data mixes. Budget 5-10x your single-run cost. LoRA (Low-Rank Adaptation) and QLoRA fine-tuning bring this down significantly, but the iteration cycle cost still adds up.
Monitoring and observability: You need to track inference latency, token throughput, GPU utilization, model accuracy, and cost allocation across teams. Tools like Weights & Biases, MLflow, or Datadog's LLM monitoring add $1,000-5,000/month depending on scale.
Prompt caching and optimization: Anthropic offers prompt caching that can reduce costs by up to 90% for repeated system prompts. OpenAI has similar features. But to take full advantage, you need to architect your applications with caching in mind from the start. If you are building a RAG (Retrieval-Augmented Generation) system, your vector database (Pinecone, Weaviate, pgvector, Qdrant) adds another $500-5,000/month depending on index size and query volume.
Self-Hosted vs API: The Real Comparison
This is the question I hear the most, and it is the exact same question we asked with VDI. Do you run your own Citrix or VMware infrastructure on-prem, or do you go with a cloud DaaS offering? The variables are the same: upfront capital vs operational expense, control vs convenience, dedicated engineering staff vs managed service, and compliance requirements that might force your hand regardless of cost. We have a LOT of institutional knowledge here. Let me put real numbers to it.
Scenario: medium-scale inference (10M tokens/day output)
API route using Claude Sonnet 4:
- 10M output tokens/day x $15/M = $150/day
- Assuming 2:1 input-to-output ratio: 20M input tokens x $3/M = $60/day
- Monthly: $6,300
- Annual: $75,600
Self-hosted route using Llama 3.1 70B on AWS:
- 2x
g5.12xlargeinstances (4x A10G each) for serving a quantized 70B model - On-demand: 2 x $5.672/hr x 730 hrs = $8,281/month
- 1-year reserved: roughly $5,400/month
- Plus storage, networking, engineering time for model serving (vLLM, TGI, or TensorRT-LLM)
- Plus an ML engineer spending 20% of their time managing the infrastructure
- Realistic monthly total: $8,000-12,000
- Annual: $96,000-144,000
So at this scale, the API is actually cheaper AND you do not need an ML infrastructure team. That surprises a lot of people! From what I am seeing, the breakeven point where self-hosting starts to make financial sense is typically around 50-100M output tokens per day. Below that, you are paying for GPU capacity that sits partially idle while also paying an engineer to babysit it. I have been through enough capacity planning exercises over my career to know that the "we will grow into it" justification for over-provisioning rarely works out the way people hope.
But just like with VDI, the decision is not purely about cost. If you are in a regulated industry where data cannot leave your VPC (Virtual Private Cloud), self-hosting might be the only option regardless of price, the same way healthcare organizations often had to keep VDI on-prem for HIPAA reasons even when cloud was cheaper. If you need sub-100ms inference latency for a real-time application, self-hosted models on optimized infrastructure can deliver that while API calls add network round-trip time, just like how latency-sensitive VDI workloads (3D graphics, real-time video) often stayed on-prem even when general desktop workloads moved to DaaS. If you need a custom fine-tuned model for a domain-specific task, you may need to self-host. The decision tree is remarkably similar to what we have already been doing.
GPU Availability
Something the pricing discussions always leave out: you can compare prices all day, but the real constraint for many enterprises is simply getting access to GPU instances. H100 availability has improved significantly from 2024, but if you need a cluster of 32+ H100s on short notice, you may still face challenges.
Spot and preemptible instances for GPU workloads are tempting (60-90% discounts) but come with real operational challenges. Your training job getting preempted at 80% completion means you need robust checkpointing. For inference, a spot interruption means dropped requests. I have heard of enterprises using spot instances successfully for batch inference workloads (processing a backlog of documents overnight, for example) where interruptions just mean "pick up where you left off." But for production real-time inference, in my opinion you need on-demand or reserved capacity.
The GPU cloud providers (CoreWeave, Lambda Labs, RunPod) often have better H100 availability than the hyperscalers and competitive pricing. CoreWeave offers H100 SXM at around $2.06/GPU/hour with reserve pricing. The tradeoff is you do not get the full ecosystem of services that AWS, Azure, and GCP provide. For pure GPU compute workloads, they are worth evaluating.
Cost Optimization Strategies That Actually Work
Based on what I am seeing across the industry, here are the strategies that consistently produce real savings.
Right-size your model for the task. This is the single biggest lever and I cannot stress it enough. I have personally done this. I was using Opus for things that Haiku could have handled just fine, and my token costs dropped dramatically once I started routing intelligently. A well-crafted prompt with a smaller model often outperforms a lazy prompt with a larger model. Build a model routing layer that sends simple tasks to affordable small models and reserves expensive large models for complex reasoning. This alone can reduce API costs by 60-80%.
Implement aggressive prompt caching. If your system prompt is 2,000 tokens and you are making 100,000 API calls per day, that is 200M tokens per day just in repeated system prompts. Prompt caching features across providers can reduce this to near zero. Structure your prompts with static content first and dynamic content last to maximize cache hit rates.
Batch your inference requests. If you have workloads that do not need real-time responses (document processing, data enrichment, report generation), batch them. Anthropic's Message Batches API offers 50% cost reduction. OpenAI has similar batch discounts. For self-hosted models, batching dramatically improves GPU utilization.
Use reserved capacity wisely. If your baseline GPU usage is predictable (and it usually becomes predictable within 3-6 months), buy reserved instances for the baseline and use on-demand for burst capacity. The 1-year all-upfront reserved pricing saves 30-40%, and 3-year saves 50-60%. For API providers, Anthropic and OpenAI both offer committed-use discounts for high-volume customers.
Monitor and attribute costs relentlessly. Tag every API call and every GPU workload with a team, project, and use case identifier. Build dashboards that show cost per conversation, cost per document processed, cost per decision made. When teams see their actual AI costs, they find ways to optimize. In my experience, this transparency alone can reduce costs by 20-30% without any technical changes.
Consider hybrid architectures. Run a small self-hosted model for high-volume, latency-sensitive, or data-sensitive workloads, and use APIs for everything else. A quantized Llama 3.1 8B running on a single A10G instance ($1/hour) can handle simple classification, entity extraction, and summarization while you route complex reasoning tasks to Claude or GPT-4o via API.
What I Recommend for Getting Started
If you are in the early stages of enterprise AI adoption, in my opinion you should not start by provisioning GPU clusters. Think about how the best VDI deployments started: you did not buy 500 Citrix licenses and a SAN on day one. You ran a pilot with 50 users, measured the IOPS and CPU patterns, figured out your user profiles, and THEN sized your production environment. AI infrastructure planning should work the same way. Start with managed APIs. Use Claude or GPT-4o through their standard APIs, instrument everything with cost tracking from day one, and let your usage patterns develop naturally. After 3-6 months, you will have real data on what your actual workloads look like and can make informed decisions about reserved capacity, self-hosting, or hybrid approaches.
The pricing landscape is evolving incredibly fast. API prices have dropped 80-90% in the last 18 months and there is no reason to think that trend stops. Every dollar you lock into a 3-year GPU reservation is a bet that self-hosting will remain cheaper than APIs for the duration, and that is not a bet I would make without very strong conviction about your workload profile.
Keep in mind that the total cost of ownership for AI infrastructure goes far beyond compute. Factor in ML engineering talent ($200K-400K per engineer fully loaded), the operational overhead of running model serving infrastructure, the cost of model evaluation and testing, and the opportunity cost of your team managing infrastructure instead of building AI-powered products. When you add all of that up, the API route is often the more practical choice until you hit truly massive scale, and by that point you will have the expertise and data to make the transition confidently.
In my experience, the organizations that handle AI infrastructure economics well are the ones treating it like any other capacity planning exercise: measure first, optimize second, commit third. If you have done VDI capacity planning, you already know this discipline. The same people who would never deploy 10,000 virtual desktops without a thorough assessment are panic-buying GPU reservations because somebody told them AI capacity is scarce. Slow down. Measure your actual usage. Then commit. The infrastructure will be there when you need it.

Jason Samuel
Product leader, advisor, and international speaker with 27+ years in enterprise end-user computing, security, and cloud. Has deployed infrastructure at Fortune 500 scale across 34 countries. 1 of 3 people globally to hold Citrix CTP + VMware vExpert + VMware EUC Champion concurrently. 200+ articles, 1,000+ reader discussions.