Skip to content
AI & TechFree · No signup · 35K+/month

Self-Host vs API Calculator — Buy GPU Hardware or Stay on OpenAI?

Capex vs opex decision for production LLM workloads. Buy an H100 server / RTX 4090 cluster / on-prem rack and amortise over its useful life — or stay on per-token API. Calculator returns horizon total both paths, hardware payback months, and the volume threshold at which the loser would catch up.

  • Instant result
  • Private — nothing saved
  • Works on any device
  • AI insight included
Reviewed by CalcBold EditorialLast verified Methodology

Self-Host vs API Calculator

Production query volume. 1k = pilot, 10k = mid-size SaaS, 100k+ = consumer.

System prompt + user message + retrieved context (if RAG).

Avg response length. Typical 200-1500 for chat / Q&A.

Hosted API model you'd use as the alternative to self-hosting.

GPU(s) + server + networking. Reference: 1× H100 PCIe ≈ $30k, 4× RTX 4090 + box ≈ $10k, 1× A100 80GB used ≈ $15k, 8× H100 SXM server ≈ $250k.

Useful life for amortisation. NVIDIA warranties run 3 years; conservative finance teams use 4-5; aggressive depreciation schedules use 3.

Combined electricity + ops time + colocation if applicable. H100 ≈ 700W TDP × 24h × 30d × $0.15/kWh = ~$75 electricity alone; add 0.05 FTE ops at $50/h × 20h/mo = $1000. $300 is a typical SMB number.

Project / budget horizon. 36 months typical for owned hardware (matches a useful-life schedule); 12-24 for short-term commitments.

Embed builderDrop the Self-Host vs API on your site →Free widget · 3 sizes · custom theme · auto-resizes · no signupGet embed code

What This Calculator Does

The Self-Host vs API Calculator handles the capex vs opex decision for production LLM workloads. Buy hardware outright — H100 server, RTX 4090 cluster, on-prem rack, refurbished A100 — amortise over the useful-life window, add monthly opex (electricity, ops time, colocation) — or stay on per-token API and skip the capital outlay entirely. The output is the horizon total for both paths, hardware payback in months, and the break-even queries/day at which the loser would catch up.

It's a different decision from cloud GPU rental (covered in the GPU Rental calc): owned hardware lives on your books for 3-5 years, depreciates, and ties you to the infrastructure choice for the full lifetime. SaaS teams typically rent; enterprise IT and on-prem ML teams typically own. Run both calculators in parallel if you're deciding between RENT vs BUY vs API.

The Math

2026 Hardware Reference Pricing

Snapshot January 2026, retail / new prices unless noted. Procurement contracts at scale typically beat list by 5-15%; refurbished hardware drops another 30-50% but shortens useful life.

  • 1× H100 80GB PCIe — ~$30,000 new. The most common single-GPU production purchase.
  • 4× RTX 4090 + box — ~$10,000. Budget consumer-GPU cluster; works for 13B-70B int4 inference.
  • 1× A100 80GB used — ~$15,000 (down from $20K+ in 2024). Sweet spot for cost-conscious teams.
  • 8× H100 SXM server — ~$250,000. Production-grade rack for serious throughput; 5-10 GPUs worth of capacity.
  • M3 Ultra Mac Studio — ~$4,000. Niche / dev / hobbyist; serves ~1-2K queries/day at 8-12 tok/sec on 70B int4. Not production scale.

A Worked Example — Single H100 PCIe

Mid-size team: 10,000 queries/day · 2,000 input · 500 output tokens. Buy a single H100 PCIe server for $25,000 with 4-year useful life, $300/month opex (electricity + 0.05 FTE ops), 36-month horizon. API comparison: GPT-4o-mini.

  • Monthly amort: $25,000 ÷ 48 = ~$521
  • Monthly self-host: $521 + $300 opex = $821/mo
  • 36-month self-host total: ~$29,556
  • Per query: ($0.15 × 2,000 + $0.60 × 500) / 1M = $0.0006
  • Monthly API: $0.0006 × 10,000 × 30 = ~$180/mo
  • 36-month API total: ~$6,480
  • Hardware payback: $25,000 / $180 = ~139 months
  • Verdict: API wins by ~$23,000 over the horizon. Break-even sits well above 50,000 queries/day at this token shape.

When Self-Host Genuinely Wins

Three conditions need to align. (1) Sustained high volume— typically > 50,000 queries/day at standard token shape, often 100K+ for the math to be compelling. (2) The open-source model meets your quality bar — Llama 3.3 70B, Qwen 2.5, Mixtral are good enough for narrow tasks (classification, extraction, formatting) but trail frontier hosted models 5-15% on agent benchmarks. (3) Engineering capability to run inference infra — vLLM tuning, observability, weight versioning, failover, on-call. The third is the silent killer of self-host plans: capex math wins, but ops capability is missing, and the project stalls in production hardening.

The Hidden Costs You're Probably Underestimating

The calculator captures hardware + opex but NOT these layers. Add 10-20% to the self-host total for production deployments to be honest:

  • Software hardening — vLLM / TensorRT-LLM tuning, batch scheduling, weight versioning, model rollback, structured-output decoding. 1-3 engineering months for a serious rollout.
  • Observability — token-level metrics, quality drift detection, prompt regression tests, latency SLOs. Datadog / Honeycomb / OpenInference all charge real money at production scale.
  • Security— model weight protection, prompt-injection defence, isolation between tenants, audit logging. Often more involved than API-equivalent because you're running raw model code.
  • Compliance— HIPAA / SOC2 / FedRAMP for on-prem ML often costs MORE than API equivalent (where the provider does the heavy lifting). Don't assume self-host is automatically more compliant.

Common Mistakes

  • Skipping hardware sizing validation.The calculator does NOT check whether your chosen hardware can serve your QPS. A single H100 80GB serves ~80 output tok/sec on Llama 3.3 70B int4 — at 10K queries/day with 500 output tokens that's ~17 hours of compute, fits 1 GPU. At 1M queries/day you need 5-10 GPUs in a cluster. Use the GPU Rental calc's throughput numbers to size before pricing.
  • Modelling 100% utilisation implicitly. Owned hardware sitting idle still costs the amortised monthly rate. If your real utilisation is 30%, your effective per-query cost is 3× higher than the steady-state number. Buying capacity for peak demand and paying for off-peak idle is the cost of ownership.
  • Using straight-line depreciation when reality is convex.NVIDIA GPUs lose value fastest in years 1-2 (frontier-tier displacement) and slower thereafter. Straight-line over 4 years is conservative for total cost comparison but optimistic for resale value at year 2. Don't assume the asset retains book value mid-lifetime.
  • Ignoring engineer ops time as opex. Inference infrastructure needs babysitting — kernel updates, driver patches, weight versioning, incident response. Budget 0.05-0.15 FTE depending on stack maturity. At $50-150/h, that's $200-2000/month opex you must include or the comparison is dishonest.
  • Forgetting hardware refresh costs. NVIDIA releases major-tier upgrades every 18-24 months (H100 → H200 → Blackwell). The calculator handles past-lifetime by assuming a notional re-buy, but if your horizon spans a generation transition, the new-gen card will likely be cheaper per token than the old one — your break-even moves against owned hardware over time.
  • Comparing list-price API against owned hardware.Enterprise API contracts at $50K+/month spend cut 10-30% off list. If you're negotiating API rates seriously, the threshold for self-host to win shifts higher. Run the calculator twice: once at list, once at your negotiated rate, and pick the more honest comparison.

How This Differs From the GPU Rental Calculator

The GPU Rental vs API Calculator models cloud-rental opex — Lambda Labs / Modal / Runpod, hourly billing, never owned. This calculator modelsOWNED hardware — purchased outright, depreciated over 3-5 years, plus monthly electricity and ops. Different decision for different audiences. SaaS teams that want to flex capacity rent; enterprise IT and on-prem ML teams that want budget predictability and data sovereignty own. Run both calculators in parallel before committing.

Related Calculators

How to Read the Verdict

Two numbers settle the question: hardware payback in months (capex recovery horizon) and break-even queries/day (the volume threshold at which API would beat owned hardware). Buying hardware ties you to the choice for 3-5 years; the volume must hold above break-even the entire time.

  • Payback under 18 months AND volume stable. Buy. The post-payback years are nearly free incremental inference cost — strongest case for on-prem ML teams.
  • Payback 18-36 months. Compare against rented GPU first — rental usually wins this band because you avoid the capex risk if volume drops.
  • Payback above 48 months.Don’t buy. Hardware lifecycle (GPU generation refresh) makes 4-yr payback a bad bet — by then a single H200 outperforms 4 H100s and your owned hardware is obsolete.
  • Compliance / data-residency required.Buy regardless of payback math — the alternative isn’t API, it’s “can’t ship.” Justify the spend on the compliance line, not pure cost.

Frequently Asked Questions

The most common questions we get about this calculator — each answer is kept under 60 words so you can scan.

  • How is the hardware cost amortised?
    Straight-line over the useful-life months: monthly amortisation = hardware cost ÷ (lifetime years × 12). A $25,000 H100 server with a 4-year life amortises to $520/month. After lifetime ends, amortisation drops to zero (the asset is paid off); only opex continues. The calculator handles a horizon longer than lifetime by adding a notional re-buy cycle.
  • What's the difference between this and the GPU Rental calculator?
    GPU Rental models cloud-rental opex — Lambda Labs / Modal / Runpod, billed hourly, never owned. Self-Host models OWNED hardware — purchased outright, amortised over a 3-5 year useful life, plus monthly electricity and ops. Different decision for different audiences: SaaS teams typically rent, enterprise IT and on-prem ML teams typically own. Run both calcs in parallel if you're deciding between RENT vs BUY vs API.
  • What hardware should I budget for at my volume?
    Rough sizing: a single H100 80GB serves ~80 output tokens/sec on Llama 3.3 70B int4. At 10k queries/day with 500 output tokens average, you need ~5,000 token-seconds per day — well under 1 GPU's daily capacity. At 100k queries/day same shape, you need ~50k token-seconds, still 1 GPU at moderate utilisation. At 1M queries/day, you need 5-10 GPUs in a cluster. The calculator does NOT validate that your hardware can serve your volume — use the GPU Rental calc's throughput numbers to size first.
  • Why doesn't this calculator track GPU throughput?
    Because we assume the user has already sized hardware to handle their volume. Throughput modelling lives in the GPU Rental calculator (L.1.4), where it determines the rented hours bill. For self-host, the relevant question is amortisation × volume, not throughput per GPU. If your hardware can't serve your QPS, you have a sizing problem this calculator can't tell you about — go validate with vLLM benchmarks first.
  • What's a realistic monthly opex for a self-hosted setup?
    Three components. (1) Electricity: H100 = ~700W TDP, A100 = ~400W, RTX 4090 = ~450W. At 24×7 and $0.15/kWh that's $75-150/month per GPU for power and another 30-50% for cooling. (2) Bandwidth + colocation if you're not on-prem: $200-500/month for a half rack at typical IX colos. (3) Engineer ops time: budget 0.05-0.15 FTE depending on stack maturity — at $50-150/h that's $200-2000/month. Total $300 (DIY home rack) to $3000+ (production-grade data-centre).
  • When does self-hosting actually win against API?
    Three conditions need to hit together: (1) sustained high query volume — typically > 50,000 queries/day at the rate in question, (2) the model you'd self-host (Llama 3.3 70B, Qwen 2.5, Mixtral) is good enough for your task quality bar, and (3) you have the engineering capability to run inference infrastructure reliably. Miss any one of those and API wins. The break-even queries/day in the result tells you the volume threshold for your specific inputs.
  • How do I model a multi-GPU cluster?
    Sum the hardware costs and stack the opex. 4× RTX 4090 cluster = $10,000 hardware + ~$120/month electricity + 0.1 FTE ops = ~$1,200 monthly. 8× H100 SXM server = $250,000 hardware + ~$800/month electricity + 0.5 FTE ops = ~$5,000 monthly. The calculator works with any aggregated number — enter the totals.
  • What about used / refurbished GPU pricing?
    Used A100 80GB sells for ~$10,000-$15,000 in 2026 (down from $20k+ in 2024). Used H100 PCIe ~$22,000-$28,000. Refurbished hardware comes with 6-12 month warranties typically. The tradeoff: ~30-50% capex saving but shorter remaining useful life — adjust hardwareLifetimeYears down to 2-3 to model honestly.
  • Does this account for the Mac-Studio-as-a-server option?
    Yes — enter $4,000 for an M3 Ultra Mac Studio, set lifetime to 5 years, monthly opex to $25 (electricity only — no colocation, you run it on your desk). Throughput is 8-12 tok/sec on Llama 3.3 70B 4-bit, so it serves ~1k-2k queries/day comfortably. Useful for hobbyist, niche prosumer, and dev workloads but NOT for production-scale serving.
  • What about quality difference between open-source and OpenAI?
    Frontier hosted models (GPT-5, Claude Opus 4.7) lead open-source by 5-15% on standardised benchmarks (SWE-Bench, GAIA, HumanEval). For high-leverage tasks the quality gap can outweigh cost savings; for narrow tasks the gap is negligible. The calculator gives you the cost; the quality call is yours. A common pattern: route low-complexity queries to self-hosted Llama, escalate hard cases to OpenAI API.
  • How do tariffs and supply chain affect this?
    GPU prices have been volatile since 2022 — supply constraints, export controls, tariff regimes. Pricing in this calculator is January 2026 retail. If your procurement contract differs, override the hardwareCost input. For multi-year horizons, consider that hardware refresh cycles get cheaper: an H100 you buy in 2026 will be replaced by an H200 (or RTX 5090 derivative) at lower price-per-token in 2027-28.
  • Are there hidden costs the calculator misses?
    Yes — software stack hardening (vLLM tuning, observability, weight versioning), security (model theft, prompt injection in self-hosted deployments), compliance (HIPAA / SOC2 for on-prem ML often costs more than API equivalent), and incident response are not modelled. Add 10-20% to the self-host total for production-grade enterprise deployments.