Why GPU output throughput, not raw FLOPS?

Because LLM serving is output-token-bound — 80-95% of GPU time is generating output tokens, with input prefill an order of magnitude faster. The dominant cost driver is tokens-per-second sustained throughput, which is what the GPU options list shows. FLOPS and memory bandwidth matter only as far as they predict that throughput number, which we've measured against published vLLM / TensorRT-LLM benchmarks for Llama 3.3 70B int4.

What model is the throughput tuned for?

Llama 3.3 70B int4-quantised — the most common open-source production deployment in 2026. Smaller models run faster: 7B-13B at 5-10× throughput on the same GPU; 30B at ~2×. Larger MoE models (Mixtral 8×22B, Llama 4 dense 405B) run at ~half throughput. To model a different size, multiply the GPU's listed throughput by the appropriate factor before reading the result.

Why does utilisation matter so much?

Because rental is 24×7 billing. If you rent an H100 to serve 200 queries/day at peak, and those 200 queries take 30 minutes of compute, you still pay for 23.5 idle hours. Utilisation = compute-time-needed / rented-time. Realistic production SaaS runs 40-70%; consumer-scale workloads with global traffic can reach 80-90%. Below 30%, you're better off on serverless GPU (Modal, Replicate) which bills per-second.

Does this account for serverless GPU pricing?

Indirectly — the calculator assumes always-on rented hardware. For workloads with utilisation < 30%, switch to serverless GPU and use the per-second rate (Modal: ~$0.000694/sec on H100 ≈ $2.50/h equivalent at 100% utilisation, but billed only when serving). The calculator math undercounts your cost in that case; flip the GPU type's hourly rate to the equivalent and set utilisation to 100%.

What's a realistic setup cost?

Bare-minimum DIY (one engineer, one weekend, vLLM + Lambda + a script): $1-3k engineer time. Production-grade with high-availability, load balancing, monitoring, on-call rotation, weight versioning: $5k-20k+ for the initial build, $1-3k/month ongoing ops. The setup-cost input captures only the one-time piece; the engineer time for ongoing ops is NOT in the calculator (it's roughly $1k-2k/month if you split a 0.1 FTE).

Why does the API side beat GPU rental at low volume?

Because the API has no fixed cost — every query is variable. GPU rental has both a setup-amortised fixed cost AND a serving cost. At low volume, the fixed cost dominates and the API's variable rate (which is set high enough to be profitable for the provider) still beats your low-utilisation GPU. The break-even queries/day in the result tells you exactly when GPU rental flips to win.

Why does GPU rental beat API at high volume?

Because OpenAI / Anthropic build a profit margin into per-token pricing. At sufficient volume, you can match or beat that margin by cutting out the middleman. Typical break-even on Llama 3.3 70B vs GPT-4o-mini lands somewhere in the 200k-2M queries/day range depending on your input/output ratio. Your calculation surfaces the exact number.

What about quality difference between Llama and GPT-4o?

Real and material — current open-source 70B models trail frontier hosted models by 5-15% on standardised agent benchmarks (SWE-Bench, GAIA, HumanEval). For high-leverage tasks (complex agents, long-form reasoning) the quality gap can outweigh cost savings; for narrow tasks (classification, extraction, light formatting) it's negligible. The calculator gives you the cost; the quality call is yours.

Does this work for fine-tuning workloads?

Partially — the GPU side estimates inference cost. For training/fine-tune, use the Fine-tune vs RAG calculator alongside this one (training cost happens once; inference cost recurs forever). The decision usually splits: fine-tune on rented A100s, then either (a) keep serving on the same A100 for high volume, or (b) migrate to a hosted endpoint for low/medium volume.

What about Mac-Studio / consumer-GPU self-host?

Out of scope — the calculator targets data-center GPUs (Lambda / Modal / Runpod / AWS / GCP). M3 Ultra Mac Studio runs Llama 3.3 70B at 4-bit at ~8-12 tok/s on a $4k machine; amortised over 36 months that's $111/month + electricity. Useful for dev / niche personal use but not production-scale serving. For consumer-grade pricing math, treat it as a 0.6× A10G with no rental cost.

How does this compare to OpenAI's Volume Discount or Anthropic's enterprise contract?

Both companies offer 10-30% volume discounts at 6-7-figure annual spend. The calculator uses list pricing — bump down the equivalent API model rate by 15-20% to model that. If your rate card delta is bigger than 20%, you've negotiated above-market and the GPU side has to clear a higher bar to win.

Is the GPU rental data current?

Snapshot is January 2026 — Lambda Labs / Modal / Runpod blended rates. Prices have been falling roughly 20% per year for the last 3 years (capacity overhang from H100 ramp + new providers entering); the calculator gets refreshed quarterly against published rate cards.

AI & TechFree · No signup · 45K+/month

GPU Rental vs API Calculator — Self-Host Llama or Stay on OpenAI?

Compare the real total cost of running an open-source LLM on rented H100 / A100 / L40S GPUs against calling the OpenAI / Anthropic / Google API directly. Throughput-based GPU math + 2026 rental rates + utilisation correction + setup amortisation. Returns the break-even queries/day where the loser would win at scale.

Instant result
Private — nothing saved
Works on any device
AI insight included

Reviewed by CalcBold EditorialLast verified May 4, 2026Methodology

Queries per day

Production daily query volume. 1k = pilot, 10k = mid-size SaaS, 100k+ = consumer scale.

Input tokens / query

System prompt + context + user message. Typical 500-5000 for chat / Q&A workloads.

Output tokens / query

Avg response length. GPU serve time scales linearly with output tokens.

API model (comparison)

Pick the closest hosted model. The calculator uses 2026 list pricing.

GPU type (rental)

Rental rates from Lambda Labs / Modal / Runpod (Jan 2026). Throughput is Llama 3.3 70B int4-quantised — see FAQ for other model sizes.

GPU utilisation

% of rented hours actually serving requests. Realistic SaaS: 40-70%; high-traffic consumer: 70-90%; intermittent: 20-40%.

Setup cost (one-time)

Engineer time + container hardening + monitoring + load-balancing. Bare-minimum DIY: $1k-3k. Production-grade with HA: $5k-20k.

Horizon

Amortisation window. 12 months matches typical project budgets; 36 months is more honest for stable deployments.

Embed builderDrop the GPU vs API on your site →Free widget · 3 sizes · custom theme · auto-resizes · no signupGet embed code

What This Calculator Does

The GPU Rental vs API Calculator answers a deceptively simple question: at YOUR query volume, is it cheaper to rent an H100 / A100 / L40S on Lambda Labs, Modal, or Runpod and self-host an open-source model — or just keep paying per-token to OpenAI, Anthropic, or Google? It models real GPU output throughput (Llama 3.3 70B int4), applies a utilisation correction for idle hours, amortises one-time setup cost over your horizon, and stacks the result against 2026 list pricing for the closest hosted model.

It's built for engineering leads sizing infrastructure before a production migration. The output isn't just a winner — it's the break-even queries/dayat which the loser would catch up. That number is what makes the decision durable: it tells you when to revisit the choice as your volume grows.

The Math

2026 GPU Rental Snapshot

Blended Lambda Labs / Modal / Runpod rates as of January 2026. Throughput numbers are tuned for Llama 3.3 70B int4-quantised using vLLM / TensorRT-LLM defaults. Smaller models (7B-13B) run 5-10× faster on the same card; larger MoE models (Mixtral 8x22B, Llama 4 dense 405B) run roughly half throughput.

H100 80GB SXM — $2.99/h at ~80 output tok/sec. The frontier-tier rental for serious 70B serving.
H100 80GB PCIe — $2.49/h at ~65 tok/sec. Slightly slower interconnect, 15-20% cheaper.
A100 80GB — $1.79/h at ~50 tok/sec. The workhorse — best price/throughput ratio for steady production.
L40S 48GB — $1.50/h at ~40 tok/sec. Good fit for 13B-30B models where 80GB VRAM is overkill.
A10G 24GB — $0.60/h at ~22 tok/sec. Budget tier for small models or low-volume staging environments.

A Worked Example — 10K queries/day on H100

Production SaaS with 10,000 queries/day · 2,000 input · 500 output tokens. Renting an H100 SXM at $2.99/h with realistic 60% utilisation, $2,000 setup cost, 12-month horizon. API comparison: GPT-4o-mini at $0.15 input / $0.60 output per 1M.

Daily output tokens: 10,000 × 500 = 5,000,000
Serve time: 5,000,000 / 80 tok/sec = ~17.4 hours of compute
Billable hours at 60% util: ~29 hours/day > 24 — overprovisioned, you’d need 1.2 GPUs
Monthly GPU cost: ~$2,600 + $167/mo setup amort = ~$2,767
API equivalent: per query = $0.0006 → ~$180/month
Verdict: API wins by ~$2,500/month at this volume — break-even sits near 200K-300K queries/day.

When Self-Host Wins (and When It Doesn't)

Three conditions need to hit simultaneously for GPU rental to beat the API: sustained high volume (typically > 100K queries/day at standard token shape), the open-source model is good enough for your task (Llama 3.3 70B trails frontier models 5-15% on agent benchmarks), and you have the engineering headcount to run inference infra reliably. Miss any one and the API wins. The break-even queries/day surfaced in the result is the volume threshold where the math flips for YOUR specific token shape.

Why Utilisation Dominates the GPU Side

Rental is 24×7 billing — if you rent an H100 to handle 200 queries/day at peak, you pay for 23.5 idle hours regardless. Realistic SaaS production runs 40-70% utilisation; consumer apps with global traffic reach 70-90%; intermittent or evaluation workloads fall to 20-40%. Below 30% utilisation, switch to serverless GPU(Modal, Replicate) which bills per-second instead of per-hour. The calculator's utilisation slider directly captures this.

Common Mistakes

Modelling 100% utilisation. The single-biggest cost-modelling error. Real production never hits 100% — even consumer-scale traffic with global coverage tops out at ~85% on average. Use 60% as the default unless you have specific traffic data showing otherwise.
Ignoring setup cost. Bare-minimum DIY (one engineer, vLLM + Lambda + a deploy script) is $1-3K. Production-grade with HA, monitoring, weight versioning, on-call rotation runs $5-20K+. The setup input captures the one-time piece; ongoing ops engineering (~$1-2K/month at 0.1 FTE) is NOT in the calculator and must be added separately.
Comparing against list-price API instead of negotiated.OpenAI and Anthropic both offer 10-30% enterprise discounts above ~$50K/month spend. If you’re at that scale, your real API rate is 15-20% below list — narrow the GPU's break-even gap accordingly.
Forgetting throughput is model-size dependent. The default 80 tok/sec on H100 is for Llama 3.3 70B int4. If you're serving a 7B model the throughput jumps 5-10×; a Mixtral 8x22B drops by half. Adjust the GPU option's throughput in your head before reading the result for non-70B workloads.
Conflating quality with cost.The calculator surfaces the cheaper path, not the better path. Frontier hosted models lead open-source 70B by 5-15% on standardised agent benchmarks. For high-leverage tasks (complex agents, long-form reasoning) that gap can outweigh the savings; for narrow tasks (classification, extraction) it's negligible. Run a 100-request A/B at the cheaper path before migrating.
Underestimating GPU price decay.H100 rental rates have fallen ~20% per year since 2023 as capacity ramped. Modelling a 36-month horizon at today's rate overstates GPU cost by ~30% versus the realistic rate-card path. For multi-year decisions, budget for rental rates dropping rather than holding flat.

How This Differs From the API-Only Calculators

The API Token Cost Calculator is single-provider, no self-host alternative — pure per-token spend across OpenAI/Anthropic/Google with caching and batch toggles. Run it first if your decision is which API provider. The Self-Host vs API Calculator is the OWNED-hardware sibling of this one — buy hardware outright and amortise instead of renting hourly. Use that for on-prem racks or bought servers; use this calculator for cloud GPU rental decisions.

Related Calculators

Self-Host vs API Calculator — buy GPU hardware outright vs API. The capex sibling of this calculator.
API Token Cost Calculator — pure per-token spend, no self-host alternative. Run first to size hosted-only cost.
AI Agent Run Cost Calculator — multi-turn agent workloads scale differently than single-shot Q&A. Use this if your queries are agent loops, not chat.
Investment ROI Calculator — once you know cost, model the workload's revenue return. Infra cost is rarely the binding constraint at production scale.

How to Read the Verdict

Two numbers anchor the call: the monthly cost gap at your current volume and the break-even queries/day at which the loser would catch up. The break-even number is the durable answer — it tells you when to revisit the choice as volume grows.

Currently below break-even AND volume growing slowly. Stay on API. The capex of self-hosting only pays off if you cross break-even and hold above it for the full amortisation horizon.
Currently 2× above break-even. Migrate to rented GPU. Self-hosted Llama 3.3 70B int4 on H100 typically beats hosted API at sustained 100K+ queries/day on chat- length prompts.
Bursty volume. Stay on API. GPU rental economics depend on high utilisation; a 30%-utilised H100 is more expensive than the API equivalent.
Quality-sensitive workload (Claude Opus / GPT-5 tier). Stay on API. Open-source 70B models match ~85% of frontier-tier quality — fine for most workloads, not for the ones that need the last 15%.

Frequently Asked Questions

The most common questions we get about this calculator — each answer is kept under 60 words so you can scan.

Why GPU output throughput, not raw FLOPS?
Because LLM serving is output-token-bound — 80-95% of GPU time is generating output tokens, with input prefill an order of magnitude faster. The dominant cost driver is tokens-per-second sustained throughput, which is what the GPU options list shows. FLOPS and memory bandwidth matter only as far as they predict that throughput number, which we've measured against published vLLM / TensorRT-LLM benchmarks for Llama 3.3 70B int4.
What model is the throughput tuned for?
Llama 3.3 70B int4-quantised — the most common open-source production deployment in 2026. Smaller models run faster: 7B-13B at 5-10× throughput on the same GPU; 30B at ~2×. Larger MoE models (Mixtral 8×22B, Llama 4 dense 405B) run at ~half throughput. To model a different size, multiply the GPU's listed throughput by the appropriate factor before reading the result.
Why does utilisation matter so much?
Because rental is 24×7 billing. If you rent an H100 to serve 200 queries/day at peak, and those 200 queries take 30 minutes of compute, you still pay for 23.5 idle hours. Utilisation = compute-time-needed / rented-time. Realistic production SaaS runs 40-70%; consumer-scale workloads with global traffic can reach 80-90%. Below 30%, you're better off on serverless GPU (Modal, Replicate) which bills per-second.
Does this account for serverless GPU pricing?
Indirectly — the calculator assumes always-on rented hardware. For workloads with utilisation < 30%, switch to serverless GPU and use the per-second rate (Modal: ~$0.000694/sec on H100 ≈ $2.50/h equivalent at 100% utilisation, but billed only when serving). The calculator math undercounts your cost in that case; flip the GPU type's hourly rate to the equivalent and set utilisation to 100%.
What's a realistic setup cost?
Bare-minimum DIY (one engineer, one weekend, vLLM + Lambda + a script): $1-3k engineer time. Production-grade with high-availability, load balancing, monitoring, on-call rotation, weight versioning: $5k-20k+ for the initial build, $1-3k/month ongoing ops. The setup-cost input captures only the one-time piece; the engineer time for ongoing ops is NOT in the calculator (it's roughly $1k-2k/month if you split a 0.1 FTE).
Why does the API side beat GPU rental at low volume?
Because the API has no fixed cost — every query is variable. GPU rental has both a setup-amortised fixed cost AND a serving cost. At low volume, the fixed cost dominates and the API's variable rate (which is set high enough to be profitable for the provider) still beats your low-utilisation GPU. The break-even queries/day in the result tells you exactly when GPU rental flips to win.
Why does GPU rental beat API at high volume?
Because OpenAI / Anthropic build a profit margin into per-token pricing. At sufficient volume, you can match or beat that margin by cutting out the middleman. Typical break-even on Llama 3.3 70B vs GPT-4o-mini lands somewhere in the 200k-2M queries/day range depending on your input/output ratio. Your calculation surfaces the exact number.
What about quality difference between Llama and GPT-4o?
Real and material — current open-source 70B models trail frontier hosted models by 5-15% on standardised agent benchmarks (SWE-Bench, GAIA, HumanEval). For high-leverage tasks (complex agents, long-form reasoning) the quality gap can outweigh cost savings; for narrow tasks (classification, extraction, light formatting) it's negligible. The calculator gives you the cost; the quality call is yours.
Does this work for fine-tuning workloads?
Partially — the GPU side estimates inference cost. For training/fine-tune, use the Fine-tune vs RAG calculator alongside this one (training cost happens once; inference cost recurs forever). The decision usually splits: fine-tune on rented A100s, then either (a) keep serving on the same A100 for high volume, or (b) migrate to a hosted endpoint for low/medium volume.
What about Mac-Studio / consumer-GPU self-host?
Out of scope — the calculator targets data-center GPUs (Lambda / Modal / Runpod / AWS / GCP). M3 Ultra Mac Studio runs Llama 3.3 70B at 4-bit at ~8-12 tok/s on a $4k machine; amortised over 36 months that's $111/month + electricity. Useful for dev / niche personal use but not production-scale serving. For consumer-grade pricing math, treat it as a 0.6× A10G with no rental cost.
How does this compare to OpenAI's Volume Discount or Anthropic's enterprise contract?
Both companies offer 10-30% volume discounts at 6-7-figure annual spend. The calculator uses list pricing — bump down the equivalent API model rate by 15-20% to model that. If your rate card delta is bigger than 20%, you've negotiated above-market and the GPU side has to clear a higher bar to win.
Is the GPU rental data current?
Snapshot is January 2026 — Lambda Labs / Modal / Runpod blended rates. Prices have been falling roughly 20% per year for the last 3 years (capacity overhang from H100 ramp + new providers entering); the calculator gets refreshed quarterly against published rate cards.