Why a separate calculator from `api-token-cost`?

Because agents are not single requests. A typical production agent runs 6-15 turns per user task — system prompt + tool calls + intermediate reasoning + final response — and each turn re-sends the accumulating context. Treating it as a single request undercounts the real spend by roughly an order of magnitude. This calculator models the per-turn loop honestly.

What does 'turns per task' actually mean?

One turn = one round-trip to the LLM. A user asks a question; the agent thinks, calls a tool, reads the result, calls another tool, and finally answers — that's 3-4 turns. Coding agents (Cursor, Cline, Aider, Devin) typically log 8-20 turns per task in production. Customer-support agents stay between 3-6. Use your own logs if you have them; otherwise the default of 8 is a good starting estimate.

How is prompt caching modelled?

Caching replaces a chunk of input tokens (the cached share) at the provider's cache-read rate instead of the input rate. Anthropic Claude charges 1/10th of input for reads (15× cheaper); GPT-5 charges 50%; Gemini 2.5 Pro charges 25%. The calculator picks up the cache-read rate from the published 2026 rate cards and applies it to the percentage you set. If your system prompt is stable across all turns of a session, 80-90% caching is achievable. If it changes per turn, caching won't help.

Why does retry rate change the answer so much?

A 10% retry rate adds 10% to monthly cost — straightforward enough. But every retry runs the *full task*, not just the failed turn, because agents re-plan from the user prompt on retry. So a high retry rate doesn't just inflate cost; it usually signals a brittle agent (bad tool descriptions, ambiguous prompts, model too small for the workload) that's worth fixing for cost AND quality.

Where do the prices come from?

Anthropic, OpenAI, and Google's published rate cards as of January 2026: Claude Opus 4.7 $15/$75 per 1M (input/output), Sonnet 4.6 $3/$15, Haiku 4.5 $0.80/$4. GPT-5 $1.25/$10. Gemini 2.5 Pro $1.25/$10, Flash $0.10/$0.40. Cache-read rates are picked up from the same pages. We refresh quarterly; the calculator page will note the snapshot date once we wire that in.

Should I switch from Opus 4.7 to Sonnet 4.6 to save money?

Sonnet 4.6 is roughly 5× cheaper than Opus 4.7 at the same volume and is the default 'workhorse' model for most production agents. The right call depends on the task: code-generation agents and complex multi-step reasoning workflows often need Opus's depth; classification, extraction, summarisation, and most customer-support flows are well-served by Sonnet. The pragmatic answer is to route by task type — Opus for the 30% that need it, Sonnet for the rest.

What about Haiku 4.5?

Haiku 4.5 is the budget tier (~$0.80/$4 per 1M) and excellent for high-volume, narrow-scope agents — intent classification, entity extraction, simple lookups, light formatting tasks. It typically scores 5-10% behind Sonnet on agent benchmarks but at 1/4 the cost. Use it as a router or a first-pass filter; escalate to Sonnet/Opus only on the harder branches.

Does this calculator account for fine-tuned model surcharges?

No — it uses base-model pricing only. Fine-tuned models on OpenAI cost 2-8× the base rate; Anthropic doesn't expose fine-tuning yet (only prompt-caching and Claude.AI custom assistants which are billed differently). If you're running a fine-tuned model, multiply the result by the surcharge published on your provider's billing page.

How does batch API change things?

Batch API (Anthropic / OpenAI / Google all offer it) discounts both input AND output by 50% in exchange for 24-hour async processing. If your agent is running offline workflows where 24-hour latency is acceptable — overnight content generation, batch enrichment, evals — switching to batch cuts cost in half on the same model. Live agents (chat, tool-use, RAG) cannot use it.

Can I run this for a fine-tune-vs-RAG decision?

Indirectly — this calculator gives you the API cost side of the equation. For a full fine-tune-vs-RAG comparison you also need infrastructure costs, eval costs, and switching-cost amortisation; those land in the upcoming `fine-tune-vs-rag-calculator` (Phase L.1). Run this one to get the API baseline, then compare to the RAG-with-context-stuffing scenario.

Why doesn't the result match my OpenAI / Anthropic dashboard exactly?

Three reasons in practice. (1) Real workloads are not perfectly average — some tasks hit 30 turns, some hit 4, and the dashboard reflects the actual mix while this calculator uses your point estimate. (2) Cached-token reporting can lag in real time on Anthropic's dashboard by up to 30 minutes. (3) Free-tier credits and committed-use discounts on enterprise contracts aren't modelled. For a production workload the calculator should land within ±15% of the actual bill.

Is this tool biased toward any provider?

No — vendor-neutral. The owner runs Claude Opus + Sonnet in production at semisoftwares.com and Pack Forge, but also operates GPT-4 and Gemini Pro surfaces; the pricing table is sourced from the public rate cards and refreshed against vendor changes. The smart insight calls out trade-offs honestly across all three providers.

AI & TechFree · No signup · 120K+/month

AI Agent Run Cost Calculator — Real Cost of a Production Agent at Scale

Model your agent's full economics: turns per task, input + output tokens per turn, retry overhead, prompt caching. See per-task cost, monthly burn, and the side-by-side across Claude Opus / Sonnet / Haiku, GPT-5, and Gemini 2.5 — pick the tier that doesn't blow your runway.

Instant result
Private — nothing saved
Works on any device
AI insight included

Reviewed by CalcBold EditorialLast verified May 4, 2026Methodology

Provider

Model

Provider auto-narrows the list — pick the model you'll actually deploy.

Turns per task

Avg tool-call cycle length. Coding agents: 6-15. Customer-support agents: 3-6.

Tasks per day

Daily volume across all users / triggers.

Avg input tokens / turn

System prompt + accumulated context. Realistic agents: 8K-30K per turn.

Avg output tokens / turn

Tool-call JSON + reasoning + final response. Typical: 200-1500.

Retry rate

% of tasks that fail and retry. Healthy agents: 5-12%.

Prompt caching

% of input tokens served from prompt cache. Anthropic / OpenAI / Google all support — set to 70-90% if your system prompt is stable across the session.

Embed builderDrop the Agent Cost on your site →Free widget · 3 sizes · custom theme · auto-resizes · no signupGet embed code

What This Calculator Does

The AI Agent Run Cost Calculator models the real economics of running a multi-turn LLM agent in production. Unlike a single API request, an agent runs a loop — system prompt + user request + tool call + tool result + reflection + next tool call — and each turn re-sends the accumulating context. Treating that loop as a single request undercounts the bill by roughly an order of magnitude.

Plug in turns per task, average input and output tokens per turn, tasks per day, retry rate, and prompt caching share. The calculator returns per-task cost, daily, monthly, and annual spend on your chosen model — and surfaces the cheapest tier across Anthropic, OpenAI, and Google for your specific volume, so you can spot a tier-down opportunity without leaving the page.

The Math

The retry multiplier is applied to the full task, not to a single failed turn — agents re-plan from the user prompt on retry, so a 10% retry rate adds 10% to the bill, not 10% of one turn. Caching only reduces the input side; output is never cached.

A Worked Example

A coding agent on Claude Sonnet 4.6 ($3 / $15 per 1M tokens), running 8 turns per task, 12,000 input tokens / 600 output tokens per turn, 1,000 tasks/day, 8% retry, 0% caching:

Per-turn input — 12,000 × $3 / 1M = $0.036
Per-turn output — 600 × $15 / 1M = $0.009
Per-task — 8 × ($0.036 + $0.009) × 1.08 ≈ $0.39
Daily — $0.39 × 1,000 = ~$389
Monthly — ~$11,664

Flip caching to 80% on Sonnet's cache-read rate ($0.30 / 1M) and the per-turn input drops from $0.036 to roughly $0.0144 — that's about $5,600/month back. The same workload on Haiku 4.5 ($1 / $5) runs ~$3,890/month flat — a third of the Sonnet bill if Haiku passes your eval suite.

When This Is Useful

Use this calculator when you're sizing capacity for a new agent product, defending a model-tier choice in a budget review, or auditing a live agent that's burning through spend faster than expected. The honest answer for most production agents lives between $1K and $50K/month — small enough that a 30% optimisation matters, large enough that getting it wrong shows up in the runway model.

It's also the right tool for a before/aftercheck when you're considering migrating from Opus 4.7 to Sonnet 4.6, or from GPT-5 to GPT-4o. Plug in the same shape with the cheaper model, compare the monthly delta, then run a 100-task A/B at the new tier before committing.

Common Mistakes

Counting only the user-visible turns. Agents do tool calls, reflection passes, and self-correction internally. The actual turn count is usually 2-4× what shows in the UI. Use your own logs if you have them; otherwise the default of 8 is honest for typical coding agents.
Assuming caching works on every turn. Caching only kicks in for stable prefixes. If your tool results get appended to the system prompt (a common pattern), the cache invalidates on every turn and you're paying fresh-input rate even with caching enabled.
Ignoring output cost on reasoning models. o1-class reasoning models bill internal “thinking tokens” as output, even though you don't see them. For complex tasks, multiply visible output by 5-10× to estimate real output volume — this calculator uses your input figure as-is.
Treating retry rate as a free knob.A 10% retry rate doesn't just add 10% to cost — it usually signals a brittle agent (bad tool descriptions, ambiguous prompts, model under-spec'd for the workload). Fix the underlying flakiness before you scale; the cost saving comes for free with the quality fix.
Forgetting batch API on async workloads.If your agent runs offline (overnight enrichment, evals, batch content generation), Anthropic / OpenAI / Google all offer 50% off both input and output via their Batch API. Live agents can't use it, but plenty of agentic workloads can.
Sizing on the average instead of the tail. Real workloads have long tails — some tasks hit 30 turns or 5K output tokens. The average is fine for monthly capacity planning, but for SLO and rate-limit sizing, model the p95 turn count and token shape too.

Related Calculators

For a single-shot LLM cost (no agent loop), use the API Token Cost Calculator. For full project TCO including engineering and ops overhead, run the AI Model Cost Calculator. If you're deciding between fine-tuning and RAG before even picking a model, the Fine-tune vs RAG Calculator sits one decision earlier in the stack. And to value the time your agent saves vs the spend, layer in the Personal AI Stack ROI Calculator.

Frequently Asked Questions

The most common questions we get about this calculator — each answer is kept under 60 words so you can scan.

Why a separate calculator from `api-token-cost`?
Because agents are not single requests. A typical production agent runs 6-15 turns per user task — system prompt + tool calls + intermediate reasoning + final response — and each turn re-sends the accumulating context. Treating it as a single request undercounts the real spend by roughly an order of magnitude. This calculator models the per-turn loop honestly.
What does 'turns per task' actually mean?
One turn = one round-trip to the LLM. A user asks a question; the agent thinks, calls a tool, reads the result, calls another tool, and finally answers — that's 3-4 turns. Coding agents (Cursor, Cline, Aider, Devin) typically log 8-20 turns per task in production. Customer-support agents stay between 3-6. Use your own logs if you have them; otherwise the default of 8 is a good starting estimate.
How is prompt caching modelled?
Caching replaces a chunk of input tokens (the cached share) at the provider's cache-read rate instead of the input rate. Anthropic Claude charges 1/10th of input for reads (15× cheaper); GPT-5 charges 50%; Gemini 2.5 Pro charges 25%. The calculator picks up the cache-read rate from the published 2026 rate cards and applies it to the percentage you set. If your system prompt is stable across all turns of a session, 80-90% caching is achievable. If it changes per turn, caching won't help.
Why does retry rate change the answer so much?
A 10% retry rate adds 10% to monthly cost — straightforward enough. But every retry runs the *full task*, not just the failed turn, because agents re-plan from the user prompt on retry. So a high retry rate doesn't just inflate cost; it usually signals a brittle agent (bad tool descriptions, ambiguous prompts, model too small for the workload) that's worth fixing for cost AND quality.
Where do the prices come from?
Anthropic, OpenAI, and Google's published rate cards as of January 2026: Claude Opus 4.7 $15/$75 per 1M (input/output), Sonnet 4.6 $3/$15, Haiku 4.5 $0.80/$4. GPT-5 $1.25/$10. Gemini 2.5 Pro $1.25/$10, Flash $0.10/$0.40. Cache-read rates are picked up from the same pages. We refresh quarterly; the calculator page will note the snapshot date once we wire that in.
Should I switch from Opus 4.7 to Sonnet 4.6 to save money?
Sonnet 4.6 is roughly 5× cheaper than Opus 4.7 at the same volume and is the default 'workhorse' model for most production agents. The right call depends on the task: code-generation agents and complex multi-step reasoning workflows often need Opus's depth; classification, extraction, summarisation, and most customer-support flows are well-served by Sonnet. The pragmatic answer is to route by task type — Opus for the 30% that need it, Sonnet for the rest.
What about Haiku 4.5?
Haiku 4.5 is the budget tier (~$0.80/$4 per 1M) and excellent for high-volume, narrow-scope agents — intent classification, entity extraction, simple lookups, light formatting tasks. It typically scores 5-10% behind Sonnet on agent benchmarks but at 1/4 the cost. Use it as a router or a first-pass filter; escalate to Sonnet/Opus only on the harder branches.
Does this calculator account for fine-tuned model surcharges?
No — it uses base-model pricing only. Fine-tuned models on OpenAI cost 2-8× the base rate; Anthropic doesn't expose fine-tuning yet (only prompt-caching and Claude.AI custom assistants which are billed differently). If you're running a fine-tuned model, multiply the result by the surcharge published on your provider's billing page.
How does batch API change things?
Batch API (Anthropic / OpenAI / Google all offer it) discounts both input AND output by 50% in exchange for 24-hour async processing. If your agent is running offline workflows where 24-hour latency is acceptable — overnight content generation, batch enrichment, evals — switching to batch cuts cost in half on the same model. Live agents (chat, tool-use, RAG) cannot use it.
Can I run this for a fine-tune-vs-RAG decision?
Indirectly — this calculator gives you the API cost side of the equation. For a full fine-tune-vs-RAG comparison you also need infrastructure costs, eval costs, and switching-cost amortisation; those land in the upcoming `fine-tune-vs-rag-calculator` (Phase L.1). Run this one to get the API baseline, then compare to the RAG-with-context-stuffing scenario.
Why doesn't the result match my OpenAI / Anthropic dashboard exactly?
Three reasons in practice. (1) Real workloads are not perfectly average — some tasks hit 30 turns, some hit 4, and the dashboard reflects the actual mix while this calculator uses your point estimate. (2) Cached-token reporting can lag in real time on Anthropic's dashboard by up to 30 minutes. (3) Free-tier credits and committed-use discounts on enterprise contracts aren't modelled. For a production workload the calculator should land within ±15% of the actual bill.
Is this tool biased toward any provider?
No — vendor-neutral. The owner runs Claude Opus + Sonnet in production at semisoftwares.com and Pack Forge, but also operates GPT-4 and Gemini Pro surfaces; the pricing table is sourced from the public rate cards and refreshed against vendor changes. The smart insight calls out trade-offs honestly across all three providers.