LLM API pricing has a quirk that trips up most teams: output tokens cost 3-5x more than input tokens. A model priced at $3/million input tokens may charge $15/million output tokens. For applications that generate long completions — summaries, code, structured reports — the output side dominates the bill.
As of mid-2026, the major model prices are: GPT-4o at $2.50/million input, $10/million output. Claude 3.5 Sonnet at $3/million input, $15/million output. Gemini 1.5 Pro at $1.25/million input, $5/million output (under 128K context). Gemini 2.0 Flash at $0.075/million input, $0.30/million output — the lowest cost frontier model. Mistral Large 2 at $2/million input, $6/million output.
To estimate monthly spend: calculate average tokens per call (input + output), multiply by calls per day, multiply by 30, then apply the token price. A support bot making 10,000 calls/day with 500 input tokens and 300 output tokens per call: 10,000 × 30 = 300,000 calls/month. At Claude 3.5 Sonnet: 300,000 × 500 × ($3/1M) = $450 input + 300,000 × 300 × ($15/1M) = $1,350 output = $1,800/month. The same workload on Gemini 2.0 Flash: $22.50 input + $27 output = $49.50/month — a 36x difference.
Caching changes the math significantly. Anthropic's prompt caching (for repeated system prompts) costs $0.375/million cached reads instead of $3. If your system prompt is 2,000 tokens and every call reuses it, caching alone cuts input costs 75%.
Context length also affects price. Gemini 1.5 Pro charges $2.50/million input for prompts over 128K tokens. A model that looks cheap for short prompts may not be for long-context workloads like document analysis or codebase review.
The practical playbook: use a fast, cheap model (Gemini Flash, GPT-4o mini at $0.15/million input) for classification, routing, and simple generation. Reserve expensive models (GPT-4o, Claude 3.5 Sonnet) for tasks where quality demonstrably matters — complex reasoning, code generation, nuanced writing. Profile actual token usage in production before optimizing; developers consistently underestimate output token length.