Skip to content

The 2026 Guide to Cutting LLM Costs: Prompt Caching, Model Routing, Quantization and Observability

I walk through how I cut a production LLM bill in half, sometimes to a fifth: prompt caching, model routing, self-hosted quantization and the observability that makes it all visible. With a Turkey and KVKK lens, concrete cost math and a tactics table.

SYK
Şükrü Yusuf KAYA
AI Expert · Enterprise AI Consultant

TL;DR — There's no single magic button for cutting LLM costs; it's a layered game. The highest leverage is prompt caching (up to 90% off repeated input on Anthropic). Right after comes model routing/cascade: stop sending every job to a frontier model and route classification and summarization to a cheap one. Take asynchronous jobs at 50% off with the batch API. If you self-host, halve your memory and cost with quantization (INT8/INT4) and squeeze 3-5x more out of the same GPU with continuous batching + PagedAttention. And put observability on top of it all; because you can't manage a cost you don't measure. Below I walk through each one with field examples, real 2026 prices and a Turkey/KVKK lens.

Why I'm writing this

Over the last two years, almost every enterprise AI project that landed on my desk played out the same scene. The team ships a great prototype, the demo wows everyone, the project goes live, and at the end of the third month an email arrives from finance: "What is this bill?"

The truth is, taking an LLM app to production and keeping it in production profitably are two separate professions. In the prototype phase nobody cares about token cost; you fire off a few hundred requests a day, the bill is coffee money. But once traffic gets multiplied by real users, that innocent-looking system prompt, that giant context window reprocessed on every request, that "let's always use the best one" reflex that sends even the simplest question to the most expensive model — it all accumulates and the monthly bill climbs into five figures.

The good news: most of that cost is waste. And cleaning up the waste usually doesn't require changing your model or compromising on quality. You just need to make the right engineering decisions in the right order. In this post I'm giving you the playbook I apply in the field, the one that delivers measurable results.

First, understand the cost: where does the token go?

The bill for an LLM call has two line items: input tokens and output tokens. The point most people miss: in most production apps, money is burned by input, not output. Because the context fed by RAG, long system prompts, tool schemas, conversation history — all of it is sent to the model again and again on every request.

For reference, here are current Anthropic prices as of 2026 (per million tokens, input/output):

ModelInputOutput
Claude Opus 4.8$5.00$25.00
Claude Sonnet 4.6$3.00$15.00
Claude Haiku 4.5$1.00$5.00

Small numbers on the surface, right? But scale the math and the story changes. Say you have a customer support assistant; on every request you send a 4,000-token system prompt + tool schema, 3,000 tokens of RAG context, and 1,000 tokens of conversation history. That's ~8,000 input tokens per request. If you get 50,000 requests a day:

"

8,000 tokens × 50,000 requests = 400 million input tokens / day With Sonnet 4.6: 400M × ($3 / 1M) = $1,200/day, i.e. ~$36,000/month.

And output isn't even in this math yet. This is the point where you start thinking "maybe we should switch models." Yet of those 8,000 input tokens, maybe 7,000 are byte-for-byte identical on every request. This is exactly where our first big lever sits.

1. Prompt Caching: the single highest-leverage move

Prompt caching is the highest-return cost lever in production LLM engineering in 2026. The logic is simple: the unchanging prefix of your prompt (system prompt, tool schemas, fixed instructions, even static RAG documents) is processed once, the key-value tensors the model computes are stored, and on subsequent requests that part isn't reprocessed. It's only read.

The numbers are serious. On Anthropic:

  • Cache write: The first time, you pay 25% more than the normal input price (the cost of building the cache).
  • Cache read: On every subsequent request, those tokens are billed at 90% below the normal price.
  • Duration: 5 minutes by default; extendable to 1 hour with explicit configuration.

Let's look at the concrete impact. In the example above, if 7,000 of the 8,000 input tokens (system prompt + tool schema + static context) are fixed:

"

Without cache: 7,000 tokens × $3/1M = $0.021 per request With cache (read): 7,000 tokens × $0.30/1M = $0.0021 per request

So the cost for those 7,000 tokens drops to a tenth. At 50,000 daily requests, this is a saving of ~$945/day and ~$28,000/month on this portion alone. The numbers I see in the field match the literature: prompt caching can typically cut API costs by 45-80%, and as a side bonus it improves time-to-first-token by 13-31% — meaning your app gets both cheaper and faster.

On the OpenAI side, caching is automatic but the discount is capped at 50% (it kicks in for prompts over 1,024 tokens). The gap versus Anthropic's 90% can add up to thousands of dollars a year on a cache-heavy production load.

Practical notes from the field:

  • Order your prompt from static to dynamic. The unchanging parts (system prompt, tool definitions, few-shot examples) should be at the very front so the cache prefix is long. The user's changing question goes last.
  • To keep the cache "warm" every 5 minutes, if your traffic pattern is sparse, even a small keep-alive request can prevent a cache miss and yield net savings.
  • The most common cause of a cache miss: a changing timestamp, session ID, or random greeting at the start of the prompt. Pull those dynamic pieces out of the prefix.

2. Semantic cache: not paying twice for the same question

The provider's prompt cache caches the prefix of the prompt. But there's also this: users ask the same thing in different words. "What's your return policy?" and "How do I send the product back?" mean the same thing; classic cache can't catch this because the texts differ.

This is where semantic cache comes in. It extracts the embedding (vector) of the incoming question, measures its similarity to questions you've answered before, and if it's close enough, returns the old answer without ever calling the LLM. Open-source libraries like GPTCache (developed by Zilliz) offer this with a modular architecture: embedding model, vector store, similarity evaluator, and cache store — each swappable independently.

The real-world numbers are encouraging: GPTCache can reach 61-69% cache hit rates with over 97% hit accuracy. On a single-GPU stack, a 60% hit rate can mean ~$846/month in savings. Think about it: nearly two-thirds of incoming requests never reach the LLM; no tokens spent, no latency incurred.

The architecture I usually build in the field is three layers, and these layers stack, they don't overlap:

  1. Exact-match cache — cheapest, easiest to set up. Is it the identical request? Return from Redis.
  2. Semantic cache — catches paraphrases. Is embedding similarity above the threshold? Return the old answer.
  3. Provider prompt cache — the layer I described above, which lowers token cost even on a cache miss.

A well-designed agent using all three together can cut its AI bill by 50-90%.

3. Model Routing and Cascade: the right model for the right job

The most common and most expensive mistake I see: sending everything to the strongest model. "Opus is the best, so let's always use Opus." This is like renting a sledgehammer to drive a nail.

The bulk of real production traffic consists of simple jobs: intent classification, short summarization, data extraction, format conversion. Paying for a frontier model for these is burning money. Model routing is directing each request to the cheapest path that meets the quality requirement.

A typical routing strategy is set up like this:

Job typeModel choiceRationale
Classification, extraction, simple formattingHaiku 4.5 (or a cheap open model)Low difficulty, high volume, cheap
Mid-level reasoning, most RAG answersSonnet 4.6The default for most production work
Complex reasoning, critical decisionsOpus 4.8Only if benchmarks show a measurable gap

There are two main approaches:

  • Routing: You first classify the incoming request (usually with a cheap model or a simple classifier), then send it to the appropriate model.
  • Cascade: You try the cheap model first; if the answer's confidence is low or it fails a quality check, you escalate to a higher tier. This way you use the expensive model only when truly needed.

A critical warning here: a router is only as good as the eval that validates it. If you set up routing blindly, requests that are actually complex but get sent to the cheap model cause quality issues, and that costs you more as churn down the line. So in every routing setup I build a hold-out set of 200-500 representative questions with quality labels, and I rerun this eval every time I change a model or threshold.

To manage this in production I usually use an LLM gateway (like Helicone, Portkey, LiteLLM); behind a single endpoint it centrally manages routing, fallback, rate limiting and cache.

4. Batch API: half price if you're not in a hurry

Not every job has to be real-time. Nightly report generation, bulk document classification, data enrichment, eval runs, content moderation — these don't expect answers within seconds.

Both Anthropic and OpenAI offer a 50% discount for asynchronous batch jobs. Anthropic's Message Batches API processes requests asynchronously within 24 hours and applies a full half-price on all tokens (both input and output). So if you move all your workload that doesn't need to be real-time onto this track, that job's cost automatically halves.

The best part: the batch discount stacks with prompt caching. When 90% off repeated input + 50% off everything in batch pile on top of each other, the cost curve really does bend dramatically downward.

5. Quantization and self-hosting: ditching pay-per-token

Everything I've discussed so far was for API scenarios where you pay per token. But beyond a certain point — especially at high and predictable volume, or for organizations with KVKK/data sovereignty requirements — running your own model on your own servers becomes both cheaper and more secure.

This is where quantization comes in. You reduce the model's weights from FP16 to INT8 or INT4; this cuts memory usage by 2-4x and roughly halves inference cost, while preserving 95-99% of the original accuracy. The practical impact is big: when you run a 70B model with INT4 (e.g. via llama.cpp), a 140 GB model shrinks to ~35 GB; meaning it fits on a much smaller and cheaper GPU.

This area advanced rapidly in 2026. Google's TurboQuant (March 2026) compresses the KV cache itself to 3 bits per value, cutting KV cache memory by 6x with zero measured accuracy loss. TensorRT-LLM, using FP8 and INT8 quantization with architecture-aware calibration, significantly increases tokens per second on H100 and B200 hardware.

But quantization alone isn't enough; the serving layer matters just as much. Here engines like vLLM change the game:

  • PagedAttention: Splits memory into small, reusable pages; cuts memory waste by up to 90% (with under 4% waste) and increases throughput by 2-3x.
  • Continuous batching: Dynamically blends new requests with ongoing ones so the GPU is never idle; delivers 3-10x higher throughput on the same hardware.
  • Speculative decoding: A small draft model predicts multiple tokens, the big model verifies them in parallel in a single pass; multiple tokens are produced per iteration.

When these stack, vLLM can handle 3-5x more traffic on the same H100 than a naive PyTorch loop. So every extra token you squeeze out of your GPU directly lowers your unit cost.

How to make the cost decision? A simple rule: when your monthly API bill starts to exceed the cost of an equivalent GPU server (depreciation + operations), put self-hosting seriously on the table. At low volume, API is always cheaper; at high and steady volume, the table flips.

KVKK and the Turkey context: cost isn't just money

When working with organizations in Turkey, a second variable enters the cost equation: data sovereignty and compliance. Under KVKK (Law No. 6698 on the Protection of Personal Data), it's critical that special-category personal data does not leave the organization without a valid legal basis. A law firm, a hospital, or a bank often can't afford to send customer files as prompts to a US-based API.

At this point, self-hosting + quantization becomes not just a cost optimization but a compliance strategy. When you build a fully KVKK-compliant architecture running in a local environment, you get two wins at once: data never leaves the organization's boundaries, and token-based cloud costs disappear. The trend I see in 2026 is clear: Gartner projects that by 2030, more than 75% of enterprises in Europe and the Middle East will geopatriate their workloads to reduce geopolitical risk. With my enterprise clients in Turkey, I feel this shift more clearly every month.

The practical middle ground I recommend is usually hybrid: sensitive data subject to KVKK is processed by a local, quantized open model; jobs that contain no sensitive data and require general knowledge go to cloud APIs optimized with cache and routing. This strikes a healthy balance between compliance, cost and quality.

6. Observability: you can't manage a cost you don't measure

Every tactic I've listed so far stays up in the air unless you put observability on top of it. Because if you can't see which endpoint, which user, which prompt is burning the money, you can't know what to optimize.

As of 2026 there are several mature tools in this space, each with a different sweet spot:

  • Langfuse — open source, detailed tracing, prompt management. Being able to self-host it is also a plus for KVKK.
  • Helicone — the fastest setup. It works as a proxy layer; with a single-line endpoint change you get cost per request, model distribution, and per-user breakdowns.
  • LangSmith — the tightest integration for LangChain/LangGraph users; visualizes agent execution, captures token usage and cost per call.
  • LiteLLM, Portkey — multi-provider cost attribution and enforcement on the gateway side.

The most robust setup I see in the field is usually a combination of 2-3 tools: a gateway for fast cost tracking (Helicone/Portkey), a tracing tool for deep tracing (Langfuse/Phoenix), and a bridge like OpenLLMetry to plug into your existing APM stack.

What you must track: prompt-level traces, latency, token usage and cost, user sessions, and failure/error patterns. When you can break these down by user, endpoint and model, finding the 5% of traffic that burns 80% of the bill takes minutes. Usually something surprising surfaces: a single badly written query, a single endpoint falling into a cache miss, or a loop left running in the test environment.

The tactics table that ties it all together

Here's the playbook I apply in the field, ranked by impact and implementation difficulty:

TacticTypical savingImplementation difficultyWhen to start
Prompt caching (provider)45-80% on inputLowImmediately, day one
Static→dynamic prompt orderingMultiplies cache efficiencyVery lowTogether with caching
Exact + semantic cache40-80% (depends on hit rate)MediumIf you have repetitive traffic
Model routing / cascade30-70%Medium-high (eval required)If traffic is varied
Batch API (async jobs)50%LowEvery non-real-time job
Quantization + self-host~50%+ on unit costHighHigh/steady volume, KVKK
ObservabilityIndirect; makes everything visibleLow-mediumFrom the very start

Notice the moves at the top of the table are both cheap and high-return. My advice is always the same: start from the top. First caching and prompt ordering (a day's work, huge return), then observability (to see what to optimize), then routing and batch, and last — only if you really need it — quantization and self-hosting.

Let's combine it all in one scenario

Back to the opening example: 50,000 requests a day, 8,000 input tokens per request, Sonnet 4.6, ~$36,000/month on input alone.

Now let's optimize in order:

  1. Prompt caching: 7,000 of the 8,000 tokens are fixed. That part gets 90% cheaper. This alone roughly quarters the input cost → down to around ~$10,000/month.
  2. Semantic cache: If 40% of requests are repeated/similar questions, a significant portion of these requests never reach the LLM → another 25-30% drop.
  3. Model routing: 50% of traffic is actually simple jobs (classification, short answers) and you route them to Haiku; Haiku is a third of Sonnet's price → a serious drop in this segment.
  4. Batch: If you move nightly analysis/reporting jobs to batch, that workload is half price.

These moves don't act one by one, they stack. In the field I've witnessed many times that a combination like this brings the monthly bill down to a fifth — without giving a single user the chance to complain about quality. Because here's the beautiful part: almost none of these optimizations lower quality. Cache returns the same answer. Routing gives the simple job to a model that's already sufficient. Batch only changes the timing. Even with quantization, you preserve 95-99% of the accuracy.

Cost optimization isn't "settling for worse," as most people assume. It's cleaning up waste. And the amount of waste in an LLM app is far more than you'd guess when you see the first bill. Apply the layers in this guide in order, set up observability from the very start, protect every routing decision with an eval — the rest is disciplined engineering. And remember: in the Turkey context, the cost decision is never just about money; as long as KVKK and data sovereignty are on the table, the right architecture protects both your wallet and your organization's legal security.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Comments

Comments

Connected pillar topics

Pillar topics this article maps to