Token Ekonomisi & LLM Cost Optimization
Token prices got ~26× cheaper from 2022 to 2026 (GPT-3.5 $20/M → Sonnet 4.6 $3/M, Haiku 4.5 $1/M). Yet companies' AI invoice line grew ~40× on average. Solving this paradox is the foundational question of the entire course.
Table of Contents
Module 0: Why Cost, Why Now?
- 1
The AI Cost Explosion: Why Token Prices Fell 96% from 2022 to 2026 — Yet Bills Grew 40×
Token prices got ~26× cheaper from 2022 to 2026 (GPT-3.5 $20/M → Sonnet 4.6 $3/M, Haiku 4.5 $1/M). Yet companies' AI invoice line grew ~40× on average. Solving this paradox is the foundational question of the entire course.
- 2
Unit Economics Vocabulary: COGS, Gross Margin, $/User, Contribution Margin — 9 Financial Concepts Every AI Engineer Must Know
9 financial concepts you need to calculate an AI product's real cost: COGS, Gross Margin, $/Request, $/User, $/MAU, Contribution Margin, CAC, LTV, Payback Period. Each with concrete LLM examples.
- 3
Workshop Toolkit: A Quick Tour of the 11 Tools We'll Use Throughout the Course
Quick tour of the 11 key tools we'll use in the course: tiktoken, anthropic-tokenizer, Langfuse, Helicone, LiteLLM, vLLM, RouteLLM, LLMLingua, GPTCache, tldraw, Python uv. For each: what it does, when it kicks in, free or paid.
- 4
Workshop Setup: Python, uv, API Keys, Your First LLM Call, and Langfuse Trace in 20 Minutes
Full workshop setup for course labs: Python 3.12, uv, virtual env, OpenAI/Anthropic/Gemini/DeepSeek/Groq API keys (all with free credit), Langfuse cloud account, first LLM call, and first telemetry trace.
Module 1: Token Anatomy — What's Inside a Token?
- 1
Character, Word, Token: The 3 Units That Determine Your Bill — and Their Surprising Differences
A token is the basic unit through which LLMs see text. Character count, word count, and token count give very different results for the same text. By the end of this lesson, you'll estimate 'how many tokens is this paragraph?' from memory with <10% error.
- 2
Tokenizer Wars: How GPT, Claude, Gemini, Llama, Mistral and DeepSeek Split the Same Turkish Text
The same 3 Turkish texts, 6 different tokenizers. Token counts differ by up to 35%. This gap shows up directly on your bill. This lesson lays the groundwork for 'which model is most token-economical for the same job?'
- 3
The Turkish Token Penalty: Why Turkish Text Costs 1.7× More on Your Bill — and How to Live With It
Turkish is an agglutinative language, so BPE tokenizers split words into many pieces. The same semantic information costs ~70% more tokens — directly 70% higher bills. This lesson covers the math, real-world impact, and 4 mitigation strategies.
- 4
Input vs Output Tokens: Which Is 5× More Expensive — and Why Knowing It Makes You Money
All major LLMs price input tokens 3-5× cheaper than output tokens. This gap isn't a technical accident — it's rooted in GPU economics and directly shapes engineering decisions. 'Much input, little output' design means 40-60% savings.
- 5
Context Window Economics: 200K, 1M, 10M Token Contexts — Money Fire or Superpower?
Modern LLM context windows reach 200K-10M tokens. But big context isn't cheap: a single 200K Sonnet 4.6 call costs $0.60. We analyze the real cost of 'put the whole book in the prompt', when it's worth it, when it's a killer.
- 6
Multimodal Tokens: How Images, Audio, and Video Are Priced in LLMs
Text isn't alone — in 2026 nearly every LLM takes images, audio, and video. How many tokens is one image? How many dollars is an hour of audio? Is a 4K video expensive? Provider differences, calculation formulas, and real lab examples.
Module 2: The 2026 Pricing Landscape
- 1
OpenAI Pricing Schema Deep Dive: 7 Tiers, 12 Products, 3 Discounts — What to Use When
OpenAI's pricing page has 12 products, each with 3-5 options: standard, cached input, batch (50% off), fine-tuning, embedding, image, audio, realtime, image generation. We break down every tier with real calculation examples.
- 2
Anthropic Pricing Schema: The 90% Discount Magic of Prompt Caching and the Extended Thinking Bill
Claude Haiku/Sonnet/Opus pricing table, the 1.25× write / 0.10× read math of prompt caching, the hidden output cost of extended thinking, Batch API, and why Anthropic is the most economical choice for Turkish.
- 3
Google Gemini Pricing Schema: Tier Traps Behind the Cheap Look and the Real Cost of 1M Context
Gemini 2.5 Pro/Flash/Flash-Lite pricing table, 2× zam above 200K, context caching mechanism, real limits of the free tier, Vertex AI enterprise difference, and Google's impact on the Turkish ecosystem.
- 4
Open-Weight Inference: Together, Fireworks, Groq, Cerebras, DeepSeek — Frontier Quality at 5% of the Price?
Providers serving open-weight models like Llama 4, Mistral, Qwen 3, DeepSeek V3.5 — Together AI, Fireworks, Groq, Cerebras, Replicate, DeepSeek native. Price comparison, latency/throughput trade-offs, which provider for what.
- 5
AWS Bedrock, Azure OpenAI, Vertex AI: The Enterprise Pricing Landscape and the Compliance Premium
AWS Bedrock, Azure OpenAI Service, Google Vertex AI — enterprise cloud LLM options. Standard on-demand pricing, provisioned throughput, region pricing, KVKK compliance premium, and when to switch to enterprise cloud.
- 6
Self-Hosted LLM Real Cost: The Full Conversion Formula from GPU-Hour to $/M Token
When you run Llama 3.3 70B on RunPod with H100, what's the real $/M token? Formula of GPU-hour × throughput × MFU, vLLM continuous batching effect, and at which volume self-hosting becomes cheaper than frontier APIs.
- 7
Hidden Costs That Inflate Your Bill: Tool Use, Structured Output, Thinking, Web Search and More
Items not on the pricing page but real on the LLM bill: tool definitions added to input, structured output prefill, reasoning thinking hidden output, web search tool $30/1K, vision detail mode 9× zam. We open the invisible corners of the bill.
Module 3: Cost Telemetry — Measure First, Then Optimize
- 1
If You Can't Measure It, You Can't Optimize It: LLM Telemetry Philosophy and Establishing Your First Baseline
The oldest motto in engineering applies to LLMs: measure first, optimize next. This lesson covers why telemetry-less optimization is a blind war, the 5 metrics you must track, and how to establish your first baseline in 30 days.
- 2
Anatomy of the API Response 'usage' Object: OpenAI, Anthropic, Gemini Compared
Every LLM API response has a 'usage' object — input_tokens, output_tokens, cached_input, reasoning_tokens, etc. These fields differ across providers. This lesson dissects each one and shows the correct parsing pattern for telemetry.
- 3
Streaming Token Counting Pitfalls: 7 Common Production Bugs
Token counting in stream mode easily goes wrong: partial output count in cancelled streams, missing last-chunk usage, token losses during idle timeout. We cover the 7 most common production bugs with fixes.
- 4
Full Telemetry Tools Comparison: Langfuse vs Helicone vs LangSmith vs Phoenix vs OTel
We compare the 5 main LLM observability tools side-by-side: feature sets, pricing, self-host options, KVKK compliance, integration ease. Decision matrix for 'which one should I use in my case'.
- 5
Self-Hosted LLM Observability from Scratch: $/Request Dashboard with ClickHouse + Grafana
Instead of third-party tools, build your own observability stack: ClickHouse + Grafana + LiteLLM Webhook. Step-by-step Docker setup, schema design, dashboard JSON, and Slack alerts — production-grade, infinite scale, KVKK-compliant.
- 6
Integrating LLM Cost with Enterprise APMs: Sentry, Datadog, New Relic Patterns
If you have existing APM (Sentry, Datadog, New Relic), you can extend them for LLM telemetry instead of using a separate tool. This lesson covers LLM-specific features in 3 enterprise APMs, custom metric patterns, and cost attribution strategies.
Module 4: Cost Attribution
- 1
Multi-Tenant SaaS Cost Attribution: Correctly Attributing Costs of 1000 Customers Through One API Key
In B2B SaaS you need to report costs separately for 1000 customers while using a single OpenAI API key. This lesson covers tenant_id propagation, metadata injection, and dashboard segmentation patterns.
- 2
Feature-Flag → Cost-Flag: Engineering-Level Measurement of A/B Test's Real $/User Impact
You want to show a new AI feature to 50% of users to measure impact. Conversion is easy to measure — but the cost difference? This lesson covers adding a cost-flag to each A/B variant, statistical significance, and decision-making with LTV.
- 3
LiteLLM Virtual Keys: Production-Grade Multi-Tenant Cost Attribution Infrastructure
Creating virtual keys in LiteLLM Proxy, per-key budgets, rate limits, model whitelists, and the full admin API. Each tenant gets their own key = automatic attribution + automatic control.
- 4
Chargeback Reporting to Internal Teams and Enterprise Customers: PDF, CSV, Invoice Generation
Engineering team burns $4K LLM monthly — which project, feature, engineer's code consumed it? How do you send an AI usage invoice to an enterprise customer? This lesson covers the anatomy of chargeback reporting automation.
- 5
Cost-Driven Abuse: Prompt-Injection Attacks, Bot Traffic, and Defending Against Cost Attacks
An attacker can target your AI product with prompt injection specifically to inflate your costs. This lesson covers cost-based attack vectors (prompt explosion, recursive tool calling, expensive context flooding), detection methods, and production mitigation.
Module 5: The Cost Dimension of Prompt Engineering
- 1
"I 4×'d My Prompt and Doubled My Tokens": 8 Most Common Prompt Cost Mistakes in Production
Prompt engineering is usually written from a quality lens — but every extra token directly hits your bill. This lesson covers the 8 most common mistakes in production, with real prompt examples and before/after token counts.
- 2
7 Techniques to Halve Your System Prompt: Practical, Tested, Quality-Preserving
After eliminating the previous lesson's mistakes: it's possible to shrink your prompt by another 50% without losing quality. We cover 7 advanced techniques with real prompt before/after examples.
- 3
Few-Shot Examples Economics: 0 vs 3 vs 8? The Cost vs Accuracy Trade-Off
Few-shot examples raise input tokens but improve output quality. What's the right number? This lesson compares test results with 0, 1, 3, 5, 8 examples and provides optimum recommendations by task type.
- 4
The Cost of Chain-of-Thought: "Think Step by Step" Can Inflate Your Bill 3-10×
CoT (chain-of-thought) prompting improves accuracy by 20-40% in some tasks. But it inflates output tokens 3-10×. This lesson covers CoT cost vs accuracy across 5 task types and when to use it.
- 5
Structured Output Pitfalls: JSON Mode Token Greed and the Real Cost of Tool-Use Forcing
Using JSON mode doesn't mean 'fewer tokens' — in most cases it uses **more tokens**. Schema complexity, field name length, escape characters — all hidden token costs. This lesson covers cost-aware structured output design.
- 6
Output Shortening Techniques: max_tokens, Stop Sequences, and the Real Impact of "Be Terse" Prompts
Since output costs 3-5× more, shortening output directly hits your bill. This lesson covers max_tokens strategy, correct stop sequence usage, measured impact of 'be terse' prompts, and format-driven constraints.
Module 6: Prompt Compression
- 1
LLMLingua, LongLLMLingua, Selective-Context: Comparison of Automatic Prompt Compression Families
Microsoft Research's LLMLingua family compresses prompts 50-90% while keeping quality loss at 2-5%. This lesson compares LLMLingua-1, LLMLingua-2, LongLLMLingua, Selective-Context, LongHeads with setup and first Turkish examples.
- 2
Gisting and Soft-Prompt Tuning: Compressing Prompts into Embedding Vectors
While LLMLingua compresses 60-90%, gisting goes down to 1/100. The logic: representing prompts as dense embedding vectors instead of token sequences. This lesson covers gisting, soft prompt tuning, and the limits of realism.
- 3
Embedding-Based Selection: The Most Practical Way to Discard Irrelevant Context
In RAG, most retrieved chunks (~50-70%) don't actually contribute to the answer. Discarding question-irrelevant parts via embedding similarity saves 50-80% tokens. This lesson covers implementation, threshold selection, and validation with LLM-as-judge.
- 4
Prompt Distillation: Transferring the Big Model's Prompt to Small Model and a 95% Cost Reduction
You can transfer a complex prompt working with Sonnet 4.6 to Haiku 4.5 via fine-tuning, achieving the same quality at 95% lower cost. This lesson covers the distillation pipeline, eval setup, and break-even analysis.
- 5
Quality-Monitored Compression: Scientifically Finding the Compression Boundary
Should compression be 50%, 70%, or 90%? You can't know by 'feel' — you need an eval framework. This lesson covers LLM-as-judge, golden test set, A/B test production rollout, and regression detection patterns.
Module 7: Prompt Caching — The Single Biggest Cost Lever of 2026
- 1
Anthropic Prompt Caching Deep-Dive: 1.25× Write, 0.10× Read — Turning the Math Into Maximum Savings
Anthropic's caching math looks simple: write 1.25×, read 0.10×. But to achieve 90% savings in production you need to master breakpoint count, TTL choice, multi-cache layering, and refresh strategies. Master-level this lesson.
- 2
OpenAI Automatic Cached Input: Maximizing the 'Magical' Automatic Cache
OpenAI cached input gives 50-87% discount (model-dependent) but auto-triggers — limited control. This lesson covers trigger conditions, maximization strategies, cache hit detection, and Anthropic comparison.
- 3
Gemini Context Caching: Storage Fee + Read Fee Model and Low-Traffic Advantage
Gemini's caching pricing is unique: normal cache create, then $1/M token/hour storage fee + 0.25× read fee. Can be more economical than Anthropic in low-traffic, low-frequency scenarios.
- 4
Cache-Friendly Architecture: Static Head, Dynamic Tail Principle
Cache efficiency depends on prompt structure: what goes where? This lesson covers the universal 'static prefix → dynamic suffix' pattern, conversation history management, RAG chunks placement, and tool definitions ordering.
- 5
Cache Hit-Rate Measurement and Optimization: How to Go From 50% to 85%
You enabled cache but hit-rate is stuck at 50%. This indicates a prompt architecture issue. This lesson covers hit-rate measurement dashboard, miss reason analysis, iterating with A/B tests, and patterns to reach 85%.
- 6
Cache Invalidation: Avoiding Stale Cache When Updating System Prompt, Tools, FAQ
You enabled cache in production. One day you need to update the system prompt — old cache is stale. This lesson covers dual-write pattern, gradual rollout, cache versioning, and emergency invalidation.