Are cached tokens counted within 'input_tokens' or separately?

Depends on provider. OpenAI: prompt_tokens includes cached, with nested detail. Anthropic: input_tokens does NOT include cache, separate fields. Gemini: prompt_token_count includes cached. Normalize this with LiteLLM or write your own adapter.

Anatomy of the API Response 'usage' Object: OpenAI, Anthropic, Gemini Compared

Every LLM API response has a 'usage' object — input_tokens, output_tokens, cached_input, reasoning_tokens, etc. These fields differ across providers. This lesson dissects each one and shows the correct parsing pattern for telemetry.

Şükrü Yusuf KAYA

18 min read

5/14/2026

Intermediate

API Response'daki "usage" Objesinin Anatomi: OpenAI, Anthropic, Gemini Karşılaştırması

🔬 Her sağlayıcı kendi sözlüğünü konuşuyor

OpenAI "prompt_tokens" diyor. Anthropic "input_tokens" diyor. Gemini "prompt_token_count" diyor. Aynı şey üç farklı isim. Telemetry kurarken bu farklılıkları normalize etmen gerek.

OpenAI
`usage`
Objesi#

response = openai.chat.completions.create(
    model="gpt-5",
    messages=[...],
)

print(response.usage)
# CompletionUsage(
#     prompt_tokens=4521,
#     completion_tokens=287,
#     total_tokens=4808,
#     prompt_tokens_details=PromptTokensDetails(
#         cached_tokens=3500,
#         audio_tokens=0,
#     ),
#     completion_tokens_details=CompletionTokensDetails(
#         reasoning_tokens=120,
#         audio_tokens=0,
#         accepted_prediction_tokens=0,
#         rejected_prediction_tokens=0,
#     )
# )

Alanların açıklaması#

Alan	Anlam
`prompt_tokens`	Toplam input token (cached dahil)
`completion_tokens`	Toplam output token (reasoning dahil)
`total_tokens`	Sum = prompt + completion
`prompt_tokens_details.cached_tokens`	Cache hit olan input token sayısı
`completion_tokens_details.reasoning_tokens`	Reasoning model thinking (o-series)
`audio_tokens`	Multimodal audio token sayısı

Maliyet hesabı#

def cost_openai(usage, model_pricing):
    cached = usage.prompt_tokens_details.cached_tokens or 0
    non_cached_input = usage.prompt_tokens - cached
    output = usage.completion_tokens
    reasoning = usage.completion_tokens_details.reasoning_tokens or 0

    input_cost = (non_cached_input / 1e6) * model_pricing["input"]
    cached_cost = (cached / 1e6) * model_pricing["input_cached"]
    output_cost = (output / 1e6) * model_pricing["output"]
    # Reasoning faturalanır output fiyatından, ama 'output' içinde zaten sayılı

    return input_cost + cached_cost + output_cost

Anthropic
`usage`
Objesi#

response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[...],
)

print(response.usage)
# Usage(
#     input_tokens=4521,
#     output_tokens=287,
#     cache_creation_input_tokens=3500,
#     cache_read_input_tokens=0,
#     server_tool_use=...
# )

Alanların açıklaması#

Alan	Anlam
`input_tokens`	Cache olmayan standart input
`output_tokens`	Total output (thinking dahil)
`cache_creation_input_tokens`	Yeni cache yazılan token sayısı (1.25× pahalı)
`cache_read_input_tokens`	Cache'den okunan token (0.10× ucuz)
`server_tool_use`	Server-side tool kullanım sayıları

⚠️ Anthropic'te

input_tokens

cache'lenen kısmı içermez. Telemetry'de toplam input =

input_tokens + cache_creation + cache_read

Maliyet hesabı#

def cost_anthropic(usage, model_pricing):
    standard_in = usage.input_tokens
    cache_write = usage.cache_creation_input_tokens or 0
    cache_read = usage.cache_read_input_tokens or 0
    output = usage.output_tokens

    input_cost = (standard_in / 1e6) * model_pricing["input"]
    write_cost = (cache_write / 1e6) * (model_pricing["input"] * 1.25)
    read_cost = (cache_read / 1e6) * (model_pricing["input"] * 0.10)
    output_cost = (output / 1e6) * model_pricing["output"]

    return input_cost + write_cost + read_cost + output_cost

Gemini
`usage_metadata`
Objesi#

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents="...",
)

print(response.usage_metadata)
# UsageMetadata(
#     prompt_token_count=4521,
#     candidates_token_count=287,
#     total_token_count=4808,
#     cached_content_token_count=3500,
#     thoughts_token_count=120,
#     tool_use_prompt_token_count=15,
# )

Alanların açıklaması#

Alan	Anlam
`prompt_token_count`	Toplam input (cache dahil)
`candidates_token_count`	Toplam output (thinking hariç!)
`total_token_count`	Sum
`cached_content_token_count`	Cache'den okunan input token
`thoughts_token_count`	Thinking budget kullanımı
`tool_use_prompt_token_count`	Tool çağrısı için ek prompt token

⚠️ Gemini'de

candidates_token_count

thinking'i içermiyor. Toplam output maliyetı =

candidates + thoughts

Maliyet hesabı#

def cost_gemini(usage_meta, model_pricing, tier=200_000):
    prompt = usage_meta.prompt_token_count
    cached = usage_meta.cached_content_token_count or 0
    non_cached = prompt - cached
    output_visible = usage_meta.candidates_token_count
    thinking = usage_meta.thoughts_token_count or 0
    total_output = output_visible + thinking

    # 200K üstü tier yüzgeci
    if prompt > tier:
        below = tier
        above = prompt - tier
        non_cached_below = min(non_cached, below)
        non_cached_above = max(0, non_cached - below)
        input_cost = (
            non_cached_below / 1e6 * model_pricing["input_low"] +
            non_cached_above / 1e6 * model_pricing["input_high"]
        )
    else:
        input_cost = non_cached / 1e6 * model_pricing["input_low"]

    cached_cost = cached / 1e6 * (model_pricing["input_low"] * 0.25)
    output_cost = total_output / 1e6 * model_pricing["output_low"]

    return input_cost + cached_cost + output_cost

Telemetry için Normalize Edilmiş Schema#

Üç sağlayıcının verisini tek bir şemada birleştirmek için:

@dataclass
class NormalizedUsage:
    provider: str        # "openai" | "anthropic" | "gemini"
    model: str           # "gpt-5" | "claude-sonnet-4-6" | "gemini-2.5-pro"

    input_tokens: int    # cache-olmayan input
    cached_input_tokens: int  # cache hit
    cached_write_tokens: int  # cache write (anthropic-spesifik)

    output_tokens: int   # visible output
    reasoning_tokens: int  # thinking / reasoning

    tool_tokens: int     # tool use ek token
    audio_tokens: int    # multimodal audio
    image_tokens: int    # multimodal image
    video_tokens: int    # multimodal video

    total_cost_usd: float  # hesaplanmış toplam

Bu şemayı kullanan bir LLM telemetry middleware'i her sağlayıcının özel formatını buna çeviriyor. Modül 4'te bunu LiteLLM ile otomatik yapacağız.

LiteLLM
`response_cost`
— Tek Adımlı Çözüm#

LiteLLM bunu zaten yapıyor.

response._hidden_params["response_cost"]

her sağlayıcı için doğru maliyeti veriyor.

from litellm import completion

response = completion(
    model="claude-sonnet-4-6",
    messages=[...],
    cache_control={"type": "ephemeral"},
    metadata={"user_id": "user_42"},
)

print(response.usage)
# Tek tipte normalized usage objesi
print(response._hidden_params["response_cost"])
# 0.00342  ← USD

LiteLLM internal'da her sağlayıcının kendi format'ını parse eder, normalize eder, doğru pricing tablosundan maliyet hesaplar. Kurs boyunca varsayılan yaklaşımımız bu.

💡 LiteLLM bonus

LiteLLM ayrıca response.usage içine cached input token alanını consistent olarak doldurur. OpenAI'da prompt_tokens_details.cached_tokens, Anthropic'te cache_read_input_tokens, Gemini'de cached_content_token_count — hepsini tek bir field'a getirir. Bir-iki satır kod tasarrufu sağlıyor.

Edge Case'ler#

1.
`usage = null`
durumu#

Bazı durumlarda usage objesi gelmez:

Stream cancelled mid-response
API 5xx hatasında
Streaming response'da son chunk gelmeden okuduysan

Her durumda defensive:

if response.usage:
    log_cost(response.usage)
else:
    log_partial_or_error(...)

2. Float vs Integer#

Token sayıları her zaman integer. Ama

response_cost

float. JSON serialize ederken precision kaybı olmasın.

3. Stream'de usage#

OpenAI: streaming'in son chunk'ında usage gelir (yeni format). Anthropic: streaming'in son event'inde

message_delta

ile. Gemini: streaming'in son chunk'ında usage_metadata.

Stream interceptor'unda accumulate et — sonraki derste detaylı.

▶️ Sıradaki ders

3.3 — Streaming Token Sayım Tuzakları. Stream mode'da token sayımı düşündüğünden farklı çalışıyor. Cancelled stream'ler, partial output, last-chunk usage event'leri — production'da yaygın hatalar ve çözümleri.

Frequently Asked Questions

API-returned counts are always correct — billing is based on them. tiktoken local count is close but can differ by 5-10% (especially for vision, audio, structured output internal prefills). Use API usage in telemetry, use tiktoken only for pre-API cost estimation.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Anatomy of the API Response 'usage' Object: OpenAI, Anthropic, Gemini Compared

OpenAI
`usage`
Objesi#

Alanların açıklaması#

Maliyet hesabı#

Anthropic
`usage`
Objesi#

Alanların açıklaması#

Maliyet hesabı#

Gemini
`usage_metadata`
Objesi#

Alanların açıklaması#

Maliyet hesabı#

Telemetry için Normalize Edilmiş Schema#

LiteLLM
`response_cost`
— Tek Adımlı Çözüm#

Edge Case'ler#

1.
`usage = null`
durumu#

2. Float vs Integer#

3. Stream'de usage#

Frequently Asked Questions

Sometimes the API token counts differ from local tiktoken counts. Which is correct?

Are cached tokens counted within 'input_tokens' or separately?

Yorumlar & Soru-Cevap

Related Content

The AI Cost Explosion: Why Token Prices Fell 96% from 2022 to 2026 — Yet Bills Grew 40×

Unit Economics Vocabulary: COGS, Gross Margin, $/User, Contribution Margin — 9 Financial Concepts Every AI Engineer Must Know

Workshop Toolkit: A Quick Tour of the 11 Tools We'll Use Throughout the Course

Subscribe to Newsletter

OpenAI usage Objesi#

Alanların açıklaması#

Maliyet hesabı#

Anthropic usage Objesi#

Alanların açıklaması#

Maliyet hesabı#

Gemini usage_metadata Objesi#

Alanların açıklaması#

Maliyet hesabı#

Telemetry için Normalize Edilmiş Schema#

LiteLLM response_cost — Tek Adımlı Çözüm#

Edge Case'ler#

1. usage = null durumu#

2. Float vs Integer#

3. Stream'de usage#

Frequently Asked Questions

Sometimes the API token counts differ from local tiktoken counts. Which is correct?

Are cached tokens counted within 'input_tokens' or separately?

Yorumlar & Soru-Cevap

Related Content

The AI Cost Explosion: Why Token Prices Fell 96% from 2022 to 2026 — Yet Bills Grew 40×

Unit Economics Vocabulary: COGS, Gross Margin, $/User, Contribution Margin — 9 Financial Concepts Every AI Engineer Must Know

Workshop Toolkit: A Quick Tour of the 11 Tools We'll Use Throughout the Course

OpenAI
`usage`
Objesi#

Anthropic
`usage`
Objesi#

Gemini
`usage_metadata`
Objesi#

LiteLLM
`response_cost`
— Tek Adımlı Çözüm#

1.
`usage = null`
durumu#