# The Context Engineering Era: Prompt Caching, Long Context vs RAG, and Runtime State Management (2026 Guide)

> Source: https://sukruyusufkaya.com/en/blog/context-engineering-prompt-caching-long-context-rag-2026
> Updated: 2026-05-27T18:13:47.033Z
> Type: blog
> Category: yapay-zeka
**TLDR:** Prompt engineering is dead, context engineering is alive. Anthropic's 90% cost-cutting prompt caching, GPT-5.5's 272K input threshold, Claude Opus 4.7's 1M context, and agent runtime state management are rewriting AI engineering in 2026. Turkish token efficiency, KVKK-compliant state stores, the 'Don't Break the Cache' principle.

<tldr data-summary="[&quot;Prompt engineering is dead; context engineering took its place. In 2026, the AI engineer&apos;\&apos;&apos;s real job is designing what enters the LLM context, when, in what order, and with what cache strategy.&quot;,&quot;Anthropic prompt caching cuts cost by 90% with 5-minute (or 1-hour) TTL. The Don&apos;\&apos;&apos;t Break the Cache principle has reshaped the economics of modern AI apps.&quot;,&quot;Long context (Claude Opus 4.7 1M, Gemini 3.1 Pro 2M) does not fully replace RAG — there are cost, latency, and accuracy tradeoffs. The decision matrix is driven by tokens-per-document and query frequency.&quot;,&quot;Agent runtime state lives in in-memory (dev), Redis (mid-scale, <100K active sessions), or Postgres (durable, audit-required). LangGraph checkpointer supports all three.&quot;,&quot;Turkish tokenization is ~30% more expensive than English — context budgets must account for that margin.&quot;]" data-one-line="Context engineering is the real AI engineering discipline of 2026: not how you write the prompt, but how you build, cache, scale, and maintain the context."></tldr>

## 1. Why Prompt Engineering is Dead

In 2023-2024, "prompt engineer" job postings flooded LinkedIn. By late 2025, the same role at major companies was reclassified as **context engineer** or **AI systems engineer**. This is not branding — the priority of the discipline shifted.

Prompt engineering was about how to phrase a single call: string formatting, few-shot selection, chain-of-thought triggers. **Context engineering** is about how the model's context is constructed across the entire lifecycle of an application — what to cache, when to invalidate, when to retrieve, when to summarize.

<definition-box data-term="Context Engineering" data-definition="The discipline of designing what context (system prompt, retrieved chunks, conversation history, tool definitions, structured outputs, cached prefixes) an LLM application sends to the model, when, in what order, and under what cache strategy. Beyond static prompt writing, it covers runtime state management, cache invalidation, token-budget allocation, and cost/latency optimization." data-also="Runtime AI Engineering" data-wikidata="Q125456789"></definition-box>

Anthropic's January 2026 engineering post summarized the shift: "Bringing an agent to production is far more than writing a better prompt. What context, on what turn, with what cache TTL, in what state store — these decisions affect performance, cost, and accuracy more than prompt wording."

<stat-callout data-value="90%" data-context="Anthropic prompt caching with a 5-minute cache hit drops input token cost" data-outcome="to 10% of the standard price. For long-running agents and multi-turn chats, this is the single line item that re-shapes economic feasibility." data-source="{&quot;label&quot;:&quot;Anthropic Prompt Caching Pricing&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching&quot;,&quot;date&quot;:&quot;2025-09&quot;}"></stat-callout>

## 2. Anthropic Prompt Caching: Mechanics

Prompt caching went GA at Anthropic in 2025 and is now standard across Claude models in 2026. It stores a static prefix (system prompt, document context, tool definitions, few-shot examples) server-side with a 5-minute (or 1-hour) TTL; subsequent calls with the same prefix get a discount.

- **Cache write:** 1.25x of standard input price (5-min TTL) or 2x (1-hour TTL)
- **Cache hit (read):** 10% of standard input price
- **Output:** Unchanged

<comparison-table data-caption="2026 Prompt Caching: Anthropic vs OpenAI vs Gemini" data-headers="[&quot;Provider&quot;,&quot;Cache Type&quot;,&quot;Hit Discount&quot;,&quot;TTL&quot;,&quot;Automatic?&quot;]" data-rows="[{&quot;feature&quot;:&quot;Anthropic (Claude)&quot;,&quot;values&quot;:[&quot;Explicit (cache_control)&quot;,&quot;90% (input)&quot;,&quot;5min or 1hr&quot;,&quot;No — explicit markup&quot;]},{&quot;feature&quot;:&quot;OpenAI (GPT-5/5.5)&quot;,&quot;values&quot;:[&quot;Implicit&quot;,&quot;50% (input)&quot;,&quot;~5min (shared)&quot;,&quot;Yes — automatic&quot;]},{&quot;feature&quot;:&quot;Google (Gemini 3.1)&quot;,&quot;values&quot;:[&quot;Implicit&quot;,&quot;75% (input)&quot;,&quot;~5min&quot;,&quot;Yes — automatic&quot;]},{&quot;feature&quot;:&quot;Anthropic Bedrock&quot;,&quot;values&quot;:[&quot;Explicit&quot;,&quot;90%&quot;,&quot;5min&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;Vertex AI (Anthropic)&quot;,&quot;values&quot;:[&quot;Explicit&quot;,&quot;90%&quot;,&quot;5min&quot;,&quot;No&quot;]}]"></comparison-table>

### Don't Break the Cache

The cache is **prefix-hashed from the beginning** of the prompt. One token change at the start invalidates everything. The golden rule:

**Put cacheable, static content at the top. Put dynamic content at the bottom.**

Anti-pattern:

    System: "Today is {{date}}. ..."
    Document context: 50K tokens (static)
    User: "..."

Each day breaks the cache. Fix:

    System: "You are an assistant. ..."         # static
    Document context: 50K tokens                # static, cache_control
    Conversation start: "Today is {{date}}. ..."# dynamic
    User: "..."

### Cache breakpoints (Anthropic explicit)

Up to **4 cache breakpoints** per request. Typical placement:

1. End of system prompt
2. End of document context
3. End of tool definitions
4. End of conversation history (excluding latest user turn)

### Cost example

A Turkish fintech customer-service agent: ~30K input tokens per call, 50K queries/day on Claude Opus 4.7. Without cache: ~$675K/month. With cache (24K cached, 6K dynamic): ~$313K/month. **Monthly savings: ~$362K (54%).**

## 3. Long Context vs RAG: Decision Matrix

By 2026, long-context models triggered a "do we still need RAG?" debate:

- **Claude Opus 4.7:** 1M tokens (1M context tier)
- **Gemini 3.1 Pro:** 2M tokens (long-mode)
- **GPT-5.5:** 272K input + 128K output
- **Claude Sonnet 4.5:** 200K
- **Llama 4 70B:** 256K

<comparison-table data-caption="Long Context vs RAG: 2026 Decision Matrix" data-headers="[&quot;Scenario&quot;,&quot;Document Volume&quot;,&quot;Query Frequency&quot;,&quot;Decision&quot;,&quot;Why&quot;]" data-rows="[{&quot;feature&quot;:&quot;Single contract analysis&quot;,&quot;values&quot;:[&quot;50K-200K&quot;,&quot;Low&quot;,&quot;Long context&quot;,&quot;RAG overhead unnecessary&quot;]},{&quot;feature&quot;:&quot;Customer service KB&quot;,&quot;values&quot;:[&quot;1M+&quot;,&quot;High&quot;,&quot;RAG&quot;,&quot;Cannot fit; high frequency blows up LC cost&quot;]},{&quot;feature&quot;:&quot;Multi-doc research&quot;,&quot;values&quot;:[&quot;500K-1M&quot;,&quot;Low-medium&quot;,&quot;Long context + cache&quot;,&quot;Docs static; high cache hit&quot;]},{&quot;feature&quot;:&quot;Turkish Commercial Code&quot;,&quot;values&quot;:[&quot;~250K&quot;,&quot;Medium&quot;,&quot;RAG or LC&quot;,&quot;Borderline; accuracy → LC, cost → RAG&quot;]},{&quot;feature&quot;:&quot;Codebase analysis&quot;,&quot;values&quot;:[&quot;100K-500K&quot;,&quot;Medium&quot;,&quot;Long context + cache&quot;,&quot;Codebase static; daily cache hit&quot;]},{&quot;feature&quot;:&quot;E-commerce catalog&quot;,&quot;values&quot;:[&quot;10M+&quot;,&quot;High&quot;,&quot;RAG required&quot;,&quot;Exceeds LC capacity&quot;]}]"></comparison-table>

Cost comparison (200K-token doc, 1K queries/day, Claude Opus 4.7): RAG ~$27K/year; long context with cache ~$111K/year; long context without cache ~$1.1M/year. **RAG is still 4x cheaper for KB-style workloads.**

### "Lost in the Middle" in 2026

Long-context accuracy improved but is not solved. Needle-in-a-haystack at 1M: Claude 96%, Gemini 93%, GPT-5.5 94%. Real long-document QA: 75-85%. RAG + reasonable LC (100K) → 90%+.

### Latency

- RAG (5K context): p50 1.2s, p95 2.8s
- LC 200K: p50 8.5s, p95 18s
- LC 1M: p50 45s, p95 90s

Real-time chat with 1M context is not viable. Async (research, batch) tolerates it.

## 4. GPT-5.5 Tier System

OpenAI launched GPT-5.5 in Feb 2026 with input tiers:

- **Standard tier:** First 128K input — standard price
- **Long tier:** 128K-272K input — 2x price
- **Output:** Same across tiers

Staying under 128K matters. Tactics: aggressive chunking + dynamic retrieval; summarize old history; compress tool definitions; reduce few-shot from 10 to 3; audit system prompt monthly.

## 5. Claude Opus 4.7 1M Context

Claude Opus 4.7 1M GA'd in March 2026. Pricing: 0-200K standard, 200K-1M 2x, cache hit still 10%.

Use cases: whole codebase in context; multi-doc research; long-running agent memory; genomic data. Pattern: cache the 1M context, ride the 5min TTL for 5-10 turns, net savings strong.

## 6. Agent Runtime State Management

The least-discussed but most-critical part of context engineering: where does the agent keep state between turns?

<comparison-table data-caption="Agent State Stores" data-headers="[&quot;Store&quot;,&quot;Use Case&quot;,&quot;Scale&quot;,&quot;Audit&quot;,&quot;Cost&quot;]" data-rows="[{&quot;feature&quot;:&quot;In-Memory (Python dict)&quot;,&quot;values&quot;:[&quot;Dev&quot;,&quot;Single instance&quot;,&quot;None&quot;,&quot;Free&quot;]},{&quot;feature&quot;:&quot;Redis&quot;,&quot;values&quot;:[&quot;Mid prod&quot;,&quot;<100K sessions&quot;,&quot;Limited&quot;,&quot;Low&quot;]},{&quot;feature&quot;:&quot;Postgres (LangGraph checkpointer)&quot;,&quot;values&quot;:[&quot;Prod, audit&quot;,&quot;Unbounded&quot;,&quot;Full&quot;,&quot;Medium&quot;]},{&quot;feature&quot;:&quot;SQLite&quot;,&quot;values&quot;:[&quot;Single-node&quot;,&quot;Single instance&quot;,&quot;Full&quot;,&quot;Free&quot;]},{&quot;feature&quot;:&quot;DynamoDB&quot;,&quot;values&quot;:[&quot;AWS native&quot;,&quot;Unbounded&quot;,&quot;Limited&quot;,&quot;Med-high&quot;]}]"></comparison-table>

Redis: hot data, AOF for KVKK durability + disk encryption. Postgres + LangGraph checkpointer: per-node state snapshot, thread_id resume, replay, audit log.

State pruning: sliding window (simple, lossy), summarization (preserves but costs LLM calls), hierarchical memory (hot/warm/cold — most scalable).

## 7. Turkish Context Engineering

Turkish tokenization is ~30% more expensive than English. Same content, ~30% less effective context.

- GPT-5.5 128K threshold → effective ~98K Turkish words
- Claude Opus 4.7 200K → ~150K Turkish words
- Gemini 3.1 Pro 2M → ~1.5M Turkish words

Gemini 3.1 Pro has the most efficient Turkish tokenizer in 2026 (~22% overhead vs 30% for Claude/GPT). For Turkish-heavy workloads (customer service, legal, public sector), Gemini is worth evaluating not just on quality but on token cost.

## 8. Context Hierarchy Pattern

Three tiers I use in production:

**Tier 1 (Static):** system prompt, tool definitions, few-shot examples, brand guidelines. Cache aggressively (1-hour TTL if available).

**Tier 2 (Semi-static):** document context (KB chunks), user profile, permissions. 5-min TTL.

**Tier 3 (Dynamic):** last user message, current timestamp, live API data, tool results. No cache.

Anthropic SDK example:

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": TOOL_DEFS, "cache_control": {"type": "ephemeral"}}
        ],
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": KB_CONTEXT, "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": user_query}
            ]}
        ]
    )

Dynamic retrieval pattern: static context goes into Tier 1; per-query top-k retrieval becomes Tier 2 (cached for 5 min). Summarization fallback: when conversation history exceeds 50 turns, summarize older turns into a compact memory block.

## 9. KVKK-Compliant State Management (Turkey)

KVKK directly shapes state-store choice:

- **Data residency:** Redis/Postgres in Turkey or EU. Frankfurt regions preferred (AWS eu-central-1, GCP europe-west3, Azure West Europe). Local providers (BulutSpeed, Turkcell Bulut) common in BDDK sectors.
- **PII minimization in state:** keep only user_id; do not persist names, phones, IBANs.
- **Encryption at rest + in transit:** Redis AOF + disk encryption, Postgres TDE, TLS mandatory.
- **Access logs:** Postgres audit extension, Redis ACL log.
- **Right to erasure (KVKK Art. 11):** 30-day state purge SLA.

For Anthropic cache: even EU-instance caches sit on Anthropic infrastructure. Mask PII before caching:

    def prepare_for_cache(text: str) -> str:
        text = mask_tc_kimlik(text)
        text = mask_phone(text)
        text = mask_email(text)
        text = mask_iban(text)
        return text

BDDK 2025 guidance specifically requires: cache content inventory, 24h cache purge SLA, cache audit logs.

## 10. Case Study: Turkish E-Commerce Context Engineering Migration

A mid-large Turkish e-commerce platform (anonymized) migrated in Q4 2025 from naive prompts to context-engineered architecture.

**Before:** GPT-4, 25K tokens/turn, no cache, ~$120K/month, p50 4.2s, p95 9.8s.

**Changes:**
1. System prompt audit: 8K → 4K
2. KB context: static 15K → dynamic 4K (top-5)
3. Migrated to Claude Opus 4.7 for prompt caching
4. 4 cache breakpoints applied
5. Conversation history auto-summarization at 20+ turns
6. Redis state store (replaced in-memory)

**After (3 months):**
- Tokens/turn: 25K → 12K (-52%)
- Cache hit rate: 0% → 72%
- Monthly cost: $120K → $34K (-72%)
- p50 latency: 4.2s → 1.8s (-57%)
- p95 latency: 9.8s → 3.4s (-65%)
- CSAT: 7.2 → 8.6 (+20%)

## 11. Risks and Pitfalls

<callout-box data-variant="warning" data-title="Context Engineering Traps">

Common production failures:

- **Cache key drift:** small prompt change invalidates the cache, cost spikes. Monitor cache hit rate in CI.
- **Stale cache:** KB updated but cache still serves old doc → wrong answers. Solution: manual invalidate endpoint.
- **State store down:** Redis/Postgres outage → all agents restart. Solution: graceful degradation to in-memory.
- **Memory leak:** unpruned agent state hits 1M+ tokens → cost explosion.
- **Tokenizer mismatch:** dev miscounts tokens, request exceeds limit → 400 error.

</callout-box>

Monitor: cache hit rate (target 60%+), cache TTL hit/expire ratio, cache key cardinality (anomaly → drift), cost per request (target: monthly decrease).

A/B test cache pattern changes: 5% traffic → new pattern → 24-48h watch → ramp or rollback.

## 12. FAQ

<callout-box data-variant="answer" data-title="Does every LLM support prompt caching?">

As of 2026: Anthropic (explicit, 90%), OpenAI (implicit, 50%), Google Gemini (implicit, 75%), AWS Bedrock (Anthropic + Cohere), Azure OpenAI (implicit). Self-hosted: vLLM and TGI both ship prefix caching.

</callout-box>

<callout-box data-variant="answer" data-title="Do I still need RAG with long context?">

Yes. Long context replaces RAG for single-document analysis but: (1) KBs >1M tokens still need RAG. (2) Accuracy is best with RAG + LC combined. (3) Pure LC is 4-10x more expensive than RAG. Decision: document count + query frequency + accuracy needs.

</callout-box>

<callout-box data-variant="answer" data-title="Which LLM has the best Turkish tokenizer?">

Gemini 3.1 Pro is most efficient for Turkish in 2026 (~22% overhead vs ~30% for Claude/GPT). For Turkish-heavy workloads, Gemini is worth evaluating purely on cost.

</callout-box>

<callout-box data-variant="answer" data-title="Can I keep agent state in-memory?">

Single-instance dev: yes. Production: (1) multi-instance breaks state sync, (2) pod restarts lose state, (3) no audit trail. Use Redis minimum; Postgres + LangGraph checkpointer for KVKK/BDDK.

</callout-box>

<callout-box data-variant="answer" data-title="Can I include a timestamp without breaking the cache?">

Yes — place timestamp in the dynamic suffix (last user message), not in the cached prefix.

</callout-box>

<callout-box data-variant="answer" data-title="Is 30% cache hit rate normal?">

Low. Target 60%+. Causes: dynamic content at the prompt start; TTL too short; tools changing too often; conversation history without cache_control. Debug by logging cache hit/miss events.

</callout-box>

<callout-box data-variant="answer" data-title="Long context vs RAG — which hallucinates less?">

Usually **RAG + reasonable context (10-50K)** hallucinates least. Pure LC suffers from "Lost in the Middle." Pure RAG can miss retrieval. The combination is strongest. Measure with RAGAS faithfulness.

</callout-box>

<callout-box data-variant="answer" data-title="Where do I learn context engineering?">

Anthropic Cookbook (github.com/anthropics/anthropic-cookbook), OpenAI Best Practices, LangGraph Documentation. In Turkish: this blog and the AI Engineering Program at sukruyusufkaya.com/egitim.

</callout-box>

## 13. Next Steps

Roadmap: audit (1 week), tiering + cache breakpoints (1 week), state store choice (1 week), A/B canary (1-2 weeks), full rollout + eval (1 week), monitoring/alerting (ongoing). Total: ~6-8 weeks for mid-complexity apps.

Reach out via the site contact form for a context engineering audit or implementation engagement.

<references-list data-items="[{&quot;title&quot;:&quot;Anthropic Prompt Caching Documentation&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025-09&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;OpenAI Prompt Caching&quot;,&quot;url&quot;:&quot;https://platform.openai.com/docs/guides/prompt-caching&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2025-10&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;Google Gemini Implicit Caching&quot;,&quot;url&quot;:&quot;https://ai.google.dev/gemini-api/docs/caching&quot;,&quot;author&quot;:&quot;Google&quot;,&quot;publishedAt&quot;:&quot;2025-11&quot;,&quot;publisher&quot;:&quot;Google&quot;},{&quot;title&quot;:&quot;Lost in the Middle&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2307.03172&quot;,&quot;author&quot;:&quot;Liu et al.&quot;,&quot;publishedAt&quot;:&quot;2023-07-06&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Claude Opus 4.7 1M Context&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/news/claude-opus-1m-context&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2026-03&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;GPT-5.5 Technical Report&quot;,&quot;url&quot;:&quot;https://openai.com/research/gpt-5-5&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2026-02&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;Gemini 3.1 Pro Technical Report&quot;,&quot;url&quot;:&quot;https://blog.google/technology/ai/gemini-3-1&quot;,&quot;author&quot;:&quot;Google DeepMind&quot;,&quot;publishedAt&quot;:&quot;2026-01&quot;,&quot;publisher&quot;:&quot;Google&quot;},{&quot;title&quot;:&quot;Context Engineering: The New AI Discipline&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/engineering/context-engineering&quot;,&quot;author&quot;:&quot;Anthropic Engineering&quot;,&quot;publishedAt&quot;:&quot;2026-01&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;LangGraph Checkpointer&quot;,&quot;url&quot;:&quot;https://langchain-ai.github.io/langgraph/concepts/persistence/&quot;,&quot;author&quot;:&quot;LangChain&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;LangChain&quot;},{&quot;title&quot;:&quot;vLLM Prefix Caching&quot;,&quot;url&quot;:&quot;https://docs.vllm.ai/en/latest/serving/prefix_caching.html&quot;,&quot;author&quot;:&quot;vLLM&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;vLLM&quot;},{&quot;title&quot;:&quot;Redis ACL&quot;,&quot;url&quot;:&quot;https://redis.io/docs/management/security/acl/&quot;,&quot;author&quot;:&quot;Redis&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Redis&quot;},{&quot;title&quot;:&quot;Postgres TDE&quot;,&quot;url&quot;:&quot;https://www.postgresql.org/docs/current/encryption-options.html&quot;,&quot;author&quot;:&quot;PostgreSQL&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;PostgreSQL&quot;},{&quot;title&quot;:&quot;Turkish Tokenizers&quot;,&quot;url&quot;:&quot;https://huggingface.co/spaces/Xenova/the-tokenizer-playground&quot;,&quot;author&quot;:&quot;Hugging Face&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;},{&quot;title&quot;:&quot;KVKK - Law No. 6698&quot;,&quot;url&quot;:&quot;https://www.kvkk.gov.tr/&quot;,&quot;author&quot;:&quot;Republic of Turkiye - KVKK&quot;,&quot;publishedAt&quot;:&quot;2016-04-07&quot;,&quot;publisher&quot;:&quot;Republic of Turkiye&quot;},{&quot;title&quot;:&quot;BDDK AI Guidance&quot;,&quot;url&quot;:&quot;https://www.bddk.org.tr/&quot;,&quot;author&quot;:&quot;BDDK&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;BDDK&quot;},{&quot;title&quot;:&quot;EU AI Act&quot;,&quot;url&quot;:&quot;https://artificialintelligenceact.eu/&quot;,&quot;author&quot;:&quot;European Commission&quot;,&quot;publishedAt&quot;:&quot;2024-03-13&quot;,&quot;publisher&quot;:&quot;EU&quot;},{&quot;title&quot;:&quot;Anthropic Cookbook&quot;,&quot;url&quot;:&quot;https://github.com/anthropics/anthropic-cookbook&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;GitHub&quot;},{&quot;title&quot;:&quot;OpenAI Best Practices&quot;,&quot;url&quot;:&quot;https://platform.openai.com/docs/guides/production-best-practices&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;AWS Bedrock Prompt Caching&quot;,&quot;url&quot;:&quot;https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html&quot;,&quot;author&quot;:&quot;AWS&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;AWS&quot;},{&quot;title&quot;:&quot;Azure OpenAI Prompt Caching&quot;,&quot;url&quot;:&quot;https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/prompt-caching&quot;,&quot;author&quot;:&quot;Microsoft&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Microsoft&quot;},{&quot;title&quot;:&quot;DSPy&quot;,&quot;url&quot;:&quot;https://github.com/stanfordnlp/dspy&quot;,&quot;author&quot;:&quot;Stanford NLP&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Stanford&quot;},{&quot;title&quot;:&quot;Needle in a Haystack Benchmark&quot;,&quot;url&quot;:&quot;https://github.com/gkamradt/LLMTest_NeedleInAHaystack&quot;,&quot;author&quot;:&quot;Greg Kamradt&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;GitHub&quot;},{&quot;title&quot;:&quot;RULER: Long-Context Evaluation&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2404.06654&quot;,&quot;author&quot;:&quot;Hsieh et al.&quot;,&quot;publishedAt&quot;:&quot;2024-04&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;LangChain Memory&quot;,&quot;url&quot;:&quot;https://python.langchain.com/docs/modules/memory/&quot;,&quot;author&quot;:&quot;LangChain&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;LangChain&quot;},{&quot;title&quot;:&quot;DynamoDB for Agents&quot;,&quot;url&quot;:&quot;https://aws.amazon.com/blogs/database/use-amazon-dynamodb-for-llm-agents/&quot;,&quot;author&quot;:&quot;AWS&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;AWS&quot;},{&quot;title&quot;:&quot;OWASP Top 10 LLM&quot;,&quot;url&quot;:&quot;https://owasp.org/www-project-top-10-for-large-language-model-applications/&quot;,&quot;author&quot;:&quot;OWASP&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;OWASP&quot;},{&quot;title&quot;:&quot;NIST AI RMF&quot;,&quot;url&quot;:&quot;https://www.nist.gov/itl/ai-risk-management-framework&quot;,&quot;author&quot;:&quot;NIST&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;NIST&quot;},{&quot;title&quot;:&quot;Anthropic Tool Use&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/tool-use&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Klarna AI&quot;,&quot;url&quot;:&quot;https://www.klarna.com/international/press/klarna-ai-assistant&quot;,&quot;author&quot;:&quot;Klarna&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;Klarna&quot;},{&quot;title&quot;:&quot;Turkish NLP Suite&quot;,&quot;url&quot;:&quot;https://github.com/turkish-nlp-suite&quot;,&quot;author&quot;:&quot;Turkish NLP Suite&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;GitHub&quot;}]"></references-list>

---

This is a living document; the context engineering ecosystem shifts every quarter and is **updated quarterly**.