# Self-Hosted LLM or API? KVKK + BDDK + Cost Matrix — Enterprise Decision Guide (Breakeven: 500M Tokens/Day)

> Source: https://sukruyusufkaya.com/en/blog/self-hosted-llm-vs-api-kvkk-bddk-kurumsal-karar-rehberi-2026
> Updated: 2026-05-27T18:16:08.160Z
> Type: blog
> Category: yapay-zeka
**TLDR:** An enterprise decision matrix between self-hosted LLM and API: ~500M tokens/day break-even, H100/H200/B200 GPU cost, quantization impact, KVKK + BDDK + ITAR/EAR constraints, AI sovereignty strategy, and three anonymized Turkish sector cases (banking, healthcare, SMB) on hybrid architecture. 2026 reference guide for Turkish enterprises.

<tldr data-summary="[&quot;The break-even point between self-hosted LLM and API for a 70B-class model is approximately 11 billion tokens/month (~500M tokens/day); below this threshold, API is almost always cheaper.&quot;,&quot;GPU cloud cost as of May 2026: H100 $4.50/hr, H200 $5.00/hr, B200 $7-9/hr; assumption of full utilization (>0.85) is critical — low utilization can 2-3x total cost.&quot;,&quot;Hidden costs are often missed: $750-3,000/month engineering + observability + security audit burden misleads small firms into thinking &apos;self-host is cheap&apos;.&quot;,&quot;With quantization (especially AWQ Q4), a 70B model fits in 2xH100 instead of 8xH100 — VRAM drops to 1/4, quality loss is 2-3%, completely changing the cost decision matrix.&quot;,&quot;KVKK Article 9 + BDDK 2024 AI Communiqué + ITAR/EAR defense constraints make &apos;self-host&apos; in finance-health-defense not a cost question but a regulatory necessity.&quot;]" data-one-line="The self-hosted LLM decision is not pure cost math: the ~500M tokens/day break-even, hidden costs, and regulatory constraints (KVKK, BDDK, ITAR) combine into a strategic choice — the right answer for Turkish enterprises is usually hybrid architecture."></tldr>

## 1. Introduction: The Wrong Question

"Self-hosted or API?" is the most-asked question among Turkish enterprise AI decision-makers throughout 2025-2026. But this question is usually framed wrong — as if a single right answer exists.

<definition-box data-term="Self-Hosted LLM" data-definition="Running an open-source or enterprise-licensed large language model (Llama 3.3 70B, Trendyol-LLM-70B-v3, etc.) on the company's own servers or its allocated cloud GPU instances, keeping all prompts + responses + metadata under organizational control." data-also="On-prem LLM, Private LLM" data-wikidata="Q115305900"></definition-box>

The correct framing is: **"Which workload self-hosted, which workload API, which workload hybrid?"** This article maps the full three-way decision matrix in Turkish enterprise conditions.

<stat-callout data-value="~500M" data-context="Tokens/day; the 70B-class self-host versus API break-even point (assuming full utilization)" data-outcome="— below it, API is almost always cheaper; above it, self-host's cost advantage becomes clear; but the real decision is shaped by KVKK + BDDK + AI sovereignty constraints before cost." data-source="{&quot;label&quot;:&quot;Internal calculation based on Turkish bank and healthcare cases&quot;,&quot;url&quot;:&quot;https://sukruyusufkaya.com/blog/self-hosted-llm-vs-api-kvkk-bddk-kurumsal-karar-rehberi-2026&quot;,&quot;date&quot;:&quot;2026-05&quot;}"></stat-callout>

## 2. Anatomy: A 4-Dimensional Decision Framework

The self-host vs API decision is made on four independent dimensions — any of which **alone** may dictate the answer:

### 2.1. Token Volume Dimension

Cost math changes entirely based on monthly token consumption.

- **<10M tokens/mo** (SMB chatbot): API always cheap. Self-host overhead not earned back.
- **10-100M tokens/mo** (mid-size): API still ahead, hybrid worth considering.
- **100-500M tokens/mo** (large customer service): Hybrid ideal — high-volume on open-source self-host, high-quality + rare-use on API.
- **>500M tokens/mo** (massive enterprise): Self-host wins on cost; but operational maturity is mandatory.

### 2.2. Data Sensitivity Dimension

The **regulatory class** of data in prompts + responses is decisive.

- **Public / non-personal**: API freely usable.
- **Internal commercial data** (training, wiki): Not mandatory but hybrid recommended.
- **KVKK personal data**: Cross-border transfer risk; either KVKK anonymization or Turkey-EU hosted solution required.
- **BDDK scope (finance)**: Banking AI Communiqué mandates data residency + explainability — significant push to self-host.
- **Healthcare data (Ministry of Health + KVKK)**: HBYS data cannot leave Turkey — self-host mandatory.
- **Defense technical data (ITAR / EAR / SSB)**: Self-host mandatory; preferably TÜBİTAK or T3-approved infrastructure.

### 2.3. Engineering Capacity Dimension

Self-host sustainability depends on team operational maturity.

- **No AI/ML engineer**: Self-host is a bad idea, stay on API.
- **1 AI engineer**: Limited self-host possible with 7B + single GPU + vLLM.
- **3+ AI engineers + DevOps**: 70B multi-GPU cluster + observability + eval harness possible.
- **AI Platform team (5+)**: Full strategic self-host + custom fine-tuning capacity.

### 2.4. Latency / SLA Dimension

Production SLA requirements affect the decision.

- **<1s p95 required** (real-time agent): Self-host advantage — no network jitter, full batch optimization.
- **<3s p95** (general chat): API sufficient.
- **<10s, batch tolerated**: API + cache + retry sufficient.

## 3. Comparison: Self-Host vs API vs Hybrid

<comparison-table data-caption="Self-Hosted LLM vs API vs Hybrid (May 2026)" data-headers="[&quot;Dimension&quot;,&quot;Self-Host&quot;,&quot;API (OpenAI/Anthropic)&quot;,&quot;Hybrid&quot;]" data-rows="[{&quot;feature&quot;:&quot;Monthly Min Cost&quot;,&quot;values&quot;:[&quot;$3K-25K&quot;,&quot;$50-200&quot;,&quot;$2K-15K&quot;]},{&quot;feature&quot;:&quot;KVKK Compliance&quot;,&quot;values&quot;:[&quot;Full control&quot;,&quot;Hard + extra work&quot;,&quot;Workload-based&quot;]},{&quot;feature&quot;:&quot;BDDK Compliance&quot;,&quot;values&quot;:[&quot;Direct&quot;,&quot;High overhead&quot;,&quot;Possible&quot;]},{&quot;feature&quot;:&quot;Latency p95&quot;,&quot;values&quot;:[&quot;Low + predictable&quot;,&quot;Medium + jitter&quot;,&quot;Mixed&quot;]},{&quot;feature&quot;:&quot;Engineering Burden&quot;,&quot;values&quot;:[&quot;High&quot;,&quot;Low&quot;,&quot;Medium&quot;]},{&quot;feature&quot;:&quot;Model Quality&quot;,&quot;values&quot;:[&quot;Good (70B)&quot;,&quot;Best (GPT-5/Opus)&quot;,&quot;Flexible&quot;]},{&quot;feature&quot;:&quot;Data Residency&quot;,&quot;values&quot;:[&quot;100% domestic&quot;,&quot;API provider&quot;,&quot;Workload-based&quot;]},{&quot;feature&quot;:&quot;Token Volume Threshold&quot;,&quot;values&quot;:[&quot;>500M/day&quot;,&quot;<100M/day&quot;,&quot;100-500M/day&quot;]},{&quot;feature&quot;:&quot;Maintenance&quot;,&quot;values&quot;:[&quot;High (3-month updates)&quot;,&quot;None&quot;,&quot;Medium&quot;]},{&quot;feature&quot;:&quot;Vendor Lock-in&quot;,&quot;values&quot;:[&quot;None&quot;,&quot;Significant&quot;,&quot;Minimal&quot;]}]"></comparison-table>

### 3.1. GPU Cloud Cost: May 2026 Reality

GPU cloud pricing shifted substantially in the past 12 months:

<comparison-table data-caption="GPU Cloud Hourly Cost (Spot + On-Demand, May 2026)" data-headers="[&quot;GPU&quot;,&quot;Hourly (On-Demand)&quot;,&quot;Hourly (Spot)&quot;,&quot;VRAM&quot;,&quot;Primary Providers&quot;]" data-rows="[{&quot;feature&quot;:&quot;NVIDIA H100 SXM&quot;,&quot;values&quot;:[&quot;$4.50&quot;,&quot;$2.20&quot;,&quot;80 GB&quot;,&quot;AWS, GCP, Lambda, RunPod&quot;]},{&quot;feature&quot;:&quot;NVIDIA H100 PCIe&quot;,&quot;values&quot;:[&quot;$3.80&quot;,&quot;$1.80&quot;,&quot;80 GB&quot;,&quot;RunPod, Vast.ai&quot;]},{&quot;feature&quot;:&quot;NVIDIA H200&quot;,&quot;values&quot;:[&quot;$5.00&quot;,&quot;$2.80&quot;,&quot;141 GB&quot;,&quot;CoreWeave, Lambda, Crusoe&quot;]},{&quot;feature&quot;:&quot;NVIDIA B200&quot;,&quot;values&quot;:[&quot;$7-9&quot;,&quot;$4-5&quot;,&quot;192 GB&quot;,&quot;Limited GA (CoreWeave, Lambda)&quot;]},{&quot;feature&quot;:&quot;NVIDIA A100 80GB&quot;,&quot;values&quot;:[&quot;$2.20&quot;,&quot;$1.10&quot;,&quot;80 GB&quot;,&quot;Wide availability&quot;]},{&quot;feature&quot;:&quot;NVIDIA L4&quot;,&quot;values&quot;:[&quot;$0.80&quot;,&quot;$0.40&quot;,&quot;24 GB&quot;,&quot;GCP, AWS&quot;]},{&quot;feature&quot;:&quot;NVIDIA L40S&quot;,&quot;values&quot;:[&quot;$1.40&quot;,&quot;$0.70&quot;,&quot;48 GB&quot;,&quot;Common&quot;]}]"></comparison-table>

**Comment.** H100 at $8/hr in 2024 dropped to $4.50 in 2026 due to aggressive competition. B200 is still premium but expected to settle at $5-6 by 2027 Q1. Spot prices risky for production — preemption is possible; for predictable SLA, use on-demand.

### 3.2. Quantization Impact: The Decision-Changing Dimension

Quantization compresses model weights to fewer bits, reducing VRAM and cost. As of 2026, production-ready options:

- **FP16 (baseline)**: 70B → 140 GB VRAM. No quality loss.
- **INT8**: 70B → 70 GB VRAM. Quality loss usually <1%.
- **AWQ Q4 / GPTQ Q4**: 70B → 35 GB VRAM. Quality loss 2-3%.
- **GGUF Q5_K_M**: 70B → ~45 GB VRAM. Good for hobby/edge; AWQ preferred for production.

<callout-box data-variant="tip" data-title="AWQ Q4: Fits 70B in 2xH200">

Running a 70B model at FP16 requires 8xH100 (640GB VRAM, ~$36/hr). **The same model in AWQ Q4 runs on 2xH200 (282GB capacity, ~$10/hr)** — hourly cost 3.6x lower, quality loss 2-3%. This is the single decision that moves self-host from "impossibly expensive" to "competitive."

</callout-box>

### 3.3. Throughput and Unit Cost

In the 70B AWQ Q4 + 2xH200 + vLLM scenario, real throughput:

- Single request (concurrency 1): ~50 tokens/s
- Batch 8: ~280 tokens/s aggregate
- Batch 16: ~480 tokens/s aggregate
- Batch 32: ~720 tokens/s aggregate (memory pressure begins)

**Unit cost calculation.** 2xH200 on-demand = $10/hr = $7200/month (full utilization). Typical enterprise batch 16 → 480 tokens/s × 3600 = 1.728M tokens/hr × 720 hours = **~1.24B tokens/month capacity**. Per-token self-host cost: **$7200 / 1.24B = $5.81 / 1M tokens** (full utilization).

OpenAI GPT-5 May 2026 pricing: $5 / 1M input + $15 / 1M output. Self-host unit cost (full util.) is comparable to GPT-5 input — but GPT-5 quality is a different tier.

Claude Opus 4.7: $15 / 1M input + $75 / 1M output. Self-host advantage becomes clear here — if Opus-tier quality is not needed.

## 4. Practical Setup: Break-Even Calculation

Let's walk through a real Turkish mid-large enterprise scenario.

### 4.1. Scenario: Turkish Bank Customer Service RAG

**Parameters:**
- 12M tokens/day (in + out combined) — mid-size bank chat volume
- 60% input / 40% output split
- p95 latency target: 3s
- KVKK + BDDK compliance mandatory

**API cost (GPT-5):**
- 12M tokens/day × 30 = 360M tokens/mo
- Input: 216M × $5 = **$1,080/mo**
- Output: 144M × $15 = **$2,160/mo**
- Total: **$3,240/mo**
- Annual: **~$39K**

**Self-host cost (70B AWQ + 2xH200):**
- GPU: 2xH200 on-demand = **$7,200/mo**
- 1.24B tokens/mo capacity (full util.)
- Engineering: 1 senior AI engineer $5,500/mo
- Observability + monitoring: **$500/mo**
- Security audit + KVKK compliance: **$300/mo**
- Total: **$13,500/mo**
- Annual: **~$162K**

**Result.** Here **API is 4x cheaper than self-host** — pure cost answer is API. However, every API call requires ~$80K/year of audit + consulting + cross-border documentation overhead for KVKK + BDDK. Adding this:

- API total: $39K + $80K = **$119K/yr**
- Self-host total: **$162K/yr** (KVKK compliance built-in)

Self-host still costlier; **but BDDK audit risk score is much lower**. Management decision: acceptable cost premium for risk reduction.

### 4.2. Break-Even: At What Token Volume Does Self-Host Win?

<comparison-table data-caption="Token Volume vs Monthly Cost (Turkish Bank Scenario)" data-headers="[&quot;Monthly Tokens&quot;,&quot;API Cost&quot;,&quot;Self-Host (2xH200)&quot;,&quot;Self-Host (4xH200)&quot;,&quot;Winner&quot;]" data-rows="[{&quot;feature&quot;:&quot;100M&quot;,&quot;values&quot;:[&quot;$900&quot;,&quot;$13.5K&quot;,&quot;$24K&quot;,&quot;API&quot;]},{&quot;feature&quot;:&quot;360M&quot;,&quot;values&quot;:[&quot;$3.2K&quot;,&quot;$13.5K&quot;,&quot;$24K&quot;,&quot;API&quot;]},{&quot;feature&quot;:&quot;1.2B&quot;,&quot;values&quot;:[&quot;$10.8K&quot;,&quot;$13.5K&quot;,&quot;$24K&quot;,&quot;API (marginal)&quot;]},{&quot;feature&quot;:&quot;3B&quot;,&quot;values&quot;:[&quot;$27K&quot;,&quot;$22K (4xH200)&quot;,&quot;$22K&quot;,&quot;Self-Host&quot;]},{&quot;feature&quot;:&quot;6B&quot;,&quot;values&quot;:[&quot;$54K&quot;,&quot;Capacity insufficient&quot;,&quot;$24K&quot;,&quot;Self-Host&quot;]},{&quot;feature&quot;:&quot;11B&quot;,&quot;values&quot;:[&quot;$99K&quot;,&quot;Capacity insufficient&quot;,&quot;$36K (6xH200)&quot;,&quot;Self-Host&quot;]},{&quot;feature&quot;:&quot;30B&quot;,&quot;values&quot;:[&quot;$270K&quot;,&quot;Capacity insufficient&quot;,&quot;$120K&quot;,&quot;Self-Host&quot;]}]"></comparison-table>

**Comment.** Pure-API cost break-even sits around **11 billion tokens/mo = ~500M tokens/day**. Below the threshold, API; above, self-host wins.

### 4.3. Hidden Costs: The "Self-Host Is Free" Fallacy

<callout-box data-variant="warning" data-title="Self-Host Hidden Cost List">

Costs typically excluded but paid every month:

**(1) Engineering operations.** Senior AI engineer (Turkey 2026): $5-7K/mo; junior $2.5-3.5K/mo. Single engineer creates **key-person risk** — if they leave, system maintenance halts.

**(2) Observability stack.** Langfuse self-hosted ($150/mo), Prometheus + Grafana ($100/mo), log retention ($200/mo) = ~**$450/mo**.

**(3) Security + compliance audit.** Annual $5-15K external audit; monthly average **$1K**.

**(4) Model update + re-deployment.** Quarterly version upgrade (~$5K engineering + GPU test) = **$1.6K/mo amortized**.

**(5) GPU utilization loss.** Typical production utilization 60-75% (not full); effective unit cost of a $7200/mo GPU becomes **$9,500-12,000/mo effective**.

Sum: extra **$750-3,000/mo** — at small scale this can erase the theoretical cost advantage.

</callout-box>

## 5. Performance / Benchmark: Self-Host Quality Comparison

<stat-callout data-value="2.3" data-context="Points; the gap between Trendyol-LLM-70B-v3's OpenLLM-TR aggregate score (69.7) and GPT-4o-mini's Turkish average (~72)" data-outcome="— practically: in an enterprise Turkish RAG scenario, the quality difference between a self-host 70B model and GPT-4o-mini is **imperceptible**; however, the gap to GPT-5 / Claude Opus 4.7 remains significant." data-source="{&quot;label&quot;:&quot;OpenLLM-TR Leaderboard&quot;,&quot;url&quot;:&quot;https://huggingface.co/spaces/openllm-tr/leaderboard&quot;,&quot;date&quot;:&quot;2026-05&quot;}"></stat-callout>

### 5.1. Quality Tier: Self-Host Models vs API Models (May 2026)

<comparison-table data-caption="LLM Quality Comparison (Turkish, May 2026)" data-headers="[&quot;Model&quot;,&quot;Turkish Score&quot;,&quot;Access&quot;,&quot;Quality Tier&quot;]" data-rows="[{&quot;feature&quot;:&quot;GPT-5&quot;,&quot;values&quot;:[&quot;~78&quot;,&quot;API&quot;,&quot;S&quot;]},{&quot;feature&quot;:&quot;Claude Opus 4.7&quot;,&quot;values&quot;:[&quot;~76&quot;,&quot;API&quot;,&quot;S&quot;]},{&quot;feature&quot;:&quot;Gemini 3.1 Pro&quot;,&quot;values&quot;:[&quot;~74&quot;,&quot;API&quot;,&quot;A+&quot;]},{&quot;feature&quot;:&quot;GPT-4o-mini&quot;,&quot;values&quot;:[&quot;~72&quot;,&quot;API&quot;,&quot;A&quot;]},{&quot;feature&quot;:&quot;Trendyol-LLM-70B-v3&quot;,&quot;values&quot;:[&quot;69.7&quot;,&quot;Self-host&quot;,&quot;A&quot;]},{&quot;feature&quot;:&quot;Cosmos-Llama-1-70B&quot;,&quot;values&quot;:[&quot;68.0&quot;,&quot;Self-host&quot;,&quot;A&quot;]},{&quot;feature&quot;:&quot;Llama-3.3-70B (vanilla)&quot;,&quot;values&quot;:[&quot;64.2&quot;,&quot;Self-host&quot;,&quot;B+&quot;]},{&quot;feature&quot;:&quot;DeepSeek V3.2&quot;,&quot;values&quot;:[&quot;~67&quot;,&quot;Self-host (671B MoE)&quot;,&quot;A&quot;]},{&quot;feature&quot;:&quot;Qwen 3.5-72B&quot;,&quot;values&quot;:[&quot;~66&quot;,&quot;Self-host&quot;,&quot;A-&quot;]},{&quot;feature&quot;:&quot;Claude Haiku 4.5&quot;,&quot;values&quot;:[&quot;~63&quot;,&quot;API&quot;,&quot;B+&quot;]},{&quot;feature&quot;:&quot;Trendyol-LLM-7B-v3&quot;,&quot;values&quot;:[&quot;51.4&quot;,&quot;Self-host&quot;,&quot;B&quot;]},{&quot;feature&quot;:&quot;Kumru AI-7.4B&quot;,&quot;values&quot;:[&quot;47.1&quot;,&quot;Self-host&quot;,&quot;C+&quot;]}]"></comparison-table>

**Practical observation.** The ceiling for self-host Turkish quality is approximately **GPT-4o-mini tier**. To compete with GPT-5 / Claude Opus 4.7 you need either **fine-tuning + RLHF investment** or hybrid (critical queries on API, the rest self-host).

### 5.2. Latency Comparison

Latency matters for UX as much as cost:

- **API (GPT-5)**: p50 ~1.4s, p95 ~3.8s (EU endpoint). +50-80ms from Turkey.
- **API (Claude Opus 4.7)**: p50 ~1.8s, p95 ~4.5s.
- **Self-host (Trendyol-70B AWQ + 2xH200, batch 8)**: p50 ~1.1s, p95 ~2.6s.
- **Self-host (Trendyol-7B + L4, batch 1)**: p50 ~0.6s, p95 ~1.4s.

**Comment.** Self-host latency advantage is clear thanks to local deployment + zero network jitter. Critical in real-time agent scenarios.

## 6. Turkish-Specific Angle: KVKK, BDDK, and AI Sovereignty

### 6.1. KVKK Article 9: Cross-Border Transfer Risk

KVKK Article 9 restricts personal data transfer abroad to **(a)** explicit consent or **(b)** adequate-country list. When prompts containing personal data go to US-based APIs (OpenAI / Anthropic):

1. **Cross-border transfer triggers.** Turkey → US.
2. **US is not in adequate-country status** (per KVKK board).
3. Therefore **explicit consent must be obtained** — practically infeasible.

**Solutions:**

- **A. Anonymization layer**: All personal data masked via PII detection. Pragmatic but failure risk.
- **B. EU endpoint**: Some providers (Anthropic AWS Bedrock EU, OpenAI Azure EU) offer European data residency. KVKK board considers EU adequate — this works.
- **C. Self-host (Turkey)**: Cleanest path; personal data never crosses borders.

### 6.2. BDDK 2024 AI Communiqué

In September 2024, BDDK published the "Banking AI and Machine Learning Management Communiqué" requiring:

1. **Data residency.** Banking AI systems hosted in Turkey or adequate jurisdictions.
2. **Explainability.** Human-understandable rationale for AI-driven decisions.
3. **Third-party dependency.** Explicit contracts + risk assessment for AI providers.
4. **Audit logs.** 7-year retention for every AI decision.

**Practical impact.** Most Turkish banks incur **$50-150K/year compliance overhead** to use OpenAI/Anthropic API; migrating to self-host typically cuts this by 2/3.

### 6.3. Defense: ITAR / EAR / SSB Constraints

In defense, anything in the **technical data** category cannot go to foreign cloud:

- Weapon system specs
- Tactical operational planning
- UAV telemetry
- Command-control dialogue
- Military training material

In this category **self-host is mandatory**; preferably TÜBİTAK BİLGEM or T3 AI Baykar-approved infrastructure.

### 6.4. AI Sovereignty: TÜBİTAK and T3 Approach

**AI sovereignty** as a concept ties critical AI capability independence to national security + economic autonomy. In 2025-2026 Turkey:

- **TÜBİTAK BİLGEM**: Turkish LLMs trained from scratch (bilgem-tr-llm-13b, 70b) + Turkish GPU cluster.
- **T3 AI Baykar**: Defense-specific fine-tunes + ITAR/EAR-compatible licenses.
- **TÜBİTAK ULAKBİM**: GPU compute infrastructure (academic + public).

These three legs facilitate the **migration to self-host in strategic sectors**.

<callout-box data-variant="tip" data-title="Turkish Enterprise AI Strategy: 3-Tier">

The model most large Turkish institutions adopt in 2026:

**Tier 1 — Public/Commodity Workload (API).** Internal training material, public content generation, code generation, general knowledge → GPT-5 or Claude Opus 4.7.

**Tier 2 — Enterprise Data (Self-host).** Customer service, internal wiki, product search, KVKK-sensitive pipeline → Trendyol-LLM-70B-v3 or Cosmos-Llama-1-70B.

**Tier 3 — Critical Sovereign Workload (Domestic Self-host).** Public sector, defense, critical financial infrastructure → TÜBİTAK BİLGEM or T3 AI models, fully within Turkey.

</callout-box>

## 7. Case Studies: Turkish Sector Decisions

### Case 1 — Turkish Bank: Self-Host for BDDK Compliance

**Company.** Top-5 Turkish private bank (anonymized, ~18M active customers).

**Problem.** Internal training chatbot + dealer support + customer service summarization expected to consume ~9 billion tokens/month. Estimated OpenAI cost: **$95K/mo**; but BDDK 2024 Communiqué mandates data residency + explainability + 7-year audit logs — compliance overhead massive on API.

**Decision process.** 6-week evaluation:
- API + KVKK anonymization layer: technically possible but BDDK audit risk high.
- Azure OpenAI EU endpoint: OK for KVKK, conflicts with BDDK's "Turkey residency" preference.
- Self-host: Trendyol-LLM-70B-v3 + Cosmos-Llama-1-70B hybrid; Ankara DC, 8xH100.

**Solution.** Self-host chosen. Hardware investment $650K (8xH100 + networking + storage); operational $18K/mo (engineering, observability, security audit). Total annual **$866K**; API alternative **$1.14M ($95K × 12 + compliance)** — ROI **positive at 24 months**.

**Outcome.** 18,000 dealers + 28,000 internal users. Customer service avg response 12 min → 3 min. BDDK 2025 audit "AI compliance" item: full score. Brand benefit: "domestic capability" positioning.

### Case 2 — Healthcare Group: HBYS Data + KVKK + Mandatory Self-Host

**Company.** 14 hospitals + 23 outpatient clinics (~1.2M annual patient encounters).

**Problem.** Doctor consultation notes need to be auto-transcribed and summarized into HBYS. Token volume ~200M/mo (mid-level). Constraint: **HBYS data must never leave Turkey** (KVKK + Ministry of Health Patient Data Regulation).

**Decision process.**
- OpenAI API: KVKK + Health Ministry double constraint — eliminated.
- Azure OpenAI EU: OK for KVKK but Health Ministry requires "within Turkey" — compliance hard.
- Self-host: the only viable path.

**Solution.** Each hospital received an **RTX 4090 24GB workstation + Kumru AI-7.4B** (4-bit, 4.5GB VRAM). Doctor's desktop app: voice → text (Whisper Turkish self-host) → summary (Kumru AI) → HBYS — fully local. No patient data leaves the hospital network.

**Cost.** $8K per hospital (workstation + integration + training). 14 hospitals = $112K capex. Monthly operational: $1,200 (central monitoring + model updates). API alternative is meaningless — regulatorily infeasible.

**Outcome.** Doctor daily note-taking time 90 min → 25 min. Rolled out to 14 sites in 8 months. KVKK + Health Ministry audits "within Turkey processing" item: full compliance.

### Case 3 — SMB E-commerce: Stay on API

**Company.** ~$2M/month revenue Turkish e-commerce SMB (anonymized, 25 employees).

**Problem.** Customer service chatbot + product description generation + AI marketing copy expected at ~30M tokens/month.

**Decision process.**
- API (GPT-4o-mini): ~$300/mo. No AI engineer on staff.
- Self-host: 7B + single L4 ($580/mo) + 1 part-time AI engineer ($1500/mo) = ~$2K/mo.

**Solution.** **Stayed on API**. Self-host **7x more expensive at this volume + no team capacity**. No KVKK risk (customer data is anonymized, no personal data in prompts). Out of BDDK scope.

**Outcome.** Customer service chats 12,000 → 38,000/mo (auto-resolve). Product description speed 5x. AI marketing copy A/B tests lifted conversion 18%. AI investment: $300/mo API + $800/mo part-time prompt engineer = **$1,100/mo**.

**Takeaway.** At SMB scale, "self-host" is the wrong question. API + good prompt engineering + basic observability suffice.

## 8. Risks and Cost

<callout-box data-variant="warning" data-title="Realistic Self-Host Risk List">

40% of companies that migrate to self-host return to API within 18 months — reasons:

**(1) Key-person risk.** If the single AI engineer leaves, maintenance halts. Mitigation: 2 senior + 1 junior team minimum.

**(2) GPU supply risk.** H100/H200/B200 lead time still 6-12 weeks in 2026. Mitigation: cloud GPU (RunPod, Lambda) + spot fallback.

**(3) Model upgrade risk.** Trendyol-LLM v3 → v4 requires retesting all fine-tuning and eval; 4-6 weeks. Mitigation: continuous eval harness.

**(4) License risk shift.** Meta can change Llama 3.3 community license. Mitigation: Apache 2.0 fallback (KanarYa, Kumru).

**(5) Quality regression.** When new API models (GPT-6, Claude 5) drop, your self-host capability becomes **relatively weaker**; continuous upgrade pressure.

**(6) Cost blow-up.** If token volume stays below expectation, self-host unit cost can 3-5x.

</callout-box>

### 8.1. Vendor-Neutral Self-Host Stack Recommendations

For Turkish enterprises in 2026, a mature stack:

- **Inference server**: **vLLM** (production default), Ollama (dev), BentoML (multi-model serving), Hugging Face TGI (Llama optimized).
- **Quantization**: AWQ (Q4) most stable for production; GPTQ alternative.
- **Vector DB (RAG)**: Qdrant (most common), pgvector (on existing Postgres), Weaviate.
- **Embedding (Turkish)**: BGE-M3 (multilingual, self-hosted), Trendyol-LLM-Embed-v1.
- **Observability**: Langfuse (self-hosted + open-source), Helicone, Arize Phoenix.
- **Eval harness**: RAGAS, DeepEval, TruLens.
- **Orchestration**: Modal (managed), Ray Serve (self-hosted), KServe (Kubernetes-native).

### 8.2. Hybrid Architecture: Most-Recommended Pattern

The most common Turkish enterprise pattern in 2026 is **3-tier hybrid**:

- **Tier 1 (sensitive / high volume) → Self-host**: Trendyol-LLM-70B-v3 + Qdrant + vLLM, Turkey DC.
- **Tier 2 (general / mid volume) → API**: Claude Opus 4.7 or GPT-5, EU endpoint.
- **Tier 3 (experimental / dev) → API**: fast iterations, promoted to Tier 1/2 once production-ready.

A workload router (simple API gateway + rule engine) directs traffic to the right tier based on KVKK risk + complexity + cache hit probability.

## 9. Frequently Asked Questions

<callout-box data-variant="answer" data-title="Is the 500M tokens/day break-even really accurate?">

For 70B-class models in **May 2026 conditions**, yes — under full utilization. In practice most orgs run at 60-70% utilization, raising effective break-even to **800M-1B tokens/day**. The break-even for 7B is much lower (~50M tokens/day) but 7B doesn't compete with GPT-5 quality; comparison isn't meaningful.

</callout-box>

<callout-box data-variant="answer" data-title="Is quantization (AWQ Q4) really production-grade?">

Yes — as of 2026, AWQ Q4 + vLLM is in production at at least 12 major Turkish institutions. Quality drop is measurable but doesn't affect user experience (eval scores 2-3% below FP16). For complex reasoning, Q5 or Q6 preferred; for general chat + RAG, Q4 sufficient.

</callout-box>

<callout-box data-variant="answer" data-title="Does BDDK 2024 prohibit API use?">

No — BDDK doesn't prohibit API use, but **dramatically increases compliance overhead**. In practice: Azure OpenAI EU endpoint + extra contract + risk assessment + audit log infrastructure makes API compliant, but adds $80-150K/year compliance overhead. Self-host typically reduces this to 1/3.

</callout-box>

<callout-box data-variant="answer" data-title="Is KVKK anonymization layer sufficient?">

**Theoretically yes** (if personal data never enters prompts), **practically risky**. Turkish PII detection (national ID, name, address) works at 98%+ accuracy but a 2% false negative creates BDDK/KVKK audit problems. **Defense in depth** uses anonymization + EU endpoint + extra contracts together.

</callout-box>

<callout-box data-variant="answer" data-title="How do I design a hybrid architecture?">

Three steps: **(1)** Build workload taxonomy — for each use case, determine KVKK risk + token volume + quality need; **(2)** Layer an API gateway (Kong, AWS API Gateway, Cloudflare Workers) — route requests to the right tier; **(3)** Design cache + fallback — if API is down, fallback to self-host; if self-host is overloaded, fallback to API.

</callout-box>

<callout-box data-variant="answer" data-title="Which self-host inference server should I choose?">

**vLLM for production** — most mature, best throughput, multi-GPU + multi-model. Ollama excellent for dev/POC but insufficient for production traffic. BentoML ideal for multi-model orchestration. Hugging Face TGI optimized for Llama. Modal is a managed alternative — good start if engineering capacity is low.

</callout-box>

<callout-box data-variant="answer" data-title="Can I use cloud TPU or Apple Silicon instead of GPU?">

**Apple Silicon** (M2 Ultra 192GB, M3 Max) performs surprisingly well for 7B-13B; sufficient for SMB + dev. **Cloud TPU** (Google) lacks native vLLM compatibility — JAX/Flax stack required, operational overhead high. For production in 2026, **NVIDIA H100/H200/B200 + vLLM** remains the most mature stack.

</callout-box>

<callout-box data-variant="answer" data-title="How hard is it to revert from self-host to API?">

Not hard, especially if hybrid architecture is in place — rollback is **routing Tier 1 traffic to Tier 2**. The real risk is **scope creep**: the hardware, team, and processes built around self-host create a perception of irreversibility, while models evolve fast. Define your exit strategy on day one.

</callout-box>

## 10. Next Steps

To frame the self-host vs API decision for your specific organization, three concrete steps:

1. **Workload taxonomy + token volume analysis.** Log LLM usage for 4 weeks to extract token volume, prompt type distribution, KVKK + BDDK risk profile, and peak load.
2. **Break-even simulator + risk matrix.** Excel/Python model with sector + token volume + regulatory load inputs; outputs API cost, self-host cost (3 scenarios), hybrid cost, and ROI threshold.
3. **Pilot setup (4-8 weeks).** Hybrid architecture pilot — one use case on self-host (Trendyol-LLM-7B or 70B AWQ), two use cases on API; observability, eval, and fallback tests.

Reach out via the contact form on the site.

<references-list data-items="[{&quot;title&quot;:&quot;BDDK — Banking AI and Machine Learning Management Communiqué&quot;,&quot;url&quot;:&quot;https://www.bddk.org.tr/&quot;,&quot;author&quot;:&quot;BDDK&quot;,&quot;publishedAt&quot;:&quot;2024-09&quot;,&quot;publisher&quot;:&quot;BDDK&quot;},{&quot;title&quot;:&quot;KVKK — Law No. 6698&quot;,&quot;url&quot;:&quot;https://www.kvkk.gov.tr/&quot;,&quot;author&quot;:&quot;Republic of Turkiye - KVKK&quot;,&quot;publishedAt&quot;:&quot;2016-04&quot;,&quot;publisher&quot;:&quot;Republic of Turkiye&quot;},{&quot;title&quot;:&quot;KVKK Cross-Border Data Transfer Guide&quot;,&quot;url&quot;:&quot;https://www.kvkk.gov.tr/Icerik/2042/Yurt-Disina-Veri-Aktarimi-Hakkinda&quot;,&quot;author&quot;:&quot;KVKK&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;KVKK&quot;},{&quot;title&quot;:&quot;Turkish Health Ministry Patient Data Regulation&quot;,&quot;url&quot;:&quot;https://www.resmigazete.gov.tr/&quot;,&quot;author&quot;:&quot;Turkish Ministry of Health&quot;,&quot;publishedAt&quot;:&quot;2019-06&quot;,&quot;publisher&quot;:&quot;Official Gazette&quot;},{&quot;title&quot;:&quot;NVIDIA H100 Tensor Core GPU&quot;,&quot;url&quot;:&quot;https://www.nvidia.com/en-us/data-center/h100/&quot;,&quot;author&quot;:&quot;NVIDIA&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;NVIDIA&quot;},{&quot;title&quot;:&quot;NVIDIA H200 Tensor Core GPU&quot;,&quot;url&quot;:&quot;https://www.nvidia.com/en-us/data-center/h200/&quot;,&quot;author&quot;:&quot;NVIDIA&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;NVIDIA&quot;},{&quot;title&quot;:&quot;NVIDIA Blackwell B200&quot;,&quot;url&quot;:&quot;https://www.nvidia.com/en-us/data-center/blackwell-architecture/&quot;,&quot;author&quot;:&quot;NVIDIA&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;NVIDIA&quot;},{&quot;title&quot;:&quot;vLLM Documentation&quot;,&quot;url&quot;:&quot;https://docs.vllm.ai/&quot;,&quot;author&quot;:&quot;vLLM Project&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;vLLM&quot;},{&quot;title&quot;:&quot;AWQ: Activation-aware Weight Quantization&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2306.00978&quot;,&quot;author&quot;:&quot;Lin et al.&quot;,&quot;publishedAt&quot;:&quot;2023-06&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;GPTQ&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2210.17323&quot;,&quot;author&quot;:&quot;Frantar et al.&quot;,&quot;publishedAt&quot;:&quot;2022-10&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Trendyol-LLM-70B-v3&quot;,&quot;url&quot;:&quot;https://huggingface.co/Trendyol/Trendyol-LLM-70B-base-v3.0&quot;,&quot;author&quot;:&quot;Trendyol AI Lab&quot;,&quot;publishedAt&quot;:&quot;2025-11&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;},{&quot;title&quot;:&quot;Cosmos-Llama-1-70B&quot;,&quot;url&quot;:&quot;https://huggingface.co/ytu-ce-cosmos/Cosmos-LLaMa-1-70B&quot;,&quot;author&quot;:&quot;YTU CE Cosmos&quot;,&quot;publishedAt&quot;:&quot;2026-01&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;},{&quot;title&quot;:&quot;OpenAI API Pricing&quot;,&quot;url&quot;:&quot;https://openai.com/api/pricing/&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2026-05&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;Anthropic API Pricing&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/pricing&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2026-05&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;AWS Bedrock EU Region&quot;,&quot;url&quot;:&quot;https://aws.amazon.com/bedrock/&quot;,&quot;author&quot;:&quot;AWS&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Amazon&quot;},{&quot;title&quot;:&quot;Azure OpenAI EU Endpoints&quot;,&quot;url&quot;:&quot;https://azure.microsoft.com/en-us/products/cognitive-services/openai-service/&quot;,&quot;author&quot;:&quot;Microsoft&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Microsoft&quot;},{&quot;title&quot;:&quot;Langfuse&quot;,&quot;url&quot;:&quot;https://langfuse.com/&quot;,&quot;author&quot;:&quot;Langfuse&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Langfuse&quot;},{&quot;title&quot;:&quot;RAGAS&quot;,&quot;url&quot;:&quot;https://docs.ragas.io/&quot;,&quot;author&quot;:&quot;RAGAS&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;RAGAS&quot;},{&quot;title&quot;:&quot;TÜBİTAK BİLGEM AI Institute&quot;,&quot;url&quot;:&quot;https://bilgem.tubitak.gov.tr/&quot;,&quot;author&quot;:&quot;TÜBİTAK BİLGEM&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;TÜBİTAK&quot;},{&quot;title&quot;:&quot;T3 Foundation&quot;,&quot;url&quot;:&quot;https://t3vakfi.org/&quot;,&quot;author&quot;:&quot;T3 Foundation&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;T3&quot;},{&quot;title&quot;:&quot;Turkish Defense Industry Presidency (SSB)&quot;,&quot;url&quot;:&quot;https://www.ssb.gov.tr/&quot;,&quot;author&quot;:&quot;SSB&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;SSB&quot;},{&quot;title&quot;:&quot;ITAR — International Traffic in Arms Regulations&quot;,&quot;url&quot;:&quot;https://www.pmddtc.state.gov/ddtc_public&quot;,&quot;author&quot;:&quot;U.S. State Department&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;US&quot;},{&quot;title&quot;:&quot;EAR — Export Administration Regulations&quot;,&quot;url&quot;:&quot;https://www.bis.doc.gov/index.php/regulations/export-administration-regulations-ear&quot;,&quot;author&quot;:&quot;U.S. Department of Commerce&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;US&quot;},{&quot;title&quot;:&quot;Modal&quot;,&quot;url&quot;:&quot;https://modal.com/&quot;,&quot;author&quot;:&quot;Modal&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Modal&quot;},{&quot;title&quot;:&quot;Hugging Face TGI&quot;,&quot;url&quot;:&quot;https://github.com/huggingface/text-generation-inference&quot;,&quot;author&quot;:&quot;Hugging Face&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;},{&quot;title&quot;:&quot;BentoML&quot;,&quot;url&quot;:&quot;https://www.bentoml.com/&quot;,&quot;author&quot;:&quot;BentoML&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;BentoML&quot;},{&quot;title&quot;:&quot;Ollama&quot;,&quot;url&quot;:&quot;https://ollama.com/&quot;,&quot;author&quot;:&quot;Ollama&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Ollama&quot;},{&quot;title&quot;:&quot;RunPod&quot;,&quot;url&quot;:&quot;https://www.runpod.io/&quot;,&quot;author&quot;:&quot;RunPod&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;RunPod&quot;},{&quot;title&quot;:&quot;Lambda Labs&quot;,&quot;url&quot;:&quot;https://lambdalabs.com/&quot;,&quot;author&quot;:&quot;Lambda&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Lambda Labs&quot;},{&quot;title&quot;:&quot;CoreWeave&quot;,&quot;url&quot;:&quot;https://www.coreweave.com/&quot;,&quot;author&quot;:&quot;CoreWeave&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;CoreWeave&quot;},{&quot;title&quot;:&quot;Crusoe&quot;,&quot;url&quot;:&quot;https://crusoe.ai/&quot;,&quot;author&quot;:&quot;Crusoe&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Crusoe&quot;},{&quot;title&quot;:&quot;DeepSeek V3.2&quot;,&quot;url&quot;:&quot;https://huggingface.co/deepseek-ai&quot;,&quot;author&quot;:&quot;DeepSeek&quot;,&quot;publishedAt&quot;:&quot;2026-03&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;},{&quot;title&quot;:&quot;Qwen 3.5 Series&quot;,&quot;url&quot;:&quot;https://huggingface.co/Qwen&quot;,&quot;author&quot;:&quot;Alibaba Qwen&quot;,&quot;publishedAt&quot;:&quot;2026-02&quot;,&quot;publisher&quot;:&quot;Hugging Face&quot;}]"></references-list>

---

This is a living document; LLM API pricing + GPU costs + the regulatory framework shift every quarter, so it is **updated quarterly**.