Lab: vLLM Llama-3.1-8B Caching Açık vs Kapalı

Name: Lab: vLLM Llama-3.1-8B Caching Açık vs Kapalı
Author: Şükrü Yusuf KAYA

vLLM ile Llama-3.1-8B host edip, caching açık ve kapalı senaryolarda throughput + latency karşılaştırması.

Şükrü Yusuf KAYA

15 dakikalık okuma

14.05.2026

İleri

Lab #12: vLLM Llama-3.1-8B

Gereksinimler: 1× A100 40GB veya 2× RTX 4090 (24GB). Lokal yoksa RunPod, Vast.ai, Lambda Labs.

Hedef: Aynı 100 sorgu, caching açık vs kapalı, throughput karşılaştırması.

Adım 1 — vLLM Kurulumu#

bash

# Python venv
python -m venv vllm-env
source vllm-env/bin/activate
 
# vLLM kur
pip install vllm
 
# Hugging Face login (Llama-3.1 gated)
huggingface-cli login
 
# Modeli indir + serve
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --block-size 16 \
  --max-model-len 32768 \
  --port 8000

vLLM kurulum + serve

Adım 2 — Benchmark Script#

python

import time
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
 
LONG_KB = open("knowledge_base.txt").read()  # ~10K token
 
def run_query(system, user):
    start = time.perf_counter()
    resp = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        max_tokens=200,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
    )
    return time.perf_counter() - start
 
# 100 farklı user query, hep aynı system
USER_QUERIES = [f"Soru {i}: detayları açıkla" for i in range(100)]
 
latencies = []
for q in USER_QUERIES:
    latencies.append(run_query(LONG_KB, q))
 
print(f"İlk istek (cache miss):    {latencies[0]:.2f}s")
print(f"Avg sonraki 99 (cache hit): {sum(latencies[1:])/99:.2f}s")
print(f"Throughput improvement: {latencies[0]/sum(latencies[1:])*99:.1f}×")

Benchmark — aynı system, farklı user query

Beklenen Sonuçlar#

A100 40GB için yaklaşık:

İlk istek (cache miss):    2.10s
Avg sonraki 99 (cache hit): 0.18s
Throughput improvement:     11.7×

~12× hızlanma. Cache prefill maliyetini atlatıyor.

Provider vs Self-Hosted

Self-hosted'da cache "ücretsiz" — GPU zaten çalışıyor, sadece daha verimli kullanıyor. Provider'da token tasarrufu var, self-hosted'da throughput tasarrufu (aynı GPU ile 12× daha çok kullanıcıya hizmet).

Multi-User Throughput Testi#

Tek user testi sınırlı. vLLM'in asıl gücü: concurrent users.

python

import asyncio
from openai import AsyncOpenAI
 
async_client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
 
async def query(system, user):
    resp = await async_client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        max_tokens=200,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
    )
    return resp
 
async def benchmark(concurrent_users):
    # Hepsi aynı system, farklı user query
    tasks = [
        query(LONG_KB, f"Soru {i}")
        for i in range(concurrent_users)
    ]
    start = time.perf_counter()
    await asyncio.gather(*tasks)
    return time.perf_counter() - start
 
# 1, 10, 100 concurrent test
for n in [1, 10, 50, 100]:
    duration = asyncio.run(benchmark(n))
    print(f"{n:>3} concurrent: {duration:.2f}s, {n/duration:.1f} req/s")

Concurrent user benchmark

Production Çıkarımlar#

vLLM + prefix caching ile bir A100 40GB:

Single user: ~12 req/s
50 concurrent: ~80 req/s (PagedAttention sayesinde)
100 concurrent: ~95 req/s (cap'e yaklaşır)

Bu rakamları cloud provider'larla kıyasla:

Anthropic API: rate limit 4000 RPM (~67 req/s)
vLLM self-hosted A100: ~95 req/s sustained

Yatırım: A100 ~

1.5/saat (RunPod). Aylık

1.080. Bu fiyata cloud API'de ne kadar sorgu yapabilirdin? Modul 11'de break-even hesaplayacağız.

✓ Pekiştir#

Bir Sonraki Derste#

SGLang ve RadixAttention — vLLM'in modern alternatifi.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

1. Temeller — Context Penceresi Ekonomisi