# On-Premise LLM Deployment with Ollama and vLLM Training

> Source: https://sukruyusufkaya.com/en/training/ollama-vllm-on-premise-llm-deploy-egitimi
> Updated: 2026-05-18T18:16:16.815Z
> Level: advanced
> Topics: ollama, vllm, on-premise llm, self-hosted ai, pagedattention, continuous batching, tensor parallelism, multi-gpu inference, quantization, gguf awq gptq fp8, kubernetes llm serving, kserve bentoml, production observability, prometheus grafana llm, tgi sglang, tensorrt-llm, kvkk uyumlu llm, air-gapped deployment, llm cost optimization, infrastructure engineering
**TLDR:** A 3-day advanced program for DevOps engineers, SREs, and ML Platform engineers who want to deploy open-source LLMs on-premise at enterprise scale, ranging from Ollama and vLLM internals to multi-GPU distributed inference, Kubernetes serving, production observability, and KVKK-compliant air-gapped deployment.

## Açıklama

The On-Premise LLM Deployment with Ollama and vLLM Training is an advanced 3-day program designed for DevOps engineers, Site Reliability Engineers (SREs), ML Platform engineers, infrastructure architects, and cloud architects who want to run open-source large language models at enterprise scale on production-grade infrastructure. The training covers TCO modeling for on-prem vs API models, hardware selection (NVIDIA H100/H200/B200, AMD MI300X, Intel Gaudi3), Ollama internals (llama.cpp, GGUF, Modelfile), vLLM architectural deep dive (PagedAttention, continuous batching, prefix caching, speculative decoding), quantization strategies (GGUF/AWQ/GPTQ/FP8/FP4), multi-GPU distributed inference (tensor / pipeline / expert parallelism), TGI/SGLang/TensorRT-LLM comparison, production observability (Prometheus, Grafana, OpenTelemetry, DCGM), LLM serving on Kubernetes (KServe, BentoML, Helm), auto-scaling and cost optimization, KVKK-compliant air-gapped deployment, and security topics — together.

## Kazanımlar

- Make architectural decisions with on-prem vs API model TCO modeling.
- Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems.
- Professionally set up Ollama in developer, edge, and branch office scenarios.
- Tune vLLM's PagedAttention, continuous batching, and prefix caching internals.
- Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism.
- Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance.
- Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm).
- Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM.
- Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.

<p>This training is designed for DevOps engineers, SREs, ML Platform engineers, infrastructure architects, and cloud architects who want to run open-source large language models at enterprise scale on production-grade infrastructure. At the heart of the program is the following approach: on-premise LLM deployment is not simply 'install Ollama on a server and open a port.' Real operational value comes from selecting the right hardware via VRAM math and throughput projection, understanding and tuning vLLM's PagedAttention and continuous-batching internals, building multi-GPU deployment with tensor / pipeline / expert parallelism strategies, ensuring production observability with Prometheus + Grafana + OpenTelemetry + DCGM, performing reproducible deployment on Kubernetes with KServe / BentoML / Helm, setting up auto-scaling with GPU-aware HPA and KEDA, cost-optimizing with spot instances and hybrid model routing, hardening security with TLS/mTLS / Vault / audit logging, and operating the entire system in a KVKK-compliant air-gapped topology.</p>

<p>Content related to open-source LLM setup in Turkey has expanded rapidly over the past two years; however, the vast majority of this content remains at the 'install Ollama on macOS, pull a model, ask-answer' level. This training is designed to be the only comprehensive Turkish-language reference that completely transcends this surface level and addresses topics like inference engine internals + multi-GPU distributed serving + Kubernetes platform engineering + production observability + KVKK air-gapped compliance within the same program. The target audience is not ML/data engineers; it is DevOps engineers, SREs, ML Platform engineers, and infrastructure architects who operate production infrastructure. The training focuses not on Python ML engineering but on platform engineering and operational discipline.</p>

<p>A strategic dimension of the program is clarifying in which scenarios on-premise LLM deployment is genuinely required. Under KVKK 'cross-border transfer' rules, BDDK (banking), EPDK (energy), SGK (healthcare) sector regulations, and the EU AI Act, self-hosted serving is a mandatory architectural decision for many enterprise customers. At the same time, in situations like high token volumes (100M+/month), low tail-latency needs, or operationalizing domain-specific fine-tuned models, self-hosted overtakes API models economically. This training addresses TCO modeling for on-prem vs API models in detail along with the break-even point; a hybrid (hot path API, cold path on-prem) strategy is also shown.</p>

<p>The hardware module forms the infrastructure backbone of the training. NVIDIA H100, H200, B100, B200, GB200 specs and performance; the AMD MI300X, MI325X, MI350 ecosystem; Intel Gaudi3 and other alternative accelerators; and prototyping scenarios with RTX 4090/5090 are addressed comparatively. VRAM math (model parameters × bytes-per-param + KV cache); the impact of batch size, sequence length, and context window; throughput projection (tokens/sec, requests/sec, p99 latency) are taught end to end. In multi-GPU topologies, PCIe vs NVLink vs NVSwitch bandwidth differences; multi-node InfiniBand and RDMA requirements; DGX, HGX reference architectures, and custom build options are addressed in detail. This module provides a directly applicable decision matrix for teams that will invest in hardware or purchase cloud GPUs.</p>

<p>The Ollama module deepens into developer / edge / branch office scenarios. Ollama's llama.cpp-based backend architecture, ggml/gguf format mechanics, model registry flow, and OpenAI-compatible API layer are addressed at the internals level. Customization techniques like Modelfile directives (FROM, PARAMETER, TEMPLATE, SYSTEM), custom model production, and LoRA adapter merging are covered hands-on. On the production Ollama side, OLLAMA_HOST, OLLAMA_KEEP_ALIVE, OLLAMA_NUM_PARALLEL configuration; GPU passthrough; docker/podman integration; and branch office / edge node deployment patterns are addressed comprehensively.</p>

<p>The vLLM module is the peak of technical depth in the training. The two core innovations forming vLLM's production-grade serving paradigm — PagedAttention (KV cache fragmentation solution) and continuous batching (throughput optimization) — are addressed at the internals level. PagedAttention's page-based KV cache management and OS analogy; memory-utilization metrics; static batching vs continuous batching throughput analysis; request scheduler and preemption mechanics; max_num_batched_tokens and max_num_seqs tuning are addressed in detail. As advanced optimizations, prefix caching (shared system-prompt advantage), speculative decoding (use of draft models), chunked prefill (long-context handling), and guided decoding (Outlines, lm-format-enforcer integration) are shown hands-on.</p>

<p>The quantization-strategies module is critical for maximum performance under hardware constraints. GGUF (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0); AWQ vs GPTQ (weight-only vs activation-aware); FP8 native (H100 and later); FP4 (Blackwell B100+); EXL2 and ExLlamaV2 are compared in detail. Quantization-induced quality regression measurement (perplexity, MMLU); domain-specific quality-loss analysis; throughput vs latency optimization; tail-latency (p99, p999) management; benchmarking tools like vLLM benchmark_serving and GenAI-Perf are addressed hands-on.</p>

<p>The multi-GPU and distributed inference module goes beyond fitting large models (70B, 405B, 671B) on a single GPU. Tensor parallelism (matmul splitting), pipeline parallelism (layer-wise split), expert parallelism (MoE-specific routing), and hybrid 3D parallelism techniques are addressed. On the vLLM side, tensor_parallel_size and pipeline_parallel_size tuning, NCCL / NVLink / InfiniBand backend configuration, and multi-node orchestration with a Ray cluster are addressed hands-on. Additionally, alternative engines like TGI (Hugging Face Rust backend), SGLang (RadixAttention, JSON schema-constrained generation), NVIDIA TensorRT-LLM (peak performance), Triton Inference Server, llama.cpp server, lmdeploy, and MLX Server are addressed comparatively, and a use-case-based engine-selection matrix is presented.</p>

<p>The production-observability module represents the operational-discipline dimension of the training. Scraping vLLM's /metrics endpoint with Prometheus; vLLM Prometheus metrics like TTFT (time-to-first-token), TPOT (time-per-output-token), throughput, queue depth; Grafana dashboard templates; GPU monitoring with NVIDIA DCGM; structured logging (request_id, prompt hash, completion length); OpenTelemetry trace propagation; and Loki / Elastic / Datadog log aggregation are addressed in detail. On the alerting side, GPU OOM, queue saturation, p99-latency spike alerting; runbook design; and postmortem discipline are addressed hands-on.</p>

<p>The Kubernetes module takes on-premise LLM serving to the level of enterprise platform engineering. NVIDIA GPU Operator and nvidia-device-plugin configuration; GPU node taints, tolerations, scheduling; fractional GPUs via MIG (Multi-Instance GPU); KServe / Knative serverless LLM serving; model packaging with BentoML; comparison of ModelMesh and Ray Serve; reproducible vLLM deployment with Helm charts; GitOps-based delivery with Argo CD are addressed end to end. On the auto-scaling side, HPA custom-metrics-based scaling, KEDA event-driven autoscaling, cold-start optimization, and warm-pool strategy are addressed. As cost optimization, the mix of spot instance / preemptible VM / reserved capacity; model routing (Haiku/Sonnet local vs cloud API hybrid); unit economics (cost-per-token, cost-per-user) measurement; multi-tenant inference (namespace isolation, fair scheduling, per-tenant rate limiting) are addressed in detail.</p>

<p>The security module covers the training's compliance and governance discipline. TLS / mTLS endpoint encryption; internal API gateway and service mesh (Istio, Linkerd); network policy and micro-segmentation; HashiCorp Vault and External Secrets Operator integration; audit logging for who submitted which prompt; PII-masking and secret-scanning hooks; model distribution in environments without an internet connection; local container registry and mirror ecosystems; compliance documentation and audit-readiness are addressed hands-on. KVKK 'cross-border transfer' rules, BDDK / EPDK / SGK sector regulations, and air-gapped deployment scenarios under the EU AI Act framework are addressed in detail.</p>

<p>In the capstone project, each participant designs an end-to-end production-grade on-premise LLM serving platform for their own company: hardware, engine, and quantization choices; Kubernetes deployment, observability, and auto-scaling plan; KVKK-compliant air-gapped topology; cost projection and performance baseline; ops runbook and incident-response procedures. By the end of the training, participants reach a level of technical and architectural competence to manage on-premise LLM serving in an integrated way across architectural, operational, and compliance dimensions; master Ollama and vLLM internals; perform multi-GPU distributed inference deployment; establish production observability and auto-scaling; build a reproducible LLM serving platform on Kubernetes; measure unit economics with cost optimization; and meet the requirements of regulated sectors with KVKK-compliant air-gapped deployment. The training consists of 3 days, 12 modules, and over 80 hands-on lessons.</p>