Can I attend this training without any vLLM experience?

Yes. Modules 1-3 (strategy, hardware, Ollama) lay the foundation for someone with no vLLM experience. Module 4 starts vLLM from scratch and takes you to the internals of PagedAttention and continuous batching. The expectation is: you have working experience with Linux/Docker/Kubernetes. Python ML engineering is not required; this is a DevOps/SRE-focused training.

Which GPUs will we do hands-on with? Is expensive equipment required?

Most exercises during the training can be done with cloud GPUs (RunPod, Lambda Labs, vast.ai). An RTX 4090 (~1 USD/hour) is sufficient for small-model exercises; A100 / H100 (~2-4 USD/hour) are used for multi-GPU and large-model exercises. Total cost per participant is in the 20-40 USD range. For enterprise classroom training, equipment is recommended within the provider's budget scope.

Is Ollama used in production, or only for prototypes?

Ollama is also used in production, but in specific scenarios. It's ideal for internal tooling with low QPS, branch office / edge deployment, and developer-shared LLM endpoints. If high QPS, multi-tenant serving, or large model (70B+) needs are present, vLLM is preferred. Module 3 covers the pros/cons and correct scenarios of Ollama's production usage in detail.

For multi-GPU deployment, tensor parallelism or pipeline parallelism?

Tensor parallelism (TP): ideal within a single node, for GPUs connected via NVLink — low-latency, high-bandwidth needs. Pipeline parallelism (PP): for multi-node deployment, when bandwidth between GPUs is low — throughput-focused. Expert parallelism (EP): mandatory for MoE models (DeepSeek V3, Mixtral). Module 6 compares each and presents a TP/PP decision matrix on real-world scenarios.

Which quantization level should be preferred in production?

Depends on the use case: For maximum quality, Q8 or FP16 (no quantization at all); for cost-quality balance, Q5_K_M or AWQ 4-bit; for maximum throughput, Q4_K_M or AWQ 4-bit. On H100+ hardware, FP8 native is ideal; for Blackwell B100+, FP4 is the new standard. Module 5 presents quality-loss analysis based on perplexity and MMLU and a cost-quality-throughput trade-off matrix.

For air-gapped deployment, how do we update Docker images?

In an air-gapped environment, a local container registry (Harbor, Nexus, Quay) and a mirror ecosystem are used. A 'bridge' node connects to a location with internet, pulls images, verifies signatures, and pushes to the internal registry. Production nodes only pull from the internal registry. Module 11 covers this process hands-on along with compliance documentation — with procedures ready for BDDK and SGK audits.

How do I choose between KServe and BentoML?

KServe (formerly KFServing): Kubernetes-native, Knative-based serverless, scale-to-zero. Declarative via CRDs (InferenceService); integrated with the KubeFlow ecosystem. BentoML: more flexible, Python-first, model-packaging focused; UI via the Yatai dashboard; also enables deployment outside Kubernetes (Docker, VM). Module 9 shows both hands-on and presents a decision matrix.

Is spot-instance LLM serving safe for cost optimization?

For stateless workloads, yes — it reduces costs by 50–70%. However: (1) you may lose requests if a spot interruption occurs, (2) cold start takes 30–90 seconds for LLMs (model loading), (3) in multi-GPU scenarios, if a single GPU spot is reclaimed, it affects the entire cluster. Module 10 shows a spot + on-demand hybrid strategy: baseline reserved/on-demand, burst spot. Cold-start warm-pool and model-preloading optimization are also addressed.

How long does it take to set up an air-gapped deployment for a bank or hospital?

A typical BDDK-compliant bank air-gapped deployment takes 6–12 weeks: weeks 1-2 hardware procurement and network design, weeks 3-4 Kubernetes cluster + GPU Operator setup, weeks 5-6 vLLM + observability stack, weeks 7-8 security hardening (TLS, Vault, audit), weeks 9-10 KVKK compliance documentation and audit-readiness, weeks 11-12 production cutover and runbook handover. Modules 11 and 12 include the step-by-step plan of this process.

What concrete outputs will I leave the training with?

As a capstone, the following concrete artifacts are produced: (1) an on-prem LLM platform architecture diagram tailored to your organization, (2) a hardware capacity plan and cost projection (3-year TCO), (3) Helm chart and Kubernetes manifest templates (vLLM, observability stack), (4) Prometheus + Grafana dashboard JSONs and alerting rules, (5) an auto-scaling and cost-optimization strategy, (6) a KVKK-compliant air-gapped deployment runbook, (7) a security checklist and incident-response procedures, (8) a 90-day operational roadmap.

Can the training be customized for our enterprise team?

Yes. Beyond the standard 3-day program, we offer customized private-classroom versions for enterprise clients. Module weights are tailored based on your existing Kubernetes / cloud / GPU stack, your target LLM family (DeepSeek, Llama, Qwen), your QPS and latency goals, your sector regulation (BDDK, EPDK, SGK), your compliance requirements (KVKK, GDPR), and your hardware constraints. A company-specific architecture diagram and capacity plan can be included.

About this training

A 3-day advanced program for DevOps engineers, SREs, and ML Platform engineers who want to deploy open-source LLMs on-premise at enterprise scale, ranging from Ollama and vLLM internals to multi-GPU distributed inference, Kubernetes serving, production observability, and KVKK-compliant air-gapped deployment.

This training is designed for: DevOps engineers and Site Reliability Engineers (SREs) who want to build and operate enterprise AI serving infrastructure ML Platform engineers, AI Platform team leads, and teams building internal developer platforms Infrastructure architects managing GPU clusters, Kubernetes, and multi-tenant serving infrastructure Technical leads responsible for building air-gapped AI serving for regulated sectors like KVKK / BDDK / EPDK / SGK FinOps teams and cost-optimization-focused CTOs who want to lower self-hosted LLM ROI Engineering leaders who want to run their existing LangChain / LlamaIndex / agent applications with a production-grade self-hosted backend

Why this course matters: Turkey's only production-grade on-premise LLM deployment program that goes beyond surface-level 'install Ollama, pull a model' content. Imparts the competence to perform real tuning by addressing vLLM's PagedAttention and continuous batching internals at architectural depth. Provides a directly applicable capacity-planning matrix for hardware investment decisions by comparing NVIDIA, AMD, and Intel GPU ecosystems. Establishes the engineering discipline of taking large models (70B, 405B, 671B) into production via multi-GPU distributed inference (TP / PP / EP). Teaches production-grade platform engineering with Kubernetes (KServe, BentoML, Helm), auto-scaling (HPA, KEDA), and observability (Prometheus, Grafana, DCGM). Meets the requirements of regulated sectors with air-gapped deployment and compliance governance topics under the KVKK / BDDK / EPDK / SGK / EU AI Act framework.

Learning outcomes by the end of the programme: Make architectural decisions with on-prem vs API model TCO modeling. Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems. Professionally set up Ollama in developer, edge, and branch office scenarios. Tune vLLM's PagedAttention, continuous batching, and prefix caching internals. Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism. Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance. Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm). Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM. Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.

Prerequisites and recommended background: Basic experience with Linux command line, Docker, and Kubernetes Working experience with GPU/CUDA, networking, and cloud infrastructure Experience with Prometheus, Grafana, or similar observability tools (recommended) DevOps scripting skills with Bash or Python Access to a GPU machine or cloud GPU during the training (RunPod, Lambda, vast.ai) A Hugging Face account (can be created with the instructor's help)

Turkey's only production-grade on-premise LLM deployment program that addresses Ollama (developer/edge) and vLLM (production serving) internals end to end
Unique technical coverage explaining vLLM's PagedAttention, continuous batching, prefix caching, and speculative decoding at the internals level
Architectural-decision maturity through NVIDIA H100/H200/B200, AMD MI300X, Intel Gaudi3 hardware comparison and multi-GPU topology (NVLink, InfiniBand) capacity planning
Multi-GPU distributed inference via tensor / pipeline / expert parallelism; comparative engine selection matrix with TGI, SGLang, TensorRT-LLM
Kubernetes (KServe, BentoML, Helm), GPU-aware HPA + KEDA auto-scaling, Prometheus + Grafana + DCGM observability, multi-tenant, and cost optimization discipline
An enterprise security perspective covering air-gapped deployment and compliance governance under KVKK 'cross-border transfer', BDDK / EPDK / SGK / EU AI Act

Key Takeaways

Make architectural decisions with on-prem vs API model TCO modeling.
Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems.
Professionally set up Ollama in developer, edge, and branch office scenarios.
Tune vLLM's PagedAttention, continuous batching, and prefix caching internals.
Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism.
Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance.
Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm).
Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM.
Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.

Advanced Level3 Gün

On-Premise LLM Deployment with Ollama and vLLM Training

Enroll Now

About This Course

This training is designed for DevOps engineers, SREs, ML Platform engineers, infrastructure architects, and cloud architects who want to run open-source large language models at enterprise scale on production-grade infrastructure. At the heart of the program is the following approach: on-premise LLM deployment is not simply 'install Ollama on a server and open a port.' Real operational value comes from selecting the right hardware via VRAM math and throughput projection, understanding and tuning vLLM's PagedAttention and continuous-batching internals, building multi-GPU deployment with tensor / pipeline / expert parallelism strategies, ensuring production observability with Prometheus + Grafana + OpenTelemetry + DCGM, performing reproducible deployment on Kubernetes with KServe / BentoML / Helm, setting up auto-scaling with GPU-aware HPA and KEDA, cost-optimizing with spot instances and hybrid model routing, hardening security with TLS/mTLS / Vault / audit logging, and operating the entire system in a KVKK-compliant air-gapped topology.

Content related to open-source LLM setup in Turkey has expanded rapidly over the past two years; however, the vast majority of this content remains at the 'install Ollama on macOS, pull a model, ask-answer' level. This training is designed to be the only comprehensive Turkish-language reference that completely transcends this surface level and addresses topics like inference engine internals + multi-GPU distributed serving + Kubernetes platform engineering + production observability + KVKK air-gapped compliance within the same program. The target audience is not ML/data engineers; it is DevOps engineers, SREs, ML Platform engineers, and infrastructure architects who operate production infrastructure. The training focuses not on Python ML engineering but on platform engineering and operational discipline.

A strategic dimension of the program is clarifying in which scenarios on-premise LLM deployment is genuinely required. Under KVKK 'cross-border transfer' rules, BDDK (banking), EPDK (energy), SGK (healthcare) sector regulations, and the EU AI Act, self-hosted serving is a mandatory architectural decision for many enterprise customers. At the same time, in situations like high token volumes (100M+/month), low tail-latency needs, or operationalizing domain-specific fine-tuned models, self-hosted overtakes API models economically. This training addresses TCO modeling for on-prem vs API models in detail along with the break-even point; a hybrid (hot path API, cold path on-prem) strategy is also shown.

The hardware module forms the infrastructure backbone of the training. NVIDIA H100, H200, B100, B200, GB200 specs and performance; the AMD MI300X, MI325X, MI350 ecosystem; Intel Gaudi3 and other alternative accelerators; and prototyping scenarios with RTX 4090/5090 are addressed comparatively. VRAM math (model parameters × bytes-per-param + KV cache); the impact of batch size, sequence length, and context window; throughput projection (tokens/sec, requests/sec, p99 latency) are taught end to end. In multi-GPU topologies, PCIe vs NVLink vs NVSwitch bandwidth differences; multi-node InfiniBand and RDMA requirements; DGX, HGX reference architectures, and custom build options are addressed in detail. This module provides a directly applicable decision matrix for teams that will invest in hardware or purchase cloud GPUs.

The Ollama module deepens into developer / edge / branch office scenarios. Ollama's llama.cpp-based backend architecture, ggml/gguf format mechanics, model registry flow, and OpenAI-compatible API layer are addressed at the internals level. Customization techniques like Modelfile directives (FROM, PARAMETER, TEMPLATE, SYSTEM), custom model production, and LoRA adapter merging are covered hands-on. On the production Ollama side, OLLAMA_HOST, OLLAMA_KEEP_ALIVE, OLLAMA_NUM_PARALLEL configuration; GPU passthrough; docker/podman integration; and branch office / edge node deployment patterns are addressed comprehensively.

The vLLM module is the peak of technical depth in the training. The two core innovations forming vLLM's production-grade serving paradigm — PagedAttention (KV cache fragmentation solution) and continuous batching (throughput optimization) — are addressed at the internals level. PagedAttention's page-based KV cache management and OS analogy; memory-utilization metrics; static batching vs continuous batching throughput analysis; request scheduler and preemption mechanics; max_num_batched_tokens and max_num_seqs tuning are addressed in detail. As advanced optimizations, prefix caching (shared system-prompt advantage), speculative decoding (use of draft models), chunked prefill (long-context handling), and guided decoding (Outlines, lm-format-enforcer integration) are shown hands-on.

The quantization-strategies module is critical for maximum performance under hardware constraints. GGUF (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0); AWQ vs GPTQ (weight-only vs activation-aware); FP8 native (H100 and later); FP4 (Blackwell B100+); EXL2 and ExLlamaV2 are compared in detail. Quantization-induced quality regression measurement (perplexity, MMLU); domain-specific quality-loss analysis; throughput vs latency optimization; tail-latency (p99, p999) management; benchmarking tools like vLLM benchmark_serving and GenAI-Perf are addressed hands-on.

The multi-GPU and distributed inference module goes beyond fitting large models (70B, 405B, 671B) on a single GPU. Tensor parallelism (matmul splitting), pipeline parallelism (layer-wise split), expert parallelism (MoE-specific routing), and hybrid 3D parallelism techniques are addressed. On the vLLM side, tensor_parallel_size and pipeline_parallel_size tuning, NCCL / NVLink / InfiniBand backend configuration, and multi-node orchestration with a Ray cluster are addressed hands-on. Additionally, alternative engines like TGI (Hugging Face Rust backend), SGLang (RadixAttention, JSON schema-constrained generation), NVIDIA TensorRT-LLM (peak performance), Triton Inference Server, llama.cpp server, lmdeploy, and MLX Server are addressed comparatively, and a use-case-based engine-selection matrix is presented.

The production-observability module represents the operational-discipline dimension of the training. Scraping vLLM's /metrics endpoint with Prometheus; vLLM Prometheus metrics like TTFT (time-to-first-token), TPOT (time-per-output-token), throughput, queue depth; Grafana dashboard templates; GPU monitoring with NVIDIA DCGM; structured logging (request_id, prompt hash, completion length); OpenTelemetry trace propagation; and Loki / Elastic / Datadog log aggregation are addressed in detail. On the alerting side, GPU OOM, queue saturation, p99-latency spike alerting; runbook design; and postmortem discipline are addressed hands-on.

The Kubernetes module takes on-premise LLM serving to the level of enterprise platform engineering. NVIDIA GPU Operator and nvidia-device-plugin configuration; GPU node taints, tolerations, scheduling; fractional GPUs via MIG (Multi-Instance GPU); KServe / Knative serverless LLM serving; model packaging with BentoML; comparison of ModelMesh and Ray Serve; reproducible vLLM deployment with Helm charts; GitOps-based delivery with Argo CD are addressed end to end. On the auto-scaling side, HPA custom-metrics-based scaling, KEDA event-driven autoscaling, cold-start optimization, and warm-pool strategy are addressed. As cost optimization, the mix of spot instance / preemptible VM / reserved capacity; model routing (Haiku/Sonnet local vs cloud API hybrid); unit economics (cost-per-token, cost-per-user) measurement; multi-tenant inference (namespace isolation, fair scheduling, per-tenant rate limiting) are addressed in detail.

The security module covers the training's compliance and governance discipline. TLS / mTLS endpoint encryption; internal API gateway and service mesh (Istio, Linkerd); network policy and micro-segmentation; HashiCorp Vault and External Secrets Operator integration; audit logging for who submitted which prompt; PII-masking and secret-scanning hooks; model distribution in environments without an internet connection; local container registry and mirror ecosystems; compliance documentation and audit-readiness are addressed hands-on. KVKK 'cross-border transfer' rules, BDDK / EPDK / SGK sector regulations, and air-gapped deployment scenarios under the EU AI Act framework are addressed in detail.

In the capstone project, each participant designs an end-to-end production-grade on-premise LLM serving platform for their own company: hardware, engine, and quantization choices; Kubernetes deployment, observability, and auto-scaling plan; KVKK-compliant air-gapped topology; cost projection and performance baseline; ops runbook and incident-response procedures. By the end of the training, participants reach a level of technical and architectural competence to manage on-premise LLM serving in an integrated way across architectural, operational, and compliance dimensions; master Ollama and vLLM internals; perform multi-GPU distributed inference deployment; establish production observability and auto-scaling; build a reproducible LLM serving platform on Kubernetes; measure unit economics with cost optimization; and meet the requirements of regulated sectors with KVKK-compliant air-gapped deployment. The training consists of 3 days, 12 modules, and over 80 hands-on lessons.

Training Methodology

Turkey's only production-grade on-premise LLM deployment program that addresses Ollama (developer/edge) and vLLM (production serving) internals end to end

Unique technical coverage explaining vLLM's PagedAttention, continuous batching, prefix caching, and speculative decoding at the internals level

Architectural-decision maturity through NVIDIA H100/H200/B200, AMD MI300X, Intel Gaudi3 hardware comparison and multi-GPU topology (NVLink, InfiniBand) capacity planning

Multi-GPU distributed inference via tensor / pipeline / expert parallelism; comparative engine selection matrix with TGI, SGLang, TensorRT-LLM

Kubernetes (KServe, BentoML, Helm), GPU-aware HPA + KEDA auto-scaling, Prometheus + Grafana + DCGM observability, multi-tenant, and cost optimization discipline

An enterprise security perspective covering air-gapped deployment and compliance governance under KVKK 'cross-border transfer', BDDK / EPDK / SGK / EU AI Act

Who Is This For?

DevOps engineers and Site Reliability Engineers (SREs) who want to build and operate enterprise AI serving infrastructure

ML Platform engineers, AI Platform team leads, and teams building internal developer platforms

Infrastructure architects managing GPU clusters, Kubernetes, and multi-tenant serving infrastructure

Technical leads responsible for building air-gapped AI serving for regulated sectors like KVKK / BDDK / EPDK / SGK

FinOps teams and cost-optimization-focused CTOs who want to lower self-hosted LLM ROI

Engineering leaders who want to run their existing LangChain / LlamaIndex / agent applications with a production-grade self-hosted backend

Why This Course?

Turkey's only production-grade on-premise LLM deployment program that goes beyond surface-level 'install Ollama, pull a model' content.

Imparts the competence to perform real tuning by addressing vLLM's PagedAttention and continuous batching internals at architectural depth.

Provides a directly applicable capacity-planning matrix for hardware investment decisions by comparing NVIDIA, AMD, and Intel GPU ecosystems.

Establishes the engineering discipline of taking large models (70B, 405B, 671B) into production via multi-GPU distributed inference (TP / PP / EP).

Teaches production-grade platform engineering with Kubernetes (KServe, BentoML, Helm), auto-scaling (HPA, KEDA), and observability (Prometheus, Grafana, DCGM).

Meets the requirements of regulated sectors with air-gapped deployment and compliance governance topics under the KVKK / BDDK / EPDK / SGK / EU AI Act framework.

Learning Outcomes

Make architectural decisions with on-prem vs API model TCO modeling.

Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems.

Professionally set up Ollama in developer, edge, and branch office scenarios.

Tune vLLM's PagedAttention, continuous batching, and prefix caching internals.

Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism.

Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance.

Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm).

Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM.

Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.

Requirements

Basic experience with Linux command line, Docker, and Kubernetes

Working experience with GPU/CUDA, networking, and cloud infrastructure

Experience with Prometheus, Grafana, or similar observability tools (recommended)

DevOps scripting skills with Bash or Python

Access to a GPU machine or cloud GPU during the training (RunPod, Lambda, vast.ai)

A Hugging Face account (can be created with the instructor's help)

Course Curriculum

103 Lessons

Module 1: On-Premise LLM Deployment Strategy and the 2026 Landscape9 Lessons

Module 2: Hardware Selection and Capacity Planning10 Lessons

Module 3: Ollama Deep Dive — Developer Experience and Edge Deployment9 Lessons

Module 4: vLLM Architectural Deep Dive — PagedAttention and Continuous Batching10 Lessons

Module 5: Quantization Strategies and Performance Tuning9 Lessons

Module 6: Multi-GPU and Distributed Inference9 Lessons

Module 7: TGI, SGLang, and Alternative Inference Engines9 Lessons

Module 8: Production Observability — Metrics, Logging, Tracing8 Lessons

Module 9: LLM Serving on Kubernetes — KServe, BentoML, Helm8 Lessons

Module 10: Auto-Scaling, Cost Optimization, and Multi-Tenant Serving9 Lessons

Module 11: Security, Compliance, and KVKK-Compliant Air-Gapped Deployment9 Lessons

Module 12: Capstone — Enterprise On-Premise LLM Platform4 Lessons

Instructor

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions

Apply for Training

Boutique training with limited seats.

Pre-register for Next Groups

Leave your info to be the first to know when the next batch opens.

Live & Interactive Sessions

Project-Based Learning

Industry-Focused Curriculum

Professional Networking

1-on-1 Mentorship

Book a private session.

Enroll

About this training

Key Takeaways

On-Premise LLM Deployment with Ollama and vLLM Training