About this training
A 3-day advanced program for DevOps engineers, SREs, and ML Platform engineers who want to deploy open-source LLMs on-premise at enterprise scale, ranging from Ollama and vLLM internals to multi-GPU distributed inference, Kubernetes serving, production observability, and KVKK-compliant air-gapped deployment.
This training is designed for: DevOps engineers and Site Reliability Engineers (SREs) who want to build and operate enterprise AI serving infrastructure ML Platform engineers, AI Platform team leads, and teams building internal developer platforms Infrastructure architects managing GPU clusters, Kubernetes, and multi-tenant serving infrastructure Technical leads responsible for building air-gapped AI serving for regulated sectors like KVKK / BDDK / EPDK / SGK FinOps teams and cost-optimization-focused CTOs who want to lower self-hosted LLM ROI Engineering leaders who want to run their existing LangChain / LlamaIndex / agent applications with a production-grade self-hosted backend
Why this course matters: Turkey's only production-grade on-premise LLM deployment program that goes beyond surface-level 'install Ollama, pull a model' content. Imparts the competence to perform real tuning by addressing vLLM's PagedAttention and continuous batching internals at architectural depth. Provides a directly applicable capacity-planning matrix for hardware investment decisions by comparing NVIDIA, AMD, and Intel GPU ecosystems. Establishes the engineering discipline of taking large models (70B, 405B, 671B) into production via multi-GPU distributed inference (TP / PP / EP). Teaches production-grade platform engineering with Kubernetes (KServe, BentoML, Helm), auto-scaling (HPA, KEDA), and observability (Prometheus, Grafana, DCGM). Meets the requirements of regulated sectors with air-gapped deployment and compliance governance topics under the KVKK / BDDK / EPDK / SGK / EU AI Act framework.
Learning outcomes by the end of the programme: Make architectural decisions with on-prem vs API model TCO modeling. Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems. Professionally set up Ollama in developer, edge, and branch office scenarios. Tune vLLM's PagedAttention, continuous batching, and prefix caching internals. Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism. Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance. Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm). Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM. Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.
Prerequisites and recommended background: Basic experience with Linux command line, Docker, and Kubernetes Working experience with GPU/CUDA, networking, and cloud infrastructure Experience with Prometheus, Grafana, or similar observability tools (recommended) DevOps scripting skills with Bash or Python Access to a GPU machine or cloud GPU during the training (RunPod, Lambda, vast.ai) A Hugging Face account (can be created with the instructor's help)
- Turkey's only production-grade on-premise LLM deployment program that addresses Ollama (developer/edge) and vLLM (production serving) internals end to end
- Unique technical coverage explaining vLLM's PagedAttention, continuous batching, prefix caching, and speculative decoding at the internals level
- Architectural-decision maturity through NVIDIA H100/H200/B200, AMD MI300X, Intel Gaudi3 hardware comparison and multi-GPU topology (NVLink, InfiniBand) capacity planning
- Multi-GPU distributed inference via tensor / pipeline / expert parallelism; comparative engine selection matrix with TGI, SGLang, TensorRT-LLM
- Kubernetes (KServe, BentoML, Helm), GPU-aware HPA + KEDA auto-scaling, Prometheus + Grafana + DCGM observability, multi-tenant, and cost optimization discipline
- An enterprise security perspective covering air-gapped deployment and compliance governance under KVKK 'cross-border transfer', BDDK / EPDK / SGK / EU AI Act
Key Takeaways
- Make architectural decisions with on-prem vs API model TCO modeling.
- Perform hardware selection and capacity planning among NVIDIA, AMD, and Intel GPU ecosystems.
- Professionally set up Ollama in developer, edge, and branch office scenarios.
- Tune vLLM's PagedAttention, continuous batching, and prefix caching internals.
- Deploy multi-GPU distributed inference with tensor / pipeline / expert parallelism.
- Apply GGUF, AWQ, GPTQ, FP8, FP4 quantization strategies in the cost-quality-throughput balance.
- Build a reproducible LLM serving platform on Kubernetes (KServe, BentoML, Helm).
- Provide production-grade observability with Prometheus, Grafana, OpenTelemetry, DCGM.
- Perform KVKK-compliant security hardening with TLS/mTLS, Vault, audit logging, and air-gapped deployment.
On-Premise LLM Deployment with Ollama and vLLM Training
A 3-day advanced program for DevOps engineers, SREs, and ML Platform engineers who want to deploy open-source LLMs on-premise at enterprise scale, ranging from Ollama and vLLM internals to multi-GPU distributed inference, Kubernetes serving, production observability, and KVKK-compliant air-gapped deployment.
About This Course
This training is designed for DevOps engineers, SREs, ML Platform engineers, infrastructure architects, and cloud architects who want to run open-source large language models at enterprise scale on production-grade infrastructure. At the heart of the program is the following approach: on-premise LLM deployment is not simply 'install Ollama on a server and open a port.' Real operational value comes from selecting the right hardware via VRAM math and throughput projection, understanding and tuning vLLM's PagedAttention and continuous-batching internals, building multi-GPU deployment with tensor / pipeline / expert parallelism strategies, ensuring production observability with Prometheus + Grafana + OpenTelemetry + DCGM, performing reproducible deployment on Kubernetes with KServe / BentoML / Helm, setting up auto-scaling with GPU-aware HPA and KEDA, cost-optimizing with spot instances and hybrid model routing, hardening security with TLS/mTLS / Vault / audit logging, and operating the entire system in a KVKK-compliant air-gapped topology.
Content related to open-source LLM setup in Turkey has expanded rapidly over the past two years; however, the vast majority of this content remains at the 'install Ollama on macOS, pull a model, ask-answer' level. This training is designed to be the only comprehensive Turkish-language reference that completely transcends this surface level and addresses topics like inference engine internals + multi-GPU distributed serving + Kubernetes platform engineering + production observability + KVKK air-gapped compliance within the same program. The target audience is not ML/data engineers; it is DevOps engineers, SREs, ML Platform engineers, and infrastructure architects who operate production infrastructure. The training focuses not on Python ML engineering but on platform engineering and operational discipline.
A strategic dimension of the program is clarifying in which scenarios on-premise LLM deployment is genuinely required. Under KVKK 'cross-border transfer' rules, BDDK (banking), EPDK (energy), SGK (healthcare) sector regulations, and the EU AI Act, self-hosted serving is a mandatory architectural decision for many enterprise customers. At the same time, in situations like high token volumes (100M+/month), low tail-latency needs, or operationalizing domain-specific fine-tuned models, self-hosted overtakes API models economically. This training addresses TCO modeling for on-prem vs API models in detail along with the break-even point; a hybrid (hot path API, cold path on-prem) strategy is also shown.
The hardware module forms the infrastructure backbone of the training. NVIDIA H100, H200, B100, B200, GB200 specs and performance; the AMD MI300X, MI325X, MI350 ecosystem; Intel Gaudi3 and other alternative accelerators; and prototyping scenarios with RTX 4090/5090 are addressed comparatively. VRAM math (model parameters × bytes-per-param + KV cache); the impact of batch size, sequence length, and context window; throughput projection (tokens/sec, requests/sec, p99 latency) are taught end to end. In multi-GPU topologies, PCIe vs NVLink vs NVSwitch bandwidth differences; multi-node InfiniBand and RDMA requirements; DGX, HGX reference architectures, and custom build options are addressed in detail. This module provides a directly applicable decision matrix for teams that will invest in hardware or purchase cloud GPUs.
The Ollama module deepens into developer / edge / branch office scenarios. Ollama's llama.cpp-based backend architecture, ggml/gguf format mechanics, model registry flow, and OpenAI-compatible API layer are addressed at the internals level. Customization techniques like Modelfile directives (FROM, PARAMETER, TEMPLATE, SYSTEM), custom model production, and LoRA adapter merging are covered hands-on. On the production Ollama side, OLLAMA_HOST, OLLAMA_KEEP_ALIVE, OLLAMA_NUM_PARALLEL configuration; GPU passthrough; docker/podman integration; and branch office / edge node deployment patterns are addressed comprehensively.
The vLLM module is the peak of technical depth in the training. The two core innovations forming vLLM's production-grade serving paradigm — PagedAttention (KV cache fragmentation solution) and continuous batching (throughput optimization) — are addressed at the internals level. PagedAttention's page-based KV cache management and OS analogy; memory-utilization metrics; static batching vs continuous batching throughput analysis; request scheduler and preemption mechanics; max_num_batched_tokens and max_num_seqs tuning are addressed in detail. As advanced optimizations, prefix caching (shared system-prompt advantage), speculative decoding (use of draft models), chunked prefill (long-context handling), and guided decoding (Outlines, lm-format-enforcer integration) are shown hands-on.
The quantization-strategies module is critical for maximum performance under hardware constraints. GGUF (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0); AWQ vs GPTQ (weight-only vs activation-aware); FP8 native (H100 and later); FP4 (Blackwell B100+); EXL2 and ExLlamaV2 are compared in detail. Quantization-induced quality regression measurement (perplexity, MMLU); domain-specific quality-loss analysis; throughput vs latency optimization; tail-latency (p99, p999) management; benchmarking tools like vLLM benchmark_serving and GenAI-Perf are addressed hands-on.
The multi-GPU and distributed inference module goes beyond fitting large models (70B, 405B, 671B) on a single GPU. Tensor parallelism (matmul splitting), pipeline parallelism (layer-wise split), expert parallelism (MoE-specific routing), and hybrid 3D parallelism techniques are addressed. On the vLLM side, tensor_parallel_size and pipeline_parallel_size tuning, NCCL / NVLink / InfiniBand backend configuration, and multi-node orchestration with a Ray cluster are addressed hands-on. Additionally, alternative engines like TGI (Hugging Face Rust backend), SGLang (RadixAttention, JSON schema-constrained generation), NVIDIA TensorRT-LLM (peak performance), Triton Inference Server, llama.cpp server, lmdeploy, and MLX Server are addressed comparatively, and a use-case-based engine-selection matrix is presented.
The production-observability module represents the operational-discipline dimension of the training. Scraping vLLM's /metrics endpoint with Prometheus; vLLM Prometheus metrics like TTFT (time-to-first-token), TPOT (time-per-output-token), throughput, queue depth; Grafana dashboard templates; GPU monitoring with NVIDIA DCGM; structured logging (request_id, prompt hash, completion length); OpenTelemetry trace propagation; and Loki / Elastic / Datadog log aggregation are addressed in detail. On the alerting side, GPU OOM, queue saturation, p99-latency spike alerting; runbook design; and postmortem discipline are addressed hands-on.
The Kubernetes module takes on-premise LLM serving to the level of enterprise platform engineering. NVIDIA GPU Operator and nvidia-device-plugin configuration; GPU node taints, tolerations, scheduling; fractional GPUs via MIG (Multi-Instance GPU); KServe / Knative serverless LLM serving; model packaging with BentoML; comparison of ModelMesh and Ray Serve; reproducible vLLM deployment with Helm charts; GitOps-based delivery with Argo CD are addressed end to end. On the auto-scaling side, HPA custom-metrics-based scaling, KEDA event-driven autoscaling, cold-start optimization, and warm-pool strategy are addressed. As cost optimization, the mix of spot instance / preemptible VM / reserved capacity; model routing (Haiku/Sonnet local vs cloud API hybrid); unit economics (cost-per-token, cost-per-user) measurement; multi-tenant inference (namespace isolation, fair scheduling, per-tenant rate limiting) are addressed in detail.
The security module covers the training's compliance and governance discipline. TLS / mTLS endpoint encryption; internal API gateway and service mesh (Istio, Linkerd); network policy and micro-segmentation; HashiCorp Vault and External Secrets Operator integration; audit logging for who submitted which prompt; PII-masking and secret-scanning hooks; model distribution in environments without an internet connection; local container registry and mirror ecosystems; compliance documentation and audit-readiness are addressed hands-on. KVKK 'cross-border transfer' rules, BDDK / EPDK / SGK sector regulations, and air-gapped deployment scenarios under the EU AI Act framework are addressed in detail.
In the capstone project, each participant designs an end-to-end production-grade on-premise LLM serving platform for their own company: hardware, engine, and quantization choices; Kubernetes deployment, observability, and auto-scaling plan; KVKK-compliant air-gapped topology; cost projection and performance baseline; ops runbook and incident-response procedures. By the end of the training, participants reach a level of technical and architectural competence to manage on-premise LLM serving in an integrated way across architectural, operational, and compliance dimensions; master Ollama and vLLM internals; perform multi-GPU distributed inference deployment; establish production observability and auto-scaling; build a reproducible LLM serving platform on Kubernetes; measure unit economics with cost optimization; and meet the requirements of regulated sectors with KVKK-compliant air-gapped deployment. The training consists of 3 days, 12 modules, and over 80 hands-on lessons.
Training Methodology
Turkey's only production-grade on-premise LLM deployment program that addresses Ollama (developer/edge) and vLLM (production serving) internals end to end
Unique technical coverage explaining vLLM's PagedAttention, continuous batching, prefix caching, and speculative decoding at the internals level
Architectural-decision maturity through NVIDIA H100/H200/B200, AMD MI300X, Intel Gaudi3 hardware comparison and multi-GPU topology (NVLink, InfiniBand) capacity planning
Multi-GPU distributed inference via tensor / pipeline / expert parallelism; comparative engine selection matrix with TGI, SGLang, TensorRT-LLM
Kubernetes (KServe, BentoML, Helm), GPU-aware HPA + KEDA auto-scaling, Prometheus + Grafana + DCGM observability, multi-tenant, and cost optimization discipline
An enterprise security perspective covering air-gapped deployment and compliance governance under KVKK 'cross-border transfer', BDDK / EPDK / SGK / EU AI Act
Who Is This For?
Why This Course?
Turkey's only production-grade on-premise LLM deployment program that goes beyond surface-level 'install Ollama, pull a model' content.
Imparts the competence to perform real tuning by addressing vLLM's PagedAttention and continuous batching internals at architectural depth.
Provides a directly applicable capacity-planning matrix for hardware investment decisions by comparing NVIDIA, AMD, and Intel GPU ecosystems.
Establishes the engineering discipline of taking large models (70B, 405B, 671B) into production via multi-GPU distributed inference (TP / PP / EP).
Teaches production-grade platform engineering with Kubernetes (KServe, BentoML, Helm), auto-scaling (HPA, KEDA), and observability (Prometheus, Grafana, DCGM).
Meets the requirements of regulated sectors with air-gapped deployment and compliance governance topics under the KVKK / BDDK / EPDK / SGK / EU AI Act framework.
Learning Outcomes
Requirements
Course Curriculum
103 LessonsInstructor

Şükrü Yusuf KAYA
AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant
Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.
Frequently Asked Questions
Apply for Training
Boutique training with limited seats.
Pre-register for Next Groups
Leave your info to be the first to know when the next batch opens.
1-on-1 Mentorship
Book a private session.
Categories
Related programs
Professional Software Development with Claude Code Training
A comprehensive, advanced 4-day training program for software professionals seeking enterprise-level mastery of Anthropic's agentic coding platform, Claude Code. Production-grade agent architecture with MCP integrations, Hooks, Sub-agents, Skills, and the Claude Agent SDK.
4 GünadvancedLLM Alignment Engineering with RLHF, DPO, and GRPO Training
A 3-day advanced Turkish LLM alignment training that covers the RLHF (PPO), DPO, KTO, IPO, SimPO, ORPO, and DeepSeek R1 GRPO algorithms at both math and code level; and teaches reward modeling, Constitutional AI, RLAIF, reasoning-model alignment, and the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain at production grade.
3 GünadvancedBuilding AI Agents with the Claude Agent SDK Training
A comprehensive, advanced 4-day program for software engineers who want to develop production-grade AI agents with Anthropic's Claude Agent SDK. Tool-use orchestration, MCP server development, multi-agent patterns, prompt caching, and evaluation engineering.
4 Günadvanced