Edge Inference: ONNX + Jetson + MediaTek NPU + Qualcomm AI Engine

Edge LLM inference 2026'da gerçek: NVIDIA Jetson Orin, MediaTek NPU (Pixel), Qualcomm AI Engine (Snapdragon 8 Gen 3+), Apple Neural Engine. ONNX format için cross-platform inference, edge-spesifik quantization (INT8 / INT4 / W4A8 mixed), latency budget < 200 ms first-token. SmolLM3 1.7B + Pixel 8 Pro deploy reçetesi.

Şükrü Yusuf KAYA

28 dakikalık okuma

14.05.2026

İleri

Edge Inference: ONNX + Jetson + MediaTek NPU + Qualcomm AI Engine

1. Edge Platform Karşılaştırma#

Platform	NPU/AI Engine	Tok/s (8B Q4)	Power	Use case
NVIDIA Jetson Orin 32GB	Ampere GPU 200 TOPS	18-25	30W	robotik, industrial
Jetson Nano (eski)	Maxwell 1 TOPS	1-2	10W	demo only
Apple M3 Max	ANE + GPU	65-75	30W	dev, professional
iPhone 15 Pro (A17)	ANE 18 TOPS	14-18	5W	mobile chat
Pixel 8 Pro (Tensor G3)	TPU 4 TOPS	8-12	4W	mobile chat
Snapdragon 8 Gen 3	Hexagon NPU 45 TOPS	18-25	5W	premium Android
Raspberry Pi 5 (CPU only)	n/a	2-3	8W	hobbyist

Karar: Mobile chatbot için Snapdragon 8 Gen 3+ veya iPhone 15 Pro+. Industrial Jetson Orin.

bash

# === HF → ONNX → Edge Deploy ===
# 1. HF model'i ONNX'e dönüştür
optimum-cli export onnx \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --opset 17 \
    --dtype fp16 \
    llama-3.2-1b-onnx/
 
# 2. Quantize (W4A8 mixed)
python -m onnxruntime.quantization.matmul_4bits_quantizer \
    --input llama-3.2-1b-onnx/ \
    --output llama-3.2-1b-onnx-int4/ \
    --block_size 32
 
# 3. Edge deploy — ONNX Runtime Mobile (Android/iOS)
# Kotlin (Android):
val session = OrtEnvironment.getEnvironment().createSession(
    "llama-3.2-1b-onnx-int4/model.onnx"
)
 
# Swift (iOS):
let session = try ORTSession(env: env, modelPath: path, sessionOptions: options)
let outputs = try session.run(withInputs: inputs)
 
# 4. Edge benchmark
# Pixel 8 Pro: SmolLM3 1.7B Q4 → 12 tok/s, first-token 150 ms
# Snapdragon 8 Gen 3: Llama 3.2 3B Q4 → 18 tok/s, first-token 95 ms

HF → ONNX → mobile deploy

2. 2026 Edge LLM Trends#

Apple Intelligence stack — Apple Neural Engine + MLX backend, OS-level integration
Google Gemini Nano — Pixel 9+ default LLM (Tensor G4 NPU)
Qualcomm AI Hub — Snapdragon-optimized model zoo
MediaTek Dimensity 9400 — Apex AI engine, 50 TOPS

Cookbook tavsiyesi: SmolLM3 1.7B veya Llama 3.2 1B/3B'yi GGUF Q4_K_M veya ONNX INT4 ile mobile platform için ship et.

✅ Part XV tamamlandı

Llama 3.2 1B'yi ONNX'e dönüştür. 2) Eğer Android/iOS dev environment varsa edge deploy test. 3) Sonraki Part: Part IX — Türkçe-First & Yerelleştirme Mühendisliği. TR-spesifik model'ler + corpus inşası + KVKK uyumlu pipeline.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla