Skip to content

Edge Inference: ONNX + Jetson + MediaTek NPU + Qualcomm AI Engine

Edge LLM inference is real in 2026: NVIDIA Jetson Orin, MediaTek NPU (Pixel), Qualcomm AI Engine (Snapdragon 8 Gen 3+), Apple Neural Engine. ONNX format for cross-platform, edge-specific quantization (INT8/INT4/W4A8 mixed), latency budget < 200 ms first-token. SmolLM3 1.7B + Pixel 8 Pro deploy recipe.

Şükrü Yusuf KAYA
28 min read
Advanced
Edge Inference: ONNX + Jetson + MediaTek NPU + Qualcomm AI Engine

1. Edge Platform Karşılaştırma#

PlatformNPU/AI EngineTok/s (8B Q4)PowerUse case
NVIDIA Jetson Orin 32GBAmpere GPU 200 TOPS18-2530Wrobotik, industrial
Jetson Nano (eski)Maxwell 1 TOPS1-210Wdemo only
Apple M3 MaxANE + GPU65-7530Wdev, professional
iPhone 15 Pro (A17)ANE 18 TOPS14-185Wmobile chat
Pixel 8 Pro (Tensor G3)TPU 4 TOPS8-124Wmobile chat
Snapdragon 8 Gen 3Hexagon NPU 45 TOPS18-255Wpremium Android
Raspberry Pi 5 (CPU only)n/a2-38Whobbyist
Karar: Mobile chatbot için Snapdragon 8 Gen 3+ veya iPhone 15 Pro+. Industrial Jetson Orin.
bash
# === HF → ONNX → Edge Deploy ===
# 1. HF model'i ONNX'e dönüştür
optimum-cli export onnx \
--model meta-llama/Llama-3.2-1B-Instruct \
--opset 17 \
--dtype fp16 \
llama-3.2-1b-onnx/
 
# 2. Quantize (W4A8 mixed)
python -m onnxruntime.quantization.matmul_4bits_quantizer \
--input llama-3.2-1b-onnx/ \
--output llama-3.2-1b-onnx-int4/ \
--block_size 32
 
# 3. Edge deploy — ONNX Runtime Mobile (Android/iOS)
# Kotlin (Android):
val session = OrtEnvironment.getEnvironment().createSession(
"llama-3.2-1b-onnx-int4/model.onnx"
)
 
# Swift (iOS):
let session = try ORTSession(env: env, modelPath: path, sessionOptions: options)
let outputs = try session.run(withInputs: inputs)
 
# 4. Edge benchmark
# Pixel 8 Pro: SmolLM3 1.7B Q4 → 12 tok/s, first-token 150 ms
# Snapdragon 8 Gen 3: Llama 3.2 3B Q4 → 18 tok/s, first-token 95 ms
HF → ONNX → mobile deploy
  • Apple Intelligence stack — Apple Neural Engine + MLX backend, OS-level integration
  • Google Gemini Nano — Pixel 9+ default LLM (Tensor G4 NPU)
  • Qualcomm AI Hub — Snapdragon-optimized model zoo
  • MediaTek Dimensity 9400 — Apex AI engine, 50 TOPS
Cookbook tavsiyesi: SmolLM3 1.7B veya Llama 3.2 1B/3B'yi GGUF Q4_K_M veya ONNX INT4 ile mobile platform için ship et.
✅ Part XV tamamlandı
  1. Llama 3.2 1B'yi ONNX'e dönüştür. 2) Eğer Android/iOS dev environment varsa edge deploy test. 3) Sonraki Part: Part IX — Türkçe-First & Yerelleştirme Mühendisliği. TR-spesifik model'ler + corpus inşası + KVKK uyumlu pipeline.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content