How do I prevent broadcasting memory blow-up?

Shape discipline + assertions. (1) Comment expected shapes before each op: `# expected: (B, T, d)`. (2) On hot paths: `assert x.shape == (B, T, d), f"got {x.shape}"`. (3) Before big ops, manually apply broadcasting rules to predict shape. (4) Use memory profiler: `tracemalloc` (CPU), `torch.cuda.max_memory_allocated()` (GPU). (5) Test with small tensors, scale up.

How did einops become so popular? Doesn't einsum suffice?

Einsum is powerful but cryptic. Einops's 3 functions are **human-readable**: `rearrange('b h t d -> b t (h d)')` is immediately understandable. Einsum equivalent is much less clear. Einops also handles **shape errors** more cleanly (informative error on bad pattern). Modern LLM codebases (Llama, GPT-NeoX, OpenAI gym) use einops extensively. Einsum still optimal for contractions (matmul-like) — they complement each other.

View modifications affecting the original — is this a bug source or a feature?

Both. **Feature** because fast (modify without copy possible). **Bug source** because of unexpected side-effects. **Defensive code**: (1) Use `.copy()` explicitly if unsure. (2) If a function contract says 'input not modified', clone at start. (3) Avoid inplace ops in dataset preprocessing. (4) Test that you don't modify inputs. PyTorch's `tensor.clone()` or `tensor.detach().clone()` is the same pattern.

Does it matter if I use PyTorch tensors (CPU) instead of NumPy?

Modern PyTorch CPU tensors have nearly identical performance to NumPy ndarrays (both use BLAS). PyTorch advantage: same API runs on GPU, autograd free, mixed precision native. **Recommendation**: in new projects, everything PyTorch tensor. NumPy only for (1) third-party libraries requiring it (scipy, sklearn), (2) Pandas interop, (3) legacy code. Module 5 (PyTorch Engineering) deepens this.

Turkish-specific: NumPy or PyTorch for tokenization output?

Use NumPy/list at tokenization step — HuggingFace tokenizers return list of int or np.ndarray. Then **right before feeding model**, convert with `torch.tensor` or `torch.from_numpy`. Turkish text length is high (few chars per token), so tokenization can be CPU-heavy. NumPy preprocessing pipeline is often faster (multi-threading friendly, less pure Python overhead). Pairs well with PyTorch DataLoader.

NumPy Tensor Engineering: Strides, View, Broadcasting, and the Anatomy of Memory Layout

Memory anatomy of a tensor: row-major C vs column-major F, strides, view vs copy, contiguous, fancy indexing, advanced broadcasting rules, BLAS backend intuition, einsum vs einops. Foundation of performance-critical code.

Şükrü Yusuf KAYA

38 min read

5/13/2026

Intermediate

NumPy Tensor Mühendisliği: Strides, View, Broadcasting ve Bellek Düzeninin Anatomisi

🧱 PyTorch'un altındaki NumPy bilgisi

PyTorch tensor'larının %80'i NumPy ndarray pattern'lerinin kopyası. NumPy'i derinden bilmek = PyTorch performans bug'larını çözebilmek. 38 dakika sonra

a.copy()

a.view()

mu sorusuna bilinçli cevap verecek, broadcasting'i 'sihir' olmaktan çıkartmış olacaksın.

Ders Haritası#

ndarray anatomisi: data buffer + metadata
Strides: bellekten tensor'a köprü
Row-major (C) vs column-major (F)
View vs copy — sessiz performans tuzakları
Contiguous: niye önemli, nasıl kontrol edilir
Broadcasting kuralları — detaylı
Fancy indexing — view mı copy mi?
BLAS arka uç: matris çarpımının gerçek hızı
Einsum vs einops — modern tensor cebri
NumPy ile PyTorch farkları

1. ndarray Anatomisi — Data + Metadata#

Bir NumPy

ndarray

aslında iki şey:

Data buffer — sürekli (flat) bir bellek bloğu
Metadata — bu bloğun nasıl yorumlanacağı (shape, dtype, strides, offset)

import numpy as np
a = np.arange(12).reshape(3, 4)
print(a)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

Görünüşte 2D ama bellekte tek bir 1D array:

[0, 1, 2, ..., 11]

. Metadata bunu nasıl yorumlayacağını söylüyor:

shape:
(3, 4)
dtype:
int64
(her element 8 byte)
strides:
(32, 8)
(bir satır geçmek için 32 byte, bir kolon için 8 byte)
offset: 0 (data başlangıcı)

Neden bu ayrım önemli?#

Çünkü aynı data buffer üzerinden farklı metadata ile bambaşka view'lar yaratabilirsin — kopyalama yok, sadece pointer arithmetic.

python

import numpy as np
 
a = np.arange(12).reshape(3, 4)
print(f"shape: {a.shape}")
print(f"dtype: {a.dtype}")
print(f"strides: {a.strides}")    # (32, 8) - bytes
print(f"itemsize: {a.itemsize}")  # 8 bytes (int64)
print(f"nbytes: {a.nbytes}")      # 96 = 12 * 8
print(f"flags: {a.flags}")
 
# Aynı buffer'ı paylaşan view'lar
b = a.T                            # transpose — kopyalama yok
print(f"\nTranspose: shape={b.shape}, strides={b.strides}")
# shape=(4, 3), strides=(8, 32)  - sadece strides döndü!
 
c = a[1:3, ::2]                    # slice — kopya yok
print(f"\nSlice: shape={c.shape}, strides={c.strides}")
# shape=(2, 2), strides=(32, 16)
 
# Verify: aynı buffer
print(f"a base: {a.base}")           # None (a kök)
print(f"b base is a: {b.base is a}") # True (b a'nın view'u)
print(f"c base is a: {c.base is a}") # True

ndarray'in iç metadata yapısı.

2. Strides — Bellekten Tensor'a Köprü#

strides

her boyutta "bir adım atmak için kaç byte ilerle"yi söyler.

(3, 4)

int64 array için:

strides = (32, 8)

Satır: 4 element × 8 byte = 32 byte
Kolon: 8 byte (tek element)

Element erişim formülü#

a[i, j]

'nin bellek konumu:

\text{addr}(i, j) = \text{base} + i \cdot \text{stride}_0 + j \cdot \text{stride}_1

Genelleme N-boyut için:

\text{addr}(i_0, i_1, \dots, i_{n-1}) = \text{base} + \sum_k i_k \cdot \text{stride}_k

Stride hilebazlığı#

NumPy/PyTorch çoğu operasyonu sadece strides değiştirerek yapar:

İşlem	Strides etkisi	Kopya?
`a.T` (transpose)	strides reversed	Hayır
`a[start:stop:step]`	offset + step × stride	Hayır
`a.reshape(-1)`	yeni strides (eğer contiguous'sa)	Bazen
`np.broadcast_to`	bazı strides = 0 (kopyalanmadan tekrar)	Hayır
`a.copy()`	yeni buffer	Evet
Fancy indexing	yeni buffer	Evet

python

import numpy as np
 
a = np.arange(12).reshape(3, 4).astype(np.int64)
print(f"Original strides: {a.strides}")     # (32, 8)
 
# Transpose
print(f"Transposed strides: {a.T.strides}") # (8, 32) - flipped
 
# Step
print(f"Every other row: {a[::2].strides}") # (64, 8) - 2x row stride
 
# Broadcast_to: zero-stride trick
b = np.broadcast_to(np.array([1, 2, 3, 4]), (5, 4))
print(f"Broadcast strides: {b.strides}")    # (0, 8) - row stride 0!
# b'yi 5 kere tekrar gibi görüyoruz ama 4 element bellekte
print(f"b nbytes: {b.nbytes}")              # 4*8 = 32 bytes (5 kopya yok)
 
# Stride trick'i kullanarak sliding window
def sliding_window(arr, window_size):
    n_windows = arr.shape[0] - window_size + 1
    return np.lib.stride_tricks.as_strided(
        arr,
        shape=(n_windows, window_size),
        strides=(arr.strides[0], arr.strides[0]),
    )
 
x = np.arange(10)
windows = sliding_window(x, 3)
print(windows)
# [[0 1 2]
#  [1 2 3]
#  [2 3 4]
#  ...
#  [7 8 9]]
# Tek bir buffer üzerinden 8 view!

Strides tricks — kopyasız sihir.

3. Row-major (C) vs Column-major (F)#

Bir 2D matrisi 1D'ye nasıl yazarsın?

Row-major (C order):

a[0,0], a[0,1], a[0,2], ..., a[1,0], a[1,1], ...

— satırlar arka arkaya. NumPy default. Strides:

(N×itemsize, itemsize)

Column-major (F order):

a[0,0], a[1,0], a[2,0], ..., a[0,1], a[1,1], ...

— kolonlar arka arkaya. Fortran, MATLAB default. Strides:

(itemsize, M×itemsize)

Niye önemli?#

CPU cache sequential erişimten faydalanır. C-order'lı bir array'i C-style traverse edersen (

for i: for j: a[i,j]

) çok hızlı — cache hit. F-style traverse edersen yavaş.

LLM'de yaygın bug: bir tensor F-order'da geliyor (örn. bazı CuPy işlemleri), sonra C-style işleniyor → 5-10x yavaşlama.

a_c = np.zeros((1000, 1000), order='C')   # row-major
a_f = np.zeros((1000, 1000), order='F')   # column-major

# Aynı işlem, farklı hız
%timeit a_c.sum(axis=0)                    # 1.5 ms (kolon toplama, F-friendly)
%timeit a_f.sum(axis=0)                    # 0.5 ms — F için daha hızlı
%timeit a_c.sum(axis=1)                    # 0.5 ms — C için daha hızlı
%timeit a_f.sum(axis=1)                    # 1.5 ms

4. View vs Copy — Sessiz Performans Tuzakları#

View: aynı data buffer'a yeni metadata. Belleksiz, hızlı. Copy: yeni data buffer. Bellek + zaman maliyeti.

Hangisi ne?#

İşlem	Sonuç
`a.T` , `a.transpose()`	View
Basit slice `a[1:5]`	View
`a.reshape()` (mümkünse)	View
`np.ravel(a)` (mümkünse)	View
`a.copy()`	Copy
Fancy indexing `a[[1,3,5]]`	Copy
Boolean mask `a[a > 0]`	Copy
`np.concatenate` , `np.stack`	Copy
Arithmetic: `a + b`	Copy (yeni array)
In-place: `a += b`	View (kendisi)

Tehlike#

Bir view'da değişiklik yaparsan orijinal değişir:

a = np.arange(10)
b = a[3:7]           # view
b[0] = 999
print(a)             # [0, 1, 2, 999, 4, 5, 6, 7, 8, 9] — a değişti!

Bug kaynağı: data preprocessing'de tensor'u modify ederken sessizce başka yerleri etkiliyorsun.

Kontrol#

print(a.base is None)    # True → kök array
print(b.base is a)       # True → b a'nın view'u
print(np.may_share_memory(a, b))  # True

python

import numpy as np
 
a = np.arange(20).reshape(4, 5)
 
# View: slice
b = a[1:3, :]
print(f"b.base is a: {b.base is a}")   # True
b[0, 0] = -1
print(a[1, 0])                          # -1 (a etkilendi!)
 
# Copy: fancy indexing
a = np.arange(20).reshape(4, 5)
c = a[[1, 3], :]                       # fancy indexing
print(f"c.base is a: {c.base is a}")   # False (copy)
c[0, 0] = -1
print(a[1, 0])                          # 5 (a etkilenmedi)
 
# Reshape: bazen view, bazen copy
a = np.arange(20).reshape(4, 5)
d = a.reshape(2, 10)
print(f"d view? {d.base is a}")        # True (contiguous için view)
 
e = a.T                                 # transpose → strides terslenir
try:
    f = e.reshape(2, 10)               # eski transpose contiguous değil
    print(f"f view? {f.base is e}")    # False (copy gerek)
except:
    pass
 
# Best practice: emin değilsen .flags['OWNDATA'] kontrol et
print(d.flags['OWNDATA'])              # False (view)
print(c.flags['OWNDATA'])              # True (copy)

View vs copy'nin yakalanması zor sonuçları.

5. Contiguous — Niye Önemli?#

Bir array contiguous ise: elementler bellekte sıralı, hiç boşluk yok.

C-contiguous: row-major sırada (default)
F-contiguous: column-major sırada
Non-contiguous: ne biri ne diğeri (örn.
a[::2]
skip yapıyor)

Niye önemli?#

view()
ve
reshape()
sadece contiguous için garanti view döndürür
Tensor cores / BLAS contiguous bekler — non-contiguous'ı kopyalar (yavaş)
CUDA kernels çoğu zaman contiguous gerektirir

Pratik#

a = np.arange(20).reshape(4, 5)
print(a.flags['C_CONTIGUOUS'])    # True
print(a.T.flags['C_CONTIGUOUS'])  # False — transpose contiguous değil
print(a.T.flags['F_CONTIGUOUS'])  # True — F olarak contiguous

# Zorla contiguous yap (copy çıkarır)
a_T_contig = np.ascontiguousarray(a.T)

PyTorch'ta:

x.is_contiguous()

x.contiguous()

aynı fikir.

6. Broadcasting Kuralları — Derinden#

Ders 1.1'de tanıttık. Şimdi tam kuralları + tuzaklar.

Resmi kurallar (numpy.org'dan)#

İki array'i karşılaştırırken, sağdan başlayarak boyutları hizala. Her boyutta:

Boyutlar eşit → OK
Biri 1 → 1 olan diğer tarafa "stretch" eder (data tekrarlanmadan!)
Eksik boyut → 1 ile doldurulmuş gibi kabul edilir
Hiçbiri →
ValueError

Az bilinen detay: stretch nasıl çalışıyor?#

Bir array (1, N) → (M, N) broadcast edilirken data tekrarlanmıyor. NumPy stride trick kullanıyor: ilgili boyutun stride'ı 0'a setleniyor — aynı veriye N kez bakıyormuş gibi.

Bu yüzden broadcasting hızlı + memory-efficient. Bir (1, 4096) bias'ı (1024, 4096) hidden state'e eklediğinde yeni 1024×4096 array yaratılmıyor.

Tuzaklar#

Tuzak 1: Sessiz outer product

a = np.arange(5).reshape(5, 1)    # (5, 1)
b = np.arange(5)                  # (5,) → (1, 5)
result = a + b                    # (5, 5) - outer-like!

Bu istemediğin bir şey olabilir.

a + b.reshape(-1, 1)

yazsaydın (5, 1) olurdu.

Tuzak 2: Memory blow-up

(1, 1000000) + (1000000, 1)

→ (1M, 1M) — 8 TB RAM. Sessiz OOM.

Tuzak 3: Wrong axis

img = np.random.rand(3, 256, 256)   # (channels, H, W)
mean = img.mean(axis=(1, 2))         # (3,)
normalized = img - mean              # HATA: shape mismatch
# Doğru: img - mean.reshape(3, 1, 1) veya mean[:, None, None]

python

import numpy as np
 
# Klasik: bias ekleme
h = np.random.randn(32, 768)         # (batch, hidden)
bias = np.random.randn(768)           # (hidden,)
out = h + bias                        # OK: (1, 768) -> (32, 768)
print(out.shape)                      # (32, 768)
 
# Per-channel normalization
img = np.random.rand(3, 256, 256)    # (C, H, W)
mean = img.mean(axis=(1, 2), keepdims=True)  # (3, 1, 1) — keepdims!
std = img.std(axis=(1, 2), keepdims=True)
normalized = (img - mean) / (std + 1e-8)
print(normalized.shape)               # (3, 256, 256)
 
# Outer product (deliberately)
x = np.arange(4)
y = np.arange(3)
outer = x[:, None] * y[None, :]      # (4, 1) * (1, 3) → (4, 3)
print(outer)
 
# Memory check
a = np.zeros((1, 10000))
b = np.zeros((10000, 1))
# (a + b).shape = (10000, 10000) → 800 MB
# Dikkat!

Broadcasting'in gerçek dünya örnekleri + tuzaklar.

7. Fancy Indexing — View mı Copy mı?#

NumPy 3 tür indexing destekler:

Basic indexing → View#

a[0]
,
a[1:3]
,
a[:, 2]
,
a[::2]
,
a[..., 0]
Slice, ellipsis, integer ve newaxis

Fancy / advanced indexing → Copy#

a[[0, 2, 5]]
(integer array)
a[a > 0]
(boolean mask)
a[[0, 1, 2], [1, 2, 3]]
(pair'wise indexing)

Karışık (mixed) → Genelde copy#

a[1:3, [0, 2, 5]]
— slice + fancy

Önemli use case: lookup#

LLM embedding lookup tam bu:

vocab_size = 50000
d_model = 4096
embedding = np.random.randn(vocab_size, d_model).astype(np.float32)

# token IDs
ids = np.array([1, 7, 42, 100, 99999])
# Embed each
embeds = embedding[ids]                # fancy indexing → (5, 4096), copy
print(embeds.shape)                    # (5, 4096)

PyTorch'ta

F.embedding

zaten bu — fancy indexing'in optimize edilmiş hali.

8. BLAS Arka Uç — Matris Çarpımı Niye Bu Kadar Hızlı?#

NumPy aslında çoğu işlemi kendisi yapmıyor. Arka planda BLAS (Basic Linear Algebra Subprograms) ya da LAPACK çağırıyor.

BLAS seviyeleri#

Level 1: vector ops (axpy, dot)
Level 2: matrix-vector (gemv)
Level 3: matrix-matrix (gemm) ← LLM'in kalbi

Implementations#

OpenBLAS: açık-kaynak, çok-thread
Intel MKL: Intel CPU için en hızlı (Anaconda default'u)
Apple Accelerate: macOS'ta
cuBLAS: NVIDIA GPU
rocBLAS: AMD GPU

Niye önemli?#

Bir 4096×4096 matrix multiply naif Python loop'la ~saatler. NumPy + OpenBLAS ile milisaniyeler. Fark: cache-aware tiling, SIMD, multi-threading.

Kontrol#

np.show_config()
# blas_info:
#   libraries = ['openblas']
#   library_dirs = ['/usr/lib/x86_64-linux-gnu']

PyTorch karşılığı#

PyTorch CPU: aynı BLAS. PyTorch GPU: cuBLAS / cuDNN. Ekstra: torch.compile ile fused operations.

9. Einsum vs Einops — Modern Tensor Cebri#

Einsum Ders 1.1'de tanıttık. Şimdi alternatifle karşılaştıralım: einops (Alex Rogozhnikov).

Einsum#

Notation:
"input1,input2->output"
Güçlü: contraction (sum-product) için ideal
Zayıf: pure shape manipulation için karmaşık string

Einops#

3 fonksiyon:
rearrange
,
reduce
,
repeat
Notation: insan-okunaklı:
"batch head seq d -> batch seq (head d)"
Güçlü: reshape, permute, expand kombinasyonları
Zayıf: contraction'da einsum kadar değil

Yan yana#

import numpy as np
from einops import rearrange, reduce, repeat

# Multi-head attention output birleştir
x = np.random.randn(2, 4, 16, 64)  # (B, H, T, d)

# Einsum yolu (transpose then reshape):
# Önce permute: (B, T, H, d) → sonra reshape (B, T, H*d)
out_einsum = np.einsum('bhtd->btdh', x).reshape(2, 16, -1)  # ??? karışık

# Einops yolu — açık ve net:
out_einops = rearrange(x, 'b h t d -> b t (h d)')
print(out_einops.shape)            # (2, 16, 256)

# Reduce: spatial mean
img = np.random.randn(32, 3, 224, 224)  # (B, C, H, W)
pooled = reduce(img, 'b c h w -> b c', 'mean')
print(pooled.shape)                # (32, 3)

Tavsiye#

Contraction (sum-product, dot products):
einsum
Shape gymnastics (rearrange, broadcast):
einops
LLM mühendisleri ikisini birden kullanır

10. NumPy ile PyTorch Farkları#

Çok benziyorlar ama önemli farklar var:

Konu	NumPy	PyTorch
GPU desteği	Yok	Native
Autograd	Yok	Native
Mixed precision	Manuel	autocast
Inplace ops	Hızlı	Hızlı ama autograd-care
Default float	float64	float32
Reshape contiguous gerek	Bazen	Bazen, `.reshape` kopyalar otomatik
`@` operator	Çalışır	Çalışır
Boolean indexing	View değil	View değil
Broadcasting	Aynı kurallar	Aynı kurallar
Compatibility	NumPy ndarray	NumPy ↔ Tensor: `.numpy()` , `torch.from_numpy()`

Gotcha: dtype#

import numpy as np
import torch

a = np.array([1.0, 2.0, 3.0])
print(a.dtype)                     # float64 ← NumPy default

t = torch.from_numpy(a)
print(t.dtype)                     # torch.float64
# Modelin float32 ise type mismatch!
t = t.float()                       # cast et

11. Mini Egzersizler#

Strides hesabı: shape
(2, 3, 4)
float32 array. Strides nedir? Bellek toplam boyutu?
View vs copy:
a[1:3, [0, 2]]
— slice + fancy. View mu copy mı? Neden?
Broadcasting tuzağı:
(32, 1, 768) + (768,)
ne döner?
(32, 1, 768) + (32, 768)
ne döner?
Embedding lookup:
vocab_size=50000
,
d_model=4096
,
int64
ids
(B=2, T=128)
. Embed sonucu shape ve memory?
Einsum vs einops:
(B, H, T, d)
→
(B, T, H*d)
einsum ile yaz. Einops ile yaz. Performans farkı?

Bu Derste Neler Öğrendik?#

✓ ndarray = data buffer + metadata (shape, dtype, strides, offset) ✓ Strides ile element erişim formülü ✓ Row-major (C) vs column-major (F) — cache efficiency ✓ View vs copy — sessiz performans tuzakları ✓ Contiguous: niye, nasıl, ne zaman ✓ Broadcasting kuralları derinden + tuzaklar ✓ Fancy indexing → copy ✓ BLAS arka uç: gemm'in milisaniyelerde milyarlık çarpım ✓ Einsum vs einops — modern tensor cebri ✓ NumPy vs PyTorch farkları

Sıradaki Ders#

2.2 — Computational Graph Derinden: DAG, Topological Sort, Eager vs Static Autograd'in altındaki graph yapısını detaylandıracağız. Forward + backward DAG'ı, topological sort'un farklı varyantları, eager (PyTorch) vs static (TF/JAX) graph paradigmaları.

Frequently Asked Questions

No — just be careful. When preprocessing in NumPy and transferring to PyTorch, cast to float32: `tensor.float()` or `np.array(..., dtype=np.float32)`. Standard ML uses float32; LLMs use float16/bfloat16/fp8. Float64 only useful for numerically sensitive preprocessing (e.g., covariance matrix eigendecomp).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. ndarray Anatomisi — Data + Metadata#

Neden bu ayrım önemli?#

2. Strides — Bellekten Tensor'a Köprü#

Element erişim formülü#

Stride hilebazlığı#

3. Row-major (C) vs Column-major (F)#

Niye önemli?#

4. View vs Copy — Sessiz Performans Tuzakları#

Hangisi ne?#

Tehlike#

Kontrol#

5. Contiguous — Niye Önemli?#

Niye önemli?#

Pratik#

6. Broadcasting Kuralları — Derinden#

Resmi kurallar (numpy.org'dan)#

Az bilinen detay: stretch nasıl çalışıyor?#

Tuzaklar#

7. Fancy Indexing — View mı Copy mı?#

Basic indexing → View#

Fancy / advanced indexing → Copy#

Karışık (mixed) → Genelde copy#

Önemli use case: lookup#

8. BLAS Arka Uç — Matris Çarpımı Niye Bu Kadar Hızlı?#

BLAS seviyeleri#

Implementations#

Niye önemli?#

Kontrol#

PyTorch karşılığı#

9. Einsum vs Einops — Modern Tensor Cebri#

Einsum#

Einops#

Yan yana#

Tavsiye#

10. NumPy ile PyTorch Farkları#

Gotcha: dtype#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

NumPy defaults to float64; is this a problem in LLM work?

How do I prevent broadcasting memory blow-up?

How did einops become so popular? Doesn't einsum suffice?

View modifications affecting the original — is this a bug source or a feature?

Does it matter if I use PyTorch tensors (CPU) instead of NumPy?

Turkish-specific: NumPy or PyTorch for tokenization output?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight