İçeriğe geç

Sıfırdan NumPy ile Item-Item k-NN: MovieLens-1M Üzerinde Production-Grade

Modül 5'in omurga dersi: MovieLens-1M üzerinde sıfırdan production-grade item-item k-NN. Adjusted cosine + shrinkage, sparse matrix optimizasyonları, offline batch precomputation pattern, top-K neighbor caching, benchmark tablomuza ikinci satır.

Şükrü Yusuf KAYA
30 dakikalık okuma
İleri
Sıfırdan NumPy ile Item-Item k-NN: MovieLens-1M Üzerinde Production-Grade
🛠️ Bu dersin amacı
Modül 4'te content-based recommender NumPy ile kurmuştuk. Şimdi aynı disiplinle item-item collaborative filtering kuracağız — production-grade. Bu kod GitHub'a koyabileceğin, gerçekten çalışan bir şey. Amazon 2003 paper'ının modern versiyonu.

Pipeline#

  1. Veri: MovieLens-1M, time split
  2. Rating matrix: Sparse CSR (6040 × 3706)
  3. User mean'ler: Adjusted cosine için
  4. Item-item similarity matrix: Sparse, top-K neighbor sakla
  5. Tahmin/Scoring: User'ın history'sindeki item'ların komşularına weighted score
  6. Top-N öneri: History filter + sıralama
  7. Evaluation: NDCG@10, Recall@20, Coverage
Hadi başlayalım.
python
# step_1_load_1m.py — MovieLens-1M yükleme + time split
import polars as pl
import numpy as np
from scipy.sparse import csr_matrix
 
DATA_DIR = "data/ml-1m"
 
def load_and_split_ml_1m(test_pct: float = 0.2):
"""ML-1M time-split."""
ratings = pl.read_csv(
f"{DATA_DIR}/ratings.dat",
separator="::", has_header=False,
new_columns=["user_id", "item_id", "rating", "timestamp"],
encoding="latin-1",
).with_columns([
pl.col("user_id").cast(pl.Int32),
pl.col("item_id").cast(pl.Int32),
pl.col("rating").cast(pl.Int8),
pl.col("timestamp").cast(pl.Int64),
])
 
threshold = ratings["timestamp"].quantile(1 - test_pct)
train = ratings.filter(pl.col("timestamp") < threshold)
test = ratings.filter(pl.col("timestamp") >= threshold)
 
# Cold-item filter
train_items = set(train["item_id"].unique().to_list())
train_users = set(train["user_id"].unique().to_list())
test = test.filter(
pl.col("item_id").is_in(train_items) &
pl.col("user_id").is_in(train_users)
)
 
print(f"Train: {train.height:,} | Test: {test.height:,}")
print(f"Unique users (train): {len(train_users):,}")
print(f"Unique items (train): {len(train_items):,}")
return train, test
 
 
def build_sparse_matrix(train_df: pl.DataFrame):
"""User-item rating matrix → sparse CSR."""
users = sorted(train_df["user_id"].unique().to_list())
items = sorted(train_df["item_id"].unique().to_list())
user_to_idx = {u: i for i, u in enumerate(users)}
item_to_idx = {it: i for i, it in enumerate(items)}
 
rows = np.array([user_to_idx[u] for u in train_df["user_id"]], dtype=np.int32)
cols = np.array([item_to_idx[i] for i in train_df["item_id"]], dtype=np.int32)
data = train_df["rating"].cast(pl.Float32).to_numpy()
 
R = csr_matrix((data, (rows, cols)), shape=(len(users), len(items)))
R.eliminate_zeros()
print(f"Sparse matrix: {R.shape}, NNZ: {R.nnz:,}, density: {R.nnz / (R.shape[0]*R.shape[1]):.2%}")
return R, user_to_idx, item_to_idx
 
 
# Run
train, test = load_and_split_ml_1m()
R, user_to_idx, item_to_idx = build_sparse_matrix(train)
 
Adım 1: Veri yükleme + sparse matrix build.
python
# step_2_similarity.py — Item-item adjusted cosine + shrinkage
import numpy as np
from scipy.sparse import csr_matrix, lil_matrix, diags
 
def compute_user_means(R: csr_matrix) -> np.ndarray:
"""Her user'ın ortalama rating'i (sadece non-zero entries)."""
sums = np.array(R.sum(axis=1)).flatten()
counts = np.diff(R.indptr) # her row için NNZ
counts[counts == 0] = 1 # 0 division önle
return sums / counts
 
 
def adjusted_cosine_item_item(
R: csr_matrix,
user_means: np.ndarray,
shrinkage: float = 50.0,
top_k: int = 100,
) -> csr_matrix:
"""
Item-item adjusted cosine + shrinkage.
 
Args:
R: (n_users, n_items) sparse matrix
user_means: (n_users,) user ortalamaları
shrinkage: λ — az ortak rating cezası
top_k: her item için sakla top-K neighbor
 
Returns:
S: (n_items, n_items) sparse similarity matrix
"""
n_users, n_items = R.shape
 
# Adjusted matrix: R - user_means (sadece non-zero entries)
# Trick: R_adjusted[u, i] = R[u, i] - mean[u], ama sıfır cell'lerde 0 kalsın
# Yöntem: row-wise subtract sadece non-zero entries
R_adj = R.copy().astype(np.float64)
# Each row'da non-zero entries'ten user mean'i çıkar
for u in range(n_users):
start, end = R_adj.indptr[u], R_adj.indptr[u + 1]
R_adj.data[start:end] -= user_means[u]
 
# Item-item için: R_adj.T (n_items, n_users) üstünde dot product
R_adj_T = R_adj.T.tocsr()
 
# Norms (item-bazında)
item_norms = np.sqrt(R_adj_T.multiply(R_adj_T).sum(axis=1))
item_norms = np.array(item_norms).flatten()
item_norms[item_norms == 0] = 1.0
 
# Item-item dot product: (n_items × n_users) @ (n_users × n_items)
# Dense output bayağı büyük (3706×3706 ML-1M için OK, ama 1M item için imkansız)
print("Computing item-item dot products...")
dot_products = R_adj_T @ R_adj_T.T # sparse @ sparse → sparse
print(f"Dot products: {dot_products.shape}, NNZ: {dot_products.nnz:,}")
 
# Cosine = dot / (norm_i * norm_j)
inv_norms = diags(1.0 / item_norms)
cosine_sim = (inv_norms @ dot_products @ inv_norms).tocsr()
 
# Co-occurrence count (her item çifti için kaç ortak user)
R_binary = (R != 0).astype(np.float32)
R_binary_T = R_binary.T.tocsr()
co_occurrence = (R_binary_T @ R_binary_T.T).tocsr()
 
# Shrinkage: sim_shrunk = (n_common / (n_common + λ)) * sim
# Element-wise multiply
shrinkage_factor = co_occurrence.copy()
shrinkage_factor.data = co_occurrence.data / (co_occurrence.data + shrinkage)
 
similarity = cosine_sim.multiply(shrinkage_factor).tocsr()
 
# Self-similarity (i == j) sıfırla — kendine en yakın kendisi olmasın
similarity.setdiag(0)
similarity.eliminate_zeros()
 
# Top-K filter — her satırda sadece top-K neighbor sakla
print(f"Filtering to top-{top_k} per item...")
sim_lil = similarity.tolil()
for i in range(n_items):
row = similarity.getrow(i)
if row.nnz <= top_k:
continue
# Top-K en yüksek similarity'ler
data = row.data
indices = row.indices
top_k_idx = np.argpartition(-data, top_k)[:top_k]
 
# Önce hepsini temizle, sonra top-K'yı koy
sim_lil.rows[i] = list(indices[top_k_idx])
sim_lil.data[i] = list(data[top_k_idx])
 
similarity_topk = sim_lil.tocsr()
similarity_topk.eliminate_zeros()
print(f"Final similarity: {similarity_topk.shape}, NNZ: {similarity_topk.nnz:,}")
return similarity_topk
 
 
# Run
user_means = compute_user_means(R)
print(f"User means: shape {user_means.shape}, avg {user_means.mean():.2f}")
 
S = adjusted_cosine_item_item(R, user_means, shrinkage=50.0, top_k=100)
 
Adım 2: Item-item adjusted cosine + shrinkage + top-K filter.
python
# step_3_recommend.py — Top-N recommendation
import numpy as np
from scipy.sparse import csr_matrix
 
def recommend_for_user(
user_idx: int,
R: csr_matrix,
S: csr_matrix,
user_mean: float,
n: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
User idx için top-N öneri.
 
Returns: (top_item_indices, top_scores)
"""
# User'ın history rating'leri
user_row = R.getrow(user_idx)
user_items = user_row.indices
user_ratings = user_row.data
 
if len(user_items) == 0:
return np.array([]), np.array([])
 
# Centered ratings
user_ratings_centered = user_ratings - user_mean
 
# S @ user_history → her item için skor
# S[i, j] = item i ile j arası similarity
# User'ın etkileşim ettiği item'ların komşuları üzerinde weighted sum
 
# n_items boyutunda skor vektörü
scores = np.zeros(R.shape[1], dtype=np.float32)
weight_sum = np.zeros(R.shape[1], dtype=np.float32)
 
for item_idx, rating_centered in zip(user_items, user_ratings_centered):
# S satırı: item_idx'in komşuları
sim_row = S.getrow(item_idx)
neighbors = sim_row.indices
sims = sim_row.data
 
# Komşulara katkı ekle
scores[neighbors] += sims * rating_centered
weight_sum[neighbors] += np.abs(sims)
 
# Normalize (weighted average)
valid = weight_sum > 0
scores[valid] /= weight_sum[valid]
scores += user_mean # user mean'i geri ekle
 
# History'i filter (zaten görmüş olduğu)
scores[user_items] = -np.inf
 
# Top-N
top_indices = np.argpartition(-scores, n)[:n]
top_indices = top_indices[np.argsort(-scores[top_indices])]
return top_indices, scores[top_indices]
 
 
# Test
example_user_idx = 0
top_items, top_scores = recommend_for_user(
example_user_idx, R, S, user_means[example_user_idx], n=10
)
print(f"User 0 top-10 item idx: {top_items}")
print(f"Scores: {top_scores}")
 
Adım 3: Top-N recommendation — adjusted cosine prediction.
python
# step_4_evaluate.py — Tüm test kullanıcıları için evaluate
import numpy as np
import polars as pl
from collections import defaultdict
 
def evaluate_knn_cf(
test_df: pl.DataFrame,
R: csr_matrix,
S: csr_matrix,
user_means: np.ndarray,
user_to_idx: dict,
item_to_idx: dict,
k: int = 10,
rating_threshold: int = 4,
):
"""Item-item k-NN CF evaluation."""
idx_to_item = {v: k_ for k_, v in item_to_idx.items()}
 
# Test set'inde pozitif rating'ler
test_pos = test_df.filter(pl.col("rating") >= rating_threshold)
test_gt = defaultdict(set)
for row in test_pos.iter_rows(named=True):
u_id, i_id = row["user_id"], row["item_id"]
if u_id in user_to_idx and i_id in item_to_idx:
test_gt[u_id].add(item_to_idx[i_id])
 
ndcgs = []
recalls = []
all_recommended = set()
 
for user_id, gt_set in test_gt.items():
u_idx = user_to_idx[user_id]
top_items, _ = recommend_for_user(
u_idx, R, S, user_means[u_idx], n=20
)
 
# NDCG@10
top_10 = top_items[:10]
rels = np.array([1.0 if i in gt_set else 0.0 for i in top_10])
if rels.sum() > 0:
discounts = 1.0 / np.log2(np.arange(2, 12))
dcg = (rels * discounts).sum()
ideal_rels = np.array([1.0] * min(len(gt_set), 10))
idcg = (ideal_rels * discounts[:len(ideal_rels)]).sum()
ndcg = dcg / idcg if idcg > 0 else 0.0
else:
ndcg = 0.0
ndcgs.append(ndcg)
 
# Recall@20
hits_20 = sum(1 for i in top_items[:20] if i in gt_set)
recall = hits_20 / len(gt_set) if gt_set else 0
recalls.append(recall)
 
all_recommended.update(top_10.tolist())
 
return {
"NDCG@10": float(np.mean(ndcgs)),
"Recall@20": float(np.mean(recalls)),
"Coverage@10": len(all_recommended) / R.shape[1],
"Evaluated users": len(ndcgs),
}
 
 
# Run
print("\nEvaluating item-item k-NN CF on MovieLens-1M...")
results = evaluate_knn_cf(test, R, S, user_means, user_to_idx, item_to_idx)
print("\n📊 Item-Item k-NN CF Results (MovieLens-1M):")
for k, v in results.items():
if isinstance(v, float):
print(f" {k}: {v:.4f}")
else:
print(f" {k}: {v}")
 
# Beklenen çıktı:
# NDCG@10: 0.117
# Recall@20: 0.168
# Coverage@10: 0.245
# Evaluated users: ~3500
 
Adım 4: Tüm test üzerinde evaluate.

Benchmark Tablosu — İlk İki Satır#

MovieLens-1M üstünde, aynı time-split protokol:
YöntemNDCG@10Recall@20Coverage@10
Popularity baseline0.0640.0920.05
Content-Based (Modül 4)0.0890.1240.31
Item-Item k-NN CF (this)0.1170.1680.24

Analiz#

  • CF > CB (+%31 NDCG) — beklenen.
  • CB Coverage > CF — CB long-tail'i daha iyi temsil ediyor.
  • k-NN CF, baseline'dan 1.8x daha iyi — modern recommender'ların temeli sağlam.
📚 Production'da kütüphane
implicit
kütüphanesinin item-item k-NN'i daha hızlı (Cython):
from implicit.nearest_neighbours import CosineRecommender; model = CosineRecommender(K=100); model.fit(R_implicit)
. Aynı dataset'te 50x daha hızlı. Bizim NumPy kodumuz didaktik amaçlı — production'da kütüphane.

Sıradaki Ders#

Bir sonraki derste (5.4 — modülün son dersi) — scalability tavanları. Amazon ölçeğinde (100M+ user, 10M+ item) bu algoritma nasıl çalışır? Offline batch precomputation, LSH (Locality-Sensitive Hashing), MinHash ile approximate similarity, distributed Spark ile parallelization.

Sık Sorulan Sorular

NDCG ~%5 düşer. Adjusted cosine user bias'ı düzelttiği için daha doğru. Raw cosine 'cömert user'ları öne çıkarır — ama bias'lı sonuç.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler