Matrix Factorization
Matrix Factorization là kỹ thuật cốt lõi của modern RecSys — Netflix Prize winner 2009 dùng SVD++ để đánh bại Netflix algorithm.
🎯 Mục tiêu
- Hiểu latent factor models
- Implement SVD cho RecSys (Surprise library)
- ALS (Alternating Least Squares)
- NMF và implicit feedback
1. Latent Factor Concept
1.1 Idea
Ví dụ
1User-Item Matrix (sparse): Decompose thành:2 3 I1 I2 I3 I4 User Matrix Item Matrix4User A 5 ? ? 1 ≈ [u1, u2, u3] [i1, i2, i3]5User B ? 4 ? ? [u1, u2, u3] [i1, i2, i3]6User C ? ? 5 4 [u1, u2, u3] × [i1, i2, i3]7 k factors8Rating(A, I3) ≈ dot(User_A_vector, Item_I3_vector)Latent factors có thể represent: thể loại, style, quality, complexity... Model tự học factors, không cần define trước.
1.2 Mathematical Formulation
Ví dụ
1R ≈ U × V^T2 3R: User-Item matrix (m × n)4U: User matrix (m × k) — k latent factors cho mỗi user5V: Item matrix (n × k) — k latent factors cho mỗi item6 7Predicted rating: r̂(u,i) = u_u · v_i + b_u + b_i + μ8 9Where:10- u_u: user latent vector11- v_i: item latent vector12- b_u: user bias (some users rate higher)13- b_i: item bias (some items rated higher)14- μ: global average2. SVD with Surprise Library
2.1 Setup
Python
1# pip install scikit-surprise2from surprise import SVD, Dataset, Reader, accuracy3from surprise.model_selection import cross_validate, train_test_split4import pandas as pd56# Load data7df = pd.read_csv('ratings.csv') # user_id, item_id, rating8reader = Reader(rating_scale=(1, 5))9data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)2.2 Train SVD
Python
1# Split data2trainset, testset = train_test_split(data, test_size=0.2, random_state=42)34# SVD model5svd = SVD(6 n_factors=100, # Latent dimensions7 n_epochs=20, # Training epochs8 lr_all=0.005, # Learning rate9 reg_all=0.02, # Regularization10 random_state=4211)1213svd.fit(trainset)14predictions = svd.test(testset)1516# Evaluate17rmse = accuracy.rmse(predictions)18mae = accuracy.mae(predictions)19print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")2.3 Generate Recommendations
Python
1def get_top_n(predictions, n=10):2 """Get top-N recommendations for each user."""3 from collections import defaultdict4 top_n = defaultdict(list)5 6 for uid, iid, true_r, est, _ in predictions:7 top_n[uid].append((iid, est))8 9 for uid, user_ratings in top_n.items():10 user_ratings.sort(key=lambda x: x[1], reverse=True)11 top_n[uid] = user_ratings[:n]12 13 return top_n1415# Predict all unrated items16trainset_full = data.build_full_trainset()17svd.fit(trainset_full)1819# Recommendations for user "U001"20anti_testset = trainset_full.build_anti_testset()21all_predictions = svd.test(anti_testset)22top_n = get_top_n(all_predictions, n=10)2324for item_id, predicted_rating in top_n["U001"]:25 print(f" Item: {item_id}, Predicted: {predicted_rating:.2f}")2.4 Cross-Validation
Python
1# K-Fold CV2results = cross_validate(3 SVD(n_factors=100),4 data,5 measures=['RMSE', 'MAE'],6 cv=5,7 verbose=True8)9print(f"Mean RMSE: {results['test_rmse'].mean():.4f}")3. ALS (Alternating Least Squares)
3.1 Concept
Ví dụ
1Optimization Problem:2 min ||R - U × V^T||^2 + lambda * (||U||^2 + ||V||^2)3 4ALS Strategy:5 Step 1: Fix V, solve for U (linear regression)6 Step 2: Fix U, solve for V (linear regression)7 Repeat until convergence3.2 ALS with Implicit Feedback
Python
1# pip install implicit2import implicit3import scipy.sparse as sparse4import numpy as np56# Create sparse user-item matrix (implicit: views, clicks, purchases)7# rows = users, cols = items, values = interaction count8user_item = sparse.csr_matrix(interaction_matrix)910# ALS model11model = implicit.als.AlternatingLeastSquares(12 factors=64,13 regularization=0.01,14 iterations=50,15 random_state=4216)1718# Train (implicit expects item-user matrix)19model.fit(user_item.T)2021# Recommendations for user 022ids, scores = model.recommend(23 userid=0,24 user_items=user_item[0],25 N=10,26 filter_already_liked_items=True27)2829for item_id, score in zip(ids, scores):30 print(f" Item {item_id}: {score:.4f}")3.3 Explicit vs Implicit
| Explicit | Implicit | |
|---|---|---|
| Data | Ratings (1-5 stars) | Clicks, views, purchases |
| Availability | Sparse (few users rate) | Dense (everyone clicks) |
| Signal | Clear preference | Noisy (click != like) |
| Algorithm | SVD, SVD++ | ALS, BPR |
| Example | MovieLens | E-commerce browsing |
4. NMF (Non-Negative Matrix Factorization)
4.1 Concept
NMF giống SVD nhưng constraint: tất cả values >= 0. Latent factors interpretable hơn (positive contribution only).
Python
1from surprise import NMF23nmf = NMF(4 n_factors=15, # Fewer factors for interpretability5 n_epochs=50,6 random_state=427)89cross_validate(nmf, data, measures=['RMSE'], cv=5, verbose=True)4.2 Interpretability
Ví dụ
1NMF factors might represent:2Factor 1: "Action movies" — high for Avengers, low for romcoms3Factor 2: "Vietnamese comedy" — high for local comedy, low for foreign4Factor 3: "Award-winning" — high for Oscar films, low for B-movies5 6User vector [0.8, 0.3, 0.9] → likes Action + Award-winning7Item vector [0.9, 0.1, 0.7] → Action + Award-winning movie8→ High predicted rating!5. Hyperparameter Tuning
5.1 Grid Search
Python
1from surprise.model_selection import GridSearchCV23param_grid = {4 'n_factors': [50, 100, 150],5 'n_epochs': [20, 30],6 'lr_all': [0.002, 0.005, 0.01],7 'reg_all': [0.02, 0.05, 0.1]8}910gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)11gs.fit(data)1213print(f"Best RMSE: {gs.best_score['rmse']:.4f}")14print(f"Best params: {gs.best_params['rmse']}")5.2 Model Comparison
Python
1from surprise import SVD, SVDpp, NMF, KNNBaseline23models = {4 'SVD': SVD(n_factors=100),5 'SVD++': SVDpp(n_factors=100),6 'NMF': NMF(n_factors=50),7 'KNN': KNNBaseline(k=40, sim_options={'name': 'cosine'})8}910for name, model in models.items():11 results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=False)12 print(f"{name}: RMSE = {results['test_rmse'].mean():.4f}")📝 Quiz
-
Latent factors trong Matrix Factorization là gì?
- Explicit features (genre, price)
- Hidden dimensions mà model tự học
- User demographics
- Missing values
-
ALS ưu điểm gì so với SGD-based SVD?
- Parallelizable, tốt cho implicit feedback và sparse data
- Chính xác hơn luôn
- Nhanh hơn luôn
- Không cần tuning
-
NMF khác SVD ở điểm nào chính?
- NMF nhanh hơn
- NMF yêu cầu non-negative values, interpretable hơn
- SVD không dùng được cho RecSys
- Không khác nhau
🎯 Key Takeaways
- SVD — Most popular, good baseline cho explicit ratings
- ALS — Best cho implicit feedback (clicks, views)
- NMF — Interpretable latent factors
- Surprise library — Easy experimentation
- Latent factors — Model tự learn hidden patterns
🚀 Bài tiếp theo
Deep Learning for RecSys — Neural Collaborative Filtering, Two-Tower, và Sequential RecSys!
