MinAI - Về trang chủ
Hướng dẫn
7/1335 phút
Đang tải...

Matrix Factorization

SVD, ALS, và Non-Negative Matrix Factorization cho Recommendation Systems

Matrix Factorization

Matrix Factorization là kỹ thuật cốt lõi của modern RecSys — Netflix Prize winner 2009 dùng SVD++ để đánh bại Netflix algorithm.

🎯 Mục tiêu

  • Hiểu latent factor models
  • Implement SVD cho RecSys (Surprise library)
  • ALS (Alternating Least Squares)
  • NMF và implicit feedback

1. Latent Factor Concept

1.1 Idea

Ví dụ
1User-Item Matrix (sparse): Decompose thành:
2
3 I1 I2 I3 I4 User Matrix Item Matrix
4User A 5 ? ? 1 ≈ [u1, u2, u3] [i1, i2, i3]
5User B ? 4 ? ? [u1, u2, u3] [i1, i2, i3]
6User C ? ? 5 4 [u1, u2, u3] × [i1, i2, i3]
7 k factors
8Rating(A, I3) ≈ dot(User_A_vector, Item_I3_vector)

Latent factors có thể represent: thể loại, style, quality, complexity... Model tự học factors, không cần define trước.

1.2 Mathematical Formulation

Ví dụ
1R ≈ U × V^T
2
3R: User-Item matrix (m × n)
4U: User matrix (m × k) — k latent factors cho mỗi user
5V: Item matrix (n × k) — k latent factors cho mỗi item
6
7Predicted rating: r̂(u,i) = u_u · v_i + b_u + b_i + μ
8
9Where:
10- u_u: user latent vector
11- v_i: item latent vector
12- b_u: user bias (some users rate higher)
13- b_i: item bias (some items rated higher)
14- μ: global average

2. SVD with Surprise Library

2.1 Setup

Python
1# pip install scikit-surprise
2from surprise import SVD, Dataset, Reader, accuracy
3from surprise.model_selection import cross_validate, train_test_split
4import pandas as pd
5
6# Load data
7df = pd.read_csv('ratings.csv') # user_id, item_id, rating
8reader = Reader(rating_scale=(1, 5))
9data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)

2.2 Train SVD

Python
1# Split data
2trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
3
4# SVD model
5svd = SVD(
6 n_factors=100, # Latent dimensions
7 n_epochs=20, # Training epochs
8 lr_all=0.005, # Learning rate
9 reg_all=0.02, # Regularization
10 random_state=42
11)
12
13svd.fit(trainset)
14predictions = svd.test(testset)
15
16# Evaluate
17rmse = accuracy.rmse(predictions)
18mae = accuracy.mae(predictions)
19print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")

2.3 Generate Recommendations

Python
1def get_top_n(predictions, n=10):
2 """Get top-N recommendations for each user."""
3 from collections import defaultdict
4 top_n = defaultdict(list)
5
6 for uid, iid, true_r, est, _ in predictions:
7 top_n[uid].append((iid, est))
8
9 for uid, user_ratings in top_n.items():
10 user_ratings.sort(key=lambda x: x[1], reverse=True)
11 top_n[uid] = user_ratings[:n]
12
13 return top_n
14
15# Predict all unrated items
16trainset_full = data.build_full_trainset()
17svd.fit(trainset_full)
18
19# Recommendations for user "U001"
20anti_testset = trainset_full.build_anti_testset()
21all_predictions = svd.test(anti_testset)
22top_n = get_top_n(all_predictions, n=10)
23
24for item_id, predicted_rating in top_n["U001"]:
25 print(f" Item: {item_id}, Predicted: {predicted_rating:.2f}")

2.4 Cross-Validation

Python
1# K-Fold CV
2results = cross_validate(
3 SVD(n_factors=100),
4 data,
5 measures=['RMSE', 'MAE'],
6 cv=5,
7 verbose=True
8)
9print(f"Mean RMSE: {results['test_rmse'].mean():.4f}")

3. ALS (Alternating Least Squares)

3.1 Concept

Ví dụ
1Optimization Problem:
2 min ||R - U × V^T||^2 + lambda * (||U||^2 + ||V||^2)
3
4ALS Strategy:
5 Step 1: Fix V, solve for U (linear regression)
6 Step 2: Fix U, solve for V (linear regression)
7 Repeat until convergence

3.2 ALS with Implicit Feedback

Python
1# pip install implicit
2import implicit
3import scipy.sparse as sparse
4import numpy as np
5
6# Create sparse user-item matrix (implicit: views, clicks, purchases)
7# rows = users, cols = items, values = interaction count
8user_item = sparse.csr_matrix(interaction_matrix)
9
10# ALS model
11model = implicit.als.AlternatingLeastSquares(
12 factors=64,
13 regularization=0.01,
14 iterations=50,
15 random_state=42
16)
17
18# Train (implicit expects item-user matrix)
19model.fit(user_item.T)
20
21# Recommendations for user 0
22ids, scores = model.recommend(
23 userid=0,
24 user_items=user_item[0],
25 N=10,
26 filter_already_liked_items=True
27)
28
29for item_id, score in zip(ids, scores):
30 print(f" Item {item_id}: {score:.4f}")

3.3 Explicit vs Implicit

ExplicitImplicit
DataRatings (1-5 stars)Clicks, views, purchases
AvailabilitySparse (few users rate)Dense (everyone clicks)
SignalClear preferenceNoisy (click != like)
AlgorithmSVD, SVD++ALS, BPR
ExampleMovieLensE-commerce browsing

4. NMF (Non-Negative Matrix Factorization)

4.1 Concept

NMF giống SVD nhưng constraint: tất cả values >= 0. Latent factors interpretable hơn (positive contribution only).

Python
1from surprise import NMF
2
3nmf = NMF(
4 n_factors=15, # Fewer factors for interpretability
5 n_epochs=50,
6 random_state=42
7)
8
9cross_validate(nmf, data, measures=['RMSE'], cv=5, verbose=True)

4.2 Interpretability

Ví dụ
1NMF factors might represent:
2Factor 1: "Action movies" — high for Avengers, low for romcoms
3Factor 2: "Vietnamese comedy" — high for local comedy, low for foreign
4Factor 3: "Award-winning" — high for Oscar films, low for B-movies
5
6User vector [0.8, 0.3, 0.9] → likes Action + Award-winning
7Item vector [0.9, 0.1, 0.7] → Action + Award-winning movie
8→ High predicted rating!

5. Hyperparameter Tuning

5.1 Grid Search

Python
1from surprise.model_selection import GridSearchCV
2
3param_grid = {
4 'n_factors': [50, 100, 150],
5 'n_epochs': [20, 30],
6 'lr_all': [0.002, 0.005, 0.01],
7 'reg_all': [0.02, 0.05, 0.1]
8}
9
10gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
11gs.fit(data)
12
13print(f"Best RMSE: {gs.best_score['rmse']:.4f}")
14print(f"Best params: {gs.best_params['rmse']}")

5.2 Model Comparison

Python
1from surprise import SVD, SVDpp, NMF, KNNBaseline
2
3models = {
4 'SVD': SVD(n_factors=100),
5 'SVD++': SVDpp(n_factors=100),
6 'NMF': NMF(n_factors=50),
7 'KNN': KNNBaseline(k=40, sim_options={'name': 'cosine'})
8}
9
10for name, model in models.items():
11 results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=False)
12 print(f"{name}: RMSE = {results['test_rmse'].mean():.4f}")

📝 Quiz

  1. Latent factors trong Matrix Factorization là gì?

    • Explicit features (genre, price)
    • Hidden dimensions mà model tự học
    • User demographics
    • Missing values
  2. ALS ưu điểm gì so với SGD-based SVD?

    • Parallelizable, tốt cho implicit feedback và sparse data
    • Chính xác hơn luôn
    • Nhanh hơn luôn
    • Không cần tuning
  3. NMF khác SVD ở điểm nào chính?

    • NMF nhanh hơn
    • NMF yêu cầu non-negative values, interpretable hơn
    • SVD không dùng được cho RecSys
    • Không khác nhau

🎯 Key Takeaways

  1. SVD — Most popular, good baseline cho explicit ratings
  2. ALS — Best cho implicit feedback (clicks, views)
  3. NMF — Interpretable latent factors
  4. Surprise library — Easy experimentation
  5. Latent factors — Model tự learn hidden patterns

🚀 Bài tiếp theo

Deep Learning for RecSys — Neural Collaborative Filtering, Two-Tower, và Sequential RecSys!