Matrix Factorization

Matrix Factorization là kỹ thuật cốt lõi của modern RecSys — Netflix Prize winner 2009 dùng SVD++ để đánh bại Netflix algorithm.

🎯 Mục tiêu

1. Latent Factor Concept

1.1 Idea

Ví dụ

1User-Item Matrix (sparse):         Decompose thành:
2 
3          I1  I2  I3  I4           User Matrix    Item Matrix
4User A     5   ?   ?   1     ≈     [u1, u2, u3]   [i1, i2, i3]
5User B     ?   4   ?   ?           [u1, u2, u3]   [i1, i2, i3]
6User C     ?   ?   5   4           [u1, u2, u3] × [i1, i2, i3]
7                                              k factors
8Rating(A, I3) ≈ dot(User_A_vector, Item_I3_vector)

Latent factors có thể represent: thể loại, style, quality, complexity... Model tự học factors, không cần define trước.

1.2 Mathematical Formulation

Ví dụ

1R ≈ U × V^T
2 
3R: User-Item matrix (m × n)
4U: User matrix (m × k)  — k latent factors cho mỗi user
5V: Item matrix (n × k)  — k latent factors cho mỗi item
6 
7Predicted rating: r̂(u,i) = u_u · v_i + b_u + b_i + μ
8 
9Where:
10- u_u: user latent vector
11- v_i: item latent vector
12- b_u: user bias (some users rate higher)
13- b_i: item bias (some items rated higher)
14- μ: global average

2. SVD with Surprise Library

2.1 Setup

Python

1# pip install scikit-surprise
2from surprise import SVD, Dataset, Reader, accuracy
3from surprise.model_selection import cross_validate, train_test_split
4import pandas as pd
5
6# Load data
7df = pd.read_csv('ratings.csv')  # user_id, item_id, rating
8reader = Reader(rating_scale=(1, 5))
9data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)

2.2 Train SVD

Python

1# Split data
2trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
3
4# SVD model
5svd = SVD(
6    n_factors=100,    # Latent dimensions
7    n_epochs=20,      # Training epochs
8    lr_all=0.005,     # Learning rate
9    reg_all=0.02,     # Regularization
10    random_state=42
11)
12
13svd.fit(trainset)
14predictions = svd.test(testset)
15
16# Evaluate
17rmse = accuracy.rmse(predictions)
18mae = accuracy.mae(predictions)
19print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")

2.3 Generate Recommendations

Python

1def get_top_n(predictions, n=10):
2    """Get top-N recommendations for each user."""
3    from collections import defaultdict
4    top_n = defaultdict(list)
5    
6    for uid, iid, true_r, est, _ in predictions:
7        top_n[uid].append((iid, est))
8    
9    for uid, user_ratings in top_n.items():
10        user_ratings.sort(key=lambda x: x[1], reverse=True)
11        top_n[uid] = user_ratings[:n]
12    
13    return top_n
14
15# Predict all unrated items
16trainset_full = data.build_full_trainset()
17svd.fit(trainset_full)
18
19# Recommendations for user "U001"
20anti_testset = trainset_full.build_anti_testset()
21all_predictions = svd.test(anti_testset)
22top_n = get_top_n(all_predictions, n=10)
23
24for item_id, predicted_rating in top_n["U001"]:
25    print(f"  Item: {item_id}, Predicted: {predicted_rating:.2f}")

2.4 Cross-Validation

Python

1# K-Fold CV
2results = cross_validate(
3    SVD(n_factors=100),
4    data,
5    measures=['RMSE', 'MAE'],
6    cv=5,
7    verbose=True
8)
9print(f"Mean RMSE: {results['test_rmse'].mean():.4f}")

3. ALS (Alternating Least Squares)

3.1 Concept

Ví dụ

1Optimization Problem:
2  min ||R - U × V^T||^2 + lambda * (||U||^2 + ||V||^2)
3 
4ALS Strategy:
5  Step 1: Fix V, solve for U (linear regression)
6  Step 2: Fix U, solve for V (linear regression)
7  Repeat until convergence

3.2 ALS with Implicit Feedback

Python

1# pip install implicit
2import implicit
3import scipy.sparse as sparse
4import numpy as np
5
6# Create sparse user-item matrix (implicit: views, clicks, purchases)
7# rows = users, cols = items, values = interaction count
8user_item = sparse.csr_matrix(interaction_matrix)
9
10# ALS model
11model = implicit.als.AlternatingLeastSquares(
12    factors=64,
13    regularization=0.01,
14    iterations=50,
15    random_state=42
16)
17
18# Train (implicit expects item-user matrix)
19model.fit(user_item.T)
20
21# Recommendations for user 0
22ids, scores = model.recommend(
23    userid=0,
24    user_items=user_item[0],
25    N=10,
26    filter_already_liked_items=True
27)
28
29for item_id, score in zip(ids, scores):
30    print(f"  Item {item_id}: {score:.4f}")

3.3 Explicit vs Implicit

	Explicit	Implicit
Data	Ratings (1-5 stars)	Clicks, views, purchases
Availability	Sparse (few users rate)	Dense (everyone clicks)
Signal	Clear preference	Noisy (click != like)
Algorithm	SVD, SVD++	ALS, BPR
Example	MovieLens	E-commerce browsing

4. NMF (Non-Negative Matrix Factorization)

4.1 Concept

NMF giống SVD nhưng constraint: tất cả values >= 0. Latent factors interpretable hơn (positive contribution only).

Python

1from surprise import NMF
2
3nmf = NMF(
4    n_factors=15,      # Fewer factors for interpretability
5    n_epochs=50,
6    random_state=42
7)
8
9cross_validate(nmf, data, measures=['RMSE'], cv=5, verbose=True)

4.2 Interpretability

Ví dụ

1NMF factors might represent:
2Factor 1: "Action movies"    — high for Avengers, low for romcoms
3Factor 2: "Vietnamese comedy" — high for local comedy, low for foreign
4Factor 3: "Award-winning"    — high for Oscar films, low for B-movies
5 
6User vector [0.8, 0.3, 0.9] → likes Action + Award-winning
7Item vector [0.9, 0.1, 0.7] → Action + Award-winning movie
8→ High predicted rating!

5. Hyperparameter Tuning

5.1 Grid Search

Python

1from surprise.model_selection import GridSearchCV
2
3param_grid = {
4    'n_factors': [50, 100, 150],
5    'n_epochs': [20, 30],
6    'lr_all': [0.002, 0.005, 0.01],
7    'reg_all': [0.02, 0.05, 0.1]
8}
9
10gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
11gs.fit(data)
12
13print(f"Best RMSE: {gs.best_score['rmse']:.4f}")
14print(f"Best params: {gs.best_params['rmse']}")

5.2 Model Comparison

Python

1from surprise import SVD, SVDpp, NMF, KNNBaseline
2
3models = {
4    'SVD': SVD(n_factors=100),
5    'SVD++': SVDpp(n_factors=100),
6    'NMF': NMF(n_factors=50),
7    'KNN': KNNBaseline(k=40, sim_options={'name': 'cosine'})
8}
9
10for name, model in models.items():
11    results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=False)
12    print(f"{name}: RMSE = {results['test_rmse'].mean():.4f}")

Matrix Factorization

🎯 Mục tiêu

1. Latent Factor Concept

1.1 Idea

1.2 Mathematical Formulation

2. SVD with Surprise Library

2.1 Setup

2.2 Train SVD

2.3 Generate Recommendations

2.4 Cross-Validation

3. ALS (Alternating Least Squares)

3.1 Concept

3.2 ALS with Implicit Feedback

3.3 Explicit vs Implicit

4. NMF (Non-Negative Matrix Factorization)

4.1 Concept

4.2 Interpretability

5. Hyperparameter Tuning

5.1 Grid Search

5.2 Model Comparison

📝 Quiz

🎯 Key Takeaways

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu