MinAI - Về trang chủ
Hướng dẫn
6/1335 phút
Đang tải...

Recommendation Systems Overview

Collaborative Filtering, Content-based, và Hybrid approaches cho hệ thống gợi ý

Recommendation Systems Overview

Recommendation systems (RecSys) là ứng dụng ML phổ biến nhất — Netflix, Shopee, YouTube đều dùng. Bài này cover foundation concepts.

🎯 Mục tiêu

  • Hiểu 3 approaches chính cho RecSys
  • Implement User-based và Item-based Collaborative Filtering
  • Content-based Filtering
  • Hybrid và Evaluation metrics

1. RecSys Taxonomy

1.1 Three Main Approaches

ApproachIdeaData Needed
Collaborative FilteringNgười giống bạn thích gì → bạn cũng thíchUser-item interactions
Content-basedBạn thích item X → gợi ý items giống XItem features
HybridKết hợp CF + Content-basedBoth

1.2 Problem Formulation

Ví dụ
1User-Item Matrix (Rating):
2
3 Item1 Item2 Item3 Item4 Item5
4User A 5 3 ? 1 ?
5User B 4 ? ? 1 ?
6User C ? 1 ? 5 4
7User D 1 ? 5 4 ?
8
9Goal: Predict "?" values → Recommend top items with highest predicted rating

2. Collaborative Filtering

2.1 User-based CF

Idea: Users tương tự nhau sẽ rate items tương tự.

Python
1import numpy as np
2from sklearn.metrics.pairwise import cosine_similarity
3
4# User-item rating matrix
5ratings = np.array([
6 [5, 3, 0, 1, 0], # User A
7 [4, 0, 0, 1, 0], # User B
8 [0, 1, 0, 5, 4], # User C
9 [1, 0, 5, 4, 0], # User D
10])
11
12# Compute user similarity
13user_sim = cosine_similarity(ratings)
14print("User similarity matrix:")
15print(user_sim.round(2))
16
17def predict_user_cf(user_idx, item_idx, ratings, sim_matrix, k=2):
18 """Predict rating for user-item pair using k nearest users."""
19 # Find users who rated this item
20 rated_mask = ratings[:, item_idx] > 0
21 rated_mask[user_idx] = False # Exclude target user
22
23 if not rated_mask.any():
24 return 0
25
26 # Get top-k similar users who rated this item
27 sims = sim_matrix[user_idx][rated_mask]
28 user_ratings = ratings[rated_mask, item_idx]
29
30 top_k = min(k, len(sims))
31 top_indices = np.argsort(sims)[-top_k:]
32
33 # Weighted average
34 weights = sims[top_indices]
35 weighted_ratings = user_ratings[top_indices]
36
37 if weights.sum() == 0:
38 return 0
39 return np.dot(weights, weighted_ratings) / weights.sum()

2.2 Item-based CF

Idea: Items được rate tương tự nhau sẽ giống nhau.

Python
1# Item similarity
2item_sim = cosine_similarity(ratings.T)
3
4def predict_item_cf(user_idx, item_idx, ratings, item_sim_matrix, k=2):
5 """Predict rating using item-based CF."""
6 user_ratings = ratings[user_idx]
7 rated_items = np.where(user_ratings > 0)[0]
8
9 if len(rated_items) == 0:
10 return 0
11
12 # Similarity of target item to rated items
13 sims = item_sim_matrix[item_idx][rated_items]
14 item_ratings = user_ratings[rated_items]
15
16 top_k = min(k, len(sims))
17 top_indices = np.argsort(sims)[-top_k:]
18
19 weights = sims[top_indices]
20 weighted_ratings = item_ratings[top_indices]
21
22 if weights.sum() == 0:
23 return 0
24 return np.dot(weights, weighted_ratings) / weights.sum()

2.3 User-based vs Item-based

User-basedItem-based
ComputationO(users^2)O(items^2)
Best whenItems >> UsersUsers >> Items
StabilityLess stableMore stable
ExampleNews (few items)E-commerce (few users per item)

3. Content-based Filtering

3.1 Concept

Ví dụ
1User Profile → Based on items user liked
2Item Profile → Based on item features
3
4Recommendation: Items whose profile matches User Profile

3.2 TF-IDF for Content Features

Python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.metrics.pairwise import cosine_similarity
3
4# Product descriptions
5products = [
6 "Laptop gaming cao cấp, RTX 4080, 32GB RAM",
7 "Laptop văn phòng nhẹ, Intel i5, 16GB RAM",
8 "PC gaming, RTX 4090, 64GB RAM, custom build",
9 "Tablet vẽ tay, Apple Pencil, M2 chip",
10 "Laptop gaming, RTX 4070, 16GB RAM, OLED"
11]
12
13# TF-IDF vectorization
14tfidf = TfidfVectorizer()
15features = tfidf.fit_transform(products)
16
17# Find similar products
18product_sim = cosine_similarity(features)
19
20def recommend_content(product_idx, sim_matrix, n=3):
21 """Recommend similar products."""
22 sim_scores = list(enumerate(sim_matrix[product_idx]))
23 sim_scores.sort(key=lambda x: x[1], reverse=True)
24 return sim_scores[1:n+1] # Exclude self
25
26# Products similar to "Laptop gaming cao cấp"
27recs = recommend_content(0, product_sim)
28for idx, score in recs:
29 print(f" {products[idx]} (similarity: {score:.2f})")

3.3 Multi-feature Content-based

Python
1import pandas as pd
2from sklearn.preprocessing import OneHotEncoder
3
4# Product features
5df = pd.DataFrame({
6 'name': ['iPhone 15', 'Samsung S24', 'Pixel 8', 'iPhone 14'],
7 'brand': ['Apple', 'Samsung', 'Google', 'Apple'],
8 'price_range': ['high', 'high', 'mid', 'mid'],
9 'camera_mp': [48, 50, 50, 48],
10 'battery_mah': [3877, 4000, 4575, 3279]
11})
12
13# Encode categorical features
14cat_features = pd.get_dummies(df[['brand', 'price_range']])
15num_features = df[['camera_mp', 'battery_mah']]
16
17from sklearn.preprocessing import StandardScaler
18scaler = StandardScaler()
19num_scaled = pd.DataFrame(
20 scaler.fit_transform(num_features),
21 columns=num_features.columns
22)
23
24# Combine features
25all_features = pd.concat([cat_features, num_scaled], axis=1)
26content_sim = cosine_similarity(all_features)

4. Hybrid Methods

4.1 Hybrid Strategies

StrategyHowExample
WeightedCF_score * w1 + Content_score * w2Simple combination
SwitchingUse Content for new users, CF for activeCold start handling
Feature AugmentationUse CF output as input to Content modelCascade
Meta-LevelCF model learns from Content featuresKnowledge integration

4.2 Simple Weighted Hybrid

Python
1def hybrid_recommend(user_idx, item_idx, ratings, item_features,
2 w_cf=0.7, w_content=0.3):
3 """Weighted hybrid: CF + Content-based."""
4 # CF score
5 cf_score = predict_item_cf(user_idx, item_idx, ratings, item_sim)
6
7 # Content score (similarity to user's top-rated items)
8 user_rated = np.where(ratings[user_idx] > 3)[0]
9 if len(user_rated) > 0:
10 content_sims = content_sim[item_idx][user_rated]
11 content_score = content_sims.mean()
12 else:
13 content_score = 0
14
15 return w_cf * cf_score + w_content * content_score

5. Evaluation Metrics

5.1 Rating Prediction

Python
1from sklearn.metrics import mean_squared_error, mean_absolute_error
2
3rmse = mean_squared_error(y_true, y_pred, squared=False)
4mae = mean_absolute_error(y_true, y_pred)

5.2 Ranking Metrics

MetricWhat it measures
Precision@KK items gợi ý, bao nhiêu relevant?
Recall@KTrong tất cả relevant items, K items capture được bao nhiêu?
NDCG@KRanking quality (relevant items ở top = tốt hơn)
MAPAverage precision across users
Hit RateUser có click/buy ít nhất 1 item gợi ý?
Python
1def precision_at_k(recommended, relevant, k):
2 rec_k = recommended[:k]
3 hits = len(set(rec_k) & set(relevant))
4 return hits / k
5
6def ndcg_at_k(recommended, relevant, k):
7 dcg = sum(1/np.log2(i+2) for i, item in enumerate(recommended[:k])
8 if item in relevant)
9 ideal = sum(1/np.log2(i+2) for i in range(min(len(relevant), k)))
10 return dcg / ideal if ideal > 0 else 0

📝 Quiz

  1. Collaborative Filtering dùng data gì?

    • Item features
    • User-item interactions (ratings, clicks)
    • User demographics
    • Product descriptions
  2. Cold Start problem xảy ra khi?

    • User mới hoặc Item mới chưa có interaction data
    • Server cold boot
    • Data bị lỗi
    • Model quá phức tạp
  3. NDCG@K đo lường gì?

    • Số lượng recommendations
    • Chất lượng ranking (relevant items ở top positions)
    • Speed
    • Coverage

🎯 Key Takeaways

  1. CF — dựa trên user behavior, strong nhưng cold start problem
  2. Content-based — dựa trên features, handle cold start tốt hơn
  3. Hybrid — kết hợp strengths, production systems đều dùng
  4. Evaluation — NDCG@K, Precision@K cho ranking quality
  5. Trade-off — Accuracy vs Coverage vs Diversity

🚀 Bài tiếp theo

Matrix Factorization — SVD, ALS, và deep dive vào latent factors!