Recommendation Systems Overview
Recommendation systems (RecSys) là ứng dụng ML phổ biến nhất — Netflix, Shopee, YouTube đều dùng. Bài này cover foundation concepts.
🎯 Mục tiêu
- Hiểu 3 approaches chính cho RecSys
- Implement User-based và Item-based Collaborative Filtering
- Content-based Filtering
- Hybrid và Evaluation metrics
1. RecSys Taxonomy
1.1 Three Main Approaches
| Approach | Idea | Data Needed |
|---|---|---|
| Collaborative Filtering | Người giống bạn thích gì → bạn cũng thích | User-item interactions |
| Content-based | Bạn thích item X → gợi ý items giống X | Item features |
| Hybrid | Kết hợp CF + Content-based | Both |
1.2 Problem Formulation
Ví dụ
1User-Item Matrix (Rating):2 3 Item1 Item2 Item3 Item4 Item54User A 5 3 ? 1 ?5User B 4 ? ? 1 ?6User C ? 1 ? 5 47User D 1 ? 5 4 ?8 9Goal: Predict "?" values → Recommend top items with highest predicted rating2. Collaborative Filtering
2.1 User-based CF
Idea: Users tương tự nhau sẽ rate items tương tự.
Python
1import numpy as np2from sklearn.metrics.pairwise import cosine_similarity34# User-item rating matrix5ratings = np.array([6 [5, 3, 0, 1, 0], # User A7 [4, 0, 0, 1, 0], # User B 8 [0, 1, 0, 5, 4], # User C9 [1, 0, 5, 4, 0], # User D10])1112# Compute user similarity13user_sim = cosine_similarity(ratings)14print("User similarity matrix:")15print(user_sim.round(2))1617def predict_user_cf(user_idx, item_idx, ratings, sim_matrix, k=2):18 """Predict rating for user-item pair using k nearest users."""19 # Find users who rated this item20 rated_mask = ratings[:, item_idx] > 021 rated_mask[user_idx] = False # Exclude target user22 23 if not rated_mask.any():24 return 025 26 # Get top-k similar users who rated this item27 sims = sim_matrix[user_idx][rated_mask]28 user_ratings = ratings[rated_mask, item_idx]29 30 top_k = min(k, len(sims))31 top_indices = np.argsort(sims)[-top_k:]32 33 # Weighted average34 weights = sims[top_indices]35 weighted_ratings = user_ratings[top_indices]36 37 if weights.sum() == 0:38 return 039 return np.dot(weights, weighted_ratings) / weights.sum()2.2 Item-based CF
Idea: Items được rate tương tự nhau sẽ giống nhau.
Python
1# Item similarity2item_sim = cosine_similarity(ratings.T)34def predict_item_cf(user_idx, item_idx, ratings, item_sim_matrix, k=2):5 """Predict rating using item-based CF."""6 user_ratings = ratings[user_idx]7 rated_items = np.where(user_ratings > 0)[0]8 9 if len(rated_items) == 0:10 return 011 12 # Similarity of target item to rated items13 sims = item_sim_matrix[item_idx][rated_items]14 item_ratings = user_ratings[rated_items]15 16 top_k = min(k, len(sims))17 top_indices = np.argsort(sims)[-top_k:]18 19 weights = sims[top_indices]20 weighted_ratings = item_ratings[top_indices]21 22 if weights.sum() == 0:23 return 024 return np.dot(weights, weighted_ratings) / weights.sum()2.3 User-based vs Item-based
| User-based | Item-based | |
|---|---|---|
| Computation | O(users^2) | O(items^2) |
| Best when | Items >> Users | Users >> Items |
| Stability | Less stable | More stable |
| Example | News (few items) | E-commerce (few users per item) |
3. Content-based Filtering
3.1 Concept
Ví dụ
1User Profile → Based on items user liked2Item Profile → Based on item features3 4Recommendation: Items whose profile matches User Profile3.2 TF-IDF for Content Features
Python
1from sklearn.feature_extraction.text import TfidfVectorizer2from sklearn.metrics.pairwise import cosine_similarity34# Product descriptions5products = [6 "Laptop gaming cao cấp, RTX 4080, 32GB RAM",7 "Laptop văn phòng nhẹ, Intel i5, 16GB RAM",8 "PC gaming, RTX 4090, 64GB RAM, custom build",9 "Tablet vẽ tay, Apple Pencil, M2 chip",10 "Laptop gaming, RTX 4070, 16GB RAM, OLED"11]1213# TF-IDF vectorization14tfidf = TfidfVectorizer()15features = tfidf.fit_transform(products)1617# Find similar products18product_sim = cosine_similarity(features)1920def recommend_content(product_idx, sim_matrix, n=3):21 """Recommend similar products."""22 sim_scores = list(enumerate(sim_matrix[product_idx]))23 sim_scores.sort(key=lambda x: x[1], reverse=True)24 return sim_scores[1:n+1] # Exclude self2526# Products similar to "Laptop gaming cao cấp"27recs = recommend_content(0, product_sim)28for idx, score in recs:29 print(f" {products[idx]} (similarity: {score:.2f})")3.3 Multi-feature Content-based
Python
1import pandas as pd2from sklearn.preprocessing import OneHotEncoder34# Product features5df = pd.DataFrame({6 'name': ['iPhone 15', 'Samsung S24', 'Pixel 8', 'iPhone 14'],7 'brand': ['Apple', 'Samsung', 'Google', 'Apple'],8 'price_range': ['high', 'high', 'mid', 'mid'],9 'camera_mp': [48, 50, 50, 48],10 'battery_mah': [3877, 4000, 4575, 3279]11})1213# Encode categorical features14cat_features = pd.get_dummies(df[['brand', 'price_range']])15num_features = df[['camera_mp', 'battery_mah']]1617from sklearn.preprocessing import StandardScaler18scaler = StandardScaler()19num_scaled = pd.DataFrame(20 scaler.fit_transform(num_features),21 columns=num_features.columns22)2324# Combine features25all_features = pd.concat([cat_features, num_scaled], axis=1)26content_sim = cosine_similarity(all_features)4. Hybrid Methods
4.1 Hybrid Strategies
| Strategy | How | Example |
|---|---|---|
| Weighted | CF_score * w1 + Content_score * w2 | Simple combination |
| Switching | Use Content for new users, CF for active | Cold start handling |
| Feature Augmentation | Use CF output as input to Content model | Cascade |
| Meta-Level | CF model learns from Content features | Knowledge integration |
4.2 Simple Weighted Hybrid
Python
1def hybrid_recommend(user_idx, item_idx, ratings, item_features,2 w_cf=0.7, w_content=0.3):3 """Weighted hybrid: CF + Content-based."""4 # CF score5 cf_score = predict_item_cf(user_idx, item_idx, ratings, item_sim)6 7 # Content score (similarity to user's top-rated items)8 user_rated = np.where(ratings[user_idx] > 3)[0]9 if len(user_rated) > 0:10 content_sims = content_sim[item_idx][user_rated]11 content_score = content_sims.mean()12 else:13 content_score = 014 15 return w_cf * cf_score + w_content * content_score5. Evaluation Metrics
5.1 Rating Prediction
Python
1from sklearn.metrics import mean_squared_error, mean_absolute_error23rmse = mean_squared_error(y_true, y_pred, squared=False)4mae = mean_absolute_error(y_true, y_pred)5.2 Ranking Metrics
| Metric | What it measures |
|---|---|
| Precision@K | K items gợi ý, bao nhiêu relevant? |
| Recall@K | Trong tất cả relevant items, K items capture được bao nhiêu? |
| NDCG@K | Ranking quality (relevant items ở top = tốt hơn) |
| MAP | Average precision across users |
| Hit Rate | User có click/buy ít nhất 1 item gợi ý? |
Python
1def precision_at_k(recommended, relevant, k):2 rec_k = recommended[:k]3 hits = len(set(rec_k) & set(relevant))4 return hits / k56def ndcg_at_k(recommended, relevant, k):7 dcg = sum(1/np.log2(i+2) for i, item in enumerate(recommended[:k])8 if item in relevant)9 ideal = sum(1/np.log2(i+2) for i in range(min(len(relevant), k)))10 return dcg / ideal if ideal > 0 else 0📝 Quiz
-
Collaborative Filtering dùng data gì?
- Item features
- User-item interactions (ratings, clicks)
- User demographics
- Product descriptions
-
Cold Start problem xảy ra khi?
- User mới hoặc Item mới chưa có interaction data
- Server cold boot
- Data bị lỗi
- Model quá phức tạp
-
NDCG@K đo lường gì?
- Số lượng recommendations
- Chất lượng ranking (relevant items ở top positions)
- Speed
- Coverage
🎯 Key Takeaways
- CF — dựa trên user behavior, strong nhưng cold start problem
- Content-based — dựa trên features, handle cold start tốt hơn
- Hybrid — kết hợp strengths, production systems đều dùng
- Evaluation — NDCG@K, Precision@K cho ranking quality
- Trade-off — Accuracy vs Coverage vs Diversity
🚀 Bài tiếp theo
Matrix Factorization — SVD, ALS, và deep dive vào latent factors!
