Capstone Project: E-commerce Recommendation System
Xây dựng complete recommendation system cho e-commerce platform. Áp dụng tất cả kiến thức từ course: ML models, RecSys, MLOps, deployment, monitoring.
🎯 Project Overview
Business Problem: E-commerce platform cần recommend sản phẩm cho users để tăng conversion rate và average order value.
Deliverables:
- Recommendation model (collaborative + content-based hybrid)
- REST API serving recommendations
- Experiment tracking (MLflow)
- Basic monitoring dashboard
📋 Project Phases
Phase 1: Data & EDA (20 points)
Tasks:
- Load và explore e-commerce dataset
- Data quality checks
- User-item interaction analysis
- Feature engineering
1# Suggested dataset: 2# - Kaggle "Brazilian E-Commerce" (Olist)3# - Kaggle "Retailrocket Recommender System Dataset"4# - Kaggle "Amazon Product Reviews"56import pandas as pd7import matplotlib.pyplot as plt8import seaborn as sns910# Load data11orders = pd.read_csv('data/orders.csv')12products = pd.read_csv('data/products.csv')13reviews = pd.read_csv('data/reviews.csv')1415# EDA Checklist:16# 1. Dataset shape and data types17# 2. Missing values analysis18# 3. User activity distribution (how many orders per user?)19# 4. Product popularity distribution (power law?)20# 5. Rating distribution21# 6. Temporal patterns (orders over time)22# 7. Category distribution23# 8. Price distributionDeliverables Phase 1:
| Item | Points |
|---|---|
| Data loading & cleaning | 5 |
| EDA with 5+ visualizations | 8 |
| Feature engineering (5+ features) | 7 |
Phase 2: Recommendation Models (30 points)
Task A: Collaborative Filtering (15 pts)
1# Implement at least 2 CF approaches:23# Approach 1: Matrix Factorization (SVD)4from surprise import SVD, Dataset, Reader5from surprise.model_selection import cross_validate67reader = Reader(rating_scale=(1, 5))8data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)910svd = SVD(n_factors=100, n_epochs=20)11results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)12print(f"SVD RMSE: {results['test_rmse'].mean():.4f}")1314# Approach 2: Neural Collaborative Filtering15# (Use NCF from lesson 08)Task B: Content-Based Filtering (10 pts)
1# Use product features for content-based recommendations2from sklearn.feature_extraction.text import TfidfVectorizer3from sklearn.metrics.pairwise import cosine_similarity45# Product text features6products['text_features'] = (7 products['category'] + ' ' + 8 products['product_name'] + ' ' + 9 products['description'].fillna('')10)1112tfidf = TfidfVectorizer(max_features=5000, stop_words='english')13tfidf_matrix = tfidf.fit_transform(products['text_features'])1415# Similar products16def get_similar_products(product_id, top_n=10):17 idx = products[products['product_id'] == product_id].index[0]18 sim_scores = cosine_similarity(tfidf_matrix[idx:idx+1], tfidf_matrix)[0]19 top_indices = sim_scores.argsort()[-top_n-1:-1][::-1]20 return products.iloc[top_indices][['product_id', 'product_name', 'category']]Task C: Hybrid Model (5 pts)
1# Combine CF + Content-Based2def hybrid_recommend(user_id, top_n=10, alpha=0.7):3 """4 alpha: weight for CF (1-alpha for content-based)5 """6 # CF scores7 cf_scores = get_cf_recommendations(user_id, top_n=50)8 9 # Content-based scores (based on user's purchase history)10 user_history = get_user_history(user_id)11 cb_scores = get_content_recommendations(user_history, top_n=50)12 13 # Merge and weight14 all_items = set(cf_scores.keys()) | set(cb_scores.keys())15 hybrid_scores = {}16 for item in all_items:17 cf = cf_scores.get(item, 0)18 cb = cb_scores.get(item, 0)19 hybrid_scores[item] = alpha * cf + (1 - alpha) * cb20 21 # Sort and return top N22 ranked = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)23 return ranked[:top_n]Deliverables Phase 2:
| Item | Points |
|---|---|
| SVD/ALS collaborative filtering | 8 |
| NCF or another CF model | 7 |
| Content-based filtering | 7 |
| Hybrid combination | 5 |
| Model comparison table | 3 |
Phase 3: MLOps & Experiment Tracking (25 points)
Task A: MLflow Integration (15 pts)
1import mlflow2import mlflow.sklearn34mlflow.set_experiment("ecommerce-recsys")56models_config = [7 {"name": "svd_50", "model": SVD(n_factors=50)},8 {"name": "svd_100", "model": SVD(n_factors=100)},9 {"name": "svd_200", "model": SVD(n_factors=200)},10 {"name": "nmf_50", "model": NMF(n_factors=50)},11]1213for config in models_config:14 with mlflow.start_run(run_name=config["name"]):15 model = config["model"]16 17 # Log params18 mlflow.log_params({19 "algo": config["name"],20 "n_factors": model.n_factors21 })22 23 # Train & evaluate24 results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5)25 26 # Log metrics27 mlflow.log_metrics({28 "rmse_mean": results['test_rmse'].mean(),29 "rmse_std": results['test_rmse'].std(),30 "mae_mean": results['test_mae'].mean(),31 "fit_time": results['fit_time'].mean()32 })33 34 # Train on full data and log model35 trainset = data.build_full_trainset()36 model.fit(trainset)37 mlflow.sklearn.log_model(model, "model")Task B: Model Evaluation Framework (10 pts)
1def evaluate_recommender(model, test_data, k_values=[5, 10, 20]):2 """Comprehensive evaluation."""3 results = {}4 5 for k in k_values:6 precisions = []7 recalls = []8 ndcgs = []9 10 for user_id in test_data['user_id'].unique():11 # Get actual items12 actual = set(test_data[13 test_data['user_id'] == user_id14 ]['product_id'])15 16 # Get recommendations17 recommended = get_recommendations(model, user_id, k)18 rec_set = set(recommended)19 20 # Precision@K21 hits = len(rec_set & actual)22 precisions.append(hits / k)23 24 # Recall@K25 recalls.append(hits / len(actual) if actual else 0)26 27 results[f'precision_at_{k}'] = sum(precisions) / len(precisions)28 results[f'recall_at_{k}'] = sum(recalls) / len(recalls)29 30 return resultsDeliverables Phase 3:
| Item | Points |
|---|---|
| MLflow experiment tracking (5+ runs) | 10 |
| Model comparison dashboard | 5 |
| Evaluation metrics (Precision/Recall @K) | 7 |
| Best model selection & documentation | 3 |
Phase 4: Deployment & Monitoring (25 points)
Task A: REST API (15 pts)
1# api.py2from fastapi import FastAPI3from pydantic import BaseModel4from typing import List56app = FastAPI(title="E-commerce Recommendations API")78class RecommendationRequest(BaseModel):9 user_id: int10 n_recommendations: int = 101112class ProductRecommendation(BaseModel):13 product_id: int14 product_name: str15 score: float16 reason: str1718class RecommendationResponse(BaseModel):19 user_id: int20 recommendations: List[ProductRecommendation]21 model_version: str2223@app.post("/recommend", response_model=RecommendationResponse)24def recommend(request: RecommendationRequest):25 # Get hybrid recommendations26 recs = hybrid_recommend(request.user_id, request.n_recommendations)27 28 products = []29 for product_id, score in recs:30 product_info = get_product_info(product_id)31 products.append(ProductRecommendation(32 product_id=product_id,33 product_name=product_info['name'],34 score=round(score, 4),35 reason=determine_reason(product_id, request.user_id)36 ))37 38 return RecommendationResponse(39 user_id=request.user_id,40 recommendations=products,41 model_version="1.0.0"42 )4344@app.get("/similar/{product_id}")45def similar_products(product_id: int, n: int = 5):46 """Content-based similar items."""47 similar = get_similar_products(product_id, top_n=n)48 return {"product_id": product_id, "similar": similar}4950@app.get("/health")51def health():52 return {"status": "healthy", "model": "hybrid_v1"}Task B: Monitoring (10 pts)
1# Basic monitoring2import logging3from collections import Counter4from datetime import datetime56class RecsysMonitor:7 def __init__(self):8 self.request_log = []9 self.logger = logging.getLogger("recsys")10 11 def log_recommendation(self, user_id, recommendations, latency_ms):12 entry = {13 'timestamp': datetime.now().isoformat(),14 'user_id': user_id,15 'n_recs': len(recommendations),16 'top_categories': self._get_categories(recommendations),17 'latency_ms': latency_ms18 }19 self.request_log.append(entry)20 self.logger.info(f"Recommendation served: user={user_id}, n={len(recommendations)}, latency={latency_ms}ms")21 22 def get_daily_stats(self):23 """Daily monitoring report."""24 today_logs = [l for l in self.request_log25 if l['timestamp'][:10] == datetime.now().strftime('%Y-%m-%d')]26 27 return {28 'total_requests': len(today_logs),29 'avg_latency_ms': sum(l['latency_ms'] for l in today_logs) / max(len(today_logs), 1),30 'unique_users': len(set(l['user_id'] for l in today_logs)),31 'popular_categories': Counter(32 cat for l in today_logs for cat in l.get('top_categories', [])33 ).most_common(5)34 }Deliverables Phase 4:
| Item | Points |
|---|---|
| FastAPI with /recommend endpoint | 8 |
| /similar endpoint (content-based) | 4 |
| Docker containerization | 5 |
| Basic monitoring & logging | 5 |
| API documentation (Swagger) | 3 |
📊 Grading Rubric
| Phase | Max Points | Requirements |
|---|---|---|
| Phase 1: Data & EDA | 20 | Clean data, 5+ visualizations, features |
| Phase 2: Models | 30 | 2 CF + content-based + hybrid |
| Phase 3: MLOps | 25 | MLflow, comparison, evaluation metrics |
| Phase 4: Deployment | 25 | API + Docker + monitoring |
| Total | 100 |
Grading Scale:
| Score | Level | Description |
|---|---|---|
| 90-100 | Excellent | Production-ready system |
| 75-89 | Good | Working system with good practices |
| 60-74 | Satisfactory | Basic system working |
| Below 60 | Needs improvement | Incomplete or major issues |
💡 Tips
Project Structure
1ecommerce-recsys/2├── data/3│ ├── raw/4│ └── processed/5├── notebooks/6│ ├── 01_eda.ipynb7│ ├── 02_modeling.ipynb8│ └── 03_evaluation.ipynb9├── src/10│ ├── data/11│ │ ├── load.py12│ │ └── preprocess.py13│ ├── models/14│ │ ├── collaborative.py15│ │ ├── content_based.py16│ │ └── hybrid.py17│ ├── evaluation/18│ │ └── metrics.py19│ └── api/20│ └── app.py21├── model_artifacts/22├── Dockerfile23├── requirements.txt24├── README.md25└── mlflow/Common Pitfalls
| Pitfall | Solution |
|---|---|
| Cold-start users | Fallback to popular items |
| Sparse matrix | Use implicit feedback |
| Overfitting CFa | Regularization, cross-validation |
| Slow API | Precompute embeddings, cache |
| Memory issues | Sparse matrices, batch processing |
Portfolio Tips
- README — Clear problem statement, approach, results
- Demo — Deployed API or Gradio demo
- Metrics — Show improvement over baseline
- Code quality — Clean, documented, tested
- LinkedIn post template:
1Built an E-commerce Recommendation System:2- Hybrid model (CF + Content-based): Precision@10 = X%3- MLflow experiment tracking, 10+ experiments4- FastAPI deployment with Docker5- Real-time monitoring dashboard6 7Tech: Python, scikit-learn, Surprise, PyTorch, FastAPI, MLflow, Docker🏆 Course Summary
| Lesson | Topic | Key Skills |
|---|---|---|
| 01 | Overview | Advanced ML landscape |
| 02 | Hyperparameter Tuning | Optuna, Bayesian Optimization |
| 03 | AutoML | Auto-sklearn, TPOT, H2O |
| 04 | Ensemble Methods | Stacking, Blending, Weighted |
| 05 | Transfer Learning | BERT, ResNet, Few-shot |
| 06 | RecSys Overview | CF, Content-based, Hybrid |
| 07 | Matrix Factorization | SVD, ALS, NMF |
| 08 | Deep RecSys | NCF, Two-Tower, SASRec |
| 09 | MLOps | MLflow, Pipelines, CI/CD |
| 10 | Model Deployment | FastAPI, Docker, Cloud |
| 11 | Feature Store & Monitoring | Feast, Evidently, Drift |
| 12 | Capstone Project | End-to-end RecSys |
🎯 What's Next?
Sau course này, bạn có thể:
- Deep Learning Specialization — CNNs, Transformers, Generative AI
- MLOps Engineering — Kubernetes, Kubeflow, advanced pipelines
- Domain Specialization — NLP, Computer Vision, Time Series
- Research — Read papers, contribute to open-source
Chúc mừng bạn đã hoàn thành Advanced Machine Learning! 🎉
