Capstone Project: E-commerce Recommendation System

Xây dựng complete recommendation system cho e-commerce platform. Áp dụng tất cả kiến thức từ course: ML models, RecSys, MLOps, deployment, monitoring.

🎯 Project Overview

Business Problem: E-commerce platform cần recommend sản phẩm cho users để tăng conversion rate và average order value.

Deliverables:

Recommendation model (collaborative + content-based hybrid)
REST API serving recommendations
Experiment tracking (MLflow)
Basic monitoring dashboard

📋 Project Phases

Phase 1: Data & EDA (20 points)

Tasks:

Load và explore e-commerce dataset
Data quality checks
User-item interaction analysis
Feature engineering

Python

1# Suggested dataset: 
2# - Kaggle "Brazilian E-Commerce" (Olist)
3# - Kaggle "Retailrocket Recommender System Dataset"
4# - Kaggle "Amazon Product Reviews"
5
6import pandas as pd
7import matplotlib.pyplot as plt
8import seaborn as sns
9
10# Load data
11orders = pd.read_csv('data/orders.csv')
12products = pd.read_csv('data/products.csv')
13reviews = pd.read_csv('data/reviews.csv')
14
15# EDA Checklist:
16# 1. Dataset shape and data types
17# 2. Missing values analysis
18# 3. User activity distribution (how many orders per user?)
19# 4. Product popularity distribution (power law?)
20# 5. Rating distribution
21# 6. Temporal patterns (orders over time)
22# 7. Category distribution
23# 8. Price distribution

Deliverables Phase 1:

Item	Points
Data loading & cleaning	5
EDA with 5+ visualizations	8
Feature engineering (5+ features)	7

Phase 2: Recommendation Models (30 points)

Task A: Collaborative Filtering (15 pts)

Python

1# Implement at least 2 CF approaches:
2
3# Approach 1: Matrix Factorization (SVD)
4from surprise import SVD, Dataset, Reader
5from surprise.model_selection import cross_validate
6
7reader = Reader(rating_scale=(1, 5))
8data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
9
10svd = SVD(n_factors=100, n_epochs=20)
11results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)
12print(f"SVD RMSE: {results['test_rmse'].mean():.4f}")
13
14# Approach 2: Neural Collaborative Filtering
15# (Use NCF from lesson 08)

Task B: Content-Based Filtering (10 pts)

Python

1# Use product features for content-based recommendations
2from sklearn.feature_extraction.text import TfidfVectorizer
3from sklearn.metrics.pairwise import cosine_similarity
4
5# Product text features
6products['text_features'] = (
7    products['category'] + ' ' + 
8    products['product_name'] + ' ' + 
9    products['description'].fillna('')
10)
11
12tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
13tfidf_matrix = tfidf.fit_transform(products['text_features'])
14
15# Similar products
16def get_similar_products(product_id, top_n=10):
17    idx = products[products['product_id'] == product_id].index[0]
18    sim_scores = cosine_similarity(tfidf_matrix[idx:idx+1], tfidf_matrix)[0]
19    top_indices = sim_scores.argsort()[-top_n-1:-1][::-1]
20    return products.iloc[top_indices][['product_id', 'product_name', 'category']]

Task C: Hybrid Model (5 pts)

Python

1# Combine CF + Content-Based
2def hybrid_recommend(user_id, top_n=10, alpha=0.7):
3    """
4    alpha: weight for CF (1-alpha for content-based)
5    """
6    # CF scores
7    cf_scores = get_cf_recommendations(user_id, top_n=50)
8    
9    # Content-based scores (based on user's purchase history)
10    user_history = get_user_history(user_id)
11    cb_scores = get_content_recommendations(user_history, top_n=50)
12    
13    # Merge and weight
14    all_items = set(cf_scores.keys()) | set(cb_scores.keys())
15    hybrid_scores = {}
16    for item in all_items:
17        cf = cf_scores.get(item, 0)
18        cb = cb_scores.get(item, 0)
19        hybrid_scores[item] = alpha * cf + (1 - alpha) * cb
20    
21    # Sort and return top N
22    ranked = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)
23    return ranked[:top_n]

Deliverables Phase 2:

Item	Points
SVD/ALS collaborative filtering	8
NCF or another CF model	7
Content-based filtering	7
Hybrid combination	5
Model comparison table	3

Phase 3: MLOps & Experiment Tracking (25 points)

Task A: MLflow Integration (15 pts)

Python

1import mlflow
2import mlflow.sklearn
3
4mlflow.set_experiment("ecommerce-recsys")
5
6models_config = [
7    {"name": "svd_50", "model": SVD(n_factors=50)},
8    {"name": "svd_100", "model": SVD(n_factors=100)},
9    {"name": "svd_200", "model": SVD(n_factors=200)},
10    {"name": "nmf_50", "model": NMF(n_factors=50)},
11]
12
13for config in models_config:
14    with mlflow.start_run(run_name=config["name"]):
15        model = config["model"]
16        
17        # Log params
18        mlflow.log_params({
19            "algo": config["name"],
20            "n_factors": model.n_factors
21        })
22        
23        # Train & evaluate
24        results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5)
25        
26        # Log metrics
27        mlflow.log_metrics({
28            "rmse_mean": results['test_rmse'].mean(),
29            "rmse_std": results['test_rmse'].std(),
30            "mae_mean": results['test_mae'].mean(),
31            "fit_time": results['fit_time'].mean()
32        })
33        
34        # Train on full data and log model
35        trainset = data.build_full_trainset()
36        model.fit(trainset)
37        mlflow.sklearn.log_model(model, "model")

Task B: Model Evaluation Framework (10 pts)

Python

1def evaluate_recommender(model, test_data, k_values=[5, 10, 20]):
2    """Comprehensive evaluation."""
3    results = {}
4    
5    for k in k_values:
6        precisions = []
7        recalls = []
8        ndcgs = []
9        
10        for user_id in test_data['user_id'].unique():
11            # Get actual items
12            actual = set(test_data[
13                test_data['user_id'] == user_id
14            ]['product_id'])
15            
16            # Get recommendations
17            recommended = get_recommendations(model, user_id, k)
18            rec_set = set(recommended)
19            
20            # Precision@K
21            hits = len(rec_set & actual)
22            precisions.append(hits / k)
23            
24            # Recall@K
25            recalls.append(hits / len(actual) if actual else 0)
26        
27        results[f'precision_at_{k}'] = sum(precisions) / len(precisions)
28        results[f'recall_at_{k}'] = sum(recalls) / len(recalls)
29    
30    return results

Deliverables Phase 3:

Item	Points
MLflow experiment tracking (5+ runs)	10
Model comparison dashboard	5
Evaluation metrics (Precision/Recall @K)	7
Best model selection & documentation	3

Phase 4: Deployment & Monitoring (25 points)

Task A: REST API (15 pts)

Python

1# api.py
2from fastapi import FastAPI
3from pydantic import BaseModel
4from typing import List
5
6app = FastAPI(title="E-commerce Recommendations API")
7
8class RecommendationRequest(BaseModel):
9    user_id: int
10    n_recommendations: int = 10
11
12class ProductRecommendation(BaseModel):
13    product_id: int
14    product_name: str
15    score: float
16    reason: str
17
18class RecommendationResponse(BaseModel):
19    user_id: int
20    recommendations: List[ProductRecommendation]
21    model_version: str
22
23@app.post("/recommend", response_model=RecommendationResponse)
24def recommend(request: RecommendationRequest):
25    # Get hybrid recommendations
26    recs = hybrid_recommend(request.user_id, request.n_recommendations)
27    
28    products = []
29    for product_id, score in recs:
30        product_info = get_product_info(product_id)
31        products.append(ProductRecommendation(
32            product_id=product_id,
33            product_name=product_info['name'],
34            score=round(score, 4),
35            reason=determine_reason(product_id, request.user_id)
36        ))
37    
38    return RecommendationResponse(
39        user_id=request.user_id,
40        recommendations=products,
41        model_version="1.0.0"
42    )
43
44@app.get("/similar/{product_id}")
45def similar_products(product_id: int, n: int = 5):
46    """Content-based similar items."""
47    similar = get_similar_products(product_id, top_n=n)
48    return {"product_id": product_id, "similar": similar}
49
50@app.get("/health")
51def health():
52    return {"status": "healthy", "model": "hybrid_v1"}

Task B: Monitoring (10 pts)

Python

1# Basic monitoring
2import logging
3from collections import Counter
4from datetime import datetime
5
6class RecsysMonitor:
7    def __init__(self):
8        self.request_log = []
9        self.logger = logging.getLogger("recsys")
10    
11    def log_recommendation(self, user_id, recommendations, latency_ms):
12        entry = {
13            'timestamp': datetime.now().isoformat(),
14            'user_id': user_id,
15            'n_recs': len(recommendations),
16            'top_categories': self._get_categories(recommendations),
17            'latency_ms': latency_ms
18        }
19        self.request_log.append(entry)
20        self.logger.info(f"Recommendation served: user={user_id}, n={len(recommendations)}, latency={latency_ms}ms")
21    
22    def get_daily_stats(self):
23        """Daily monitoring report."""
24        today_logs = [l for l in self.request_log
25                     if l['timestamp'][:10] == datetime.now().strftime('%Y-%m-%d')]
26        
27        return {
28            'total_requests': len(today_logs),
29            'avg_latency_ms': sum(l['latency_ms'] for l in today_logs) / max(len(today_logs), 1),
30            'unique_users': len(set(l['user_id'] for l in today_logs)),
31            'popular_categories': Counter(
32                cat for l in today_logs for cat in l.get('top_categories', [])
33            ).most_common(5)
34        }

Deliverables Phase 4:

Item	Points
FastAPI with /recommend endpoint	8
/similar endpoint (content-based)	4
Docker containerization	5
Basic monitoring & logging	5
API documentation (Swagger)	3

📊 Grading Rubric

Phase	Max Points	Requirements
Phase 1: Data & EDA	20	Clean data, 5+ visualizations, features
Phase 2: Models	30	2 CF + content-based + hybrid
Phase 3: MLOps	25	MLflow, comparison, evaluation metrics
Phase 4: Deployment	25	API + Docker + monitoring
Total	100

Grading Scale:

Score	Level	Description
90-100	Excellent	Production-ready system
75-89	Good	Working system with good practices
60-74	Satisfactory	Basic system working
Below 60	Needs improvement	Incomplete or major issues

💡 Tips

Project Structure

Ví dụ

1ecommerce-recsys/
2├── data/
3│   ├── raw/
4│   └── processed/
5├── notebooks/
6│   ├── 01_eda.ipynb
7│   ├── 02_modeling.ipynb
8│   └── 03_evaluation.ipynb
9├── src/
10│   ├── data/
11│   │   ├── load.py
12│   │   └── preprocess.py
13│   ├── models/
14│   │   ├── collaborative.py
15│   │   ├── content_based.py
16│   │   └── hybrid.py
17│   ├── evaluation/
18│   │   └── metrics.py
19│   └── api/
20│       └── app.py
21├── model_artifacts/
22├── Dockerfile
23├── requirements.txt
24├── README.md
25└── mlflow/

Common Pitfalls

Pitfall	Solution
Cold-start users	Fallback to popular items
Sparse matrix	Use implicit feedback
Overfitting CFa	Regularization, cross-validation
Slow API	Precompute embeddings, cache
Memory issues	Sparse matrices, batch processing

Portfolio Tips

README — Clear problem statement, approach, results
Demo — Deployed API or Gradio demo
Metrics — Show improvement over baseline
Code quality — Clean, documented, tested
LinkedIn post template:

Ví dụ

1Built an E-commerce Recommendation System:
2- Hybrid model (CF + Content-based): Precision@10 = X%
3- MLflow experiment tracking, 10+ experiments
4- FastAPI deployment with Docker
5- Real-time monitoring dashboard
6 
7Tech: Python, scikit-learn, Surprise, PyTorch, FastAPI, MLflow, Docker

🏆 Course Summary

Lesson	Topic	Key Skills
01	Overview	Advanced ML landscape
02	Hyperparameter Tuning	Optuna, Bayesian Optimization
03	AutoML	Auto-sklearn, TPOT, H2O
04	Ensemble Methods	Stacking, Blending, Weighted
05	Transfer Learning	BERT, ResNet, Few-shot
06	RecSys Overview	CF, Content-based, Hybrid
07	Matrix Factorization	SVD, ALS, NMF
08	Deep RecSys	NCF, Two-Tower, SASRec
09	MLOps	MLflow, Pipelines, CI/CD
10	Model Deployment	FastAPI, Docker, Cloud
11	Feature Store & Monitoring	Feast, Evidently, Drift
12	Capstone Project	End-to-end RecSys

🎯 What's Next?

Sau course này, bạn có thể:

Deep Learning Specialization — CNNs, Transformers, Generative AI
MLOps Engineering — Kubernetes, Kubeflow, advanced pipelines
Domain Specialization — NLP, Computer Vision, Time Series
Research — Read papers, contribute to open-source

Chúc mừng bạn đã hoàn thành Advanced Machine Learning! 🎉

Capstone Project: E-commerce Recommendation System

Capstone Project: E-commerce Recommendation System

🎯 Project Overview

📋 Project Phases

Phase 1: Data & EDA (20 points)

Phase 2: Recommendation Models (30 points)

Phase 3: MLOps & Experiment Tracking (25 points)

Phase 4: Deployment & Monitoring (25 points)

📊 Grading Rubric

💡 Tips

Project Structure

Common Pitfalls

Portfolio Tips

🏆 Course Summary

🎯 What's Next?

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu