MinAI - Về trang chủ
Dự án
12/1360 phút
Đang tải...

Capstone Project: E-commerce Recommendation System

Xây dựng hệ thống recommend sản phẩm end-to-end — từ data đến deployment

Capstone Project: E-commerce Recommendation System

Xây dựng complete recommendation system cho e-commerce platform. Áp dụng tất cả kiến thức từ course: ML models, RecSys, MLOps, deployment, monitoring.

🎯 Project Overview

Business Problem: E-commerce platform cần recommend sản phẩm cho users để tăng conversion rate và average order value.

Deliverables:

  1. Recommendation model (collaborative + content-based hybrid)
  2. REST API serving recommendations
  3. Experiment tracking (MLflow)
  4. Basic monitoring dashboard

📋 Project Phases

Phase 1: Data & EDA (20 points)

Tasks:

  • Load và explore e-commerce dataset
  • Data quality checks
  • User-item interaction analysis
  • Feature engineering
Python
1# Suggested dataset:
2# - Kaggle "Brazilian E-Commerce" (Olist)
3# - Kaggle "Retailrocket Recommender System Dataset"
4# - Kaggle "Amazon Product Reviews"
5
6import pandas as pd
7import matplotlib.pyplot as plt
8import seaborn as sns
9
10# Load data
11orders = pd.read_csv('data/orders.csv')
12products = pd.read_csv('data/products.csv')
13reviews = pd.read_csv('data/reviews.csv')
14
15# EDA Checklist:
16# 1. Dataset shape and data types
17# 2. Missing values analysis
18# 3. User activity distribution (how many orders per user?)
19# 4. Product popularity distribution (power law?)
20# 5. Rating distribution
21# 6. Temporal patterns (orders over time)
22# 7. Category distribution
23# 8. Price distribution

Deliverables Phase 1:

ItemPoints
Data loading & cleaning5
EDA with 5+ visualizations8
Feature engineering (5+ features)7

Phase 2: Recommendation Models (30 points)

Task A: Collaborative Filtering (15 pts)

Python
1# Implement at least 2 CF approaches:
2
3# Approach 1: Matrix Factorization (SVD)
4from surprise import SVD, Dataset, Reader
5from surprise.model_selection import cross_validate
6
7reader = Reader(rating_scale=(1, 5))
8data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
9
10svd = SVD(n_factors=100, n_epochs=20)
11results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)
12print(f"SVD RMSE: {results['test_rmse'].mean():.4f}")
13
14# Approach 2: Neural Collaborative Filtering
15# (Use NCF from lesson 08)

Task B: Content-Based Filtering (10 pts)

Python
1# Use product features for content-based recommendations
2from sklearn.feature_extraction.text import TfidfVectorizer
3from sklearn.metrics.pairwise import cosine_similarity
4
5# Product text features
6products['text_features'] = (
7 products['category'] + ' ' +
8 products['product_name'] + ' ' +
9 products['description'].fillna('')
10)
11
12tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
13tfidf_matrix = tfidf.fit_transform(products['text_features'])
14
15# Similar products
16def get_similar_products(product_id, top_n=10):
17 idx = products[products['product_id'] == product_id].index[0]
18 sim_scores = cosine_similarity(tfidf_matrix[idx:idx+1], tfidf_matrix)[0]
19 top_indices = sim_scores.argsort()[-top_n-1:-1][::-1]
20 return products.iloc[top_indices][['product_id', 'product_name', 'category']]

Task C: Hybrid Model (5 pts)

Python
1# Combine CF + Content-Based
2def hybrid_recommend(user_id, top_n=10, alpha=0.7):
3 """
4 alpha: weight for CF (1-alpha for content-based)
5 """
6 # CF scores
7 cf_scores = get_cf_recommendations(user_id, top_n=50)
8
9 # Content-based scores (based on user's purchase history)
10 user_history = get_user_history(user_id)
11 cb_scores = get_content_recommendations(user_history, top_n=50)
12
13 # Merge and weight
14 all_items = set(cf_scores.keys()) | set(cb_scores.keys())
15 hybrid_scores = {}
16 for item in all_items:
17 cf = cf_scores.get(item, 0)
18 cb = cb_scores.get(item, 0)
19 hybrid_scores[item] = alpha * cf + (1 - alpha) * cb
20
21 # Sort and return top N
22 ranked = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)
23 return ranked[:top_n]

Deliverables Phase 2:

ItemPoints
SVD/ALS collaborative filtering8
NCF or another CF model7
Content-based filtering7
Hybrid combination5
Model comparison table3

Phase 3: MLOps & Experiment Tracking (25 points)

Task A: MLflow Integration (15 pts)

Python
1import mlflow
2import mlflow.sklearn
3
4mlflow.set_experiment("ecommerce-recsys")
5
6models_config = [
7 {"name": "svd_50", "model": SVD(n_factors=50)},
8 {"name": "svd_100", "model": SVD(n_factors=100)},
9 {"name": "svd_200", "model": SVD(n_factors=200)},
10 {"name": "nmf_50", "model": NMF(n_factors=50)},
11]
12
13for config in models_config:
14 with mlflow.start_run(run_name=config["name"]):
15 model = config["model"]
16
17 # Log params
18 mlflow.log_params({
19 "algo": config["name"],
20 "n_factors": model.n_factors
21 })
22
23 # Train & evaluate
24 results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5)
25
26 # Log metrics
27 mlflow.log_metrics({
28 "rmse_mean": results['test_rmse'].mean(),
29 "rmse_std": results['test_rmse'].std(),
30 "mae_mean": results['test_mae'].mean(),
31 "fit_time": results['fit_time'].mean()
32 })
33
34 # Train on full data and log model
35 trainset = data.build_full_trainset()
36 model.fit(trainset)
37 mlflow.sklearn.log_model(model, "model")

Task B: Model Evaluation Framework (10 pts)

Python
1def evaluate_recommender(model, test_data, k_values=[5, 10, 20]):
2 """Comprehensive evaluation."""
3 results = {}
4
5 for k in k_values:
6 precisions = []
7 recalls = []
8 ndcgs = []
9
10 for user_id in test_data['user_id'].unique():
11 # Get actual items
12 actual = set(test_data[
13 test_data['user_id'] == user_id
14 ]['product_id'])
15
16 # Get recommendations
17 recommended = get_recommendations(model, user_id, k)
18 rec_set = set(recommended)
19
20 # Precision@K
21 hits = len(rec_set & actual)
22 precisions.append(hits / k)
23
24 # Recall@K
25 recalls.append(hits / len(actual) if actual else 0)
26
27 results[f'precision_at_{k}'] = sum(precisions) / len(precisions)
28 results[f'recall_at_{k}'] = sum(recalls) / len(recalls)
29
30 return results

Deliverables Phase 3:

ItemPoints
MLflow experiment tracking (5+ runs)10
Model comparison dashboard5
Evaluation metrics (Precision/Recall @K)7
Best model selection & documentation3

Phase 4: Deployment & Monitoring (25 points)

Task A: REST API (15 pts)

Python
1# api.py
2from fastapi import FastAPI
3from pydantic import BaseModel
4from typing import List
5
6app = FastAPI(title="E-commerce Recommendations API")
7
8class RecommendationRequest(BaseModel):
9 user_id: int
10 n_recommendations: int = 10
11
12class ProductRecommendation(BaseModel):
13 product_id: int
14 product_name: str
15 score: float
16 reason: str
17
18class RecommendationResponse(BaseModel):
19 user_id: int
20 recommendations: List[ProductRecommendation]
21 model_version: str
22
23@app.post("/recommend", response_model=RecommendationResponse)
24def recommend(request: RecommendationRequest):
25 # Get hybrid recommendations
26 recs = hybrid_recommend(request.user_id, request.n_recommendations)
27
28 products = []
29 for product_id, score in recs:
30 product_info = get_product_info(product_id)
31 products.append(ProductRecommendation(
32 product_id=product_id,
33 product_name=product_info['name'],
34 score=round(score, 4),
35 reason=determine_reason(product_id, request.user_id)
36 ))
37
38 return RecommendationResponse(
39 user_id=request.user_id,
40 recommendations=products,
41 model_version="1.0.0"
42 )
43
44@app.get("/similar/{product_id}")
45def similar_products(product_id: int, n: int = 5):
46 """Content-based similar items."""
47 similar = get_similar_products(product_id, top_n=n)
48 return {"product_id": product_id, "similar": similar}
49
50@app.get("/health")
51def health():
52 return {"status": "healthy", "model": "hybrid_v1"}

Task B: Monitoring (10 pts)

Python
1# Basic monitoring
2import logging
3from collections import Counter
4from datetime import datetime
5
6class RecsysMonitor:
7 def __init__(self):
8 self.request_log = []
9 self.logger = logging.getLogger("recsys")
10
11 def log_recommendation(self, user_id, recommendations, latency_ms):
12 entry = {
13 'timestamp': datetime.now().isoformat(),
14 'user_id': user_id,
15 'n_recs': len(recommendations),
16 'top_categories': self._get_categories(recommendations),
17 'latency_ms': latency_ms
18 }
19 self.request_log.append(entry)
20 self.logger.info(f"Recommendation served: user={user_id}, n={len(recommendations)}, latency={latency_ms}ms")
21
22 def get_daily_stats(self):
23 """Daily monitoring report."""
24 today_logs = [l for l in self.request_log
25 if l['timestamp'][:10] == datetime.now().strftime('%Y-%m-%d')]
26
27 return {
28 'total_requests': len(today_logs),
29 'avg_latency_ms': sum(l['latency_ms'] for l in today_logs) / max(len(today_logs), 1),
30 'unique_users': len(set(l['user_id'] for l in today_logs)),
31 'popular_categories': Counter(
32 cat for l in today_logs for cat in l.get('top_categories', [])
33 ).most_common(5)
34 }

Deliverables Phase 4:

ItemPoints
FastAPI with /recommend endpoint8
/similar endpoint (content-based)4
Docker containerization5
Basic monitoring & logging5
API documentation (Swagger)3

📊 Grading Rubric

PhaseMax PointsRequirements
Phase 1: Data & EDA20Clean data, 5+ visualizations, features
Phase 2: Models302 CF + content-based + hybrid
Phase 3: MLOps25MLflow, comparison, evaluation metrics
Phase 4: Deployment25API + Docker + monitoring
Total100

Grading Scale:

ScoreLevelDescription
90-100ExcellentProduction-ready system
75-89GoodWorking system with good practices
60-74SatisfactoryBasic system working
Below 60Needs improvementIncomplete or major issues

💡 Tips

Project Structure

Ví dụ
1ecommerce-recsys/
2├── data/
3│ ├── raw/
4│ └── processed/
5├── notebooks/
6│ ├── 01_eda.ipynb
7│ ├── 02_modeling.ipynb
8│ └── 03_evaluation.ipynb
9├── src/
10│ ├── data/
11│ │ ├── load.py
12│ │ └── preprocess.py
13│ ├── models/
14│ │ ├── collaborative.py
15│ │ ├── content_based.py
16│ │ └── hybrid.py
17│ ├── evaluation/
18│ │ └── metrics.py
19│ └── api/
20│ └── app.py
21├── model_artifacts/
22├── Dockerfile
23├── requirements.txt
24├── README.md
25└── mlflow/

Common Pitfalls

PitfallSolution
Cold-start usersFallback to popular items
Sparse matrixUse implicit feedback
Overfitting CFaRegularization, cross-validation
Slow APIPrecompute embeddings, cache
Memory issuesSparse matrices, batch processing

Portfolio Tips

  1. README — Clear problem statement, approach, results
  2. Demo — Deployed API or Gradio demo
  3. Metrics — Show improvement over baseline
  4. Code quality — Clean, documented, tested
  5. LinkedIn post template:
Ví dụ
1Built an E-commerce Recommendation System:
2- Hybrid model (CF + Content-based): Precision@10 = X%
3- MLflow experiment tracking, 10+ experiments
4- FastAPI deployment with Docker
5- Real-time monitoring dashboard
6
7Tech: Python, scikit-learn, Surprise, PyTorch, FastAPI, MLflow, Docker

🏆 Course Summary

LessonTopicKey Skills
01OverviewAdvanced ML landscape
02Hyperparameter TuningOptuna, Bayesian Optimization
03AutoMLAuto-sklearn, TPOT, H2O
04Ensemble MethodsStacking, Blending, Weighted
05Transfer LearningBERT, ResNet, Few-shot
06RecSys OverviewCF, Content-based, Hybrid
07Matrix FactorizationSVD, ALS, NMF
08Deep RecSysNCF, Two-Tower, SASRec
09MLOpsMLflow, Pipelines, CI/CD
10Model DeploymentFastAPI, Docker, Cloud
11Feature Store & MonitoringFeast, Evidently, Drift
12Capstone ProjectEnd-to-end RecSys

🎯 What's Next?

Sau course này, bạn có thể:

  1. Deep Learning Specialization — CNNs, Transformers, Generative AI
  2. MLOps Engineering — Kubernetes, Kubeflow, advanced pipelines
  3. Domain Specialization — NLP, Computer Vision, Time Series
  4. Research — Read papers, contribute to open-source

Chúc mừng bạn đã hoàn thành Advanced Machine Learning! 🎉