Model Deployment
Train xong model rồi, giờ deploy cho users sử dụng. Bài này cover full stack deployment: FastAPI → Docker → Cloud.
🎯 Mục tiêu
- Build REST API với FastAPI
- Containerize với Docker
- Batch vs Real-time serving
- Cloud deployment options
1. Model Serving Patterns
1.1 Overview
Serving Patterns
| Real-time (Online) | Batch (Offline) | |
|---|---|---|
| Interface | REST API, gRPC | Scheduled jobs |
| Speed | < 100ms | Daily/hourly predictions |
| Volume | One-at-a-time | Millions at once |
| Examples | Fraud detection, Chatbot | Email campaign targeting, Product recommendations, Churn scoring |
1.2 Model Serialization
Python
1import joblib2import json34# Save model + metadata5def save_model(model, preprocessor, metadata, path="model_artifacts"):6 import os7 os.makedirs(path, exist_ok=True)8 9 joblib.dump(model, f"{path}/model.pkl")10 joblib.dump(preprocessor, f"{path}/preprocessor.pkl")11 12 with open(f"{path}/metadata.json", "w") as f:13 json.dump(metadata, f, indent=2)14 15 print(f"Model saved to {path}/")1617save_model(18 model=trained_model,19 preprocessor=pipeline_preprocessor,20 metadata={21 "model_type": "GradientBoosting",22 "version": "1.2.0",23 "accuracy": 0.94,24 "features": feature_names,25 "training_date": "2025-01-15"26 }27)2. REST API with FastAPI
2.1 Basic API
Python
1# api.py2from fastapi import FastAPI, HTTPException3from pydantic import BaseModel, Field4import joblib5import numpy as np67app = FastAPI(8 title="Churn Prediction API",9 version="1.0.0"10)1112# Load model at startup13model = joblib.load("model_artifacts/model.pkl")14preprocessor = joblib.load("model_artifacts/preprocessor.pkl")1516# Request schema17class CustomerData(BaseModel):18 age: int = Field(ge=18, le=100, description="Customer age")19 monthly_spend: float = Field(ge=0, description="Monthly spending")20 tenure_months: int = Field(ge=0, description="Months as customer")21 support_tickets: int = Field(ge=0, description="Support tickets filed")22 contract_type: str = Field(description="Contract: monthly/annual/two_year")2324# Response schema25class PredictionResponse(BaseModel):26 churn_probability: float27 prediction: str28 confidence: float2930@app.post("/predict", response_model=PredictionResponse)31def predict(data: CustomerData):32 try:33 # Convert to array34 features = np.array([[35 data.age,36 data.monthly_spend,37 data.tenure_months,38 data.support_tickets39 ]])40 41 # Preprocess42 features_processed = preprocessor.transform(features)43 44 # Predict45 proba = model.predict_proba(features_processed)[0]46 churn_prob = float(proba[1])47 prediction = "churn" if churn_prob > 0.5 else "no_churn"48 confidence = float(max(proba))49 50 return PredictionResponse(51 churn_probability=round(churn_prob, 4),52 prediction=prediction,53 confidence=round(confidence, 4)54 )55 except Exception as e:56 raise HTTPException(status_code=500, detail=str(e))5758@app.get("/health")59def health():60 return {"status": "healthy", "model_version": "1.0.0"}2.2 Batch Endpoint
Python
1from typing import List23class BatchRequest(BaseModel):4 customers: List[CustomerData]56class BatchResponse(BaseModel):7 predictions: List[PredictionResponse]8 total: int910@app.post("/predict/batch", response_model=BatchResponse)11def predict_batch(request: BatchRequest):12 if len(request.customers) > 1000:13 raise HTTPException(400, "Max 1000 records per batch")14 15 features = np.array([16 [c.age, c.monthly_spend, c.tenure_months, c.support_tickets]17 for c in request.customers18 ])19 20 features_processed = preprocessor.transform(features)21 probas = model.predict_proba(features_processed)22 23 predictions = []24 for proba in probas:25 churn_prob = float(proba[1])26 predictions.append(PredictionResponse(27 churn_probability=round(churn_prob, 4),28 prediction="churn" if churn_prob > 0.5 else "no_churn",29 confidence=round(float(max(proba)), 4)30 ))31 32 return BatchResponse(predictions=predictions, total=len(predictions))2.3 Run & Test
Bash
1# Start server2uvicorn api:app --host 0.0.0.0 --port 8000 --reload3 4# Test (another terminal)5curl -X POST "http://localhost:8000/predict" \6 -H "Content-Type: application/json" \7 -d '{"age": 35, "monthly_spend": 89.5, "tenure_months": 24, "support_tickets": 3, "contract_type": "monthly"}'Python
1# Python client2import requests34response = requests.post("http://localhost:8000/predict", json={5 "age": 35,6 "monthly_spend": 89.5,7 "tenure_months": 24,8 "support_tickets": 3,9 "contract_type": "monthly"10})11print(response.json())12# {"churn_probability": 0.7234, "prediction": "churn", "confidence": 0.7234}3. Docker Containerization
3.1 Dockerfile
dockerfile
1# Dockerfile2FROM python:3.11-slim3 4WORKDIR /app5 6# Install dependencies7COPY requirements.txt .8RUN pip install --no-cache-dir -r requirements.txt9 10# Copy application11COPY api.py .12COPY model_artifacts/ model_artifacts/13 14# Expose port15EXPOSE 800016 17# Health check18HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 119 20# Run21CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]3.2 Requirements
Ví dụ
1# requirements.txt2fastapi==0.109.03uvicorn==0.27.04scikit-learn==1.4.05joblib==1.3.26numpy==1.26.37pydantic==2.5.33.3 Build & Run
Bash
1# Build image2docker build -t churn-model:v1.0 .3 4# Run container5docker run -d -p 8000:8000 --name churn-api churn-model:v1.06 7# Test8curl http://localhost:8000/health9 10# View logs11docker logs churn-api12 13# Stop14docker stop churn-api3.4 Docker Compose (with monitoring)
yaml
1# docker-compose.yml2version: '3.8'3 4services:5 model-api:6 build: .7 ports:8 - "8000:8000"9 environment:10 - MODEL_VERSION=1.0.011 - LOG_LEVEL=info12 volumes:13 - ./model_artifacts:/app/model_artifacts14 restart: unless-stopped15 healthcheck:16 test: ["CMD", "curl", "-f", "http://localhost:8000/health"]17 interval: 30s18 timeout: 10s19 retries: 320 21 prometheus:22 image: prom/prometheus23 ports:24 - "9090:9090"25 volumes:26 - ./prometheus.yml:/etc/prometheus/prometheus.yml27 28 grafana:29 image: grafana/grafana30 ports:31 - "3000:3000"32 depends_on:33 - prometheus4. Cloud Deployment Options
4.1 Comparison
| Platform | Pros | Cons | Best For |
|---|---|---|---|
| AWS SageMaker | Full ML platform, auto-scaling | Complex, expensive | Enterprise |
| GCP Vertex AI | Good integration | GCP lock-in | GCP users |
| Azure ML | Enterprise features | Complex pricing | MS shops |
| Hugging Face | Easy, free tier | Limited compute | NLP/demo |
| Railway/Render | Simple deploy | Basic features | MVPs/startups |
| Self-hosted (K8s) | Full control | High maintenance | Large teams |
4.2 Deploy to Hugging Face Spaces (Free)
Python
1# app.py (Gradio version for HF Spaces)2import gradio as gr3import joblib4import numpy as np56model = joblib.load("model_artifacts/model.pkl")78def predict_churn(age, monthly_spend, tenure, tickets):9 features = np.array([[age, monthly_spend, tenure, tickets]])10 proba = model.predict_proba(features)[0]11 churn_prob = proba[1]12 13 label = "High Risk" if churn_prob > 0.7 else "Medium Risk" if churn_prob > 0.4 else "Low Risk"14 return {15 "Churn Probability": f"{churn_prob:.1%}",16 "Risk Level": label17 }1819demo = gr.Interface(20 fn=predict_churn,21 inputs=[22 gr.Slider(18, 80, value=35, label="Age"),23 gr.Number(value=89.5, label="Monthly Spend ($)"),24 gr.Slider(0, 120, value=24, label="Tenure (months)"),25 gr.Slider(0, 20, value=2, label="Support Tickets")26 ],27 outputs=gr.JSON(label="Prediction"),28 title="Customer Churn Predictor",29 description="Predict customer churn probability"30)3132demo.launch()4.3 Deploy with BentoML
Python
1# service.py2import bentoml3import numpy as np4from bentoml.io import JSON56# Save model to BentoML7# bentoml.sklearn.save_model("churn_model", trained_model)89runner = bentoml.sklearn.get("churn_model:latest").to_runner()10svc = bentoml.Service("churn_prediction", runners=[runner])1112@svc.api(input=JSON(), output=JSON())13def predict(input_data: dict) -> dict:14 features = np.array([[15 input_data["age"],16 input_data["monthly_spend"],17 input_data["tenure_months"],18 input_data["support_tickets"]19 ]])20 21 proba = runner.predict_proba.run(features)[0]22 return {23 "churn_probability": float(proba[1]),24 "prediction": "churn" if proba[1] > 0.5 else "no_churn"25 }5. Production Best Practices
5.1 API Design Checklist
| Practice | Details |
|---|---|
| Input validation | Pydantic models, type checks |
| Error handling | Proper HTTP codes, error messages |
| Logging | Request/response logging, prediction logging |
| Versioning | /v1/predict, /v2/predict |
| Rate limiting | Prevent abuse |
| Authentication | API keys, JWT tokens |
| Documentation | Auto-generated OpenAPI/Swagger |
| Health check | /health endpoint |
5.2 Logging Predictions
Python
1import logging2from datetime import datetime34logging.basicConfig(level=logging.INFO)5logger = logging.getLogger("prediction_logger")67@app.post("/predict")8def predict(data: CustomerData):9 start_time = datetime.now()10 11 # ... prediction logic ...12 13 # Log for monitoring14 latency_ms = (datetime.now() - start_time).total_seconds() * 100015 logger.info(16 f"prediction_log | "17 f"input={data.dict()} | "18 f"output={prediction} | "19 f"probability={churn_prob:.4f} | "20 f"latency_ms={latency_ms:.1f} | "21 f"model_version=1.0.0"22 )23 24 return response📝 Quiz
-
Real-time serving vs Batch serving khác nhau ở?
- Batch chính xác hơn
- Real-time từng request (dưới 100ms), Batch xử lý hàng loạt (minutes OK)
- Real-time không cần API
- Batch không cần model
-
Tại sao dùng Docker cho ML deployment?
- Chạy nhanh hơn
- Reproducible environment, consistent across dev/staging/prod
- Bắt buộc phải dùng
- Docker trains model tốt hơn
-
FastAPI được ưa chuộng cho ML serving vì?
- Async, auto-docs, type validation, high performance
- Chỉ có FastAPI mới deploy được ML
- Miễn phí
- Google tạo ra
🎯 Key Takeaways
- FastAPI — Best Python framework cho ML APIs
- Docker — Standard containerization cho deployment
- Pydantic — Input validation essential cho production
- Health checks — Monitor API availability
- Logging — Log mọi prediction cho monitoring
🚀 Bài tiếp theo
Feature Store & Model Monitoring — Feature engineering at scale và monitoring model drift!
