🎯 Mục tiêu bài học
Sau bài này, bạn sẽ:
✅ Hiểu Deployment Pipeline cho Deep Learning
✅ Biết các kỹ thuật Compression: Quantization, Pruning
✅ Sử dụng TensorFlow Lite, ONNX để export model
✅ Deploy model lên Cloud (API, Container)
Bài cuối khóa học!
Đã học xây dựng và train model. Bài này học cách đưa model vào thực tế!
Model không có giá trị nếu chỉ nằm trên laptop! 🚀
🚀 Deployment Overview
From Training to Production
Production challenges:
- Size: Models can be GBs
- Speed: Real-time inference needed
- Cost: GPU expensive
- Compatibility: Different platforms
Solutions: Compression + Optimization + Proper Serving
Deployment Pipeline
Checkpoint
Bạn đã hiểu deployment challenges?
📦 Model Saving & Export
Keras/TensorFlow Formats
1import tensorflow as tf2from tensorflow.keras import layers, Sequential34# Create model5model = Sequential([6 layers.Dense(128, activation='relu', input_shape=(10,)),7 layers.Dense(64, activation='relu'),8 layers.Dense(1)9])10model.compile(optimizer='adam', loss='mse')1112# 1. Keras format (.keras) - Recommended for Keras13model.save('model.keras')14loaded = tf.keras.models.load_model('model.keras')1516# 2. SavedModel format (TensorFlow serving)17model.save('saved_model_dir')18loaded = tf.keras.models.load_model('saved_model_dir')1920# 3. H5 format (legacy)21model.save('model.h5')22loaded = tf.keras.models.load_model('model.h5')2324# 4. Weights only25model.save_weights('weights.weights.h5')26model.load_weights('weights.weights.h5')ONNX Export (Cross-platform)
1# pip install tf2onnx onnxruntime23import tf2onnx4import onnx56# Convert Keras model to ONNX7spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)8model_onnx, _ = tf2onnx.convert.from_keras(model, input_signature=spec)9onnx.save_model(model_onnx, "model.onnx")1011# Inference with ONNX Runtime12import onnxruntime as ort13import numpy as np1415session = ort.InferenceSession("model.onnx")16input_name = session.get_inputs()[0].name17output_name = session.get_outputs()[0].name1819# Predict20X = np.random.randn(1, 10).astype(np.float32)21result = session.run([output_name], {input_name: X})22print(f"Prediction: {result[0]}")Checkpoint
Bạn đã biết các format export model?
⚡ Model Quantization
Quantization là gì?
Quantization = Reduce precision of weights
- FP32 → INT8: 4x smaller, 2-4x faster
- Minimal accuracy loss (< 1%)
TensorFlow Lite Quantization
1import tensorflow as tf2import numpy as np34# Create and train model (example)5model = tf.keras.Sequential([6 tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),7 tf.keras.layers.Dense(10, activation='softmax')8])910# 1. Dynamic Range Quantization (Easiest)11converter = tf.lite.TFLiteConverter.from_keras_model(model)12converter.optimizations = [tf.lite.Optimize.DEFAULT]13tflite_model = converter.convert()1415with open('model_dynamic.tflite', 'wb') as f:16 f.write(tflite_model)1718print(f"Original size: {model.count_params() * 4 / 1024:.1f} KB")19print(f"Quantized size: {len(tflite_model) / 1024:.1f} KB")2021# 2. Full Integer Quantization (Best for mobile)22def representative_dataset():23 """Generate representative data for calibration"""24 for _ in range(100):25 yield [np.random.randn(1, 784).astype(np.float32)]2627converter = tf.lite.TFLiteConverter.from_keras_model(model)28converter.optimizations = [tf.lite.Optimize.DEFAULT]29converter.representative_dataset = representative_dataset30converter.target_spec.supported_types = [tf.int8]31converter.inference_input_type = tf.int832converter.inference_output_type = tf.int833tflite_int8 = converter.convert()3435with open('model_int8.tflite', 'wb') as f:36 f.write(tflite_int8)3738# 3. Float16 Quantization (GPU)39converter = tf.lite.TFLiteConverter.from_keras_model(model)40converter.optimizations = [tf.lite.Optimize.DEFAULT]41converter.target_spec.supported_types = [tf.float16]42tflite_fp16 = converter.convert()TFLite Inference
1import tensorflow as tf2import numpy as np34# Load TFLite model5interpreter = tf.lite.Interpreter(model_path='model_dynamic.tflite')6interpreter.allocate_tensors()78# Get input/output details9input_details = interpreter.get_input_details()10output_details = interpreter.get_output_details()1112# Prepare input13input_data = np.random.randn(1, 784).astype(np.float32)1415# Set input16interpreter.set_tensor(input_details[0]['index'], input_data)1718# Run inference19interpreter.invoke()2021# Get output22output = interpreter.get_tensor(output_details[0]['index'])23print(f"Prediction: {output}")2425# Benchmark26import time2728def benchmark(interpreter, input_data, num_runs=100):29 times = []30 for _ in range(num_runs):31 start = time.time()32 interpreter.set_tensor(input_details[0]['index'], input_data)33 interpreter.invoke()34 times.append(time.time() - start)35 return np.mean(times) * 1000 # ms3637print(f"Average inference time: {benchmark(interpreter, input_data):.2f} ms")Checkpoint
Bạn đã biết cách quantize model?
✂️ Model Pruning
Pruning là gì?
Pruning = Remove unimportant weights
- Giảm size và computation
- Có thể prune đến 90% weights
TensorFlow Model Optimization
1import tensorflow as tf2import tensorflow_model_optimization as tfmot3import numpy as np45# Create model6base_model = tf.keras.Sequential([7 tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),8 tf.keras.layers.Dense(64, activation='relu'),9 tf.keras.layers.Dense(10, activation='softmax')10])1112# Define pruning schedule13prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude1415pruning_params = {16 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(17 initial_sparsity=0.50,18 final_sparsity=0.90,19 begin_step=0,20 end_step=100021 )22}2324# Apply pruning to model25model_for_pruning = prune_low_magnitude(base_model, **pruning_params)2627# Compile28model_for_pruning.compile(29 optimizer='adam',30 loss='sparse_categorical_crossentropy',31 metrics=['accuracy']32)3334# Callbacks for pruning35callbacks = [36 tfmot.sparsity.keras.UpdatePruningStep(),37]3839# Train with pruning40# model_for_pruning.fit(x_train, y_train, epochs=10, callbacks=callbacks)4142# Strip pruning wrappers43model_pruned = tfmot.sparsity.keras.strip_pruning(model_for_pruning)4445# Check sparsity46def get_sparsity(model):47 for layer in model.layers:48 if hasattr(layer, 'kernel'):49 weights = layer.kernel.numpy()50 sparsity = np.sum(weights == 0) / weights.size51 print(f"{layer.name}: {sparsity*100:.1f}% sparse")5253# Export pruned model54model_pruned.save('pruned_model.keras')Pruning + Quantization
1# Combine pruning and quantization for maximum compression2def optimize_model(model):3 # 1. Prune4 pruned_model = tfmot.sparsity.keras.prune_low_magnitude(5 model,6 pruning_schedule=tfmot.sparsity.keras.ConstantSparsity(7 target_sparsity=0.75,8 begin_step=09 )10 )11 12 # 2. Train (needed for pruning to take effect)13 # pruned_model.fit(...)14 15 # 3. Strip pruning wrappers16 pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)17 18 # 4. Quantize19 converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)20 converter.optimizations = [tf.lite.Optimize.DEFAULT]21 tflite_model = converter.convert()22 23 return tflite_model2425# Check size reduction26# Original: ~100KB27# Pruned: ~50KB 28# Pruned + Quantized: ~15KBCheckpoint
Bạn đã hiểu model pruning?
🖥️ Serving với TensorFlow Serving
Setup TensorFlow Serving
1# Save model in SavedModel format2import tensorflow as tf34model = tf.keras.models.load_model('my_model.keras')56# Export with versioning7export_path = 'serving_model/1' # Version 18model.save(export_path, save_format='tf')910# Model signature11print(f"Saved to: {export_path}")1213# Check signature14loaded = tf.saved_model.load(export_path)15print(list(loaded.signatures.keys()))Docker Deployment
1# Pull TensorFlow Serving image2docker pull tensorflow/serving3 4# Run server5docker run -p 8501:8501 \6 --mount type=bind,source=/path/to/serving_model,target=/models/my_model \7 -e MODEL_NAME=my_model \8 tensorflow/servingREST API Client
1import requests2import numpy as np3import json45# Prepare data6data = np.random.randn(1, 10).tolist()78# REST API request9url = "http://localhost:8501/v1/models/my_model:predict"10payload = {11 "instances": data12}1314response = requests.post(url, json=payload)15predictions = response.json()['predictions']16print(f"Predictions: {predictions}")1718# gRPC client (faster for production)19# pip install tensorflow-serving-api20from tensorflow_serving.apis import predict_pb221from tensorflow_serving.apis import prediction_service_pb2_grpc22import grpc2324def grpc_predict(input_data, host='localhost:8500'):25 channel = grpc.insecure_channel(host)26 stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)27 28 request = predict_pb2.PredictRequest()29 request.model_spec.name = 'my_model'30 request.inputs['input_1'].CopyFrom(31 tf.make_tensor_proto(input_data.astype(np.float32))32 )33 34 response = stub.Predict(request)35 return response.outputs['output_1'].float_valCheckpoint
Bạn đã biết cách dùng TensorFlow Serving?
☁️ Cloud Deployment Options
FastAPI Serving
1# pip install fastapi uvicorn23from fastapi import FastAPI4from pydantic import BaseModel5import tensorflow as tf6import numpy as np78app = FastAPI()910# Load model at startup11model = tf.keras.models.load_model('my_model.keras')1213class PredictRequest(BaseModel):14 data: list1516class PredictResponse(BaseModel):17 prediction: list1819@app.post("/predict", response_model=PredictResponse)20async def predict(request: PredictRequest):21 input_data = np.array(request.data).astype(np.float32)22 prediction = model.predict(input_data).tolist()23 return PredictResponse(prediction=prediction)2425@app.get("/health")26async def health():27 return {"status": "healthy"}2829# Run with: uvicorn app:app --host 0.0.0.0 --port 8000Dockerfile
1FROM python:3.10-slim2 3WORKDIR /app4 5COPY requirements.txt .6RUN pip install --no-cache-dir -r requirements.txt7 8COPY . .9 10EXPOSE 800011 12CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Cloud Platform Options
| Platform | Best for | Pros | Cons |
|---|---|---|---|
| AWS SageMaker | Full ML lifecycle | Managed, scalable | Expensive |
| Google Vertex AI | TensorFlow | Native TF support | GCP lock-in |
| Azure ML | Enterprise | Integration | Complex |
| Hugging Face | NLP/Transformers | Easy, free tier | Limited customization |
Hugging Face Deployment
1# Deploy to Hugging Face Spaces (Gradio)2# pip install gradio34import gradio as gr5import tensorflow as tf67model = tf.keras.models.load_model('sentiment_model')89def predict_sentiment(text):10 # Tokenize and predict11 # ... preprocessing12 prediction = model.predict([text])13 return {14 "Positive": float(prediction[0][1]),15 "Negative": float(prediction[0][0])16 }1718demo = gr.Interface(19 fn=predict_sentiment,20 inputs=gr.Textbox(label="Enter text"),21 outputs=gr.Label(num_top_classes=2),22 title="Sentiment Analysis",23 examples=["I love this!", "This is terrible"]24)2526demo.launch() # Local testing27# demo.launch(share=True) # Get public URLCheckpoint
Bạn đã biết các cloud deployment options?
📊 Monitoring & MLOps
Model Monitoring
1import numpy as np2from datetime import datetime3import json45class ModelMonitor:6 def __init__(self, model_name):7 self.model_name = model_name8 self.predictions = []9 self.latencies = []10 11 def log_prediction(self, input_data, prediction, latency_ms):12 log = {13 "timestamp": datetime.now().isoformat(),14 "model": self.model_name,15 "input_shape": list(input_data.shape),16 "prediction": prediction.tolist() if isinstance(prediction, np.ndarray) else prediction,17 "latency_ms": latency_ms18 }19 self.predictions.append(log)20 self.latencies.append(latency_ms)21 22 # Alert if latency too high23 if latency_ms > 100:24 self.alert(f"High latency: {latency_ms}ms")25 26 def get_stats(self):27 return {28 "total_predictions": len(self.predictions),29 "avg_latency_ms": np.mean(self.latencies),30 "p95_latency_ms": np.percentile(self.latencies, 95),31 "p99_latency_ms": np.percentile(self.latencies, 99),32 }33 34 def alert(self, message):35 print(f"⚠️ ALERT: {message}")36 # Send to monitoring system (Prometheus, CloudWatch, etc.)373839# Usage40monitor = ModelMonitor("sentiment_model_v1")4142import time43input_data = np.random.randn(1, 10)4445start = time.time()46prediction = model.predict(input_data)47latency = (time.time() - start) * 10004849monitor.log_prediction(input_data, prediction, latency)50print(monitor.get_stats())Data Drift Detection
1import numpy as np2from scipy import stats34class DriftDetector:5 def __init__(self, reference_data):6 self.reference_mean = np.mean(reference_data, axis=0)7 self.reference_std = np.std(reference_data, axis=0)8 self.reference_data = reference_data9 10 def check_drift(self, new_data, threshold=0.05):11 """12 Check for data drift using KS test13 """14 drift_detected = False15 drift_features = []16 17 for i in range(new_data.shape[1]):18 # Kolmogorov-Smirnov test19 stat, p_value = stats.ks_2samp(20 self.reference_data[:, i],21 new_data[:, i]22 )23 24 if p_value < threshold:25 drift_detected = True26 drift_features.append(i)27 28 return {29 "drift_detected": drift_detected,30 "drift_features": drift_features,31 "recommendation": "Retrain model" if drift_detected else "Model OK"32 }333435# Usage36reference = np.random.randn(1000, 10) # Training data distribution37detector = DriftDetector(reference)3839# Check new data40new_data = np.random.randn(100, 10) + 0.5 # Shifted distribution41result = detector.check_drift(new_data)42print(result)Checkpoint
Bạn đã biết cách monitor models?
🎯 Tổng kết khóa học Deep Learning
Optimization Summary
| Technique | Size Reduction | Speed Up |
|---|---|---|
| Quantization INT8 | 4x | 2-4x |
| Pruning 90% | ~3x | 2-3x |
| Both combined | 10-15x | 4-8x |
Deployment Checklist
Before Deployment:
- Model optimized (quantized/pruned)
- API endpoint tested
- Error handling implemented
- Logging configured
- Health check endpoint
Production:
- Load testing done
- Monitoring setup
- Alerting configured
- Rollback plan ready
- Documentation updated
Course Summary
Bạn đã học:
| Module | Key Topics |
|---|---|
| ANN | Perceptron, Backprop, Activation |
| CNN | Convolution, Pooling, ResNet |
| RNN | Sequences, BPTT, Stacked RNN |
| LSTM | Gates, Cell State, GRU |
| Transformer | Attention, Self-Attention, BERT/GPT |
| Transfer Learning | Fine-tuning, Hugging Face |
| Optimization | Adam, LR Schedule, Regularization |
| Deployment | Quantization, Serving, Monitoring |
Next Steps
🎉 Hoàn thành khóa Deep Learning!
Tiếp theo:
- Build real projects với knowledge đã học
- Explore specialized domains (NLP, CV, RL)
- Learn MLOps và production systems
- Contribute to open source
Resources:
- Papers: arxiv.org
- Models: huggingface.co
- Tutorials: tensorflow.org, pytorch.org
- Practice: kaggle.com
Final Words
Deep Learning là một journey, không phải destination. Keep learning, keep building!
