Model Deployment & Production

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu Deployment Pipeline cho Deep Learning

✅ Biết các kỹ thuật Compression: Quantization, Pruning

✅ Sử dụng TensorFlow Lite, ONNX để export model

✅ Deploy model lên Cloud (API, Container)

Bài cuối khóa học!

Đã học xây dựng và train model. Bài này học cách đưa model vào thực tế!

Model không có giá trị nếu chỉ nằm trên laptop! 🚀

Task 0

🚀 Deployment Overview

TB5 min

From Training to Production

Production challenges:

Size: Models can be GBs
Speed: Real-time inference needed
Cost: GPU expensive
Compatibility: Different platforms

Solutions: Compression + Optimization + Proper Serving

Deployment Pipeline

Checkpoint

Bạn đã hiểu deployment challenges?

Task 1

📦 Model Saving & Export

TB5 min

Keras/TensorFlow Formats

python.py

1import tensorflow as tf
2from tensorflow.keras import layers, Sequential
3
4# Create model
5model = Sequential([
6    layers.Dense(128, activation='relu', input_shape=(10,)),
7    layers.Dense(64, activation='relu'),
8    layers.Dense(1)
9])
10model.compile(optimizer='adam', loss='mse')
11
12# 1. Keras format (.keras) - Recommended for Keras
13model.save('model.keras')
14loaded = tf.keras.models.load_model('model.keras')
15
16# 2. SavedModel format (TensorFlow serving)
17model.save('saved_model_dir')
18loaded = tf.keras.models.load_model('saved_model_dir')
19
20# 3. H5 format (legacy)
21model.save('model.h5')
22loaded = tf.keras.models.load_model('model.h5')
23
24# 4. Weights only
25model.save_weights('weights.weights.h5')
26model.load_weights('weights.weights.h5')

ONNX Export (Cross-platform)

python.py

1# pip install tf2onnx onnxruntime
2
3import tf2onnx
4import onnx
5
6# Convert Keras model to ONNX
7spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
8model_onnx, _ = tf2onnx.convert.from_keras(model, input_signature=spec)
9onnx.save_model(model_onnx, "model.onnx")
10
11# Inference with ONNX Runtime
12import onnxruntime as ort
13import numpy as np
14
15session = ort.InferenceSession("model.onnx")
16input_name = session.get_inputs()[0].name
17output_name = session.get_outputs()[0].name
18
19# Predict
20X = np.random.randn(1, 10).astype(np.float32)
21result = session.run([output_name], {input_name: X})
22print(f"Prediction: {result[0]}")

Checkpoint

Bạn đã biết các format export model?

Task 2

⚡ Model Quantization

TB5 min

Quantization là gì?

Quantization = Reduce precision of weights

FP32 → INT8: 4x smaller, 2-4x faster
Minimal accuracy loss (< 1%)

TensorFlow Lite Quantization

python.py

1import tensorflow as tf
2import numpy as np
3
4# Create and train model (example)
5model = tf.keras.Sequential([
6    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
7    tf.keras.layers.Dense(10, activation='softmax')
8])
9
10# 1. Dynamic Range Quantization (Easiest)
11converter = tf.lite.TFLiteConverter.from_keras_model(model)
12converter.optimizations = [tf.lite.Optimize.DEFAULT]
13tflite_model = converter.convert()
14
15with open('model_dynamic.tflite', 'wb') as f:
16    f.write(tflite_model)
17
18print(f"Original size: {model.count_params() * 4 / 1024:.1f} KB")
19print(f"Quantized size: {len(tflite_model) / 1024:.1f} KB")
20
21# 2. Full Integer Quantization (Best for mobile)
22def representative_dataset():
23    """Generate representative data for calibration"""
24    for _ in range(100):
25        yield [np.random.randn(1, 784).astype(np.float32)]
26
27converter = tf.lite.TFLiteConverter.from_keras_model(model)
28converter.optimizations = [tf.lite.Optimize.DEFAULT]
29converter.representative_dataset = representative_dataset
30converter.target_spec.supported_types = [tf.int8]
31converter.inference_input_type = tf.int8
32converter.inference_output_type = tf.int8
33tflite_int8 = converter.convert()
34
35with open('model_int8.tflite', 'wb') as f:
36    f.write(tflite_int8)
37
38# 3. Float16 Quantization (GPU)
39converter = tf.lite.TFLiteConverter.from_keras_model(model)
40converter.optimizations = [tf.lite.Optimize.DEFAULT]
41converter.target_spec.supported_types = [tf.float16]
42tflite_fp16 = converter.convert()

TFLite Inference

python.py

1import tensorflow as tf
2import numpy as np
3
4# Load TFLite model
5interpreter = tf.lite.Interpreter(model_path='model_dynamic.tflite')
6interpreter.allocate_tensors()
7
8# Get input/output details
9input_details = interpreter.get_input_details()
10output_details = interpreter.get_output_details()
11
12# Prepare input
13input_data = np.random.randn(1, 784).astype(np.float32)
14
15# Set input
16interpreter.set_tensor(input_details[0]['index'], input_data)
17
18# Run inference
19interpreter.invoke()
20
21# Get output
22output = interpreter.get_tensor(output_details[0]['index'])
23print(f"Prediction: {output}")
24
25# Benchmark
26import time
27
28def benchmark(interpreter, input_data, num_runs=100):
29    times = []
30    for _ in range(num_runs):
31        start = time.time()
32        interpreter.set_tensor(input_details[0]['index'], input_data)
33        interpreter.invoke()
34        times.append(time.time() - start)
35    return np.mean(times) * 1000  # ms
36
37print(f"Average inference time: {benchmark(interpreter, input_data):.2f} ms")

Checkpoint

Bạn đã biết cách quantize model?

Task 3

✂️ Model Pruning

TB5 min

Pruning là gì?

Pruning = Remove unimportant weights

Giảm size và computation
Có thể prune đến 90% weights

TensorFlow Model Optimization

python.py

1import tensorflow as tf
2import tensorflow_model_optimization as tfmot
3import numpy as np
4
5# Create model
6base_model = tf.keras.Sequential([
7    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
8    tf.keras.layers.Dense(64, activation='relu'),
9    tf.keras.layers.Dense(10, activation='softmax')
10])
11
12# Define pruning schedule
13prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
14
15pruning_params = {
16    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
17        initial_sparsity=0.50,
18        final_sparsity=0.90,
19        begin_step=0,
20        end_step=1000
21    )
22}
23
24# Apply pruning to model
25model_for_pruning = prune_low_magnitude(base_model, **pruning_params)
26
27# Compile
28model_for_pruning.compile(
29    optimizer='adam',
30    loss='sparse_categorical_crossentropy',
31    metrics=['accuracy']
32)
33
34# Callbacks for pruning
35callbacks = [
36    tfmot.sparsity.keras.UpdatePruningStep(),
37]
38
39# Train with pruning
40# model_for_pruning.fit(x_train, y_train, epochs=10, callbacks=callbacks)
41
42# Strip pruning wrappers
43model_pruned = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
44
45# Check sparsity
46def get_sparsity(model):
47    for layer in model.layers:
48        if hasattr(layer, 'kernel'):
49            weights = layer.kernel.numpy()
50            sparsity = np.sum(weights == 0) / weights.size
51            print(f"{layer.name}: {sparsity*100:.1f}% sparse")
52
53# Export pruned model
54model_pruned.save('pruned_model.keras')

Pruning + Quantization

python.py

1# Combine pruning and quantization for maximum compression
2def optimize_model(model):
3    # 1. Prune
4    pruned_model = tfmot.sparsity.keras.prune_low_magnitude(
5        model,
6        pruning_schedule=tfmot.sparsity.keras.ConstantSparsity(
7            target_sparsity=0.75,
8            begin_step=0
9        )
10    )
11    
12    # 2. Train (needed for pruning to take effect)
13    # pruned_model.fit(...)
14    
15    # 3. Strip pruning wrappers
16    pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)
17    
18    # 4. Quantize
19    converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
20    converter.optimizations = [tf.lite.Optimize.DEFAULT]
21    tflite_model = converter.convert()
22    
23    return tflite_model
24
25# Check size reduction
26# Original: ~100KB
27# Pruned: ~50KB  
28# Pruned + Quantized: ~15KB

Checkpoint

Bạn đã hiểu model pruning?

Task 4

🖥️ Serving với TensorFlow Serving

TB5 min

Setup TensorFlow Serving

python.py

1# Save model in SavedModel format
2import tensorflow as tf
3
4model = tf.keras.models.load_model('my_model.keras')
5
6# Export with versioning
7export_path = 'serving_model/1'  # Version 1
8model.save(export_path, save_format='tf')
9
10# Model signature
11print(f"Saved to: {export_path}")
12
13# Check signature
14loaded = tf.saved_model.load(export_path)
15print(list(loaded.signatures.keys()))

Docker Deployment

Bash

1# Pull TensorFlow Serving image
2docker pull tensorflow/serving
3 
4# Run server
5docker run -p 8501:8501 \
6  --mount type=bind,source=/path/to/serving_model,target=/models/my_model \
7  -e MODEL_NAME=my_model \
8  tensorflow/serving

REST API Client

python.py

1import requests
2import numpy as np
3import json
4
5# Prepare data
6data = np.random.randn(1, 10).tolist()
7
8# REST API request
9url = "http://localhost:8501/v1/models/my_model:predict"
10payload = {
11    "instances": data
12}
13
14response = requests.post(url, json=payload)
15predictions = response.json()['predictions']
16print(f"Predictions: {predictions}")
17
18# gRPC client (faster for production)
19# pip install tensorflow-serving-api
20from tensorflow_serving.apis import predict_pb2
21from tensorflow_serving.apis import prediction_service_pb2_grpc
22import grpc
23
24def grpc_predict(input_data, host='localhost:8500'):
25    channel = grpc.insecure_channel(host)
26    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
27    
28    request = predict_pb2.PredictRequest()
29    request.model_spec.name = 'my_model'
30    request.inputs['input_1'].CopyFrom(
31        tf.make_tensor_proto(input_data.astype(np.float32))
32    )
33    
34    response = stub.Predict(request)
35    return response.outputs['output_1'].float_val

Checkpoint

Bạn đã biết cách dùng TensorFlow Serving?

Task 5

☁️ Cloud Deployment Options

TB5 min

FastAPI Serving

python.py

1# pip install fastapi uvicorn
2
3from fastapi import FastAPI
4from pydantic import BaseModel
5import tensorflow as tf
6import numpy as np
7
8app = FastAPI()
9
10# Load model at startup
11model = tf.keras.models.load_model('my_model.keras')
12
13class PredictRequest(BaseModel):
14    data: list
15
16class PredictResponse(BaseModel):
17    prediction: list
18
19@app.post("/predict", response_model=PredictResponse)
20async def predict(request: PredictRequest):
21    input_data = np.array(request.data).astype(np.float32)
22    prediction = model.predict(input_data).tolist()
23    return PredictResponse(prediction=prediction)
24
25@app.get("/health")
26async def health():
27    return {"status": "healthy"}
28
29# Run with: uvicorn app:app --host 0.0.0.0 --port 8000

Dockerfile

dockerfile

1FROM python:3.10-slim
2 
3WORKDIR /app
4 
5COPY requirements.txt .
6RUN pip install --no-cache-dir -r requirements.txt
7 
8COPY . .
9 
10EXPOSE 8000
11 
12CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Cloud Platform Options

Platform	Best for	Pros	Cons
AWS SageMaker	Full ML lifecycle	Managed, scalable	Expensive
Google Vertex AI	TensorFlow	Native TF support	GCP lock-in
Azure ML	Enterprise	Integration	Complex
Hugging Face	NLP/Transformers	Easy, free tier	Limited customization

Hugging Face Deployment

python.py

1# Deploy to Hugging Face Spaces (Gradio)
2# pip install gradio
3
4import gradio as gr
5import tensorflow as tf
6
7model = tf.keras.models.load_model('sentiment_model')
8
9def predict_sentiment(text):
10    # Tokenize and predict
11    # ... preprocessing
12    prediction = model.predict([text])
13    return {
14        "Positive": float(prediction[0][1]),
15        "Negative": float(prediction[0][0])
16    }
17
18demo = gr.Interface(
19    fn=predict_sentiment,
20    inputs=gr.Textbox(label="Enter text"),
21    outputs=gr.Label(num_top_classes=2),
22    title="Sentiment Analysis",
23    examples=["I love this!", "This is terrible"]
24)
25
26demo.launch()  # Local testing
27# demo.launch(share=True)  # Get public URL

Checkpoint

Bạn đã biết các cloud deployment options?

Task 6

📊 Monitoring & MLOps

TB5 min

Model Monitoring

python.py

1import numpy as np
2from datetime import datetime
3import json
4
5class ModelMonitor:
6    def __init__(self, model_name):
7        self.model_name = model_name
8        self.predictions = []
9        self.latencies = []
10    
11    def log_prediction(self, input_data, prediction, latency_ms):
12        log = {
13            "timestamp": datetime.now().isoformat(),
14            "model": self.model_name,
15            "input_shape": list(input_data.shape),
16            "prediction": prediction.tolist() if isinstance(prediction, np.ndarray) else prediction,
17            "latency_ms": latency_ms
18        }
19        self.predictions.append(log)
20        self.latencies.append(latency_ms)
21        
22        # Alert if latency too high
23        if latency_ms > 100:
24            self.alert(f"High latency: {latency_ms}ms")
25    
26    def get_stats(self):
27        return {
28            "total_predictions": len(self.predictions),
29            "avg_latency_ms": np.mean(self.latencies),
30            "p95_latency_ms": np.percentile(self.latencies, 95),
31            "p99_latency_ms": np.percentile(self.latencies, 99),
32        }
33    
34    def alert(self, message):
35        print(f"⚠️ ALERT: {message}")
36        # Send to monitoring system (Prometheus, CloudWatch, etc.)
37
38
39# Usage
40monitor = ModelMonitor("sentiment_model_v1")
41
42import time
43input_data = np.random.randn(1, 10)
44
45start = time.time()
46prediction = model.predict(input_data)
47latency = (time.time() - start) * 1000
48
49monitor.log_prediction(input_data, prediction, latency)
50print(monitor.get_stats())

Data Drift Detection

python.py

1import numpy as np
2from scipy import stats
3
4class DriftDetector:
5    def __init__(self, reference_data):
6        self.reference_mean = np.mean(reference_data, axis=0)
7        self.reference_std = np.std(reference_data, axis=0)
8        self.reference_data = reference_data
9    
10    def check_drift(self, new_data, threshold=0.05):
11        """
12        Check for data drift using KS test
13        """
14        drift_detected = False
15        drift_features = []
16        
17        for i in range(new_data.shape[1]):
18            # Kolmogorov-Smirnov test
19            stat, p_value = stats.ks_2samp(
20                self.reference_data[:, i],
21                new_data[:, i]
22            )
23            
24            if p_value < threshold:
25                drift_detected = True
26                drift_features.append(i)
27        
28        return {
29            "drift_detected": drift_detected,
30            "drift_features": drift_features,
31            "recommendation": "Retrain model" if drift_detected else "Model OK"
32        }
33
34
35# Usage
36reference = np.random.randn(1000, 10)  # Training data distribution
37detector = DriftDetector(reference)
38
39# Check new data
40new_data = np.random.randn(100, 10) + 0.5  # Shifted distribution
41result = detector.check_drift(new_data)
42print(result)

Checkpoint

Bạn đã biết cách monitor models?

Task 7

🎯 Tổng kết khóa học Deep Learning

TB5 min

Optimization Summary

Technique	Size Reduction	Speed Up
Quantization INT8	4x	2-4x
Pruning 90%	~3x	2-3x
Both combined	10-15x	4-8x

Deployment Checklist

Course Summary

Bạn đã học:

Module	Key Topics
ANN	Perceptron, Backprop, Activation
CNN	Convolution, Pooling, ResNet
RNN	Sequences, BPTT, Stacked RNN
LSTM	Gates, Cell State, GRU
Transformer	Attention, Self-Attention, BERT/GPT
Transfer Learning	Fine-tuning, Hugging Face
Optimization	Adam, LR Schedule, Regularization
Deployment	Quantization, Serving, Monitoring

Next Steps

🎉 Hoàn thành khóa Deep Learning!

Tiếp theo:

Build real projects với knowledge đã học
Explore specialized domains (NLP, CV, RL)
Learn MLOps và production systems
Contribute to open source

Resources:

Papers: arxiv.org
Models: huggingface.co
Tutorials: tensorflow.org, pytorch.org
Practice: kaggle.com

Final Words

Deep Learning là một journey, không phải destination. Keep learning, keep building!

Task 8