MinAI - Về trang chủ
Lý thuyết
20/2155 phút
Đang tải...

Model Deployment & Production

Deploy Deep Learning models: Compression, Quantization, Serving, và Cloud Deployment

0

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu Deployment Pipeline cho Deep Learning

✅ Biết các kỹ thuật Compression: Quantization, Pruning

✅ Sử dụng TensorFlow Lite, ONNX để export model

✅ Deploy model lên Cloud (API, Container)

Bài cuối khóa học!

Đã học xây dựng và train model. Bài này học cách đưa model vào thực tế!

Model không có giá trị nếu chỉ nằm trên laptop! 🚀

1

🚀 Deployment Overview

TB5 min

From Training to Production

Production challenges:

  • Size: Models can be GBs
  • Speed: Real-time inference needed
  • Cost: GPU expensive
  • Compatibility: Different platforms

Solutions: Compression + Optimization + Proper Serving

Deployment Pipeline

Deployment Pipeline🏋️TrainingModelOptimizationQuantizePruning📦ExportSavedModelONNX🚀ServingAPIContainer📊MonitoringMetricsAlerts

Checkpoint

Bạn đã hiểu deployment challenges?

2

📦 Model Saving & Export

TB5 min

Keras/TensorFlow Formats

python.py
1import tensorflow as tf
2from tensorflow.keras import layers, Sequential
3
4# Create model
5model = Sequential([
6 layers.Dense(128, activation='relu', input_shape=(10,)),
7 layers.Dense(64, activation='relu'),
8 layers.Dense(1)
9])
10model.compile(optimizer='adam', loss='mse')
11
12# 1. Keras format (.keras) - Recommended for Keras
13model.save('model.keras')
14loaded = tf.keras.models.load_model('model.keras')
15
16# 2. SavedModel format (TensorFlow serving)
17model.save('saved_model_dir')
18loaded = tf.keras.models.load_model('saved_model_dir')
19
20# 3. H5 format (legacy)
21model.save('model.h5')
22loaded = tf.keras.models.load_model('model.h5')
23
24# 4. Weights only
25model.save_weights('weights.weights.h5')
26model.load_weights('weights.weights.h5')

ONNX Export (Cross-platform)

python.py
1# pip install tf2onnx onnxruntime
2
3import tf2onnx
4import onnx
5
6# Convert Keras model to ONNX
7spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
8model_onnx, _ = tf2onnx.convert.from_keras(model, input_signature=spec)
9onnx.save_model(model_onnx, "model.onnx")
10
11# Inference with ONNX Runtime
12import onnxruntime as ort
13import numpy as np
14
15session = ort.InferenceSession("model.onnx")
16input_name = session.get_inputs()[0].name
17output_name = session.get_outputs()[0].name
18
19# Predict
20X = np.random.randn(1, 10).astype(np.float32)
21result = session.run([output_name], {input_name: X})
22print(f"Prediction: {result[0]}")

Checkpoint

Bạn đã biết các format export model?

3

⚡ Model Quantization

TB5 min

Quantization là gì?

Quantization = Reduce precision of weights

  • FP32 → INT8: 4x smaller, 2-4x faster
  • Minimal accuracy loss (< 1%)

TensorFlow Lite Quantization

python.py
1import tensorflow as tf
2import numpy as np
3
4# Create and train model (example)
5model = tf.keras.Sequential([
6 tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
7 tf.keras.layers.Dense(10, activation='softmax')
8])
9
10# 1. Dynamic Range Quantization (Easiest)
11converter = tf.lite.TFLiteConverter.from_keras_model(model)
12converter.optimizations = [tf.lite.Optimize.DEFAULT]
13tflite_model = converter.convert()
14
15with open('model_dynamic.tflite', 'wb') as f:
16 f.write(tflite_model)
17
18print(f"Original size: {model.count_params() * 4 / 1024:.1f} KB")
19print(f"Quantized size: {len(tflite_model) / 1024:.1f} KB")
20
21# 2. Full Integer Quantization (Best for mobile)
22def representative_dataset():
23 """Generate representative data for calibration"""
24 for _ in range(100):
25 yield [np.random.randn(1, 784).astype(np.float32)]
26
27converter = tf.lite.TFLiteConverter.from_keras_model(model)
28converter.optimizations = [tf.lite.Optimize.DEFAULT]
29converter.representative_dataset = representative_dataset
30converter.target_spec.supported_types = [tf.int8]
31converter.inference_input_type = tf.int8
32converter.inference_output_type = tf.int8
33tflite_int8 = converter.convert()
34
35with open('model_int8.tflite', 'wb') as f:
36 f.write(tflite_int8)
37
38# 3. Float16 Quantization (GPU)
39converter = tf.lite.TFLiteConverter.from_keras_model(model)
40converter.optimizations = [tf.lite.Optimize.DEFAULT]
41converter.target_spec.supported_types = [tf.float16]
42tflite_fp16 = converter.convert()

TFLite Inference

python.py
1import tensorflow as tf
2import numpy as np
3
4# Load TFLite model
5interpreter = tf.lite.Interpreter(model_path='model_dynamic.tflite')
6interpreter.allocate_tensors()
7
8# Get input/output details
9input_details = interpreter.get_input_details()
10output_details = interpreter.get_output_details()
11
12# Prepare input
13input_data = np.random.randn(1, 784).astype(np.float32)
14
15# Set input
16interpreter.set_tensor(input_details[0]['index'], input_data)
17
18# Run inference
19interpreter.invoke()
20
21# Get output
22output = interpreter.get_tensor(output_details[0]['index'])
23print(f"Prediction: {output}")
24
25# Benchmark
26import time
27
28def benchmark(interpreter, input_data, num_runs=100):
29 times = []
30 for _ in range(num_runs):
31 start = time.time()
32 interpreter.set_tensor(input_details[0]['index'], input_data)
33 interpreter.invoke()
34 times.append(time.time() - start)
35 return np.mean(times) * 1000 # ms
36
37print(f"Average inference time: {benchmark(interpreter, input_data):.2f} ms")

Checkpoint

Bạn đã biết cách quantize model?

4

✂️ Model Pruning

TB5 min

Pruning là gì?

Pruning = Remove unimportant weights

  • Giảm size và computation
  • Có thể prune đến 90% weights

TensorFlow Model Optimization

python.py
1import tensorflow as tf
2import tensorflow_model_optimization as tfmot
3import numpy as np
4
5# Create model
6base_model = tf.keras.Sequential([
7 tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
8 tf.keras.layers.Dense(64, activation='relu'),
9 tf.keras.layers.Dense(10, activation='softmax')
10])
11
12# Define pruning schedule
13prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
14
15pruning_params = {
16 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
17 initial_sparsity=0.50,
18 final_sparsity=0.90,
19 begin_step=0,
20 end_step=1000
21 )
22}
23
24# Apply pruning to model
25model_for_pruning = prune_low_magnitude(base_model, **pruning_params)
26
27# Compile
28model_for_pruning.compile(
29 optimizer='adam',
30 loss='sparse_categorical_crossentropy',
31 metrics=['accuracy']
32)
33
34# Callbacks for pruning
35callbacks = [
36 tfmot.sparsity.keras.UpdatePruningStep(),
37]
38
39# Train with pruning
40# model_for_pruning.fit(x_train, y_train, epochs=10, callbacks=callbacks)
41
42# Strip pruning wrappers
43model_pruned = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
44
45# Check sparsity
46def get_sparsity(model):
47 for layer in model.layers:
48 if hasattr(layer, 'kernel'):
49 weights = layer.kernel.numpy()
50 sparsity = np.sum(weights == 0) / weights.size
51 print(f"{layer.name}: {sparsity*100:.1f}% sparse")
52
53# Export pruned model
54model_pruned.save('pruned_model.keras')

Pruning + Quantization

python.py
1# Combine pruning and quantization for maximum compression
2def optimize_model(model):
3 # 1. Prune
4 pruned_model = tfmot.sparsity.keras.prune_low_magnitude(
5 model,
6 pruning_schedule=tfmot.sparsity.keras.ConstantSparsity(
7 target_sparsity=0.75,
8 begin_step=0
9 )
10 )
11
12 # 2. Train (needed for pruning to take effect)
13 # pruned_model.fit(...)
14
15 # 3. Strip pruning wrappers
16 pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)
17
18 # 4. Quantize
19 converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
20 converter.optimizations = [tf.lite.Optimize.DEFAULT]
21 tflite_model = converter.convert()
22
23 return tflite_model
24
25# Check size reduction
26# Original: ~100KB
27# Pruned: ~50KB
28# Pruned + Quantized: ~15KB

Checkpoint

Bạn đã hiểu model pruning?

5

🖥️ Serving với TensorFlow Serving

TB5 min

Setup TensorFlow Serving

python.py
1# Save model in SavedModel format
2import tensorflow as tf
3
4model = tf.keras.models.load_model('my_model.keras')
5
6# Export with versioning
7export_path = 'serving_model/1' # Version 1
8model.save(export_path, save_format='tf')
9
10# Model signature
11print(f"Saved to: {export_path}")
12
13# Check signature
14loaded = tf.saved_model.load(export_path)
15print(list(loaded.signatures.keys()))

Docker Deployment

Bash
1# Pull TensorFlow Serving image
2docker pull tensorflow/serving
3
4# Run server
5docker run -p 8501:8501 \
6 --mount type=bind,source=/path/to/serving_model,target=/models/my_model \
7 -e MODEL_NAME=my_model \
8 tensorflow/serving

REST API Client

python.py
1import requests
2import numpy as np
3import json
4
5# Prepare data
6data = np.random.randn(1, 10).tolist()
7
8# REST API request
9url = "http://localhost:8501/v1/models/my_model:predict"
10payload = {
11 "instances": data
12}
13
14response = requests.post(url, json=payload)
15predictions = response.json()['predictions']
16print(f"Predictions: {predictions}")
17
18# gRPC client (faster for production)
19# pip install tensorflow-serving-api
20from tensorflow_serving.apis import predict_pb2
21from tensorflow_serving.apis import prediction_service_pb2_grpc
22import grpc
23
24def grpc_predict(input_data, host='localhost:8500'):
25 channel = grpc.insecure_channel(host)
26 stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
27
28 request = predict_pb2.PredictRequest()
29 request.model_spec.name = 'my_model'
30 request.inputs['input_1'].CopyFrom(
31 tf.make_tensor_proto(input_data.astype(np.float32))
32 )
33
34 response = stub.Predict(request)
35 return response.outputs['output_1'].float_val

Checkpoint

Bạn đã biết cách dùng TensorFlow Serving?

6

☁️ Cloud Deployment Options

TB5 min

FastAPI Serving

python.py
1# pip install fastapi uvicorn
2
3from fastapi import FastAPI
4from pydantic import BaseModel
5import tensorflow as tf
6import numpy as np
7
8app = FastAPI()
9
10# Load model at startup
11model = tf.keras.models.load_model('my_model.keras')
12
13class PredictRequest(BaseModel):
14 data: list
15
16class PredictResponse(BaseModel):
17 prediction: list
18
19@app.post("/predict", response_model=PredictResponse)
20async def predict(request: PredictRequest):
21 input_data = np.array(request.data).astype(np.float32)
22 prediction = model.predict(input_data).tolist()
23 return PredictResponse(prediction=prediction)
24
25@app.get("/health")
26async def health():
27 return {"status": "healthy"}
28
29# Run with: uvicorn app:app --host 0.0.0.0 --port 8000

Dockerfile

dockerfile
1FROM python:3.10-slim
2
3WORKDIR /app
4
5COPY requirements.txt .
6RUN pip install --no-cache-dir -r requirements.txt
7
8COPY . .
9
10EXPOSE 8000
11
12CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Cloud Platform Options

PlatformBest forProsCons
AWS SageMakerFull ML lifecycleManaged, scalableExpensive
Google Vertex AITensorFlowNative TF supportGCP lock-in
Azure MLEnterpriseIntegrationComplex
Hugging FaceNLP/TransformersEasy, free tierLimited customization

Hugging Face Deployment

python.py
1# Deploy to Hugging Face Spaces (Gradio)
2# pip install gradio
3
4import gradio as gr
5import tensorflow as tf
6
7model = tf.keras.models.load_model('sentiment_model')
8
9def predict_sentiment(text):
10 # Tokenize and predict
11 # ... preprocessing
12 prediction = model.predict([text])
13 return {
14 "Positive": float(prediction[0][1]),
15 "Negative": float(prediction[0][0])
16 }
17
18demo = gr.Interface(
19 fn=predict_sentiment,
20 inputs=gr.Textbox(label="Enter text"),
21 outputs=gr.Label(num_top_classes=2),
22 title="Sentiment Analysis",
23 examples=["I love this!", "This is terrible"]
24)
25
26demo.launch() # Local testing
27# demo.launch(share=True) # Get public URL

Checkpoint

Bạn đã biết các cloud deployment options?

7

📊 Monitoring & MLOps

TB5 min

Model Monitoring

python.py
1import numpy as np
2from datetime import datetime
3import json
4
5class ModelMonitor:
6 def __init__(self, model_name):
7 self.model_name = model_name
8 self.predictions = []
9 self.latencies = []
10
11 def log_prediction(self, input_data, prediction, latency_ms):
12 log = {
13 "timestamp": datetime.now().isoformat(),
14 "model": self.model_name,
15 "input_shape": list(input_data.shape),
16 "prediction": prediction.tolist() if isinstance(prediction, np.ndarray) else prediction,
17 "latency_ms": latency_ms
18 }
19 self.predictions.append(log)
20 self.latencies.append(latency_ms)
21
22 # Alert if latency too high
23 if latency_ms > 100:
24 self.alert(f"High latency: {latency_ms}ms")
25
26 def get_stats(self):
27 return {
28 "total_predictions": len(self.predictions),
29 "avg_latency_ms": np.mean(self.latencies),
30 "p95_latency_ms": np.percentile(self.latencies, 95),
31 "p99_latency_ms": np.percentile(self.latencies, 99),
32 }
33
34 def alert(self, message):
35 print(f"⚠️ ALERT: {message}")
36 # Send to monitoring system (Prometheus, CloudWatch, etc.)
37
38
39# Usage
40monitor = ModelMonitor("sentiment_model_v1")
41
42import time
43input_data = np.random.randn(1, 10)
44
45start = time.time()
46prediction = model.predict(input_data)
47latency = (time.time() - start) * 1000
48
49monitor.log_prediction(input_data, prediction, latency)
50print(monitor.get_stats())

Data Drift Detection

python.py
1import numpy as np
2from scipy import stats
3
4class DriftDetector:
5 def __init__(self, reference_data):
6 self.reference_mean = np.mean(reference_data, axis=0)
7 self.reference_std = np.std(reference_data, axis=0)
8 self.reference_data = reference_data
9
10 def check_drift(self, new_data, threshold=0.05):
11 """
12 Check for data drift using KS test
13 """
14 drift_detected = False
15 drift_features = []
16
17 for i in range(new_data.shape[1]):
18 # Kolmogorov-Smirnov test
19 stat, p_value = stats.ks_2samp(
20 self.reference_data[:, i],
21 new_data[:, i]
22 )
23
24 if p_value < threshold:
25 drift_detected = True
26 drift_features.append(i)
27
28 return {
29 "drift_detected": drift_detected,
30 "drift_features": drift_features,
31 "recommendation": "Retrain model" if drift_detected else "Model OK"
32 }
33
34
35# Usage
36reference = np.random.randn(1000, 10) # Training data distribution
37detector = DriftDetector(reference)
38
39# Check new data
40new_data = np.random.randn(100, 10) + 0.5 # Shifted distribution
41result = detector.check_drift(new_data)
42print(result)

Checkpoint

Bạn đã biết cách monitor models?

8

🎯 Tổng kết khóa học Deep Learning

TB5 min

Optimization Summary

TechniqueSize ReductionSpeed Up
Quantization INT84x2-4x
Pruning 90%~3x2-3x
Both combined10-15x4-8x

Deployment Checklist

Before Deployment:

  • Model optimized (quantized/pruned)
  • API endpoint tested
  • Error handling implemented
  • Logging configured
  • Health check endpoint

Production:

  • Load testing done
  • Monitoring setup
  • Alerting configured
  • Rollback plan ready
  • Documentation updated

Course Summary

Bạn đã học:

ModuleKey Topics
ANNPerceptron, Backprop, Activation
CNNConvolution, Pooling, ResNet
RNNSequences, BPTT, Stacked RNN
LSTMGates, Cell State, GRU
TransformerAttention, Self-Attention, BERT/GPT
Transfer LearningFine-tuning, Hugging Face
OptimizationAdam, LR Schedule, Regularization
DeploymentQuantization, Serving, Monitoring

Next Steps

🎉 Hoàn thành khóa Deep Learning!

Tiếp theo:

  1. Build real projects với knowledge đã học
  2. Explore specialized domains (NLP, CV, RL)
  3. Learn MLOps và production systems
  4. Contribute to open source

Resources:

  • Papers: arxiv.org
  • Models: huggingface.co
  • Tutorials: tensorflow.org, pytorch.org
  • Practice: kaggle.com

Final Words

Deep Learning là một journey, không phải destination. Keep learning, keep building!