Transformer Architecture

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu tổng quan kiến trúc Transformer

✅ Hiểu Encoder và Decoder làm gì

✅ Biết cách thông tin "chảy" qua Transformer

✅ Biết tại sao Transformer mạnh hơn RNN/LSTM

Đây là kiến trúc của ChatGPT, BERT, Google Translate!

Ôn lại bài trước

Bài Attention ta đã học:

Self-Attention: Mỗi từ "nhìn" tất cả từ khác
Q, K, V: Query hỏi, Key khớp, Value trả về
Multi-Head: Nhiều góc nhìn song song

Transformer = Kiến trúc xây dựng trên Attention!

Task 0

🚀 Transformer là gì?

TB5 min

Bước đột phá năm 2017

Bài báo "Attention Is All You Need" (2017) tạo ra Transformer:

Bỏ hoàn toàn RNN/LSTM
Chỉ dùng Attention
Train nhanh hơn 10x (song song được!)

Ai dùng Transformer?

Mô hình	Ứng dụng	Bạn đã dùng?
ChatGPT	Chat AI	✅ Chắc rồi!
BERT	Tìm kiếm Google	✅ Mỗi ngày
Google Translate	Dịch	✅
Midjourney	Tạo ảnh	Có thể
GitHub Copilot	Code AI	Đang dùng đây!

Tại sao mạnh?

RNN/LSTM	Transformer
Xử lý tuần tự	Xử lý song song
Chậm, khó train	Nhanh, train dễ
Quên thông tin xa	Nhớ toàn bộ
Khó scale	Scale lên hàng tỷ parameters

Checkpoint

Bạn đã hiểu Transformer là gì?

Task 1

🏗️ Kiến trúc tổng quan

TB5 min

Hai phần chính: Encoder & Decoder

Analogy - Phiên dịch viên:

Encoder = Nghe tiếng Việt, hiểu ý nghĩa
Decoder = Nói ra tiếng Anh

Sơ đồ đơn giản

Mỗi phần dùng cho gì?

Kiến trúc	Dùng cho	Ví dụ
Encoder only	Hiểu text	BERT, sentence embedding
Decoder only	Generate text	GPT, ChatGPT
Encoder-Decoder	Biến đổi text	Google Translate, T5

Checkpoint

Bạn đã hiểu Encoder và Decoder làm gì?

Task 2

📍 Positional Encoding - Vị trí từ

TB5 min

Vấn đề: Attention không biết thứ tự!

Attention nhìn tất cả từ cùng lúc → Không phân biệt được:

"I love you" vs "you love I"
Hai câu này hoàn toàn khác nghĩa!

Giải pháp: Thêm thông tin vị trí

Code đơn giản

python.py

1import tensorflow as tf
2from tensorflow.keras import layers
3import numpy as np
4
5class PositionalEncoding(layers.Layer):
6    """Thêm thông tin vị trí vào embedding"""
7    
8    def __init__(self, max_len, d_model):
9        super().__init__()
10        
11        # Tạo positional encoding
12        position = np.arange(max_len)[:, np.newaxis]
13        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
14        
15        pe = np.zeros((max_len, d_model))
16        pe[:, 0::2] = np.sin(position * div_term)
17        pe[:, 1::2] = np.cos(position * div_term)
18        
19        self.pos_encoding = tf.constant(pe[np.newaxis, :, :], dtype=tf.float32)
20    
21    def call(self, x):
22        seq_len = tf.shape(x)[1]
23        return x + self.pos_encoding[:, :seq_len, :]
24
25# Demo
26pe = PositionalEncoding(max_len=100, d_model=64)
27x = tf.random.normal((2, 10, 64))  # 2 câu, 10 từ
28
29output = pe(x)
30print(f"Input: {x.shape}")
31print(f"Output (có vị trí): {output.shape}")

Expected Output

1Input: (2, 10, 64)
2Output (có vị trí): (2, 10, 64)

Ý tưởng đơn giản: Mỗi vị trí có một "mã" riêng, cộng vào embedding để mô hình biết từ nào ở đâu.

Checkpoint

Bạn đã hiểu tại sao cần Positional Encoding?

Task 3

🔧 Encoder Block

TB5 min

Cấu trúc một Encoder Block

Giải thích từng thành phần

Thành phần	Làm gì	Analogy
Self-Attention	Mỗi từ "giao tiếp" với từ khác	Họp nhóm
Add (Residual)	Giữ thông tin gốc	Ghi chú bên lề
Norm	Chuẩn hóa giá trị	Đánh giá theo thang điểm chung
Feed Forward	Xử lý từng từ độc lập	Suy nghĩ cá nhân

Code Encoder Block

python.py

1from tensorflow.keras import layers
2
3class EncoderBlock(layers.Layer):
4    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
5        super().__init__()
6        
7        # Self-Attention
8        self.attention = layers.MultiHeadAttention(
9            num_heads=num_heads, 
10            key_dim=d_model // num_heads
11        )
12        
13        # Feed Forward
14        self.ffn = tf.keras.Sequential([
15            layers.Dense(d_ff, activation='relu'),
16            layers.Dense(d_model)
17        ])
18        
19        # Normalization
20        self.norm1 = layers.LayerNormalization()
21        self.norm2 = layers.LayerNormalization()
22        self.dropout = layers.Dropout(dropout)
23    
24    def call(self, x, training=False):
25        # Self-Attention + Residual
26        attn_output = self.attention(x, x, x)  # Q=K=V=x
27        attn_output = self.dropout(attn_output, training=training)
28        x = self.norm1(x + attn_output)  # Residual
29        
30        # Feed Forward + Residual
31        ffn_output = self.ffn(x)
32        ffn_output = self.dropout(ffn_output, training=training)
33        x = self.norm2(x + ffn_output)  # Residual
34        
35        return x
36
37# Demo
38encoder_block = EncoderBlock(d_model=64, num_heads=4, d_ff=256)
39x = tf.random.normal((2, 10, 64))  # 2 câu, 10 từ
40
41output = encoder_block(x)
42print(f"Input: {x.shape}")
43print(f"Output: {output.shape}")

Expected Output

1Input: (2, 10, 64)
2Output: (2, 10, 64)

Checkpoint

Bạn đã hiểu Encoder Block?

Task 4

🎭 Decoder Block

TB5 min

Khác gì với Encoder?

Decoder có thêm 2 điểm khác:

Masked Self-Attention: Không được nhìn "tương lai"
Cross-Attention: Nhìn vào output của Encoder

Cấu trúc Decoder Block

Tại sao cần Mask?

Code Decoder Block

python.py

1class DecoderBlock(layers.Layer):
2    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
3        super().__init__()
4        
5        # Masked Self-Attention
6        self.masked_attention = layers.MultiHeadAttention(
7            num_heads=num_heads, 
8            key_dim=d_model // num_heads
9        )
10        
11        # Cross-Attention (với encoder)
12        self.cross_attention = layers.MultiHeadAttention(
13            num_heads=num_heads, 
14            key_dim=d_model // num_heads
15        )
16        
17        # Feed Forward
18        self.ffn = tf.keras.Sequential([
19            layers.Dense(d_ff, activation='relu'),
20            layers.Dense(d_model)
21        ])
22        
23        # Normalization
24        self.norm1 = layers.LayerNormalization()
25        self.norm2 = layers.LayerNormalization()
26        self.norm3 = layers.LayerNormalization()
27        self.dropout = layers.Dropout(dropout)
28    
29    def call(self, x, encoder_output, look_ahead_mask=None, training=False):
30        # 1. Masked Self-Attention
31        attn1 = self.masked_attention(
32            x, x, x, 
33            attention_mask=look_ahead_mask
34        )
35        x = self.norm1(x + self.dropout(attn1, training=training))
36        
37        # 2. Cross-Attention với Encoder
38        attn2 = self.cross_attention(
39            query=x,              # Decoder hỏi
40            key=encoder_output,   # Encoder có gì
41            value=encoder_output  # Lấy từ Encoder
42        )
43        x = self.norm2(x + self.dropout(attn2, training=training))
44        
45        # 3. Feed Forward
46        ffn_output = self.ffn(x)
47        x = self.norm3(x + self.dropout(ffn_output, training=training))
48        
49        return x
50
51# Demo
52decoder_block = DecoderBlock(d_model=64, num_heads=4, d_ff=256)
53
54encoder_out = tf.random.normal((2, 10, 64))  # Encoder output
55decoder_in = tf.random.normal((2, 5, 64))    # Decoder đang generate
56
57output = decoder_block(decoder_in, encoder_out)
58print(f"Encoder output: {encoder_out.shape}")
59print(f"Decoder input: {decoder_in.shape}")
60print(f"Decoder output: {output.shape}")

Expected Output

Checkpoint

Bạn đã hiểu Decoder Block?

Task 5

🔗 Transformer Hoàn Chỉnh

TB5 min

Ghép các phần lại

python.py

1class Transformer(tf.keras.Model):
2    def __init__(self, vocab_size, d_model, num_heads, d_ff, 
3                 num_encoder_layers, num_decoder_layers, max_len, dropout=0.1):
4        super().__init__()
5        
6        # Embeddings
7        self.encoder_embedding = layers.Embedding(vocab_size, d_model)
8        self.decoder_embedding = layers.Embedding(vocab_size, d_model)
9        
10        # Positional Encoding
11        self.pos_encoding = PositionalEncoding(max_len, d_model)
12        
13        # Encoder & Decoder stacks
14        self.encoder_layers = [
15            EncoderBlock(d_model, num_heads, d_ff, dropout)
16            for _ in range(num_encoder_layers)
17        ]
18        
19        self.decoder_layers = [
20            DecoderBlock(d_model, num_heads, d_ff, dropout)
21            for _ in range(num_decoder_layers)
22        ]
23        
24        # Output layer
25        self.final_layer = layers.Dense(vocab_size)
26        self.dropout = layers.Dropout(dropout)
27    
28    def call(self, inputs, targets, training=False):
29        # Encoder
30        enc_output = self.encoder_embedding(inputs)
31        enc_output = self.pos_encoding(enc_output)
32        enc_output = self.dropout(enc_output, training=training)
33        
34        for encoder_layer in self.encoder_layers:
35            enc_output = encoder_layer(enc_output, training=training)
36        
37        # Decoder
38        dec_output = self.decoder_embedding(targets)
39        dec_output = self.pos_encoding(dec_output)
40        dec_output = self.dropout(dec_output, training=training)
41        
42        for decoder_layer in self.decoder_layers:
43            dec_output = decoder_layer(dec_output, enc_output, training=training)
44        
45        # Output probabilities
46        output = self.final_layer(dec_output)
47        
48        return output
49
50# Demo Transformer
51transformer = Transformer(
52    vocab_size=10000,
53    d_model=64,
54    num_heads=4,
55    d_ff=256,
56    num_encoder_layers=2,
57    num_decoder_layers=2,
58    max_len=100
59)
60
61# Giả lập input
62source = tf.random.uniform((2, 10), 0, 10000, dtype=tf.int32)  # Tiếng Việt
63target = tf.random.uniform((2, 8), 0, 10000, dtype=tf.int32)   # Tiếng Anh
64
65output = transformer(source, target)
66print(f"Source (input): {source.shape}")
67print(f"Target (output tokens): {target.shape}")
68print(f"Output (vocab probs): {output.shape}")

Expected Output

1Source (input): (2, 10)
2Target (output tokens): (2, 8)
3Output (vocab probs): (2, 8, 10000)

🎉 Bạn vừa xây dựng một Transformer hoàn chỉnh!

ChatGPT cũng dựa trên kiến trúc này, chỉ khác:

Lớn hơn nhiều (hàng tỷ parameters)
Chỉ dùng Decoder (không cần Encoder)
Train trên nhiều data hơn

Task 6

🎯 Tổng kết

TB5 min

Transformer = Attention + Feed Forward

Thành phần	Vai trò
Positional Encoding	Thêm thông tin vị trí
Encoder	Hiểu input (đọc)
Decoder	Generate output (viết)
Self-Attention	Từ nhìn từ trong cùng câu
Cross-Attention	Decoder nhìn Encoder
Feed Forward	Xử lý phi tuyến tính

Các biến thể phổ biến

Model	Kiến trúc	Dùng cho
BERT	Encoder only	Hiểu văn bản
GPT/ChatGPT	Decoder only	Sinh văn bản
T5, BART	Encoder-Decoder	Dịch, tóm tắt

Bài tiếp theo

Transfer Learning & Pre-trained Models - Cách sử dụng BERT, GPT trong thực tế!

🎉 Chúc mừng! Bạn đã hiểu kiến trúc Transformer!

Đây là nền tảng của mọi AI hiện đại: ChatGPT, BERT, Claude, Gemini...

Task 7

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

Ôn lại bài trước

🚀 Transformer là gì?

Bước đột phá năm 2017

Ai dùng Transformer?

Tại sao mạnh?

Checkpoint

🏗️ Kiến trúc tổng quan

Hai phần chính: Encoder & Decoder

Sơ đồ đơn giản

Mỗi phần dùng cho gì?

Checkpoint

📍 Positional Encoding - Vị trí từ

Vấn đề: Attention không biết thứ tự!

Giải pháp: Thêm thông tin vị trí

Code đơn giản

Checkpoint

🔧 Encoder Block

Cấu trúc một Encoder Block

Giải thích từng thành phần

Code Encoder Block

Checkpoint

🎭 Decoder Block

Khác gì với Encoder?

Cấu trúc Decoder Block

Tại sao cần Mask?

Code Decoder Block

Checkpoint

🔗 Transformer Hoàn Chỉnh

Ghép các phần lại

🎯 Tổng kết

Transformer = Attention + Feed Forward

Các biến thể phổ biến

Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu