LSTM - Long Short-Term Memory

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu LSTM giải quyết vấn đề gì của RNN

✅ Hiểu 3 Gates (Forget, Input, Output) qua analogies

✅ Biết Cell State hoạt động như thế nào

✅ Xây dựng LSTM với Keras

Ôn lại bài trước

RNN (SimpleRNN) có vấn đề:

Khó nhớ thông tin xa (vanishing gradient)
Câu dài → quên thông tin đầu câu

LSTM ra đời để giải quyết điều này!

Analogy:

RNN = Người có trí nhớ ngắn, đọc sách xong quên phần đầu
LSTM = Người có sổ tay ghi chép, ghi lại những điểm quan trọng!

Task 0

📖 Bảng Thuật Ngữ LSTM

TB5 min

Thuật ngữ	Tiếng Việt	Giải thích
LSTM	Long Short-Term Memory	Biến thể RNN với memory cells
Cell State	Trạng thái ô nhớ	"Băng chuyền" thông tin dài hạn
Forget Gate	Cổng quên	Quyết định quên thông tin cũ
Input Gate	Cổng đầu vào	Quyết định thêm thông tin mới
Output Gate	Cổng đầu ra	Quyết định output
GRU	Gated Recurrent Unit	Phiên bản đơn giản của LSTM
Peephole	Kết nối xuyên	Cho gates "nhìn" cell state

Checkpoint

Bạn đã đọc qua bảng thuật ngữ?

Task 1

🧠 LSTM là gì?

TB5 min

Vấn đề của SimpleRNN

SimpleRNN gặp vấn đề vanishing gradient khi sequences dài:

Gradient giảm exponentially qua các time steps
Khó học dependencies xa (từ đầu đến cuối sequence)
"Quên" thông tin từ quá khứ xa

LSTM - Giải pháp

LSTM (Long Short-Term Memory) được thiết kế để:

Nhớ lâu dài: Cell state giữ thông tin qua nhiều steps
Quên có chọn lọc: Forget gate loại bỏ thông tin không cần
Cập nhật có kiểm soát: Gates điều khiển information flow

So sánh SimpleRNN vs LSTM

Analogy: LSTM giống như đường cao tốc (cell state) với các nút giao (gates). Thông tin có thể đi thẳng từ đầu đến cuối, hoặc vào/ra tại các điểm được kiểm soát.

Checkpoint

Bạn đã hiểu LSTM giải quyết vấn đề gì?

Task 2

🚪 Ba Gates của LSTM

TB5 min

Analogy: Học sinh ghi chép bài

Tưởng tượng bạn đang ghi chép trong lớp học:

Forget Gate: Xóa bớt ghi chép cũ không còn quan trọng
Input Gate: Ghi thêm kiến thức mới vào sổ
Output Gate: Chọn kiến thức nào để trả lời câu hỏi
Cell State (Sổ tay): Lưu trữ tất cả kiến thức quan trọng

1. Forget Gate (Cổng Quên) - "Xóa gì?"

Visualization

Kiến trúc LSTM Cell

Checkpoint

Bạn đã hiểu 3 gates của LSTM?

Task 3

📐 Toán học chi tiết

TB5 min

Công thức đầy đủ

Tại mỗi time step $t$ :

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ $h_t = o_t \odot \tanh(C_t)$

Trong đó:

$\sigma$ : Sigmoid function
$\odot$ : Element-wise multiplication
$[h_{t-1}, x_t]$ : Concatenation

Triển khai từ đầu

python.py

1import numpy as np
2
3class LSTMCell:
4    """LSTM Cell from scratch"""
5    
6    def __init__(self, input_size, hidden_size):
7        self.hidden_size = hidden_size
8        
9        # Initialize weights (simplified)
10        scale = 0.1
11        concat_size = input_size + hidden_size
12        
13        # Forget gate
14        self.Wf = np.random.randn(hidden_size, concat_size) * scale
15        self.bf = np.zeros((hidden_size, 1))
16        
17        # Input gate
18        self.Wi = np.random.randn(hidden_size, concat_size) * scale
19        self.bi = np.zeros((hidden_size, 1))
20        
21        # Candidate
22        self.Wc = np.random.randn(hidden_size, concat_size) * scale
23        self.bc = np.zeros((hidden_size, 1))
24        
25        # Output gate
26        self.Wo = np.random.randn(hidden_size, concat_size) * scale
27        self.bo = np.zeros((hidden_size, 1))
28    
29    def sigmoid(self, x):
30        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
31    
32    def forward(self, x, h_prev, c_prev):
33        """
34        Single LSTM step
35        
36        Args:
37            x: Input (input_size, 1)
38            h_prev: Previous hidden state (hidden_size, 1)
39            c_prev: Previous cell state (hidden_size, 1)
40        """
41        # Concatenate
42        concat = np.vstack([h_prev, x])
43        
44        # Forget gate
45        f = self.sigmoid(self.Wf @ concat + self.bf)
46        
47        # Input gate
48        i = self.sigmoid(self.Wi @ concat + self.bi)
49        
50        # Candidate
51        c_tilde = np.tanh(self.Wc @ concat + self.bc)
52        
53        # Cell state update
54        c = f * c_prev + i * c_tilde
55        
56        # Output gate
57        o = self.sigmoid(self.Wo @ concat + self.bo)
58        
59        # Hidden state
60        h = o * np.tanh(c)
61        
62        return h, c, (f, i, c_tilde, o)
63
64# Demo
65lstm = LSTMCell(input_size=10, hidden_size=20)
66
67# Initial states
68h = np.zeros((20, 1))
69c = np.zeros((20, 1))
70
71# Process sequence
72for t in range(5):
73    x = np.random.randn(10, 1)
74    h, c, gates = lstm.forward(x, h, c)
75    print(f"t={t}: h_mean={h.mean():.4f}, c_mean={c.mean():.4f}")

Expected Output

1t=0: h_mean=0.0023, c_mean=0.0046
2t=1: h_mean=0.0031, c_mean=0.0089
3t=2: h_mean=0.0028, c_mean=0.0124
4t=3: h_mean=0.0035, c_mean=0.0156
5t=4: h_mean=0.0029, c_mean=0.0183

Checkpoint

Bạn đã hiểu công thức LSTM?

Task 4

💻 LSTM trong Keras

TB5 min

Basic LSTM

python.py

1from tensorflow import keras
2from tensorflow.keras import layers
3
4# Single LSTM layer
5model = keras.Sequential([
6    layers.LSTM(
7        units=64,              # Hidden size
8        return_sequences=True, # Return all time steps
9        input_shape=(100, 32)  # (timesteps, features)
10    ),
11    layers.LSTM(
12        units=32,
13        return_sequences=False # Only last output
14    ),
15    layers.Dense(10, activation='softmax')
16])
17
18model.summary()

Tham số LSTM

Tham số	Ý nghĩa	Default
`units`	Hidden/Cell state size	Required
`return_sequences`	Return all outputs?	False
`return_state`	Return (h, c)?	False
`dropout`	Input dropout	0
`recurrent_dropout`	Recurrent dropout	0

Return State

python.py

1# Get hidden state AND cell state
2lstm_layer = layers.LSTM(64, return_sequences=True, return_state=True)
3
4# Usage
5import tensorflow as tf
6x = tf.random.normal((32, 10, 128))
7
8output, final_h, final_c = lstm_layer(x)
9
10print(f"Output (all steps): {output.shape}")      # (32, 10, 64)
11print(f"Final hidden state: {final_h.shape}")     # (32, 64)
12print(f"Final cell state: {final_c.shape}")       # (32, 64)

Expected Output

1Output (all steps): (32, 10, 64)
2Final hidden state: (32, 64)
3Final cell state: (32, 64)

Checkpoint

Bạn đã biết cách dùng LSTM trong Keras?

Task 5

⚡ GRU - Gated Recurrent Unit

TB5 min

GRU vs LSTM

GRU là phiên bản đơn giản hơn của LSTM:

2 gates thay vì 3 (reset, update)
Không có cell state riêng
Ít parameters hơn → train nhanh hơn
Performance tương đương LSTM trong nhiều tasks

Công thức GRU

$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$ (Update gate) $r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$ (Reset gate) $\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$ $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

So sánh

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
States	Hidden + Cell	Hidden only
Parameters	Nhiều hơn	Ít hơn (~25%)
Speed	Chậm hơn	Nhanh hơn
Memory	Tốt hơn cho rất dài	Tương đương cho trung bình

GRU trong Keras

python.py

1from tensorflow.keras import layers
2
3# GRU layer - API giống LSTM
4model = keras.Sequential([
5    layers.Embedding(10000, 128, input_length=200),
6    
7    # Thay LSTM bằng GRU
8    layers.GRU(64, return_sequences=True),
9    layers.GRU(32),
10    
11    layers.Dense(1, activation='sigmoid')
12])
13
14model.compile(
15    optimizer='adam',
16    loss='binary_crossentropy',
17    metrics=['accuracy']
18)
19
20model.summary()

Khi nào dùng LSTM vs GRU?

Scenario	Recommendation
Very long sequences (>500)	LSTM
Resource limited	GRU
Quick experimentation	GRU
Not sure	Try both, compare

Checkpoint

Bạn đã hiểu GRU?

Task 6

🔀 Bidirectional LSTM

TB5 min

Tại sao Bidirectional?

Ví dụ

1Standard LSTM:     "The movie was not bad at all"
2                    →→→→→→→→→→→→→→→→→→→→→→→→→→→→
3                    Chỉ thấy context từ trái
4 
5Bidirectional:     "The movie was not bad at all"
6                    →→→→→→→→→→→→→→→→→→→→→→→→→→→→
7                    ←←←←←←←←←←←←←←←←←←←←←←←←←←←←
8                    Thấy context từ cả 2 phía

Code Keras

python.py

1from tensorflow.keras import layers
2
3# Bidirectional LSTM
4model = keras.Sequential([
5    layers.Embedding(10000, 128, input_length=200),
6    
7    # Bidirectional wrapper
8    layers.Bidirectional(
9        layers.LSTM(64, return_sequences=True)
10    ),
11    # Output: (batch, timesteps, 128) - 64*2
12    
13    layers.Bidirectional(
14        layers.LSTM(32)
15    ),
16    # Output: (batch, 64) - 32*2
17    
18    layers.Dense(1, activation='sigmoid')
19])
20
21model.summary()

Stacked Bidirectional LSTM

python.py

1def create_bilstm_classifier(vocab_size, embedding_dim, max_len, num_classes):
2    """Production-ready Bidirectional LSTM"""
3    
4    model = keras.Sequential([
5        # Embedding
6        layers.Embedding(vocab_size, embedding_dim, 
7                         input_length=max_len),
8        layers.SpatialDropout1D(0.2),
9        
10        # Stacked Bidirectional LSTM
11        layers.Bidirectional(
12            layers.LSTM(128, return_sequences=True, dropout=0.2)
13        ),
14        layers.Bidirectional(
15            layers.LSTM(64, return_sequences=True, dropout=0.2)
16        ),
17        layers.Bidirectional(
18            layers.LSTM(32, dropout=0.2)
19        ),
20        
21        # Classification head
22        layers.Dense(64, activation='relu'),
23        layers.Dropout(0.5),
24        layers.Dense(num_classes, activation='softmax')
25    ])
26    
27    return model
28
29# Create
30model = create_bilstm_classifier(
31    vocab_size=10000,
32    embedding_dim=128,
33    max_len=200,
34    num_classes=5
35)
36
37model.compile(
38    optimizer=keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0),
39    loss='sparse_categorical_crossentropy',
40    metrics=['accuracy']
41)

Checkpoint

Bạn đã biết cách dùng Bidirectional LSTM?

Task 7

🎯 Tổng kết

TB5 min

LSTM Key Points

Cell State = "Highway" cho information flow
Forget Gate = Quyết định quên gì
Input Gate = Quyết định thêm gì
Output Gate = Quyết định output gì
Giải quyết vanishing gradient problem

Công thức cần nhớ

$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ $h_t = o_t \odot \tanh(C_t)$

So sánh RNN variants

Feature	SimpleRNN	LSTM	GRU
Gates	0	3	2
Cell state	No	Yes	No
Long-term	Poor	Good	Good
Parameters	Ít nhất	Nhiều nhất	Trung bình
Speed	Nhanh	Chậm	Trung bình

Keras API

Python

1# SimpleRNN
2layers.SimpleRNN(64, return_sequences=True)
3
4# LSTM
5layers.LSTM(64, return_sequences=True, dropout=0.2)
6
7# GRU  
8layers.GRU(64, return_sequences=True)
9
10# Bidirectional
11layers.Bidirectional(layers.LSTM(64))

Bài tiếp theo

LSTM Applications:

Text Generation
Machine Translation (Seq2Seq)
Named Entity Recognition
Time Series với LSTM

🎉 Tuyệt vời! Bạn đã hiểu LSTM - backbone của nhiều NLP tasks!

Task 8