Ứng Dụng RNN - Text và Time Series

🎯 Mục tiêu bài học

TB5 min

Trong bài này, bạn sẽ học:

✅ Text Preprocessing: Tokenization, Padding, Embedding
✅ Text Classification với RNN
✅ Language Modeling: Dự đoán từ tiếp theo
✅ Time Series Prediction với RNN
✅ Bidirectional RNN: Xử lý cả 2 chiều

RNN có thể áp dụng cho bất kỳ dữ liệu tuần tự nào: text, time series, DNA sequences, audio, video frames, ...

Task 1

📝 Text Preprocessing Pipeline

TB5 min

Pipeline xử lý text

Pipeline Tiền xử lý Văn bản cho RNN

Tokenization với Keras

python.py

1from tensorflow import keras
2from tensorflow.keras.preprocessing.text import Tokenizer
3from tensorflow.keras.preprocessing.sequence import pad_sequences
4
5# Sample texts
6texts = [
7    "I love machine learning",
8    "Deep learning is amazing",
9    "Neural networks are powerful",
10    "AI will change the world"
11]
12
13# Create tokenizer
14tokenizer = Tokenizer(num_words=10000)  # Vocabulary size
15tokenizer.fit_on_texts(texts)
16
17# View word index
18print("Word Index:")
19for word, idx in list(tokenizer.word_index.items())[:10]:
20    print(f"  '{word}': {idx}")
21
22# Convert to sequences
23sequences = tokenizer.texts_to_sequences(texts)
24print("\nSequences:")
25for text, seq in zip(texts, sequences):
26    print(f"  '{text}' → {seq}")

Expected Output

1Word Index:
2  'learning': 1
3  'i': 2
4  'love': 3
5  'machine': 4
6  'deep': 5
7  ...
8 
9Sequences:
10  'I love machine learning' → [2, 3, 4, 1]
11  'Deep learning is amazing' → [5, 1, 6, 7]
12  'Neural networks are powerful' → [8, 9, 10, 11]
13  'AI will change the world' → [12, 13, 14, 15, 16]

Padding

python.py

1# Pad sequences to same length
2MAX_LEN = 10
3
4padded = pad_sequences(
5    sequences,
6    maxlen=MAX_LEN,
7    padding='pre',      # Pad at beginning
8    truncating='post'   # Truncate at end
9)
10
11print("Padded Sequences:")
12print(padded)
13print(f"Shape: {padded.shape}")

Expected Output

1Padded Sequences:
2[[ 0  0  0  0  0  0  2  3  4  1]
3 [ 0  0  0  0  0  0  5  1  6  7]
4 [ 0  0  0  0  0  0  8  9 10 11]
5 [ 0  0  0  0  0 12 13 14 15 16]]
6Shape: (4, 10)

Padding options:

padding='pre': Thêm 0 ở đầu (phổ biến hơn)
padding='post': Thêm 0 ở cuối
truncating: Cắt ở đầu hoặc cuối nếu sequence quá dài

Checkpoint

Bạn đã hiểu cách preprocessing text?

Task 2

🔤 Embedding Layer

TB5 min

Tại sao cần Embedding?

Representation	Vấn đề
One-hot	Sparse, không capture semantic
Integer	Không có ý nghĩa (word 5 ≠ 5 × word 1)
Embedding	Dense, capture semantic similarity

Embedding trong Keras

python.py

1from tensorflow.keras import layers
2
3# Embedding layer
4embedding = layers.Embedding(
5    input_dim=10000,    # Vocabulary size
6    output_dim=128,     # Embedding dimension
7    input_length=100    # Sequence length
8)
9
10# Input: (batch, sequence_length) - integers
11# Output: (batch, sequence_length, embedding_dim) - vectors
12
13# Ví dụ
14import tensorflow as tf
15sample_input = tf.constant([[1, 2, 3], [4, 5, 6]])  # 2 sequences, 3 words each
16sample_output = embedding(sample_input)
17print(f"Input shape: {sample_input.shape}")
18print(f"Output shape: {sample_output.shape}")

Expected Output

1Input shape: (2, 3)
2Output shape: (2, 3, 128)

Pretrained Embeddings (GloVe, Word2Vec)

python.py

1import numpy as np
2
3def load_glove_embeddings(glove_file, word_index, embedding_dim=100):
4    """Load pretrained GloVe embeddings"""
5    # Load GloVe
6    embeddings_index = {}
7    with open(glove_file, 'r', encoding='utf-8') as f:
8        for line in f:
9            values = line.split()
10            word = values[0]
11            coefs = np.asarray(values[1:], dtype='float32')
12            embeddings_index[word] = coefs
13    
14    print(f"Loaded {len(embeddings_index)} word vectors")
15    
16    # Create embedding matrix
17    vocab_size = len(word_index) + 1
18    embedding_matrix = np.zeros((vocab_size, embedding_dim))
19    
20    for word, idx in word_index.items():
21        if idx < vocab_size:
22            embedding_vector = embeddings_index.get(word)
23            if embedding_vector is not None:
24                embedding_matrix[idx] = embedding_vector
25    
26    return embedding_matrix
27
28# Usage
29# embedding_matrix = load_glove_embeddings(
30#     'glove.6B.100d.txt', 
31#     tokenizer.word_index,
32#     embedding_dim=100
33# )
34
35# Create embedding layer with pretrained weights
36# embedding_layer = layers.Embedding(
37#     input_dim=VOCAB_SIZE,
38#     output_dim=100,
39#     weights=[embedding_matrix],
40#     trainable=False  # Freeze pretrained weights
41# )

Checkpoint

Bạn đã hiểu Embedding layer?

Task 3

🎭 Text Classification hoàn chỉnh

TB5 min

IMDB Sentiment Classification

python.py

1from tensorflow import keras
2from tensorflow.keras import layers
3from tensorflow.keras.datasets import imdb
4from tensorflow.keras.preprocessing.sequence import pad_sequences
5
6# Hyperparameters
7VOCAB_SIZE = 10000
8MAX_LEN = 200
9EMBEDDING_DIM = 128
10RNN_UNITS = 64
11
12# Load IMDB dataset
13print("Loading data...")
14(x_train, y_train), (x_test, y_test) = imdb.load_data(
15    num_words=VOCAB_SIZE
16)
17
18# Pad sequences
19x_train = pad_sequences(x_train, maxlen=MAX_LEN)
20x_test = pad_sequences(x_test, maxlen=MAX_LEN)
21
22print(f"Train shape: {x_train.shape}")
23print(f"Test shape: {x_test.shape}")
24
25# Build model
26def create_text_classifier():
27    model = keras.Sequential([
28        # Embedding
29        layers.Embedding(
30            input_dim=VOCAB_SIZE,
31            output_dim=EMBEDDING_DIM,
32            input_length=MAX_LEN
33        ),
34        
35        # RNN layers
36        layers.SimpleRNN(RNN_UNITS, return_sequences=True),
37        layers.SimpleRNN(RNN_UNITS // 2),
38        
39        # Classification
40        layers.Dropout(0.5),
41        layers.Dense(64, activation='relu'),
42        layers.Dropout(0.3),
43        layers.Dense(1, activation='sigmoid')
44    ])
45    
46    return model
47
48model = create_text_classifier()
49
50model.compile(
51    optimizer='adam',
52    loss='binary_crossentropy',
53    metrics=['accuracy']
54)
55
56model.summary()

Training

python.py

1# Callbacks
2callbacks = [
3    keras.callbacks.EarlyStopping(
4        patience=3,
5        restore_best_weights=True
6    ),
7    keras.callbacks.ReduceLROnPlateau(
8        factor=0.5,
9        patience=2
10    )
11]
12
13# Train
14history = model.fit(
15    x_train, y_train,
16    epochs=10,
17    batch_size=128,
18    validation_split=0.2,
19    callbacks=callbacks
20)
21
22# Evaluate
23test_loss, test_acc = model.evaluate(x_test, y_test)
24print(f"\nTest Accuracy: {test_acc:.2%}")

Prediction

python.py

1# Get word index for decoding
2word_index = imdb.get_word_index()
3reverse_word_index = {v: k for k, v in word_index.items()}
4
5def decode_review(sequence):
6    """Convert sequence back to text"""
7    return ' '.join([reverse_word_index.get(i - 3, '?') 
8                     for i in sequence if i > 3])
9
10def predict_sentiment(text, tokenizer=None):
11    """Predict sentiment of new text"""
12    # If using custom tokenizer
13    if tokenizer:
14        seq = tokenizer.texts_to_sequences([text])
15    else:
16        # Simple word to index (demo only)
17        words = text.lower().split()
18        seq = [[word_index.get(w, 0) + 3 for w in words]]
19    
20    padded = pad_sequences(seq, maxlen=MAX_LEN)
21    pred = model.predict(padded, verbose=0)[0][0]
22    
23    sentiment = "Positive" if pred > 0.5 else "Negative"
24    confidence = pred if pred > 0.5 else 1 - pred
25    
26    return sentiment, confidence
27
28# Test
29sample_review = "This movie was absolutely fantastic and entertaining"
30sentiment, conf = predict_sentiment(sample_review)
31print(f"'{sample_review}'")
32print(f"Sentiment: {sentiment} (confidence: {conf:.2%})")

Checkpoint

Bạn có thể xây dựng Text Classifier?

Task 4

↔️ Bidirectional RNN

TB5 min

Tại sao cần Bidirectional?

Standard RNN chỉ xem thông tin từ trái → phải. Nhưng đôi khi context từ cả 2 phía đều quan trọng:

"The movie was not bad at all"

Forward: "not" → tiêu cực
Backward: "at all" → nhấn mạnh tích cực

Kiến trúc Bidirectional RNN

Code Keras

python.py

1from tensorflow.keras import layers
2
3# Bidirectional wrapper
4model = keras.Sequential([
5    layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, 
6                     input_length=MAX_LEN),
7    
8    # Bidirectional RNN
9    layers.Bidirectional(
10        layers.SimpleRNN(64, return_sequences=True)
11    ),
12    # Output: (batch, timesteps, 128) - 64*2
13    
14    layers.Bidirectional(
15        layers.SimpleRNN(32)
16    ),
17    # Output: (batch, 64) - 32*2
18    
19    layers.Dense(1, activation='sigmoid')
20])
21
22model.summary()

Merge modes

python.py

1# Các cách kết hợp forward và backward
2layers.Bidirectional(
3    layers.SimpleRNN(64),
4    merge_mode='concat'  # Default: [h⃗, h⃖] → 128 units
5)
6
7layers.Bidirectional(
8    layers.SimpleRNN(64),
9    merge_mode='sum'     # h⃗ + h⃖ → 64 units
10)
11
12layers.Bidirectional(
13    layers.SimpleRNN(64),
14    merge_mode='mul'     # h⃗ * h⃖ → 64 units
15)
16
17layers.Bidirectional(
18    layers.SimpleRNN(64),
19    merge_mode='ave'     # (h⃗ + h⃖) / 2 → 64 units
20)

Checkpoint

Bạn đã hiểu Bidirectional RNN?

Task 5

📈 Time Series Prediction

TB5 min

Chuẩn bị data

python.py

1import numpy as np
2import matplotlib.pyplot as plt
3
4def create_time_series_data(n_samples=1000, noise=0.1):
5    """Create synthetic time series"""
6    t = np.linspace(0, 100, n_samples)
7    # Trend + Seasonality + Noise
8    data = 0.05 * t + 2 * np.sin(0.5 * t) + np.random.randn(n_samples) * noise
9    return data
10
11def create_sequences(data, seq_length, forecast_horizon=1):
12    """
13    Create input-output sequences for time series
14    
15    Args:
16        data: Time series array
17        seq_length: Number of past timesteps to use
18        forecast_horizon: Number of future steps to predict
19    """
20    X, y = [], []
21    for i in range(len(data) - seq_length - forecast_horizon + 1):
22        X.append(data[i:i+seq_length])
23        y.append(data[i+seq_length:i+seq_length+forecast_horizon])
24    
25    X = np.array(X).reshape(-1, seq_length, 1)  # (samples, timesteps, features)
26    y = np.array(y)
27    
28    return X, y
29
30# Create data
31data = create_time_series_data(1000)
32
33# Create sequences
34SEQ_LENGTH = 20
35X, y = create_sequences(data, SEQ_LENGTH)
36
37# Split
38split = int(len(X) * 0.8)
39X_train, X_test = X[:split], X[split:]
40y_train, y_test = y[:split], y[split:]
41
42print(f"X_train shape: {X_train.shape}")
43print(f"y_train shape: {y_train.shape}")

Expected Output

1X_train shape: (784, 20, 1)
2y_train shape: (784, 1)

Model cho Time Series

python.py

1def create_ts_model(seq_length, n_features=1, forecast_horizon=1):
2    """RNN model for time series prediction"""
3    model = keras.Sequential([
4        layers.SimpleRNN(64, return_sequences=True,
5                         input_shape=(seq_length, n_features)),
6        layers.SimpleRNN(32),
7        layers.Dense(32, activation='relu'),
8        layers.Dense(forecast_horizon)  # Predict n steps ahead
9    ])
10    
11    return model
12
13# Create and compile
14model = create_ts_model(SEQ_LENGTH)
15model.compile(
16    optimizer='adam',
17    loss='mse',
18    metrics=['mae']
19)
20
21# Train
22history = model.fit(
23    X_train, y_train,
24    epochs=50,
25    batch_size=32,
26    validation_split=0.2,
27    callbacks=[
28        keras.callbacks.EarlyStopping(patience=5)
29    ],
30    verbose=0
31)
32
33# Evaluate
34test_loss, test_mae = model.evaluate(X_test, y_test)
35print(f"Test MAE: {test_mae:.4f}")

Visualization

python.py

1# Predict
2y_pred = model.predict(X_test)
3
4# Plot
5plt.figure(figsize=(14, 5))
6
7# Plot predictions vs actual
8n_plot = 100
9plt.subplot(1, 2, 1)
10plt.plot(y_test[:n_plot], label='Actual', alpha=0.7)
11plt.plot(y_pred[:n_plot], label='Predicted', alpha=0.7)
12plt.xlabel('Time')
13plt.ylabel('Value')
14plt.title('Time Series Prediction')
15plt.legend()
16
17# Plot training history
18plt.subplot(1, 2, 2)
19plt.plot(history.history['loss'], label='Train Loss')
20plt.plot(history.history['val_loss'], label='Val Loss')
21plt.xlabel('Epoch')
22plt.ylabel('Loss')
23plt.title('Training History')
24plt.legend()
25
26plt.tight_layout()
27plt.show()

Checkpoint

Bạn có thể xây dựng Time Series model?

Task 6

💡 Multi-step Forecasting

TB5 min

Dự đoán nhiều bước

python.py

1# Predict 5 steps ahead
2FORECAST_HORIZON = 5
3X_multi, y_multi = create_sequences(data, SEQ_LENGTH, FORECAST_HORIZON)
4
5# Split
6X_train_m, X_test_m = X_multi[:split], X_multi[split:]
7y_train_m, y_test_m = y_multi[:split], y_multi[split:]
8
9# Model with multiple outputs
10model_multi = create_ts_model(SEQ_LENGTH, forecast_horizon=FORECAST_HORIZON)
11model_multi.compile(optimizer='adam', loss='mse', metrics=['mae'])
12
13model_multi.fit(
14    X_train_m, y_train_m,
15    epochs=50,
16    batch_size=32,
17    validation_split=0.2,
18    callbacks=[keras.callbacks.EarlyStopping(patience=5)],
19    verbose=0
20)
21
22# Predict
23y_pred_m = model_multi.predict(X_test_m[:1])
24print(f"Input shape: {X_test_m[:1].shape}")
25print(f"Output (5 steps): {y_pred_m}")

Sequence-to-Sequence cho Time Series

python.py

1def create_seq2seq_model(seq_length, forecast_horizon):
2    """
3    Encoder-Decoder architecture for time series
4    """
5    model = keras.Sequential([
6        # Encoder
7        layers.SimpleRNN(64, return_sequences=True,
8                         input_shape=(seq_length, 1)),
9        layers.SimpleRNN(32),
10        
11        # Repeat vector for decoder
12        layers.RepeatVector(forecast_horizon),
13        
14        # Decoder
15        layers.SimpleRNN(32, return_sequences=True),
16        layers.SimpleRNN(64, return_sequences=True),
17        
18        # Output
19        layers.TimeDistributed(layers.Dense(1))
20    ])
21    
22    return model
23
24# Build
25seq2seq = create_seq2seq_model(SEQ_LENGTH, FORECAST_HORIZON)
26seq2seq.summary()

Checkpoint

Bạn đã hiểu Multi-step forecasting?

Task 7

🎯 Tổng kết

TB5 min

Ứng dụng RNN đã học

Task	Input	Output	Architecture
Sentiment	Text	Label	Many-to-One
Time Series	Past values	Future value(s)	Many-to-One/Many
Language Model	Words	Next word	Many-to-Many

Pipeline xử lý Text

Ví dụ

11. Tokenization: Text → Words → Integers
22. Padding: Sequences → Fixed length
33. Embedding: Integers → Dense vectors
44. RNN: Sequence processing
55. Output: Classification/Regression

Key Components

Component	Keras Layer	Purpose
Tokenizer	`Tokenizer()`	Text → Sequences
Padding	`pad_sequences()`	Fixed length
Embedding	`Embedding()`	Words → Vectors
RNN	`SimpleRNN()`	Sequence processing
Bidirectional	`Bidirectional()`	Both directions

Hạn chế của SimpleRNN

Vấn đề	Mô tả
Vanishing gradient	Khó học long-term dependencies
Sequential	Không parallelize được
Short memory	Quên thông tin xa

Bài tiếp theo

LSTM (Long Short-Term Memory):

Giải quyết vanishing gradient
Memory cells cho long-term dependencies
Gates để kiểm soát information flow

🎉 Tuyệt vời! Bạn đã biết cách áp dụng RNN vào các bài toán thực tế!

Task 8