Lý thuyết
Bài 13/15

Text Processing & NLP Basics

Text preprocessing, tokenization, feature extraction và NLP fundamentals

Text Processing & NLP Basics

Natural Language Processing

1. Introduction to NLP

NLP - Natural Language Processing

NLP là lĩnh vực AI giúp máy tính hiểu, phân tích và xử lý ngôn ngữ tự nhiên của con người.

1.1 NLP Applications

ApplicationDescriptionExample
Sentiment AnalysisPhân tích cảm xúcReview positive/negative
Named Entity RecognitionNhận dạng thực thểTên người, địa điểm
Text ClassificationPhân loại văn bảnSpam detection
Machine TranslationDịch máyGoogle Translate
Question AnsweringTrả lời câu hỏiChatGPT, Chatbots
Text SummarizationTóm tắt văn bảnNews summaries

1.2 NLP Pipeline

Text
1┌──────────────────────────────────────────────────────────┐
2│ NLP Pipeline │
3├──────────────────────────────────────────────────────────┤
4│ │
5│ Raw Text │
6│ │ │
7│ v │
8│ ┌────────────────────┐ │
9│ │ Text Preprocessing │ (cleaning, normalization) │
10│ └─────────┬──────────┘ │
11│ v │
12│ ┌────────────────────┐ │
13│ │ Tokenization │ (split into words/sentences) │
14│ └─────────┬──────────┘ │
15│ v │
16│ ┌────────────────────┐ │
17│ │ Feature Extraction│ (TF-IDF, Word2Vec, etc.) │
18│ └─────────┬──────────┘ │
19│ v │
20│ ┌────────────────────┐ │
21│ │ Model / Analysis │ (ML models, rules) │
22│ └─────────┬──────────┘ │
23│ v │
24│ Output (sentiment, category, entities, etc.) │
25│ │
26└──────────────────────────────────────────────────────────┘

2. Text Preprocessing

2.1 Basic Cleaning

Python
1import re
2import string
3
4def clean_text(text):
5 """Basic text cleaning"""
6 # Lowercase
7 text = text.lower()
8
9 # Remove URLs
10 text = re.sub(r'http\S+|www\S+|https\S+', '', text)
11
12 # Remove HTML tags
13 text = re.sub(r'<.*?>', '', text)
14
15 # Remove punctuation
16 text = text.translate(str.maketrans('', '', string.punctuation))
17
18 # Remove numbers
19 text = re.sub(r'\d+', '', text)
20
21 # Remove extra whitespace
22 text = ' '.join(text.split())
23
24 return text
25
26# Example
27raw_text = "Check out https://example.com! This is AMAZING!!! 👍 #NLP @user"
28cleaned = clean_text(raw_text)
29print(cleaned)
30# Output: "check out this is amazing nlp user"

2.2 Advanced Cleaning

Python
1import unicodedata
2import emoji
3
4def advanced_clean(text):
5 """Advanced text cleaning"""
6 # Handle Unicode
7 text = unicodedata.normalize('NFKD', text)
8 text = text.encode('ascii', 'ignore').decode('utf-8')
9
10 # Remove emojis
11 text = emoji.replace_emoji(text, replace='')
12
13 # Remove special characters but keep essential punctuation
14 text = re.sub(r'[^\w\s.,!?]', '', text)
15
16 # Fix common contractions
17 contractions = {
18 "can't": "cannot",
19 "won't": "will not",
20 "n't": " not",
21 "'re": " are",
22 "'s": " is",
23 "'d": " would",
24 "'ll": " will",
25 "'ve": " have"
26 }
27 for contraction, expansion in contractions.items():
28 text = text.replace(contraction, expansion)
29
30 return text.strip()

2.3 Stop Words Removal

Python
1from nltk.corpus import stopwords
2import nltk
3
4# Download stopwords
5nltk.download('stopwords')
6
7stop_words = set(stopwords.words('english'))
8
9def remove_stopwords(text):
10 """Remove stop words"""
11 words = text.split()
12 filtered = [word for word in words if word.lower() not in stop_words]
13 return ' '.join(filtered)
14
15# Example
16text = "This is a sample sentence showing stop word removal"
17print(remove_stopwords(text))
18# Output: "sample sentence showing stop word removal"
19
20# Custom stop words
21custom_stops = stop_words.union({'also', 'however', 'therefore'})

3. Tokenization

3.1 Word Tokenization

Python
1import nltk
2from nltk.tokenize import word_tokenize, sent_tokenize
3
4nltk.download('punkt')
5
6text = "Hello, world! How are you doing today? I'm fine."
7
8# Word tokenization
9words = word_tokenize(text)
10print(words)
11# ['Hello', ',', 'world', '!', 'How', 'are', 'you', 'doing', 'today', '?', "I'm", 'fine', '.']
12
13# Sentence tokenization
14sentences = sent_tokenize(text)
15print(sentences)
16# ['Hello, world!', 'How are you doing today?', "I'm fine."]

3.2 Subword Tokenization

Python
1# Using Hugging Face Tokenizers
2from transformers import AutoTokenizer
3
4# BERT tokenizer (WordPiece)
5tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
6tokens = tokenizer.tokenize("unhappiness")
7print(tokens) # ['un', '##happiness'] or similar subwords
8
9# Encode to IDs
10encoded = tokenizer.encode("Hello world!", add_special_tokens=True)
11print(encoded) # [101, 7592, 2088, 999, 102]
12
13# Decode back
14decoded = tokenizer.decode(encoded)
15print(decoded) # "[CLS] hello world! [SEP]"

3.3 N-grams

Python
1from nltk import ngrams
2
3text = "The quick brown fox jumps"
4words = text.split()
5
6# Unigrams (1-grams)
7unigrams = list(ngrams(words, 1))
8print(unigrams)
9# [('The',), ('quick',), ('brown',), ('fox',), ('jumps',)]
10
11# Bigrams (2-grams)
12bigrams = list(ngrams(words, 2))
13print(bigrams)
14# [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]
15
16# Trigrams (3-grams)
17trigrams = list(ngrams(words, 3))
18print(trigrams)
19# [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]

4. Stemming & Lemmatization

4.1 Stemming

Python
1from nltk.stem import PorterStemmer, SnowballStemmer
2
3porter = PorterStemmer()
4snowball = SnowballStemmer('english')
5
6words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying']
7
8for word in words:
9 print(f"{word} -> Porter: {porter.stem(word)}, Snowball: {snowball.stem(word)}")
10
11# Output:
12# running -> Porter: run, Snowball: run
13# runs -> Porter: run, Snowball: run
14# ran -> Porter: ran, Snowball: ran
15# easily -> Porter: easili, Snowball: easili
16# fairly -> Porter: fairli, Snowball: fair
17# studies -> Porter: studi, Snowball: studi
18# studying -> Porter: studi, Snowball: studi

4.2 Lemmatization

Python
1from nltk.stem import WordNetLemmatizer
2import nltk
3
4nltk.download('wordnet')
5nltk.download('averaged_perceptron_tagger')
6
7lemmatizer = WordNetLemmatizer()
8
9# Need to specify part of speech for accurate lemmatization
10# n: noun, v: verb, a: adjective, r: adverb
11print(lemmatizer.lemmatize('running', pos='v')) # run
12print(lemmatizer.lemmatize('better', pos='a')) # good
13print(lemmatizer.lemmatize('studies', pos='v')) # study
14
15# Auto POS detection
16from nltk import pos_tag
17
18def get_wordnet_pos(tag):
19 """Convert NLTK POS tag to WordNet POS"""
20 if tag.startswith('J'):
21 return 'a' # adjective
22 elif tag.startswith('V'):
23 return 'v' # verb
24 elif tag.startswith('R'):
25 return 'r' # adverb
26 else:
27 return 'n' # noun (default)
28
29def lemmatize_with_pos(text):
30 words = word_tokenize(text)
31 pos_tags = pos_tag(words)
32
33 lemmatized = []
34 for word, tag in pos_tags:
35 wn_pos = get_wordnet_pos(tag)
36 lemma = lemmatizer.lemmatize(word, pos=wn_pos)
37 lemmatized.append(lemma)
38
39 return ' '.join(lemmatized)
40
41text = "The cats were running quickly towards their homes"
42print(lemmatize_with_pos(text))
43# "The cat be run quickly towards their home"
Stemming vs Lemmatization
  • Stemming: Faster, rule-based, may produce non-words (running → run, studies → studi)
  • Lemmatization: Slower, dictionary-based, produces valid words (better → good, studies → study)
  • Use Lemmatization for better accuracy, Stemming for speed

5. Feature Extraction

5.1 Bag of Words (BoW)

Python
1from sklearn.feature_extraction.text import CountVectorizer
2
3documents = [
4 "The cat sat on the mat",
5 "The dog sat on the log",
6 "The cat and the dog are friends"
7]
8
9# Create BoW
10vectorizer = CountVectorizer()
11bow_matrix = vectorizer.fit_transform(documents)
12
13# View vocabulary
14print("Vocabulary:", vectorizer.get_feature_names_out())
15# ['and', 'are', 'cat', 'dog', 'friends', 'log', 'mat', 'on', 'sat', 'the']
16
17# View matrix
18print("BoW Matrix:\n", bow_matrix.toarray())
19# [[0 0 1 0 0 0 1 1 1 2]
20# [0 0 0 1 0 1 0 1 1 2]
21# [1 1 1 1 1 0 0 0 0 2]]

5.2 TF-IDF

Python
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4 "Machine learning is great",
5 "Machine learning is challenging",
6 "Deep learning is a subset of machine learning"
7]
8
9# TF-IDF
10tfidf = TfidfVectorizer()
11tfidf_matrix = tfidf.fit_transform(documents)
12
13print("Vocabulary:", tfidf.get_feature_names_out())
14print("TF-IDF Matrix:\n", tfidf_matrix.toarray().round(2))
15
16# TF-IDF with more options
17tfidf = TfidfVectorizer(
18 max_features=1000, # Top 1000 features
19 min_df=2, # Minimum document frequency
20 max_df=0.95, # Maximum document frequency
21 ngram_range=(1, 2), # Unigrams and bigrams
22 stop_words='english'
23)

5.3 Word Embeddings (Word2Vec)

Python
1from gensim.models import Word2Vec
2
3# Sample corpus (list of tokenized sentences)
4sentences = [
5 ['machine', 'learning', 'is', 'great'],
6 ['deep', 'learning', 'is', 'powerful'],
7 ['natural', 'language', 'processing', 'uses', 'machine', 'learning'],
8 ['word', 'embeddings', 'capture', 'semantic', 'meaning']
9]
10
11# Train Word2Vec
12model = Word2Vec(
13 sentences,
14 vector_size=100, # Embedding dimension
15 window=5, # Context window
16 min_count=1, # Minimum word frequency
17 workers=4, # Parallel training
18 sg=1 # 1 for Skip-gram, 0 for CBOW
19)
20
21# Get word vector
22vector = model.wv['machine']
23print(f"Vector shape: {vector.shape}") # (100,)
24
25# Find similar words
26similar = model.wv.most_similar('learning', topn=3)
27print("Similar to 'learning':", similar)
28
29# Word arithmetic
30# king - man + woman ≈ queen
31result = model.wv.most_similar(
32 positive=['king', 'woman'],
33 negative=['man'],
34 topn=1
35)
36
37# Save and load
38model.save("word2vec.model")
39model = Word2Vec.load("word2vec.model")

5.4 Pre-trained Embeddings

Python
1import gensim.downloader as api
2
3# Download pre-trained GloVe
4glove = api.load('glove-wiki-gigaword-100') # 100-dimensional
5
6# Get vector
7vector = glove['computer']
8print(f"Shape: {vector.shape}")
9
10# Similarity
11similarity = glove.similarity('king', 'queen')
12print(f"Similarity king-queen: {similarity}")
13
14# Most similar
15similar = glove.most_similar('python', topn=5)
16print("Similar to 'python':", similar)

6. Text Classification Pipeline

6.1 Complete Pipeline

Python
1import pandas as pd
2from sklearn.model_selection import train_test_split
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.naive_bayes import MultinomialNB
5from sklearn.linear_model import LogisticRegression
6from sklearn.pipeline import Pipeline
7from sklearn.metrics import classification_report, accuracy_score
8
9# Sample data
10data = pd.DataFrame({
11 'text': [
12 "I love this product, it's amazing!",
13 "Terrible experience, waste of money",
14 "Great quality and fast shipping",
15 "Product broke after one day",
16 "Best purchase I've ever made",
17 "Don't buy this, very disappointed",
18 "Exceeded my expectations",
19 "Poor customer service"
20 ],
21 'label': ['positive', 'negative', 'positive', 'negative',
22 'positive', 'negative', 'positive', 'negative']
23})
24
25# Split data
26X_train, X_test, y_train, y_test = train_test_split(
27 data['text'], data['label'], test_size=0.25, random_state=42
28)
29
30# Create pipeline
31pipeline = Pipeline([
32 ('tfidf', TfidfVectorizer(
33 lowercase=True,
34 stop_words='english',
35 ngram_range=(1, 2),
36 max_features=5000
37 )),
38 ('classifier', LogisticRegression(max_iter=1000))
39])
40
41# Train
42pipeline.fit(X_train, y_train)
43
44# Predict
45predictions = pipeline.predict(X_test)
46
47# Evaluate
48print("Accuracy:", accuracy_score(y_test, predictions))
49print("\nClassification Report:")
50print(classification_report(y_test, predictions))
51
52# Predict on new text
53new_text = ["This is the worst product ever"]
54print(f"\nPrediction: {pipeline.predict(new_text)[0]}")

6.2 Using Transformers

Python
1from transformers import pipeline
2
3# Sentiment Analysis
4classifier = pipeline("sentiment-analysis")
5result = classifier("I love this product!")
6print(result)
7# [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Text Classification (zero-shot)
10classifier = pipeline("zero-shot-classification")
11result = classifier(
12 "This is a tutorial about NLP",
13 candidate_labels=["education", "politics", "technology"]
14)
15print(result)
16# {'labels': ['technology', 'education', 'politics'], 'scores': [...]}
17
18# Named Entity Recognition
19ner = pipeline("ner", aggregation_strategy="simple")
20result = ner("Apple is looking at buying U.K. startup for $1 billion")
21print(result)
22# [{'entity_group': 'ORG', 'word': 'Apple', ...}, ...]

7. spaCy for Production NLP

7.1 Basic spaCy Usage

Python
1import spacy
2
3# Load model
4nlp = spacy.load("en_core_web_sm") # Small model
5# nlp = spacy.load("en_core_web_lg") # Large model (better accuracy)
6
7text = "Apple is looking at buying U.K. startup for $1 billion"
8doc = nlp(text)
9
10# Tokenization
11print("Tokens:", [token.text for token in doc])
12
13# Part of Speech
14print("\nPOS Tags:")
15for token in doc:
16 print(f" {token.text}: {token.pos_} ({token.tag_})")
17
18# Named Entities
19print("\nNamed Entities:")
20for ent in doc.ents:
21 print(f" {ent.text}: {ent.label_}")
22
23# Dependency Parsing
24print("\nDependencies:")
25for token in doc:
26 print(f" {token.text} <--{token.dep_}-- {token.head.text}")

7.2 Custom Pipeline

Python
1import spacy
2from spacy.language import Language
3
4# Custom component
5@Language.component("text_cleaner")
6def text_cleaner(doc):
7 # Custom processing
8 return doc
9
10# Add to pipeline
11nlp = spacy.load("en_core_web_sm")
12nlp.add_pipe("text_cleaner", last=True)
13
14# Batch processing (efficient)
15texts = ["Text 1", "Text 2", "Text 3"]
16docs = list(nlp.pipe(texts, batch_size=50))
17
18# Disable unnecessary components for speed
19with nlp.select_pipes(disable=['ner', 'parser']):
20 doc = nlp("Just tokenize this")

8. Thực hành

NLP Exercise

Exercise: Build Text Preprocessing Pipeline

Python
1# Build a comprehensive text preprocessing class với:
2# 1. Cleaning (URLs, HTML, punctuation)
3# 2. Tokenization
4# 3. Stop word removal
5# 4. Lemmatization
6# 5. TF-IDF vectorization
7
8# Test với sample reviews
9
10# YOUR CODE HERE
💡 Xem đáp án
Python
1import re
2import string
3import nltk
4from nltk.tokenize import word_tokenize
5from nltk.corpus import stopwords
6from nltk.stem import WordNetLemmatizer
7from nltk import pos_tag
8from sklearn.feature_extraction.text import TfidfVectorizer
9
10nltk.download('punkt')
11nltk.download('stopwords')
12nltk.download('wordnet')
13nltk.download('averaged_perceptron_tagger')
14
15class TextPreprocessor:
16 def __init__(self, remove_stopwords=True, lemmatize=True):
17 self.remove_stopwords = remove_stopwords
18 self.lemmatize = lemmatize
19 self.stop_words = set(stopwords.words('english'))
20 self.lemmatizer = WordNetLemmatizer()
21 self.vectorizer = None
22
23 def clean(self, text):
24 """Basic text cleaning"""
25 # Lowercase
26 text = text.lower()
27 # Remove URLs
28 text = re.sub(r'http\S+|www\S+', '', text)
29 # Remove HTML
30 text = re.sub(r'<.*?>', '', text)
31 # Remove punctuation
32 text = text.translate(str.maketrans('', '', string.punctuation))
33 # Remove numbers
34 text = re.sub(r'\d+', '', text)
35 # Remove extra whitespace
36 text = ' '.join(text.split())
37 return text
38
39 def get_wordnet_pos(self, tag):
40 if tag.startswith('J'):
41 return 'a'
42 elif tag.startswith('V'):
43 return 'v'
44 elif tag.startswith('R'):
45 return 'r'
46 return 'n'
47
48 def process(self, text):
49 """Full preprocessing pipeline"""
50 # Clean
51 text = self.clean(text)
52
53 # Tokenize
54 tokens = word_tokenize(text)
55
56 # Remove stop words
57 if self.remove_stopwords:
58 tokens = [t for t in tokens if t not in self.stop_words]
59
60 # Lemmatize
61 if self.lemmatize:
62 pos_tags = pos_tag(tokens)
63 tokens = [
64 self.lemmatizer.lemmatize(word, self.get_wordnet_pos(tag))
65 for word, tag in pos_tags
66 ]
67
68 return ' '.join(tokens)
69
70 def fit_transform(self, texts):
71 """Process texts and create TF-IDF features"""
72 processed = [self.process(text) for text in texts]
73 self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
74 return self.vectorizer.fit_transform(processed)
75
76 def transform(self, texts):
77 """Transform new texts using fitted vectorizer"""
78 processed = [self.process(text) for text in texts]
79 return self.vectorizer.transform(processed)
80
81
82# Test
83reviews = [
84 "This product is AMAZING! Best purchase I've ever made! Visit https://shop.com",
85 "Terrible experience. The product broke after 2 days. DO NOT BUY!!!",
86 "Good quality, fast shipping. I'm satisfied with my purchase.",
87 "Waste of money. Customer service is horrible. Never buying again."
88]
89
90preprocessor = TextPreprocessor()
91
92# Process single text
93print("Processed text:")
94print(preprocessor.process(reviews[0]))
95print()
96
97# Fit and transform all
98tfidf_matrix = preprocessor.fit_transform(reviews)
99print(f"TF-IDF shape: {tfidf_matrix.shape}")
100print(f"Features: {preprocessor.vectorizer.get_feature_names_out()[:10]}...")

9. Tổng kết

TechniqueDescriptionUse Case
PreprocessingClean, normalize textAll NLP tasks
TokenizationSplit into unitsText analysis
Stemming/LemmaReduce to root formSearch, classification
BoW / TF-IDFFrequency-based featuresTraditional ML
Word EmbeddingsDense vector representationsSemantic similarity
TransformersContextual embeddingsState-of-art NLP

Tools Comparison:

  • NLTK: Educational, comprehensive
  • spaCy: Production-ready, fast
  • Transformers (HuggingFace): SOTA models

Bài tiếp theo: Project - Sentiment Analysis