Text Processing & NLP Basics

Natural Language Processing

1. Introduction to NLP

NLP - Natural Language Processing

NLP là lĩnh vực AI giúp máy tính hiểu, phân tích và xử lý ngôn ngữ tự nhiên của con người.

1.1 NLP Applications

Application	Description	Example
Sentiment Analysis	Phân tích cảm xúc	Review positive/negative
Named Entity Recognition	Nhận dạng thực thể	Tên người, địa điểm
Text Classification	Phân loại văn bản	Spam detection
Machine Translation	Dịch máy	Google Translate
Question Answering	Trả lời câu hỏi	ChatGPT, Chatbots
Text Summarization	Tóm tắt văn bản	News summaries

1.2 NLP Pipeline

NLP Pipeline

📝Raw Text

🧹Text Preprocessing Cleaning, Normalization

✂️Tokenization Split into words/sentences

🔢Feature Extraction TF-IDF, Word2Vec, etc.

🤖Model / Analysis ML models, rules

📊Output Sentiment, category, entities

2. Text Preprocessing

2.1 Basic Cleaning

Python

1import re
2import string
3
4def clean_text(text):
5    """Basic text cleaning"""
6    # Lowercase
7    text = text.lower()
8    
9    # Remove URLs
10    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
11    
12    # Remove HTML tags
13    text = re.sub(r'<.*?>', '', text)
14    
15    # Remove punctuation
16    text = text.translate(str.maketrans('', '', string.punctuation))
17    
18    # Remove numbers
19    text = re.sub(r'\d+', '', text)
20    
21    # Remove extra whitespace
22    text = ' '.join(text.split())
23    
24    return text
25
26# Example
27raw_text = "Check out https://example.com! This is AMAZING!!! 👍 #NLP @user"
28cleaned = clean_text(raw_text)
29print(cleaned)
30# Output: "check out this is amazing nlp user"

2.2 Advanced Cleaning

Python

1import unicodedata
2import emoji
3
4def advanced_clean(text):
5    """Advanced text cleaning"""
6    # Handle Unicode
7    text = unicodedata.normalize('NFKD', text)
8    text = text.encode('ascii', 'ignore').decode('utf-8')
9    
10    # Remove emojis
11    text = emoji.replace_emoji(text, replace='')
12    
13    # Remove special characters but keep essential punctuation
14    text = re.sub(r'[^\w\s.,!?]', '', text)
15    
16    # Fix common contractions
17    contractions = {
18        "can't": "cannot",
19        "won't": "will not",
20        "n't": " not",
21        "'re": " are",
22        "'s": " is",
23        "'d": " would",
24        "'ll": " will",
25        "'ve": " have"
26    }
27    for contraction, expansion in contractions.items():
28        text = text.replace(contraction, expansion)
29    
30    return text.strip()

2.3 Stop Words Removal

Python

1from nltk.corpus import stopwords
2import nltk
3
4# Download stopwords
5nltk.download('stopwords')
6
7stop_words = set(stopwords.words('english'))
8
9def remove_stopwords(text):
10    """Remove stop words"""
11    words = text.split()
12    filtered = [word for word in words if word.lower() not in stop_words]
13    return ' '.join(filtered)
14
15# Example
16text = "This is a sample sentence showing stop word removal"
17print(remove_stopwords(text))
18# Output: "sample sentence showing stop word removal"
19
20# Custom stop words
21custom_stops = stop_words.union({'also', 'however', 'therefore'})

3. Tokenization

3.1 Word Tokenization

Python

1import nltk
2from nltk.tokenize import word_tokenize, sent_tokenize
3
4nltk.download('punkt')
5
6text = "Hello, world! How are you doing today? I'm fine."
7
8# Word tokenization
9words = word_tokenize(text)
10print(words)
11# ['Hello', ',', 'world', '!', 'How', 'are', 'you', 'doing', 'today', '?', "I'm", 'fine', '.']
12
13# Sentence tokenization
14sentences = sent_tokenize(text)
15print(sentences)
16# ['Hello, world!', 'How are you doing today?', "I'm fine."]

3.2 Subword Tokenization

Python

1# Using Hugging Face Tokenizers
2from transformers import AutoTokenizer
3
4# BERT tokenizer (WordPiece)
5tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
6tokens = tokenizer.tokenize("unhappiness")
7print(tokens)  # ['un', '##happiness'] or similar subwords
8
9# Encode to IDs
10encoded = tokenizer.encode("Hello world!", add_special_tokens=True)
11print(encoded)  # [101, 7592, 2088, 999, 102]
12
13# Decode back
14decoded = tokenizer.decode(encoded)
15print(decoded)  # "[CLS] hello world! [SEP]"

3.3 N-grams

Python

1from nltk import ngrams
2
3text = "The quick brown fox jumps"
4words = text.split()
5
6# Unigrams (1-grams)
7unigrams = list(ngrams(words, 1))
8print(unigrams)
9# [('The',), ('quick',), ('brown',), ('fox',), ('jumps',)]
10
11# Bigrams (2-grams)
12bigrams = list(ngrams(words, 2))
13print(bigrams)
14# [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]
15
16# Trigrams (3-grams)
17trigrams = list(ngrams(words, 3))
18print(trigrams)
19# [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]

4. Stemming & Lemmatization

4.1 Stemming

Python

1from nltk.stem import PorterStemmer, SnowballStemmer
2
3porter = PorterStemmer()
4snowball = SnowballStemmer('english')
5
6words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying']
7
8for word in words:
9    print(f"{word} -> Porter: {porter.stem(word)}, Snowball: {snowball.stem(word)}")
10
11# Output:
12# running -> Porter: run, Snowball: run
13# runs -> Porter: run, Snowball: run
14# ran -> Porter: ran, Snowball: ran
15# easily -> Porter: easili, Snowball: easili
16# fairly -> Porter: fairli, Snowball: fair
17# studies -> Porter: studi, Snowball: studi
18# studying -> Porter: studi, Snowball: studi

4.2 Lemmatization

Python

1from nltk.stem import WordNetLemmatizer
2import nltk
3
4nltk.download('wordnet')
5nltk.download('averaged_perceptron_tagger')
6
7lemmatizer = WordNetLemmatizer()
8
9# Need to specify part of speech for accurate lemmatization
10# n: noun, v: verb, a: adjective, r: adverb
11print(lemmatizer.lemmatize('running', pos='v'))  # run
12print(lemmatizer.lemmatize('better', pos='a'))   # good
13print(lemmatizer.lemmatize('studies', pos='v'))  # study
14
15# Auto POS detection
16from nltk import pos_tag
17
18def get_wordnet_pos(tag):
19    """Convert NLTK POS tag to WordNet POS"""
20    if tag.startswith('J'):
21        return 'a'  # adjective
22    elif tag.startswith('V'):
23        return 'v'  # verb
24    elif tag.startswith('R'):
25        return 'r'  # adverb
26    else:
27        return 'n'  # noun (default)
28
29def lemmatize_with_pos(text):
30    words = word_tokenize(text)
31    pos_tags = pos_tag(words)
32    
33    lemmatized = []
34    for word, tag in pos_tags:
35        wn_pos = get_wordnet_pos(tag)
36        lemma = lemmatizer.lemmatize(word, pos=wn_pos)
37        lemmatized.append(lemma)
38    
39    return ' '.join(lemmatized)
40
41text = "The cats were running quickly towards their homes"
42print(lemmatize_with_pos(text))
43# "The cat be run quickly towards their home"

Stemming vs Lemmatization

Stemming: Faster, rule-based, may produce non-words (running → run, studies → studi)
Lemmatization: Slower, dictionary-based, produces valid words (better → good, studies → study)
Use Lemmatization for better accuracy, Stemming for speed

5. Feature Extraction

5.1 Bag of Words (BoW)

Python

1from sklearn.feature_extraction.text import CountVectorizer
2
3documents = [
4    "The cat sat on the mat",
5    "The dog sat on the log",
6    "The cat and the dog are friends"
7]
8
9# Create BoW
10vectorizer = CountVectorizer()
11bow_matrix = vectorizer.fit_transform(documents)
12
13# View vocabulary
14print("Vocabulary:", vectorizer.get_feature_names_out())
15# ['and', 'are', 'cat', 'dog', 'friends', 'log', 'mat', 'on', 'sat', 'the']
16
17# View matrix
18print("BoW Matrix:\n", bow_matrix.toarray())
19# [[0 0 1 0 0 0 1 1 1 2]
20#  [0 0 0 1 0 1 0 1 1 2]
21#  [1 1 1 1 1 0 0 0 0 2]]

5.2 TF-IDF

Python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4    "Machine learning is great",
5    "Machine learning is challenging",
6    "Deep learning is a subset of machine learning"
7]
8
9# TF-IDF
10tfidf = TfidfVectorizer()
11tfidf_matrix = tfidf.fit_transform(documents)
12
13print("Vocabulary:", tfidf.get_feature_names_out())
14print("TF-IDF Matrix:\n", tfidf_matrix.toarray().round(2))
15
16# TF-IDF with more options
17tfidf = TfidfVectorizer(
18    max_features=1000,      # Top 1000 features
19    min_df=2,               # Minimum document frequency
20    max_df=0.95,            # Maximum document frequency
21    ngram_range=(1, 2),     # Unigrams and bigrams
22    stop_words='english'
23)

5.3 Word Embeddings (Word2Vec)

Python

1from gensim.models import Word2Vec
2
3# Sample corpus (list of tokenized sentences)
4sentences = [
5    ['machine', 'learning', 'is', 'great'],
6    ['deep', 'learning', 'is', 'powerful'],
7    ['natural', 'language', 'processing', 'uses', 'machine', 'learning'],
8    ['word', 'embeddings', 'capture', 'semantic', 'meaning']
9]
10
11# Train Word2Vec
12model = Word2Vec(
13    sentences,
14    vector_size=100,    # Embedding dimension
15    window=5,           # Context window
16    min_count=1,        # Minimum word frequency
17    workers=4,          # Parallel training
18    sg=1                # 1 for Skip-gram, 0 for CBOW
19)
20
21# Get word vector
22vector = model.wv['machine']
23print(f"Vector shape: {vector.shape}")  # (100,)
24
25# Find similar words
26similar = model.wv.most_similar('learning', topn=3)
27print("Similar to 'learning':", similar)
28
29# Word arithmetic
30# king - man + woman ≈ queen
31result = model.wv.most_similar(
32    positive=['king', 'woman'],
33    negative=['man'],
34    topn=1
35)
36
37# Save and load
38model.save("word2vec.model")
39model = Word2Vec.load("word2vec.model")

5.4 Pre-trained Embeddings

Python

1import gensim.downloader as api
2
3# Download pre-trained GloVe
4glove = api.load('glove-wiki-gigaword-100')  # 100-dimensional
5
6# Get vector
7vector = glove['computer']
8print(f"Shape: {vector.shape}")
9
10# Similarity
11similarity = glove.similarity('king', 'queen')
12print(f"Similarity king-queen: {similarity}")
13
14# Most similar
15similar = glove.most_similar('python', topn=5)
16print("Similar to 'python':", similar)

6. Text Classification Pipeline

6.1 Complete Pipeline

Python

1import pandas as pd
2from sklearn.model_selection import train_test_split
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.naive_bayes import MultinomialNB
5from sklearn.linear_model import LogisticRegression
6from sklearn.pipeline import Pipeline
7from sklearn.metrics import classification_report, accuracy_score
8
9# Sample data
10data = pd.DataFrame({
11    'text': [
12        "I love this product, it's amazing!",
13        "Terrible experience, waste of money",
14        "Great quality and fast shipping",
15        "Product broke after one day",
16        "Best purchase I've ever made",
17        "Don't buy this, very disappointed",
18        "Exceeded my expectations",
19        "Poor customer service"
20    ],
21    'label': ['positive', 'negative', 'positive', 'negative', 
22              'positive', 'negative', 'positive', 'negative']
23})
24
25# Split data
26X_train, X_test, y_train, y_test = train_test_split(
27    data['text'], data['label'], test_size=0.25, random_state=42
28)
29
30# Create pipeline
31pipeline = Pipeline([
32    ('tfidf', TfidfVectorizer(
33        lowercase=True,
34        stop_words='english',
35        ngram_range=(1, 2),
36        max_features=5000
37    )),
38    ('classifier', LogisticRegression(max_iter=1000))
39])
40
41# Train
42pipeline.fit(X_train, y_train)
43
44# Predict
45predictions = pipeline.predict(X_test)
46
47# Evaluate
48print("Accuracy:", accuracy_score(y_test, predictions))
49print("\nClassification Report:")
50print(classification_report(y_test, predictions))
51
52# Predict on new text
53new_text = ["This is the worst product ever"]
54print(f"\nPrediction: {pipeline.predict(new_text)[0]}")

6.2 Using Transformers

Python

1from transformers import pipeline
2
3# Sentiment Analysis
4classifier = pipeline("sentiment-analysis")
5result = classifier("I love this product!")
6print(result)
7# [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Text Classification (zero-shot)
10classifier = pipeline("zero-shot-classification")
11result = classifier(
12    "This is a tutorial about NLP",
13    candidate_labels=["education", "politics", "technology"]
14)
15print(result)
16# {'labels': ['technology', 'education', 'politics'], 'scores': [...]}
17
18# Named Entity Recognition
19ner = pipeline("ner", aggregation_strategy="simple")
20result = ner("Apple is looking at buying U.K. startup for $1 billion")
21print(result)
22# [{'entity_group': 'ORG', 'word': 'Apple', ...}, ...]

7. spaCy for Production NLP

7.1 Basic spaCy Usage

Python

1import spacy
2
3# Load model
4nlp = spacy.load("en_core_web_sm")  # Small model
5# nlp = spacy.load("en_core_web_lg")  # Large model (better accuracy)
6
7text = "Apple is looking at buying U.K. startup for $1 billion"
8doc = nlp(text)
9
10# Tokenization
11print("Tokens:", [token.text for token in doc])
12
13# Part of Speech
14print("\nPOS Tags:")
15for token in doc:
16    print(f"  {token.text}: {token.pos_} ({token.tag_})")
17
18# Named Entities
19print("\nNamed Entities:")
20for ent in doc.ents:
21    print(f"  {ent.text}: {ent.label_}")
22
23# Dependency Parsing
24print("\nDependencies:")
25for token in doc:
26    print(f"  {token.text} <--{token.dep_}-- {token.head.text}")

7.2 Custom Pipeline

Python

1import spacy
2from spacy.language import Language
3
4# Custom component
5@Language.component("text_cleaner")
6def text_cleaner(doc):
7    # Custom processing
8    return doc
9
10# Add to pipeline
11nlp = spacy.load("en_core_web_sm")
12nlp.add_pipe("text_cleaner", last=True)
13
14# Batch processing (efficient)
15texts = ["Text 1", "Text 2", "Text 3"]
16docs = list(nlp.pipe(texts, batch_size=50))
17
18# Disable unnecessary components for speed
19with nlp.select_pipes(disable=['ner', 'parser']):
20    doc = nlp("Just tokenize this")

8. Thực hành

NLP Exercise

Exercise: Build Text Preprocessing Pipeline

Python

1# Build a comprehensive text preprocessing class với:
2# 1. Cleaning (URLs, HTML, punctuation)
3# 2. Tokenization
4# 3. Stop word removal
5# 4. Lemmatization
6# 5. TF-IDF vectorization
7
8# Test với sample reviews
9
10# YOUR CODE HERE

💡 Xem đáp án

Python

1import re
2import string
3import nltk
4from nltk.tokenize import word_tokenize
5from nltk.corpus import stopwords
6from nltk.stem import WordNetLemmatizer
7from nltk import pos_tag
8from sklearn.feature_extraction.text import TfidfVectorizer
9
10nltk.download('punkt')
11nltk.download('stopwords')
12nltk.download('wordnet')
13nltk.download('averaged_perceptron_tagger')
14
15class TextPreprocessor:
16    def __init__(self, remove_stopwords=True, lemmatize=True):
17        self.remove_stopwords = remove_stopwords
18        self.lemmatize = lemmatize
19        self.stop_words = set(stopwords.words('english'))
20        self.lemmatizer = WordNetLemmatizer()
21        self.vectorizer = None
22    
23    def clean(self, text):
24        """Basic text cleaning"""
25        # Lowercase
26        text = text.lower()
27        # Remove URLs
28        text = re.sub(r'http\S+|www\S+', '', text)
29        # Remove HTML
30        text = re.sub(r'<.*?>', '', text)
31        # Remove punctuation
32        text = text.translate(str.maketrans('', '', string.punctuation))
33        # Remove numbers
34        text = re.sub(r'\d+', '', text)
35        # Remove extra whitespace
36        text = ' '.join(text.split())
37        return text
38    
39    def get_wordnet_pos(self, tag):
40        if tag.startswith('J'):
41            return 'a'
42        elif tag.startswith('V'):
43            return 'v'
44        elif tag.startswith('R'):
45            return 'r'
46        return 'n'
47    
48    def process(self, text):
49        """Full preprocessing pipeline"""
50        # Clean
51        text = self.clean(text)
52        
53        # Tokenize
54        tokens = word_tokenize(text)
55        
56        # Remove stop words
57        if self.remove_stopwords:
58            tokens = [t for t in tokens if t not in self.stop_words]
59        
60        # Lemmatize
61        if self.lemmatize:
62            pos_tags = pos_tag(tokens)
63            tokens = [
64                self.lemmatizer.lemmatize(word, self.get_wordnet_pos(tag))
65                for word, tag in pos_tags
66            ]
67        
68        return ' '.join(tokens)
69    
70    def fit_transform(self, texts):
71        """Process texts and create TF-IDF features"""
72        processed = [self.process(text) for text in texts]
73        self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
74        return self.vectorizer.fit_transform(processed)
75    
76    def transform(self, texts):
77        """Transform new texts using fitted vectorizer"""
78        processed = [self.process(text) for text in texts]
79        return self.vectorizer.transform(processed)
80
81
82# Test
83reviews = [
84    "This product is AMAZING! Best purchase I've ever made! Visit https://shop.com",
85    "Terrible experience. The product broke after 2 days. DO NOT BUY!!!",
86    "Good quality, fast shipping. I'm satisfied with my purchase.",
87    "Waste of money. Customer service is horrible. Never buying again."
88]
89
90preprocessor = TextPreprocessor()
91
92# Process single text
93print("Processed text:")
94print(preprocessor.process(reviews[0]))
95print()
96
97# Fit and transform all
98tfidf_matrix = preprocessor.fit_transform(reviews)
99print(f"TF-IDF shape: {tfidf_matrix.shape}")
100print(f"Features: {preprocessor.vectorizer.get_feature_names_out()[:10]}...")

9. Tổng kết

Technique	Description	Use Case
Preprocessing	Clean, normalize text	All NLP tasks
Tokenization	Split into units	Text analysis
Stemming/Lemma	Reduce to root form	Search, classification
BoW / TF-IDF	Frequency-based features	Traditional ML
Word Embeddings	Dense vector representations	Semantic similarity
Transformers	Contextual embeddings	State-of-art NLP

Tools Comparison:

NLTK: Educational, comprehensive
spaCy: Production-ready, fast
Transformers (HuggingFace): SOTA models

Bài tiếp theo: Project - Sentiment Analysis

Text Processing & NLP Basics

Text Processing & NLP Basics

1. Introduction to NLP

1.1 NLP Applications

1.2 NLP Pipeline

NLP Pipeline

2. Text Preprocessing

2.1 Basic Cleaning

2.2 Advanced Cleaning

2.3 Stop Words Removal

3. Tokenization

3.1 Word Tokenization

3.2 Subword Tokenization

3.3 N-grams

4. Stemming & Lemmatization

4.1 Stemming

4.2 Lemmatization

5. Feature Extraction

5.1 Bag of Words (BoW)

5.2 TF-IDF

5.3 Word Embeddings (Word2Vec)

5.4 Pre-trained Embeddings

6. Text Classification Pipeline

6.1 Complete Pipeline

6.2 Using Transformers

7. spaCy for Production NLP

7.1 Basic spaCy Usage

7.2 Custom Pipeline

8. Thực hành

Exercise: Build Text Preprocessing Pipeline

9. Tổng kết

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu