Text Processing & NLP Basics
1. Introduction to NLP
NLP - Natural Language Processing
NLP là lĩnh vực AI giúp máy tính hiểu, phân tích và xử lý ngôn ngữ tự nhiên của con người.
1.1 NLP Applications
| Application | Description | Example |
|---|---|---|
| Sentiment Analysis | Phân tích cảm xúc | Review positive/negative |
| Named Entity Recognition | Nhận dạng thực thể | Tên người, địa điểm |
| Text Classification | Phân loại văn bản | Spam detection |
| Machine Translation | Dịch máy | Google Translate |
| Question Answering | Trả lời câu hỏi | ChatGPT, Chatbots |
| Text Summarization | Tóm tắt văn bản | News summaries |
1.2 NLP Pipeline
Text
1┌──────────────────────────────────────────────────────────┐2│ NLP Pipeline │3├──────────────────────────────────────────────────────────┤4│ │5│ Raw Text │6│ │ │7│ v │8│ ┌────────────────────┐ │9│ │ Text Preprocessing │ (cleaning, normalization) │10│ └─────────┬──────────┘ │11│ v │12│ ┌────────────────────┐ │13│ │ Tokenization │ (split into words/sentences) │14│ └─────────┬──────────┘ │15│ v │16│ ┌────────────────────┐ │17│ │ Feature Extraction│ (TF-IDF, Word2Vec, etc.) │18│ └─────────┬──────────┘ │19│ v │20│ ┌────────────────────┐ │21│ │ Model / Analysis │ (ML models, rules) │22│ └─────────┬──────────┘ │23│ v │24│ Output (sentiment, category, entities, etc.) │25│ │26└──────────────────────────────────────────────────────────┘2. Text Preprocessing
2.1 Basic Cleaning
Python
1import re2import string34def clean_text(text):5 """Basic text cleaning"""6 # Lowercase7 text = text.lower()8 9 # Remove URLs10 text = re.sub(r'http\S+|www\S+|https\S+', '', text)11 12 # Remove HTML tags13 text = re.sub(r'<.*?>', '', text)14 15 # Remove punctuation16 text = text.translate(str.maketrans('', '', string.punctuation))17 18 # Remove numbers19 text = re.sub(r'\d+', '', text)20 21 # Remove extra whitespace22 text = ' '.join(text.split())23 24 return text2526# Example27raw_text = "Check out https://example.com! This is AMAZING!!! 👍 #NLP @user"28cleaned = clean_text(raw_text)29print(cleaned)30# Output: "check out this is amazing nlp user"2.2 Advanced Cleaning
Python
1import unicodedata2import emoji34def advanced_clean(text):5 """Advanced text cleaning"""6 # Handle Unicode7 text = unicodedata.normalize('NFKD', text)8 text = text.encode('ascii', 'ignore').decode('utf-8')9 10 # Remove emojis11 text = emoji.replace_emoji(text, replace='')12 13 # Remove special characters but keep essential punctuation14 text = re.sub(r'[^\w\s.,!?]', '', text)15 16 # Fix common contractions17 contractions = {18 "can't": "cannot",19 "won't": "will not",20 "n't": " not",21 "'re": " are",22 "'s": " is",23 "'d": " would",24 "'ll": " will",25 "'ve": " have"26 }27 for contraction, expansion in contractions.items():28 text = text.replace(contraction, expansion)29 30 return text.strip()2.3 Stop Words Removal
Python
1from nltk.corpus import stopwords2import nltk34# Download stopwords5nltk.download('stopwords')67stop_words = set(stopwords.words('english'))89def remove_stopwords(text):10 """Remove stop words"""11 words = text.split()12 filtered = [word for word in words if word.lower() not in stop_words]13 return ' '.join(filtered)1415# Example16text = "This is a sample sentence showing stop word removal"17print(remove_stopwords(text))18# Output: "sample sentence showing stop word removal"1920# Custom stop words21custom_stops = stop_words.union({'also', 'however', 'therefore'})3. Tokenization
3.1 Word Tokenization
Python
1import nltk2from nltk.tokenize import word_tokenize, sent_tokenize34nltk.download('punkt')56text = "Hello, world! How are you doing today? I'm fine."78# Word tokenization9words = word_tokenize(text)10print(words)11# ['Hello', ',', 'world', '!', 'How', 'are', 'you', 'doing', 'today', '?', "I'm", 'fine', '.']1213# Sentence tokenization14sentences = sent_tokenize(text)15print(sentences)16# ['Hello, world!', 'How are you doing today?', "I'm fine."]3.2 Subword Tokenization
Python
1# Using Hugging Face Tokenizers2from transformers import AutoTokenizer34# BERT tokenizer (WordPiece)5tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')6tokens = tokenizer.tokenize("unhappiness")7print(tokens) # ['un', '##happiness'] or similar subwords89# Encode to IDs10encoded = tokenizer.encode("Hello world!", add_special_tokens=True)11print(encoded) # [101, 7592, 2088, 999, 102]1213# Decode back14decoded = tokenizer.decode(encoded)15print(decoded) # "[CLS] hello world! [SEP]"3.3 N-grams
Python
1from nltk import ngrams23text = "The quick brown fox jumps"4words = text.split()56# Unigrams (1-grams)7unigrams = list(ngrams(words, 1))8print(unigrams)9# [('The',), ('quick',), ('brown',), ('fox',), ('jumps',)]1011# Bigrams (2-grams)12bigrams = list(ngrams(words, 2))13print(bigrams)14# [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]1516# Trigrams (3-grams)17trigrams = list(ngrams(words, 3))18print(trigrams)19# [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]4. Stemming & Lemmatization
4.1 Stemming
Python
1from nltk.stem import PorterStemmer, SnowballStemmer23porter = PorterStemmer()4snowball = SnowballStemmer('english')56words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying']78for word in words:9 print(f"{word} -> Porter: {porter.stem(word)}, Snowball: {snowball.stem(word)}")1011# Output:12# running -> Porter: run, Snowball: run13# runs -> Porter: run, Snowball: run14# ran -> Porter: ran, Snowball: ran15# easily -> Porter: easili, Snowball: easili16# fairly -> Porter: fairli, Snowball: fair17# studies -> Porter: studi, Snowball: studi18# studying -> Porter: studi, Snowball: studi4.2 Lemmatization
Python
1from nltk.stem import WordNetLemmatizer2import nltk34nltk.download('wordnet')5nltk.download('averaged_perceptron_tagger')67lemmatizer = WordNetLemmatizer()89# Need to specify part of speech for accurate lemmatization10# n: noun, v: verb, a: adjective, r: adverb11print(lemmatizer.lemmatize('running', pos='v')) # run12print(lemmatizer.lemmatize('better', pos='a')) # good13print(lemmatizer.lemmatize('studies', pos='v')) # study1415# Auto POS detection16from nltk import pos_tag1718def get_wordnet_pos(tag):19 """Convert NLTK POS tag to WordNet POS"""20 if tag.startswith('J'):21 return 'a' # adjective22 elif tag.startswith('V'):23 return 'v' # verb24 elif tag.startswith('R'):25 return 'r' # adverb26 else:27 return 'n' # noun (default)2829def lemmatize_with_pos(text):30 words = word_tokenize(text)31 pos_tags = pos_tag(words)32 33 lemmatized = []34 for word, tag in pos_tags:35 wn_pos = get_wordnet_pos(tag)36 lemma = lemmatizer.lemmatize(word, pos=wn_pos)37 lemmatized.append(lemma)38 39 return ' '.join(lemmatized)4041text = "The cats were running quickly towards their homes"42print(lemmatize_with_pos(text))43# "The cat be run quickly towards their home"Stemming vs Lemmatization
- Stemming: Faster, rule-based, may produce non-words (running → run, studies → studi)
- Lemmatization: Slower, dictionary-based, produces valid words (better → good, studies → study)
- Use Lemmatization for better accuracy, Stemming for speed
5. Feature Extraction
5.1 Bag of Words (BoW)
Python
1from sklearn.feature_extraction.text import CountVectorizer23documents = [4 "The cat sat on the mat",5 "The dog sat on the log",6 "The cat and the dog are friends"7]89# Create BoW10vectorizer = CountVectorizer()11bow_matrix = vectorizer.fit_transform(documents)1213# View vocabulary14print("Vocabulary:", vectorizer.get_feature_names_out())15# ['and', 'are', 'cat', 'dog', 'friends', 'log', 'mat', 'on', 'sat', 'the']1617# View matrix18print("BoW Matrix:\n", bow_matrix.toarray())19# [[0 0 1 0 0 0 1 1 1 2]20# [0 0 0 1 0 1 0 1 1 2]21# [1 1 1 1 1 0 0 0 0 2]]5.2 TF-IDF
Python
1from sklearn.feature_extraction.text import TfidfVectorizer23documents = [4 "Machine learning is great",5 "Machine learning is challenging",6 "Deep learning is a subset of machine learning"7]89# TF-IDF10tfidf = TfidfVectorizer()11tfidf_matrix = tfidf.fit_transform(documents)1213print("Vocabulary:", tfidf.get_feature_names_out())14print("TF-IDF Matrix:\n", tfidf_matrix.toarray().round(2))1516# TF-IDF with more options17tfidf = TfidfVectorizer(18 max_features=1000, # Top 1000 features19 min_df=2, # Minimum document frequency20 max_df=0.95, # Maximum document frequency21 ngram_range=(1, 2), # Unigrams and bigrams22 stop_words='english'23)5.3 Word Embeddings (Word2Vec)
Python
1from gensim.models import Word2Vec23# Sample corpus (list of tokenized sentences)4sentences = [5 ['machine', 'learning', 'is', 'great'],6 ['deep', 'learning', 'is', 'powerful'],7 ['natural', 'language', 'processing', 'uses', 'machine', 'learning'],8 ['word', 'embeddings', 'capture', 'semantic', 'meaning']9]1011# Train Word2Vec12model = Word2Vec(13 sentences,14 vector_size=100, # Embedding dimension15 window=5, # Context window16 min_count=1, # Minimum word frequency17 workers=4, # Parallel training18 sg=1 # 1 for Skip-gram, 0 for CBOW19)2021# Get word vector22vector = model.wv['machine']23print(f"Vector shape: {vector.shape}") # (100,)2425# Find similar words26similar = model.wv.most_similar('learning', topn=3)27print("Similar to 'learning':", similar)2829# Word arithmetic30# king - man + woman ≈ queen31result = model.wv.most_similar(32 positive=['king', 'woman'],33 negative=['man'],34 topn=135)3637# Save and load38model.save("word2vec.model")39model = Word2Vec.load("word2vec.model")5.4 Pre-trained Embeddings
Python
1import gensim.downloader as api23# Download pre-trained GloVe4glove = api.load('glove-wiki-gigaword-100') # 100-dimensional56# Get vector7vector = glove['computer']8print(f"Shape: {vector.shape}")910# Similarity11similarity = glove.similarity('king', 'queen')12print(f"Similarity king-queen: {similarity}")1314# Most similar15similar = glove.most_similar('python', topn=5)16print("Similar to 'python':", similar)6. Text Classification Pipeline
6.1 Complete Pipeline
Python
1import pandas as pd2from sklearn.model_selection import train_test_split3from sklearn.feature_extraction.text import TfidfVectorizer4from sklearn.naive_bayes import MultinomialNB5from sklearn.linear_model import LogisticRegression6from sklearn.pipeline import Pipeline7from sklearn.metrics import classification_report, accuracy_score89# Sample data10data = pd.DataFrame({11 'text': [12 "I love this product, it's amazing!",13 "Terrible experience, waste of money",14 "Great quality and fast shipping",15 "Product broke after one day",16 "Best purchase I've ever made",17 "Don't buy this, very disappointed",18 "Exceeded my expectations",19 "Poor customer service"20 ],21 'label': ['positive', 'negative', 'positive', 'negative', 22 'positive', 'negative', 'positive', 'negative']23})2425# Split data26X_train, X_test, y_train, y_test = train_test_split(27 data['text'], data['label'], test_size=0.25, random_state=4228)2930# Create pipeline31pipeline = Pipeline([32 ('tfidf', TfidfVectorizer(33 lowercase=True,34 stop_words='english',35 ngram_range=(1, 2),36 max_features=500037 )),38 ('classifier', LogisticRegression(max_iter=1000))39])4041# Train42pipeline.fit(X_train, y_train)4344# Predict45predictions = pipeline.predict(X_test)4647# Evaluate48print("Accuracy:", accuracy_score(y_test, predictions))49print("\nClassification Report:")50print(classification_report(y_test, predictions))5152# Predict on new text53new_text = ["This is the worst product ever"]54print(f"\nPrediction: {pipeline.predict(new_text)[0]}")6.2 Using Transformers
Python
1from transformers import pipeline23# Sentiment Analysis4classifier = pipeline("sentiment-analysis")5result = classifier("I love this product!")6print(result)7# [{'label': 'POSITIVE', 'score': 0.9998}]89# Text Classification (zero-shot)10classifier = pipeline("zero-shot-classification")11result = classifier(12 "This is a tutorial about NLP",13 candidate_labels=["education", "politics", "technology"]14)15print(result)16# {'labels': ['technology', 'education', 'politics'], 'scores': [...]}1718# Named Entity Recognition19ner = pipeline("ner", aggregation_strategy="simple")20result = ner("Apple is looking at buying U.K. startup for $1 billion")21print(result)22# [{'entity_group': 'ORG', 'word': 'Apple', ...}, ...]7. spaCy for Production NLP
7.1 Basic spaCy Usage
Python
1import spacy23# Load model4nlp = spacy.load("en_core_web_sm") # Small model5# nlp = spacy.load("en_core_web_lg") # Large model (better accuracy)67text = "Apple is looking at buying U.K. startup for $1 billion"8doc = nlp(text)910# Tokenization11print("Tokens:", [token.text for token in doc])1213# Part of Speech14print("\nPOS Tags:")15for token in doc:16 print(f" {token.text}: {token.pos_} ({token.tag_})")1718# Named Entities19print("\nNamed Entities:")20for ent in doc.ents:21 print(f" {ent.text}: {ent.label_}")2223# Dependency Parsing24print("\nDependencies:")25for token in doc:26 print(f" {token.text} <--{token.dep_}-- {token.head.text}")7.2 Custom Pipeline
Python
1import spacy2from spacy.language import Language34# Custom component5@Language.component("text_cleaner")6def text_cleaner(doc):7 # Custom processing8 return doc910# Add to pipeline11nlp = spacy.load("en_core_web_sm")12nlp.add_pipe("text_cleaner", last=True)1314# Batch processing (efficient)15texts = ["Text 1", "Text 2", "Text 3"]16docs = list(nlp.pipe(texts, batch_size=50))1718# Disable unnecessary components for speed19with nlp.select_pipes(disable=['ner', 'parser']):20 doc = nlp("Just tokenize this")8. Thực hành
NLP Exercise
Exercise: Build Text Preprocessing Pipeline
Python
1# Build a comprehensive text preprocessing class với:2# 1. Cleaning (URLs, HTML, punctuation)3# 2. Tokenization4# 3. Stop word removal5# 4. Lemmatization6# 5. TF-IDF vectorization78# Test với sample reviews910# YOUR CODE HERE💡 Xem đáp án
Python
1import re2import string3import nltk4from nltk.tokenize import word_tokenize5from nltk.corpus import stopwords6from nltk.stem import WordNetLemmatizer7from nltk import pos_tag8from sklearn.feature_extraction.text import TfidfVectorizer910nltk.download('punkt')11nltk.download('stopwords')12nltk.download('wordnet')13nltk.download('averaged_perceptron_tagger')1415class TextPreprocessor:16 def __init__(self, remove_stopwords=True, lemmatize=True):17 self.remove_stopwords = remove_stopwords18 self.lemmatize = lemmatize19 self.stop_words = set(stopwords.words('english'))20 self.lemmatizer = WordNetLemmatizer()21 self.vectorizer = None22 23 def clean(self, text):24 """Basic text cleaning"""25 # Lowercase26 text = text.lower()27 # Remove URLs28 text = re.sub(r'http\S+|www\S+', '', text)29 # Remove HTML30 text = re.sub(r'<.*?>', '', text)31 # Remove punctuation32 text = text.translate(str.maketrans('', '', string.punctuation))33 # Remove numbers34 text = re.sub(r'\d+', '', text)35 # Remove extra whitespace36 text = ' '.join(text.split())37 return text38 39 def get_wordnet_pos(self, tag):40 if tag.startswith('J'):41 return 'a'42 elif tag.startswith('V'):43 return 'v'44 elif tag.startswith('R'):45 return 'r'46 return 'n'47 48 def process(self, text):49 """Full preprocessing pipeline"""50 # Clean51 text = self.clean(text)52 53 # Tokenize54 tokens = word_tokenize(text)55 56 # Remove stop words57 if self.remove_stopwords:58 tokens = [t for t in tokens if t not in self.stop_words]59 60 # Lemmatize61 if self.lemmatize:62 pos_tags = pos_tag(tokens)63 tokens = [64 self.lemmatizer.lemmatize(word, self.get_wordnet_pos(tag))65 for word, tag in pos_tags66 ]67 68 return ' '.join(tokens)69 70 def fit_transform(self, texts):71 """Process texts and create TF-IDF features"""72 processed = [self.process(text) for text in texts]73 self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))74 return self.vectorizer.fit_transform(processed)75 76 def transform(self, texts):77 """Transform new texts using fitted vectorizer"""78 processed = [self.process(text) for text in texts]79 return self.vectorizer.transform(processed)808182# Test83reviews = [84 "This product is AMAZING! Best purchase I've ever made! Visit https://shop.com",85 "Terrible experience. The product broke after 2 days. DO NOT BUY!!!",86 "Good quality, fast shipping. I'm satisfied with my purchase.",87 "Waste of money. Customer service is horrible. Never buying again."88]8990preprocessor = TextPreprocessor()9192# Process single text93print("Processed text:")94print(preprocessor.process(reviews[0]))95print()9697# Fit and transform all98tfidf_matrix = preprocessor.fit_transform(reviews)99print(f"TF-IDF shape: {tfidf_matrix.shape}")100print(f"Features: {preprocessor.vectorizer.get_feature_names_out()[:10]}...")9. Tổng kết
| Technique | Description | Use Case |
|---|---|---|
| Preprocessing | Clean, normalize text | All NLP tasks |
| Tokenization | Split into units | Text analysis |
| Stemming/Lemma | Reduce to root form | Search, classification |
| BoW / TF-IDF | Frequency-based features | Traditional ML |
| Word Embeddings | Dense vector representations | Semantic similarity |
| Transformers | Contextual embeddings | State-of-art NLP |
Tools Comparison:
- NLTK: Educational, comprehensive
- spaCy: Production-ready, fast
- Transformers (HuggingFace): SOTA models
Bài tiếp theo: Project - Sentiment Analysis
