Ôn Tập Tổng Hợp

🎯 Mục tiêu bài học

TB5 min

Ôn tập toàn bộ khóa học:

✅ Python Basics → Functions → Control Flow

✅ NumPy → Pandas → Data Manipulation

✅ Visualization (Matplotlib, Seaborn, Plotly)

✅ Data Cleaning → Feature Engineering → Pipeline

✅ Data Crawling → Streamlit Dashboard

✅ Cheat Sheets tổng hợp để tra cứu nhanh

Thời gian: 2 giờ | Mục đích: Ôn tập & tra cứu | Yêu cầu: Hoàn thành Bài 01–12

Task 0

🗺️ Lộ Trình Khóa Học

TB5 min

Hành trình Python Data Science

Python Basics

NumPy

Pandas

Visualization

Data Processing

Web & Dashboard

Kiến thức tổng quan

Python Data Science

Fundamentals

int, float, str, bool, list, dict

if/elif/else, for, while

def, lambda, *args, **kwargs

Data Tools

NumPy — Array, Broadcasting

Pandas — DataFrame, GroupBy, Merge

Visualization

Matplotlib — plt.plot/bar/hist

Seaborn — Statistical Charts

Plotly — Interactive Charts

Data Processing

Missing, Outliers, Duplicates

Encoding, Scaling, Selection

Task 1

🐍 Cheat Sheet — Python Basics

TB5 min

Python

1# === Variables & Types ===
2name = "MinAI"                  # str
3age = 25                        # int
4pi = 3.14                       # float
5is_active = True                # bool
6
7# === Collections ===
8fruits = ["apple", "banana"]    # list (mutable, ordered)
9coords = (10.5, 20.3)          # tuple (immutable)
10unique = {1, 2, 3}             # set (unique values)
11person = {"name": "A", "age": 25}  # dict (key-value)
12
13# === List Operations ===
14fruits.append("cherry")
15fruits.extend(["date", "fig"])
16fruits.pop()
17sorted_list = sorted(fruits)
18squares = [x**2 for x in range(10)]           # List comprehension
19evens = [x for x in range(20) if x % 2 == 0]  # Filtered
20
21# === Control Flow ===
22if score >= 90:
23    grade = "A"
24elif score >= 80:
25    grade = "B"
26else:
27    grade = "C"
28
29for item in fruits:
30    print(item)
31
32for i, item in enumerate(fruits):
33    print(f"{i}: {item}")
34
35# === Functions ===
36def calculate(a, b, operation="add"):
37    if operation == "add":
38        return a + b
39    return a - b
40
41# Lambda
42square = lambda x: x ** 2
43sorted_data = sorted(students, key=lambda s: s['gpa'], reverse=True)
44
45# *args, **kwargs
46def flexible(*args, **kwargs):
47    print(args)    # tuple
48    print(kwargs)  # dict
49
50# === String Methods ===
51text = "  Hello World  "
52text.strip()          # Remove whitespace
53text.lower()          # lowercase
54text.split()          # ['Hello', 'World']
55"-".join(["a","b"])   # "a-b"
56f"Name: {name}"       # f-string

Checkpoint

Tự kiểm tra: Bạn có thể viết 1 function nhận list số, trả về list chỉ chứa số chẵn, dùng list comprehension không? Nếu có, kiến thức Python Basics bạn đã vững!

Task 2

🔢 Cheat Sheet — NumPy

TB5 min

Python

1import numpy as np
2
3# === Array Creation ===
4a = np.array([1, 2, 3])
5zeros = np.zeros((3, 4))
6ones = np.ones((2, 3))
7rng = np.arange(0, 10, 2)         # [0, 2, 4, 6, 8]
8lin = np.linspace(0, 1, 5)        # 5 points from 0 to 1
9rand = np.random.randn(3, 3)      # Normal distribution
10
11# === Shape & Reshape ===
12a.shape                            # (3,)
13a.reshape(3, 1)                    # Column vector
14a.flatten()                        # 1D
15
16# === Indexing ===
17arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
18arr[0, :]         # Row 0: [1, 2, 3]
19arr[:, 1]         # Col 1: [2, 5, 8]
20arr[arr > 5]      # Boolean: [6, 7, 8, 9]
21
22# === Math ===
23np.mean(a), np.std(a), np.median(a)
24np.sum(arr, axis=0)     # Column sums
25np.sum(arr, axis=1)     # Row sums
26np.dot(a, b)            # Dot product
27a @ b                   # Matrix multiply
28
29# === Broadcasting ===
30arr + 10                # Add 10 to all elements
31arr * np.array([1,2,3]) # Multiply each col

Checkpoint

Tự kiểm tra: Bạn có phân biệt được axis=0 (tính theo cột) và axis=1 (tính theo hàng) không? Có hiểu broadcasting là gì không? Nếu có, NumPy bạn OK!

Task 3

🐼 Cheat Sheet — Pandas

TB5 min

Python

1import pandas as pd
2
3# === Create & Read ===
4df = pd.read_csv("file.csv")
5df = pd.read_excel("file.xlsx")
6df = pd.DataFrame({"a": [1,2], "b": [3,4]})
7
8# === Explore ===
9df.shape               # (rows, cols)
10df.info()              # Types + non-null counts
11df.describe()          # Statistics
12df.head()              # First 5 rows
13df.dtypes              # Column types
14df.isnull().sum()      # Missing counts
15
16# === Selection ===
17df['col']                       # Series
18df[['col1', 'col2']]            # Multiple cols
19df.loc[0, 'col']                # By label
20df.iloc[0, 0]                   # By position
21
22# === Filtering ===
23df[df['age'] > 30]
24df.query('age > 30 and city == "Hanoi"')
25df[df['city'].isin(['HN', 'HCM'])]
26
27# === Transform ===
28df['new'] = df['a'] + df['b']
29df['category'] = df['score'].apply(lambda x: 'A' if x > 90 else 'B')
30
31# === GroupBy ===
32df.groupby('city')['salary'].mean()
33df.groupby(['city', 'dept']).agg(
34    avg_salary=('salary', 'mean'),
35    count=('id', 'count')
36)
37
38# === Merge ===
39pd.merge(df1, df2, on='key', how='left')    # left/right/inner/outer
40pd.concat([df1, df2], axis=0)                # Stack vertically
41df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
42
43# === Chain Operations (Best Practice) ===
44result = (
45    df
46    .dropna(subset=['col'])
47    .drop_duplicates()
48    .assign(new_col=lambda x: x['a'] + x['b'])
49    .query('value > 0')
50    .groupby('category')['value'].mean()
51)

Checkpoint

Tự kiểm tra: Bạn có thể tự viết code: đọc CSV, lọc theo điều kiện, groupby tính trung bình, merge 2 bảng không? Nếu có, Pandas bạn vững rồi!

Task 4

📊 Cheat Sheet — Visualization

TB5 min

Khi nào dùng chart nào?

Mục đích	Chart	Library
Distribution	Histogram, KDE	Seaborn
Comparison	Bar, Box	Seaborn
Relationship	Scatter, Regression	Seaborn / Plotly
Composition	Pie, Stacked Bar	Plotly
Trend	Line	Plotly
Correlation	Heatmap	Seaborn
Interactive	Any chart	Plotly

Python

1import matplotlib.pyplot as plt
2import seaborn as sns
3import plotly.express as px
4
5# === Seaborn (Statistical) ===
6sns.histplot(df['col'], kde=True)
7sns.boxplot(data=df, x='cat', y='num')
8sns.scatterplot(data=df, x='x', y='y', hue='group')
9sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
10sns.pairplot(df, hue='target')
11
12# === Plotly (Interactive) ===
13px.scatter(df, x='x', y='y', color='cat', size='val',
14           hover_name='name', title='Scatter Plot')
15px.line(df, x='date', y='value', color='category')
16px.bar(df, x='cat', y='val', color='sub_cat', barmode='group')
17px.pie(df, values='val', names='cat', hole=0.4)
18px.histogram(df, x='col', nbins=30)
19
20# === Matplotlib (Foundation) ===
21fig, axes = plt.subplots(1, 2, figsize=(12, 5))
22axes[0].plot(x, y)
23axes[0].set_title('Line Plot')
24axes[1].bar(categories, values)
25axes[1].set_title('Bar Plot')
26plt.tight_layout()
27plt.show()

Checkpoint

Tự kiểm tra: Bạn có biết khi nào dùng histogram vs boxplot vs scatterplot không? Có từng tạo subplot 2x2 không? Nếu có, Visualization bạn OK!

Task 5

🧹 Cheat Sheet — Data Cleaning & Feature Engineering

TB5 min

Data Cleaning

Python

1# === Missing Values ===
2df.isnull().sum()
3df.fillna(df['col'].median(), inplace=True)    # Numeric
4df['cat'].fillna(df['cat'].mode()[0], inplace=True)  # Categorical
5df.dropna(subset=['important_col'])
6
7# === Duplicates ===
8df.duplicated().sum()
9df.drop_duplicates(subset=['col1', 'col2'], keep='first')
10
11# === Outliers (IQR) ===
12Q1, Q3 = df['col'].quantile([0.25, 0.75])
13IQR = Q3 - Q1
14mask = (df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)
15df_clean = df[mask]
16
17# === Data Types ===
18df['date'] = pd.to_datetime(df['date'])
19df['price'] = df['price'].str.replace(',', '').astype(float)

Feature Engineering

Python

1from sklearn.preprocessing import StandardScaler, OneHotEncoder
2from sklearn.pipeline import Pipeline
3from sklearn.compose import ColumnTransformer
4
5# === Encoding ===
6pd.get_dummies(df, columns=['cat_col'], drop_first=True)  # One-Hot
7df['edu_num'] = df['education'].map({'HS':0, 'BS':1, 'MS':2, 'PhD':3})  # Ordinal
8
9# === Scaling ===
10scaler = StandardScaler()
11X_train_scaled = scaler.fit_transform(X_train)
12X_test_scaled = scaler.transform(X_test)  # KHÔNG fit lại!
13
14# === Feature Creation ===
15df['revenue'] = df['price'] * df['quantity']
16df['month'] = df['date'].dt.month
17df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int)
18
19# === Feature Selection ===
20from sklearn.ensemble import RandomForestClassifier
21model = RandomForestClassifier().fit(X, y)
22importance = model.feature_importances_
23
24# === Pipeline (tránh Data Leakage!) ===
25preprocessor = ColumnTransformer([
26    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')),
27                      ('scaler', StandardScaler())]), numeric_cols),
28    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
29                      ('ohe', OneHotEncoder(handle_unknown='ignore'))]), cat_cols)
30])
31full_pipe = Pipeline([('prep', preprocessor), ('model', RandomForestClassifier())])
32full_pipe.fit(X_train, y_train)

Checkpoint

Tự kiểm tra: Bạn có hiểu tại sao phải dùng Pipeline để tránh Data Leakage không? Có biết khi nào dùng .fillna() vs .dropna() không? Nếu có, Data Processing bạn hiểu rồi!

Task 6

📚 Tổng Hợp Libraries

TB5 min

Library	Mục đích	Hàm chính
numpy	Numerical computing	`array, mean, std, reshape, dot`
pandas	Data manipulation	`read_csv, groupby, merge, pivot_table`
matplotlib	Base visualization	`plt.plot, plt.bar, plt.hist, subplots`
seaborn	Statistical viz	`histplot, boxplot, heatmap, pairplot`
plotly	Interactive viz	`px.scatter, px.line, px.bar, px.pie`
sklearn	ML preprocessing	`StandardScaler, OneHotEncoder, Pipeline`
requests	HTTP requests	`get, post, json, raise_for_status`
beautifulsoup4	HTML parsing	`find, find_all, select, text`
streamlit	Web apps	`st.dataframe, st.plotly_chart, st.sidebar`

Task 7

🚀 Tiếp Tục Học

TB5 min

Hành trình tiếp theo

Machine Learning

Deep Learning

MLOps

Specialization

Khóa học tiếp theo trên MinAI

Machine Learning Fundamentals — Regression, Classification, Clustering
Deep Learning — Neural Networks, CNN, RNN, Transformers
Statistics Fundamentals — Xác suất, Thống kê suy luận

Practice Resources

Resource	Mô tả
Kaggle	Competitions, Datasets, Notebooks
LeetCode	Luyện Python coding
Real projects	Áp dụng vào dữ liệu thực tế

Bài tiếp theo: Mini Project & Bài kiểm tra — kiểm chứng kiến thức toàn khóa! 🎓

Câu hỏi tự kiểm tra

Trong pipeline xử lý dữ liệu hoàn chỉnh, các bước từ import dữ liệu đến visualization gồm những gì?
Khi nào dùng pd.merge() và khi nào dùng pd.concat()? Cho ví dụ cụ thể.
So sánh Seaborn histplot và boxplot — mỗi loại phù hợp khi muốn phân tích điều gì?
Tại sao sklearn Pipeline quan trọng khi kết hợp preprocessing với model training?

🎉 Tuyệt vời! Bạn đã hoàn thành bài Ôn Tập Tổng Hợp!

Tiếp theo: Mini Project & Bài kiểm tra cuối khóa — hãy chứng minh kiến thức bạn đã tích lũy!

Task 8

🎯 Mục tiêu bài học

🗺️ Lộ Trình Khóa Học

Hành trình Python Data Science

Kiến thức tổng quan

🐍 Cheat Sheet — Python Basics

Checkpoint

🔢 Cheat Sheet — NumPy

Checkpoint

🐼 Cheat Sheet — Pandas

Checkpoint

📊 Cheat Sheet — Visualization

Khi nào dùng chart nào?

Checkpoint

🧹 Cheat Sheet — Data Cleaning & Feature Engineering

Data Cleaning

Feature Engineering

Checkpoint

📚 Tổng Hợp Libraries

🚀 Tiếp Tục Học

Hành trình tiếp theo

Khóa học tiếp theo trên MinAI

Practice Resources

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu