Mini Project & Bài Kiểm Tra

0

🎯 Mục tiêu bài học

TB5 min

Bài cuối khóa gồm 2 phần:

✅ Phần A — Mini Project: EDA + Dashboard hoàn chỉnh trên dataset thực tế

✅ Phần B — Bài Kiểm Tra: 20 câu hỏi trắc nghiệm kiểm tra toàn bộ kiến thức

Thời gian: 3-4 giờ (Project: 2-3h, Quiz: 30-45 phút) | Yêu cầu: Hoàn thành Bài 01–13

Task 0

1

📋 Phần A — Đề Bài Mini Project

TB5 min

Bối cảnh

Bạn là Data Analyst tại một công ty E-commerce. Sếp giao cho bạn bộ dữ liệu bán hàng 2024 và yêu cầu:

Phân tích tổng quan tình hình kinh doanh
Tìm insights ẩn trong dữ liệu
Đề xuất hành động cải thiện doanh số
Xây dựng Dashboard để team theo dõi

Dataset

Tự tạo dataset hoặc dùng file CSV sau:

Python

1import pandas as pd
2import numpy as np
3
4np.random.seed(42)
5n = 5000
6
7# Generate dataset
8orders = pd.DataFrame({
9    'order_id': range(1, n + 1),
10    'date': pd.date_range('2024-01-01', periods=n, freq='2h'),
11    'customer_id': np.random.randint(1, 501, n),
12    'product': np.random.choice(
13        ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger',
14         'Case', 'Mouse', 'Keyboard', 'Monitor', 'Speaker'], n),
15    'category': np.random.choice(
16        ['Electronics', 'Accessories', 'Peripherals'], n,
17        p=[0.3, 0.4, 0.3]),
18    'quantity': np.random.randint(1, 6, n),
19    'unit_price': np.random.choice(
20        [29.99, 49.99, 99.99, 199.99, 499.99, 999.99], n),
21    'region': np.random.choice(
22        ['Hà Nội', 'HCM', 'Đà Nẵng', 'Cần Thơ', 'Hải Phòng'], n,
23        p=[0.3, 0.35, 0.15, 0.1, 0.1]),
24    'payment': np.random.choice(
25        ['Credit Card', 'Bank Transfer', 'COD', 'E-Wallet'], n,
26        p=[0.3, 0.25, 0.25, 0.2]),
27    'rating': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.1, 0.2, 0.35, 0.3])
28})
29
30# Add revenue
31orders['revenue'] = orders['quantity'] * orders['unit_price']
32
33# Add some missing values (realistic)
34mask = np.random.random(n) < 0.05
35orders.loc[mask, 'rating'] = np.nan
36
37# Save
38orders.to_csv('ecommerce_2024.csv', index=False)
39print(f"Dataset: {orders.shape[0]} orders, {orders.shape[1]} columns")
40orders.head()

Task 1

2.5

⚠️ Common Mistakes to Avoid

TB5 min

Top 10 sai lầm người mới hay mắc phải:

Data Leakage: Fit scaler/encoder trên toàn bộ data → phải fit trên train, transform trên test!
Python
```
1# ❌ SAI
2scaler.fit(entire_data)
3
4# ✅ ĐÚNG
5scaler.fit(X_train)
6X_test_scaled = scaler.transform(X_test)
```

Missing Values: Chỉ drop mà không phân tích → mất data quan trọng!

Python

1# ❌ SAI - Drop ngay
2df.dropna()
3
4# ✅ ĐÚNG - Phân tích trước
5print(df.isnull().sum())
6df['numeric_col'].fillna(df['numeric_col'].median())

GroupBy Reset Index: Quên reset_index() → MultiIndex khó xử lý

Python

1# ❌ SAI  
2grouped = df.groupby('category')['sales'].sum()
3
4# ✅ ĐÚNG
5grouped = df.groupby('category')['sales'].sum().reset_index()

Merge Duplicates: Merge mà không kiểm tra → dữ liệu bị nhân đôi/nhân ba!

Python

1# ✅ ĐÚNG - Kiểm tra trước
2print(f"Before: {len(df1)}, {len(df2)}")
3merged = pd.merge(df1, df2, on='key')
4print(f"After: {len(merged)}")  # Nếu >> len(df1) → có vấn đề!

DateTime Conversion: Quên convert string → datetime → không tính được month/year

Python

1# ❌ SAI
2df['month'] = df['date'].str.split('-')[1]  # String manipulation
3
4# ✅ ĐÚNG
5df['date'] = pd.to_datetime(df['date'])
6df['month'] = df['date'].dt.month

Visualization Labels: Biểu đồ không có title, labels → người xem không hiểu!

Python

1# ❌ SAI
2plt.plot(x, y)
3
4# ✅ ĐÚNG
5plt.plot(x, y)
6plt.title("Revenue Trend 2024")
7plt.xlabel("Month")
8plt.ylabel("Revenue ($)")

Axis Confusion: Nhầm axis=0 (cột) vs axis=1 (hàng)

Python

1# df.mean(axis=0) → trung bình MỖI CỘT (dọc xuống)
2# df.mean(axis=1) → trung bình MỖI HÀNG (ngang qua)

Inplace vs Return: Quên inplace=True hoặc gán lại

Python

1# ❌ SAI
2df.dropna()  # Không lưu kết quả!
3
4# ✅ ĐÚNG (chọn 1 trong 2)
5df = df.dropna()           # Gán lại
6df.dropna(inplace=True)    # Hoặc inplace

Feature Engineering Timing: Tạo features SAU khi split → test set bị leak!

Python

1# ❌ SAI
2df['new_feature'] = df['a'] + df['b']
3train, test = train_test_split(df)
4
5# ✅ ĐÚNG
6train, test = train_test_split(df)
7train['new_feature'] = train['a'] + train['b']
8test['new_feature'] = test['a'] + test['b']

Dashboard Performance: Load data nhiều lần trong Streamlit → chậm!

Python

1# ❌ SAI
2df = pd.read_csv('data.csv')  # Mỗi rerun lại đọc!
3
4# ✅ ĐÚNG
5@st.cache_data
6def load_data():
7    return pd.read_csv('data.csv')
8df = load_data()  # Chỉ load 1 lần

Pro Tips:

Luôn in shape sau mỗi bước transform để debug
Dùng .head(), .sample() thường xuyên để xem data
Comment code giải thích WHY, không phải WHAT
Tạo functions cho logic lặp lại → tránh copy-paste
Dùng try/except cho phần dễ lỗi (đọc file, API call)

Task 2.5

3

💡 Gợi Ý Giải (Đừng xem trước khi thử!)

TB5 min

🔍 Click để xem gợi ý Step 1 — Data Cleaning

Python

1import pandas as pd
2import numpy as np
3
4df = pd.read_csv('ecommerce_2024.csv')
5
6# Convert date
7df['date'] = pd.to_datetime(df['date'])
8
9# Missing values
10print(f"Missing ratings: {df['rating'].isnull().sum()}")
11df['rating'].fillna(df['rating'].median(), inplace=True)
12
13# Duplicates
14print(f"Duplicates: {df.duplicated().sum()}")
15
16# Feature creation
17df['month'] = df['date'].dt.month
18df['dayofweek'] = df['date'].dt.dayofweek
19df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
20df['hour'] = df['date'].dt.hour

📊 Click để xem gợi ý Step 2 — EDA Charts

Python

1import plotly.express as px
2import seaborn as sns
3import matplotlib.pyplot as plt
4
5# Monthly revenue trend
6monthly = df.groupby('month')['revenue'].sum().reset_index()
7fig = px.line(monthly, x='month', y='revenue', title='Monthly Revenue 2024',
8              markers=True)
9fig.show()
10
11# Revenue by region
12region_rev = df.groupby('region')['revenue'].sum().reset_index()
13fig = px.pie(region_rev, values='revenue', names='region',
14             title='Revenue by Region', hole=0.4)
15fig.show()
16
17# Top products
18product_rev = (df.groupby('product')['revenue'].sum()
19               .sort_values(ascending=True).tail(5))
20fig = px.bar(product_rev, orientation='h', title='Top 5 Products by Revenue')
21fig.show()
22
23# Correlation heatmap
24fig, ax = plt.subplots(figsize=(8, 6))
25sns.heatmap(df[['quantity','unit_price','rating','revenue']].corr(),
26            annot=True, cmap='coolwarm', ax=ax)
27plt.title('Correlation Matrix')
28plt.show()

⚙️ Click để xem gợi ý Step 3 — RFM & Pipeline

Python

1from sklearn.preprocessing import StandardScaler
2import pandas as pd
3
4# RFM Analysis
5today = df['date'].max()
6rfm = df.groupby('customer_id').agg(
7    recency=('date', lambda x: (today - x.max()).days),
8    frequency=('order_id', 'count'),
9    monetary=('revenue', 'sum')
10).reset_index()
11
12# Scale
13scaler = StandardScaler()
14rfm[['R_scaled', 'F_scaled', 'M_scaled']] = scaler.fit_transform(
15    rfm[['recency', 'frequency', 'monetary']])
16
17print(rfm.head())

🚀 Click để xem gợi ý Step 4 — Streamlit Dashboard

Python

1# dashboard.py
2import streamlit as st
3import pandas as pd
4import plotly.express as px
5
6st.set_page_config(page_title="E-commerce Dashboard", layout="wide")
7st.title("🛒 E-commerce Dashboard 2024")
8
9@st.cache_data
10def load_data():
11    df = pd.read_csv('ecommerce_2024.csv')
12    df['date'] = pd.to_datetime(df['date'])
13    return df
14
15df = load_data()
16
17# Sidebar
18st.sidebar.header("Filters")
19regions = st.sidebar.multiselect("Region", df['region'].unique(), df['region'].unique())
20categories = st.sidebar.multiselect("Category", df['category'].unique(), df['category'].unique())
21
22filtered = df[(df['region'].isin(regions)) & (df['category'].isin(categories))]
23
24# KPIs
25c1, c2, c3, c4 = st.columns(4)
26c1.metric("Total Revenue", f"${filtered['revenue'].sum():,.0f}")
27c2.metric("Total Orders", f"{len(filtered):,}")
28c3.metric("Avg Order", f"${filtered['revenue'].mean():,.0f}")
29c4.metric("Customers", f"{filtered['customer_id'].nunique():,}")
30
31# Charts
32col1, col2 = st.columns(2)
33with col1:
34    monthly = filtered.groupby(filtered['date'].dt.month)['revenue'].sum().reset_index()
35    fig = px.line(monthly, x='date', y='revenue', title='Monthly Revenue')
36    st.plotly_chart(fig, use_container_width=True)
37
38with col2:
39    region_rev = filtered.groupby('region')['revenue'].sum().reset_index()
40    fig = px.pie(region_rev, values='revenue', names='region', title='By Region', hole=0.4)
41    st.plotly_chart(fig, use_container_width=True)
42
43st.dataframe(filtered, use_container_width=True)

Task 3

4

📝 Phần B — Bài Kiểm Tra (20 câu)

TB5 min

Trả lời 20 câu hỏi trắc nghiệm. Mỗi câu 5 điểm. Đạt ≥ 70/100 để pass.

Câu 1. Output của đoạn code: [x**2 for x in range(5) if x % 2 == 0]?

A. [0, 4, 16] B. [1, 9, 25] C. [0, 1, 4, 9, 16] D. [4, 16]

Câu 2. lambda x, y: x + y tương đương với:

A. def f(x): return x B. def f(x, y): return x + y C. def f(x, y): x + y D. def f(*args): return sum(args)

Câu 3. NumPy: np.array([1,2,3]) + np.array([10,20,30]) cho kết quả?

A. [1,2,3,10,20,30] B. [11, 22, 33] C. Error D. [[1,2,3],[10,20,30]]

Câu 4. Để chọn các dòng mà cột 'age' > 30 trong Pandas:

A. df.filter(age > 30) B. df[df['age'] > 30] C. df.select('age > 30') D. df.where(age > 30)

Câu 5. df.groupby('city')['salary'].mean() trả về:

A. DataFrame B. Series C. Array D. List

Câu 6. Merge 2 DataFrame giữ TẤT CẢ rows bên trái:

A. pd.merge(df1, df2, how='inner') B. pd.merge(df1, df2, how='left') C. pd.merge(df1, df2, how='right') D. pd.merge(df1, df2, how='cross')

Câu 7. Chart nào tốt nhất để xem phân phối của biến numeric?

A. Pie chart B. Line chart C. Histogram/KDE D. Scatter plot

Câu 8. Thư viện nào tạo interactive charts?

A. Matplotlib B. Seaborn C. Plotly D. Pandas plot

Câu 9. df.isnull().sum() cho biết:

A. Tổng giá trị NULL trong toàn bộ DataFrame B. Số NULL mỗi cột C. Số dòng có NULL D. % NULL

Câu 10. Cách detect outliers bằng IQR: nếu IQR = Q3 - Q1, outlier là giá trị:

A. > Q3 + IQR B. > Q3 + 1.5*IQR C. > Q3 + 2*IQR D. > Q3 + 3*IQR

Câu 11. One-Hot Encoding biến 'color' có 5 giá trị unique, drop_first=True tạo bao nhiêu cột mới?

A. 5 B. 4 C. 6 D. 3

Câu 12. StandardScaler biến đổi data sao cho:

A. Range [0, 1] B. Mean=0, Std=1 C. Median=0 D. Range [-1, 1]

Câu 13. Data Leakage xảy ra khi:

A. Model overfit B. Dùng thông tin test khi train C. Learning rate quá cao D. Dataset quá nhỏ

Câu 14. Trong sklearn Pipeline, scaler.fit_transform(X_train) rồi scaler.transform(X_test). Tại sao KHÔNG fit_transform trên test?

A. Test data quá nhỏ B. Tránh data leakage C. Tiết kiệm bộ nhớ D. Test không có labels

Câu 15. BeautifulSoup: Tìm TẤT CẢ thẻ <div class="item">:

A. soup.find('div', 'item') B. soup.find_all('div', class_='item') C. soup.select_all('.item') D. soup.get('div.item')

Câu 16. API trả response.status_code = 404. Nghĩa là:

A. Server error B. Unauthorized C. Not Found D. Rate limited

Câu 17. Streamlit cache data dùng decorator nào?

A. @st.cache B. @st.cache_data C. @st.memo D. @cache

Câu 18. pd.qcut(df['income'], q=4) chia data thành:

A. 4 bins bằng width B. 4 bins bằng số lượng (quantile) C. 4 random groups D. 4 bins theo mean

Câu 19. Feature Selection: model.feature_importances_ dùng method nào?

A. Filter B. Wrapper C. Embedded D. PCA

Câu 20. Để tránh data leakage, nên dùng:

A. for loop thủ công B. sklearn.Pipeline C. df.apply() D. Không cần quan tâm

Task 4

5

🔑 Đáp Án

TB5 min

Click để xem đáp án

Câu	Đáp án	Giải thích
1	A	range(5)=[0,1,2,3,4], even=[0,2,4], squares=[0,4,16]
2	B	lambda 2 params, return tổng
3	B	Element-wise addition
4	B	Boolean indexing: `df[condition]`
5	B	GroupBy + single column agg → Series
6	B	`how='left'` giữ tất cả rows bên trái
7	C	Histogram/KDE cho distribution
8	C	Plotly tạo interactive HTML charts
9	B	`.sum()` trên mỗi cột → Series count NULL per column
10	B	IQR method: outlier ngoài [Q1-1.5×IQR, Q3+1.5×IQR]
11	B	5 unique - 1 (drop_first) = 4 cột mới
12	B	Z-score normalization: μ=0, σ=1
13	B	Dùng test info khi train → kết quả sai lệch
14	B	fit trên test = học distribution test → data leakage
15	B	`find_all` + `class_=` parameter
16	C	404 = Resource Not Found
17	B	`@st.cache_data` (new API, thay thế `@st.cache`)
18	B	`qcut` = quantile cut (equal frequency)
19	C	Tree-based importance = embedded method
20	B	Pipeline tự động fit/transform đúng trên train/test

Scoring: Mỗi câu đúng = 5 điểm. Pass ≥ 70/100 (≥ 14 câu đúng).

Task 5

6

🎉 Chúc Mừng Hoàn Thành Khóa Học!

TB5 min

Câu hỏi tự kiểm tra

Trong quy trình phân tích dữ liệu, bước Data Cleaning thường chiếm bao nhiêu phần trăm thời gian và tại sao?
Khi xây dựng Mini Project, các bước từ đọc dữ liệu đến trình bày kết quả gồm những gì?
Data Leakage xảy ra khi nào và sklearn Pipeline giúp ngăn chặn nó bằng cách nào?
Bạn đã học những thư viện Python nào trong khóa học? Mỗi thư viện dùng cho mục đích gì?

Bạn đã hoàn thành khóa học Python cho Khoa học Dữ liệu! 🎓

Bạn đã nắm vững:

Python fundamentals & data structures
NumPy & Pandas data manipulation
Data visualization (Matplotlib, Seaborn, Plotly)
Data Cleaning & Feature Engineering
Web Scraping, API, Streamlit Dashboard
Sklearn Pipeline & best practices

Tiếp theo:

✅ Hoàn thành Mini Project ở trên
✅ Đạt ≥ 70/100 bài kiểm tra
✅ Tiếp tục với Machine Learning Fundamentals trên MinAI!

Task 6

Mini Project & Bài Kiểm Tra

🎯 Mục tiêu bài học

📋 Phần A — Đề Bài Mini Project

Bối cảnh

Dataset

📝 Yêu Cầu Chi Tiết

Step 1: Data Loading & Cleaning (30 phút)

Step 2: EDA — Phân Tích Dữ Liệu (45 phút)

Step 3: Feature Engineering (30 phút)

Step 4: Streamlit Dashboard (60 phút)

⚠️ Common Mistakes to Avoid

💡 Gợi Ý Giải (Đừng xem trước khi thử!)

📝 Phần B — Bài Kiểm Tra (20 câu)

🔑 Đáp Án

🎉 Chúc Mừng Hoàn Thành Khóa Học!

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu