📊 Percentiles, Quartiles và Outliers

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu và tính toán Percentiles, Quartiles
Hiểu IQR và Five-Number Summary
Biết cách phát hiện và xử lý Outliers
Đọc và tạo Box Plot

1. Percentiles (Phân vị)

1.1 Định nghĩa

Percentile thứ p là giá trị mà p% dữ liệu nằm dưới nó.

Ký hiệu: $P_p$ hoặc $p^{th}$ percentile

Percentile	Ý nghĩa
$P_{25}$	25% data ≤ giá trị này
$P_{50}$	50% data ≤ giá trị này (= Median)
$P_{75}$	75% data ≤ giá trị này
$P_{90}$	90% data ≤ giá trị này

1.2 Công thức tính vị trí

$L = \frac{p}{100} \times (n + 1)$

Trong đó:

L = vị trí trong dữ liệu đã sắp xếp
p = percentile cần tìm
n = số lượng data points

1.3 Ví dụ tính tay

Dữ liệu (đã sắp xếp): [15, 20, 35, 40, 50, 60, 75, 80, 90, 100]

(n = 10)

Tìm $P_{25}$ : $L = \frac{25}{100} \times (10 + 1) = 2.75$

→ Nằm giữa vị trí 2 và 3: interpolate giữa 20 và 35

$P_{25} = 20 + 0.75 \times (35 - 20) = 20 + 11.25 = 31.25$

Tìm $P_{75}$ : $L = \frac{75}{100} \times (10 + 1) = 8.25$

→ Nằm giữa vị trí 8 và 9

$P_{75} = 80 + 0.25 \times (90 - 80) = 80 + 2.5 = 82.5$

1.4 Code Python

Python

1import numpy as np
2
3data = [15, 20, 35, 40, 50, 60, 75, 80, 90, 100]
4
5# Các percentiles thường dùng
6print(f"P25 (Q1): {np.percentile(data, 25)}")
7print(f"P50 (Median): {np.percentile(data, 50)}")
8print(f"P75 (Q3): {np.percentile(data, 75)}")
9print(f"P90: {np.percentile(data, 90)}")
10
11# Nhiều percentiles cùng lúc
12percentiles = np.percentile(data, [10, 25, 50, 75, 90])
13print(f"Multiple percentiles: {percentiles}")

2. Quartiles (Tứ phân vị)

2.1 Định nghĩa

Quartiles chia dữ liệu thành 4 phần bằng nhau:

Quartiles - Phân chia dữ liệu

📉Min

📊Q1 (25%)

📊Q2 (50%) Median

📊Q3 (75%)

📈Max

Quartile	Percentile	Ý nghĩa
Q1	$P_{25}$	25% data ≤ Q1
Q2	$P_{50}$	50% data ≤ Q2 (Median)
Q3	$P_{75}$	75% data ≤ Q3

2.2 Interquartile Range (IQR)

$IQR = Q3 - Q1$

IQR đo độ rộng của 50% dữ liệu trung tâm.

2.3 Ví dụ

Dữ liệu: [15, 20, 35, 40, 50, 60, 75, 80, 90, 100]

Q1 = 31.25
Q2 = 55 (Median)
Q3 = 82.5
IQR = 82.5 - 31.25 = 51.25

2.4 Code Python

Python

1import numpy as np
2
3data = [15, 20, 35, 40, 50, 60, 75, 80, 90, 100]
4
5Q1 = np.percentile(data, 25)
6Q2 = np.percentile(data, 50)
7Q3 = np.percentile(data, 75)
8IQR = Q3 - Q1
9
10print(f"Q1: {Q1}")
11print(f"Q2 (Median): {Q2}")
12print(f"Q3: {Q3}")
13print(f"IQR: {IQR}")
14
15# Hoặc dùng scipy
16from scipy import stats
17quartiles = stats.scoreatpercentile(data, [25, 50, 75])
18iqr = stats.iqr(data)
19print(f"IQR (scipy): {iqr}")

3. Five-Number Summary

3.1 Thành phần

#	Thành phần	Ý nghĩa
1	Minimum	Giá trị nhỏ nhất
2	Q1	Tứ phân vị thứ nhất
3	Median (Q2)	Trung vị
4	Q3	Tứ phân vị thứ ba
5	Maximum	Giá trị lớn nhất

3.2 Code Python

Python

1import numpy as np
2import pandas as pd
3
4data = [15, 20, 35, 40, 50, 60, 75, 80, 90, 100]
5
6# Five-Number Summary
7five_num = {
8    'Min': np.min(data),
9    'Q1': np.percentile(data, 25),
10    'Median': np.median(data),
11    'Q3': np.percentile(data, 75),
12    'Max': np.max(data)
13}
14
15print("=== Five-Number Summary ===")
16for key, value in five_num.items():
17    print(f"{key}: {value}")
18
19# Hoặc dùng pandas
20df = pd.DataFrame({'values': data})
21print("\n=== Pandas describe() ===")
22print(df.describe())

4. Box Plot (Biểu đồ hộp)

4.1 Cấu trúc

Text

1┌─────────────┐
2    ────┤             ├──────── Whisker
3        │      │      │
4        │      │      │ ← Box (IQR)
5        │      │      │
6    ────┼─────────────┼──────── Whisker
7        └─────────────┘
8        Q1    Q2     Q3
9              (Median)
10        
11    ○ ← Outliers

Thành phần	Ý nghĩa
Box	IQR (Q1 đến Q3)
Line trong box	Median
Whiskers	1.5 × IQR từ Q1 và Q3
Points ngoài	Outliers

4.2 Code Python

Python

1import numpy as np
2import matplotlib.pyplot as plt
3import seaborn as sns
4
5# Data với outliers
6data = [15, 20, 35, 40, 50, 60, 75, 80, 90, 100, 150]
7
8fig, axes = plt.subplots(1, 2, figsize=(12, 5))
9
10# Box plot với matplotlib
11axes[0].boxplot(data, vert=True)
12axes[0].set_title('Box Plot (Matplotlib)')
13axes[0].set_ylabel('Value')
14
15# Box plot với seaborn
16sns.boxplot(y=data, ax=axes[1])
17axes[1].set_title('Box Plot (Seaborn)')
18
19plt.tight_layout()
20plt.show()
21
22# Thông tin chi tiết
23Q1, Q3 = np.percentile(data, [25, 75])
24IQR = Q3 - Q1
25lower_whisker = Q1 - 1.5 * IQR
26upper_whisker = Q3 + 1.5 * IQR
27
28print(f"Q1: {Q1}, Q3: {Q3}")
29print(f"IQR: {IQR}")
30print(f"Lower Whisker: {lower_whisker}")
31print(f"Upper Whisker: {upper_whisker}")

5. Outliers (Giá trị ngoại lai)

5.1 Định nghĩa

Outlier là giá trị bất thường, nằm xa phần lớn dữ liệu.

5.2 Phương pháp phát hiện

a) IQR Method

$\text{Lower Bound} = Q1 - 1.5 \times IQR$ $\text{Upper Bound} = Q3 + 1.5 \times IQR$

Giá trị ngoài [Lower Bound, Upper Bound] là outlier.

1.5 hay 3?

1.5 × IQR: Mild outliers
3 × IQR: Extreme outliers

b) Z-Score Method

$|z| > 3 \rightarrow \text{Outlier}$

Giá trị cách mean hơn 3 standard deviations.

5.3 Ví dụ IQR Method

Dữ liệu: [10, 12, 14, 15, 16, 18, 20, 22, 100]

Text

1Q1 = 13, Q3 = 21
2IQR = 21 - 13 = 8
3 
4Lower Bound = 13 - 1.5 × 8 = 1
5Upper Bound = 21 + 1.5 × 8 = 33
6 
7→ 100 > 33 → Outlier!

5.4 Code Python

Python

1import numpy as np
2from scipy import stats
3
4data = [10, 12, 14, 15, 16, 18, 20, 22, 100]
5
6# Method 1: IQR
7Q1 = np.percentile(data, 25)
8Q3 = np.percentile(data, 75)
9IQR = Q3 - Q1
10
11lower_bound = Q1 - 1.5 * IQR
12upper_bound = Q3 + 1.5 * IQR
13
14outliers_iqr = [x for x in data if x < lower_bound or x > upper_bound]
15print(f"IQR Method - Outliers: {outliers_iqr}")
16
17# Method 2: Z-score
18z_scores = stats.zscore(data)
19outliers_zscore = [data[i] for i, z in enumerate(z_scores) if abs(z) > 3]
20print(f"Z-score Method - Outliers: {outliers_zscore}")
21
22# Method 3: Using pandas
23import pandas as pd
24
25df = pd.DataFrame({'values': data})
26Q1 = df['values'].quantile(0.25)
27Q3 = df['values'].quantile(0.75)
28IQR = Q3 - Q1
29
30outliers = df[(df['values'] < Q1 - 1.5*IQR) | (df['values'] > Q3 + 1.5*IQR)]
31print(f"Pandas - Outliers:\n{outliers}")

5.5 Xử lý Outliers

Phương pháp	Khi nào dùng
Loại bỏ	Data entry errors, không hợp lệ
Capping	Giữ lại nhưng giới hạn giá trị
Transformation	Log, sqrt để giảm skewness
Giữ nguyên	Outlier có ý nghĩa thực tế

Python

1import numpy as np
2
3data = np.array([10, 12, 14, 15, 16, 18, 20, 22, 100])
4
5# 1. Loại bỏ
6Q1, Q3 = np.percentile(data, [25, 75])
7IQR = Q3 - Q1
8mask = (data >= Q1 - 1.5*IQR) & (data <= Q3 + 1.5*IQR)
9data_cleaned = data[mask]
10print(f"Removed: {data_cleaned}")
11
12# 2. Capping (Winsorizing)
13lower = np.percentile(data, 5)
14upper = np.percentile(data, 95)
15data_capped = np.clip(data, lower, upper)
16print(f"Capped: {data_capped}")
17
18# 3. Transformation
19data_log = np.log(data)
20print(f"Log transformed: {data_log}")

6. Visualization tổng hợp

Python

1import numpy as np
2import matplotlib.pyplot as plt
3import seaborn as sns
4
5# Tạo data
6np.random.seed(42)
7normal_data = np.random.normal(50, 10, 100)
8data_with_outliers = np.concatenate([normal_data, [100, 110, 5]])
9
10fig, axes = plt.subplots(2, 2, figsize=(12, 10))
11
12# 1. Histogram
13axes[0, 0].hist(data_with_outliers, bins=20, edgecolor='black', alpha=0.7)
14axes[0, 0].axvline(np.mean(data_with_outliers), color='red', linestyle='--', label='Mean')
15axes[0, 0].axvline(np.median(data_with_outliers), color='green', linestyle='--', label='Median')
16axes[0, 0].set_title('Histogram with Mean & Median')
17axes[0, 0].legend()
18
19# 2. Box Plot
20bp = axes[0, 1].boxplot(data_with_outliers, vert=True)
21axes[0, 1].set_title('Box Plot')
22
23# 3. Violin Plot
24sns.violinplot(y=data_with_outliers, ax=axes[1, 0])
25axes[1, 0].set_title('Violin Plot')
26
27# 4. Strip Plot with Box
28sns.boxplot(y=data_with_outliers, ax=axes[1, 1], color='lightblue')
29sns.stripplot(y=data_with_outliers, ax=axes[1, 1], color='red', alpha=0.5)
30axes[1, 1].set_title('Box + Strip Plot')
31
32plt.tight_layout()
33plt.show()

7. Bài tập thực hành

Bài tập 1: Five-Number Summary

Cho dữ liệu điểm thi: [45, 55, 60, 65, 70, 72, 75, 78, 80, 85, 90, 95]

Tính Five-Number Summary
Tính IQR
Vẽ Box Plot

Bài tập 2: Phát hiện Outliers

Cho dữ liệu lương (triệu/tháng): [12, 15, 14, 16, 18, 15, 17, 16, 100, 14, 15]

Dùng IQR method để tìm outliers
Dùng Z-score method
Nên xử lý outlier như thế nào?

Bài tập 3: So sánh hai nhóm

Tạo box plot so sánh:

Nhóm A: [50, 55, 60, 65, 70, 75, 80]
Nhóm B: [40, 50, 60, 70, 80, 90, 100]

Nhóm nào có IQR lớn hơn?

Tóm tắt

Khái niệm	Công thức/Định nghĩa	Ý nghĩa
Percentile	Giá trị mà p% data nằm dưới	Vị trí tương đối
Q1	$P_{25}$	25% data ≤ Q1
Q2	$P_{50}$ = Median	50% data ≤ Q2
Q3	$P_{75}$	75% data ≤ Q3
IQR	Q3 - Q1	Độ rộng 50% giữa
Outlier (IQR)	< Q1-1.5×IQR hoặc > Q3+1.5×IQR	Giá trị bất thường

Key Takeaways

Percentiles giúp hiểu vị trí tương đối của giá trị
Quartiles chia data thành 4 phần bằng nhau
IQR là độ đo robust của spread
Box Plot trực quan hóa Five-Number Summary
Outliers cần được phát hiện và xử lý phù hợp