Correlation và Regression

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu và tính toán Correlation

✅ Phân biệt Pearson vs Spearman Correlation

✅ Xây dựng Simple Linear Regression

✅ Đánh giá model với R-squared

Thời gian: 1.5 giờ | Độ khó: Intermediate | Yêu cầu: Bài 13

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Covariance	Hiệp phương sai	Đo mức biến thiên cùng nhau
Pearson r	Hệ số Pearson	Tương quan tuyến tính [-1, 1]
Spearman ρ	Hệ số Spearman	Tương quan rank, robust outliers
Regression	Hồi quy	Dự đoán Y từ X
R-squared	Hệ số xác định	% variance giải thích bởi model
Slope	Hệ số góc	Mức thay đổi Y khi X +1
Intercept	Hệ số chặn	Giá trị Y khi X = 0
Residual	Phần dư	Sai lệch thực tế vs dự đoán

Checkpoint

Pearson r = 0.85 nghĩa là gì? → Tương quan tuyến tính dương mạnh giữa X và Y.

Task 1

Covariance (Hiệp phương sai)

TB5 min

1.1 Công thức

Population Covariance: $Cov(X, Y) = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y)$

Sample Covariance: $Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$

1.2 Ý nghĩa

Cov(X,Y)	Ý nghĩa
> 0	X tăng → Y có xu hướng tăng
< 0	X tăng → Y có xu hướng giảm
≈ 0	Không có quan hệ tuyến tính

1.3 Code Python

Python

1import numpy as np
2
3X = [1, 2, 3, 4, 5]
4Y = [2, 4, 5, 4, 5]
5
6# Manual calculation
7x_bar = np.mean(X)
8y_bar = np.mean(Y)
9cov_manual = np.sum((X - x_bar) * (Y - y_bar)) / (len(X) - 1)
10print(f"Covariance (manual): {cov_manual:.4f}")
11
12# Using numpy
13cov_matrix = np.cov(X, Y)
14print(f"Covariance matrix:\n{cov_matrix}")
15print(f"Cov(X,Y): {cov_matrix[0,1]:.4f}")

Task 2

Correlation (Hệ số tương quan)

TB5 min

2.1 Pearson Correlation Coefficient

$r = \frac{Cov(X, Y)}{\sigma_X \cdot \sigma_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$

2.2 Tính chất

-1 ≤ r ≤ 1
r = 1: Tương quan tuyến tính thuận hoàn hảo
r = -1: Tương quan tuyến tính nghịch hoàn hảo
r = 0: Không có tương quan tuyến tính

2.3 Bảng đánh giá

| |r| | Mức độ | |-----|--------| | 0.00 - 0.19 | Very weak | | 0.20 - 0.39 | Weak | | 0.40 - 0.59 | Moderate | | 0.60 - 0.79 | Strong | | 0.80 - 1.00 | Very strong |

2.4 Code Python

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Dữ liệu
6X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
7Y = np.array([2, 4, 5, 4, 5, 7, 8, 9, 10, 11])
8
9# Pearson correlation
10r, p_value = stats.pearsonr(X, Y)
11print(f"Pearson r: {r:.4f}")
12print(f"P-value: {p_value:.4f}")
13
14# Correlation matrix (nhiều biến)
15data = np.array([X, Y]).T
16corr_matrix = np.corrcoef(X, Y)
17print(f"\nCorrelation matrix:\n{corr_matrix}")
18
19# Visualization
20plt.figure(figsize=(10, 5))
21
22plt.subplot(1, 2, 1)
23plt.scatter(X, Y, alpha=0.7)
24plt.xlabel('X')
25plt.ylabel('Y')
26plt.title(f'Scatter Plot (r = {r:.3f})')
27
28# Add regression line
29z = np.polyfit(X, Y, 1)
30p = np.poly1d(z)
31plt.plot(X, p(X), "r--", alpha=0.8, label='Best fit line')
32plt.legend()
33
34# Different correlations
35plt.subplot(1, 2, 2)
36correlations = [
37    (np.arange(10), np.arange(10), 'r=1 (Perfect +)'),
38    (np.arange(10), -np.arange(10), 'r=-1 (Perfect -)'),
39    (np.arange(10), np.random.randn(10), 'r≈0 (No correlation)')
40]
41
42for x, y, label in correlations:
43    r_val = stats.pearsonr(x, y)[0]
44    plt.scatter(x, y, label=f'{label}', alpha=0.6)
45
46plt.xlabel('X')
47plt.ylabel('Y')
48plt.title('Different Correlations')
49plt.legend()
50plt.tight_layout()
51plt.show()

Task 3

Spearman Rank Correlation

TB5 min

3.1 Khi nào dùng?

Pearson	Spearman
Linear relationship	Monotonic relationship
Continuous data	Ordinal data OK
Sensitive to outliers	Robust to outliers
Requires normality	No normality required

3.2 Công thức

$\rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}$

Với $d_i$ = chênh lệch rank của cặp $(x_i, y_i)$

3.3 Code Python

Python

1from scipy import stats
2import numpy as np
3
4X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]  # Có outlier
5Y = [2, 4, 5, 4, 5, 7, 8, 9, 10, 11]
6
7# Pearson (bị ảnh hưởng bởi outlier)
8pearson_r, _ = stats.pearsonr(X, Y)
9print(f"Pearson r: {pearson_r:.4f}")
10
11# Spearman (robust với outlier)
12spearman_r, p_value = stats.spearmanr(X, Y)
13print(f"Spearman ρ: {spearman_r:.4f}")
14print(f"P-value: {p_value:.4f}")
15
16# Kendall's tau (alternative)
17kendall_tau, _ = stats.kendalltau(X, Y)
18print(f"Kendall τ: {kendall_tau:.4f}")

Checkpoint

Pearson r = -0.92. Mô tả mối quan hệ? → Tương quan tuyến tính âm rất mạnh (X tăng → Y giảm mạnh).

Task 4

Correlation ≠ Causation

TB5 min

Cảnh báo quan trọng!

Correlation does NOT imply Causation!

Có correlation không có nghĩa X gây ra Y.

4.1 Các khả năng

Correlation ≠ Causation

🔗X correlates with Y

4.2 Ví dụ

Ice cream sales ↔ Drowning deaths (confounding: hot weather)
Shoe size ↔ Reading ability (confounding: age)

Task 5

Simple Linear Regression

TB5 min

5.1 Mục đích

Dự đoán Y dựa trên X bằng đường thẳng:

$\hat{y} = b_0 + b_1 x$

$b_0$ = intercept (hệ số chặn)
$b_1$ = slope (hệ số góc)

5.2 Least Squares Method

Tìm $b_0, b_1$ sao cho tổng bình phương sai số nhỏ nhất:

$\min \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

5.3 Công thức

$b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}$

$b_0 = \bar{y} - b_1 \bar{x}$

5.4 Ví dụ tính tay

x	y	x-x̄	y-ȳ	(x-x̄)(y-ȳ)	(x-x̄)²
1	2	-2	-2	4	4
2	3	-1	-1	1	1
3	4	0	0	0	0
4	5	1	1	1	1
5	6	2	2	4	4
Sum				10	10

$b_1 = \frac{10}{10} = 1$ $b_0 = 4 - 1 \times 3 = 1$ $\hat{y} = 1 + 1 \cdot x$

5.5 Code Python

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Dữ liệu
6X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
7Y = np.array([2.1, 4.2, 5.1, 4.8, 6.5, 7.2, 8.1, 9.5, 10.2, 11.3])
8
9# Method 1: scipy.stats
10slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)
11
12print("=== Linear Regression Results ===")
13print(f"Slope (b1): {slope:.4f}")
14print(f"Intercept (b0): {intercept:.4f}")
15print(f"R-squared: {r_value**2:.4f}")
16print(f"P-value: {p_value:.6f}")
17print(f"Standard Error: {std_err:.4f}")
18print(f"\nEquation: ŷ = {intercept:.2f} + {slope:.2f}x")
19
20# Visualization
21plt.figure(figsize=(10, 6))
22plt.scatter(X, Y, color='blue', label='Data points')
23plt.plot(X, intercept + slope * X, color='red', label=f'Regression line: ŷ = {intercept:.2f} + {slope:.2f}x')
24plt.xlabel('X')
25plt.ylabel('Y')
26plt.title(f'Simple Linear Regression (R² = {r_value**2:.3f})')
27plt.legend()
28plt.grid(True, alpha=0.3)
29plt.show()
30
31# Prediction
32x_new = 12
33y_pred = intercept + slope * x_new
34print(f"\nPrediction: x={x_new} → ŷ={y_pred:.2f}")

Task 6

R-squared (Coefficient of Determination)

TB5 min

6.1 Công thức

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

6.2 Ý nghĩa

$R^2$ = % biến thiên của Y được giải thích bởi X

R²	Interpretation
0.0	Model không giải thích gì
0.5	50% variance được giải thích
1.0	Model hoàn hảo

6.3 Code Python

Python

1import numpy as np
2from sklearn.metrics import r2_score
3
4# Actual và Predicted values
5y_actual = np.array([2.1, 4.2, 5.1, 4.8, 6.5])
6y_pred = np.array([2.0, 4.0, 5.0, 5.0, 6.5])
7
8# R-squared
9r2 = r2_score(y_actual, y_pred)
10print(f"R-squared: {r2:.4f}")
11
12# Manual calculation
13ss_res = np.sum((y_actual - y_pred) ** 2)
14ss_tot = np.sum((y_actual - np.mean(y_actual)) ** 2)
15r2_manual = 1 - (ss_res / ss_tot)
16print(f"R-squared (manual): {r2_manual:.4f}")

Task 7

Regression với sklearn

TB5 min

Python

1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.metrics import mean_squared_error, r2_score
4import matplotlib.pyplot as plt
5
6# Dữ liệu
7X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
8y = np.array([2.1, 4.2, 5.1, 4.8, 6.5, 7.2, 8.1, 9.5, 10.2, 11.3])
9
10# Train model
11model = LinearRegression()
12model.fit(X, y)
13
14# Parameters
15print(f"Intercept (b0): {model.intercept_:.4f}")
16print(f"Slope (b1): {model.coef_[0]:.4f}")
17
18# Predictions
19y_pred = model.predict(X)
20
21# Metrics
22mse = mean_squared_error(y, y_pred)
23rmse = np.sqrt(mse)
24r2 = r2_score(y, y_pred)
25
26print(f"\nMSE: {mse:.4f}")
27print(f"RMSE: {rmse:.4f}")
28print(f"R²: {r2:.4f}")
29
30# Visualization with residuals
31fig, axes = plt.subplots(1, 2, figsize=(12, 5))
32
33# Regression plot
34axes[0].scatter(X, y, color='blue', label='Actual')
35axes[0].plot(X, y_pred, color='red', label='Predicted')
36axes[0].set_xlabel('X')
37axes[0].set_ylabel('Y')
38axes[0].set_title('Linear Regression')
39axes[0].legend()
40
41# Residual plot
42residuals = y - y_pred
43axes[1].scatter(y_pred, residuals)
44axes[1].axhline(y=0, color='red', linestyle='--')
45axes[1].set_xlabel('Predicted Values')
46axes[1].set_ylabel('Residuals')
47axes[1].set_title('Residual Plot')
48
49plt.tight_layout()
50plt.show()

Checkpoint

R² = 0.95 nghĩa là gì? → Model giải thích được 95% variance của biến Y.

Task 8

Assumptions của Linear Regression

TB5 min

8.1 LINE Assumptions

Letter	Assumption	Kiểm tra
Linearity	Quan hệ tuyến tính	Scatter plot
Independence	Observations độc lập	Study design
Normality	Residuals ~ Normal	Q-Q plot, Shapiro test
Equal variance	Homoscedasticity	Residual plot

8.2 Kiểm tra Assumptions

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Residuals
6residuals = y - y_pred
7
8fig, axes = plt.subplots(1, 3, figsize=(15, 4))
9
10# 1. Linearity check
11axes[0].scatter(X, y)
12axes[0].plot(X, y_pred, 'r-')
13axes[0].set_title('Linearity Check')
14
15# 2. Normality of residuals
16stats.probplot(residuals, dist="norm", plot=axes[1])
17axes[1].set_title('Q-Q Plot (Normality)')
18
19# 3. Homoscedasticity
20axes[2].scatter(y_pred, residuals)
21axes[2].axhline(y=0, color='r', linestyle='--')
22axes[2].set_xlabel('Predicted')
23axes[2].set_ylabel('Residuals')
24axes[2].set_title('Homoscedasticity Check')
25
26plt.tight_layout()
27plt.show()
28
29# Shapiro-Wilk test for normality
30stat, p_value = stats.shapiro(residuals)
31print(f"Shapiro-Wilk test: stat={stat:.4f}, p-value={p_value:.4f}")

Task 9

Correlation Matrix Heatmap

TB5 min

Python

1import pandas as pd
2import numpy as np
3import seaborn as sns
4import matplotlib.pyplot as plt
5
6# Create sample data
7np.random.seed(42)
8data = pd.DataFrame({
9    'Age': np.random.randint(20, 60, 100),
10    'Income': np.random.randint(30000, 100000, 100),
11    'Experience': np.random.randint(0, 30, 100),
12    'Score': np.random.randint(50, 100, 100)
13})
14
15# Add some correlations
16data['Income'] = data['Age'] * 1500 + np.random.randn(100) * 5000
17data['Experience'] = data['Age'] - 20 + np.random.randn(100) * 3
18
19# Correlation matrix
20corr_matrix = data.corr()
21print("Correlation Matrix:")
22print(corr_matrix.round(3))
23
24# Heatmap
25plt.figure(figsize=(8, 6))
26sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
27            fmt='.2f', square=True, linewidths=0.5)
28plt.title('Correlation Matrix Heatmap')
29plt.tight_layout()
30plt.show()

Task 10

Bài tập thực hành

TB5 min

Bài tập 1: Correlation

Dữ liệu:

Hours studied: [2, 3, 5, 6, 8, 10]
Test score: [55, 60, 70, 75, 85, 90]

Tính Pearson correlation
Interpret kết quả
Test significance (α = 0.05)

Bài tập 2: Linear Regression

Với dữ liệu trên:

Fit linear regression model
Viết phương trình regression
Dự đoán score cho 7 hours
Tính R²

Bài tập 3: Analysis

Advertising ($)	Sales ($)
100	200
150	280
200	320
250	380
300	420

Có nên tăng advertising không?
ROI dự kiến cho mỗi $1 advertising?

Task 11

Tóm tắt

TB5 min

Concept	Formula	Use case
Covariance	$\frac{\sum(x-\bar{x})(y-\bar{y})}{n-1}$	Direction of relationship
Pearson r	$\frac{Cov(X,Y)}{s_x s_y}$	Linear relationship
Spearman ρ	Rank correlation	Non-linear, outliers
Regression	$\hat{y} = b_0 + b_1 x$	Prediction
R²	$1 - \frac{SS_{res}}{SS_{tot}}$	Model fit

Câu hỏi tự kiểm tra

Tại sao "Correlation does not imply Causation" — cho ví dụ minh họa?
Khi nào nên dùng Spearman's ρ thay vì Pearson's r?
R² = 0.85 có ý nghĩa gì trong mô hình hồi quy?
LINE assumptions trong hồi quy tuyến tính bao gồm những gì?

Key Takeaways

Correlation đo mức độ và hướng quan hệ
Correlation ≠ Causation
Spearman robust hơn Pearson với outliers
R² cho biết % variance được giải thích
Kiểm tra LINE assumptions cho regression

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Tương quan & Hồi quy!

Tiếp theo: Hãy làm bài Quiz tổng hợp để kiểm tra toàn bộ kiến thức thống kê của bạn!

Task 12