📈 Correlation và Regression

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu và tính toán Correlation
Phân biệt Pearson vs Spearman Correlation
Xây dựng Simple Linear Regression
Đánh giá model với R-squared

1. Covariance (Hiệp phương sai)

1.1 Công thức

Population Covariance: $Cov(X, Y) = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y)$

Sample Covariance: $Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$

1.2 Ý nghĩa

Cov(X,Y)	Ý nghĩa
> 0	X tăng → Y có xu hướng tăng
< 0	X tăng → Y có xu hướng giảm
≈ 0	Không có quan hệ tuyến tính

1.3 Code Python

Python

1import numpy as np
2
3X = [1, 2, 3, 4, 5]
4Y = [2, 4, 5, 4, 5]
5
6# Manual calculation
7x_bar = np.mean(X)
8y_bar = np.mean(Y)
9cov_manual = np.sum((X - x_bar) * (Y - y_bar)) / (len(X) - 1)
10print(f"Covariance (manual): {cov_manual:.4f}")
11
12# Using numpy
13cov_matrix = np.cov(X, Y)
14print(f"Covariance matrix:\n{cov_matrix}")
15print(f"Cov(X,Y): {cov_matrix[0,1]:.4f}")

2. Correlation (Hệ số tương quan)

2.1 Pearson Correlation Coefficient

$r = \frac{Cov(X, Y)}{\sigma_X \cdot \sigma_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$

2.2 Tính chất

-1 ≤ r ≤ 1
r = 1: Tương quan tuyến tính thuận hoàn hảo
r = -1: Tương quan tuyến tính nghịch hoàn hảo
r = 0: Không có tương quan tuyến tính

2.3 Bảng đánh giá

| |r| | Mức độ | |-----|--------| | 0.00 - 0.19 | Very weak | | 0.20 - 0.39 | Weak | | 0.40 - 0.59 | Moderate | | 0.60 - 0.79 | Strong | | 0.80 - 1.00 | Very strong |

2.4 Code Python

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Dữ liệu
6X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
7Y = np.array([2, 4, 5, 4, 5, 7, 8, 9, 10, 11])
8
9# Pearson correlation
10r, p_value = stats.pearsonr(X, Y)
11print(f"Pearson r: {r:.4f}")
12print(f"P-value: {p_value:.4f}")
13
14# Correlation matrix (nhiều biến)
15data = np.array([X, Y]).T
16corr_matrix = np.corrcoef(X, Y)
17print(f"\nCorrelation matrix:\n{corr_matrix}")
18
19# Visualization
20plt.figure(figsize=(10, 5))
21
22plt.subplot(1, 2, 1)
23plt.scatter(X, Y, alpha=0.7)
24plt.xlabel('X')
25plt.ylabel('Y')
26plt.title(f'Scatter Plot (r = {r:.3f})')
27
28# Add regression line
29z = np.polyfit(X, Y, 1)
30p = np.poly1d(z)
31plt.plot(X, p(X), "r--", alpha=0.8, label='Best fit line')
32plt.legend()
33
34# Different correlations
35plt.subplot(1, 2, 2)
36correlations = [
37    (np.arange(10), np.arange(10), 'r=1 (Perfect +)'),
38    (np.arange(10), -np.arange(10), 'r=-1 (Perfect -)'),
39    (np.arange(10), np.random.randn(10), 'r≈0 (No correlation)')
40]
41
42for x, y, label in correlations:
43    r_val = stats.pearsonr(x, y)[0]
44    plt.scatter(x, y, label=f'{label}', alpha=0.6)
45
46plt.xlabel('X')
47plt.ylabel('Y')
48plt.title('Different Correlations')
49plt.legend()
50plt.tight_layout()
51plt.show()

3. Spearman Rank Correlation

3.1 Khi nào dùng?

Pearson	Spearman
Linear relationship	Monotonic relationship
Continuous data	Ordinal data OK
Sensitive to outliers	Robust to outliers
Requires normality	No normality required

3.2 Công thức

$\rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}$

Với $d_i$ = chênh lệch rank của cặp $(x_i, y_i)$

3.3 Code Python

Python

1from scipy import stats
2import numpy as np
3
4X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]  # Có outlier
5Y = [2, 4, 5, 4, 5, 7, 8, 9, 10, 11]
6
7# Pearson (bị ảnh hưởng bởi outlier)
8pearson_r, _ = stats.pearsonr(X, Y)
9print(f"Pearson r: {pearson_r:.4f}")
10
11# Spearman (robust với outlier)
12spearman_r, p_value = stats.spearmanr(X, Y)
13print(f"Spearman ρ: {spearman_r:.4f}")
14print(f"P-value: {p_value:.4f}")
15
16# Kendall's tau (alternative)
17kendall_tau, _ = stats.kendalltau(X, Y)
18print(f"Kendall τ: {kendall_tau:.4f}")

4. Correlation ≠ Causation

Cảnh báo quan trọng!

Correlation does NOT imply Causation!

Có correlation không có nghĩa X gây ra Y.

4.1 Các khả năng

Correlation ≠ Causation

🔗X correlates with Y

4.2 Ví dụ

Ice cream sales ↔ Drowning deaths (confounding: hot weather)
Shoe size ↔ Reading ability (confounding: age)

5. Simple Linear Regression

5.1 Mục đích

Dự đoán Y dựa trên X bằng đường thẳng:

$\hat{y} = b_0 + b_1 x$

$b_0$ = intercept (hệ số chặn)
$b_1$ = slope (hệ số góc)

5.2 Least Squares Method

Tìm $b_0, b_1$ sao cho tổng bình phương sai số nhỏ nhất:

$\min \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

5.3 Công thức

$b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}$

$b_0 = \bar{y} - b_1 \bar{x}$

5.4 Ví dụ tính tay

x	y	x-x̄	y-ȳ	(x-x̄)(y-ȳ)	(x-x̄)²
1	2	-2	-2	4	4
2	3	-1	-1	1	1
3	4	0	0	0	0
4	5	1	1	1	1
5	6	2	2	4	4
Sum				10	10

$b_1 = \frac{10}{10} = 1$ $b_0 = 4 - 1 \times 3 = 1$ $\hat{y} = 1 + 1 \cdot x$

5.5 Code Python

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Dữ liệu
6X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
7Y = np.array([2.1, 4.2, 5.1, 4.8, 6.5, 7.2, 8.1, 9.5, 10.2, 11.3])
8
9# Method 1: scipy.stats
10slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)
11
12print("=== Linear Regression Results ===")
13print(f"Slope (b1): {slope:.4f}")
14print(f"Intercept (b0): {intercept:.4f}")
15print(f"R-squared: {r_value**2:.4f}")
16print(f"P-value: {p_value:.6f}")
17print(f"Standard Error: {std_err:.4f}")
18print(f"\nEquation: ŷ = {intercept:.2f} + {slope:.2f}x")
19
20# Visualization
21plt.figure(figsize=(10, 6))
22plt.scatter(X, Y, color='blue', label='Data points')
23plt.plot(X, intercept + slope * X, color='red', label=f'Regression line: ŷ = {intercept:.2f} + {slope:.2f}x')
24plt.xlabel('X')
25plt.ylabel('Y')
26plt.title(f'Simple Linear Regression (R² = {r_value**2:.3f})')
27plt.legend()
28plt.grid(True, alpha=0.3)
29plt.show()
30
31# Prediction
32x_new = 12
33y_pred = intercept + slope * x_new
34print(f"\nPrediction: x={x_new} → ŷ={y_pred:.2f}")

6. R-squared (Coefficient of Determination)

6.1 Công thức

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

6.2 Ý nghĩa

$R^2$ = % biến thiên của Y được giải thích bởi X

R²	Interpretation
0.0	Model không giải thích gì
0.5	50% variance được giải thích
1.0	Model hoàn hảo

6.3 Code Python

Python

1import numpy as np
2from sklearn.metrics import r2_score
3
4# Actual và Predicted values
5y_actual = np.array([2.1, 4.2, 5.1, 4.8, 6.5])
6y_pred = np.array([2.0, 4.0, 5.0, 5.0, 6.5])
7
8# R-squared
9r2 = r2_score(y_actual, y_pred)
10print(f"R-squared: {r2:.4f}")
11
12# Manual calculation
13ss_res = np.sum((y_actual - y_pred) ** 2)
14ss_tot = np.sum((y_actual - np.mean(y_actual)) ** 2)
15r2_manual = 1 - (ss_res / ss_tot)
16print(f"R-squared (manual): {r2_manual:.4f}")

7. Regression với sklearn

Python

1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.metrics import mean_squared_error, r2_score
4import matplotlib.pyplot as plt
5
6# Dữ liệu
7X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
8y = np.array([2.1, 4.2, 5.1, 4.8, 6.5, 7.2, 8.1, 9.5, 10.2, 11.3])
9
10# Train model
11model = LinearRegression()
12model.fit(X, y)
13
14# Parameters
15print(f"Intercept (b0): {model.intercept_:.4f}")
16print(f"Slope (b1): {model.coef_[0]:.4f}")
17
18# Predictions
19y_pred = model.predict(X)
20
21# Metrics
22mse = mean_squared_error(y, y_pred)
23rmse = np.sqrt(mse)
24r2 = r2_score(y, y_pred)
25
26print(f"\nMSE: {mse:.4f}")
27print(f"RMSE: {rmse:.4f}")
28print(f"R²: {r2:.4f}")
29
30# Visualization with residuals
31fig, axes = plt.subplots(1, 2, figsize=(12, 5))
32
33# Regression plot
34axes[0].scatter(X, y, color='blue', label='Actual')
35axes[0].plot(X, y_pred, color='red', label='Predicted')
36axes[0].set_xlabel('X')
37axes[0].set_ylabel('Y')
38axes[0].set_title('Linear Regression')
39axes[0].legend()
40
41# Residual plot
42residuals = y - y_pred
43axes[1].scatter(y_pred, residuals)
44axes[1].axhline(y=0, color='red', linestyle='--')
45axes[1].set_xlabel('Predicted Values')
46axes[1].set_ylabel('Residuals')
47axes[1].set_title('Residual Plot')
48
49plt.tight_layout()
50plt.show()

8. Assumptions của Linear Regression

8.1 LINE Assumptions

Letter	Assumption	Kiểm tra
Linearity	Quan hệ tuyến tính	Scatter plot
Independence	Observations độc lập	Study design
Normality	Residuals ~ Normal	Q-Q plot, Shapiro test
Equal variance	Homoscedasticity	Residual plot

8.2 Kiểm tra Assumptions

Python

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Residuals
6residuals = y - y_pred
7
8fig, axes = plt.subplots(1, 3, figsize=(15, 4))
9
10# 1. Linearity check
11axes[0].scatter(X, y)
12axes[0].plot(X, y_pred, 'r-')
13axes[0].set_title('Linearity Check')
14
15# 2. Normality of residuals
16stats.probplot(residuals, dist="norm", plot=axes[1])
17axes[1].set_title('Q-Q Plot (Normality)')
18
19# 3. Homoscedasticity
20axes[2].scatter(y_pred, residuals)
21axes[2].axhline(y=0, color='r', linestyle='--')
22axes[2].set_xlabel('Predicted')
23axes[2].set_ylabel('Residuals')
24axes[2].set_title('Homoscedasticity Check')
25
26plt.tight_layout()
27plt.show()
28
29# Shapiro-Wilk test for normality
30stat, p_value = stats.shapiro(residuals)
31print(f"Shapiro-Wilk test: stat={stat:.4f}, p-value={p_value:.4f}")

9. Correlation Matrix Heatmap

Python

1import pandas as pd
2import numpy as np
3import seaborn as sns
4import matplotlib.pyplot as plt
5
6# Create sample data
7np.random.seed(42)
8data = pd.DataFrame({
9    'Age': np.random.randint(20, 60, 100),
10    'Income': np.random.randint(30000, 100000, 100),
11    'Experience': np.random.randint(0, 30, 100),
12    'Score': np.random.randint(50, 100, 100)
13})
14
15# Add some correlations
16data['Income'] = data['Age'] * 1500 + np.random.randn(100) * 5000
17data['Experience'] = data['Age'] - 20 + np.random.randn(100) * 3
18
19# Correlation matrix
20corr_matrix = data.corr()
21print("Correlation Matrix:")
22print(corr_matrix.round(3))
23
24# Heatmap
25plt.figure(figsize=(8, 6))
26sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
27            fmt='.2f', square=True, linewidths=0.5)
28plt.title('Correlation Matrix Heatmap')
29plt.tight_layout()
30plt.show()

10. Bài tập thực hành

Bài tập 1: Correlation

Dữ liệu:

Hours studied: [2, 3, 5, 6, 8, 10]
Test score: [55, 60, 70, 75, 85, 90]

Tính Pearson correlation
Interpret kết quả
Test significance (α = 0.05)

Bài tập 2: Linear Regression

Với dữ liệu trên:

Fit linear regression model
Viết phương trình regression
Dự đoán score cho 7 hours
Tính R²

Bài tập 3: Analysis

Advertising ($)	Sales ($)
100	200
150	280
200	320
250	380
300	420

Có nên tăng advertising không?
ROI dự kiến cho mỗi $1 advertising?

Tóm tắt

Concept	Formula	Use case
Covariance	$\frac{\sum(x-\bar{x})(y-\bar{y})}{n-1}$	Direction of relationship
Pearson r	$\frac{Cov(X,Y)}{s_x s_y}$	Linear relationship
Spearman ρ	Rank correlation	Non-linear, outliers
Regression	$\hat{y} = b_0 + b_1 x$	Prediction
R²	$1 - \frac{SS_{res}}{SS_{tot}}$	Model fit

Key Takeaways

Correlation đo mức độ và hướng quan hệ
Correlation ≠ Causation
Spearman robust hơn Pearson với outliers
R² cho biết % variance được giải thích
Kiểm tra LINE assumptions cho regression