Sampling và Sampling Distribution

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu các phương pháp sampling

✅ Nắm vững Sampling Distribution

✅ Phân biệt Parameter vs Statistic

✅ Tính toán Standard Error

Thời gian: 1 giờ | Độ khó: Intermediate | Yêu cầu: Bài 09

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Population	Tổng thể	Toàn bộ đối tượng nghiên cứu
Sample	Mẫu	Phần được chọn từ population
Parameter	Tham số	Đặc trưng của population (μ, σ)
Statistic	Thống kê	Đặc trưng của sample (x̄, s)
Sampling Bias	Sai lệch mẫu	Mẫu không đại diện
Standard Error	Sai số chuẩn	σ/√n
Stratified	Phân tầng	Chia nhóm rồi lấy mẫu

Checkpoint

Muốn giảm SE một nửa → tăng n gấp 4 lần (vì SE ∝ 1/√n).

Task 1

🌍 Population vs Sample

TB5 min

1.1 Định nghĩa

Khái niệm	Population	Sample
Định nghĩa	Toàn bộ đối tượng nghiên cứu	Một phần được chọn từ population
Ký hiệu	N	n
Ví dụ	Tất cả sinh viên VN	1000 sinh viên được khảo sát

1.2 Parameter vs Statistic

	Parameter	Statistic
Nguồn	Population	Sample
Mean	μ (mu)	x̄ (x-bar)
Std	σ (sigma)	s
Proportion	p	p̂ (p-hat)
Tính chất	Cố định	Biến thiên

Key Insight

Statistic là ước lượng của Parameter. Mục tiêu thống kê suy luận là sử dụng statistic để suy ra parameter.

Task 2

🎲 Các phương pháp Sampling

TB5 min

2.1 Probability Sampling

Phương pháp Probability Sampling

🎲Probability Sampling

a) Simple Random Sampling

Mỗi phần tử có cơ hội được chọn bằng nhau
Dùng random number generator

Python

1import numpy as np
2
3population = list(range(1, 1001))  # 1000 người
4sample = np.random.choice(population, size=100, replace=False)
5print(f"Sample size: {len(sample)}")
6print(f"First 10: {sample[:10]}")

b) Stratified Sampling

Chia population thành strata (nhóm)
Lấy mẫu từ mỗi stratum

Python

1import pandas as pd
2import numpy as np
3
4# Giả sử có 60% nam, 40% nữ
5np.random.seed(42)
6population = pd.DataFrame({
7    'id': range(1000),
8    'gender': np.random.choice(['M', 'F'], 1000, p=[0.6, 0.4])
9})
10
11# Stratified sampling: lấy 10% từ mỗi nhóm
12sample = population.groupby('gender').apply(
13    lambda x: x.sample(frac=0.1)
14).reset_index(drop=True)
15
16print(f"Sample size: {len(sample)}")
17print(sample['gender'].value_counts())

c) Cluster Sampling

Chia thành clusters (cụm)
Chọn ngẫu nhiên một số clusters
Lấy tất cả phần tử trong clusters được chọn

d) Systematic Sampling

Chọn mỗi k-th phần tử
k = N/n

Python

1population = list(range(1, 1001))
2n = 100
3k = len(population) // n  # k = 10
4
5# Chọn điểm bắt đầu ngẫu nhiên
6start = np.random.randint(0, k)
7systematic_sample = population[start::k]
8print(f"Sample size: {len(systematic_sample)}")

2.2 Non-Probability Sampling

Phương pháp	Mô tả	Nhược điểm
Convenience	Chọn thuận tiện	Bias cao
Purposive	Chọn có chủ đích	Chủ quan
Snowball	Giới thiệu lẫn nhau	Không representative
Quota	Đạt quota từng nhóm	Không random

Task 3

⚠️ Sampling Bias

TB5 min

3.1 Các loại Bias

Loại	Nguyên nhân	Ví dụ
Selection Bias	Chọn mẫu không ngẫu nhiên	Khảo sát online bỏ qua người không dùng internet
Non-response Bias	Người không trả lời khác người trả lời	Người bận rộn không tham gia
Survivorship Bias	Chỉ xét những ai "còn sống"	Phân tích công ty thành công, bỏ qua thất bại

3.2 Cách giảm Bias

✅ Sử dụng random sampling
✅ Tăng sample size
✅ Đảm bảo representative
✅ Kiểm tra response rate

Task 4

📊 Sampling Distribution

TB5 min

4.1 Định nghĩa

Sampling Distribution là phân phối xác suất của một statistic từ nhiều samples.

Sampling Distribution

🌍Population

Sample 1 → x̄₁

Sample 2 → x̄₂

Sample 3 → x̄₃

Sample n → x̄ₙ

📊Distribution of x̄

4.2 Sampling Distribution of the Mean

Nếu population có mean μ và std σ:

Statistic	Value
Mean of x̄	μ
Std of x̄ (SE)	σ/√n

4.3 Code minh họa

Python

1import numpy as np
2import matplotlib.pyplot as plt
3
4# Population (không normal - uniform)
5np.random.seed(42)
6population = np.random.uniform(0, 100, 100000)
7pop_mean = np.mean(population)
8pop_std = np.std(population)
9
10print(f"Population Mean: {pop_mean:.2f}")
11print(f"Population Std: {pop_std:.2f}")
12
13# Lấy nhiều samples và tính mean
14sample_sizes = [10, 30, 100]
15fig, axes = plt.subplots(1, 3, figsize=(15, 4))
16
17for i, n in enumerate(sample_sizes):
18    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
19    
20    axes[i].hist(sample_means, bins=30, density=True, alpha=0.7)
21    axes[i].axvline(pop_mean, color='red', linestyle='--', label=f'μ = {pop_mean:.1f}')
22    axes[i].set_title(f'n = {n}\nSE = {pop_std/np.sqrt(n):.2f}')
23    axes[i].legend()
24
25plt.suptitle('Sampling Distribution of the Mean')
26plt.tight_layout()
27plt.show()

Checkpoint

Population μ=50, σ=10, n=100. SE = ? E(x̄) = ?

Task 5

📏 Standard Error (SE)

TB5 min

5.1 Công thức

$SE = \frac{\sigma}{\sqrt{n}}$

Khi không biết σ, dùng s (sample std):

$SE = \frac{s}{\sqrt{n}}$

5.2 Ý nghĩa

SE đo lường độ chính xác của sample mean trong việc ước lượng population mean.

n tăng	SE giảm	Ước lượng chính xác hơn
n = 25	SE = σ/5
n = 100	SE = σ/10	↓ 50%
n = 400	SE = σ/20	↓ 75%

Quy luật căn bậc hai

Muốn giảm SE một nửa, cần tăng n gấp 4 lần!

5.3 Code tính SE

Python

1import numpy as np
2
3def calculate_se(data=None, sigma=None, n=None):
4    """Tính Standard Error"""
5    if data is not None:
6        return np.std(data, ddof=1) / np.sqrt(len(data))
7    elif sigma is not None and n is not None:
8        return sigma / np.sqrt(n)
9    else:
10        raise ValueError("Cần cung cấp data hoặc (sigma, n)")
11
12# Ví dụ
13sample = [45, 52, 48, 55, 50, 53, 47, 51, 49, 54]
14se = calculate_se(data=sample)
15print(f"SE from sample: {se:.4f}")
16
17# Nếu biết population std
18se_known = calculate_se(sigma=10, n=100)
19print(f"SE with known σ: {se_known:.4f}")

Task 6

📊 Sampling Distribution of Proportion

TB5 min

6.1 Công thức

Nếu population proportion là p:

$\hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right)$

6.2 Standard Error of Proportion

$SE_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

6.3 Điều kiện

np ≥ 10
n(1-p) ≥ 10

6.4 Ví dụ

Tỷ lệ ủng hộ ứng viên A trong population là 52%. Khảo sát 500 người:

Python

1import numpy as np
2from scipy import stats
3
4p = 0.52
5n = 500
6
7SE = np.sqrt(p * (1-p) / n)
8print(f"SE = {SE:.4f}")
9
10# P(p̂ > 0.55)?
11z = (0.55 - p) / SE
12prob = 1 - stats.norm.cdf(z)
13print(f"P(p̂ > 0.55) = {prob:.4f}")

Task 7

🔢 Sample Size Determination

TB5 min

7.1 Cho Mean

$n = \left(\frac{z \cdot \sigma}{E}\right)^2$

Trong đó:

z = z-score cho confidence level
σ = population std (hoặc ước lượng)
E = margin of error mong muốn

7.2 Cho Proportion

$n = \frac{z^2 \cdot p(1-p)}{E^2}$

Nếu không biết p, dùng p = 0.5 (worst case)

7.3 Code tính sample size

Python

1from scipy import stats
2import numpy as np
3
4def sample_size_mean(sigma, margin_error, confidence=0.95):
5    """Tính sample size cho mean"""
6    z = stats.norm.ppf((1 + confidence) / 2)
7    n = (z * sigma / margin_error) ** 2
8    return int(np.ceil(n))
9
10def sample_size_proportion(margin_error, confidence=0.95, p=0.5):
11    """Tính sample size cho proportion"""
12    z = stats.norm.ppf((1 + confidence) / 2)
13    n = z**2 * p * (1-p) / margin_error**2
14    return int(np.ceil(n))
15
16# Ví dụ: Mean
17# σ = 15, margin of error = 3, 95% confidence
18n_mean = sample_size_mean(sigma=15, margin_error=3, confidence=0.95)
19print(f"Sample size for mean: {n_mean}")
20
21# Ví dụ: Proportion
22# Margin of error = 3%, 95% confidence
23n_prop = sample_size_proportion(margin_error=0.03, confidence=0.95)
24print(f"Sample size for proportion: {n_prop}")

Task 8

🧩 Bài tập thực hành

TB5 min

Bài tập 1: Standard Error

Chiều cao dân số: μ = 170cm, σ = 8cm.

SE với n = 25?
SE với n = 100?
Cần n bao nhiêu để SE ≤ 1cm?

Bài tập 2: Sampling Distribution

Thời gian xử lý: μ = 10 phút, σ = 3 phút. Mẫu n = 36:

E(x̄) = ?
SE = ?
P(x̄ > 11)?

Bài tập 3: Sample Size

Khảo sát tỷ lệ hài lòng:

Margin of error: 2%
Confidence: 95%
Cần khảo sát bao nhiêu người?

Task 9

📝 Tổng Kết

TB5 min

Khái niệm	Công thức	Ý nghĩa
SE (mean)	σ/√n	Độ chính xác của x̄
SE (proportion)	√(p(1-p)/n)	Độ chính xác của p̂
Sample size (mean)	(zσ/E)²	n cần thiết
Sample size (prop)	z²p(1-p)/E²	n cần thiết

Câu hỏi tự kiểm tra

Tại sao Random Sampling quan trọng trong việc giảm bias?
Sampling Distribution của Mean có đặc điểm gì theo Định lý Giới hạn Trung tâm?
Muốn giảm Standard Error xuống một nửa thì cần tăng sample size bao nhiêu lần?
Những yếu tố nào ảnh hưởng đến việc xác định sample size cần thiết?

Key Takeaways

Random sampling giảm bias
Sampling distribution của mean → Normal (CLT)
SE giảm khi n tăng (tỷ lệ 1/√n)
Sample size phụ thuộc vào margin of error và confidence
Gấp đôi precision cần gấp 4 lần sample size

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Lấy mẫu (Sampling)!

Tiếp theo: Chúng ta sẽ học về Khoảng tin cậy (Confidence Intervals) — cách ước lượng tham số tổng thể với độ chắc chắn.

Task 10