Time Series Analysis | MinAI Learning

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

Chuẩn bị và xử lý time series data (datetime operations, missing dates)
Thực hiện resampling, moving averages, rolling statistics
Phân tích trend và seasonality (decomposition, STL, seasonal indices)
Xây dựng forecasting models (Naive, SES, Holt-Winters)

📋 Thông tin bài học

Thông tin	Chi tiết
⏱️ Thời lượng	2.5 giờ
📖 Chủ đề chính	Time Series Decomposition, Forecasting
💡 Kiến thức cần có	Pandas, Matplotlib cơ bản
🎯 Output	Phân tích và forecast time series data thực tế

Task 0

📖 Thuật ngữ quan trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Time Series	Chuỗi thời gian	Dữ liệu có thứ tự theo thời gian (daily sales, monthly revenue)
Trend	Xu hướng	Hướng dài hạn — upward, downward, hoặc flat
Seasonality	Tính mùa vụ	Patterns lặp lại theo chu kỳ (daily, weekly, yearly)
Residual	Phần dư	Random noise sau khi loại bỏ trend và seasonality
Resampling	Lấy mẫu lại	Chuyển đổi tần suất (daily → monthly, monthly → daily)
Rolling Window	Cửa sổ trượt	Tính statistics trên N rows gần nhất (moving average)
EMA	Trung bình mũ	Exponential Moving Average — weighted trung bình gần đây hơn
Decomposition	Phân tách	Tách time series thành Trend + Seasonal + Residual
STL	Phân tách STL	Seasonal-Trend decomposition using LOESS — robust hơn
Holt-Winters	Mô hình HW	Triple Exponential Smoothing — forecast trend + seasonality

Checkpoint

Time series = Trend + Seasonality + Residual. Decomposition tách 3 components để hiểu pattern. Holt-Winters forecast cả trend và seasonality. Time series analysis là foundation của business analytics — sales forecasting, inventory management, demand planning đều dùng nó!

Task 1

📊 Time Series Data Preparation

TB5 min

Time Series trong Business

Time series analysis là foundation của business analytics — từ sales forecasting, inventory management đến demand planning. Master kỹ năng này để predict và plan hiệu quả.

Creating Time Series

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4
5# Create datetime index
6dates = pd.date_range(start='2022-01-01', end='2024-12-31', freq='D')
7
8# Generate sample data: trend + seasonality + noise
9np.random.seed(42)
10n = len(dates)
11trend = np.linspace(100, 200, n)
12seasonality = 30 * np.sin(2 * np.pi * np.arange(n) / 365)
13noise = np.random.normal(0, 10, n)
14
15sales = trend + seasonality + noise
16
17df = pd.DataFrame({'date': dates, 'sales': sales})
18df.set_index('date', inplace=True)
19
20print(df.head())
21print(f"\nShape: {df.shape}")
22print(f"Date range: {df.index.min()} to {df.index.max()}")

Datetime Operations

Python

1# Date components
2df['year'] = df.index.year
3df['month'] = df.index.month
4df['quarter'] = df.index.quarter
5df['day_of_week'] = df.index.dayofweek
6df['week_of_year'] = df.index.isocalendar().week
7df['is_weekend'] = df.index.dayofweek >= 5
8df['is_month_start'] = df.index.is_month_start
9df['is_month_end'] = df.index.is_month_end
10
11# Parse dates from strings
12df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
13
14# Set datetime index
15df = df.set_index('date').sort_index()

Handling Missing Dates

Python

1# Create complete date range
2full_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
3
4# Reindex to fill missing dates
5df = df.reindex(full_range)
6
7# Fill missing values
8df['sales'] = df['sales'].fillna(method='ffill')       # Forward fill
9df['sales'] = df['sales'].interpolate(method='linear')  # Interpolate
10df['sales'] = df['sales'].fillna(df['sales'].mean())    # Mean
11
12# Check for gaps
13missing_dates = full_range.difference(df.index)
14print(f"Missing dates: {len(missing_dates)}")

Ví dụ

1┌─────────────────────────────────────────────────────────┐
2│            Time Series Decomposition                     │
3├─────────────────────────────────────────────────────────┤
4│  Original = Trend + Seasonality + Residual              │
5│                                                          │
6│  Trend:       Long-term direction (↗↘→)                 │
7│  Seasonality: Regular patterns (daily/weekly/yearly)    │
8│  Residual:    Random noise, unpredictable               │
9└─────────────────────────────────────────────────────────┘

Checkpoint

Time series cần datetime index. pd.date_range tạo complete date range. .reindex() + fillna()/interpolate() xử lý missing dates. Components: Trend + Seasonality + Residual. Missing dates là vấn đề phổ biến — bạn chọn forward fill hay interpolation tùy vào data type!

Task 2

📈 Visualization & Resampling

TB5 min

Basic Time Series Plots

Python

1fig, axes = plt.subplots(3, 1, figsize=(14, 10))
2
3# Line plot
4axes[0].plot(df.index, df['sales'], linewidth=0.8)
5axes[0].set_title('Daily Sales')
6
7# Monthly resampling
8monthly = df['sales'].resample('M').mean()
9axes[1].plot(monthly.index, monthly.values, marker='o')
10axes[1].set_title('Monthly Average Sales')
11
12# Year-over-year comparison
13for year in df.index.year.unique():
14    yearly = df[df.index.year == year]['sales']
15    yearly.index = yearly.index.dayofyear
16    axes[2].plot(yearly.index, yearly.values, label=str(year), alpha=0.7)
17axes[2].set_title('Year-over-Year Comparison')
18axes[2].legend()
19
20plt.tight_layout()
21plt.show()

Heatmap

Python

1heatmap_data = df.pivot_table(
2    values='sales',
3    index=df.index.year,
4    columns=df.index.month,
5    aggfunc='mean'
6)
7
8plt.figure(figsize=(12, 6))
9plt.imshow(heatmap_data, cmap='YlOrRd', aspect='auto')
10plt.colorbar(label='Average Sales')
11plt.xticks(range(12), ['Jan','Feb','Mar','Apr','May','Jun',
12                       'Jul','Aug','Sep','Oct','Nov','Dec'])
13plt.title('Sales Heatmap by Year and Month')
14plt.show()

Downsampling

Python

1# Daily → Weekly
2weekly = df['sales'].resample('W').agg(['mean', 'sum', 'min', 'max'])
3
4# Daily → Monthly
5monthly = df['sales'].resample('M').agg(['mean', 'std', 'min', 'max'])
6
7# Daily → Quarterly
8quarterly = df['sales'].resample('Q').sum()
9
10# Custom aggregation
11monthly_stats = df.resample('M').agg({
12    'sales': ['sum', 'mean', 'std']
13})

Upsampling

Python

1monthly_data = pd.DataFrame({
2    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
3    'value': [100, 120, 115, 130, 145, 160, 155, 170, 165, 180, 190, 200]
4})
5monthly_data.set_index('date', inplace=True)
6
7# Forward fill
8daily = monthly_data.resample('D').ffill()
9
10# Linear interpolation
11daily_interp = monthly_data.resample('D').interpolate(method='linear')
12
13# Spline interpolation (smoother)
14daily_spline = monthly_data.resample('D').interpolate(method='spline', order=3)

Checkpoint

YoY comparison plots hiệu quả nhất cho seasonal data. .resample() chuyển đổi tần suất — downsampling dùng agg, upsampling dùng ffill/interpolate. Heatmap là visualization tuyệt vời để spot seasonal patterns — bạn thấy pattern nào trong data của mình?

Task 3

📉 Moving Statistics & Trend Analysis

TB5 min

Rolling Windows

Python

1# Simple moving average
2df['SMA_7'] = df['sales'].rolling(window=7).mean()
3df['SMA_30'] = df['sales'].rolling(window=30).mean()
4
5# Exponential moving average
6df['EMA_7'] = df['sales'].ewm(span=7, adjust=False).mean()
7df['EMA_30'] = df['sales'].ewm(span=30, adjust=False).mean()
8
9# Rolling statistics
10df['rolling_std'] = df['sales'].rolling(window=30).std()
11df['rolling_min'] = df['sales'].rolling(window=30).min()
12df['rolling_max'] = df['sales'].rolling(window=30).max()
13
14# Plot comparison
15fig, ax = plt.subplots(figsize=(14, 6))
16ax.plot(df.index, df['sales'], alpha=0.3, label='Actual')
17ax.plot(df.index, df['SMA_30'], label='30-day SMA')
18ax.plot(df.index, df['EMA_30'], label='30-day EMA')
19ax.legend()
20ax.set_title('Sales with Moving Averages')
21plt.show()

Expanding Windows

Python

1# Cumulative statistics
2df['cumsum'] = df['sales'].cumsum()
3df['cummean'] = df['sales'].expanding().mean()
4df['cummax'] = df['sales'].expanding().max()
5
6# Year-to-date
7df['ytd_sum'] = df.groupby(df.index.year)['sales'].cumsum()

Linear Trend

Python

1from scipy import stats
2
3x = np.arange(len(df))
4y = df['sales'].values
5
6slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
7df['trend'] = intercept + slope * x
8
9print(f"Slope: {slope:.4f} per day")
10print(f"R-squared: {r_value**2:.4f}")
11print(f"P-value: {p_value:.4e}")
12
13plt.figure(figsize=(12, 6))
14plt.plot(df.index, df['sales'], alpha=0.5, label='Actual')
15plt.plot(df.index, df['trend'], 'r-', linewidth=2, label='Trend')
16plt.legend()
17plt.title(f'Sales Trend (slope: {slope:.4f}/day)')
18plt.show()

Detrending

Python

1# Remove linear trend
2df['detrended'] = df['sales'] - df['trend']
3
4# Using differencing
5df['diff_1'] = df['sales'].diff(1)    # First difference
6df['diff_7'] = df['sales'].diff(7)    # Weekly difference
7
8# Percentage change
9df['pct_change'] = df['sales'].pct_change()
10df['pct_change_7'] = df['sales'].pct_change(periods=7)

Checkpoint

SMA smooth noise, EMA giữ gần recent values hơn. Rolling operations cho real-time statistics. Linear regression xác định trend slope. Differencing giúp detrend data. SMA vs EMA — bạn chọn cái nào? EMA react nhanh hơn với changes gần đây!

Task 4

🔬 Seasonality Analysis

TB5 min

Decomposition

Python

1from statsmodels.tsa.seasonal import seasonal_decompose
2
3# Additive: Original = Trend + Seasonal + Residual
4result_add = seasonal_decompose(df['sales'], model='additive', period=365)
5
6# Multiplicative: Original = Trend × Seasonal × Residual
7result_mult = seasonal_decompose(df['sales'], model='multiplicative', period=365)
8
9# Plot components
10fig = result_add.plot()
11fig.set_size_inches(14, 10)
12plt.tight_layout()
13plt.show()
14
15# Extract components
16trend = result_add.trend
17seasonal = result_add.seasonal
18residual = result_add.resid

STL Decomposition (More Robust)

Python

1from statsmodels.tsa.seasonal import STL
2
3stl = STL(df['sales'], period=365, robust=True)
4result = stl.fit()
5
6fig = result.plot()
7fig.set_size_inches(14, 10)
8plt.show()
9
10# Get seasonally adjusted data
11df['seasonally_adjusted'] = df['sales'] - result.seasonal

Seasonal Indices

Python

1def calculate_seasonal_indices(df, column, freq='month'):
2    """Calculate seasonal indices"""
3    if freq == 'month':
4        grouped = df.groupby(df.index.month)[column].mean()
5    elif freq == 'dayofweek':
6        grouped = df.groupby(df.index.dayofweek)[column].mean()
7    elif freq == 'quarter':
8        grouped = df.groupby(df.index.quarter)[column].mean()
9    
10    overall_mean = df[column].mean()
11    seasonal_index = grouped / overall_mean
12    return seasonal_index
13
14monthly_index = calculate_seasonal_indices(df, 'sales', 'month')
15print("Monthly Seasonal Indices:")
16print(monthly_index)
17
18# Visualize
19plt.figure(figsize=(10, 5))
20monthly_index.plot(kind='bar')
21plt.axhline(y=1, color='r', linestyle='--')
22plt.title('Monthly Seasonal Indices')
23plt.xlabel('Month')
24plt.ylabel('Index (1.0 = average)')
25plt.show()

Checkpoint

Additive decomposition khi seasonal amplitude cố định, multiplicative khi amplitude tăng theo trend. STL robust hơn với outliers. Seasonal indices > 1 = above average. Seasonal index = 1.2 nghĩa là tháng đó cao hơn trung bình 20%. Bạn có thể dùng nó để adjust forecasts!

Task 5

🔮 Forecasting

TB5 min

Simple Methods

Python

1# Naive forecast (last value)
2df['forecast_naive'] = df['sales'].shift(1)
3
4# Seasonal naive (same day last year)
5df['forecast_seasonal_naive'] = df['sales'].shift(365)
6
7# Moving average forecast
8df['forecast_ma'] = df['sales'].rolling(window=30).mean().shift(1)
9
10# Exponential smoothing forecast
11df['forecast_ewm'] = df['sales'].ewm(span=30).mean().shift(1)

Simple Exponential Smoothing (SES)

Python

1from statsmodels.tsa.holtwinters import SimpleExpSmoothing
2
3model = SimpleExpSmoothing(df['sales']).fit(smoothing_level=0.2)
4df['ses_fitted'] = model.fittedvalues
5forecast = model.forecast(30)  # Next 30 days
6print(f"Smoothing level (alpha): {model.params['smoothing_level']:.4f}")

Holt-Winters (Triple Exponential Smoothing)

Python

1from statsmodels.tsa.holtwinters import ExponentialSmoothing
2
3model = ExponentialSmoothing(
4    df['sales'],
5    trend='add',           # Additive trend
6    seasonal='add',        # Additive seasonality
7    seasonal_periods=365   # Yearly seasonality
8).fit()
9
10df['hw_fitted'] = model.fittedvalues
11forecast = model.forecast(90)  # Next 90 days
12
13# Plot
14fig, ax = plt.subplots(figsize=(14, 6))
15ax.plot(df.index[-365:], df['sales'][-365:], label='Actual')
16ax.plot(forecast.index, forecast.values, 'r--', label='Forecast')
17ax.legend()
18ax.set_title('Holt-Winters Forecast')
19plt.show()

Forecast Evaluation

Python

1from sklearn.metrics import mean_absolute_error, mean_squared_error
2
3def evaluate_forecast(actual, predicted, model_name='Model'):
4    """Calculate forecast metrics"""
5    mask = ~(np.isnan(actual) | np.isnan(predicted))
6    actual, predicted = actual[mask], predicted[mask]
7    
8    mae = mean_absolute_error(actual, predicted)
9    rmse = np.sqrt(mean_squared_error(actual, predicted))
10    mape = np.mean(np.abs((actual - predicted) / actual)) * 100
11    
12    print(f"\n{model_name} Performance:")
13    print(f"  MAE:  {mae:.2f}")
14    print(f"  RMSE: {rmse:.2f}")
15    print(f"  MAPE: {mape:.2f}%")
16    return {'mae': mae, 'rmse': rmse, 'mape': mape}
17
18# Compare methods
19evaluate_forecast(df['sales'], df['forecast_naive'], 'Naive')
20evaluate_forecast(df['sales'], df['forecast_ma'], 'Moving Average')
21evaluate_forecast(df['sales'], df['hw_fitted'], 'Holt-Winters')

Checkpoint

Naive/seasonal naive là baseline. SES cho no-trend data. Holt-Winters forecast cả trend + seasonality. Evaluate với MAE, RMSE, MAPE. MAPE < 10% = excellent forecast. Bạn nên so sánh model phức tạp với Naive baseline — nếu không better thì không đáng dùng!

Task 6

💾 Time Series SQL

TB5 min

Date Functions

SQL

1-- PostgreSQL date operations
2SELECT
3    order_date,
4    EXTRACT(YEAR FROM order_date) as year,
5    EXTRACT(MONTH FROM order_date) as month,
6    DATE_TRUNC('month', order_date) as month_start,
7    order_date - INTERVAL '7 days' as week_ago
8FROM orders;
9
10-- Generate date series
11SELECT generate_series(
12    '2024-01-01'::date,
13    '2024-12-31'::date,
14    '1 day'::interval
15)::date as date;

Time Series Aggregation

SQL

1-- Monthly aggregation
2SELECT
3    DATE_TRUNC('month', order_date) as month,
4    COUNT(*) as orders,
5    SUM(amount) as revenue
6FROM orders
7GROUP BY DATE_TRUNC('month', order_date)
8ORDER BY month;
9
10-- Moving average with window functions
11SELECT
12    order_date, amount,
13    AVG(amount) OVER (
14        ORDER BY order_date
15        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
16    ) as moving_avg_7d
17FROM orders;
18
19-- YoY comparison
20WITH monthly_sales AS (
21    SELECT DATE_TRUNC('month', order_date) as month,
22        SUM(amount) as revenue
23    FROM orders GROUP BY 1
24)
25SELECT month, revenue,
26    LAG(revenue, 12) OVER (ORDER BY month) as revenue_last_year,
27    ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY month)) / 
28          LAG(revenue, 12) OVER (ORDER BY month) * 100, 2) as yoy_growth
29FROM monthly_sales;

Fill Missing Dates

SQL

1WITH date_series AS (
2    SELECT generate_series(
3        (SELECT MIN(order_date) FROM orders),
4        (SELECT MAX(order_date) FROM orders),
5        '1 day'::interval
6    )::date as date
7),
8daily_sales AS (
9    SELECT order_date::date as date, SUM(amount) as revenue
10    FROM orders GROUP BY 1
11)
12SELECT ds.date, COALESCE(d.revenue, 0) as revenue
13FROM date_series ds
14LEFT JOIN daily_sales d ON ds.date = d.date
15ORDER BY ds.date;

Checkpoint

DATE_TRUNC và EXTRACT cho time grouping. Window functions cho moving averages và YoY growth trong SQL. generate_series + LEFT JOIN cho fill missing dates. Kết hợp SQL time series aggregation với Python forecasting cho pipeline analytics hoàn chỉnh!

Task 7

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đề	Nội dung chính
Data Preparation	datetime index, date components, missing dates handling
Visualization	Line plots, YoY comparison, heatmaps, seasonal plots
Resampling	resample() — downsampling (agg), upsampling (interpolate)
Moving Statistics	rolling(), ewm(), expanding(), cumsum
Trend Analysis	Linear regression, polynomial fit, detrending, differencing
Seasonality	seasonal_decompose(), STL, seasonal indices
Forecasting	Naive, SES, Holt-Winters, forecast evaluation
SQL Functions	DATE_TRUNC, EXTRACT, generate_series, window functions
Metrics	MAE, RMSE, MAPE

Key Takeaways

✅ Time Series = Trend + Seasonality + Residual
✅ STL decomposition robust hơn standard decomposition
✅ Holt-Winters forecast cả trend và seasonality
✅ Luôn so sánh model với Naive baseline
✅ MAPE là metric dễ hiểu nhất (% error)
✅ Kết hợp SQL aggregation + Python analysis cho pipeline hoàn chỉnh

Câu hỏi tự kiểm tra

Trend, Seasonality, Residual là gì?
STL decomposition khác seasonal_decompose() thế nào?
Holt-Winters forecast được những thành phần nào?
MAPE dễ hiểu hơn MAE/RMSE vì sao?

Bài tiếp theo: Cohort Analysis →

🎉 Tuyệt vời! Bạn đã nắm vững Time Series Analysis!

Nhớ: Time Series = Trend + Seasonality + Residual. Decompose trước, forecast sau — luôn so với Naive baseline!

Task 8