🎯 Mục tiêu bài học
- Chuẩn bị và xử lý time series data (datetime operations, missing dates)
- Thực hiện resampling, moving averages, rolling statistics
- Phân tích trend và seasonality (decomposition, STL, seasonal indices)
- Xây dựng forecasting models (Naive, SES, Holt-Winters)
| Thông tin | Chi tiết |
|---|---|
| ⏱️ Thời lượng | 2.5 giờ |
| 📖 Chủ đề chính | Time Series Decomposition, Forecasting |
| 💡 Kiến thức cần có | Pandas, Matplotlib cơ bản |
| 🎯 Output | Phân tích và forecast time series data thực tế |
📖 Thuật ngữ quan trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Time Series | Chuỗi thời gian | Dữ liệu có thứ tự theo thời gian (daily sales, monthly revenue) |
| Trend | Xu hướng | Hướng dài hạn — upward, downward, hoặc flat |
| Seasonality | Tính mùa vụ | Patterns lặp lại theo chu kỳ (daily, weekly, yearly) |
| Residual | Phần dư | Random noise sau khi loại bỏ trend và seasonality |
| Resampling | Lấy mẫu lại | Chuyển đổi tần suất (daily → monthly, monthly → daily) |
| Rolling Window | Cửa sổ trượt | Tính statistics trên N rows gần nhất (moving average) |
| EMA | Trung bình mũ | Exponential Moving Average — weighted trung bình gần đây hơn |
| Decomposition | Phân tách | Tách time series thành Trend + Seasonal + Residual |
| STL | Phân tách STL | Seasonal-Trend decomposition using LOESS — robust hơn |
| Holt-Winters | Mô hình HW | Triple Exponential Smoothing — forecast trend + seasonality |
Checkpoint
Time series = Trend + Seasonality + Residual. Decomposition tách 3 components để hiểu pattern. Holt-Winters forecast cả trend và seasonality. Time series analysis là foundation của business analytics — sales forecasting, inventory management, demand planning đều dùng nó!
📊 Time Series Data Preparation
Time series analysis là foundation của business analytics — từ sales forecasting, inventory management đến demand planning. Master kỹ năng này để predict và plan hiệu quả.
Creating Time Series
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt45# Create datetime index6dates = pd.date_range(start='2022-01-01', end='2024-12-31', freq='D')78# Generate sample data: trend + seasonality + noise9np.random.seed(42)10n = len(dates)11trend = np.linspace(100, 200, n)12seasonality = 30 * np.sin(2 * np.pi * np.arange(n) / 365)13noise = np.random.normal(0, 10, n)1415sales = trend + seasonality + noise1617df = pd.DataFrame({'date': dates, 'sales': sales})18df.set_index('date', inplace=True)1920print(df.head())21print(f"\nShape: {df.shape}")22print(f"Date range: {df.index.min()} to {df.index.max()}")Datetime Operations
1# Date components2df['year'] = df.index.year3df['month'] = df.index.month4df['quarter'] = df.index.quarter5df['day_of_week'] = df.index.dayofweek6df['week_of_year'] = df.index.isocalendar().week7df['is_weekend'] = df.index.dayofweek >= 58df['is_month_start'] = df.index.is_month_start9df['is_month_end'] = df.index.is_month_end1011# Parse dates from strings12df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')1314# Set datetime index15df = df.set_index('date').sort_index()Handling Missing Dates
1# Create complete date range2full_range = pd.date_range(df.index.min(), df.index.max(), freq='D')34# Reindex to fill missing dates5df = df.reindex(full_range)67# Fill missing values8df['sales'] = df['sales'].fillna(method='ffill') # Forward fill9df['sales'] = df['sales'].interpolate(method='linear') # Interpolate10df['sales'] = df['sales'].fillna(df['sales'].mean()) # Mean1112# Check for gaps13missing_dates = full_range.difference(df.index)14print(f"Missing dates: {len(missing_dates)}")1┌─────────────────────────────────────────────────────────┐2│ Time Series Decomposition │3├─────────────────────────────────────────────────────────┤4│ Original = Trend + Seasonality + Residual │5│ │6│ Trend: Long-term direction (↗↘→) │7│ Seasonality: Regular patterns (daily/weekly/yearly) │8│ Residual: Random noise, unpredictable │9└─────────────────────────────────────────────────────────┘Checkpoint
Time series cần datetime index. pd.date_range tạo complete date range. .reindex() + fillna()/interpolate() xử lý missing dates. Components: Trend + Seasonality + Residual. Missing dates là vấn đề phổ biến — bạn chọn forward fill hay interpolation tùy vào data type!
📈 Visualization & Resampling
Basic Time Series Plots
1fig, axes = plt.subplots(3, 1, figsize=(14, 10))23# Line plot4axes[0].plot(df.index, df['sales'], linewidth=0.8)5axes[0].set_title('Daily Sales')67# Monthly resampling8monthly = df['sales'].resample('M').mean()9axes[1].plot(monthly.index, monthly.values, marker='o')10axes[1].set_title('Monthly Average Sales')1112# Year-over-year comparison13for year in df.index.year.unique():14 yearly = df[df.index.year == year]['sales']15 yearly.index = yearly.index.dayofyear16 axes[2].plot(yearly.index, yearly.values, label=str(year), alpha=0.7)17axes[2].set_title('Year-over-Year Comparison')18axes[2].legend()1920plt.tight_layout()21plt.show()Heatmap
1heatmap_data = df.pivot_table(2 values='sales',3 index=df.index.year,4 columns=df.index.month,5 aggfunc='mean'6)78plt.figure(figsize=(12, 6))9plt.imshow(heatmap_data, cmap='YlOrRd', aspect='auto')10plt.colorbar(label='Average Sales')11plt.xticks(range(12), ['Jan','Feb','Mar','Apr','May','Jun',12 'Jul','Aug','Sep','Oct','Nov','Dec'])13plt.title('Sales Heatmap by Year and Month')14plt.show()Downsampling
1# Daily → Weekly2weekly = df['sales'].resample('W').agg(['mean', 'sum', 'min', 'max'])34# Daily → Monthly5monthly = df['sales'].resample('M').agg(['mean', 'std', 'min', 'max'])67# Daily → Quarterly8quarterly = df['sales'].resample('Q').sum()910# Custom aggregation11monthly_stats = df.resample('M').agg({12 'sales': ['sum', 'mean', 'std']13})Upsampling
1monthly_data = pd.DataFrame({2 'date': pd.date_range('2024-01-01', periods=12, freq='M'),3 'value': [100, 120, 115, 130, 145, 160, 155, 170, 165, 180, 190, 200]4})5monthly_data.set_index('date', inplace=True)67# Forward fill8daily = monthly_data.resample('D').ffill()910# Linear interpolation11daily_interp = monthly_data.resample('D').interpolate(method='linear')1213# Spline interpolation (smoother)14daily_spline = monthly_data.resample('D').interpolate(method='spline', order=3)Checkpoint
YoY comparison plots hiệu quả nhất cho seasonal data. .resample() chuyển đổi tần suất — downsampling dùng agg, upsampling dùng ffill/interpolate. Heatmap là visualization tuyệt vời để spot seasonal patterns — bạn thấy pattern nào trong data của mình?
📉 Moving Statistics & Trend Analysis
Rolling Windows
1# Simple moving average2df['SMA_7'] = df['sales'].rolling(window=7).mean()3df['SMA_30'] = df['sales'].rolling(window=30).mean()45# Exponential moving average6df['EMA_7'] = df['sales'].ewm(span=7, adjust=False).mean()7df['EMA_30'] = df['sales'].ewm(span=30, adjust=False).mean()89# Rolling statistics10df['rolling_std'] = df['sales'].rolling(window=30).std()11df['rolling_min'] = df['sales'].rolling(window=30).min()12df['rolling_max'] = df['sales'].rolling(window=30).max()1314# Plot comparison15fig, ax = plt.subplots(figsize=(14, 6))16ax.plot(df.index, df['sales'], alpha=0.3, label='Actual')17ax.plot(df.index, df['SMA_30'], label='30-day SMA')18ax.plot(df.index, df['EMA_30'], label='30-day EMA')19ax.legend()20ax.set_title('Sales with Moving Averages')21plt.show()Expanding Windows
1# Cumulative statistics2df['cumsum'] = df['sales'].cumsum()3df['cummean'] = df['sales'].expanding().mean()4df['cummax'] = df['sales'].expanding().max()56# Year-to-date7df['ytd_sum'] = df.groupby(df.index.year)['sales'].cumsum()Linear Trend
1from scipy import stats23x = np.arange(len(df))4y = df['sales'].values56slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)7df['trend'] = intercept + slope * x89print(f"Slope: {slope:.4f} per day")10print(f"R-squared: {r_value**2:.4f}")11print(f"P-value: {p_value:.4e}")1213plt.figure(figsize=(12, 6))14plt.plot(df.index, df['sales'], alpha=0.5, label='Actual')15plt.plot(df.index, df['trend'], 'r-', linewidth=2, label='Trend')16plt.legend()17plt.title(f'Sales Trend (slope: {slope:.4f}/day)')18plt.show()Detrending
1# Remove linear trend2df['detrended'] = df['sales'] - df['trend']34# Using differencing5df['diff_1'] = df['sales'].diff(1) # First difference6df['diff_7'] = df['sales'].diff(7) # Weekly difference78# Percentage change9df['pct_change'] = df['sales'].pct_change()10df['pct_change_7'] = df['sales'].pct_change(periods=7)Checkpoint
SMA smooth noise, EMA giữ gần recent values hơn. Rolling operations cho real-time statistics. Linear regression xác định trend slope. Differencing giúp detrend data. SMA vs EMA — bạn chọn cái nào? EMA react nhanh hơn với changes gần đây!
🔬 Seasonality Analysis
Decomposition
1from statsmodels.tsa.seasonal import seasonal_decompose23# Additive: Original = Trend + Seasonal + Residual4result_add = seasonal_decompose(df['sales'], model='additive', period=365)56# Multiplicative: Original = Trend × Seasonal × Residual7result_mult = seasonal_decompose(df['sales'], model='multiplicative', period=365)89# Plot components10fig = result_add.plot()11fig.set_size_inches(14, 10)12plt.tight_layout()13plt.show()1415# Extract components16trend = result_add.trend17seasonal = result_add.seasonal18residual = result_add.residSTL Decomposition (More Robust)
1from statsmodels.tsa.seasonal import STL23stl = STL(df['sales'], period=365, robust=True)4result = stl.fit()56fig = result.plot()7fig.set_size_inches(14, 10)8plt.show()910# Get seasonally adjusted data11df['seasonally_adjusted'] = df['sales'] - result.seasonalSeasonal Indices
1def calculate_seasonal_indices(df, column, freq='month'):2 """Calculate seasonal indices"""3 if freq == 'month':4 grouped = df.groupby(df.index.month)[column].mean()5 elif freq == 'dayofweek':6 grouped = df.groupby(df.index.dayofweek)[column].mean()7 elif freq == 'quarter':8 grouped = df.groupby(df.index.quarter)[column].mean()9 10 overall_mean = df[column].mean()11 seasonal_index = grouped / overall_mean12 return seasonal_index1314monthly_index = calculate_seasonal_indices(df, 'sales', 'month')15print("Monthly Seasonal Indices:")16print(monthly_index)1718# Visualize19plt.figure(figsize=(10, 5))20monthly_index.plot(kind='bar')21plt.axhline(y=1, color='r', linestyle='--')22plt.title('Monthly Seasonal Indices')23plt.xlabel('Month')24plt.ylabel('Index (1.0 = average)')25plt.show()Checkpoint
Additive decomposition khi seasonal amplitude cố định, multiplicative khi amplitude tăng theo trend. STL robust hơn với outliers. Seasonal indices > 1 = above average. Seasonal index = 1.2 nghĩa là tháng đó cao hơn trung bình 20%. Bạn có thể dùng nó để adjust forecasts!
🔮 Forecasting
Simple Methods
1# Naive forecast (last value)2df['forecast_naive'] = df['sales'].shift(1)34# Seasonal naive (same day last year)5df['forecast_seasonal_naive'] = df['sales'].shift(365)67# Moving average forecast8df['forecast_ma'] = df['sales'].rolling(window=30).mean().shift(1)910# Exponential smoothing forecast11df['forecast_ewm'] = df['sales'].ewm(span=30).mean().shift(1)Simple Exponential Smoothing (SES)
1from statsmodels.tsa.holtwinters import SimpleExpSmoothing23model = SimpleExpSmoothing(df['sales']).fit(smoothing_level=0.2)4df['ses_fitted'] = model.fittedvalues5forecast = model.forecast(30) # Next 30 days6print(f"Smoothing level (alpha): {model.params['smoothing_level']:.4f}")Holt-Winters (Triple Exponential Smoothing)
1from statsmodels.tsa.holtwinters import ExponentialSmoothing23model = ExponentialSmoothing(4 df['sales'],5 trend='add', # Additive trend6 seasonal='add', # Additive seasonality7 seasonal_periods=365 # Yearly seasonality8).fit()910df['hw_fitted'] = model.fittedvalues11forecast = model.forecast(90) # Next 90 days1213# Plot14fig, ax = plt.subplots(figsize=(14, 6))15ax.plot(df.index[-365:], df['sales'][-365:], label='Actual')16ax.plot(forecast.index, forecast.values, 'r--', label='Forecast')17ax.legend()18ax.set_title('Holt-Winters Forecast')19plt.show()Forecast Evaluation
1from sklearn.metrics import mean_absolute_error, mean_squared_error23def evaluate_forecast(actual, predicted, model_name='Model'):4 """Calculate forecast metrics"""5 mask = ~(np.isnan(actual) | np.isnan(predicted))6 actual, predicted = actual[mask], predicted[mask]7 8 mae = mean_absolute_error(actual, predicted)9 rmse = np.sqrt(mean_squared_error(actual, predicted))10 mape = np.mean(np.abs((actual - predicted) / actual)) * 10011 12 print(f"\n{model_name} Performance:")13 print(f" MAE: {mae:.2f}")14 print(f" RMSE: {rmse:.2f}")15 print(f" MAPE: {mape:.2f}%")16 return {'mae': mae, 'rmse': rmse, 'mape': mape}1718# Compare methods19evaluate_forecast(df['sales'], df['forecast_naive'], 'Naive')20evaluate_forecast(df['sales'], df['forecast_ma'], 'Moving Average')21evaluate_forecast(df['sales'], df['hw_fitted'], 'Holt-Winters')Checkpoint
Naive/seasonal naive là baseline. SES cho no-trend data. Holt-Winters forecast cả trend + seasonality. Evaluate với MAE, RMSE, MAPE. MAPE < 10% = excellent forecast. Bạn nên so sánh model phức tạp với Naive baseline — nếu không better thì không đáng dùng!
💾 Time Series SQL
Date Functions
1-- PostgreSQL date operations2SELECT3 order_date,4 EXTRACT(YEAR FROM order_date) as year,5 EXTRACT(MONTH FROM order_date) as month,6 DATE_TRUNC('month', order_date) as month_start,7 order_date - INTERVAL '7 days' as week_ago8FROM orders;910-- Generate date series11SELECT generate_series(12 '2024-01-01'::date,13 '2024-12-31'::date,14 '1 day'::interval15)::date as date;Time Series Aggregation
1-- Monthly aggregation2SELECT3 DATE_TRUNC('month', order_date) as month,4 COUNT(*) as orders,5 SUM(amount) as revenue6FROM orders7GROUP BY DATE_TRUNC('month', order_date)8ORDER BY month;910-- Moving average with window functions11SELECT12 order_date, amount,13 AVG(amount) OVER (14 ORDER BY order_date15 ROWS BETWEEN 6 PRECEDING AND CURRENT ROW16 ) as moving_avg_7d17FROM orders;1819-- YoY comparison20WITH monthly_sales AS (21 SELECT DATE_TRUNC('month', order_date) as month,22 SUM(amount) as revenue23 FROM orders GROUP BY 124)25SELECT month, revenue,26 LAG(revenue, 12) OVER (ORDER BY month) as revenue_last_year,27 ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY month)) / 28 LAG(revenue, 12) OVER (ORDER BY month) * 100, 2) as yoy_growth29FROM monthly_sales;Fill Missing Dates
1WITH date_series AS (2 SELECT generate_series(3 (SELECT MIN(order_date) FROM orders),4 (SELECT MAX(order_date) FROM orders),5 '1 day'::interval6 )::date as date7),8daily_sales AS (9 SELECT order_date::date as date, SUM(amount) as revenue10 FROM orders GROUP BY 111)12SELECT ds.date, COALESCE(d.revenue, 0) as revenue13FROM date_series ds14LEFT JOIN daily_sales d ON ds.date = d.date15ORDER BY ds.date;Checkpoint
DATE_TRUNC và EXTRACT cho time grouping. Window functions cho moving averages và YoY growth trong SQL. generate_series + LEFT JOIN cho fill missing dates. Kết hợp SQL time series aggregation với Python forecasting cho pipeline analytics hoàn chỉnh!
📋 Tổng kết
Kiến thức đã học
| Chủ đề | Nội dung chính |
|---|---|
| Data Preparation | datetime index, date components, missing dates handling |
| Visualization | Line plots, YoY comparison, heatmaps, seasonal plots |
| Resampling | resample() — downsampling (agg), upsampling (interpolate) |
| Moving Statistics | rolling(), ewm(), expanding(), cumsum |
| Trend Analysis | Linear regression, polynomial fit, detrending, differencing |
| Seasonality | seasonal_decompose(), STL, seasonal indices |
| Forecasting | Naive, SES, Holt-Winters, forecast evaluation |
| SQL Functions | DATE_TRUNC, EXTRACT, generate_series, window functions |
| Metrics | MAE, RMSE, MAPE |
Key Takeaways
- ✅ Time Series = Trend + Seasonality + Residual
- ✅ STL decomposition robust hơn standard decomposition
- ✅ Holt-Winters forecast cả trend và seasonality
- ✅ Luôn so sánh model với Naive baseline
- ✅ MAPE là metric dễ hiểu nhất (% error)
- ✅ Kết hợp SQL aggregation + Python analysis cho pipeline hoàn chỉnh
Câu hỏi tự kiểm tra
- Trend, Seasonality, Residual là gì?
- STL decomposition khác
seasonal_decompose()thế nào? - Holt-Winters forecast được những thành phần nào?
- MAPE dễ hiểu hơn MAE/RMSE vì sao?
Bài tiếp theo: Cohort Analysis →
🎉 Tuyệt vời! Bạn đã nắm vững Time Series Analysis!
Nhớ: Time Series = Trend + Seasonality + Residual. Decompose trước, forecast sau — luôn so với Naive baseline!
