Automated Reports & Portfolio Capstone

Portfolio Project

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ có thể:

Tạo report templates với Jinja2, HTML, Excel (openpyxl) và PDF (ReportLab)
Thiết kế email delivery system và integration với Slack/Teams
Lên lịch reports tự động với APScheduler và GitHub Actions
Setup capstone project với repository structure chuẩn
Thực hiện Data Exploration, Cleaning, Analysis và Reporting
Xây dựng portfolio-ready project trên GitHub

Thông tin bài học

⏱️ Thời lượng: 3 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, openpyxl, ReportLab, Jinja2, Streamlit

Task 0

📖 Thuật ngữ quan trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Jinja2	Template engine	Python template engine cho dynamic report generation
openpyxl	Thư viện Excel	Python library đọc/ghi Excel files (.xlsx)
ReportLab	Thư viện PDF	Python library tạo PDF documents
SMTP	Giao thức email	Simple Mail Transfer Protocol cho email delivery
APScheduler	Lập lịch Python	Advanced Python Scheduler cho cron-based tasks
ETL	Trích xuất-Chuyển đổi-Tải	Extract, Transform, Load pipeline pattern
RFM	Phân nhóm khách hàng	Recency, Frequency, Monetary segmentation
Cohort Analysis	Phân tích nhóm	Tracking user groups over time
Streamlit	Framework dashboard	Python framework tạo web dashboards nhanh
Portfolio	Hồ sơ năng lực	Collection of projects demonstrating skills

Checkpoint

Automated reporting pipeline kết hợp templates (Jinja2), output formats (Excel/PDF/HTML), delivery (SMTP/Slack), và scheduling (APScheduler/cron). ETL pipeline khác gì report generation pipeline?

Task 1

📝 Report Templates & Output Formats

TB5 min

Jinja2 Template Engine

Python

1from jinja2 import Template
2from datetime import datetime
3
4# Text report template
5text_template = Template("""
6{{ '=' * 60 }}
7{{ title | center(60) }}
8{{ report_date | center(60) }}
9{{ '=' * 60 }}
10
11�� KEY METRICS
12{% for metric in metrics %}
13  {{ metric.status }} {{ metric.name }}: {{ metric.value }}
14     Target: {{ metric.target }} | vs Target: {{ metric.vs_target }}
15{% endfor %}
16
17�� TOP HIGHLIGHTS
18{% for h in highlights %}
19  {{ loop.index }}. {{ h }}
20{% endfor %}
21
22�� ACTION ITEMS
23{% for action in actions %}
24  [{{ action.priority }}] {{ action.task }}
25  Owner: {{ action.owner }} | Due: {{ action.due }}
26{% endfor %}
27{{ '=' * 60 }}
28""")
29
30# HTML email template
31html_template_string = '''
32HTML Email Report Template:
33- Header: Dark background (#2c3e50), white text, centered title + date
34- Body: Full-width table with metrics (Metric | Value | Target | Status)
35- Styling: Alternating row colors, bold values, colored status indicators
36- Footer: Light gray background, "Auto-generated report" disclaimer
37- Uses Jinja2 variables: title, report_date, metrics list
38- Uses Jinja2 loop: for metric in metrics
39'''
40html_template = Template(html_template_string)
41
42# Generate reports
43data = {
44    'title': 'Weekly Sales Report',
45    'report_date': datetime.now().strftime('%B %d, %Y'),
46    'metrics': [
47        {'name': 'Revenue', 'value': '$2.5M', 'target': '$2.3M', 'vs_target': '+8.7%', 'status': '✅'},
48        {'name': 'Orders', 'value': '12,500', 'target': '11,000', 'vs_target': '+13.6%', 'status': '✅'},
49        {'name': 'Conversion', 'value': '3.2%', 'target': '3.5%', 'vs_target': '-8.6%', 'status': '⚠️'},
50    ],
51    'highlights': ['Record Thursday revenue', 'Mobile app orders grew 25% MoM'],
52    'actions': [
53        {'task': 'Fix mobile checkout UX', 'owner': 'Product', 'due': '3/22', 'priority': 'HIGH'},
54        {'task': 'Launch retargeting', 'owner': 'Marketing', 'due': '3/20', 'priority': 'MEDIUM'},
55    ]
56}
57
58print(text_template.render(**data))

Excel Reports with openpyxl

Python

1from openpyxl import Workbook
2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
3from openpyxl.chart import BarChart, Reference
4from openpyxl.utils import get_column_letter
5
6def create_excel_report(data, filename='report.xlsx'):
7    wb = Workbook()
8    
9    # --- Summary Sheet ---
10    ws = wb.active
11    ws.title = 'Executive Summary'
12    
13    # Header
14    ws.merge_cells('A1:E1')
15    ws['A1'] = 'Weekly Executive Report'
16    ws['A1'].font = Font(size=16, bold=True, color='FFFFFF')
17    ws['A1'].fill = PatternFill(start_color='2C3E50', fill_type='solid')
18    ws['A1'].alignment = Alignment(horizontal='center')
19    
20    # Column headers
21    headers = ['KPI', 'Current', 'Target', 'vs Target', 'Status']
22    for i, header in enumerate(headers, 1):
23        cell = ws.cell(row=3, column=i, value=header)
24        cell.font = Font(bold=True, color='FFFFFF')
25        cell.fill = PatternFill(start_color='34495E', fill_type='solid')
26    
27    # Data rows
28    status_colors = {'✅': '27AE60', '⚠️': 'F39C12', '❌': 'E74C3C'}
29    for row_idx, metric in enumerate(data['metrics'], 4):
30        ws.cell(row=row_idx, column=1, value=metric['name'])
31        ws.cell(row=row_idx, column=2, value=metric['value'])
32        ws.cell(row=row_idx, column=3, value=metric['target'])
33        ws.cell(row=row_idx, column=4, value=metric['vs_target'])
34        status_cell = ws.cell(row=row_idx, column=5, value=metric['status'])
35        color = status_colors.get(metric['status'], 'BDC3C7')
36        status_cell.fill = PatternFill(start_color=color, fill_type='solid')
37    
38    # Auto-adjust column widths
39    for col in range(1, 6):
40        ws.column_dimensions[get_column_letter(col)].width = 18
41    
42    # --- Chart Sheet ---
43    ws2 = wb.create_sheet('Charts')
44    chart_data = [['Category', 'Revenue'], ['Product A', 45], ['Product B', 38],
45                  ['Product C', 29], ['Product D', 18], ['Product E', 12]]
46    for row in chart_data:
47        ws2.append(row)
48    
49    chart = BarChart()
50    chart.title = 'Revenue by Product'
51    chart.y_axis.title = 'Revenue ($K)'
52    cats = Reference(ws2, min_col=1, min_row=2, max_row=6)
53    vals = Reference(ws2, min_col=2, min_row=1, max_row=6)
54    chart.add_data(vals, titles_from_data=True)
55    chart.set_categories(cats)
56    ws2.add_chart(chart, 'D2')
57    
58    wb.save(filename)
59    print(f"Excel report saved: {filename}")
60
61create_excel_report(data)

PDF Reports with ReportLab

Python

1from reportlab.lib.pagesizes import A4
2from reportlab.lib import colors
3from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
4from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer
5
6def create_pdf_report(data, filename='report.pdf'):
7    doc = SimpleDocTemplate(filename, pagesize=A4)
8    styles = getSampleStyleSheet()
9    elements = []
10    
11    # Title
12    title_style = ParagraphStyle('CustomTitle', parent=styles['Title'],
13                                  fontSize=20, spaceAfter=30)
14    elements.append(Paragraph(data['title'], title_style))
15    elements.append(Spacer(1, 12))
16    
17    # Metrics table
18    table_data = [['KPI', 'Current', 'Target', 'Status']]
19    for m in data['metrics']:
20        table_data.append([m['name'], m['value'], m['target'], m['status']])
21    
22    table = Table(table_data, colWidths=[150, 100, 100, 80])
23    table.setStyle(TableStyle([
24        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#2C3E50')),
25        ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
26        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
27        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
28        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#F8F9FA')]),
29    ]))
30    elements.append(table)
31    elements.append(Spacer(1, 20))
32    
33    # Action items
34    elements.append(Paragraph('Action Items', styles['Heading2']))
35    for action in data['actions']:
36        elements.append(Paragraph(
37            f"• [{action['priority']}] {action['task']} — Owner: {action['owner']}, Due: {action['due']}",
38            styles['Normal']
39        ))
40    
41    doc.build(elements)
42    print(f"PDF report saved: {filename}")
43
44create_pdf_report(data)

Checkpoint

Jinja2 tách logic khỏi presentation, openpyxl tạo Excel reports chuyên nghiệp với charts, ReportLab tạo PDF documents programmatically. Khi nào nên chọn Excel report vs PDF report?

Task 2

📮 Report Delivery & Scheduling

TB5 min

Email Delivery

Python

1import smtplib
2from email.mime.text import MIMEText
3from email.mime.multipart import MIMEMultipart
4from email.mime.base import MIMEBase
5from email import encoders
6
7class EmailReporter:
8    def __init__(self, smtp_server, smtp_port, username, password):
9        self.smtp_server = smtp_server
10        self.smtp_port = smtp_port
11        self.username = username
12        self.password = password
13    
14    def send_report(self, to_list, subject, html_body, attachments=None):
15        msg = MIMEMultipart()
16        msg['From'] = self.username
17        msg['To'] = ', '.join(to_list)
18        msg['Subject'] = subject
19        msg.attach(MIMEText(html_body, 'html'))
20        
21        if attachments:
22            for filepath in attachments:
23                with open(filepath, 'rb') as f:
24                    part = MIMEBase('application', 'octet-stream')
25                    part.set_payload(f.read())
26                encoders.encode_base64(part)
27                part.add_header('Content-Disposition',
28                               f'attachment; filename={filepath.split("/")[-1]}')
29                msg.attach(part)
30        
31        with smtplib.SMTP(self.smtp_server, self.smtp_port) as server:
32            server.starttls()
33            server.login(self.username, self.password)
34            server.send_message(msg)
35        print(f"Email sent to {len(to_list)} recipients")
36
37# Usage
38# reporter = EmailReporter('smtp.gmail.com', 587, 'you@gmail.com', 'app_password')
39# reporter.send_report(['ceo@company.com'], 'Weekly Report', html_content, ['report.xlsx'])

Slack & Teams Integration

Python

1import requests
2
3class SlackReporter:
4    def __init__(self, webhook_url):
5        self.webhook_url = webhook_url
6    
7    def send_report(self, title, metrics, status='good'):
8        color_map = {'good': '#27ae60', 'warning': '#f39c12', 'danger': '#e74c3c'}
9        fields = [{'title': m['name'], 'value': f"{m['value']} (Target: {m['target']})",
10                   'short': True} for m in metrics]
11        
12        payload = {'attachments': [{'color': color_map.get(status, '#3498db'),
13                                     'title': title, 'fields': fields}]}
14        requests.post(self.webhook_url, json=payload)
15
16class TeamsReporter:
17    def __init__(self, webhook_url):
18        self.webhook_url = webhook_url
19    
20    def send_report(self, title, summary, metrics):
21        facts = [{'name': m['name'], 'value': f"{m['value']} ({m['status']})"}
22                 for m in metrics]
23        
24        card = {
25            '@type': 'MessageCard', 'summary': title, 'themeColor': '2C3E50',
26            'title': title,
27            'sections': [{'activityTitle': summary, 'facts': facts}]
28        }
29        requests.post(self.webhook_url, json=card)

Scheduling with APScheduler

Python

1from apscheduler.schedulers.background import BackgroundScheduler
2from apscheduler.triggers.cron import CronTrigger
3
4def daily_report():
5    print("Running daily report...")
6
7def weekly_report():
8    print("Running weekly report...")
9
10def monthly_report():
11    print("Running monthly report...")
12
13scheduler = BackgroundScheduler()
14scheduler.add_job(daily_report, CronTrigger(hour=7, minute=0), id='daily')
15scheduler.add_job(weekly_report, CronTrigger(day_of_week='mon', hour=8), id='weekly')
16scheduler.add_job(monthly_report, CronTrigger(day=1, hour=9), id='monthly')
17# scheduler.start()

GitHub Actions Workflow

Ví dụ

1# .github/workflows/automated-report.yml
2name: Automated Report Generation
3on:
4  schedule:
5    - cron: '0 7 * * 1-5'    # Weekdays at 7:00 UTC
6    - cron: '0 8 * * 1'       # Monday at 8:00 UTC (weekly)
7  workflow_dispatch:            # Manual trigger
8 
9jobs:
10  generate-report:
11    runs-on: ubuntu-latest
12    steps:
13      - uses: actions/checkout@v4
14      - uses: actions/setup-python@v5
15        with:
16          python-version: '3.11'
17      - run: pip install -r requirements.txt
18      - run: python generate_report.py
19        env:
20          DB_URL: secrets.DATABASE_URL
21          SMTP_PASSWORD: secrets.SMTP_PASSWORD
22      - uses: actions/upload-artifact@v4
23        with:
24          name: reports
25          path: output/

Security Best Practice

Không bao giờ hardcode passwords trong code. Sử dụng environment variables hoặc secrets management (GitHub Secrets, AWS SSM, Azure Key Vault).

Checkpoint

Reports cần đến đúng người, đúng thời điểm: Email cho formal reports, Slack cho quick alerts, và cron/APScheduler cho automatic delivery. Tại sao nên dùng GitHub Actions thay vì cron job trên local machine?

Task 3

🏗️ Capstone Project Setup

TB5 min

Project Overview

Ví dụ

1╔══════════════════════════════════════════════════════════╗
2║              PORTFOLIO CAPSTONE PROJECT                  ║
3╠══════════════════════════════════════════════════════════╣
4║  5 DELIVERABLES:                                         ║
5║  1. 📓 Analysis Notebook                                 ║
6║  2. 📊 Executive Report (PDF/PowerPoint)                 ║
7║  3. 🖥️ Interactive Dashboard (Streamlit)                 ║
8║  4. 🤖 Automated Report Script                           ║
9║  5. 📁 GitHub Repository (portfolio-ready)               ║
10╚══════════════════════════════════════════════════════════╝

Repository Structure

Ví dụ

1data-analytics-capstone/
2├── README.md
3├── requirements.txt
4├── .gitignore
5├── setup_project.py
6├── data/
7│   ├── raw/
8│   └── processed/
9├── notebooks/
10│   ├── 01_exploration.ipynb
11│   ├── 02_cleaning.ipynb
12│   ├── 03_analysis.ipynb
13│   └── 04_visualization.ipynb
14├── src/
15│   ├── __init__.py
16│   ├── data_loader.py
17│   ├── cleaner.py
18│   ├── analyzers.py
19│   └── reporters.py
20├── reports/
21│   ├── templates/
22│   └── output/
23├── dashboard/
24│   └── app.py
25└── tests/
26    └── test_analyzers.py

Setup Script

Python

1import os
2
3def setup_project(project_name="data-analytics-capstone"):
4    """Create complete project structure."""
5    directories = [
6        'data/raw', 'data/processed', 'notebooks', 'src',
7        'reports/templates', 'reports/output', 'dashboard', 'tests'
8    ]
9    for d in directories:
10        os.makedirs(os.path.join(project_name, d), exist_ok=True)
11    
12    # README.md
13    readme = f"""# {project_name}
14## Overview
15End-to-end data analytics project: exploration - cleaning - analysis - reporting.
16
17## Quick Start
18    pip install -r requirements.txt
19    streamlit run dashboard/app.py
20
21## Project Structure
22- notebooks/: Jupyter notebooks for each phase
23- src/: Reusable Python modules
24- reports/: Generated reports
25- dashboard/: Streamlit dashboard
26"""
27    
28    # requirements.txt
29    requirements = """pandas>=2.0.0
30numpy>=1.24.0
31matplotlib>=3.7.0
32seaborn>=0.12.0
33scikit-learn>=1.3.0
34openpyxl>=3.1.0
35jinja2>=3.1.0
36streamlit>=1.28.0
37plotly>=5.18.0
38"""
39    
40    # .gitignore
41    gitignore = """__pycache__/
42*.pyc
43.env
44data/raw/*.csv
45*.xlsx
46.DS_Store
47.ipynb_checkpoints/
48"""
49    
50    files = {
51        'README.md': readme,
52        'requirements.txt': requirements,
53        '.gitignore': gitignore,
54        'src/__init__.py': '# Data Analytics Capstone\n'
55    }
56    for filepath, content in files.items():
57        with open(os.path.join(project_name, filepath), 'w') as f:
58            f.write(content)
59    
60    print(f"✅ Project '{project_name}' created!")
61    print(f"📁 Directories: {len(directories)}")
62    print(f"📄 Files: {len(files)}")
63
64setup_project()

Checkpoint

Professional project structure = reproducible, shareable, portfolio-ready. Bao gồm README, requirements.txt, .gitignore, tách source code (src/) khỏi notebooks và reports. Tại sao nên tách code thành modules trong src/ thay vì viết hết trong notebook?

Task 4

🔍 Data Exploration & Cleaning

TB5 min

Phase 1: Data Exploration

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6def assess_data_quality(df, name="dataset"):
7    """Comprehensive data quality report."""
8    print(f"\n{'='*60}")
9    print(f"DATA QUALITY REPORT: {name}")
10    print(f"{'='*60}")
11    
12    total = len(df)
13    print(f"\n📊 SHAPE: {df.shape[0]:,} rows × {df.shape[1]} columns")
14    print(f"📦 MEMORY: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
15    
16    # Missing values
17    missing = df.isnull().sum()
18    missing_pct = (missing / total * 100).round(1)
19    missing_df = pd.DataFrame({'Missing': missing, '%': missing_pct})
20    missing_df = missing_df[missing_df['Missing'] > 0].sort_values('%', ascending=False)
21    
22    if len(missing_df) > 0:
23        print(f"\n⚠️ MISSING VALUES ({len(missing_df)} columns):")
24        print(missing_df.to_string())
25    else:
26        print("\n✅ NO MISSING VALUES")
27    
28    # Duplicates
29    dups = df.duplicated().sum()
30    print(f"\n{'⚠️' if dups > 0 else '✅'} DUPLICATES: {dups:,} ({dups/total*100:.1f}%)")
31    
32    # Data types
33    print(f"\n📋 DATA TYPES:")
34    for dtype, count in df.dtypes.value_counts().items():
35        print(f"   {dtype}: {count} columns")
36    
37    return {'shape': df.shape, 'missing': missing_df, 'duplicates': dups}
38
39# Usage
40# df = pd.read_csv('data/raw/sales_data.csv')
41# quality = assess_data_quality(df, 'Sales Data')

Phase 2: Data Cleaning

Python

1import logging
2
3logging.basicConfig(level=logging.INFO)
4logger = logging.getLogger(__name__)
5
6class DataCleaner:
7    """Reusable data cleaning pipeline with logging."""
8    
9    def __init__(self, df):
10        self.df = df.copy()
11        self.original_shape = df.shape
12        self.log = []
13    
14    def remove_duplicates(self):
15        before = len(self.df)
16        self.df = self.df.drop_duplicates()
17        removed = before - len(self.df)
18        self._log(f"Removed {removed:,} duplicates")
19        return self
20    
21    def handle_missing(self, strategy='smart'):
22        for col in self.df.columns:
23            missing = self.df[col].isnull().sum()
24            if missing == 0:
25                continue
26            
27            pct = missing / len(self.df) * 100
28            
29            if pct > 50:
30                self.df = self.df.drop(columns=[col])
31                self._log(f"Dropped column '{col}' ({pct:.0f}% missing)")
32            elif self.df[col].dtype in ['float64', 'int64']:
33                median_val = self.df[col].median()
34                self.df[col] = self.df[col].fillna(median_val)
35                self._log(f"Filled '{col}' with median ({median_val:.2f})")
36            else:
37                mode_val = self.df[col].mode()[0]
38                self.df[col] = self.df[col].fillna(mode_val)
39                self._log(f"Filled '{col}' with mode ({mode_val})")
40        return self
41    
42    def fix_data_types(self, date_columns=None, category_columns=None):
43        if date_columns:
44            for col in date_columns:
45                self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
46                self._log(f"Converted '{col}' to datetime")
47        if category_columns:
48            for col in category_columns:
49                self.df[col] = self.df[col].astype('category')
50                self._log(f"Converted '{col}' to category")
51        return self
52    
53    def remove_outliers(self, columns, method='iqr', threshold=1.5):
54        before = len(self.df)
55        for col in columns:
56            if method == 'iqr':
57                Q1 = self.df[col].quantile(0.25)
58                Q3 = self.df[col].quantile(0.75)
59                IQR = Q3 - Q1
60                self.df = self.df[
61                    (self.df[col] >= Q1 - threshold * IQR) &
62                    (self.df[col] <= Q3 + threshold * IQR)
63                ]
64        removed = before - len(self.df)
65        self._log(f"Removed {removed:,} outliers using {method.upper()}")
66        return self
67    
68    def get_result(self):
69        final_shape = self.df.shape
70        print(f"\n📊 Cleaning Summary:")
71        print(f"   Before: {self.original_shape}")
72        print(f"   After:  {final_shape}")
73        print(f"\n📋 Steps:")
74        for step in self.log:
75            print(f"   • {step}")
76        return self.df
77    
78    def _log(self, message):
79        self.log.append(message)
80        logger.info(message)
81
82# Usage
83# cleaned_df = (DataCleaner(raw_df)
84#     .remove_duplicates()
85#     .handle_missing(strategy='smart')
86#     .fix_data_types(date_columns=['order_date'], category_columns=['region'])
87#     .remove_outliers(['revenue', 'quantity'])
88#     .get_result())

Checkpoint

Data cleaning pipeline nên chainable (method chaining), có logging, và track changes — giúp reproducible và auditable. Tại sao cleaning pipeline nên log mỗi bước thay vì chỉ output final result?

Task 5

📊 Core Analysis & Portfolio

TB5 min

Revenue Analysis

Python

1class RevenueAnalyzer:
2    def __init__(self, df, date_col='date', revenue_col='revenue'):
3        self.df = df
4        self.date_col = date_col
5        self.revenue_col = revenue_col
6    
7    def monthly_trend(self):
8        self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
9        monthly = self.df.set_index(self.date_col).resample('M')[self.revenue_col].sum()
10        monthly_growth = monthly.pct_change() * 100
11        return pd.DataFrame({'revenue': monthly, 'growth_pct': monthly_growth})
12    
13    def segment_analysis(self, segment_col):
14        return self.df.groupby(segment_col)[self.revenue_col].agg(['sum', 'mean', 'count'])

Cohort Analysis

Python

1class CohortAnalyzer:
2    def __init__(self, df, customer_col, date_col, revenue_col):
3        self.df = df.copy()
4        self.customer_col = customer_col
5        self.date_col = date_col
6        self.revenue_col = revenue_col
7    
8    def build_cohort(self):
9        self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
10        self.df['order_month'] = self.df[self.date_col].dt.to_period('M')
11        
12        # First purchase month = cohort
13        first_purchase = self.df.groupby(self.customer_col)['order_month'].min()
14        self.df = self.df.merge(
15            first_purchase.rename('cohort'), on=self.customer_col
16        )
17        self.df['period_number'] = (
18            self.df['order_month'].astype(int) - self.df['cohort'].astype(int)
19        )
20        
21        cohort_data = self.df.groupby(['cohort', 'period_number'])[self.customer_col].nunique()
22        cohort_table = cohort_data.unstack()
23        cohort_sizes = cohort_table.iloc[:, 0]
24        retention = cohort_table.divide(cohort_sizes, axis=0) * 100
25        return retention
26    
27    def plot_retention(self, retention):
28        plt.figure(figsize=(12, 8))
29        sns.heatmap(retention, annot=True, fmt='.0f', cmap='YlOrRd_r',
30                    vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})
31        plt.title('Cohort Retention Analysis', fontsize=14, fontweight='bold')
32        plt.xlabel('Months Since First Purchase')
33        plt.ylabel('Cohort')
34        plt.tight_layout()
35        plt.show()

RFM Segmentation

Python

1class RFMAnalyzer:
2    def __init__(self, df, customer_col, date_col, revenue_col):
3        self.df = df
4        self.customer_col = customer_col
5        self.date_col = date_col
6        self.revenue_col = revenue_col
7    
8    def calculate_rfm(self, analysis_date=None):
9        self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
10        if analysis_date is None:
11            analysis_date = self.df[self.date_col].max() + pd.Timedelta(days=1)
12        
13        rfm = self.df.groupby(self.customer_col).agg({
14            self.date_col: lambda x: (analysis_date - x.max()).days,
15            self.revenue_col: ['count', 'sum']
16        })
17        rfm.columns = ['recency', 'frequency', 'monetary']
18        
19        for col in ['recency', 'frequency', 'monetary']:
20            rfm[f'{col}_score'] = pd.qcut(
21                rfm[col].rank(method='first'),
22                q=5, labels=[5, 4, 3, 2, 1] if col == 'recency' else [1, 2, 3, 4, 5]
23            ).astype(int)
24        
25        rfm['rfm_score'] = (rfm['recency_score'] * 100 +
26                             rfm['frequency_score'] * 10 +
27                             rfm['monetary_score'])
28        
29        def assign_segment(row):
30            r, f, m = row['recency_score'], row['frequency_score'], row['monetary_score']
31            if r >= 4 and f >= 4:
32                return 'Champions'
33            elif r >= 3 and f >= 3:
34                return 'Loyal'
35            elif r >= 4 and f <= 2:
36                return 'New Customers'
37            elif r <= 2 and f >= 3:
38                return 'At Risk'
39            elif r <= 2 and f <= 2:
40                return 'Lost'
41            else:
42                return 'Potential'
43        
44        rfm['segment'] = rfm.apply(assign_segment, axis=1)
45        return rfm

Executive Summary & Dashboard

Python

1# Executive summary template
2executive_summary = """
3╔══════════════════════════════════════════════════════════╗
4║                  EXECUTIVE SUMMARY                       ║
5╠══════════════════════════════════════════════════════════╣
6║  Period: Q1 2024 | Prepared by: [Your Name]              ║
7║                                                          ║
8║  KEY FINDINGS:                                           ║
9║  1. Revenue grew 15% YoY ($8.2M)                        ║
10║  2. Customer retention improved 5pp (78% → 83%)         ║
11║  3. Champions segment contributes 52% of revenue        ║
12║                                                          ║
13║  RECOMMENDATIONS:                                        ║
14║  1. Invest $200K in loyalty program for At Risk segment  ║
15║  2. Expand product line driving 40% of growth           ║
16║  3. Launch re-engagement campaign for Lost customers     ║
17║                                                          ║
18║  EXPECTED IMPACT: +$1.2M annual revenue                 ║
19╚══════════════════════════════════════════════════════════╝
20"""
21
22# Streamlit dashboard structure
23streamlit_code = '''
24import streamlit as st
25import pandas as pd
26import plotly.express as px
27
28st.set_page_config(page_title="Analytics Dashboard", layout="wide")
29st.title("📊 Data Analytics Dashboard")
30
31# Sidebar filters
32st.sidebar.header("Filters")
33date_range = st.sidebar.date_input("Date Range", [])
34segment = st.sidebar.multiselect("Segment", ["Champions", "Loyal", "At Risk", "Lost"])
35
36# KPI Cards
37col1, col2, col3, col4 = st.columns(4)
38col1.metric("Revenue", "$8.2M", "+15%")
39col2.metric("Customers", "15,200", "+22%")
40col3.metric("Retention", "83%", "+5pp")
41col4.metric("NPS", "48", "-2")
42
43# Charts
44tab1, tab2, tab3 = st.tabs(["Revenue", "Cohort", "RFM"])
45with tab1:
46    fig = px.line(monthly_data, x="month", y="revenue", title="Monthly Revenue Trend")
47    st.plotly_chart(fig, use_container_width=True)
48with tab2:
49    st.subheader("Cohort Retention Heatmap")
50    # Display retention heatmap
51with tab3:
52    fig = px.scatter(rfm_data, x="frequency", y="monetary", color="segment",
53                     size="recency", title="RFM Segments")
54    st.plotly_chart(fig, use_container_width=True)
55'''
56print(executive_summary)

GitHub Portfolio README

Ví dụ

1# Data Analytics Capstone Project
2 
3Badges: [Python 3.11] [License: MIT]
4 
5## Overview
6End-to-end data analytics project covering data exploration, cleaning,
7RFM segmentation, cohort analysis, and automated executive reporting.
8 
9## Key Results
10- 15% revenue growth identified and attributed to new product line
11- RFM segmentation revealed 52% revenue from Champions (12% of customers)
12- Cohort analysis showed retention improved 5pp after loyalty program
13- Automated reports reduced manual reporting time by 8 hours/week
14 
15## Tech Stack
16Python | Pandas | Scikit-learn | Matplotlib | Plotly | Streamlit | Jinja2
17 
18## How to Run
19    git clone [repo-url]
20    pip install -r requirements.txt
21    streamlit run dashboard/app.py

Project Checklist

Ví dụ

1✅ Phase 1: Data Exploration
2   □ Load and inspect dataset
3   □ Data quality assessment
4   □ Initial visualizations
5 
6✅ Phase 2: Data Cleaning
7   □ Handle missing values
8   □ Remove duplicates
9   □ Fix data types
10   □ Handle outliers
11 
12✅ Phase 3: Core Analysis
13   □ Revenue trend analysis
14   □ Cohort retention analysis
15   □ RFM segmentation
16   □ Funnel analysis (if applicable)
17 
18✅ Phase 4: Reporting & Presentation
19   □ Executive summary
20   □ Streamlit dashboard
21   □ Automated report script
22   □ GitHub README
23 
24✅ Phase 5: Polish
25   □ Code documentation
26   □ Unit tests
27   □ Clean commit history
28   □ Professional README with badges

Checkpoint

Capstone project = 5 deliverables (notebook, report, dashboard, automation script, GitHub repo). Professional portfolio needs clean code, documentation, và results-oriented README. Portfolio project nên focus vào technical complexity hay business impact?

Task 6

📋 Tổng kết

TB5 min

Kiến thức đã học trong bài

Chủ đề	Nội dung chính
Report Templates	Jinja2, HTML email, openpyxl Excel, ReportLab PDF
Delivery	SMTP email, Slack/Teams webhooks
Scheduling	APScheduler, GitHub Actions cron workflows
Project Setup	Repository structure, setup script, requirements
Data Cleaning	DataCleaner pipeline, logging, method chaining
Analysis	Revenue, Cohort, RFM, Funnel analysis classes
Dashboard	Streamlit app with filters, metrics, charts
Portfolio	GitHub README, project checklist, presentation

Tổng kết toàn khóa Advanced Data Analysis

Module	Kỹ năng
L01-L03	Python foundations, Pandas, data manipulation
L04-L06	Statistical analysis, hypothesis testing, AB testing
L07-L08	Time series analysis, customer segmentation (RFM, K-Means)
L09-L10	Funnel analysis, predictive analytics basics
L11-L12	Data storytelling, executive reporting, automated reports, capstone

Câu hỏi tự kiểm tra

Automated reporting pipeline gồm những bước nào?
Streamlit dashboard khác static report thế nào?
Portfolio project cần những yếu tố nào?
GitHub Actions dùng cho scheduling thế nào?

Chúc mừng! 🎉

Bạn đã hoàn thành khóa Advanced Data Analysis! Bạn giờ có đầy đủ kỹ năng từ data wrangling → statistical analysis → predictive modeling → storytelling → automated reporting. Hãy apply tất cả vào capstone project để tạo portfolio ấn tượng.

Quay lại: Data Storytelling & Executive Reporting

Task 7