Automated Reports & Portfolio Capstone
🎯 Mục tiêu bài học
- Tạo report templates với Jinja2, HTML, Excel (openpyxl) và PDF (ReportLab)
- Thiết kế email delivery system và integration với Slack/Teams
- Lên lịch reports tự động với APScheduler và GitHub Actions
- Setup capstone project với repository structure chuẩn
- Thực hiện Data Exploration, Cleaning, Analysis và Reporting
- Xây dựng portfolio-ready project trên GitHub
⏱️ Thời lượng: 3 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, openpyxl, ReportLab, Jinja2, Streamlit
📖 Thuật ngữ quan trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Jinja2 | Template engine | Python template engine cho dynamic report generation |
| openpyxl | Thư viện Excel | Python library đọc/ghi Excel files (.xlsx) |
| ReportLab | Thư viện PDF | Python library tạo PDF documents |
| SMTP | Giao thức email | Simple Mail Transfer Protocol cho email delivery |
| APScheduler | Lập lịch Python | Advanced Python Scheduler cho cron-based tasks |
| ETL | Trích xuất-Chuyển đổi-Tải | Extract, Transform, Load pipeline pattern |
| RFM | Phân nhóm khách hàng | Recency, Frequency, Monetary segmentation |
| Cohort Analysis | Phân tích nhóm | Tracking user groups over time |
| Streamlit | Framework dashboard | Python framework tạo web dashboards nhanh |
| Portfolio | Hồ sơ năng lực | Collection of projects demonstrating skills |
Checkpoint
Automated reporting pipeline kết hợp templates (Jinja2), output formats (Excel/PDF/HTML), delivery (SMTP/Slack), và scheduling (APScheduler/cron). ETL pipeline khác gì report generation pipeline?
📝 Report Templates & Output Formats
Jinja2 Template Engine
1from jinja2 import Template2from datetime import datetime34# Text report template5text_template = Template("""6{{ '=' * 60 }}7{{ title | center(60) }}8{{ report_date | center(60) }}9{{ '=' * 60 }}1011�� KEY METRICS12{% for metric in metrics %}13 {{ metric.status }} {{ metric.name }}: {{ metric.value }}14 Target: {{ metric.target }} | vs Target: {{ metric.vs_target }}15{% endfor %}1617�� TOP HIGHLIGHTS18{% for h in highlights %}19 {{ loop.index }}. {{ h }}20{% endfor %}2122�� ACTION ITEMS23{% for action in actions %}24 [{{ action.priority }}] {{ action.task }}25 Owner: {{ action.owner }} | Due: {{ action.due }}26{% endfor %}27{{ '=' * 60 }}28""")2930# HTML email template31html_template_string = '''32HTML Email Report Template:33- Header: Dark background (#2c3e50), white text, centered title + date34- Body: Full-width table with metrics (Metric | Value | Target | Status)35- Styling: Alternating row colors, bold values, colored status indicators36- Footer: Light gray background, "Auto-generated report" disclaimer37- Uses Jinja2 variables: title, report_date, metrics list38- Uses Jinja2 loop: for metric in metrics39'''40html_template = Template(html_template_string)4142# Generate reports43data = {44 'title': 'Weekly Sales Report',45 'report_date': datetime.now().strftime('%B %d, %Y'),46 'metrics': [47 {'name': 'Revenue', 'value': '$2.5M', 'target': '$2.3M', 'vs_target': '+8.7%', 'status': '✅'},48 {'name': 'Orders', 'value': '12,500', 'target': '11,000', 'vs_target': '+13.6%', 'status': '✅'},49 {'name': 'Conversion', 'value': '3.2%', 'target': '3.5%', 'vs_target': '-8.6%', 'status': '⚠️'},50 ],51 'highlights': ['Record Thursday revenue', 'Mobile app orders grew 25% MoM'],52 'actions': [53 {'task': 'Fix mobile checkout UX', 'owner': 'Product', 'due': '3/22', 'priority': 'HIGH'},54 {'task': 'Launch retargeting', 'owner': 'Marketing', 'due': '3/20', 'priority': 'MEDIUM'},55 ]56}5758print(text_template.render(**data))Excel Reports with openpyxl
1from openpyxl import Workbook2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side3from openpyxl.chart import BarChart, Reference4from openpyxl.utils import get_column_letter56def create_excel_report(data, filename='report.xlsx'):7 wb = Workbook()8 9 # --- Summary Sheet ---10 ws = wb.active11 ws.title = 'Executive Summary'12 13 # Header14 ws.merge_cells('A1:E1')15 ws['A1'] = 'Weekly Executive Report'16 ws['A1'].font = Font(size=16, bold=True, color='FFFFFF')17 ws['A1'].fill = PatternFill(start_color='2C3E50', fill_type='solid')18 ws['A1'].alignment = Alignment(horizontal='center')19 20 # Column headers21 headers = ['KPI', 'Current', 'Target', 'vs Target', 'Status']22 for i, header in enumerate(headers, 1):23 cell = ws.cell(row=3, column=i, value=header)24 cell.font = Font(bold=True, color='FFFFFF')25 cell.fill = PatternFill(start_color='34495E', fill_type='solid')26 27 # Data rows28 status_colors = {'✅': '27AE60', '⚠️': 'F39C12', '❌': 'E74C3C'}29 for row_idx, metric in enumerate(data['metrics'], 4):30 ws.cell(row=row_idx, column=1, value=metric['name'])31 ws.cell(row=row_idx, column=2, value=metric['value'])32 ws.cell(row=row_idx, column=3, value=metric['target'])33 ws.cell(row=row_idx, column=4, value=metric['vs_target'])34 status_cell = ws.cell(row=row_idx, column=5, value=metric['status'])35 color = status_colors.get(metric['status'], 'BDC3C7')36 status_cell.fill = PatternFill(start_color=color, fill_type='solid')37 38 # Auto-adjust column widths39 for col in range(1, 6):40 ws.column_dimensions[get_column_letter(col)].width = 1841 42 # --- Chart Sheet ---43 ws2 = wb.create_sheet('Charts')44 chart_data = [['Category', 'Revenue'], ['Product A', 45], ['Product B', 38],45 ['Product C', 29], ['Product D', 18], ['Product E', 12]]46 for row in chart_data:47 ws2.append(row)48 49 chart = BarChart()50 chart.title = 'Revenue by Product'51 chart.y_axis.title = 'Revenue ($K)'52 cats = Reference(ws2, min_col=1, min_row=2, max_row=6)53 vals = Reference(ws2, min_col=2, min_row=1, max_row=6)54 chart.add_data(vals, titles_from_data=True)55 chart.set_categories(cats)56 ws2.add_chart(chart, 'D2')57 58 wb.save(filename)59 print(f"Excel report saved: {filename}")6061create_excel_report(data)PDF Reports with ReportLab
1from reportlab.lib.pagesizes import A42from reportlab.lib import colors3from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle4from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer56def create_pdf_report(data, filename='report.pdf'):7 doc = SimpleDocTemplate(filename, pagesize=A4)8 styles = getSampleStyleSheet()9 elements = []10 11 # Title12 title_style = ParagraphStyle('CustomTitle', parent=styles['Title'],13 fontSize=20, spaceAfter=30)14 elements.append(Paragraph(data['title'], title_style))15 elements.append(Spacer(1, 12))16 17 # Metrics table18 table_data = [['KPI', 'Current', 'Target', 'Status']]19 for m in data['metrics']:20 table_data.append([m['name'], m['value'], m['target'], m['status']])21 22 table = Table(table_data, colWidths=[150, 100, 100, 80])23 table.setStyle(TableStyle([24 ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#2C3E50')),25 ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),26 ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),27 ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),28 ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#F8F9FA')]),29 ]))30 elements.append(table)31 elements.append(Spacer(1, 20))32 33 # Action items34 elements.append(Paragraph('Action Items', styles['Heading2']))35 for action in data['actions']:36 elements.append(Paragraph(37 f"• [{action['priority']}] {action['task']} — Owner: {action['owner']}, Due: {action['due']}",38 styles['Normal']39 ))40 41 doc.build(elements)42 print(f"PDF report saved: {filename}")4344create_pdf_report(data)Checkpoint
Jinja2 tách logic khỏi presentation, openpyxl tạo Excel reports chuyên nghiệp với charts, ReportLab tạo PDF documents programmatically. Khi nào nên chọn Excel report vs PDF report?
📮 Report Delivery & Scheduling
Email Delivery
1import smtplib2from email.mime.text import MIMEText3from email.mime.multipart import MIMEMultipart4from email.mime.base import MIMEBase5from email import encoders67class EmailReporter:8 def __init__(self, smtp_server, smtp_port, username, password):9 self.smtp_server = smtp_server10 self.smtp_port = smtp_port11 self.username = username12 self.password = password13 14 def send_report(self, to_list, subject, html_body, attachments=None):15 msg = MIMEMultipart()16 msg['From'] = self.username17 msg['To'] = ', '.join(to_list)18 msg['Subject'] = subject19 msg.attach(MIMEText(html_body, 'html'))20 21 if attachments:22 for filepath in attachments:23 with open(filepath, 'rb') as f:24 part = MIMEBase('application', 'octet-stream')25 part.set_payload(f.read())26 encoders.encode_base64(part)27 part.add_header('Content-Disposition',28 f'attachment; filename={filepath.split("/")[-1]}')29 msg.attach(part)30 31 with smtplib.SMTP(self.smtp_server, self.smtp_port) as server:32 server.starttls()33 server.login(self.username, self.password)34 server.send_message(msg)35 print(f"Email sent to {len(to_list)} recipients")3637# Usage38# reporter = EmailReporter('smtp.gmail.com', 587, 'you@gmail.com', 'app_password')39# reporter.send_report(['ceo@company.com'], 'Weekly Report', html_content, ['report.xlsx'])Slack & Teams Integration
1import requests23class SlackReporter:4 def __init__(self, webhook_url):5 self.webhook_url = webhook_url6 7 def send_report(self, title, metrics, status='good'):8 color_map = {'good': '#27ae60', 'warning': '#f39c12', 'danger': '#e74c3c'}9 fields = [{'title': m['name'], 'value': f"{m['value']} (Target: {m['target']})",10 'short': True} for m in metrics]11 12 payload = {'attachments': [{'color': color_map.get(status, '#3498db'),13 'title': title, 'fields': fields}]}14 requests.post(self.webhook_url, json=payload)1516class TeamsReporter:17 def __init__(self, webhook_url):18 self.webhook_url = webhook_url19 20 def send_report(self, title, summary, metrics):21 facts = [{'name': m['name'], 'value': f"{m['value']} ({m['status']})"}22 for m in metrics]23 24 card = {25 '@type': 'MessageCard', 'summary': title, 'themeColor': '2C3E50',26 'title': title,27 'sections': [{'activityTitle': summary, 'facts': facts}]28 }29 requests.post(self.webhook_url, json=card)Scheduling with APScheduler
1from apscheduler.schedulers.background import BackgroundScheduler2from apscheduler.triggers.cron import CronTrigger34def daily_report():5 print("Running daily report...")67def weekly_report():8 print("Running weekly report...")910def monthly_report():11 print("Running monthly report...")1213scheduler = BackgroundScheduler()14scheduler.add_job(daily_report, CronTrigger(hour=7, minute=0), id='daily')15scheduler.add_job(weekly_report, CronTrigger(day_of_week='mon', hour=8), id='weekly')16scheduler.add_job(monthly_report, CronTrigger(day=1, hour=9), id='monthly')17# scheduler.start()GitHub Actions Workflow
1# .github/workflows/automated-report.yml2name: Automated Report Generation3on:4 schedule:5 - cron: '0 7 * * 1-5' # Weekdays at 7:00 UTC6 - cron: '0 8 * * 1' # Monday at 8:00 UTC (weekly)7 workflow_dispatch: # Manual trigger8 9jobs:10 generate-report:11 runs-on: ubuntu-latest12 steps:13 - uses: actions/checkout@v414 - uses: actions/setup-python@v515 with:16 python-version: '3.11'17 - run: pip install -r requirements.txt18 - run: python generate_report.py19 env:20 DB_URL: secrets.DATABASE_URL21 SMTP_PASSWORD: secrets.SMTP_PASSWORD22 - uses: actions/upload-artifact@v423 with:24 name: reports25 path: output/Không bao giờ hardcode passwords trong code. Sử dụng environment variables hoặc secrets management (GitHub Secrets, AWS SSM, Azure Key Vault).
Checkpoint
Reports cần đến đúng người, đúng thời điểm: Email cho formal reports, Slack cho quick alerts, và cron/APScheduler cho automatic delivery. Tại sao nên dùng GitHub Actions thay vì cron job trên local machine?
🏗️ Capstone Project Setup
Project Overview
1╔══════════════════════════════════════════════════════════╗2║ PORTFOLIO CAPSTONE PROJECT ║3╠══════════════════════════════════════════════════════════╣4║ 5 DELIVERABLES: ║5║ 1. 📓 Analysis Notebook ║6║ 2. 📊 Executive Report (PDF/PowerPoint) ║7║ 3. 🖥️ Interactive Dashboard (Streamlit) ║8║ 4. 🤖 Automated Report Script ║9║ 5. 📁 GitHub Repository (portfolio-ready) ║10╚══════════════════════════════════════════════════════════╝Repository Structure
1data-analytics-capstone/2├── README.md3├── requirements.txt4├── .gitignore5├── setup_project.py6├── data/7│ ├── raw/8│ └── processed/9├── notebooks/10│ ├── 01_exploration.ipynb11│ ├── 02_cleaning.ipynb12│ ├── 03_analysis.ipynb13│ └── 04_visualization.ipynb14├── src/15│ ├── __init__.py16│ ├── data_loader.py17│ ├── cleaner.py18│ ├── analyzers.py19│ └── reporters.py20├── reports/21│ ├── templates/22│ └── output/23├── dashboard/24│ └── app.py25└── tests/26 └── test_analyzers.pySetup Script
1import os23def setup_project(project_name="data-analytics-capstone"):4 """Create complete project structure."""5 directories = [6 'data/raw', 'data/processed', 'notebooks', 'src',7 'reports/templates', 'reports/output', 'dashboard', 'tests'8 ]9 for d in directories:10 os.makedirs(os.path.join(project_name, d), exist_ok=True)11 12 # README.md13 readme = f"""# {project_name}14## Overview15End-to-end data analytics project: exploration - cleaning - analysis - reporting.1617## Quick Start18 pip install -r requirements.txt19 streamlit run dashboard/app.py2021## Project Structure22- notebooks/: Jupyter notebooks for each phase23- src/: Reusable Python modules24- reports/: Generated reports25- dashboard/: Streamlit dashboard26"""27 28 # requirements.txt29 requirements = """pandas>=2.0.030numpy>=1.24.031matplotlib>=3.7.032seaborn>=0.12.033scikit-learn>=1.3.034openpyxl>=3.1.035jinja2>=3.1.036streamlit>=1.28.037plotly>=5.18.038"""39 40 # .gitignore41 gitignore = """__pycache__/42*.pyc43.env44data/raw/*.csv45*.xlsx46.DS_Store47.ipynb_checkpoints/48"""49 50 files = {51 'README.md': readme,52 'requirements.txt': requirements,53 '.gitignore': gitignore,54 'src/__init__.py': '# Data Analytics Capstone\n'55 }56 for filepath, content in files.items():57 with open(os.path.join(project_name, filepath), 'w') as f:58 f.write(content)59 60 print(f"✅ Project '{project_name}' created!")61 print(f"📁 Directories: {len(directories)}")62 print(f"📄 Files: {len(files)}")6364setup_project()Checkpoint
Professional project structure = reproducible, shareable, portfolio-ready. Bao gồm README, requirements.txt, .gitignore, tách source code (src/) khỏi notebooks và reports. Tại sao nên tách code thành modules trong src/ thay vì viết hết trong notebook?
🔍 Data Exploration & Cleaning
Phase 1: Data Exploration
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt4import seaborn as sns56def assess_data_quality(df, name="dataset"):7 """Comprehensive data quality report."""8 print(f"\n{'='*60}")9 print(f"DATA QUALITY REPORT: {name}")10 print(f"{'='*60}")11 12 total = len(df)13 print(f"\n📊 SHAPE: {df.shape[0]:,} rows × {df.shape[1]} columns")14 print(f"📦 MEMORY: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")15 16 # Missing values17 missing = df.isnull().sum()18 missing_pct = (missing / total * 100).round(1)19 missing_df = pd.DataFrame({'Missing': missing, '%': missing_pct})20 missing_df = missing_df[missing_df['Missing'] > 0].sort_values('%', ascending=False)21 22 if len(missing_df) > 0:23 print(f"\n⚠️ MISSING VALUES ({len(missing_df)} columns):")24 print(missing_df.to_string())25 else:26 print("\n✅ NO MISSING VALUES")27 28 # Duplicates29 dups = df.duplicated().sum()30 print(f"\n{'⚠️' if dups > 0 else '✅'} DUPLICATES: {dups:,} ({dups/total*100:.1f}%)")31 32 # Data types33 print(f"\n📋 DATA TYPES:")34 for dtype, count in df.dtypes.value_counts().items():35 print(f" {dtype}: {count} columns")36 37 return {'shape': df.shape, 'missing': missing_df, 'duplicates': dups}3839# Usage40# df = pd.read_csv('data/raw/sales_data.csv')41# quality = assess_data_quality(df, 'Sales Data')Phase 2: Data Cleaning
1import logging23logging.basicConfig(level=logging.INFO)4logger = logging.getLogger(__name__)56class DataCleaner:7 """Reusable data cleaning pipeline with logging."""8 9 def __init__(self, df):10 self.df = df.copy()11 self.original_shape = df.shape12 self.log = []13 14 def remove_duplicates(self):15 before = len(self.df)16 self.df = self.df.drop_duplicates()17 removed = before - len(self.df)18 self._log(f"Removed {removed:,} duplicates")19 return self20 21 def handle_missing(self, strategy='smart'):22 for col in self.df.columns:23 missing = self.df[col].isnull().sum()24 if missing == 0:25 continue26 27 pct = missing / len(self.df) * 10028 29 if pct > 50:30 self.df = self.df.drop(columns=[col])31 self._log(f"Dropped column '{col}' ({pct:.0f}% missing)")32 elif self.df[col].dtype in ['float64', 'int64']:33 median_val = self.df[col].median()34 self.df[col] = self.df[col].fillna(median_val)35 self._log(f"Filled '{col}' with median ({median_val:.2f})")36 else:37 mode_val = self.df[col].mode()[0]38 self.df[col] = self.df[col].fillna(mode_val)39 self._log(f"Filled '{col}' with mode ({mode_val})")40 return self41 42 def fix_data_types(self, date_columns=None, category_columns=None):43 if date_columns:44 for col in date_columns:45 self.df[col] = pd.to_datetime(self.df[col], errors='coerce')46 self._log(f"Converted '{col}' to datetime")47 if category_columns:48 for col in category_columns:49 self.df[col] = self.df[col].astype('category')50 self._log(f"Converted '{col}' to category")51 return self52 53 def remove_outliers(self, columns, method='iqr', threshold=1.5):54 before = len(self.df)55 for col in columns:56 if method == 'iqr':57 Q1 = self.df[col].quantile(0.25)58 Q3 = self.df[col].quantile(0.75)59 IQR = Q3 - Q160 self.df = self.df[61 (self.df[col] >= Q1 - threshold * IQR) &62 (self.df[col] <= Q3 + threshold * IQR)63 ]64 removed = before - len(self.df)65 self._log(f"Removed {removed:,} outliers using {method.upper()}")66 return self67 68 def get_result(self):69 final_shape = self.df.shape70 print(f"\n📊 Cleaning Summary:")71 print(f" Before: {self.original_shape}")72 print(f" After: {final_shape}")73 print(f"\n📋 Steps:")74 for step in self.log:75 print(f" • {step}")76 return self.df77 78 def _log(self, message):79 self.log.append(message)80 logger.info(message)8182# Usage83# cleaned_df = (DataCleaner(raw_df)84# .remove_duplicates()85# .handle_missing(strategy='smart')86# .fix_data_types(date_columns=['order_date'], category_columns=['region'])87# .remove_outliers(['revenue', 'quantity'])88# .get_result())Checkpoint
Data cleaning pipeline nên chainable (method chaining), có logging, và track changes — giúp reproducible và auditable. Tại sao cleaning pipeline nên log mỗi bước thay vì chỉ output final result?
📊 Core Analysis & Portfolio
Revenue Analysis
1class RevenueAnalyzer:2 def __init__(self, df, date_col='date', revenue_col='revenue'):3 self.df = df4 self.date_col = date_col5 self.revenue_col = revenue_col6 7 def monthly_trend(self):8 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])9 monthly = self.df.set_index(self.date_col).resample('M')[self.revenue_col].sum()10 monthly_growth = monthly.pct_change() * 10011 return pd.DataFrame({'revenue': monthly, 'growth_pct': monthly_growth})12 13 def segment_analysis(self, segment_col):14 return self.df.groupby(segment_col)[self.revenue_col].agg(['sum', 'mean', 'count'])Cohort Analysis
1class CohortAnalyzer:2 def __init__(self, df, customer_col, date_col, revenue_col):3 self.df = df.copy()4 self.customer_col = customer_col5 self.date_col = date_col6 self.revenue_col = revenue_col7 8 def build_cohort(self):9 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])10 self.df['order_month'] = self.df[self.date_col].dt.to_period('M')11 12 # First purchase month = cohort13 first_purchase = self.df.groupby(self.customer_col)['order_month'].min()14 self.df = self.df.merge(15 first_purchase.rename('cohort'), on=self.customer_col16 )17 self.df['period_number'] = (18 self.df['order_month'].astype(int) - self.df['cohort'].astype(int)19 )20 21 cohort_data = self.df.groupby(['cohort', 'period_number'])[self.customer_col].nunique()22 cohort_table = cohort_data.unstack()23 cohort_sizes = cohort_table.iloc[:, 0]24 retention = cohort_table.divide(cohort_sizes, axis=0) * 10025 return retention26 27 def plot_retention(self, retention):28 plt.figure(figsize=(12, 8))29 sns.heatmap(retention, annot=True, fmt='.0f', cmap='YlOrRd_r',30 vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})31 plt.title('Cohort Retention Analysis', fontsize=14, fontweight='bold')32 plt.xlabel('Months Since First Purchase')33 plt.ylabel('Cohort')34 plt.tight_layout()35 plt.show()RFM Segmentation
1class RFMAnalyzer:2 def __init__(self, df, customer_col, date_col, revenue_col):3 self.df = df4 self.customer_col = customer_col5 self.date_col = date_col6 self.revenue_col = revenue_col7 8 def calculate_rfm(self, analysis_date=None):9 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])10 if analysis_date is None:11 analysis_date = self.df[self.date_col].max() + pd.Timedelta(days=1)12 13 rfm = self.df.groupby(self.customer_col).agg({14 self.date_col: lambda x: (analysis_date - x.max()).days,15 self.revenue_col: ['count', 'sum']16 })17 rfm.columns = ['recency', 'frequency', 'monetary']18 19 for col in ['recency', 'frequency', 'monetary']:20 rfm[f'{col}_score'] = pd.qcut(21 rfm[col].rank(method='first'),22 q=5, labels=[5, 4, 3, 2, 1] if col == 'recency' else [1, 2, 3, 4, 5]23 ).astype(int)24 25 rfm['rfm_score'] = (rfm['recency_score'] * 100 +26 rfm['frequency_score'] * 10 +27 rfm['monetary_score'])28 29 def assign_segment(row):30 r, f, m = row['recency_score'], row['frequency_score'], row['monetary_score']31 if r >= 4 and f >= 4:32 return 'Champions'33 elif r >= 3 and f >= 3:34 return 'Loyal'35 elif r >= 4 and f <= 2:36 return 'New Customers'37 elif r <= 2 and f >= 3:38 return 'At Risk'39 elif r <= 2 and f <= 2:40 return 'Lost'41 else:42 return 'Potential'43 44 rfm['segment'] = rfm.apply(assign_segment, axis=1)45 return rfmExecutive Summary & Dashboard
1# Executive summary template2executive_summary = """3╔══════════════════════════════════════════════════════════╗4║ EXECUTIVE SUMMARY ║5╠══════════════════════════════════════════════════════════╣6║ Period: Q1 2024 | Prepared by: [Your Name] ║7║ ║8║ KEY FINDINGS: ║9║ 1. Revenue grew 15% YoY ($8.2M) ║10║ 2. Customer retention improved 5pp (78% → 83%) ║11║ 3. Champions segment contributes 52% of revenue ║12║ ║13║ RECOMMENDATIONS: ║14║ 1. Invest $200K in loyalty program for At Risk segment ║15║ 2. Expand product line driving 40% of growth ║16║ 3. Launch re-engagement campaign for Lost customers ║17║ ║18║ EXPECTED IMPACT: +$1.2M annual revenue ║19╚══════════════════════════════════════════════════════════╝20"""2122# Streamlit dashboard structure23streamlit_code = '''24import streamlit as st25import pandas as pd26import plotly.express as px2728st.set_page_config(page_title="Analytics Dashboard", layout="wide")29st.title("📊 Data Analytics Dashboard")3031# Sidebar filters32st.sidebar.header("Filters")33date_range = st.sidebar.date_input("Date Range", [])34segment = st.sidebar.multiselect("Segment", ["Champions", "Loyal", "At Risk", "Lost"])3536# KPI Cards37col1, col2, col3, col4 = st.columns(4)38col1.metric("Revenue", "$8.2M", "+15%")39col2.metric("Customers", "15,200", "+22%")40col3.metric("Retention", "83%", "+5pp")41col4.metric("NPS", "48", "-2")4243# Charts44tab1, tab2, tab3 = st.tabs(["Revenue", "Cohort", "RFM"])45with tab1:46 fig = px.line(monthly_data, x="month", y="revenue", title="Monthly Revenue Trend")47 st.plotly_chart(fig, use_container_width=True)48with tab2:49 st.subheader("Cohort Retention Heatmap")50 # Display retention heatmap51with tab3:52 fig = px.scatter(rfm_data, x="frequency", y="monetary", color="segment",53 size="recency", title="RFM Segments")54 st.plotly_chart(fig, use_container_width=True)55'''56print(executive_summary)GitHub Portfolio README
1# Data Analytics Capstone Project2 3Badges: [Python 3.11] [License: MIT]4 5## Overview6End-to-end data analytics project covering data exploration, cleaning,7RFM segmentation, cohort analysis, and automated executive reporting.8 9## Key Results10- 15% revenue growth identified and attributed to new product line11- RFM segmentation revealed 52% revenue from Champions (12% of customers)12- Cohort analysis showed retention improved 5pp after loyalty program13- Automated reports reduced manual reporting time by 8 hours/week14 15## Tech Stack16Python | Pandas | Scikit-learn | Matplotlib | Plotly | Streamlit | Jinja217 18## How to Run19 git clone [repo-url]20 pip install -r requirements.txt21 streamlit run dashboard/app.pyProject Checklist
1✅ Phase 1: Data Exploration2 □ Load and inspect dataset3 □ Data quality assessment4 □ Initial visualizations5 6✅ Phase 2: Data Cleaning7 □ Handle missing values8 □ Remove duplicates9 □ Fix data types10 □ Handle outliers11 12✅ Phase 3: Core Analysis13 □ Revenue trend analysis14 □ Cohort retention analysis15 □ RFM segmentation16 □ Funnel analysis (if applicable)17 18✅ Phase 4: Reporting & Presentation19 □ Executive summary20 □ Streamlit dashboard21 □ Automated report script22 □ GitHub README23 24✅ Phase 5: Polish25 □ Code documentation26 □ Unit tests27 □ Clean commit history28 □ Professional README with badgesCheckpoint
Capstone project = 5 deliverables (notebook, report, dashboard, automation script, GitHub repo). Professional portfolio needs clean code, documentation, và results-oriented README. Portfolio project nên focus vào technical complexity hay business impact?
📋 Tổng kết
Kiến thức đã học trong bài
| Chủ đề | Nội dung chính |
|---|---|
| Report Templates | Jinja2, HTML email, openpyxl Excel, ReportLab PDF |
| Delivery | SMTP email, Slack/Teams webhooks |
| Scheduling | APScheduler, GitHub Actions cron workflows |
| Project Setup | Repository structure, setup script, requirements |
| Data Cleaning | DataCleaner pipeline, logging, method chaining |
| Analysis | Revenue, Cohort, RFM, Funnel analysis classes |
| Dashboard | Streamlit app with filters, metrics, charts |
| Portfolio | GitHub README, project checklist, presentation |
Tổng kết toàn khóa Advanced Data Analysis
| Module | Kỹ năng |
|---|---|
| L01-L03 | Python foundations, Pandas, data manipulation |
| L04-L06 | Statistical analysis, hypothesis testing, AB testing |
| L07-L08 | Time series analysis, customer segmentation (RFM, K-Means) |
| L09-L10 | Funnel analysis, predictive analytics basics |
| L11-L12 | Data storytelling, executive reporting, automated reports, capstone |
Câu hỏi tự kiểm tra
- Automated reporting pipeline gồm những bước nào?
- Streamlit dashboard khác static report thế nào?
- Portfolio project cần những yếu tố nào?
- GitHub Actions dùng cho scheduling thế nào?
Bạn đã hoàn thành khóa Advanced Data Analysis! Bạn giờ có đầy đủ kỹ năng từ data wrangling → statistical analysis → predictive modeling → storytelling → automated reporting. Hãy apply tất cả vào capstone project để tạo portfolio ấn tượng.
Quay lại: Data Storytelling & Executive Reporting
