MinAI - Về trang chủ
Dự án
12/133 giờ
Đang tải...

Automated Reports & Portfolio Capstone

Xây dựng hệ thống báo cáo tự động và hoàn thành dự án portfolio

Automated Reports & Portfolio Capstone

Portfolio Project

0

🎯 Mục tiêu bài học

TB5 min
Sau bài học này, bạn sẽ có thể:
  • Tạo report templates với Jinja2, HTML, Excel (openpyxl) và PDF (ReportLab)
  • Thiết kế email delivery system và integration với Slack/Teams
  • Lên lịch reports tự động với APScheduler và GitHub Actions
  • Setup capstone project với repository structure chuẩn
  • Thực hiện Data Exploration, Cleaning, Analysis và Reporting
  • Xây dựng portfolio-ready project trên GitHub
Thông tin bài học

⏱️ Thời lượng: 3 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, openpyxl, ReportLab, Jinja2, Streamlit

1

📖 Thuật ngữ quan trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
Jinja2Template enginePython template engine cho dynamic report generation
openpyxlThư viện ExcelPython library đọc/ghi Excel files (.xlsx)
ReportLabThư viện PDFPython library tạo PDF documents
SMTPGiao thức emailSimple Mail Transfer Protocol cho email delivery
APSchedulerLập lịch PythonAdvanced Python Scheduler cho cron-based tasks
ETLTrích xuất-Chuyển đổi-TảiExtract, Transform, Load pipeline pattern
RFMPhân nhóm khách hàngRecency, Frequency, Monetary segmentation
Cohort AnalysisPhân tích nhómTracking user groups over time
StreamlitFramework dashboardPython framework tạo web dashboards nhanh
PortfolioHồ sơ năng lựcCollection of projects demonstrating skills

Checkpoint

Automated reporting pipeline kết hợp templates (Jinja2), output formats (Excel/PDF/HTML), delivery (SMTP/Slack), và scheduling (APScheduler/cron). ETL pipeline khác gì report generation pipeline?

2

📝 Report Templates & Output Formats

TB5 min

Jinja2 Template Engine

Python
1from jinja2 import Template
2from datetime import datetime
3
4# Text report template
5text_template = Template("""
6{{ '=' * 60 }}
7{{ title | center(60) }}
8{{ report_date | center(60) }}
9{{ '=' * 60 }}
10
11 KEY METRICS
12{% for metric in metrics %}
13 {{ metric.status }} {{ metric.name }}: {{ metric.value }}
14 Target: {{ metric.target }} | vs Target: {{ metric.vs_target }}
15{% endfor %}
16
17 TOP HIGHLIGHTS
18{% for h in highlights %}
19 {{ loop.index }}. {{ h }}
20{% endfor %}
21
22 ACTION ITEMS
23{% for action in actions %}
24 [{{ action.priority }}] {{ action.task }}
25 Owner: {{ action.owner }} | Due: {{ action.due }}
26{% endfor %}
27{{ '=' * 60 }}
28""")
29
30# HTML email template
31html_template_string = '''
32HTML Email Report Template:
33- Header: Dark background (#2c3e50), white text, centered title + date
34- Body: Full-width table with metrics (Metric | Value | Target | Status)
35- Styling: Alternating row colors, bold values, colored status indicators
36- Footer: Light gray background, "Auto-generated report" disclaimer
37- Uses Jinja2 variables: title, report_date, metrics list
38- Uses Jinja2 loop: for metric in metrics
39'''
40html_template = Template(html_template_string)
41
42# Generate reports
43data = {
44 'title': 'Weekly Sales Report',
45 'report_date': datetime.now().strftime('%B %d, %Y'),
46 'metrics': [
47 {'name': 'Revenue', 'value': '$2.5M', 'target': '$2.3M', 'vs_target': '+8.7%', 'status': '✅'},
48 {'name': 'Orders', 'value': '12,500', 'target': '11,000', 'vs_target': '+13.6%', 'status': '✅'},
49 {'name': 'Conversion', 'value': '3.2%', 'target': '3.5%', 'vs_target': '-8.6%', 'status': '⚠️'},
50 ],
51 'highlights': ['Record Thursday revenue', 'Mobile app orders grew 25% MoM'],
52 'actions': [
53 {'task': 'Fix mobile checkout UX', 'owner': 'Product', 'due': '3/22', 'priority': 'HIGH'},
54 {'task': 'Launch retargeting', 'owner': 'Marketing', 'due': '3/20', 'priority': 'MEDIUM'},
55 ]
56}
57
58print(text_template.render(**data))

Excel Reports with openpyxl

Python
1from openpyxl import Workbook
2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
3from openpyxl.chart import BarChart, Reference
4from openpyxl.utils import get_column_letter
5
6def create_excel_report(data, filename='report.xlsx'):
7 wb = Workbook()
8
9 # --- Summary Sheet ---
10 ws = wb.active
11 ws.title = 'Executive Summary'
12
13 # Header
14 ws.merge_cells('A1:E1')
15 ws['A1'] = 'Weekly Executive Report'
16 ws['A1'].font = Font(size=16, bold=True, color='FFFFFF')
17 ws['A1'].fill = PatternFill(start_color='2C3E50', fill_type='solid')
18 ws['A1'].alignment = Alignment(horizontal='center')
19
20 # Column headers
21 headers = ['KPI', 'Current', 'Target', 'vs Target', 'Status']
22 for i, header in enumerate(headers, 1):
23 cell = ws.cell(row=3, column=i, value=header)
24 cell.font = Font(bold=True, color='FFFFFF')
25 cell.fill = PatternFill(start_color='34495E', fill_type='solid')
26
27 # Data rows
28 status_colors = {'✅': '27AE60', '⚠️': 'F39C12', '❌': 'E74C3C'}
29 for row_idx, metric in enumerate(data['metrics'], 4):
30 ws.cell(row=row_idx, column=1, value=metric['name'])
31 ws.cell(row=row_idx, column=2, value=metric['value'])
32 ws.cell(row=row_idx, column=3, value=metric['target'])
33 ws.cell(row=row_idx, column=4, value=metric['vs_target'])
34 status_cell = ws.cell(row=row_idx, column=5, value=metric['status'])
35 color = status_colors.get(metric['status'], 'BDC3C7')
36 status_cell.fill = PatternFill(start_color=color, fill_type='solid')
37
38 # Auto-adjust column widths
39 for col in range(1, 6):
40 ws.column_dimensions[get_column_letter(col)].width = 18
41
42 # --- Chart Sheet ---
43 ws2 = wb.create_sheet('Charts')
44 chart_data = [['Category', 'Revenue'], ['Product A', 45], ['Product B', 38],
45 ['Product C', 29], ['Product D', 18], ['Product E', 12]]
46 for row in chart_data:
47 ws2.append(row)
48
49 chart = BarChart()
50 chart.title = 'Revenue by Product'
51 chart.y_axis.title = 'Revenue ($K)'
52 cats = Reference(ws2, min_col=1, min_row=2, max_row=6)
53 vals = Reference(ws2, min_col=2, min_row=1, max_row=6)
54 chart.add_data(vals, titles_from_data=True)
55 chart.set_categories(cats)
56 ws2.add_chart(chart, 'D2')
57
58 wb.save(filename)
59 print(f"Excel report saved: {filename}")
60
61create_excel_report(data)

PDF Reports with ReportLab

Python
1from reportlab.lib.pagesizes import A4
2from reportlab.lib import colors
3from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
4from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer
5
6def create_pdf_report(data, filename='report.pdf'):
7 doc = SimpleDocTemplate(filename, pagesize=A4)
8 styles = getSampleStyleSheet()
9 elements = []
10
11 # Title
12 title_style = ParagraphStyle('CustomTitle', parent=styles['Title'],
13 fontSize=20, spaceAfter=30)
14 elements.append(Paragraph(data['title'], title_style))
15 elements.append(Spacer(1, 12))
16
17 # Metrics table
18 table_data = [['KPI', 'Current', 'Target', 'Status']]
19 for m in data['metrics']:
20 table_data.append([m['name'], m['value'], m['target'], m['status']])
21
22 table = Table(table_data, colWidths=[150, 100, 100, 80])
23 table.setStyle(TableStyle([
24 ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#2C3E50')),
25 ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
26 ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
27 ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
28 ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#F8F9FA')]),
29 ]))
30 elements.append(table)
31 elements.append(Spacer(1, 20))
32
33 # Action items
34 elements.append(Paragraph('Action Items', styles['Heading2']))
35 for action in data['actions']:
36 elements.append(Paragraph(
37 f"• [{action['priority']}] {action['task']} — Owner: {action['owner']}, Due: {action['due']}",
38 styles['Normal']
39 ))
40
41 doc.build(elements)
42 print(f"PDF report saved: {filename}")
43
44create_pdf_report(data)

Checkpoint

Jinja2 tách logic khỏi presentation, openpyxl tạo Excel reports chuyên nghiệp với charts, ReportLab tạo PDF documents programmatically. Khi nào nên chọn Excel report vs PDF report?

3

📮 Report Delivery & Scheduling

TB5 min

Email Delivery

Python
1import smtplib
2from email.mime.text import MIMEText
3from email.mime.multipart import MIMEMultipart
4from email.mime.base import MIMEBase
5from email import encoders
6
7class EmailReporter:
8 def __init__(self, smtp_server, smtp_port, username, password):
9 self.smtp_server = smtp_server
10 self.smtp_port = smtp_port
11 self.username = username
12 self.password = password
13
14 def send_report(self, to_list, subject, html_body, attachments=None):
15 msg = MIMEMultipart()
16 msg['From'] = self.username
17 msg['To'] = ', '.join(to_list)
18 msg['Subject'] = subject
19 msg.attach(MIMEText(html_body, 'html'))
20
21 if attachments:
22 for filepath in attachments:
23 with open(filepath, 'rb') as f:
24 part = MIMEBase('application', 'octet-stream')
25 part.set_payload(f.read())
26 encoders.encode_base64(part)
27 part.add_header('Content-Disposition',
28 f'attachment; filename={filepath.split("/")[-1]}')
29 msg.attach(part)
30
31 with smtplib.SMTP(self.smtp_server, self.smtp_port) as server:
32 server.starttls()
33 server.login(self.username, self.password)
34 server.send_message(msg)
35 print(f"Email sent to {len(to_list)} recipients")
36
37# Usage
38# reporter = EmailReporter('smtp.gmail.com', 587, 'you@gmail.com', 'app_password')
39# reporter.send_report(['ceo@company.com'], 'Weekly Report', html_content, ['report.xlsx'])

Slack & Teams Integration

Python
1import requests
2
3class SlackReporter:
4 def __init__(self, webhook_url):
5 self.webhook_url = webhook_url
6
7 def send_report(self, title, metrics, status='good'):
8 color_map = {'good': '#27ae60', 'warning': '#f39c12', 'danger': '#e74c3c'}
9 fields = [{'title': m['name'], 'value': f"{m['value']} (Target: {m['target']})",
10 'short': True} for m in metrics]
11
12 payload = {'attachments': [{'color': color_map.get(status, '#3498db'),
13 'title': title, 'fields': fields}]}
14 requests.post(self.webhook_url, json=payload)
15
16class TeamsReporter:
17 def __init__(self, webhook_url):
18 self.webhook_url = webhook_url
19
20 def send_report(self, title, summary, metrics):
21 facts = [{'name': m['name'], 'value': f"{m['value']} ({m['status']})"}
22 for m in metrics]
23
24 card = {
25 '@type': 'MessageCard', 'summary': title, 'themeColor': '2C3E50',
26 'title': title,
27 'sections': [{'activityTitle': summary, 'facts': facts}]
28 }
29 requests.post(self.webhook_url, json=card)

Scheduling with APScheduler

Python
1from apscheduler.schedulers.background import BackgroundScheduler
2from apscheduler.triggers.cron import CronTrigger
3
4def daily_report():
5 print("Running daily report...")
6
7def weekly_report():
8 print("Running weekly report...")
9
10def monthly_report():
11 print("Running monthly report...")
12
13scheduler = BackgroundScheduler()
14scheduler.add_job(daily_report, CronTrigger(hour=7, minute=0), id='daily')
15scheduler.add_job(weekly_report, CronTrigger(day_of_week='mon', hour=8), id='weekly')
16scheduler.add_job(monthly_report, CronTrigger(day=1, hour=9), id='monthly')
17# scheduler.start()

GitHub Actions Workflow

Ví dụ
1# .github/workflows/automated-report.yml
2name: Automated Report Generation
3on:
4 schedule:
5 - cron: '0 7 * * 1-5' # Weekdays at 7:00 UTC
6 - cron: '0 8 * * 1' # Monday at 8:00 UTC (weekly)
7 workflow_dispatch: # Manual trigger
8
9jobs:
10 generate-report:
11 runs-on: ubuntu-latest
12 steps:
13 - uses: actions/checkout@v4
14 - uses: actions/setup-python@v5
15 with:
16 python-version: '3.11'
17 - run: pip install -r requirements.txt
18 - run: python generate_report.py
19 env:
20 DB_URL: secrets.DATABASE_URL
21 SMTP_PASSWORD: secrets.SMTP_PASSWORD
22 - uses: actions/upload-artifact@v4
23 with:
24 name: reports
25 path: output/
Security Best Practice

Không bao giờ hardcode passwords trong code. Sử dụng environment variables hoặc secrets management (GitHub Secrets, AWS SSM, Azure Key Vault).

Checkpoint

Reports cần đến đúng người, đúng thời điểm: Email cho formal reports, Slack cho quick alerts, và cron/APScheduler cho automatic delivery. Tại sao nên dùng GitHub Actions thay vì cron job trên local machine?

4

🏗️ Capstone Project Setup

TB5 min

Project Overview

Ví dụ
1╔══════════════════════════════════════════════════════════╗
2║ PORTFOLIO CAPSTONE PROJECT ║
3╠══════════════════════════════════════════════════════════╣
4║ 5 DELIVERABLES: ║
5║ 1. 📓 Analysis Notebook ║
6║ 2. 📊 Executive Report (PDF/PowerPoint) ║
7║ 3. 🖥️ Interactive Dashboard (Streamlit) ║
8║ 4. 🤖 Automated Report Script ║
9║ 5. 📁 GitHub Repository (portfolio-ready) ║
10╚══════════════════════════════════════════════════════════╝

Repository Structure

Ví dụ
1data-analytics-capstone/
2├── README.md
3├── requirements.txt
4├── .gitignore
5├── setup_project.py
6├── data/
7│ ├── raw/
8│ └── processed/
9├── notebooks/
10│ ├── 01_exploration.ipynb
11│ ├── 02_cleaning.ipynb
12│ ├── 03_analysis.ipynb
13│ └── 04_visualization.ipynb
14├── src/
15│ ├── __init__.py
16│ ├── data_loader.py
17│ ├── cleaner.py
18│ ├── analyzers.py
19│ └── reporters.py
20├── reports/
21│ ├── templates/
22│ └── output/
23├── dashboard/
24│ └── app.py
25└── tests/
26 └── test_analyzers.py

Setup Script

Python
1import os
2
3def setup_project(project_name="data-analytics-capstone"):
4 """Create complete project structure."""
5 directories = [
6 'data/raw', 'data/processed', 'notebooks', 'src',
7 'reports/templates', 'reports/output', 'dashboard', 'tests'
8 ]
9 for d in directories:
10 os.makedirs(os.path.join(project_name, d), exist_ok=True)
11
12 # README.md
13 readme = f"""# {project_name}
14## Overview
15End-to-end data analytics project: exploration - cleaning - analysis - reporting.
16
17## Quick Start
18 pip install -r requirements.txt
19 streamlit run dashboard/app.py
20
21## Project Structure
22- notebooks/: Jupyter notebooks for each phase
23- src/: Reusable Python modules
24- reports/: Generated reports
25- dashboard/: Streamlit dashboard
26"""
27
28 # requirements.txt
29 requirements = """pandas>=2.0.0
30numpy>=1.24.0
31matplotlib>=3.7.0
32seaborn>=0.12.0
33scikit-learn>=1.3.0
34openpyxl>=3.1.0
35jinja2>=3.1.0
36streamlit>=1.28.0
37plotly>=5.18.0
38"""
39
40 # .gitignore
41 gitignore = """__pycache__/
42*.pyc
43.env
44data/raw/*.csv
45*.xlsx
46.DS_Store
47.ipynb_checkpoints/
48"""
49
50 files = {
51 'README.md': readme,
52 'requirements.txt': requirements,
53 '.gitignore': gitignore,
54 'src/__init__.py': '# Data Analytics Capstone\n'
55 }
56 for filepath, content in files.items():
57 with open(os.path.join(project_name, filepath), 'w') as f:
58 f.write(content)
59
60 print(f"✅ Project '{project_name}' created!")
61 print(f"📁 Directories: {len(directories)}")
62 print(f"📄 Files: {len(files)}")
63
64setup_project()

Checkpoint

Professional project structure = reproducible, shareable, portfolio-ready. Bao gồm README, requirements.txt, .gitignore, tách source code (src/) khỏi notebooks và reports. Tại sao nên tách code thành modules trong src/ thay vì viết hết trong notebook?

5

🔍 Data Exploration & Cleaning

TB5 min

Phase 1: Data Exploration

Python
1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6def assess_data_quality(df, name="dataset"):
7 """Comprehensive data quality report."""
8 print(f"\n{'='*60}")
9 print(f"DATA QUALITY REPORT: {name}")
10 print(f"{'='*60}")
11
12 total = len(df)
13 print(f"\n📊 SHAPE: {df.shape[0]:,} rows × {df.shape[1]} columns")
14 print(f"📦 MEMORY: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
15
16 # Missing values
17 missing = df.isnull().sum()
18 missing_pct = (missing / total * 100).round(1)
19 missing_df = pd.DataFrame({'Missing': missing, '%': missing_pct})
20 missing_df = missing_df[missing_df['Missing'] > 0].sort_values('%', ascending=False)
21
22 if len(missing_df) > 0:
23 print(f"\n⚠️ MISSING VALUES ({len(missing_df)} columns):")
24 print(missing_df.to_string())
25 else:
26 print("\n✅ NO MISSING VALUES")
27
28 # Duplicates
29 dups = df.duplicated().sum()
30 print(f"\n{'⚠️' if dups > 0 else '✅'} DUPLICATES: {dups:,} ({dups/total*100:.1f}%)")
31
32 # Data types
33 print(f"\n📋 DATA TYPES:")
34 for dtype, count in df.dtypes.value_counts().items():
35 print(f" {dtype}: {count} columns")
36
37 return {'shape': df.shape, 'missing': missing_df, 'duplicates': dups}
38
39# Usage
40# df = pd.read_csv('data/raw/sales_data.csv')
41# quality = assess_data_quality(df, 'Sales Data')

Phase 2: Data Cleaning

Python
1import logging
2
3logging.basicConfig(level=logging.INFO)
4logger = logging.getLogger(__name__)
5
6class DataCleaner:
7 """Reusable data cleaning pipeline with logging."""
8
9 def __init__(self, df):
10 self.df = df.copy()
11 self.original_shape = df.shape
12 self.log = []
13
14 def remove_duplicates(self):
15 before = len(self.df)
16 self.df = self.df.drop_duplicates()
17 removed = before - len(self.df)
18 self._log(f"Removed {removed:,} duplicates")
19 return self
20
21 def handle_missing(self, strategy='smart'):
22 for col in self.df.columns:
23 missing = self.df[col].isnull().sum()
24 if missing == 0:
25 continue
26
27 pct = missing / len(self.df) * 100
28
29 if pct > 50:
30 self.df = self.df.drop(columns=[col])
31 self._log(f"Dropped column '{col}' ({pct:.0f}% missing)")
32 elif self.df[col].dtype in ['float64', 'int64']:
33 median_val = self.df[col].median()
34 self.df[col] = self.df[col].fillna(median_val)
35 self._log(f"Filled '{col}' with median ({median_val:.2f})")
36 else:
37 mode_val = self.df[col].mode()[0]
38 self.df[col] = self.df[col].fillna(mode_val)
39 self._log(f"Filled '{col}' with mode ({mode_val})")
40 return self
41
42 def fix_data_types(self, date_columns=None, category_columns=None):
43 if date_columns:
44 for col in date_columns:
45 self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
46 self._log(f"Converted '{col}' to datetime")
47 if category_columns:
48 for col in category_columns:
49 self.df[col] = self.df[col].astype('category')
50 self._log(f"Converted '{col}' to category")
51 return self
52
53 def remove_outliers(self, columns, method='iqr', threshold=1.5):
54 before = len(self.df)
55 for col in columns:
56 if method == 'iqr':
57 Q1 = self.df[col].quantile(0.25)
58 Q3 = self.df[col].quantile(0.75)
59 IQR = Q3 - Q1
60 self.df = self.df[
61 (self.df[col] >= Q1 - threshold * IQR) &
62 (self.df[col] <= Q3 + threshold * IQR)
63 ]
64 removed = before - len(self.df)
65 self._log(f"Removed {removed:,} outliers using {method.upper()}")
66 return self
67
68 def get_result(self):
69 final_shape = self.df.shape
70 print(f"\n📊 Cleaning Summary:")
71 print(f" Before: {self.original_shape}")
72 print(f" After: {final_shape}")
73 print(f"\n📋 Steps:")
74 for step in self.log:
75 print(f" • {step}")
76 return self.df
77
78 def _log(self, message):
79 self.log.append(message)
80 logger.info(message)
81
82# Usage
83# cleaned_df = (DataCleaner(raw_df)
84# .remove_duplicates()
85# .handle_missing(strategy='smart')
86# .fix_data_types(date_columns=['order_date'], category_columns=['region'])
87# .remove_outliers(['revenue', 'quantity'])
88# .get_result())

Checkpoint

Data cleaning pipeline nên chainable (method chaining), có logging, và track changes — giúp reproducible và auditable. Tại sao cleaning pipeline nên log mỗi bước thay vì chỉ output final result?

6

📊 Core Analysis & Portfolio

TB5 min

Revenue Analysis

Python
1class RevenueAnalyzer:
2 def __init__(self, df, date_col='date', revenue_col='revenue'):
3 self.df = df
4 self.date_col = date_col
5 self.revenue_col = revenue_col
6
7 def monthly_trend(self):
8 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
9 monthly = self.df.set_index(self.date_col).resample('M')[self.revenue_col].sum()
10 monthly_growth = monthly.pct_change() * 100
11 return pd.DataFrame({'revenue': monthly, 'growth_pct': monthly_growth})
12
13 def segment_analysis(self, segment_col):
14 return self.df.groupby(segment_col)[self.revenue_col].agg(['sum', 'mean', 'count'])

Cohort Analysis

Python
1class CohortAnalyzer:
2 def __init__(self, df, customer_col, date_col, revenue_col):
3 self.df = df.copy()
4 self.customer_col = customer_col
5 self.date_col = date_col
6 self.revenue_col = revenue_col
7
8 def build_cohort(self):
9 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
10 self.df['order_month'] = self.df[self.date_col].dt.to_period('M')
11
12 # First purchase month = cohort
13 first_purchase = self.df.groupby(self.customer_col)['order_month'].min()
14 self.df = self.df.merge(
15 first_purchase.rename('cohort'), on=self.customer_col
16 )
17 self.df['period_number'] = (
18 self.df['order_month'].astype(int) - self.df['cohort'].astype(int)
19 )
20
21 cohort_data = self.df.groupby(['cohort', 'period_number'])[self.customer_col].nunique()
22 cohort_table = cohort_data.unstack()
23 cohort_sizes = cohort_table.iloc[:, 0]
24 retention = cohort_table.divide(cohort_sizes, axis=0) * 100
25 return retention
26
27 def plot_retention(self, retention):
28 plt.figure(figsize=(12, 8))
29 sns.heatmap(retention, annot=True, fmt='.0f', cmap='YlOrRd_r',
30 vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})
31 plt.title('Cohort Retention Analysis', fontsize=14, fontweight='bold')
32 plt.xlabel('Months Since First Purchase')
33 plt.ylabel('Cohort')
34 plt.tight_layout()
35 plt.show()

RFM Segmentation

Python
1class RFMAnalyzer:
2 def __init__(self, df, customer_col, date_col, revenue_col):
3 self.df = df
4 self.customer_col = customer_col
5 self.date_col = date_col
6 self.revenue_col = revenue_col
7
8 def calculate_rfm(self, analysis_date=None):
9 self.df[self.date_col] = pd.to_datetime(self.df[self.date_col])
10 if analysis_date is None:
11 analysis_date = self.df[self.date_col].max() + pd.Timedelta(days=1)
12
13 rfm = self.df.groupby(self.customer_col).agg({
14 self.date_col: lambda x: (analysis_date - x.max()).days,
15 self.revenue_col: ['count', 'sum']
16 })
17 rfm.columns = ['recency', 'frequency', 'monetary']
18
19 for col in ['recency', 'frequency', 'monetary']:
20 rfm[f'{col}_score'] = pd.qcut(
21 rfm[col].rank(method='first'),
22 q=5, labels=[5, 4, 3, 2, 1] if col == 'recency' else [1, 2, 3, 4, 5]
23 ).astype(int)
24
25 rfm['rfm_score'] = (rfm['recency_score'] * 100 +
26 rfm['frequency_score'] * 10 +
27 rfm['monetary_score'])
28
29 def assign_segment(row):
30 r, f, m = row['recency_score'], row['frequency_score'], row['monetary_score']
31 if r >= 4 and f >= 4:
32 return 'Champions'
33 elif r >= 3 and f >= 3:
34 return 'Loyal'
35 elif r >= 4 and f <= 2:
36 return 'New Customers'
37 elif r <= 2 and f >= 3:
38 return 'At Risk'
39 elif r <= 2 and f <= 2:
40 return 'Lost'
41 else:
42 return 'Potential'
43
44 rfm['segment'] = rfm.apply(assign_segment, axis=1)
45 return rfm

Executive Summary & Dashboard

Python
1# Executive summary template
2executive_summary = """
3
4 EXECUTIVE SUMMARY
5
6 Period: Q1 2024 | Prepared by: [Your Name]
7
8 KEY FINDINGS:
9 1. Revenue grew 15% YoY ($8.2M)
10 2. Customer retention improved 5pp (78% 83%)
11 3. Champions segment contributes 52% of revenue
12
13 RECOMMENDATIONS:
14 1. Invest $200K in loyalty program for At Risk segment
15 2. Expand product line driving 40% of growth
16 3. Launch re-engagement campaign for Lost customers
17
18 EXPECTED IMPACT: +$1.2M annual revenue
19
20"""
21
22# Streamlit dashboard structure
23streamlit_code = '''
24import streamlit as st
25import pandas as pd
26import plotly.express as px
27
28st.set_page_config(page_title="Analytics Dashboard", layout="wide")
29st.title("📊 Data Analytics Dashboard")
30
31# Sidebar filters
32st.sidebar.header("Filters")
33date_range = st.sidebar.date_input("Date Range", [])
34segment = st.sidebar.multiselect("Segment", ["Champions", "Loyal", "At Risk", "Lost"])
35
36# KPI Cards
37col1, col2, col3, col4 = st.columns(4)
38col1.metric("Revenue", "$8.2M", "+15%")
39col2.metric("Customers", "15,200", "+22%")
40col3.metric("Retention", "83%", "+5pp")
41col4.metric("NPS", "48", "-2")
42
43# Charts
44tab1, tab2, tab3 = st.tabs(["Revenue", "Cohort", "RFM"])
45with tab1:
46 fig = px.line(monthly_data, x="month", y="revenue", title="Monthly Revenue Trend")
47 st.plotly_chart(fig, use_container_width=True)
48with tab2:
49 st.subheader("Cohort Retention Heatmap")
50 # Display retention heatmap
51with tab3:
52 fig = px.scatter(rfm_data, x="frequency", y="monetary", color="segment",
53 size="recency", title="RFM Segments")
54 st.plotly_chart(fig, use_container_width=True)
55'''
56print(executive_summary)

GitHub Portfolio README

Ví dụ
1# Data Analytics Capstone Project
2
3Badges: [Python 3.11] [License: MIT]
4
5## Overview
6End-to-end data analytics project covering data exploration, cleaning,
7RFM segmentation, cohort analysis, and automated executive reporting.
8
9## Key Results
10- 15% revenue growth identified and attributed to new product line
11- RFM segmentation revealed 52% revenue from Champions (12% of customers)
12- Cohort analysis showed retention improved 5pp after loyalty program
13- Automated reports reduced manual reporting time by 8 hours/week
14
15## Tech Stack
16Python | Pandas | Scikit-learn | Matplotlib | Plotly | Streamlit | Jinja2
17
18## How to Run
19 git clone [repo-url]
20 pip install -r requirements.txt
21 streamlit run dashboard/app.py

Project Checklist

Ví dụ
1✅ Phase 1: Data Exploration
2 □ Load and inspect dataset
3 □ Data quality assessment
4 □ Initial visualizations
5
6✅ Phase 2: Data Cleaning
7 □ Handle missing values
8 □ Remove duplicates
9 □ Fix data types
10 □ Handle outliers
11
12✅ Phase 3: Core Analysis
13 □ Revenue trend analysis
14 □ Cohort retention analysis
15 □ RFM segmentation
16 □ Funnel analysis (if applicable)
17
18✅ Phase 4: Reporting & Presentation
19 □ Executive summary
20 □ Streamlit dashboard
21 □ Automated report script
22 □ GitHub README
23
24✅ Phase 5: Polish
25 □ Code documentation
26 □ Unit tests
27 □ Clean commit history
28 □ Professional README with badges

Checkpoint

Capstone project = 5 deliverables (notebook, report, dashboard, automation script, GitHub repo). Professional portfolio needs clean code, documentation, và results-oriented README. Portfolio project nên focus vào technical complexity hay business impact?

7

📋 Tổng kết

TB5 min

Kiến thức đã học trong bài

Chủ đềNội dung chính
Report TemplatesJinja2, HTML email, openpyxl Excel, ReportLab PDF
DeliverySMTP email, Slack/Teams webhooks
SchedulingAPScheduler, GitHub Actions cron workflows
Project SetupRepository structure, setup script, requirements
Data CleaningDataCleaner pipeline, logging, method chaining
AnalysisRevenue, Cohort, RFM, Funnel analysis classes
DashboardStreamlit app with filters, metrics, charts
PortfolioGitHub README, project checklist, presentation

Tổng kết toàn khóa Advanced Data Analysis

ModuleKỹ năng
L01-L03Python foundations, Pandas, data manipulation
L04-L06Statistical analysis, hypothesis testing, AB testing
L07-L08Time series analysis, customer segmentation (RFM, K-Means)
L09-L10Funnel analysis, predictive analytics basics
L11-L12Data storytelling, executive reporting, automated reports, capstone

Câu hỏi tự kiểm tra

  1. Automated reporting pipeline gồm những bước nào?
  2. Streamlit dashboard khác static report thế nào?
  3. Portfolio project cần những yếu tố nào?
  4. GitHub Actions dùng cho scheduling thế nào?
Chúc mừng! 🎉

Bạn đã hoàn thành khóa Advanced Data Analysis! Bạn giờ có đầy đủ kỹ năng từ data wrangling → statistical analysis → predictive modeling → storytelling → automated reporting. Hãy apply tất cả vào capstone project để tạo portfolio ấn tượng.

Quay lại: Data Storytelling & Executive Reporting