Lý thuyết
Bài 16/17

Portfolio Capstone Project

Dự án tổng kết - Xây dựng portfolio data analysis hoàn chỉnh

Portfolio Capstone Project

Data Analytics Portfolio Project

1. Introduction

Capstone Project

Đây là dự án cuối khóa để tổng hợp tất cả kỹ năng đã học. Bạn sẽ thực hiện một dự án phân tích dữ liệu hoàn chỉnh từ đầu đến cuối, phù hợp để đưa vào portfolio xin việc.

1.1 Project Overview

Text
1┌──────────────────────────────────────────────────────────┐
2│ CAPSTONE PROJECT STRUCTURE │
3├──────────────────────────────────────────────────────────┤
4│ │
5│ 🎯 OBJECTIVE │
6│ Build a complete data analysis project that │
7│ demonstrates your skills to potential employers │
8│ │
9│ 📋 DELIVERABLES │
10│ 1. Jupyter Notebook (analysis) │
11│ 2. Executive Report (PDF/HTML) │
12│ 3. Interactive Dashboard │
13│ 4. Automated Report Script │
14│ 5. GitHub Repository with documentation │
15│ │
16│ ⏱️ ESTIMATED TIME: 8-12 hours │
17│ │
18│ 🛠️ SKILLS DEMONSTRATED │
19│ • Python (Pandas, NumPy, Matplotlib) │
20│ • SQL (Window functions, CTEs, Optimization) │
21│ • Data Storytelling & Visualization │
22│ • Business Analysis (Cohort, Funnel, Time Series) │
23│ • Automated Reporting │
24│ │
25└──────────────────────────────────────────────────────────┘

1.2 Dataset Options

Text
1OPTION A: E-COMMERCE ANALYTICS
2─────────────────────────────────────────────────────────
3• Transactions, customers, products data
4• Analysis: Revenue trends, cohort retention, RFM
5• Business questions: Customer lifetime value, churn prediction
6
7OPTION B: SAAS METRICS
8─────────────────────────────────────────────────────────
9• Subscription data, usage events, support tickets
10• Analysis: MRR trends, feature adoption, churn
11• Business questions: Product stickiness, expansion revenue
12
13OPTION C: MARKETING CAMPAIGN
14─────────────────────────────────────────────────────────
15• Campaign performance, leads, conversions
16• Analysis: Funnel analysis, attribution, ROI
17• Business questions: Channel optimization, CAC/LTV
18
19OPTION D: YOUR OWN DATA
20─────────────────────────────────────────────────────────
21• Any dataset you have access to
22• Real-world data preferred
23• Must have at least 3 tables and 10,000+ rows

2. Project Setup

2.1 Repository Structure

Text
1my-data-analysis-portfolio/
2
3├── README.md # Project overview
4├── requirements.txt # Dependencies
5├── .gitignore
6
7├── data/
8│ ├── raw/ # Original data
9│ ├── processed/ # Cleaned data
10│ └── external/ # External reference data
11
12├── notebooks/
13│ ├── 01_data_exploration.ipynb
14│ ├── 02_data_cleaning.ipynb
15│ ├── 03_analysis.ipynb
16│ └── 04_modeling.ipynb
17
18├── src/
19│ ├── __init__.py
20│ ├── data_loader.py # Data loading utilities
21│ ├── analysis.py # Analysis functions
22│ ├── visualization.py # Chart functions
23│ └── reporting.py # Report generation
24
25├── reports/
26│ ├── executive_summary.pdf
27│ ├── detailed_analysis.html
28│ └── figures/
29
30├── dashboard/
31│ ├── app.py # Streamlit/Dash app
32│ └── assets/
33
34└── tests/
35 └── test_analysis.py

2.2 Setup Script

Python
1"""
2setup_project.py - Initialize project structure
3"""
4
5import os
6from pathlib import Path
7
8def setup_project(project_name: str):
9 """Create project directory structure"""
10
11 base_path = Path(project_name)
12
13 # Define directory structure
14 directories = [
15 'data/raw',
16 'data/processed',
17 'data/external',
18 'notebooks',
19 'src',
20 'reports/figures',
21 'dashboard/assets',
22 'tests'
23 ]
24
25 # Create directories
26 for dir_path in directories:
27 (base_path / dir_path).mkdir(parents=True, exist_ok=True)
28 print(f"✅ Created: {dir_path}")
29
30 # Create essential files
31 files = {
32 'README.md': README_TEMPLATE,
33 'requirements.txt': REQUIREMENTS_TEMPLATE,
34 '.gitignore': GITIGNORE_TEMPLATE,
35 'src/__init__.py': '',
36 'tests/__init__.py': ''
37 }
38
39 for filename, content in files.items():
40 filepath = base_path / filename
41 filepath.write_text(content)
42 print(f"✅ Created: {filename}")
43
44 print(f"\n🎉 Project '{project_name}' setup complete!")
45 print(f" cd {project_name}")
46 print(f" pip install -r requirements.txt")
47
48README_TEMPLATE = """# Data Analysis Portfolio Project
49
50## Overview
51A comprehensive data analysis project demonstrating skills in Python, SQL,
52data visualization, and business analytics.
53
54## Project Structure
55- `data/` - Raw and processed data files
56- `notebooks/` - Jupyter notebooks with analysis
57- `src/` - Python modules for data processing
58- `reports/` - Generated reports and visualizations
59- `dashboard/` - Interactive dashboard application
60
61## Key Findings
62[To be completed after analysis]
63
64## Skills Demonstrated
65- Data wrangling with Pandas
66- SQL analytics (Window functions, CTEs)
67- Statistical analysis
68- Data visualization
69- Business metrics (Cohort, Funnel, RFM)
70- Automated reporting
71
72## How to Run
73```bash
74pip install -r requirements.txt
75jupyter notebook notebooks/

Author

[Your Name] """

REQUIREMENTS_TEMPLATE = """pandas>=2.0.0 numpy>=1.24.0 matplotlib>=3.7.0 seaborn>=0.12.0 plotly>=5.14.0 jupyter>=1.0.0 sqlalchemy>=2.0.0 openpyxl>=3.1.0 jinja2>=3.1.0 reportlab>=4.0.0 streamlit>=1.20.0 pytest>=7.3.0 python-dotenv>=1.0.0 """

GITIGNORE_TEMPLATE = """# Python pycache/ *.py[cod] .env venv/

Jupyter

.ipynb_checkpoints/

Data (large files)

data/raw/.csv data/raw/.xlsx *.db

Reports

reports/*.pdf

IDE

.vscode/ .idea/ """

if name == 'main': setup_project('ecommerce-analysis')

Text
1---
2
3## 3. Phase 1: Data Exploration
4
5### 3.1 Loading and Understanding Data
6
7```python
8"""
901_data_exploration.ipynb
10"""
11
12import pandas as pd
13import numpy as np
14import matplotlib.pyplot as plt
15import seaborn as sns
16
17# Set display options
18pd.set_option('display.max_columns', 50)
19pd.set_option('display.float_format', '{:.2f}'.format)
20
21# Load data
22orders = pd.read_csv('data/raw/orders.csv')
23customers = pd.read_csv('data/raw/customers.csv')
24products = pd.read_csv('data/raw/products.csv')
25
26print("=" * 60)
27print("DATA OVERVIEW")
28print("=" * 60)
29
30datasets = {'orders': orders, 'customers': customers, 'products': products}
31
32for name, df in datasets.items():
33 print(f"\n📊 {name.upper()}")
34 print(f" Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
35 print(f" Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
36 print(f" Columns: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")

3.2 Data Quality Assessment

Python
1def assess_data_quality(df, name='dataset'):
2 """Comprehensive data quality assessment"""
3
4 print(f"\n{'='*60}")
5 print(f"DATA QUALITY REPORT: {name.upper()}")
6 print(f"{'='*60}")
7
8 # Missing values
9 missing = df.isnull().sum()
10 missing_pct = (missing / len(df) * 100).round(2)
11 missing_report = pd.DataFrame({
12 'Missing': missing,
13 'Pct': missing_pct
14 })
15 missing_report = missing_report[missing_report['Missing'] > 0]
16
17 if len(missing_report) > 0:
18 print("\n⚠️ MISSING VALUES:")
19 print(missing_report.to_string())
20 else:
21 print("\n✅ No missing values")
22
23 # Data types
24 print("\n📋 DATA TYPES:")
25 print(df.dtypes.value_counts().to_string())
26
27 # Duplicates
28 duplicates = df.duplicated().sum()
29 print(f"\n🔄 DUPLICATES: {duplicates:,} ({duplicates/len(df)*100:.2f}%)")
30
31 # Unique values for categorical columns
32 print("\n📊 UNIQUE VALUES (categorical):")
33 for col in df.select_dtypes(include=['object']).columns:
34 unique_count = df[col].nunique()
35 print(f" {col}: {unique_count:,}")
36
37 # Numeric statistics
38 print("\n📈 NUMERIC STATISTICS:")
39 numeric_df = df.select_dtypes(include=[np.number])
40 if len(numeric_df.columns) > 0:
41 stats = numeric_df.describe().T[['min', 'max', 'mean', 'std']]
42 print(stats.to_string())
43
44 return missing_report
45
46# Run assessment
47for name, df in datasets.items():
48 assess_data_quality(df, name)

3.3 Initial Visualizations

Python
1def create_exploration_plots(orders, customers, products):
2 """Create initial exploration visualizations"""
3
4 fig, axes = plt.subplots(2, 3, figsize=(15, 10))
5 fig.suptitle('Data Exploration - Initial Overview', fontsize=14, fontweight='bold')
6
7 # 1. Order distribution by day
8 ax1 = axes[0, 0]
9 orders['order_date'] = pd.to_datetime(orders['order_date'])
10 orders.set_index('order_date')['order_id'].resample('W').count().plot(ax=ax1)
11 ax1.set_title('Orders Over Time')
12 ax1.set_ylabel('Order Count')
13
14 # 2. Revenue distribution
15 ax2 = axes[0, 1]
16 orders['order_total'].hist(bins=50, ax=ax2, edgecolor='white')
17 ax2.set_title('Order Value Distribution')
18 ax2.set_xlabel('Order Total ($)')
19 ax2.axvline(orders['order_total'].median(), color='red', linestyle='--', label='Median')
20 ax2.legend()
21
22 # 3. Top categories
23 ax3 = axes[0, 2]
24 category_revenue = orders.groupby('category')['order_total'].sum().sort_values(ascending=True)
25 category_revenue.plot(kind='barh', ax=ax3)
26 ax3.set_title('Revenue by Category')
27 ax3.set_xlabel('Revenue ($)')
28
29 # 4. Customer distribution
30 ax4 = axes[1, 0]
31 customer_orders = orders.groupby('customer_id').size()
32 customer_orders.value_counts().head(10).plot(kind='bar', ax=ax4)
33 ax4.set_title('Order Frequency Distribution')
34 ax4.set_xlabel('Number of Orders')
35 ax4.set_ylabel('Number of Customers')
36
37 # 5. Hourly pattern
38 ax5 = axes[1, 1]
39 orders['hour'] = orders['order_date'].dt.hour
40 orders.groupby('hour')['order_id'].count().plot(ax=ax5, marker='o')
41 ax5.set_title('Orders by Hour')
42 ax5.set_xlabel('Hour of Day')
43 ax5.set_ylabel('Order Count')
44
45 # 6. Day of week pattern
46 ax6 = axes[1, 2]
47 orders['day_of_week'] = orders['order_date'].dt.day_name()
48 day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
49 dow_orders = orders.groupby('day_of_week')['order_id'].count().reindex(day_order)
50 dow_orders.plot(kind='bar', ax=ax6)
51 ax6.set_title('Orders by Day of Week')
52 ax6.tick_params(axis='x', rotation=45)
53
54 plt.tight_layout()
55 plt.savefig('reports/figures/01_data_exploration.png', dpi=150, bbox_inches='tight')
56 plt.show()
57
58# Generate plots
59# create_exploration_plots(orders, customers, products)

4. Phase 2: Data Cleaning

4.1 Cleaning Pipeline

Python
1"""
202_data_cleaning.ipynb
3"""
4
5import pandas as pd
6import numpy as np
7
8class DataCleaner:
9 """Data cleaning pipeline for e-commerce data"""
10
11 def __init__(self):
12 self.cleaning_log = []
13
14 def log_action(self, action, before, after):
15 """Log cleaning action"""
16 self.cleaning_log.append({
17 'action': action,
18 'rows_before': before,
19 'rows_after': after,
20 'rows_removed': before - after
21 })
22
23 def clean_orders(self, df):
24 """Clean orders dataframe"""
25 initial_rows = len(df)
26
27 # 1. Remove duplicates
28 df = df.drop_duplicates(subset=['order_id'])
29 self.log_action('Remove duplicate orders', initial_rows, len(df))
30
31 # 2. Handle missing values
32 # - Drop rows with missing critical fields
33 critical_cols = ['order_id', 'customer_id', 'order_date', 'order_total']
34 df = df.dropna(subset=critical_cols)
35 self.log_action('Drop missing critical fields', len(self.cleaning_log[-1]) if self.cleaning_log else initial_rows, len(df))
36
37 # 3. Convert data types
38 df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
39 df['order_total'] = pd.to_numeric(df['order_total'], errors='coerce')
40
41 # 4. Remove invalid dates
42 df = df[df['order_date'].notna()]
43 df = df[df['order_date'] >= '2020-01-01']
44 df = df[df['order_date'] <= pd.Timestamp.now()]
45 self.log_action('Remove invalid dates', len(df), len(df))
46
47 # 5. Remove negative/zero orders
48 df = df[df['order_total'] > 0]
49 self.log_action('Remove negative/zero orders', len(df), len(df))
50
51 # 6. Handle outliers (optional - cap at 99th percentile)
52 p99 = df['order_total'].quantile(0.99)
53 df = df[df['order_total'] <= p99]
54 self.log_action('Cap outliers at 99th percentile', len(df), len(df))
55
56 # 7. Add derived columns
57 df['order_month'] = df['order_date'].dt.to_period('M')
58 df['order_week'] = df['order_date'].dt.to_period('W')
59 df['order_dow'] = df['order_date'].dt.day_name()
60 df['order_hour'] = df['order_date'].dt.hour
61
62 return df
63
64 def clean_customers(self, df):
65 """Clean customers dataframe"""
66 initial_rows = len(df)
67
68 # Remove duplicates
69 df = df.drop_duplicates(subset=['customer_id'])
70
71 # Standardize categorical fields
72 if 'country' in df.columns:
73 df['country'] = df['country'].str.strip().str.title()
74
75 if 'segment' in df.columns:
76 df['segment'] = df['segment'].str.strip().str.title()
77
78 # Convert signup date
79 if 'signup_date' in df.columns:
80 df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
81
82 return df
83
84 def print_cleaning_summary(self):
85 """Print cleaning summary"""
86 print("\n" + "=" * 60)
87 print("DATA CLEANING SUMMARY")
88 print("=" * 60)
89
90 for log in self.cleaning_log:
91 print(f"\n{log['action']}")
92 print(f" Rows: {log['rows_before']:,} → {log['rows_after']:,}")
93 print(f" Removed: {log['rows_removed']:,}")
94
95 total_removed = sum(log['rows_removed'] for log in self.cleaning_log)
96 print(f"\n{'─' * 60}")
97 print(f"TOTAL ROWS REMOVED: {total_removed:,}")
98
99# Usage
100cleaner = DataCleaner()
101# orders_clean = cleaner.clean_orders(orders)
102# customers_clean = cleaner.clean_customers(customers)
103# cleaner.print_cleaning_summary()
104
105# Save cleaned data
106# orders_clean.to_csv('data/processed/orders_clean.csv', index=False)
107# customers_clean.to_csv('data/processed/customers_clean.csv', index=False)

5. Phase 3: Core Analysis

5.1 Revenue Analysis

Python
1"""
203_analysis.ipynb
3"""
4
5import pandas as pd
6import numpy as np
7import matplotlib.pyplot as plt
8
9class RevenueAnalyzer:
10 """Revenue analysis module"""
11
12 def __init__(self, orders_df):
13 self.orders = orders_df
14
15 def monthly_revenue(self):
16 """Calculate monthly revenue metrics"""
17 monthly = self.orders.groupby('order_month').agg({
18 'order_total': ['sum', 'mean', 'count'],
19 'customer_id': 'nunique'
20 }).round(2)
21
22 monthly.columns = ['revenue', 'aov', 'orders', 'customers']
23 monthly['revenue_mom'] = monthly['revenue'].pct_change() * 100
24
25 return monthly
26
27 def revenue_by_segment(self, customers_df):
28 """Revenue breakdown by customer segment"""
29 merged = self.orders.merge(customers_df[['customer_id', 'segment']], on='customer_id')
30
31 segment_metrics = merged.groupby('segment').agg({
32 'order_total': ['sum', 'mean', 'count'],
33 'customer_id': 'nunique'
34 }).round(2)
35
36 segment_metrics.columns = ['revenue', 'aov', 'orders', 'customers']
37 segment_metrics['rev_per_customer'] = segment_metrics['revenue'] / segment_metrics['customers']
38
39 return segment_metrics.sort_values('revenue', ascending=False)

5.2 Cohort Analysis

Python
1class CohortAnalyzer:
2 """Cohort analysis module"""
3
4 def __init__(self, orders_df):
5 self.orders = orders_df
6
7 def build_retention_cohorts(self):
8 """Build monthly retention cohorts"""
9
10 # Get first purchase month for each customer
11 customer_first = self.orders.groupby('customer_id')['order_date'].min().reset_index()
12 customer_first.columns = ['customer_id', 'first_order_date']
13 customer_first['cohort'] = customer_first['first_order_date'].dt.to_period('M')
14
15 # Merge with orders
16 cohort_df = self.orders.merge(customer_first[['customer_id', 'cohort']], on='customer_id')
17 cohort_df['order_period'] = cohort_df['order_date'].dt.to_period('M')
18
19 # Calculate period number
20 cohort_df['period_number'] = (
21 cohort_df['order_period'].astype('int64') -
22 cohort_df['cohort'].astype('int64')
23 )
24
25 # Build cohort matrix
26 cohort_matrix = cohort_df.groupby(['cohort', 'period_number'])['customer_id'].nunique().unstack()
27
28 # Calculate retention rates
29 cohort_sizes = cohort_matrix.iloc[:, 0]
30 retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0) * 100
31
32 return retention_matrix.round(1)
33
34 def plot_retention_heatmap(self, retention_matrix):
35 """Plot retention heatmap"""
36
37 fig, ax = plt.subplots(figsize=(12, 8))
38
39 sns.heatmap(
40 retention_matrix.iloc[:12, :12], # First 12 cohorts, 12 months
41 annot=True,
42 fmt='.0f',
43 cmap='YlGnBu',
44 ax=ax,
45 vmin=0,
46 vmax=100,
47 cbar_kws={'label': 'Retention %'}
48 )
49
50 ax.set_title('Customer Retention by Monthly Cohort', fontsize=14, fontweight='bold')
51 ax.set_xlabel('Months Since First Purchase')
52 ax.set_ylabel('Cohort (First Purchase Month)')
53
54 plt.tight_layout()
55 plt.savefig('reports/figures/retention_heatmap.png', dpi=150)
56 plt.show()

5.3 RFM Segmentation

Python
1class RFMAnalyzer:
2 """RFM segmentation module"""
3
4 def __init__(self, orders_df, analysis_date=None):
5 self.orders = orders_df
6 self.analysis_date = analysis_date or orders_df['order_date'].max() + pd.Timedelta(days=1)
7
8 def calculate_rfm(self):
9 """Calculate RFM scores"""
10
11 # Calculate RFM metrics
12 rfm = self.orders.groupby('customer_id').agg({
13 'order_date': lambda x: (self.analysis_date - x.max()).days, # Recency
14 'order_id': 'count', # Frequency
15 'order_total': 'sum' # Monetary
16 })
17
18 rfm.columns = ['recency', 'frequency', 'monetary']
19
20 # Score each metric (1-5, 5 is best)
21 rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
22 rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
23 rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
24
25 # Combine scores
26 rfm['rfm_score'] = (
27 rfm['r_score'].astype(str) +
28 rfm['f_score'].astype(str) +
29 rfm['m_score'].astype(str)
30 )
31
32 return rfm
33
34 def segment_customers(self, rfm_df):
35 """Assign customer segments based on RFM"""
36
37 def get_segment(row):
38 r, f, m = int(row['r_score']), int(row['f_score']), int(row['m_score'])
39
40 if r >= 4 and f >= 4 and m >= 4:
41 return 'Champions'
42 elif r >= 4 and f >= 3:
43 return 'Loyal Customers'
44 elif r >= 4 and f <= 2:
45 return 'Recent Customers'
46 elif r <= 2 and f >= 4:
47 return 'At Risk'
48 elif r <= 2 and f <= 2 and m >= 3:
49 return 'Hibernating'
50 elif r <= 2 and f <= 2:
51 return 'Lost'
52 else:
53 return 'Potential Loyalists'
54
55 rfm_df['segment'] = rfm_df.apply(get_segment, axis=1)
56
57 return rfm_df
58
59 def get_segment_summary(self, rfm_df):
60 """Generate segment summary"""
61
62 summary = rfm_df.groupby('segment').agg({
63 'recency': 'mean',
64 'frequency': 'mean',
65 'monetary': ['mean', 'sum'],
66 'rfm_score': 'count'
67 }).round(1)
68
69 summary.columns = ['avg_recency', 'avg_frequency', 'avg_monetary', 'total_revenue', 'customers']
70 summary['pct_customers'] = (summary['customers'] / summary['customers'].sum() * 100).round(1)
71 summary['pct_revenue'] = (summary['total_revenue'] / summary['total_revenue'].sum() * 100).round(1)
72
73 return summary.sort_values('total_revenue', ascending=False)

5.4 Funnel Analysis

Python
1class FunnelAnalyzer:
2 """Funnel analysis module"""
3
4 def __init__(self, events_df):
5 self.events = events_df
6
7 def build_funnel(self, stages):
8 """Build conversion funnel"""
9
10 funnel_data = []
11
12 for i, stage in enumerate(stages):
13 stage_users = self.events[self.events['event_type'] == stage]['user_id'].nunique()
14
15 if i == 0:
16 conversion = 100.0
17 step_conversion = 100.0
18 else:
19 prev_users = funnel_data[i-1]['users']
20 conversion = stage_users / funnel_data[0]['users'] * 100
21 step_conversion = stage_users / prev_users * 100 if prev_users > 0 else 0
22
23 funnel_data.append({
24 'stage': stage,
25 'users': stage_users,
26 'conversion': round(conversion, 2),
27 'step_conversion': round(step_conversion, 2)
28 })
29
30 return pd.DataFrame(funnel_data)
31
32 def plot_funnel(self, funnel_df):
33 """Visualize funnel"""
34
35 fig, ax = plt.subplots(figsize=(12, 6))
36
37 stages = funnel_df['stage'].tolist()
38 users = funnel_df['users'].tolist()
39 conversions = funnel_df['conversion'].tolist()
40
41 # Create funnel bars
42 colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(stages)))
43
44 for i, (stage, user_count, conv) in enumerate(zip(stages, users, conversions)):
45 # Width proportional to conversion
46 width = conv / 100 * 0.8
47 left = (1 - width) / 2
48
49 ax.barh(i, width, left=left, height=0.6, color=colors[i], edgecolor='white')
50
51 # Add labels
52 ax.text(0.5, i, f'{stage}\n{user_count:,} users ({conv}%)',
53 ha='center', va='center', fontsize=10, fontweight='bold')
54
55 ax.set_ylim(-0.5, len(stages) - 0.5)
56 ax.set_xlim(0, 1)
57 ax.invert_yaxis()
58 ax.axis('off')
59 ax.set_title('Conversion Funnel', fontsize=14, fontweight='bold')
60
61 plt.tight_layout()
62 plt.savefig('reports/figures/conversion_funnel.png', dpi=150)
63 plt.show()

6. Phase 4: Reporting

6.1 Executive Summary

Python
1"""
2Generate executive summary report
3"""
4
5def generate_executive_summary(analysis_results):
6 """Generate executive summary from analysis"""
7
8 summary = f"""
9
10 EXECUTIVE SUMMARY
11 E-Commerce Analytics
12
13
14 BUSINESS PERFORMANCE OVERVIEW
15
16
17Total Revenue: ${analysis_results['total_revenue']:,.0f}
18Total Orders: {analysis_results['total_orders']:,}
19Total Customers: {analysis_results['total_customers']:,}
20Average Order Value: ${analysis_results['aov']:.2f}
21YoY Growth: {analysis_results['yoy_growth']:+.1f}%
22
23 KEY FINDINGS
24
25
261. REVENUE TRENDS
27 Revenue grew {analysis_results['revenue_growth']:.1f}% compared to last period
28 Peak day: {analysis_results['peak_day']} with ${analysis_results['peak_revenue']:,.0f}
29 Strongest category: {analysis_results['top_category']}
30
312. CUSTOMER INSIGHTS
32 {analysis_results['new_customers']:,} new customers acquired
33 Customer retention rate: {analysis_results['retention_rate']:.1f}%
34 Top segment by revenue: {analysis_results['top_segment']}
35
363. COHORT ANALYSIS
37 Average 6-month retention: {analysis_results['avg_6m_retention']:.1f}%
38 Best performing cohort: {analysis_results['best_cohort']}
39 Cohort LTV trend: {analysis_results['ltv_trend']}
40
414. RFM SEGMENTATION
42 Champions: {analysis_results['champions_pct']:.1f}% of customers, {analysis_results['champions_rev_pct']:.1f}% of revenue
43 At Risk: {analysis_results['at_risk_pct']:.1f}% of customers need attention
44 Lost: {analysis_results['lost_pct']:.1f}% of customers churned
45
46 RECOMMENDATIONS
47
48
491. PRIORITY: Re-engage "At Risk" customers
50 Impact: Potential ${analysis_results['at_risk_value']:,.0f} revenue recovery
51 Action: Launch targeted email campaign with personalized offers
52
532. OPPORTUNITY: Expand top category
54 {analysis_results['top_category']} shows {analysis_results['category_growth']:.1f}% growth
55 Action: Increase inventory and marketing spend
56
573. IMPROVE: Checkout conversion
58 Current conversion: {analysis_results['checkout_conversion']:.1f}%
59 Action: A/B test simplified checkout flow
60
61
62Report Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}
63
64"""
65
66 return summary

6.2 Dashboard

Python
1"""
2dashboard/app.py - Streamlit Dashboard
3"""
4
5import streamlit as st
6import pandas as pd
7import plotly.express as px
8import plotly.graph_objects as go
9
10st.set_page_config(page_title="E-Commerce Analytics", layout="wide")
11
12st.title("📊 E-Commerce Analytics Dashboard")
13
14# Sidebar filters
15st.sidebar.header("Filters")
16date_range = st.sidebar.date_input("Date Range", [])
17segment = st.sidebar.multiselect("Customer Segment", ['All', 'Enterprise', 'SMB', 'Consumer'])
18
19# Main metrics
20col1, col2, col3, col4 = st.columns(4)
21
22with col1:
23 st.metric("Revenue", "$2.5M", "+15%")
24
25with col2:
26 st.metric("Orders", "12,500", "+8%")
27
28with col3:
29 st.metric("Customers", "5,200", "+12%")
30
31with col4:
32 st.metric("AOV", "$200", "+3%")
33
34# Charts
35st.header("Revenue Trends")
36
37# Sample data for demo
38dates = pd.date_range('2024-01-01', periods=90, freq='D')
39revenue = [100000 + i * 1000 + np.random.normal(0, 5000) for i in range(90)]
40
41fig = px.line(x=dates, y=revenue, title="Daily Revenue")
42fig.update_layout(xaxis_title="Date", yaxis_title="Revenue ($)")
43st.plotly_chart(fig, use_container_width=True)
44
45# Cohort heatmap
46st.header("Retention Cohorts")
47# Add cohort heatmap here
48
49# RFM Distribution
50st.header("Customer Segments")
51segments = ['Champions', 'Loyal', 'Potential', 'At Risk', 'Lost']
52customers = [500, 800, 1200, 600, 400]
53
54fig2 = px.pie(values=customers, names=segments, title="Customer Distribution by Segment")
55st.plotly_chart(fig2, use_container_width=True)
56
57# Raw data
58if st.checkbox("Show Raw Data"):
59 st.dataframe(pd.DataFrame({'Date': dates, 'Revenue': revenue}).head(20))

7. Portfolio Presentation

7.1 GitHub README Template

markdown
1# E-Commerce Analytics Portfolio Project
2
3![Python](https://img.shields.io/badge/Python-3.10-blue)
4![Pandas](https://img.shields.io/badge/Pandas-2.0-green)
5![License](https://img.shields.io/badge/License-MIT-yellow)
6
7## 🎯 Project Overview
8
9A comprehensive data analysis project analyzing e-commerce customer behavior,
10revenue trends, and providing actionable business recommendations.
11
12### Key Findings
13
14| Metric | Value | Change |
15|--------|-------|--------|
16| Total Revenue | $2.5M | +15% YoY |
17| Customer Retention | 42% | +5pp |
18| Customer LTV | $285 | +12% |
19
20### Skills Demonstrated
21
22- **Python**: Pandas, NumPy, Matplotlib, Seaborn
23- **SQL**: Window functions, CTEs, Query optimization
24- **Analysis**: Cohort, RFM, Funnel analysis
25- **Visualization**: Plotly, Streamlit dashboards
26- **Reporting**: Automated PDF/Excel generation
27
28## 📊 Analysis Highlights
29
30### Revenue Analysis
31![Revenue Trend](reports/figures/revenue_trend.png)
32
33### Customer Retention
34![Retention Heatmap](reports/figures/retention_heatmap.png)
35
36### Customer Segmentation
37![RFM Segments](reports/figures/rfm_segments.png)
38
39## 🚀 Quick Start
40
41```bash
42# Clone repository
43git clone https://github.com/yourusername/ecommerce-analytics.git
44cd ecommerce-analytics
45
46# Install dependencies
47pip install -r requirements.txt
48
49# Run analysis
50jupyter notebook notebooks/03_analysis.ipynb
51
52# Launch dashboard
53streamlit run dashboard/app.py

📁 Project Structure

Text
1├── data/ # Raw and processed data
2├── notebooks/ # Jupyter notebooks
3├── src/ # Python modules
4├── reports/ # Generated reports
5└── dashboard/ # Streamlit app

📝 Methodology

  1. Data Collection: Extracted from PostgreSQL database
  2. Data Cleaning: Handled missing values, outliers, duplicates
  3. Analysis: Revenue trends, cohort retention, RFM segmentation
  4. Visualization: Interactive dashboards with Plotly
  5. Reporting: Automated weekly reports

🔮 Future Improvements

  • Add predictive churn model
  • Implement real-time dashboard
  • A/B testing analysis module

👤 Author

Your Name

📄 License

MIT License - feel free to use this project as a template!

Text
1### 7.2 Project Checklist

PORTFOLIO PROJECT CHECKLIST

□ DATA EXPLORATION □ Loaded all datasets □ Documented data sources □ Assessed data quality □ Created initial visualizations

□ DATA CLEANING □ Handled missing values □ Removed duplicates □ Fixed data types □ Documented cleaning steps

□ ANALYSIS □ Revenue analysis (trends, breakdown) □ Cohort analysis (retention matrix) □ RFM segmentation □ At least one more business analysis

□ VISUALIZATION □ Clear, professional charts □ Consistent styling □ Interactive dashboard

□ REPORTING □ Executive summary □ Key findings documented □ Recommendations with impact

□ CODE QUALITY □ Well-organized repository □ Clear function documentation □ Requirements.txt complete □ .gitignore configured

□ DOCUMENTATION □ Comprehensive README □ Methodology explained □ Results summarized

□ PRESENTATION READY □ Can explain project in 5 minutes □ Know key numbers by heart □ Prepared for technical questions

Text
1---
2
3## 8. Tổng kết khóa học
4
5<Callout type="success" title="Congratulations!">
6Bạn đã hoàn thành khóa Advanced Data Analysis! Giờ hãy apply những gì đã học vào dự án capstone.
7</Callout>
8
9### Course Skills Summary
10
11| Module | Key Skills |
12|--------|-----------|
13| **Python** | Pandas, NumPy, Data manipulation |
14| **SQL** | Window functions, CTEs, Optimization |
15| **Analysis** | Time series, Cohort, Funnel, RFM |
16| **Storytelling** | SCQA, Visualization, Executive reporting |
17| **Automation** | Templates, Scheduling, Email delivery |
18
19### Next Steps
20
211. **Complete your capstone project**
222. **Deploy to GitHub with documentation**
233. **Add to your portfolio/LinkedIn**
244. **Practice presenting your findings**
255. **Continue learning with real projects**
26
27**Chúc bạn thành công trong career Data Analyst! 🎉**