Portfolio Capstone Project

Data Analytics Portfolio Project

1. Introduction

Capstone Project

Đây là dự án cuối khóa để tổng hợp tất cả kỹ năng đã học. Bạn sẽ thực hiện một dự án phân tích dữ liệu hoàn chỉnh từ đầu đến cuối, phù hợp để đưa vào portfolio xin việc.

1.1 Project Overview

Text

1┌──────────────────────────────────────────────────────────┐
2│               CAPSTONE PROJECT STRUCTURE                  │
3├──────────────────────────────────────────────────────────┤
4│                                                           │
5│  🎯 OBJECTIVE                                            │
6│  Build a complete data analysis project that             │
7│  demonstrates your skills to potential employers         │
8│                                                           │
9│  📋 DELIVERABLES                                         │
10│  1. Jupyter Notebook (analysis)                          │
11│  2. Executive Report (PDF/HTML)                          │
12│  3. Interactive Dashboard                                │
13│  4. Automated Report Script                              │
14│  5. GitHub Repository with documentation                 │
15│                                                           │
16│  ⏱️ ESTIMATED TIME: 8-12 hours                           │
17│                                                           │
18│  🛠️ SKILLS DEMONSTRATED                                  │
19│  • Python (Pandas, NumPy, Matplotlib)                    │
20│  • SQL (Window functions, CTEs, Optimization)            │
21│  • Data Storytelling & Visualization                     │
22│  • Business Analysis (Cohort, Funnel, Time Series)       │
23│  • Automated Reporting                                   │
24│                                                           │
25└──────────────────────────────────────────────────────────┘

1.2 Dataset Options

Text

1OPTION A: E-COMMERCE ANALYTICS
2─────────────────────────────────────────────────────────
3• Transactions, customers, products data
4• Analysis: Revenue trends, cohort retention, RFM
5• Business questions: Customer lifetime value, churn prediction
6 
7OPTION B: SAAS METRICS
8─────────────────────────────────────────────────────────
9• Subscription data, usage events, support tickets
10• Analysis: MRR trends, feature adoption, churn
11• Business questions: Product stickiness, expansion revenue
12 
13OPTION C: MARKETING CAMPAIGN
14─────────────────────────────────────────────────────────
15• Campaign performance, leads, conversions
16• Analysis: Funnel analysis, attribution, ROI
17• Business questions: Channel optimization, CAC/LTV
18 
19OPTION D: YOUR OWN DATA
20─────────────────────────────────────────────────────────
21• Any dataset you have access to
22• Real-world data preferred
23• Must have at least 3 tables and 10,000+ rows

2. Project Setup

2.1 Repository Structure

Text

1my-data-analysis-portfolio/
2│
3├── README.md                 # Project overview
4├── requirements.txt          # Dependencies
5├── .gitignore
6│
7├── data/
8│   ├── raw/                  # Original data
9│   ├── processed/            # Cleaned data
10│   └── external/             # External reference data
11│
12├── notebooks/
13│   ├── 01_data_exploration.ipynb
14│   ├── 02_data_cleaning.ipynb
15│   ├── 03_analysis.ipynb
16│   └── 04_modeling.ipynb
17│
18├── src/
19│   ├── __init__.py
20│   ├── data_loader.py        # Data loading utilities
21│   ├── analysis.py           # Analysis functions
22│   ├── visualization.py      # Chart functions
23│   └── reporting.py          # Report generation
24│
25├── reports/
26│   ├── executive_summary.pdf
27│   ├── detailed_analysis.html
28│   └── figures/
29│
30├── dashboard/
31│   ├── app.py               # Streamlit/Dash app
32│   └── assets/
33│
34└── tests/
35    └── test_analysis.py

2.2 Setup Script

Python

1"""
2setup_project.py - Initialize project structure
3"""
4
5import os
6from pathlib import Path
7
8def setup_project(project_name: str):
9    """Create project directory structure"""
10    
11    base_path = Path(project_name)
12    
13    # Define directory structure
14    directories = [
15        'data/raw',
16        'data/processed',
17        'data/external',
18        'notebooks',
19        'src',
20        'reports/figures',
21        'dashboard/assets',
22        'tests'
23    ]
24    
25    # Create directories
26    for dir_path in directories:
27        (base_path / dir_path).mkdir(parents=True, exist_ok=True)
28        print(f"✅ Created: {dir_path}")
29    
30    # Create essential files
31    files = {
32        'README.md': README_TEMPLATE,
33        'requirements.txt': REQUIREMENTS_TEMPLATE,
34        '.gitignore': GITIGNORE_TEMPLATE,
35        'src/__init__.py': '',
36        'tests/__init__.py': ''
37    }
38    
39    for filename, content in files.items():
40        filepath = base_path / filename
41        filepath.write_text(content)
42        print(f"✅ Created: {filename}")
43    
44    print(f"\n🎉 Project '{project_name}' setup complete!")
45    print(f"   cd {project_name}")
46    print(f"   pip install -r requirements.txt")
47
48README_TEMPLATE = """# Data Analysis Portfolio Project
49
50## Overview
51A comprehensive data analysis project demonstrating skills in Python, SQL, 
52data visualization, and business analytics.
53
54## Project Structure
55- `data/` - Raw and processed data files
56- `notebooks/` - Jupyter notebooks with analysis
57- `src/` - Python modules for data processing
58- `reports/` - Generated reports and visualizations
59- `dashboard/` - Interactive dashboard application
60
61## Key Findings
62[To be completed after analysis]
63
64## Skills Demonstrated
65- Data wrangling with Pandas
66- SQL analytics (Window functions, CTEs)
67- Statistical analysis
68- Data visualization
69- Business metrics (Cohort, Funnel, RFM)
70- Automated reporting
71
72## How to Run
73```bash
74pip install -r requirements.txt
75jupyter notebook notebooks/

Author

[Your Name] """

REQUIREMENTS_TEMPLATE = """pandas>=2.0.0 numpy>=1.24.0 matplotlib>=3.7.0 seaborn>=0.12.0 plotly>=5.14.0 jupyter>=1.0.0 sqlalchemy>=2.0.0 openpyxl>=3.1.0 jinja2>=3.1.0 reportlab>=4.0.0 streamlit>=1.20.0 pytest>=7.3.0 python-dotenv>=1.0.0 """

GITIGNORE_TEMPLATE = """# Python pycache/ *.py[cod] .env venv/

Jupyter

.ipynb_checkpoints/

Data (large files)

data/raw/.csv data/raw/.xlsx *.db

Reports

reports/*.pdf

IDE

.vscode/ .idea/ """

if name == 'main': setup_project('ecommerce-analysis')

Text

1---
2 
3## 3. Phase 1: Data Exploration
4 
5### 3.1 Loading and Understanding Data
6 
7```python
8"""
901_data_exploration.ipynb
10"""
11 
12import pandas as pd
13import numpy as np
14import matplotlib.pyplot as plt
15import seaborn as sns
16 
17# Set display options
18pd.set_option('display.max_columns', 50)
19pd.set_option('display.float_format', '{:.2f}'.format)
20 
21# Load data
22orders = pd.read_csv('data/raw/orders.csv')
23customers = pd.read_csv('data/raw/customers.csv')
24products = pd.read_csv('data/raw/products.csv')
25 
26print("=" * 60)
27print("DATA OVERVIEW")
28print("=" * 60)
29 
30datasets = {'orders': orders, 'customers': customers, 'products': products}
31 
32for name, df in datasets.items():
33    print(f"\n📊 {name.upper()}")
34    print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
35    print(f"   Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
36    print(f"   Columns: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")

3.2 Data Quality Assessment

Python

1def assess_data_quality(df, name='dataset'):
2    """Comprehensive data quality assessment"""
3    
4    print(f"\n{'='*60}")
5    print(f"DATA QUALITY REPORT: {name.upper()}")
6    print(f"{'='*60}")
7    
8    # Missing values
9    missing = df.isnull().sum()
10    missing_pct = (missing / len(df) * 100).round(2)
11    missing_report = pd.DataFrame({
12        'Missing': missing,
13        'Pct': missing_pct
14    })
15    missing_report = missing_report[missing_report['Missing'] > 0]
16    
17    if len(missing_report) > 0:
18        print("\n⚠️ MISSING VALUES:")
19        print(missing_report.to_string())
20    else:
21        print("\n✅ No missing values")
22    
23    # Data types
24    print("\n📋 DATA TYPES:")
25    print(df.dtypes.value_counts().to_string())
26    
27    # Duplicates
28    duplicates = df.duplicated().sum()
29    print(f"\n🔄 DUPLICATES: {duplicates:,} ({duplicates/len(df)*100:.2f}%)")
30    
31    # Unique values for categorical columns
32    print("\n📊 UNIQUE VALUES (categorical):")
33    for col in df.select_dtypes(include=['object']).columns:
34        unique_count = df[col].nunique()
35        print(f"   {col}: {unique_count:,}")
36    
37    # Numeric statistics
38    print("\n📈 NUMERIC STATISTICS:")
39    numeric_df = df.select_dtypes(include=[np.number])
40    if len(numeric_df.columns) > 0:
41        stats = numeric_df.describe().T[['min', 'max', 'mean', 'std']]
42        print(stats.to_string())
43    
44    return missing_report
45
46# Run assessment
47for name, df in datasets.items():
48    assess_data_quality(df, name)

3.3 Initial Visualizations

Python

1def create_exploration_plots(orders, customers, products):
2    """Create initial exploration visualizations"""
3    
4    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
5    fig.suptitle('Data Exploration - Initial Overview', fontsize=14, fontweight='bold')
6    
7    # 1. Order distribution by day
8    ax1 = axes[0, 0]
9    orders['order_date'] = pd.to_datetime(orders['order_date'])
10    orders.set_index('order_date')['order_id'].resample('W').count().plot(ax=ax1)
11    ax1.set_title('Orders Over Time')
12    ax1.set_ylabel('Order Count')
13    
14    # 2. Revenue distribution
15    ax2 = axes[0, 1]
16    orders['order_total'].hist(bins=50, ax=ax2, edgecolor='white')
17    ax2.set_title('Order Value Distribution')
18    ax2.set_xlabel('Order Total ($)')
19    ax2.axvline(orders['order_total'].median(), color='red', linestyle='--', label='Median')
20    ax2.legend()
21    
22    # 3. Top categories
23    ax3 = axes[0, 2]
24    category_revenue = orders.groupby('category')['order_total'].sum().sort_values(ascending=True)
25    category_revenue.plot(kind='barh', ax=ax3)
26    ax3.set_title('Revenue by Category')
27    ax3.set_xlabel('Revenue ($)')
28    
29    # 4. Customer distribution
30    ax4 = axes[1, 0]
31    customer_orders = orders.groupby('customer_id').size()
32    customer_orders.value_counts().head(10).plot(kind='bar', ax=ax4)
33    ax4.set_title('Order Frequency Distribution')
34    ax4.set_xlabel('Number of Orders')
35    ax4.set_ylabel('Number of Customers')
36    
37    # 5. Hourly pattern
38    ax5 = axes[1, 1]
39    orders['hour'] = orders['order_date'].dt.hour
40    orders.groupby('hour')['order_id'].count().plot(ax=ax5, marker='o')
41    ax5.set_title('Orders by Hour')
42    ax5.set_xlabel('Hour of Day')
43    ax5.set_ylabel('Order Count')
44    
45    # 6. Day of week pattern
46    ax6 = axes[1, 2]
47    orders['day_of_week'] = orders['order_date'].dt.day_name()
48    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
49    dow_orders = orders.groupby('day_of_week')['order_id'].count().reindex(day_order)
50    dow_orders.plot(kind='bar', ax=ax6)
51    ax6.set_title('Orders by Day of Week')
52    ax6.tick_params(axis='x', rotation=45)
53    
54    plt.tight_layout()
55    plt.savefig('reports/figures/01_data_exploration.png', dpi=150, bbox_inches='tight')
56    plt.show()
57
58# Generate plots
59# create_exploration_plots(orders, customers, products)

4. Phase 2: Data Cleaning

4.1 Cleaning Pipeline

Python

1"""
202_data_cleaning.ipynb
3"""
4
5import pandas as pd
6import numpy as np
7
8class DataCleaner:
9    """Data cleaning pipeline for e-commerce data"""
10    
11    def __init__(self):
12        self.cleaning_log = []
13    
14    def log_action(self, action, before, after):
15        """Log cleaning action"""
16        self.cleaning_log.append({
17            'action': action,
18            'rows_before': before,
19            'rows_after': after,
20            'rows_removed': before - after
21        })
22    
23    def clean_orders(self, df):
24        """Clean orders dataframe"""
25        initial_rows = len(df)
26        
27        # 1. Remove duplicates
28        df = df.drop_duplicates(subset=['order_id'])
29        self.log_action('Remove duplicate orders', initial_rows, len(df))
30        
31        # 2. Handle missing values
32        # - Drop rows with missing critical fields
33        critical_cols = ['order_id', 'customer_id', 'order_date', 'order_total']
34        df = df.dropna(subset=critical_cols)
35        self.log_action('Drop missing critical fields', len(self.cleaning_log[-1]) if self.cleaning_log else initial_rows, len(df))
36        
37        # 3. Convert data types
38        df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
39        df['order_total'] = pd.to_numeric(df['order_total'], errors='coerce')
40        
41        # 4. Remove invalid dates
42        df = df[df['order_date'].notna()]
43        df = df[df['order_date'] >= '2020-01-01']
44        df = df[df['order_date'] <= pd.Timestamp.now()]
45        self.log_action('Remove invalid dates', len(df), len(df))
46        
47        # 5. Remove negative/zero orders
48        df = df[df['order_total'] > 0]
49        self.log_action('Remove negative/zero orders', len(df), len(df))
50        
51        # 6. Handle outliers (optional - cap at 99th percentile)
52        p99 = df['order_total'].quantile(0.99)
53        df = df[df['order_total'] <= p99]
54        self.log_action('Cap outliers at 99th percentile', len(df), len(df))
55        
56        # 7. Add derived columns
57        df['order_month'] = df['order_date'].dt.to_period('M')
58        df['order_week'] = df['order_date'].dt.to_period('W')
59        df['order_dow'] = df['order_date'].dt.day_name()
60        df['order_hour'] = df['order_date'].dt.hour
61        
62        return df
63    
64    def clean_customers(self, df):
65        """Clean customers dataframe"""
66        initial_rows = len(df)
67        
68        # Remove duplicates
69        df = df.drop_duplicates(subset=['customer_id'])
70        
71        # Standardize categorical fields
72        if 'country' in df.columns:
73            df['country'] = df['country'].str.strip().str.title()
74        
75        if 'segment' in df.columns:
76            df['segment'] = df['segment'].str.strip().str.title()
77        
78        # Convert signup date
79        if 'signup_date' in df.columns:
80            df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
81        
82        return df
83    
84    def print_cleaning_summary(self):
85        """Print cleaning summary"""
86        print("\n" + "=" * 60)
87        print("DATA CLEANING SUMMARY")
88        print("=" * 60)
89        
90        for log in self.cleaning_log:
91            print(f"\n{log['action']}")
92            print(f"  Rows: {log['rows_before']:,} → {log['rows_after']:,}")
93            print(f"  Removed: {log['rows_removed']:,}")
94        
95        total_removed = sum(log['rows_removed'] for log in self.cleaning_log)
96        print(f"\n{'─' * 60}")
97        print(f"TOTAL ROWS REMOVED: {total_removed:,}")
98
99# Usage
100cleaner = DataCleaner()
101# orders_clean = cleaner.clean_orders(orders)
102# customers_clean = cleaner.clean_customers(customers)
103# cleaner.print_cleaning_summary()
104
105# Save cleaned data
106# orders_clean.to_csv('data/processed/orders_clean.csv', index=False)
107# customers_clean.to_csv('data/processed/customers_clean.csv', index=False)

5. Phase 3: Core Analysis

5.1 Revenue Analysis

Python

1"""
203_analysis.ipynb
3"""
4
5import pandas as pd
6import numpy as np
7import matplotlib.pyplot as plt
8
9class RevenueAnalyzer:
10    """Revenue analysis module"""
11    
12    def __init__(self, orders_df):
13        self.orders = orders_df
14    
15    def monthly_revenue(self):
16        """Calculate monthly revenue metrics"""
17        monthly = self.orders.groupby('order_month').agg({
18            'order_total': ['sum', 'mean', 'count'],
19            'customer_id': 'nunique'
20        }).round(2)
21        
22        monthly.columns = ['revenue', 'aov', 'orders', 'customers']
23        monthly['revenue_mom'] = monthly['revenue'].pct_change() * 100
24        
25        return monthly
26    
27    def revenue_by_segment(self, customers_df):
28        """Revenue breakdown by customer segment"""
29        merged = self.orders.merge(customers_df[['customer_id', 'segment']], on='customer_id')
30        
31        segment_metrics = merged.groupby('segment').agg({
32            'order_total': ['sum', 'mean', 'count'],
33            'customer_id': 'nunique'
34        }).round(2)
35        
36        segment_metrics.columns = ['revenue', 'aov', 'orders', 'customers']
37        segment_metrics['rev_per_customer'] = segment_metrics['revenue'] / segment_metrics['customers']
38        
39        return segment_metrics.sort_values('revenue', ascending=False)

5.2 Cohort Analysis

Python

1class CohortAnalyzer:
2    """Cohort analysis module"""
3    
4    def __init__(self, orders_df):
5        self.orders = orders_df
6    
7    def build_retention_cohorts(self):
8        """Build monthly retention cohorts"""
9        
10        # Get first purchase month for each customer
11        customer_first = self.orders.groupby('customer_id')['order_date'].min().reset_index()
12        customer_first.columns = ['customer_id', 'first_order_date']
13        customer_first['cohort'] = customer_first['first_order_date'].dt.to_period('M')
14        
15        # Merge with orders
16        cohort_df = self.orders.merge(customer_first[['customer_id', 'cohort']], on='customer_id')
17        cohort_df['order_period'] = cohort_df['order_date'].dt.to_period('M')
18        
19        # Calculate period number
20        cohort_df['period_number'] = (
21            cohort_df['order_period'].astype('int64') - 
22            cohort_df['cohort'].astype('int64')
23        )
24        
25        # Build cohort matrix
26        cohort_matrix = cohort_df.groupby(['cohort', 'period_number'])['customer_id'].nunique().unstack()
27        
28        # Calculate retention rates
29        cohort_sizes = cohort_matrix.iloc[:, 0]
30        retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0) * 100
31        
32        return retention_matrix.round(1)
33    
34    def plot_retention_heatmap(self, retention_matrix):
35        """Plot retention heatmap"""
36        
37        fig, ax = plt.subplots(figsize=(12, 8))
38        
39        sns.heatmap(
40            retention_matrix.iloc[:12, :12],  # First 12 cohorts, 12 months
41            annot=True,
42            fmt='.0f',
43            cmap='YlGnBu',
44            ax=ax,
45            vmin=0,
46            vmax=100,
47            cbar_kws={'label': 'Retention %'}
48        )
49        
50        ax.set_title('Customer Retention by Monthly Cohort', fontsize=14, fontweight='bold')
51        ax.set_xlabel('Months Since First Purchase')
52        ax.set_ylabel('Cohort (First Purchase Month)')
53        
54        plt.tight_layout()
55        plt.savefig('reports/figures/retention_heatmap.png', dpi=150)
56        plt.show()

5.3 RFM Segmentation

Python

1class RFMAnalyzer:
2    """RFM segmentation module"""
3    
4    def __init__(self, orders_df, analysis_date=None):
5        self.orders = orders_df
6        self.analysis_date = analysis_date or orders_df['order_date'].max() + pd.Timedelta(days=1)
7    
8    def calculate_rfm(self):
9        """Calculate RFM scores"""
10        
11        # Calculate RFM metrics
12        rfm = self.orders.groupby('customer_id').agg({
13            'order_date': lambda x: (self.analysis_date - x.max()).days,  # Recency
14            'order_id': 'count',  # Frequency
15            'order_total': 'sum'  # Monetary
16        })
17        
18        rfm.columns = ['recency', 'frequency', 'monetary']
19        
20        # Score each metric (1-5, 5 is best)
21        rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
22        rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
23        rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
24        
25        # Combine scores
26        rfm['rfm_score'] = (
27            rfm['r_score'].astype(str) + 
28            rfm['f_score'].astype(str) + 
29            rfm['m_score'].astype(str)
30        )
31        
32        return rfm
33    
34    def segment_customers(self, rfm_df):
35        """Assign customer segments based on RFM"""
36        
37        def get_segment(row):
38            r, f, m = int(row['r_score']), int(row['f_score']), int(row['m_score'])
39            
40            if r >= 4 and f >= 4 and m >= 4:
41                return 'Champions'
42            elif r >= 4 and f >= 3:
43                return 'Loyal Customers'
44            elif r >= 4 and f <= 2:
45                return 'Recent Customers'
46            elif r <= 2 and f >= 4:
47                return 'At Risk'
48            elif r <= 2 and f <= 2 and m >= 3:
49                return 'Hibernating'
50            elif r <= 2 and f <= 2:
51                return 'Lost'
52            else:
53                return 'Potential Loyalists'
54        
55        rfm_df['segment'] = rfm_df.apply(get_segment, axis=1)
56        
57        return rfm_df
58    
59    def get_segment_summary(self, rfm_df):
60        """Generate segment summary"""
61        
62        summary = rfm_df.groupby('segment').agg({
63            'recency': 'mean',
64            'frequency': 'mean',
65            'monetary': ['mean', 'sum'],
66            'rfm_score': 'count'
67        }).round(1)
68        
69        summary.columns = ['avg_recency', 'avg_frequency', 'avg_monetary', 'total_revenue', 'customers']
70        summary['pct_customers'] = (summary['customers'] / summary['customers'].sum() * 100).round(1)
71        summary['pct_revenue'] = (summary['total_revenue'] / summary['total_revenue'].sum() * 100).round(1)
72        
73        return summary.sort_values('total_revenue', ascending=False)

5.4 Funnel Analysis

Python

1class FunnelAnalyzer:
2    """Funnel analysis module"""
3    
4    def __init__(self, events_df):
5        self.events = events_df
6    
7    def build_funnel(self, stages):
8        """Build conversion funnel"""
9        
10        funnel_data = []
11        
12        for i, stage in enumerate(stages):
13            stage_users = self.events[self.events['event_type'] == stage]['user_id'].nunique()
14            
15            if i == 0:
16                conversion = 100.0
17                step_conversion = 100.0
18            else:
19                prev_users = funnel_data[i-1]['users']
20                conversion = stage_users / funnel_data[0]['users'] * 100
21                step_conversion = stage_users / prev_users * 100 if prev_users > 0 else 0
22            
23            funnel_data.append({
24                'stage': stage,
25                'users': stage_users,
26                'conversion': round(conversion, 2),
27                'step_conversion': round(step_conversion, 2)
28            })
29        
30        return pd.DataFrame(funnel_data)
31    
32    def plot_funnel(self, funnel_df):
33        """Visualize funnel"""
34        
35        fig, ax = plt.subplots(figsize=(12, 6))
36        
37        stages = funnel_df['stage'].tolist()
38        users = funnel_df['users'].tolist()
39        conversions = funnel_df['conversion'].tolist()
40        
41        # Create funnel bars
42        colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(stages)))
43        
44        for i, (stage, user_count, conv) in enumerate(zip(stages, users, conversions)):
45            # Width proportional to conversion
46            width = conv / 100 * 0.8
47            left = (1 - width) / 2
48            
49            ax.barh(i, width, left=left, height=0.6, color=colors[i], edgecolor='white')
50            
51            # Add labels
52            ax.text(0.5, i, f'{stage}\n{user_count:,} users ({conv}%)', 
53                   ha='center', va='center', fontsize=10, fontweight='bold')
54        
55        ax.set_ylim(-0.5, len(stages) - 0.5)
56        ax.set_xlim(0, 1)
57        ax.invert_yaxis()
58        ax.axis('off')
59        ax.set_title('Conversion Funnel', fontsize=14, fontweight='bold')
60        
61        plt.tight_layout()
62        plt.savefig('reports/figures/conversion_funnel.png', dpi=150)
63        plt.show()

6. Phase 4: Reporting

6.1 Executive Summary

Python

1"""
2Generate executive summary report
3"""
4
5def generate_executive_summary(analysis_results):
6    """Generate executive summary from analysis"""
7    
8    summary = f"""
9╔══════════════════════════════════════════════════════════════════╗
10║                    EXECUTIVE SUMMARY                              ║
11║                    E-Commerce Analytics                           ║
12╚══════════════════════════════════════════════════════════════════╝
13
14�� BUSINESS PERFORMANCE OVERVIEW
15━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16
17Total Revenue:         ${analysis_results['total_revenue']:,.0f}
18Total Orders:          {analysis_results['total_orders']:,}
19Total Customers:       {analysis_results['total_customers']:,}
20Average Order Value:   ${analysis_results['aov']:.2f}
21YoY Growth:           {analysis_results['yoy_growth']:+.1f}%
22
23�� KEY FINDINGS
24━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
25
261. REVENUE TRENDS
27   • Revenue grew {analysis_results['revenue_growth']:.1f}% compared to last period
28   • Peak day: {analysis_results['peak_day']} with ${analysis_results['peak_revenue']:,.0f}
29   • Strongest category: {analysis_results['top_category']}
30
312. CUSTOMER INSIGHTS
32   • {analysis_results['new_customers']:,} new customers acquired
33   • Customer retention rate: {analysis_results['retention_rate']:.1f}%
34   • Top segment by revenue: {analysis_results['top_segment']}
35
363. COHORT ANALYSIS
37   • Average 6-month retention: {analysis_results['avg_6m_retention']:.1f}%
38   • Best performing cohort: {analysis_results['best_cohort']}
39   • Cohort LTV trend: {analysis_results['ltv_trend']}
40
414. RFM SEGMENTATION
42   • Champions: {analysis_results['champions_pct']:.1f}% of customers, {analysis_results['champions_rev_pct']:.1f}% of revenue
43   • At Risk: {analysis_results['at_risk_pct']:.1f}% of customers need attention
44   • Lost: {analysis_results['lost_pct']:.1f}% of customers churned
45
46�� RECOMMENDATIONS
47━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
48
491. PRIORITY: Re-engage "At Risk" customers
50   • Impact: Potential ${analysis_results['at_risk_value']:,.0f} revenue recovery
51   • Action: Launch targeted email campaign with personalized offers
52
532. OPPORTUNITY: Expand top category
54   • {analysis_results['top_category']} shows {analysis_results['category_growth']:.1f}% growth
55   • Action: Increase inventory and marketing spend
56
573. IMPROVE: Checkout conversion
58   • Current conversion: {analysis_results['checkout_conversion']:.1f}%
59   • Action: A/B test simplified checkout flow
60
61━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
62Report Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}
63━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
64"""
65    
66    return summary

6.2 Dashboard

Python

1"""
2dashboard/app.py - Streamlit Dashboard
3"""
4
5import streamlit as st
6import pandas as pd
7import plotly.express as px
8import plotly.graph_objects as go
9
10st.set_page_config(page_title="E-Commerce Analytics", layout="wide")
11
12st.title("📊 E-Commerce Analytics Dashboard")
13
14# Sidebar filters
15st.sidebar.header("Filters")
16date_range = st.sidebar.date_input("Date Range", [])
17segment = st.sidebar.multiselect("Customer Segment", ['All', 'Enterprise', 'SMB', 'Consumer'])
18
19# Main metrics
20col1, col2, col3, col4 = st.columns(4)
21
22with col1:
23    st.metric("Revenue", "$2.5M", "+15%")
24
25with col2:
26    st.metric("Orders", "12,500", "+8%")
27
28with col3:
29    st.metric("Customers", "5,200", "+12%")
30
31with col4:
32    st.metric("AOV", "$200", "+3%")
33
34# Charts
35st.header("Revenue Trends")
36
37# Sample data for demo
38dates = pd.date_range('2024-01-01', periods=90, freq='D')
39revenue = [100000 + i * 1000 + np.random.normal(0, 5000) for i in range(90)]
40
41fig = px.line(x=dates, y=revenue, title="Daily Revenue")
42fig.update_layout(xaxis_title="Date", yaxis_title="Revenue ($)")
43st.plotly_chart(fig, use_container_width=True)
44
45# Cohort heatmap
46st.header("Retention Cohorts")
47# Add cohort heatmap here
48
49# RFM Distribution
50st.header("Customer Segments")
51segments = ['Champions', 'Loyal', 'Potential', 'At Risk', 'Lost']
52customers = [500, 800, 1200, 600, 400]
53
54fig2 = px.pie(values=customers, names=segments, title="Customer Distribution by Segment")
55st.plotly_chart(fig2, use_container_width=True)
56
57# Raw data
58if st.checkbox("Show Raw Data"):
59    st.dataframe(pd.DataFrame({'Date': dates, 'Revenue': revenue}).head(20))

7. Portfolio Presentation

7.1 GitHub README Template

markdown

1# E-Commerce Analytics Portfolio Project
2 
3![Python](https://img.shields.io/badge/Python-3.10-blue)
4![Pandas](https://img.shields.io/badge/Pandas-2.0-green)
5![License](https://img.shields.io/badge/License-MIT-yellow)
6 
7## 🎯 Project Overview
8 
9A comprehensive data analysis project analyzing e-commerce customer behavior, 
10revenue trends, and providing actionable business recommendations.
11 
12### Key Findings
13 
14| Metric | Value | Change |
15|--------|-------|--------|
16| Total Revenue | $2.5M | +15% YoY |
17| Customer Retention | 42% | +5pp |
18| Customer LTV | $285 | +12% |
19 
20### Skills Demonstrated
21 
22- **Python**: Pandas, NumPy, Matplotlib, Seaborn
23- **SQL**: Window functions, CTEs, Query optimization
24- **Analysis**: Cohort, RFM, Funnel analysis
25- **Visualization**: Plotly, Streamlit dashboards
26- **Reporting**: Automated PDF/Excel generation
27 
28## 📊 Analysis Highlights
29 
30### Revenue Analysis
31![Revenue Trend](reports/figures/revenue_trend.png)
32 
33### Customer Retention
34![Retention Heatmap](reports/figures/retention_heatmap.png)
35 
36### Customer Segmentation
37![RFM Segments](reports/figures/rfm_segments.png)
38 
39## 🚀 Quick Start
40 
41```bash
42# Clone repository
43git clone https://github.com/yourusername/ecommerce-analytics.git
44cd ecommerce-analytics
45 
46# Install dependencies
47pip install -r requirements.txt
48 
49# Run analysis
50jupyter notebook notebooks/03_analysis.ipynb
51 
52# Launch dashboard
53streamlit run dashboard/app.py

📁 Project Structure

Text

1├── data/           # Raw and processed data
2├── notebooks/      # Jupyter notebooks
3├── src/            # Python modules
4├── reports/        # Generated reports
5└── dashboard/      # Streamlit app

📝 Methodology

Data Collection: Extracted from PostgreSQL database
Data Cleaning: Handled missing values, outliers, duplicates
Analysis: Revenue trends, cohort retention, RFM segmentation
Visualization: Interactive dashboards with Plotly
Reporting: Automated weekly reports

🔮 Future Improvements

Add predictive churn model
Implement real-time dashboard
A/B testing analysis module

👤 Author

Your Name

LinkedIn: your-profile
Email: your.email@example.com

📄 License

MIT License - feel free to use this project as a template!

Text

1### 7.2 Project Checklist

PORTFOLIO PROJECT CHECKLIST

□ DATA EXPLORATION □ Loaded all datasets □ Documented data sources □ Assessed data quality □ Created initial visualizations

□ DATA CLEANING □ Handled missing values □ Removed duplicates □ Fixed data types □ Documented cleaning steps

□ ANALYSIS □ Revenue analysis (trends, breakdown) □ Cohort analysis (retention matrix) □ RFM segmentation □ At least one more business analysis

□ VISUALIZATION □ Clear, professional charts □ Consistent styling □ Interactive dashboard

□ REPORTING □ Executive summary □ Key findings documented □ Recommendations with impact

□ CODE QUALITY □ Well-organized repository □ Clear function documentation □ Requirements.txt complete □ .gitignore configured

□ DOCUMENTATION □ Comprehensive README □ Methodology explained □ Results summarized

□ PRESENTATION READY □ Can explain project in 5 minutes □ Know key numbers by heart □ Prepared for technical questions

Text

1---
2 
3## 8. Tổng kết khóa học
4 
5<Callout type="success" title="Congratulations!">
6Bạn đã hoàn thành khóa Advanced Data Analysis! Giờ hãy apply những gì đã học vào dự án capstone.
7</Callout>
8 
9### Course Skills Summary
10 
11| Module | Key Skills |
12|--------|-----------|
13| **Python** | Pandas, NumPy, Data manipulation |
14| **SQL** | Window functions, CTEs, Optimization |
15| **Analysis** | Time series, Cohort, Funnel, RFM |
16| **Storytelling** | SCQA, Visualization, Executive reporting |
17| **Automation** | Templates, Scheduling, Email delivery |
18 
19### Next Steps
20 
211. **Complete your capstone project**
222. **Deploy to GitHub with documentation**
233. **Add to your portfolio/LinkedIn**
244. **Practice presenting your findings**
255. **Continue learning with real projects**
26 
27**Chúc bạn thành công trong career Data Analyst! 🎉**