Portfolio Capstone Project
1. Introduction
Đây là dự án cuối khóa để tổng hợp tất cả kỹ năng đã học. Bạn sẽ thực hiện một dự án phân tích dữ liệu hoàn chỉnh từ đầu đến cuối, phù hợp để đưa vào portfolio xin việc.
1.1 Project Overview
1┌──────────────────────────────────────────────────────────┐2│ CAPSTONE PROJECT STRUCTURE │3├──────────────────────────────────────────────────────────┤4│ │5│ 🎯 OBJECTIVE │6│ Build a complete data analysis project that │7│ demonstrates your skills to potential employers │8│ │9│ 📋 DELIVERABLES │10│ 1. Jupyter Notebook (analysis) │11│ 2. Executive Report (PDF/HTML) │12│ 3. Interactive Dashboard │13│ 4. Automated Report Script │14│ 5. GitHub Repository with documentation │15│ │16│ ⏱️ ESTIMATED TIME: 8-12 hours │17│ │18│ 🛠️ SKILLS DEMONSTRATED │19│ • Python (Pandas, NumPy, Matplotlib) │20│ • SQL (Window functions, CTEs, Optimization) │21│ • Data Storytelling & Visualization │22│ • Business Analysis (Cohort, Funnel, Time Series) │23│ • Automated Reporting │24│ │25└──────────────────────────────────────────────────────────┘1.2 Dataset Options
1OPTION A: E-COMMERCE ANALYTICS2─────────────────────────────────────────────────────────3• Transactions, customers, products data4• Analysis: Revenue trends, cohort retention, RFM5• Business questions: Customer lifetime value, churn prediction6 7OPTION B: SAAS METRICS8─────────────────────────────────────────────────────────9• Subscription data, usage events, support tickets10• Analysis: MRR trends, feature adoption, churn11• Business questions: Product stickiness, expansion revenue12 13OPTION C: MARKETING CAMPAIGN14─────────────────────────────────────────────────────────15• Campaign performance, leads, conversions16• Analysis: Funnel analysis, attribution, ROI17• Business questions: Channel optimization, CAC/LTV18 19OPTION D: YOUR OWN DATA20─────────────────────────────────────────────────────────21• Any dataset you have access to22• Real-world data preferred23• Must have at least 3 tables and 10,000+ rows2. Project Setup
2.1 Repository Structure
1my-data-analysis-portfolio/2│3├── README.md # Project overview4├── requirements.txt # Dependencies5├── .gitignore6│7├── data/8│ ├── raw/ # Original data9│ ├── processed/ # Cleaned data10│ └── external/ # External reference data11│12├── notebooks/13│ ├── 01_data_exploration.ipynb14│ ├── 02_data_cleaning.ipynb15│ ├── 03_analysis.ipynb16│ └── 04_modeling.ipynb17│18├── src/19│ ├── __init__.py20│ ├── data_loader.py # Data loading utilities21│ ├── analysis.py # Analysis functions22│ ├── visualization.py # Chart functions23│ └── reporting.py # Report generation24│25├── reports/26│ ├── executive_summary.pdf27│ ├── detailed_analysis.html28│ └── figures/29│30├── dashboard/31│ ├── app.py # Streamlit/Dash app32│ └── assets/33│34└── tests/35 └── test_analysis.py2.2 Setup Script
1"""2setup_project.py - Initialize project structure3"""45import os6from pathlib import Path78def setup_project(project_name: str):9 """Create project directory structure"""10 11 base_path = Path(project_name)12 13 # Define directory structure14 directories = [15 'data/raw',16 'data/processed',17 'data/external',18 'notebooks',19 'src',20 'reports/figures',21 'dashboard/assets',22 'tests'23 ]24 25 # Create directories26 for dir_path in directories:27 (base_path / dir_path).mkdir(parents=True, exist_ok=True)28 print(f"✅ Created: {dir_path}")29 30 # Create essential files31 files = {32 'README.md': README_TEMPLATE,33 'requirements.txt': REQUIREMENTS_TEMPLATE,34 '.gitignore': GITIGNORE_TEMPLATE,35 'src/__init__.py': '',36 'tests/__init__.py': ''37 }38 39 for filename, content in files.items():40 filepath = base_path / filename41 filepath.write_text(content)42 print(f"✅ Created: {filename}")43 44 print(f"\n🎉 Project '{project_name}' setup complete!")45 print(f" cd {project_name}")46 print(f" pip install -r requirements.txt")4748README_TEMPLATE = """# Data Analysis Portfolio Project4950## Overview51A comprehensive data analysis project demonstrating skills in Python, SQL, 52data visualization, and business analytics.5354## Project Structure55- `data/` - Raw and processed data files56- `notebooks/` - Jupyter notebooks with analysis57- `src/` - Python modules for data processing58- `reports/` - Generated reports and visualizations59- `dashboard/` - Interactive dashboard application6061## Key Findings62[To be completed after analysis]6364## Skills Demonstrated65- Data wrangling with Pandas66- SQL analytics (Window functions, CTEs)67- Statistical analysis68- Data visualization69- Business metrics (Cohort, Funnel, RFM)70- Automated reporting7172## How to Run73```bash74pip install -r requirements.txt75jupyter notebook notebooks/Author
[Your Name] """
REQUIREMENTS_TEMPLATE = """pandas>=2.0.0 numpy>=1.24.0 matplotlib>=3.7.0 seaborn>=0.12.0 plotly>=5.14.0 jupyter>=1.0.0 sqlalchemy>=2.0.0 openpyxl>=3.1.0 jinja2>=3.1.0 reportlab>=4.0.0 streamlit>=1.20.0 pytest>=7.3.0 python-dotenv>=1.0.0 """
GITIGNORE_TEMPLATE = """# Python pycache/ *.py[cod] .env venv/
Jupyter
.ipynb_checkpoints/
Data (large files)
data/raw/.csv data/raw/.xlsx *.db
Reports
reports/*.pdf
IDE
.vscode/ .idea/ """
if name == 'main': setup_project('ecommerce-analysis')
1---2 3## 3. Phase 1: Data Exploration4 5### 3.1 Loading and Understanding Data6 7```python8"""901_data_exploration.ipynb10"""11 12import pandas as pd13import numpy as np14import matplotlib.pyplot as plt15import seaborn as sns16 17# Set display options18pd.set_option('display.max_columns', 50)19pd.set_option('display.float_format', '{:.2f}'.format)20 21# Load data22orders = pd.read_csv('data/raw/orders.csv')23customers = pd.read_csv('data/raw/customers.csv')24products = pd.read_csv('data/raw/products.csv')25 26print("=" * 60)27print("DATA OVERVIEW")28print("=" * 60)29 30datasets = {'orders': orders, 'customers': customers, 'products': products}31 32for name, df in datasets.items():33 print(f"\n📊 {name.upper()}")34 print(f" Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")35 print(f" Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")36 print(f" Columns: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")3.2 Data Quality Assessment
1def assess_data_quality(df, name='dataset'):2 """Comprehensive data quality assessment"""3 4 print(f"\n{'='*60}")5 print(f"DATA QUALITY REPORT: {name.upper()}")6 print(f"{'='*60}")7 8 # Missing values9 missing = df.isnull().sum()10 missing_pct = (missing / len(df) * 100).round(2)11 missing_report = pd.DataFrame({12 'Missing': missing,13 'Pct': missing_pct14 })15 missing_report = missing_report[missing_report['Missing'] > 0]16 17 if len(missing_report) > 0:18 print("\n⚠️ MISSING VALUES:")19 print(missing_report.to_string())20 else:21 print("\n✅ No missing values")22 23 # Data types24 print("\n📋 DATA TYPES:")25 print(df.dtypes.value_counts().to_string())26 27 # Duplicates28 duplicates = df.duplicated().sum()29 print(f"\n🔄 DUPLICATES: {duplicates:,} ({duplicates/len(df)*100:.2f}%)")30 31 # Unique values for categorical columns32 print("\n📊 UNIQUE VALUES (categorical):")33 for col in df.select_dtypes(include=['object']).columns:34 unique_count = df[col].nunique()35 print(f" {col}: {unique_count:,}")36 37 # Numeric statistics38 print("\n📈 NUMERIC STATISTICS:")39 numeric_df = df.select_dtypes(include=[np.number])40 if len(numeric_df.columns) > 0:41 stats = numeric_df.describe().T[['min', 'max', 'mean', 'std']]42 print(stats.to_string())43 44 return missing_report4546# Run assessment47for name, df in datasets.items():48 assess_data_quality(df, name)3.3 Initial Visualizations
1def create_exploration_plots(orders, customers, products):2 """Create initial exploration visualizations"""3 4 fig, axes = plt.subplots(2, 3, figsize=(15, 10))5 fig.suptitle('Data Exploration - Initial Overview', fontsize=14, fontweight='bold')6 7 # 1. Order distribution by day8 ax1 = axes[0, 0]9 orders['order_date'] = pd.to_datetime(orders['order_date'])10 orders.set_index('order_date')['order_id'].resample('W').count().plot(ax=ax1)11 ax1.set_title('Orders Over Time')12 ax1.set_ylabel('Order Count')13 14 # 2. Revenue distribution15 ax2 = axes[0, 1]16 orders['order_total'].hist(bins=50, ax=ax2, edgecolor='white')17 ax2.set_title('Order Value Distribution')18 ax2.set_xlabel('Order Total ($)')19 ax2.axvline(orders['order_total'].median(), color='red', linestyle='--', label='Median')20 ax2.legend()21 22 # 3. Top categories23 ax3 = axes[0, 2]24 category_revenue = orders.groupby('category')['order_total'].sum().sort_values(ascending=True)25 category_revenue.plot(kind='barh', ax=ax3)26 ax3.set_title('Revenue by Category')27 ax3.set_xlabel('Revenue ($)')28 29 # 4. Customer distribution30 ax4 = axes[1, 0]31 customer_orders = orders.groupby('customer_id').size()32 customer_orders.value_counts().head(10).plot(kind='bar', ax=ax4)33 ax4.set_title('Order Frequency Distribution')34 ax4.set_xlabel('Number of Orders')35 ax4.set_ylabel('Number of Customers')36 37 # 5. Hourly pattern38 ax5 = axes[1, 1]39 orders['hour'] = orders['order_date'].dt.hour40 orders.groupby('hour')['order_id'].count().plot(ax=ax5, marker='o')41 ax5.set_title('Orders by Hour')42 ax5.set_xlabel('Hour of Day')43 ax5.set_ylabel('Order Count')44 45 # 6. Day of week pattern46 ax6 = axes[1, 2]47 orders['day_of_week'] = orders['order_date'].dt.day_name()48 day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']49 dow_orders = orders.groupby('day_of_week')['order_id'].count().reindex(day_order)50 dow_orders.plot(kind='bar', ax=ax6)51 ax6.set_title('Orders by Day of Week')52 ax6.tick_params(axis='x', rotation=45)53 54 plt.tight_layout()55 plt.savefig('reports/figures/01_data_exploration.png', dpi=150, bbox_inches='tight')56 plt.show()5758# Generate plots59# create_exploration_plots(orders, customers, products)4. Phase 2: Data Cleaning
4.1 Cleaning Pipeline
1"""202_data_cleaning.ipynb3"""45import pandas as pd6import numpy as np78class DataCleaner:9 """Data cleaning pipeline for e-commerce data"""10 11 def __init__(self):12 self.cleaning_log = []13 14 def log_action(self, action, before, after):15 """Log cleaning action"""16 self.cleaning_log.append({17 'action': action,18 'rows_before': before,19 'rows_after': after,20 'rows_removed': before - after21 })22 23 def clean_orders(self, df):24 """Clean orders dataframe"""25 initial_rows = len(df)26 27 # 1. Remove duplicates28 df = df.drop_duplicates(subset=['order_id'])29 self.log_action('Remove duplicate orders', initial_rows, len(df))30 31 # 2. Handle missing values32 # - Drop rows with missing critical fields33 critical_cols = ['order_id', 'customer_id', 'order_date', 'order_total']34 df = df.dropna(subset=critical_cols)35 self.log_action('Drop missing critical fields', len(self.cleaning_log[-1]) if self.cleaning_log else initial_rows, len(df))36 37 # 3. Convert data types38 df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')39 df['order_total'] = pd.to_numeric(df['order_total'], errors='coerce')40 41 # 4. Remove invalid dates42 df = df[df['order_date'].notna()]43 df = df[df['order_date'] >= '2020-01-01']44 df = df[df['order_date'] <= pd.Timestamp.now()]45 self.log_action('Remove invalid dates', len(df), len(df))46 47 # 5. Remove negative/zero orders48 df = df[df['order_total'] > 0]49 self.log_action('Remove negative/zero orders', len(df), len(df))50 51 # 6. Handle outliers (optional - cap at 99th percentile)52 p99 = df['order_total'].quantile(0.99)53 df = df[df['order_total'] <= p99]54 self.log_action('Cap outliers at 99th percentile', len(df), len(df))55 56 # 7. Add derived columns57 df['order_month'] = df['order_date'].dt.to_period('M')58 df['order_week'] = df['order_date'].dt.to_period('W')59 df['order_dow'] = df['order_date'].dt.day_name()60 df['order_hour'] = df['order_date'].dt.hour61 62 return df63 64 def clean_customers(self, df):65 """Clean customers dataframe"""66 initial_rows = len(df)67 68 # Remove duplicates69 df = df.drop_duplicates(subset=['customer_id'])70 71 # Standardize categorical fields72 if 'country' in df.columns:73 df['country'] = df['country'].str.strip().str.title()74 75 if 'segment' in df.columns:76 df['segment'] = df['segment'].str.strip().str.title()77 78 # Convert signup date79 if 'signup_date' in df.columns:80 df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')81 82 return df83 84 def print_cleaning_summary(self):85 """Print cleaning summary"""86 print("\n" + "=" * 60)87 print("DATA CLEANING SUMMARY")88 print("=" * 60)89 90 for log in self.cleaning_log:91 print(f"\n{log['action']}")92 print(f" Rows: {log['rows_before']:,} → {log['rows_after']:,}")93 print(f" Removed: {log['rows_removed']:,}")94 95 total_removed = sum(log['rows_removed'] for log in self.cleaning_log)96 print(f"\n{'─' * 60}")97 print(f"TOTAL ROWS REMOVED: {total_removed:,}")9899# Usage100cleaner = DataCleaner()101# orders_clean = cleaner.clean_orders(orders)102# customers_clean = cleaner.clean_customers(customers)103# cleaner.print_cleaning_summary()104105# Save cleaned data106# orders_clean.to_csv('data/processed/orders_clean.csv', index=False)107# customers_clean.to_csv('data/processed/customers_clean.csv', index=False)5. Phase 3: Core Analysis
5.1 Revenue Analysis
1"""203_analysis.ipynb3"""45import pandas as pd6import numpy as np7import matplotlib.pyplot as plt89class RevenueAnalyzer:10 """Revenue analysis module"""11 12 def __init__(self, orders_df):13 self.orders = orders_df14 15 def monthly_revenue(self):16 """Calculate monthly revenue metrics"""17 monthly = self.orders.groupby('order_month').agg({18 'order_total': ['sum', 'mean', 'count'],19 'customer_id': 'nunique'20 }).round(2)21 22 monthly.columns = ['revenue', 'aov', 'orders', 'customers']23 monthly['revenue_mom'] = monthly['revenue'].pct_change() * 10024 25 return monthly26 27 def revenue_by_segment(self, customers_df):28 """Revenue breakdown by customer segment"""29 merged = self.orders.merge(customers_df[['customer_id', 'segment']], on='customer_id')30 31 segment_metrics = merged.groupby('segment').agg({32 'order_total': ['sum', 'mean', 'count'],33 'customer_id': 'nunique'34 }).round(2)35 36 segment_metrics.columns = ['revenue', 'aov', 'orders', 'customers']37 segment_metrics['rev_per_customer'] = segment_metrics['revenue'] / segment_metrics['customers']38 39 return segment_metrics.sort_values('revenue', ascending=False)5.2 Cohort Analysis
1class CohortAnalyzer:2 """Cohort analysis module"""3 4 def __init__(self, orders_df):5 self.orders = orders_df6 7 def build_retention_cohorts(self):8 """Build monthly retention cohorts"""9 10 # Get first purchase month for each customer11 customer_first = self.orders.groupby('customer_id')['order_date'].min().reset_index()12 customer_first.columns = ['customer_id', 'first_order_date']13 customer_first['cohort'] = customer_first['first_order_date'].dt.to_period('M')14 15 # Merge with orders16 cohort_df = self.orders.merge(customer_first[['customer_id', 'cohort']], on='customer_id')17 cohort_df['order_period'] = cohort_df['order_date'].dt.to_period('M')18 19 # Calculate period number20 cohort_df['period_number'] = (21 cohort_df['order_period'].astype('int64') - 22 cohort_df['cohort'].astype('int64')23 )24 25 # Build cohort matrix26 cohort_matrix = cohort_df.groupby(['cohort', 'period_number'])['customer_id'].nunique().unstack()27 28 # Calculate retention rates29 cohort_sizes = cohort_matrix.iloc[:, 0]30 retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0) * 10031 32 return retention_matrix.round(1)33 34 def plot_retention_heatmap(self, retention_matrix):35 """Plot retention heatmap"""36 37 fig, ax = plt.subplots(figsize=(12, 8))38 39 sns.heatmap(40 retention_matrix.iloc[:12, :12], # First 12 cohorts, 12 months41 annot=True,42 fmt='.0f',43 cmap='YlGnBu',44 ax=ax,45 vmin=0,46 vmax=100,47 cbar_kws={'label': 'Retention %'}48 )49 50 ax.set_title('Customer Retention by Monthly Cohort', fontsize=14, fontweight='bold')51 ax.set_xlabel('Months Since First Purchase')52 ax.set_ylabel('Cohort (First Purchase Month)')53 54 plt.tight_layout()55 plt.savefig('reports/figures/retention_heatmap.png', dpi=150)56 plt.show()5.3 RFM Segmentation
1class RFMAnalyzer:2 """RFM segmentation module"""3 4 def __init__(self, orders_df, analysis_date=None):5 self.orders = orders_df6 self.analysis_date = analysis_date or orders_df['order_date'].max() + pd.Timedelta(days=1)7 8 def calculate_rfm(self):9 """Calculate RFM scores"""10 11 # Calculate RFM metrics12 rfm = self.orders.groupby('customer_id').agg({13 'order_date': lambda x: (self.analysis_date - x.max()).days, # Recency14 'order_id': 'count', # Frequency15 'order_total': 'sum' # Monetary16 })17 18 rfm.columns = ['recency', 'frequency', 'monetary']19 20 # Score each metric (1-5, 5 is best)21 rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])22 rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])23 rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])24 25 # Combine scores26 rfm['rfm_score'] = (27 rfm['r_score'].astype(str) + 28 rfm['f_score'].astype(str) + 29 rfm['m_score'].astype(str)30 )31 32 return rfm33 34 def segment_customers(self, rfm_df):35 """Assign customer segments based on RFM"""36 37 def get_segment(row):38 r, f, m = int(row['r_score']), int(row['f_score']), int(row['m_score'])39 40 if r >= 4 and f >= 4 and m >= 4:41 return 'Champions'42 elif r >= 4 and f >= 3:43 return 'Loyal Customers'44 elif r >= 4 and f <= 2:45 return 'Recent Customers'46 elif r <= 2 and f >= 4:47 return 'At Risk'48 elif r <= 2 and f <= 2 and m >= 3:49 return 'Hibernating'50 elif r <= 2 and f <= 2:51 return 'Lost'52 else:53 return 'Potential Loyalists'54 55 rfm_df['segment'] = rfm_df.apply(get_segment, axis=1)56 57 return rfm_df58 59 def get_segment_summary(self, rfm_df):60 """Generate segment summary"""61 62 summary = rfm_df.groupby('segment').agg({63 'recency': 'mean',64 'frequency': 'mean',65 'monetary': ['mean', 'sum'],66 'rfm_score': 'count'67 }).round(1)68 69 summary.columns = ['avg_recency', 'avg_frequency', 'avg_monetary', 'total_revenue', 'customers']70 summary['pct_customers'] = (summary['customers'] / summary['customers'].sum() * 100).round(1)71 summary['pct_revenue'] = (summary['total_revenue'] / summary['total_revenue'].sum() * 100).round(1)72 73 return summary.sort_values('total_revenue', ascending=False)5.4 Funnel Analysis
1class FunnelAnalyzer:2 """Funnel analysis module"""3 4 def __init__(self, events_df):5 self.events = events_df6 7 def build_funnel(self, stages):8 """Build conversion funnel"""9 10 funnel_data = []11 12 for i, stage in enumerate(stages):13 stage_users = self.events[self.events['event_type'] == stage]['user_id'].nunique()14 15 if i == 0:16 conversion = 100.017 step_conversion = 100.018 else:19 prev_users = funnel_data[i-1]['users']20 conversion = stage_users / funnel_data[0]['users'] * 10021 step_conversion = stage_users / prev_users * 100 if prev_users > 0 else 022 23 funnel_data.append({24 'stage': stage,25 'users': stage_users,26 'conversion': round(conversion, 2),27 'step_conversion': round(step_conversion, 2)28 })29 30 return pd.DataFrame(funnel_data)31 32 def plot_funnel(self, funnel_df):33 """Visualize funnel"""34 35 fig, ax = plt.subplots(figsize=(12, 6))36 37 stages = funnel_df['stage'].tolist()38 users = funnel_df['users'].tolist()39 conversions = funnel_df['conversion'].tolist()40 41 # Create funnel bars42 colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(stages)))43 44 for i, (stage, user_count, conv) in enumerate(zip(stages, users, conversions)):45 # Width proportional to conversion46 width = conv / 100 * 0.847 left = (1 - width) / 248 49 ax.barh(i, width, left=left, height=0.6, color=colors[i], edgecolor='white')50 51 # Add labels52 ax.text(0.5, i, f'{stage}\n{user_count:,} users ({conv}%)', 53 ha='center', va='center', fontsize=10, fontweight='bold')54 55 ax.set_ylim(-0.5, len(stages) - 0.5)56 ax.set_xlim(0, 1)57 ax.invert_yaxis()58 ax.axis('off')59 ax.set_title('Conversion Funnel', fontsize=14, fontweight='bold')60 61 plt.tight_layout()62 plt.savefig('reports/figures/conversion_funnel.png', dpi=150)63 plt.show()6. Phase 4: Reporting
6.1 Executive Summary
1"""2Generate executive summary report3"""45def generate_executive_summary(analysis_results):6 """Generate executive summary from analysis"""7 8 summary = f"""9╔══════════════════════════════════════════════════════════════════╗10║ EXECUTIVE SUMMARY ║11║ E-Commerce Analytics ║12╚══════════════════════════════════════════════════════════════════╝1314�� BUSINESS PERFORMANCE OVERVIEW15━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━1617Total Revenue: ${analysis_results['total_revenue']:,.0f}18Total Orders: {analysis_results['total_orders']:,}19Total Customers: {analysis_results['total_customers']:,}20Average Order Value: ${analysis_results['aov']:.2f}21YoY Growth: {analysis_results['yoy_growth']:+.1f}%2223�� KEY FINDINGS24━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━25261. REVENUE TRENDS27 • Revenue grew {analysis_results['revenue_growth']:.1f}% compared to last period28 • Peak day: {analysis_results['peak_day']} with ${analysis_results['peak_revenue']:,.0f}29 • Strongest category: {analysis_results['top_category']}30312. CUSTOMER INSIGHTS32 • {analysis_results['new_customers']:,} new customers acquired33 • Customer retention rate: {analysis_results['retention_rate']:.1f}%34 • Top segment by revenue: {analysis_results['top_segment']}35363. COHORT ANALYSIS37 • Average 6-month retention: {analysis_results['avg_6m_retention']:.1f}%38 • Best performing cohort: {analysis_results['best_cohort']}39 • Cohort LTV trend: {analysis_results['ltv_trend']}40414. RFM SEGMENTATION42 • Champions: {analysis_results['champions_pct']:.1f}% of customers, {analysis_results['champions_rev_pct']:.1f}% of revenue43 • At Risk: {analysis_results['at_risk_pct']:.1f}% of customers need attention44 • Lost: {analysis_results['lost_pct']:.1f}% of customers churned4546�� RECOMMENDATIONS47━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━48491. PRIORITY: Re-engage "At Risk" customers50 • Impact: Potential ${analysis_results['at_risk_value']:,.0f} revenue recovery51 • Action: Launch targeted email campaign with personalized offers52532. OPPORTUNITY: Expand top category54 • {analysis_results['top_category']} shows {analysis_results['category_growth']:.1f}% growth55 • Action: Increase inventory and marketing spend56573. IMPROVE: Checkout conversion58 • Current conversion: {analysis_results['checkout_conversion']:.1f}%59 • Action: A/B test simplified checkout flow6061━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━62Report Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}63━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━64"""65 66 return summary6.2 Dashboard
1"""2dashboard/app.py - Streamlit Dashboard3"""45import streamlit as st6import pandas as pd7import plotly.express as px8import plotly.graph_objects as go910st.set_page_config(page_title="E-Commerce Analytics", layout="wide")1112st.title("📊 E-Commerce Analytics Dashboard")1314# Sidebar filters15st.sidebar.header("Filters")16date_range = st.sidebar.date_input("Date Range", [])17segment = st.sidebar.multiselect("Customer Segment", ['All', 'Enterprise', 'SMB', 'Consumer'])1819# Main metrics20col1, col2, col3, col4 = st.columns(4)2122with col1:23 st.metric("Revenue", "$2.5M", "+15%")2425with col2:26 st.metric("Orders", "12,500", "+8%")2728with col3:29 st.metric("Customers", "5,200", "+12%")3031with col4:32 st.metric("AOV", "$200", "+3%")3334# Charts35st.header("Revenue Trends")3637# Sample data for demo38dates = pd.date_range('2024-01-01', periods=90, freq='D')39revenue = [100000 + i * 1000 + np.random.normal(0, 5000) for i in range(90)]4041fig = px.line(x=dates, y=revenue, title="Daily Revenue")42fig.update_layout(xaxis_title="Date", yaxis_title="Revenue ($)")43st.plotly_chart(fig, use_container_width=True)4445# Cohort heatmap46st.header("Retention Cohorts")47# Add cohort heatmap here4849# RFM Distribution50st.header("Customer Segments")51segments = ['Champions', 'Loyal', 'Potential', 'At Risk', 'Lost']52customers = [500, 800, 1200, 600, 400]5354fig2 = px.pie(values=customers, names=segments, title="Customer Distribution by Segment")55st.plotly_chart(fig2, use_container_width=True)5657# Raw data58if st.checkbox("Show Raw Data"):59 st.dataframe(pd.DataFrame({'Date': dates, 'Revenue': revenue}).head(20))7. Portfolio Presentation
7.1 GitHub README Template
1# E-Commerce Analytics Portfolio Project2 3456 7## 🎯 Project Overview8 9A comprehensive data analysis project analyzing e-commerce customer behavior, 10revenue trends, and providing actionable business recommendations.11 12### Key Findings13 14| Metric | Value | Change |15|--------|-------|--------|16| Total Revenue | $2.5M | +15% YoY |17| Customer Retention | 42% | +5pp |18| Customer LTV | $285 | +12% |19 20### Skills Demonstrated21 22- **Python**: Pandas, NumPy, Matplotlib, Seaborn23- **SQL**: Window functions, CTEs, Query optimization24- **Analysis**: Cohort, RFM, Funnel analysis25- **Visualization**: Plotly, Streamlit dashboards26- **Reporting**: Automated PDF/Excel generation27 28## 📊 Analysis Highlights29 30### Revenue Analysis3132 33### Customer Retention3435 36### Customer Segmentation3738 39## 🚀 Quick Start40 41```bash42# Clone repository43git clone https://github.com/yourusername/ecommerce-analytics.git44cd ecommerce-analytics45 46# Install dependencies47pip install -r requirements.txt48 49# Run analysis50jupyter notebook notebooks/03_analysis.ipynb51 52# Launch dashboard53streamlit run dashboard/app.py📁 Project Structure
1├── data/ # Raw and processed data2├── notebooks/ # Jupyter notebooks3├── src/ # Python modules4├── reports/ # Generated reports5└── dashboard/ # Streamlit app📝 Methodology
- Data Collection: Extracted from PostgreSQL database
- Data Cleaning: Handled missing values, outliers, duplicates
- Analysis: Revenue trends, cohort retention, RFM segmentation
- Visualization: Interactive dashboards with Plotly
- Reporting: Automated weekly reports
🔮 Future Improvements
- Add predictive churn model
- Implement real-time dashboard
- A/B testing analysis module
👤 Author
Your Name
- LinkedIn: your-profile
- Email: your.email@example.com
📄 License
MIT License - feel free to use this project as a template!
1### 7.2 Project ChecklistPORTFOLIO PROJECT CHECKLIST
□ DATA EXPLORATION □ Loaded all datasets □ Documented data sources □ Assessed data quality □ Created initial visualizations
□ DATA CLEANING □ Handled missing values □ Removed duplicates □ Fixed data types □ Documented cleaning steps
□ ANALYSIS □ Revenue analysis (trends, breakdown) □ Cohort analysis (retention matrix) □ RFM segmentation □ At least one more business analysis
□ VISUALIZATION □ Clear, professional charts □ Consistent styling □ Interactive dashboard
□ REPORTING □ Executive summary □ Key findings documented □ Recommendations with impact
□ CODE QUALITY □ Well-organized repository □ Clear function documentation □ Requirements.txt complete □ .gitignore configured
□ DOCUMENTATION □ Comprehensive README □ Methodology explained □ Results summarized
□ PRESENTATION READY □ Can explain project in 5 minutes □ Know key numbers by heart □ Prepared for technical questions
1---2 3## 8. Tổng kết khóa học4 5<Callout type="success" title="Congratulations!">6Bạn đã hoàn thành khóa Advanced Data Analysis! Giờ hãy apply những gì đã học vào dự án capstone.7</Callout>8 9### Course Skills Summary10 11| Module | Key Skills |12|--------|-----------|13| **Python** | Pandas, NumPy, Data manipulation |14| **SQL** | Window functions, CTEs, Optimization |15| **Analysis** | Time series, Cohort, Funnel, RFM |16| **Storytelling** | SCQA, Visualization, Executive reporting |17| **Automation** | Templates, Scheduling, Email delivery |18 19### Next Steps20 211. **Complete your capstone project**222. **Deploy to GitHub with documentation**233. **Add to your portfolio/LinkedIn**244. **Practice presenting your findings**255. **Continue learning with real projects**26 27**Chúc bạn thành công trong career Data Analyst! 🎉**