Tổng Kết Khóa Học Python Data Science

1. Hành Trình Học Tập

Chúc mừng bạn đã hoàn thành khóa học! Hãy cùng ôn lại những gì đã học:

Lộ trình khóa học

Python Basics

Data Manipulation

Visualization

Advanced Analytics

2. Kiến Thức Theo Module

Module 1: Python Fundamentals

Python Fundamentals

Python Basics

Data Types

int, float, str, bool

list, dict, set, tuple

Control Flow

if/elif/else

for, while loops

list comprehension

Functions

def, lambda

args, kwargs

decorators

Cheat Sheet:

Python

1# Variables
2name = "MinAI"
3age = 25
4is_active = True
5
6# List
7fruits = ["apple", "banana", "cherry"]
8fruits.append("orange")
9
10# Dictionary
11person = {"name": "Alice", "age": 25}
12person["city"] = "Hanoi"
13
14# List comprehension
15squares = [x**2 for x in range(10)]
16
17# Function
18def greet(name, greeting="Hello"):
19    return f"{greeting}, {name}!"
20
21# Lambda
22square = lambda x: x ** 2

Module 2: Data Manipulation

Pandas vs Polars:

Feature	Pandas	Polars
Speed	Standard	10-100x faster
Memory	Higher	Lower
Syntax	`df['col']`	`pl.col('col')`
Lazy eval	No	Yes

Pandas Cheat Sheet:

Python

1import pandas as pd
2
3# Read data
4df = pd.read_csv("data.csv")
5
6# Selection
7df['column']              # Single column
8df[['col1', 'col2']]      # Multiple columns
9df.loc[0, 'column']       # By label
10df.iloc[0, 0]             # By position
11
12# Filtering
13df[df['age'] > 30]
14df.query('age > 30 and city == "Hanoi"')
15
16# Groupby
17df.groupby('city')['salary'].mean()
18df.groupby(['city', 'dept']).agg({'salary': 'mean', 'id': 'count'})
19
20# Merge
21pd.merge(df1, df2, on='key', how='left')

Data Cleaning Cheat Sheet:

Python

1# Missing values
2df.isnull().sum()
3df.fillna(df['col'].median())
4df.dropna(subset=['important_col'])
5
6# Duplicates
7df.duplicated().sum()
8df.drop_duplicates()
9
10# Outliers (IQR)
11Q1 = df['col'].quantile(0.25)
12Q3 = df['col'].quantile(0.75)
13IQR = Q3 - Q1
14df = df[(df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)]

Module 3: Visualization

When to use which chart:

Mục đích	Chart Type	Library
Distribution	Histogram, KDE	Seaborn
Comparison	Bar, Box	Seaborn
Relationship	Scatter, Regression	Seaborn/Plotly
Composition	Pie, Stacked Bar	Plotly
Trend	Line	Plotly
Correlation	Heatmap	Seaborn
Interactive	Any	Plotly
Dashboard	Multiple	Streamlit

Visualization Cheat Sheet:

Python

1import seaborn as sns
2import plotly.express as px
3
4# Seaborn
5sns.histplot(df['col'], kde=True)
6sns.boxplot(data=df, x='category', y='value')
7sns.scatterplot(data=df, x='x', y='y', hue='category')
8sns.heatmap(df.corr(), annot=True)
9
10# Plotly
11px.scatter(df, x='x', y='y', color='category', size='size')
12px.line(df, x='date', y='value', color='category')
13px.bar(df, x='category', y='value', color='sub_category')
14px.pie(df, values='value', names='category')

Module 4: Advanced Analytics

EDA Workflow:

Python

1def quick_eda(df):
2    # 1. Overview
3    print(df.shape)
4    print(df.info())
5    
6    # 2. Missing values
7    print(df.isnull().sum())
8    
9    # 3. Statistics
10    print(df.describe())
11    
12    # 4. Distributions
13    for col in df.select_dtypes(include=[np.number]).columns:
14        sns.histplot(df[col], kde=True)
15        plt.show()
16    
17    # 5. Correlations
18    sns.heatmap(df.corr(), annot=True)
19    plt.show()

Feature Engineering Cheat Sheet:

Python

1# Date features
2df['year'] = df['date'].dt.year
3df['month'] = df['date'].dt.month
4df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6])
5
6# Aggregations
7df['customer_total'] = df.groupby('customer_id')['amount'].transform('sum')
8
9# Scaling
10from sklearn.preprocessing import StandardScaler
11scaler = StandardScaler()
12df['col_scaled'] = scaler.fit_transform(df[['col']])
13
14# Encoding
15df_encoded = pd.get_dummies(df, columns=['category'])
16
17# Feature selection
18from sklearn.ensemble import RandomForestClassifier
19model = RandomForestClassifier()
20model.fit(X, y)
21importance = model.feature_importances_

3. Tools & Libraries Summary

Library	Purpose	Key Functions
pandas	Data manipulation	read_csv, groupby, merge, pivot_table
polars	Fast data processing	scan_csv, filter, group_by, collect
numpy	Numerical computing	array, mean, std, reshape
seaborn	Statistical viz	histplot, boxplot, heatmap, pairplot
plotly	Interactive viz	scatter, line, bar, pie, choropleth
streamlit	Web apps	st.write, st.dataframe, st.plotly_chart
scikit-learn	ML preprocessing	StandardScaler, OneHotEncoder, train_test_split

4. Best Practices

Code Quality

Python

1# ✅ Good
2import pandas as pd
3import numpy as np
4
5def calculate_metrics(df: pd.DataFrame, target: str) -> dict:
6    """
7    Calculate basic metrics for a target column.
8    
9    Args:
10        df: Input DataFrame
11        target: Column name to analyze
12        
13    Returns:
14        Dictionary with metrics
15    """
16    return {
17        'mean': df[target].mean(),
18        'median': df[target].median(),
19        'std': df[target].std()
20    }
21
22# ❌ Bad
23def calc(d, t):
24    return d[t].mean(), d[t].median()

Data Processing

Python

1# ✅ Good - Chain operations
2df_clean = (
3    df
4    .dropna(subset=['important_col'])
5    .drop_duplicates()
6    .assign(new_col=lambda x: x['a'] + x['b'])
7    .query('value > 0')
8)
9
10# ❌ Bad - Multiple reassignments
11df = df.dropna()
12df = df.drop_duplicates()
13df['new'] = df['a'] + df['b']
14df = df[df['value'] > 0]

Visualization

Python

1# ✅ Good - Informative plot
2fig, ax = plt.subplots(figsize=(10, 6))
3sns.barplot(data=df, x='category', y='value', ax=ax)
4ax.set_title('Sales by Category', fontsize=14, fontweight='bold')
5ax.set_xlabel('Product Category')
6ax.set_ylabel('Sales ($)')
7plt.tight_layout()
8plt.savefig('chart.png', dpi=300)
9
10# ❌ Bad - Bare minimum
11df.plot()

5. Project Ideas

Beginner Projects

Sales Analysis Dashboard
- Load sales data
- Clean and preprocess
- Create visualizations
- Build Streamlit dashboard
Customer Segmentation
- RFM analysis
- Clustering với K-means
- Visualize segments

Intermediate Projects

Stock Price Analysis
- Fetch data với yfinance
- Technical indicators
- Interactive charts với Plotly
Sentiment Analysis
- Text preprocessing
- Feature extraction
- Classification model

Advanced Projects

Real-time Dashboard
- Streaming data
- Auto-refresh Streamlit
- Alerts và notifications

6. Learning Path Tiếp Theo

Tiếp tục học

Machine Learning

Deep Learning

MLOps

Specialization

Recommended Resources

Courses:

Machine Learning Fundamentals (MinAI)
Deep Learning (MinAI)
Statistics Fundamentals (MinAI)

Books:

"Python for Data Analysis" - Wes McKinney
"Hands-On Machine Learning" - Aurélien Géron
"Data Science from Scratch" - Joel Grus

Practice:

Kaggle Competitions
LeetCode (Python)
Real-world projects

7. Quick Reference Card

Python

1# === IMPORTS ===
2import pandas as pd
3import numpy as np
4import seaborn as sns
5import plotly.express as px
6import streamlit as st
7
8# === DATA LOADING ===
9df = pd.read_csv("file.csv")
10df = pd.read_excel("file.xlsx")
11df = pd.read_json("file.json")
12
13# === DATA EXPLORATION ===
14df.shape
15df.info()
16df.describe()
17df.head()
18df.isnull().sum()
19df.duplicated().sum()
20
21# === DATA CLEANING ===
22df.dropna()
23df.drop_duplicates()
24df.fillna(value)
25df.rename(columns={'old': 'new'})
26df.astype({'col': 'int'})
27
28# === DATA TRANSFORMATION ===
29df['new'] = df['a'] + df['b']
30df.groupby('cat').agg({'num': 'mean'})
31pd.merge(df1, df2, on='key')
32pd.concat([df1, df2])
33df.pivot_table(values='v', index='i', columns='c')
34
35# === VISUALIZATION ===
36sns.histplot(df['col'])
37sns.boxplot(x='cat', y='num', data=df)
38px.scatter(df, x='x', y='y', color='cat')
39px.line(df, x='date', y='value')
40
41# === STREAMLIT APP ===
42st.title("My App")
43st.dataframe(df)
44st.plotly_chart(fig)
45option = st.selectbox("Select", options)

Chúc Mừng! 🎉

Bạn đã hoàn thành khóa học Python cho Khoa học Dữ liệu!

Hãy tiếp tục:

✅ Làm Quiz để kiểm tra kiến thức
✅ Thực hành với các project
✅ Khám phá các khóa học khác

Good luck on your Data Science journey! 🚀