Tổng Kết Khóa Học Python Data Science
1. Hành Trình Học Tập
Chúc mừng bạn đã hoàn thành khóa học! Hãy cùng ôn lại những gì đã học:
Lộ trình khóa học
1
Python Basics
2
Data Manipulation
3
Visualization
4
Advanced Analytics
2. Kiến Thức Theo Module
Module 1: Python Fundamentals
Python Fundamentals
Python Basics
Data Types
int, float, str, bool
list, dict, set, tuple
Control Flow
if/elif/else
for, while loops
list comprehension
Functions
def, lambda
args, kwargs
decorators
Cheat Sheet:
Python
1# Variables2name = "MinAI"3age = 254is_active = True56# List7fruits = ["apple", "banana", "cherry"]8fruits.append("orange")910# Dictionary11person = {"name": "Alice", "age": 25}12person["city"] = "Hanoi"1314# List comprehension15squares = [x**2 for x in range(10)]1617# Function18def greet(name, greeting="Hello"):19 return f"{greeting}, {name}!"2021# Lambda22square = lambda x: x ** 2Module 2: Data Manipulation
Pandas vs Polars:
| Feature | Pandas | Polars |
|---|---|---|
| Speed | Standard | 10-100x faster |
| Memory | Higher | Lower |
| Syntax | df['col'] | pl.col('col') |
| Lazy eval | No | Yes |
Pandas Cheat Sheet:
Python
1import pandas as pd23# Read data4df = pd.read_csv("data.csv")56# Selection7df['column'] # Single column8df[['col1', 'col2']] # Multiple columns9df.loc[0, 'column'] # By label10df.iloc[0, 0] # By position1112# Filtering13df[df['age'] > 30]14df.query('age > 30 and city == "Hanoi"')1516# Groupby17df.groupby('city')['salary'].mean()18df.groupby(['city', 'dept']).agg({'salary': 'mean', 'id': 'count'})1920# Merge21pd.merge(df1, df2, on='key', how='left')Data Cleaning Cheat Sheet:
Python
1# Missing values2df.isnull().sum()3df.fillna(df['col'].median())4df.dropna(subset=['important_col'])56# Duplicates7df.duplicated().sum()8df.drop_duplicates()910# Outliers (IQR)11Q1 = df['col'].quantile(0.25)12Q3 = df['col'].quantile(0.75)13IQR = Q3 - Q114df = df[(df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)]Module 3: Visualization
When to use which chart:
| Mục đích | Chart Type | Library |
|---|---|---|
| Distribution | Histogram, KDE | Seaborn |
| Comparison | Bar, Box | Seaborn |
| Relationship | Scatter, Regression | Seaborn/Plotly |
| Composition | Pie, Stacked Bar | Plotly |
| Trend | Line | Plotly |
| Correlation | Heatmap | Seaborn |
| Interactive | Any | Plotly |
| Dashboard | Multiple | Streamlit |
Visualization Cheat Sheet:
Python
1import seaborn as sns2import plotly.express as px34# Seaborn5sns.histplot(df['col'], kde=True)6sns.boxplot(data=df, x='category', y='value')7sns.scatterplot(data=df, x='x', y='y', hue='category')8sns.heatmap(df.corr(), annot=True)910# Plotly11px.scatter(df, x='x', y='y', color='category', size='size')12px.line(df, x='date', y='value', color='category')13px.bar(df, x='category', y='value', color='sub_category')14px.pie(df, values='value', names='category')Module 4: Advanced Analytics
EDA Workflow:
Python
1def quick_eda(df):2 # 1. Overview3 print(df.shape)4 print(df.info())5 6 # 2. Missing values7 print(df.isnull().sum())8 9 # 3. Statistics10 print(df.describe())11 12 # 4. Distributions13 for col in df.select_dtypes(include=[np.number]).columns:14 sns.histplot(df[col], kde=True)15 plt.show()16 17 # 5. Correlations18 sns.heatmap(df.corr(), annot=True)19 plt.show()Feature Engineering Cheat Sheet:
Python
1# Date features2df['year'] = df['date'].dt.year3df['month'] = df['date'].dt.month4df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6])56# Aggregations7df['customer_total'] = df.groupby('customer_id')['amount'].transform('sum')89# Scaling10from sklearn.preprocessing import StandardScaler11scaler = StandardScaler()12df['col_scaled'] = scaler.fit_transform(df[['col']])1314# Encoding15df_encoded = pd.get_dummies(df, columns=['category'])1617# Feature selection18from sklearn.ensemble import RandomForestClassifier19model = RandomForestClassifier()20model.fit(X, y)21importance = model.feature_importances_3. Tools & Libraries Summary
| Library | Purpose | Key Functions |
|---|---|---|
| pandas | Data manipulation | read_csv, groupby, merge, pivot_table |
| polars | Fast data processing | scan_csv, filter, group_by, collect |
| numpy | Numerical computing | array, mean, std, reshape |
| seaborn | Statistical viz | histplot, boxplot, heatmap, pairplot |
| plotly | Interactive viz | scatter, line, bar, pie, choropleth |
| streamlit | Web apps | st.write, st.dataframe, st.plotly_chart |
| scikit-learn | ML preprocessing | StandardScaler, OneHotEncoder, train_test_split |
4. Best Practices
Code Quality
Python
1# ✅ Good2import pandas as pd3import numpy as np45def calculate_metrics(df: pd.DataFrame, target: str) -> dict:6 """7 Calculate basic metrics for a target column.8 9 Args:10 df: Input DataFrame11 target: Column name to analyze12 13 Returns:14 Dictionary with metrics15 """16 return {17 'mean': df[target].mean(),18 'median': df[target].median(),19 'std': df[target].std()20 }2122# ❌ Bad23def calc(d, t):24 return d[t].mean(), d[t].median()Data Processing
Python
1# ✅ Good - Chain operations2df_clean = (3 df4 .dropna(subset=['important_col'])5 .drop_duplicates()6 .assign(new_col=lambda x: x['a'] + x['b'])7 .query('value > 0')8)910# ❌ Bad - Multiple reassignments11df = df.dropna()12df = df.drop_duplicates()13df['new'] = df['a'] + df['b']14df = df[df['value'] > 0]Visualization
Python
1# ✅ Good - Informative plot2fig, ax = plt.subplots(figsize=(10, 6))3sns.barplot(data=df, x='category', y='value', ax=ax)4ax.set_title('Sales by Category', fontsize=14, fontweight='bold')5ax.set_xlabel('Product Category')6ax.set_ylabel('Sales ($)')7plt.tight_layout()8plt.savefig('chart.png', dpi=300)910# ❌ Bad - Bare minimum11df.plot()5. Project Ideas
Beginner Projects
-
Sales Analysis Dashboard
- Load sales data
- Clean and preprocess
- Create visualizations
- Build Streamlit dashboard
-
Customer Segmentation
- RFM analysis
- Clustering với K-means
- Visualize segments
Intermediate Projects
-
Stock Price Analysis
- Fetch data với yfinance
- Technical indicators
- Interactive charts với Plotly
-
Sentiment Analysis
- Text preprocessing
- Feature extraction
- Classification model
Advanced Projects
- Real-time Dashboard
- Streaming data
- Auto-refresh Streamlit
- Alerts và notifications
6. Learning Path Tiếp Theo
Tiếp tục học
1
Machine Learning
2
Deep Learning
3
MLOps
4
Specialization
Recommended Resources
Courses:
- Machine Learning Fundamentals (MinAI)
- Deep Learning (MinAI)
- Statistics Fundamentals (MinAI)
Books:
- "Python for Data Analysis" - Wes McKinney
- "Hands-On Machine Learning" - Aurélien Géron
- "Data Science from Scratch" - Joel Grus
Practice:
- Kaggle Competitions
- LeetCode (Python)
- Real-world projects
7. Quick Reference Card
Python
1# === IMPORTS ===2import pandas as pd3import numpy as np4import seaborn as sns5import plotly.express as px6import streamlit as st78# === DATA LOADING ===9df = pd.read_csv("file.csv")10df = pd.read_excel("file.xlsx")11df = pd.read_json("file.json")1213# === DATA EXPLORATION ===14df.shape15df.info()16df.describe()17df.head()18df.isnull().sum()19df.duplicated().sum()2021# === DATA CLEANING ===22df.dropna()23df.drop_duplicates()24df.fillna(value)25df.rename(columns={'old': 'new'})26df.astype({'col': 'int'})2728# === DATA TRANSFORMATION ===29df['new'] = df['a'] + df['b']30df.groupby('cat').agg({'num': 'mean'})31pd.merge(df1, df2, on='key')32pd.concat([df1, df2])33df.pivot_table(values='v', index='i', columns='c')3435# === VISUALIZATION ===36sns.histplot(df['col'])37sns.boxplot(x='cat', y='num', data=df)38px.scatter(df, x='x', y='y', color='cat')39px.line(df, x='date', y='value')4041# === STREAMLIT APP ===42st.title("My App")43st.dataframe(df)44st.plotly_chart(fig)45option = st.selectbox("Select", options)Chúc Mừng! 🎉
Bạn đã hoàn thành khóa học Python cho Khoa học Dữ liệu!
Hãy tiếp tục:
- ✅ Làm Quiz để kiểm tra kiến thức
- ✅ Thực hành với các project
- ✅ Khám phá các khóa học khác
Good luck on your Data Science journey! 🚀
