Lý thuyết
Bài 13/14

Tổng Kết Khóa Học

Ôn tập và tổng hợp kiến thức Python Data Science

Tổng Kết Khóa Học Python Data Science

1. Hành Trình Học Tập

Chúc mừng bạn đã hoàn thành khóa học! Hãy cùng ôn lại những gì đã học:

Lộ trình khóa học

1
Python Basics
2
Data Manipulation
3
Visualization
4
Advanced Analytics

2. Kiến Thức Theo Module

Module 1: Python Fundamentals

Python Fundamentals

Python Basics
Data Types
int, float, str, bool
list, dict, set, tuple
Control Flow
if/elif/else
for, while loops
list comprehension
Functions
def, lambda
args, kwargs
decorators

Cheat Sheet:

Python
1# Variables
2name = "MinAI"
3age = 25
4is_active = True
5
6# List
7fruits = ["apple", "banana", "cherry"]
8fruits.append("orange")
9
10# Dictionary
11person = {"name": "Alice", "age": 25}
12person["city"] = "Hanoi"
13
14# List comprehension
15squares = [x**2 for x in range(10)]
16
17# Function
18def greet(name, greeting="Hello"):
19 return f"{greeting}, {name}!"
20
21# Lambda
22square = lambda x: x ** 2

Module 2: Data Manipulation

Pandas vs Polars:

FeaturePandasPolars
SpeedStandard10-100x faster
MemoryHigherLower
Syntaxdf['col']pl.col('col')
Lazy evalNoYes

Pandas Cheat Sheet:

Python
1import pandas as pd
2
3# Read data
4df = pd.read_csv("data.csv")
5
6# Selection
7df['column'] # Single column
8df[['col1', 'col2']] # Multiple columns
9df.loc[0, 'column'] # By label
10df.iloc[0, 0] # By position
11
12# Filtering
13df[df['age'] > 30]
14df.query('age > 30 and city == "Hanoi"')
15
16# Groupby
17df.groupby('city')['salary'].mean()
18df.groupby(['city', 'dept']).agg({'salary': 'mean', 'id': 'count'})
19
20# Merge
21pd.merge(df1, df2, on='key', how='left')

Data Cleaning Cheat Sheet:

Python
1# Missing values
2df.isnull().sum()
3df.fillna(df['col'].median())
4df.dropna(subset=['important_col'])
5
6# Duplicates
7df.duplicated().sum()
8df.drop_duplicates()
9
10# Outliers (IQR)
11Q1 = df['col'].quantile(0.25)
12Q3 = df['col'].quantile(0.75)
13IQR = Q3 - Q1
14df = df[(df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)]

Module 3: Visualization

When to use which chart:

Mục đíchChart TypeLibrary
DistributionHistogram, KDESeaborn
ComparisonBar, BoxSeaborn
RelationshipScatter, RegressionSeaborn/Plotly
CompositionPie, Stacked BarPlotly
TrendLinePlotly
CorrelationHeatmapSeaborn
InteractiveAnyPlotly
DashboardMultipleStreamlit

Visualization Cheat Sheet:

Python
1import seaborn as sns
2import plotly.express as px
3
4# Seaborn
5sns.histplot(df['col'], kde=True)
6sns.boxplot(data=df, x='category', y='value')
7sns.scatterplot(data=df, x='x', y='y', hue='category')
8sns.heatmap(df.corr(), annot=True)
9
10# Plotly
11px.scatter(df, x='x', y='y', color='category', size='size')
12px.line(df, x='date', y='value', color='category')
13px.bar(df, x='category', y='value', color='sub_category')
14px.pie(df, values='value', names='category')

Module 4: Advanced Analytics

EDA Workflow:

Python
1def quick_eda(df):
2 # 1. Overview
3 print(df.shape)
4 print(df.info())
5
6 # 2. Missing values
7 print(df.isnull().sum())
8
9 # 3. Statistics
10 print(df.describe())
11
12 # 4. Distributions
13 for col in df.select_dtypes(include=[np.number]).columns:
14 sns.histplot(df[col], kde=True)
15 plt.show()
16
17 # 5. Correlations
18 sns.heatmap(df.corr(), annot=True)
19 plt.show()

Feature Engineering Cheat Sheet:

Python
1# Date features
2df['year'] = df['date'].dt.year
3df['month'] = df['date'].dt.month
4df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6])
5
6# Aggregations
7df['customer_total'] = df.groupby('customer_id')['amount'].transform('sum')
8
9# Scaling
10from sklearn.preprocessing import StandardScaler
11scaler = StandardScaler()
12df['col_scaled'] = scaler.fit_transform(df[['col']])
13
14# Encoding
15df_encoded = pd.get_dummies(df, columns=['category'])
16
17# Feature selection
18from sklearn.ensemble import RandomForestClassifier
19model = RandomForestClassifier()
20model.fit(X, y)
21importance = model.feature_importances_

3. Tools & Libraries Summary

LibraryPurposeKey Functions
pandasData manipulationread_csv, groupby, merge, pivot_table
polarsFast data processingscan_csv, filter, group_by, collect
numpyNumerical computingarray, mean, std, reshape
seabornStatistical vizhistplot, boxplot, heatmap, pairplot
plotlyInteractive vizscatter, line, bar, pie, choropleth
streamlitWeb appsst.write, st.dataframe, st.plotly_chart
scikit-learnML preprocessingStandardScaler, OneHotEncoder, train_test_split

4. Best Practices

Code Quality

Python
1# ✅ Good
2import pandas as pd
3import numpy as np
4
5def calculate_metrics(df: pd.DataFrame, target: str) -> dict:
6 """
7 Calculate basic metrics for a target column.
8
9 Args:
10 df: Input DataFrame
11 target: Column name to analyze
12
13 Returns:
14 Dictionary with metrics
15 """
16 return {
17 'mean': df[target].mean(),
18 'median': df[target].median(),
19 'std': df[target].std()
20 }
21
22# ❌ Bad
23def calc(d, t):
24 return d[t].mean(), d[t].median()

Data Processing

Python
1# ✅ Good - Chain operations
2df_clean = (
3 df
4 .dropna(subset=['important_col'])
5 .drop_duplicates()
6 .assign(new_col=lambda x: x['a'] + x['b'])
7 .query('value > 0')
8)
9
10# ❌ Bad - Multiple reassignments
11df = df.dropna()
12df = df.drop_duplicates()
13df['new'] = df['a'] + df['b']
14df = df[df['value'] > 0]

Visualization

Python
1# ✅ Good - Informative plot
2fig, ax = plt.subplots(figsize=(10, 6))
3sns.barplot(data=df, x='category', y='value', ax=ax)
4ax.set_title('Sales by Category', fontsize=14, fontweight='bold')
5ax.set_xlabel('Product Category')
6ax.set_ylabel('Sales ($)')
7plt.tight_layout()
8plt.savefig('chart.png', dpi=300)
9
10# ❌ Bad - Bare minimum
11df.plot()

5. Project Ideas

Beginner Projects

  1. Sales Analysis Dashboard

    • Load sales data
    • Clean and preprocess
    • Create visualizations
    • Build Streamlit dashboard
  2. Customer Segmentation

    • RFM analysis
    • Clustering với K-means
    • Visualize segments

Intermediate Projects

  1. Stock Price Analysis

    • Fetch data với yfinance
    • Technical indicators
    • Interactive charts với Plotly
  2. Sentiment Analysis

    • Text preprocessing
    • Feature extraction
    • Classification model

Advanced Projects

  1. Real-time Dashboard
    • Streaming data
    • Auto-refresh Streamlit
    • Alerts và notifications

6. Learning Path Tiếp Theo

Tiếp tục học

1
Machine Learning
2
Deep Learning
3
MLOps
4
Specialization

Recommended Resources

Courses:

  • Machine Learning Fundamentals (MinAI)
  • Deep Learning (MinAI)
  • Statistics Fundamentals (MinAI)

Books:

  • "Python for Data Analysis" - Wes McKinney
  • "Hands-On Machine Learning" - Aurélien Géron
  • "Data Science from Scratch" - Joel Grus

Practice:

  • Kaggle Competitions
  • LeetCode (Python)
  • Real-world projects

7. Quick Reference Card

Python
1# === IMPORTS ===
2import pandas as pd
3import numpy as np
4import seaborn as sns
5import plotly.express as px
6import streamlit as st
7
8# === DATA LOADING ===
9df = pd.read_csv("file.csv")
10df = pd.read_excel("file.xlsx")
11df = pd.read_json("file.json")
12
13# === DATA EXPLORATION ===
14df.shape
15df.info()
16df.describe()
17df.head()
18df.isnull().sum()
19df.duplicated().sum()
20
21# === DATA CLEANING ===
22df.dropna()
23df.drop_duplicates()
24df.fillna(value)
25df.rename(columns={'old': 'new'})
26df.astype({'col': 'int'})
27
28# === DATA TRANSFORMATION ===
29df['new'] = df['a'] + df['b']
30df.groupby('cat').agg({'num': 'mean'})
31pd.merge(df1, df2, on='key')
32pd.concat([df1, df2])
33df.pivot_table(values='v', index='i', columns='c')
34
35# === VISUALIZATION ===
36sns.histplot(df['col'])
37sns.boxplot(x='cat', y='num', data=df)
38px.scatter(df, x='x', y='y', color='cat')
39px.line(df, x='date', y='value')
40
41# === STREAMLIT APP ===
42st.title("My App")
43st.dataframe(df)
44st.plotly_chart(fig)
45option = st.selectbox("Select", options)

Chúc Mừng! 🎉

Bạn đã hoàn thành khóa học Python cho Khoa học Dữ liệu!

Hãy tiếp tục:

  1. ✅ Làm Quiz để kiểm tra kiến thức
  2. ✅ Thực hành với các project
  3. ✅ Khám phá các khóa học khác

Good luck on your Data Science journey! 🚀