🎯 Mục tiêu bài học
Sau bài học này, bạn sẽ:
✅ Install Python + Data Analysis packages
✅ Setup virtual environment chuyên nghiệp
✅ Sử dụng Jupyter Notebook/Lab cho analysis
✅ Nắm Python basics: types, collections, functions
✅ Hiểu NumPy cho numerical computing
Thời gian: 1.5 giờ | Độ khó: Beginner | Tool: Python 3.10+, Anaconda/pip, Jupyter
📖 Bảng Thuật Ngữ Quan Trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Virtual Environment | Môi trường ảo | Isolated Python environment per project |
| Anaconda | - | Python distribution với 250+ DS packages |
| Jupyter Notebook | - | Interactive coding: code + text + viz |
| pip | - | Python package manager |
| conda | - | Package + environment manager |
| NumPy | - | Numerical Python — fast array operations |
| Pandas | - | Data manipulation library (DataFrames) |
| DataFrame | Bảng dữ liệu | 2D labeled data structure |
| JupyterLab | - | Next-gen Jupyter với tabbed interface |
| Google Colab | - | Cloud Jupyter — free GPU, no setup |
Checkpoint
Virtual env = isolated packages per project. Anaconda = all-in-one DS distribution. Jupyter = interactive analysis. pip vs conda đều install packages, nhưng conda quản lý cả environments!
🐍 1. Tại sao Python?
1.1 Python cho Data Analysis
| Ưu điểm | Mô tả |
|---|---|
| Easy to learn | Cú pháp rõ ràng, dễ đọc |
| Rich ecosystem | Pandas, NumPy, Scikit-learn, etc. |
| Community | Cộng đồng lớn, nhiều tài liệu |
| Versatile | Analysis, ML, Web, Automation |
| Job market | Kỹ năng được săn đón nhất |
1.2 Data Analysis Stack
Checkpoint
Python = #1 cho Data Analysis nhờ ecosystem (Pandas, NumPy, Matplotlib) + community + job market. Stack: NumPy (arrays) → Pandas (DataFrames) → Matplotlib (charts).
💻 2. Cài đặt Python
2.1 Phương pháp 1: Anaconda (Recommended)
1# Download Anaconda từ https://www.anaconda.com/download2 3# Sau khi cài, verify:4conda --version5 6# Create môi trường mới7conda create -n data-analysis python=3.108 9# Activate môi trường10conda activate data-analysis11 12# Install packages13conda install pandas numpy matplotlib seaborn jupyterAnaconda = Python distribution bao gồm sẵn 250+ packages cho Data Science, package manager (conda), và môi trường ảo — bắt đầu nhanh chóng!
2.2 Phương pháp 2: Python + pip
1# Download Python từ https://www.python.org/downloads/2python --version3 4# Create virtual environment5python -m venv data-analysis-env6 7# Activate (Windows)8data-analysis-env\Scripts\activate9 10# Activate (macOS/Linux)11source data-analysis-env/bin/activate12 13# Install packages14pip install pandas numpy matplotlib seaborn jupyter notebook2.3 Packages cần thiết
1# Core2pip install pandas numpy scipy3 4# Visualization5pip install matplotlib seaborn plotly6 7# Jupyter8pip install jupyter notebook jupyterlab9 10# Database + Excel11pip install sqlalchemy psycopg2-binary openpyxl xlrd12 13# Statistics14pip install statsmodels scikit-learnCheckpoint
2 cách: Anaconda (recommended, all-in-one) hoặc Python + pip (lightweight). Luôn dùng virtual environment để isolate projects!
📓 3. Jupyter Notebook
3.1 Khởi động
1# Classic Notebook2jupyter notebook3 4# JupyterLab (recommended)5jupyter lab6# → http://localhost:88883.2 Keyboard Shortcuts
| Shortcut | Action |
|---|---|
Shift + Enter | Run cell, move to next |
Ctrl + Enter | Run cell, stay |
Alt + Enter | Run cell, insert below |
A / B | Insert cell above / below (command mode) |
DD | Delete cell (command mode) |
M / Y | Markdown / Code mode |
3.3 Notebook Best Practices
1# Cell 1: Imports (always at top)2import pandas as pd3import numpy as np4import matplotlib.pyplot as plt5import seaborn as sns67# Settings8pd.set_option('display.max_columns', None)9pd.set_option('display.max_rows', 100)10plt.style.use('seaborn-v0_8-whitegrid')11%matplotlib inline3.4 IDE Alternatives
| IDE | Best For |
|---|---|
| VS Code | Full IDE + Jupyter + Git integration |
| PyCharm | Large projects, excellent debugging |
| Google Colab | Free GPU, no setup, easy sharing |
Checkpoint
Jupyter = interactive analysis (code + text + viz). Shortcuts: Shift+Enter (run), A/B (insert), DD (delete). VS Code + Jupyter extension = best combo cho professionals!
🔤 4. Python Basics cho Analysis
4.1 Data Types
1integer_val = 422float_val = 3.143name = "Data Analysis"4is_valid = True5missing_value = None67print(type(42)) # <class 'int'>8print(type(3.14)) # <class 'float'>9print(type("hello")) # <class 'str'>4.2 Collections
1# List - ordered, mutable2numbers = [1, 2, 3, 4, 5]3numbers.append(6)4print(numbers[1:3]) # [2, 3]56# Tuple - ordered, immutable7coordinates = (10.5, 20.3)8x, y = coordinates # Unpacking910# Dictionary - key-value pairs11person = {'name': 'Alice', 'age': 30, 'city': 'Hanoi'}12print(person['name']) # Alice1314# Set - unique values15unique_ids = {1, 2, 3, 2, 1} # {1, 2, 3}4.3 Control Flow & Functions
1# List comprehension (Pythonic!)2squares = [x**2 for x in range(10)]3even_squares = [x**2 for x in range(10) if x % 2 == 0]45# Functions6def calculate_tax(amount, rate=0.1):7 return amount * rate89# Lambda10square = lambda x: x ** 21112# Multiple returns13def get_statistics(numbers):14 return min(numbers), max(numbers), sum(numbers)/len(numbers)15minimum, maximum, average = get_statistics([1, 2, 3, 4, 5])1617# *args and **kwargs18def flexible_func(*args, **kwargs):19 print(f"Args: {args}, Kwargs: {kwargs}")Checkpoint
Collections: List (mutable), Tuple (immutable), Dict (key-value), Set (unique). List comprehension = Pythonic filtering. Lambda = anonymous function!
🔢 5. NumPy Basics
5.1 Arrays
1import numpy as np23arr1 = np.array([1, 2, 3, 4, 5])4arr2 = np.zeros(5) # [0, 0, 0, 0, 0]5arr3 = np.ones(5) # [1, 1, 1, 1, 1]6arr4 = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]7arr5 = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]89# 2D arrays10matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])11print(matrix.shape) # (3, 3)5.2 Operations & Statistics
1a = np.array([1, 2, 3, 4])2b = np.array([5, 6, 7, 8])34# Element-wise5print(a + b) # [6, 8, 10, 12]6print(a * b) # [5, 12, 21, 32]7print(a ** 2) # [1, 4, 9, 16]89# Statistics10print(np.mean(a)) # 2.511print(np.std(a)) # 1.11812print(np.median(a)) # 2.51314# Boolean indexing15arr = np.arange(10)16print(arr[arr > 5]) # [6, 7, 8, 9]1718# 2D indexing19matrix = np.arange(12).reshape(3, 4)20print(matrix[1, 2]) # 621print(matrix[0, :]) # [0, 1, 2, 3]22print(matrix[:, 1]) # [1, 5, 9]Checkpoint
NumPy = fast vectorized operations (no loops!). Boolean indexing = filter with conditions. NumPy là foundation cho Pandas!
✅ 6. Verify Installation
1print("Testing Python Data Analysis Setup...")2print("=" * 50)34import pandas as pd5print(f"✅ Pandas {pd.__version__}")67import numpy as np8print(f"✅ NumPy {np.__version__}")910import matplotlib11print(f"✅ Matplotlib {matplotlib.__version__}")1213import seaborn as sns14print(f"✅ Seaborn {sns.__version__}")1516# Test operations17df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})18print(f"✅ DataFrame created:\n{df}")1920print("\n🎉 Setup complete! Ready for Data Analysis!")Nếu bất kỳ import nào fail → chạy lại pip install package_name hoặc conda install package_name. Đảm bảo virtual environment đã activated!
📋 Tổng kết
Kiến thức đã học
| Topic | Key Points |
|---|---|
| Installation | Anaconda (recommended) hoặc Python + pip |
| Environment | Virtual environments để isolate projects |
| Jupyter | Interactive analysis, export reports |
| Python Basics | Types, collections, functions, comprehensions |
| NumPy | Efficient numerical operations, arrays |
Checklist hoàn thành
- Python 3.10+ installed
- Virtual environment created
- Core packages installed
- Jupyter working
- Test script passed
Câu hỏi tự kiểm tra
- Virtual environment dùng để giải quyết vấn đề gì?
- Jupyter Notebook khác Python script thế nào?
- NumPy array khác Python list ở điểm nào?
- Anaconda vs pip: khi nào dùng cái nào?
Bài tiếp theo: Pandas Fundamentals — DataFrames và operations cơ bản →
🎉 Tuyệt vời! Bạn đã setup xong môi trường Python cho Data Analysis!
Nhớ: Virtual environment + Jupyter + NumPy là bộ 3 không thể thiếu. Hãy luôn tạo environment mới cho mỗi project!
