Python Setup & Environment
1. Giới thiệu
Python cho Data Analysis
Python là ngôn ngữ phổ biến nhất cho Data Analysis nhờ ecosystem phong phú với Pandas, NumPy, Matplotlib và nhiều thư viện mạnh mẽ khác.
1.1 Tại sao Python?
| Ưu điểm | Mô tả |
|---|---|
| Easy to learn | Cú pháp rõ ràng, dễ đọc |
| Rich ecosystem | Pandas, NumPy, Scikit-learn, etc. |
| Community | Cộng đồng lớn, nhiều tài liệu |
| Versatile | Analysis, ML, Web, Automation |
| Job market | Kỹ năng được săn đón nhất |
1.2 Data Analysis Stack
Text
1┌─────────────────────────────────────────────────────────┐2│ Python Data Stack │3├─────────────────────────────────────────────────────────┤4│ │5│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │6│ │ NumPy │ │ Pandas │ │ Matplotlib │ │7│ │ (Arrays) │ │(DataFrames)│ │ (Charts) │ │8│ └────────────┘ └────────────┘ └────────────┘ │9│ ▲ ▲ ▲ │10│ │ │ │ │11│ ┌─────┴───────────────┴───────────────┴─────┐ │12│ │ Python 3.10+ │ │13│ └────────────────────────────────────────────┘ │14│ │15│ Supporting: Seaborn, Plotly, Scipy, Statsmodels │16│ │17└─────────────────────────────────────────────────────────┘2. Cài đặt Python
2.1 Phương pháp 1: Anaconda (Recommended)
Bash
1# Download Anaconda từ https://www.anaconda.com/download2 3# Sau khi cài, verify:4conda --version5# conda 23.x.x6 7# Create môi trường mới8conda create -n data-analysis python=3.109 10# Activate môi trường11conda activate data-analysis12 13# Install packages14conda install pandas numpy matplotlib seaborn jupyterAnaconda là gì?
Anaconda là Python distribution bao gồm sẵn 250+ packages cho Data Science, package manager (conda), và môi trường ảo - giúp bạn bắt đầu nhanh chóng.
2.2 Phương pháp 2: Python + pip
Bash
1# Download Python từ https://www.python.org/downloads/2 3# Verify installation4python --version5# Python 3.10.x6 7pip --version8# pip 23.x.x9 10# Create virtual environment11python -m venv data-analysis-env12 13# Activate (Windows)14data-analysis-env\Scripts\activate15 16# Activate (macOS/Linux)17source data-analysis-env/bin/activate18 19# Install packages20pip install pandas numpy matplotlib seaborn jupyter notebook2.3 Packages cần thiết
Bash
1# Core packages2pip install pandas numpy scipy3 4# Visualization5pip install matplotlib seaborn plotly6 7# Jupyter8pip install jupyter notebook jupyterlab9 10# Database connections11pip install sqlalchemy psycopg2-binary pymysql12 13# Excel support14pip install openpyxl xlrd15 16# Statistical analysis17pip install statsmodels scikit-learn18 19# Date handling20pip install python-dateutil pytz3. Jupyter Notebook
3.1 Giới thiệu Jupyter
Jupyter Notebook là môi trường interactive cho Data Analysis:
- Kết hợp code, text, visualizations
- Run từng cell một
- Export PDF, HTML
- Share dễ dàng
3.2 Khởi động Jupyter
Bash
1# Classic Notebook2jupyter notebook3 4# JupyterLab (recommended)5jupyter lab6 7# Mở URL: http://localhost:88883.3 Jupyter Keyboard Shortcuts
| Shortcut | Action |
|---|---|
Shift + Enter | Run cell, move to next |
Ctrl + Enter | Run cell, stay |
Alt + Enter | Run cell, insert below |
A | Insert cell above (command mode) |
B | Insert cell below (command mode) |
DD | Delete cell (command mode) |
M | Convert to Markdown |
Y | Convert to Code |
Esc | Enter command mode |
Enter | Enter edit mode |
3.4 Notebook best practices
Python
1# Cell 1: Imports (always at top)2import pandas as pd3import numpy as np4import matplotlib.pyplot as plt5import seaborn as sns67# Settings8pd.set_option('display.max_columns', None)9pd.set_option('display.max_rows', 100)10plt.style.use('seaborn-v0_8-whitegrid')11%matplotlib inline1213# Cell 2: Load data14# Cell 3: Explore data15# Cell 4+: Analysis16# Last cell: Summary & conclusions4. IDE Alternatives
4.1 VS Code
Bash
1# Install Python extension2# Install Jupyter extension3 4# Benefits:5# - Full IDE features (debugging, Git)6# - Run notebooks in VS Code7# - Better code completion8# - Integrated terminal4.2 PyCharm
Bash
1# PyCharm Professional has Jupyter support2# Good for large projects3# Excellent debugging4 5# Community edition: Free but no Jupyter6# Professional: Paid with full features4.3 Google Colab (Cloud)
Python
1# Access: https://colab.research.google.com23# Benefits:4# - Free GPU/TPU access5# - No setup required6# - Share easily7# - Pre-installed packages89# Mount Google Drive10from google.colab import drive11drive.mount('/content/drive')5. Python Basics cho Analysis
5.1 Data Types
Python
1# Numbers2integer_val = 423float_val = 3.144complex_val = 1 + 2j56# Strings7name = "Data Analysis"8multiline = """9This is a10multiline string11"""1213# Boolean14is_valid = True15is_empty = False1617# None18missing_value = None1920# Type checking21print(type(42)) # <class 'int'>22print(type(3.14)) # <class 'float'>23print(type("hello")) # <class 'str'>5.2 Collections
Python
1# List - ordered, mutable2numbers = [1, 2, 3, 4, 5]3numbers.append(6)4numbers[0] = 105print(numbers[1:3]) # [2, 3]67# Tuple - ordered, immutable8coordinates = (10.5, 20.3)9x, y = coordinates # Unpacking1011# Dictionary - key-value pairs12person = {13 'name': 'Alice',14 'age': 30,15 'city': 'Hanoi'16}17print(person['name']) # Alice18person['email'] = 'alice@email.com' # Add new key1920# Set - unique values21unique_ids = {1, 2, 3, 2, 1} # {1, 2, 3}5.3 Control Flow
Python
1# Conditionals2score = 853if score >= 90:4 grade = 'A'5elif score >= 80:6 grade = 'B'7else:8 grade = 'C'910# Loops11# For loop12for i in range(5):13 print(i) # 0, 1, 2, 3, 41415for name in ['Alice', 'Bob', 'Charlie']:16 print(f"Hello, {name}")1718# While loop19count = 020while count < 3:21 print(count)22 count += 12324# List comprehension (Pythonic!)25squares = [x**2 for x in range(10)]26even_squares = [x**2 for x in range(10) if x % 2 == 0]5.4 Functions
Python
1# Basic function2def greet(name):3 return f"Hello, {name}!"45# Default parameters6def calculate_tax(amount, rate=0.1):7 return amount * rate89# Multiple returns10def get_statistics(numbers):11 return min(numbers), max(numbers), sum(numbers)/len(numbers)1213minimum, maximum, average = get_statistics([1, 2, 3, 4, 5])1415# Lambda functions16square = lambda x: x ** 217add = lambda a, b: a + b1819# *args and **kwargs20def flexible_func(*args, **kwargs):21 print(f"Args: {args}")22 print(f"Kwargs: {kwargs}")2324flexible_func(1, 2, 3, name='Alice', age=30)6. NumPy Basics
6.1 NumPy Arrays
Python
1import numpy as np23# Create arrays4arr1 = np.array([1, 2, 3, 4, 5])5arr2 = np.zeros(5) # [0, 0, 0, 0, 0]6arr3 = np.ones(5) # [1, 1, 1, 1, 1]7arr4 = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]8arr5 = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]910# 2D arrays11matrix = np.array([12 [1, 2, 3],13 [4, 5, 6],14 [7, 8, 9]15])1617print(matrix.shape) # (3, 3)18print(matrix.dtype) # int646.2 Array Operations
Python
1a = np.array([1, 2, 3, 4])2b = np.array([5, 6, 7, 8])34# Element-wise operations5print(a + b) # [6, 8, 10, 12]6print(a * b) # [5, 12, 21, 32]7print(a ** 2) # [1, 4, 9, 16]89# Statistics10print(np.mean(a)) # 2.511print(np.std(a)) # 1.11812print(np.median(a)) # 2.513print(np.sum(a)) # 101415# Aggregations16data = np.random.randn(100)17print(f"Mean: {np.mean(data):.2f}")18print(f"Std: {np.std(data):.2f}")19print(f"Min: {np.min(data):.2f}")20print(f"Max: {np.max(data):.2f}")6.3 Indexing & Slicing
Python
1arr = np.arange(10) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]23# Basic indexing4print(arr[0]) # 05print(arr[-1]) # 96print(arr[2:5]) # [2, 3, 4]7print(arr[::2]) # [0, 2, 4, 6, 8]89# Boolean indexing10mask = arr > 511print(arr[mask]) # [6, 7, 8, 9]1213# 2D indexing14matrix = np.arange(12).reshape(3, 4)15# [[ 0, 1, 2, 3],16# [ 4, 5, 6, 7],17# [ 8, 9, 10, 11]]1819print(matrix[1, 2]) # 620print(matrix[0, :]) # [0, 1, 2, 3]21print(matrix[:, 1]) # [1, 5, 9]22print(matrix[0:2, 1:3]) # [[1, 2], [5, 6]]7. Verify Installation
7.1 Test Script
Python
1# test_setup.py2print("Testing Python Data Analysis Setup...")3print("=" * 50)45# Test imports6try:7 import pandas as pd8 print(f"✅ Pandas {pd.__version__}")9except ImportError:10 print("❌ Pandas not installed")1112try:13 import numpy as np14 print(f"✅ NumPy {np.__version__}")15except ImportError:16 print("❌ NumPy not installed")1718try:19 import matplotlib20 print(f"✅ Matplotlib {matplotlib.__version__}")21except ImportError:22 print("❌ Matplotlib not installed")2324try:25 import seaborn as sns26 print(f"✅ Seaborn {sns.__version__}")27except ImportError:28 print("❌ Seaborn not installed")2930try:31 import scipy32 print(f"✅ SciPy {scipy.__version__}")33except ImportError:34 print("❌ SciPy not installed")3536# Test basic operations37print("\n" + "=" * 50)38print("Testing operations...")3940df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})41print(f"✅ Created DataFrame:\n{df}")4243arr = np.random.randn(5)44print(f"\n✅ NumPy random array: {arr}")4546print("\n" + "=" * 50)47print("🎉 Setup complete! Ready for Data Analysis!")8. Thực hành
Hands-on Exercise
Exercise: Environment Setup Check
Python
1# 1. Tạo virtual environment mới2# 2. Install tất cả packages cần thiết3# 3. Chạy Jupyter Lab4# 4. Tạo notebook đầu tiên với các imports5# 5. Load sample data và hiển thị67# YOUR CODE HERE💡 Xem đáp án
Python
1# Cell 1: Setup và imports2import pandas as pd3import numpy as np4import matplotlib.pyplot as plt5import seaborn as sns67# Verify versions8print(f"Pandas: {pd.__version__}")9print(f"NumPy: {np.__version__}")1011# Cell 2: Settings12pd.set_option('display.max_columns', None)13pd.set_option('display.float_format', '{:.2f}'.format)14%matplotlib inline15plt.style.use('seaborn-v0_8-whitegrid')1617# Cell 3: Load sample data18# Using built-in datasets19tips = sns.load_dataset('tips')20print(f"Dataset shape: {tips.shape}")21tips.head()2223# Cell 4: Quick exploration24print(tips.info())25print(tips.describe())2627# Cell 5: Simple visualization28fig, axes = plt.subplots(1, 2, figsize=(12, 4))2930tips['day'].value_counts().plot(kind='bar', ax=axes[0], title='Orders by Day')31sns.histplot(tips['total_bill'], kde=True, ax=axes[1])32axes[1].set_title('Distribution of Total Bill')3334plt.tight_layout()35plt.show()3637print("\n✅ Environment setup complete!")9. Tổng kết
| Topic | Key Points |
|---|---|
| Installation | Anaconda (recommended) hoặc Python + pip |
| Environment | Virtual environments để isolate projects |
| Jupyter | Interactive analysis, export reports |
| NumPy | Efficient numerical operations |
| Next Steps | Pandas fundamentals |
Checklist hoàn thành:
- Python 3.10+ installed
- Virtual environment created
- Core packages installed
- Jupyter working
- Test script passed
Bài tiếp theo: Pandas Fundamentals - DataFrames và operations cơ bản
