Lý thuyết
Bài 1/17

Python Setup & Environment

Thiết lập môi trường Python chuyên nghiệp cho Data Analysis

Python Setup & Environment

Python Programming Environment

1. Giới thiệu

Python cho Data Analysis

Python là ngôn ngữ phổ biến nhất cho Data Analysis nhờ ecosystem phong phú với Pandas, NumPy, Matplotlib và nhiều thư viện mạnh mẽ khác.

1.1 Tại sao Python?

Ưu điểmMô tả
Easy to learnCú pháp rõ ràng, dễ đọc
Rich ecosystemPandas, NumPy, Scikit-learn, etc.
CommunityCộng đồng lớn, nhiều tài liệu
VersatileAnalysis, ML, Web, Automation
Job marketKỹ năng được săn đón nhất

1.2 Data Analysis Stack

Text
1┌─────────────────────────────────────────────────────────┐
2│ Python Data Stack │
3├─────────────────────────────────────────────────────────┤
4│ │
5│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
6│ │ NumPy │ │ Pandas │ │ Matplotlib │ │
7│ │ (Arrays) │ │(DataFrames)│ │ (Charts) │ │
8│ └────────────┘ └────────────┘ └────────────┘ │
9│ ▲ ▲ ▲ │
10│ │ │ │ │
11│ ┌─────┴───────────────┴───────────────┴─────┐ │
12│ │ Python 3.10+ │ │
13│ └────────────────────────────────────────────┘ │
14│ │
15│ Supporting: Seaborn, Plotly, Scipy, Statsmodels │
16│ │
17└─────────────────────────────────────────────────────────┘

2. Cài đặt Python

2.1 Phương pháp 1: Anaconda (Recommended)

Bash
1# Download Anaconda từ https://www.anaconda.com/download
2
3# Sau khi cài, verify:
4conda --version
5# conda 23.x.x
6
7# Create môi trường mới
8conda create -n data-analysis python=3.10
9
10# Activate môi trường
11conda activate data-analysis
12
13# Install packages
14conda install pandas numpy matplotlib seaborn jupyter
Anaconda là gì?

Anaconda là Python distribution bao gồm sẵn 250+ packages cho Data Science, package manager (conda), và môi trường ảo - giúp bạn bắt đầu nhanh chóng.

2.2 Phương pháp 2: Python + pip

Bash
1# Download Python từ https://www.python.org/downloads/
2
3# Verify installation
4python --version
5# Python 3.10.x
6
7pip --version
8# pip 23.x.x
9
10# Create virtual environment
11python -m venv data-analysis-env
12
13# Activate (Windows)
14data-analysis-env\Scripts\activate
15
16# Activate (macOS/Linux)
17source data-analysis-env/bin/activate
18
19# Install packages
20pip install pandas numpy matplotlib seaborn jupyter notebook

2.3 Packages cần thiết

Bash
1# Core packages
2pip install pandas numpy scipy
3
4# Visualization
5pip install matplotlib seaborn plotly
6
7# Jupyter
8pip install jupyter notebook jupyterlab
9
10# Database connections
11pip install sqlalchemy psycopg2-binary pymysql
12
13# Excel support
14pip install openpyxl xlrd
15
16# Statistical analysis
17pip install statsmodels scikit-learn
18
19# Date handling
20pip install python-dateutil pytz

3. Jupyter Notebook

3.1 Giới thiệu Jupyter

Jupyter Notebook là môi trường interactive cho Data Analysis:

  • Kết hợp code, text, visualizations
  • Run từng cell một
  • Export PDF, HTML
  • Share dễ dàng

3.2 Khởi động Jupyter

Bash
1# Classic Notebook
2jupyter notebook
3
4# JupyterLab (recommended)
5jupyter lab
6
7# Mở URL: http://localhost:8888

3.3 Jupyter Keyboard Shortcuts

ShortcutAction
Shift + EnterRun cell, move to next
Ctrl + EnterRun cell, stay
Alt + EnterRun cell, insert below
AInsert cell above (command mode)
BInsert cell below (command mode)
DDDelete cell (command mode)
MConvert to Markdown
YConvert to Code
EscEnter command mode
EnterEnter edit mode

3.4 Notebook best practices

Python
1# Cell 1: Imports (always at top)
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6
7# Settings
8pd.set_option('display.max_columns', None)
9pd.set_option('display.max_rows', 100)
10plt.style.use('seaborn-v0_8-whitegrid')
11%matplotlib inline
12
13# Cell 2: Load data
14# Cell 3: Explore data
15# Cell 4+: Analysis
16# Last cell: Summary & conclusions

4. IDE Alternatives

4.1 VS Code

Bash
1# Install Python extension
2# Install Jupyter extension
3
4# Benefits:
5# - Full IDE features (debugging, Git)
6# - Run notebooks in VS Code
7# - Better code completion
8# - Integrated terminal

4.2 PyCharm

Bash
1# PyCharm Professional has Jupyter support
2# Good for large projects
3# Excellent debugging
4
5# Community edition: Free but no Jupyter
6# Professional: Paid with full features

4.3 Google Colab (Cloud)

Python
1# Access: https://colab.research.google.com
2
3# Benefits:
4# - Free GPU/TPU access
5# - No setup required
6# - Share easily
7# - Pre-installed packages
8
9# Mount Google Drive
10from google.colab import drive
11drive.mount('/content/drive')

5. Python Basics cho Analysis

5.1 Data Types

Python
1# Numbers
2integer_val = 42
3float_val = 3.14
4complex_val = 1 + 2j
5
6# Strings
7name = "Data Analysis"
8multiline = """
9This is a
10multiline string
11"""
12
13# Boolean
14is_valid = True
15is_empty = False
16
17# None
18missing_value = None
19
20# Type checking
21print(type(42)) # <class 'int'>
22print(type(3.14)) # <class 'float'>
23print(type("hello")) # <class 'str'>

5.2 Collections

Python
1# List - ordered, mutable
2numbers = [1, 2, 3, 4, 5]
3numbers.append(6)
4numbers[0] = 10
5print(numbers[1:3]) # [2, 3]
6
7# Tuple - ordered, immutable
8coordinates = (10.5, 20.3)
9x, y = coordinates # Unpacking
10
11# Dictionary - key-value pairs
12person = {
13 'name': 'Alice',
14 'age': 30,
15 'city': 'Hanoi'
16}
17print(person['name']) # Alice
18person['email'] = 'alice@email.com' # Add new key
19
20# Set - unique values
21unique_ids = {1, 2, 3, 2, 1} # {1, 2, 3}

5.3 Control Flow

Python
1# Conditionals
2score = 85
3if score >= 90:
4 grade = 'A'
5elif score >= 80:
6 grade = 'B'
7else:
8 grade = 'C'
9
10# Loops
11# For loop
12for i in range(5):
13 print(i) # 0, 1, 2, 3, 4
14
15for name in ['Alice', 'Bob', 'Charlie']:
16 print(f"Hello, {name}")
17
18# While loop
19count = 0
20while count < 3:
21 print(count)
22 count += 1
23
24# List comprehension (Pythonic!)
25squares = [x**2 for x in range(10)]
26even_squares = [x**2 for x in range(10) if x % 2 == 0]

5.4 Functions

Python
1# Basic function
2def greet(name):
3 return f"Hello, {name}!"
4
5# Default parameters
6def calculate_tax(amount, rate=0.1):
7 return amount * rate
8
9# Multiple returns
10def get_statistics(numbers):
11 return min(numbers), max(numbers), sum(numbers)/len(numbers)
12
13minimum, maximum, average = get_statistics([1, 2, 3, 4, 5])
14
15# Lambda functions
16square = lambda x: x ** 2
17add = lambda a, b: a + b
18
19# *args and **kwargs
20def flexible_func(*args, **kwargs):
21 print(f"Args: {args}")
22 print(f"Kwargs: {kwargs}")
23
24flexible_func(1, 2, 3, name='Alice', age=30)

6. NumPy Basics

6.1 NumPy Arrays

Python
1import numpy as np
2
3# Create arrays
4arr1 = np.array([1, 2, 3, 4, 5])
5arr2 = np.zeros(5) # [0, 0, 0, 0, 0]
6arr3 = np.ones(5) # [1, 1, 1, 1, 1]
7arr4 = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
8arr5 = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
9
10# 2D arrays
11matrix = np.array([
12 [1, 2, 3],
13 [4, 5, 6],
14 [7, 8, 9]
15])
16
17print(matrix.shape) # (3, 3)
18print(matrix.dtype) # int64

6.2 Array Operations

Python
1a = np.array([1, 2, 3, 4])
2b = np.array([5, 6, 7, 8])
3
4# Element-wise operations
5print(a + b) # [6, 8, 10, 12]
6print(a * b) # [5, 12, 21, 32]
7print(a ** 2) # [1, 4, 9, 16]
8
9# Statistics
10print(np.mean(a)) # 2.5
11print(np.std(a)) # 1.118
12print(np.median(a)) # 2.5
13print(np.sum(a)) # 10
14
15# Aggregations
16data = np.random.randn(100)
17print(f"Mean: {np.mean(data):.2f}")
18print(f"Std: {np.std(data):.2f}")
19print(f"Min: {np.min(data):.2f}")
20print(f"Max: {np.max(data):.2f}")

6.3 Indexing & Slicing

Python
1arr = np.arange(10) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2
3# Basic indexing
4print(arr[0]) # 0
5print(arr[-1]) # 9
6print(arr[2:5]) # [2, 3, 4]
7print(arr[::2]) # [0, 2, 4, 6, 8]
8
9# Boolean indexing
10mask = arr > 5
11print(arr[mask]) # [6, 7, 8, 9]
12
13# 2D indexing
14matrix = np.arange(12).reshape(3, 4)
15# [[ 0, 1, 2, 3],
16# [ 4, 5, 6, 7],
17# [ 8, 9, 10, 11]]
18
19print(matrix[1, 2]) # 6
20print(matrix[0, :]) # [0, 1, 2, 3]
21print(matrix[:, 1]) # [1, 5, 9]
22print(matrix[0:2, 1:3]) # [[1, 2], [5, 6]]

7. Verify Installation

7.1 Test Script

Python
1# test_setup.py
2print("Testing Python Data Analysis Setup...")
3print("=" * 50)
4
5# Test imports
6try:
7 import pandas as pd
8 print(f"✅ Pandas {pd.__version__}")
9except ImportError:
10 print("❌ Pandas not installed")
11
12try:
13 import numpy as np
14 print(f"✅ NumPy {np.__version__}")
15except ImportError:
16 print("❌ NumPy not installed")
17
18try:
19 import matplotlib
20 print(f"✅ Matplotlib {matplotlib.__version__}")
21except ImportError:
22 print("❌ Matplotlib not installed")
23
24try:
25 import seaborn as sns
26 print(f"✅ Seaborn {sns.__version__}")
27except ImportError:
28 print("❌ Seaborn not installed")
29
30try:
31 import scipy
32 print(f"✅ SciPy {scipy.__version__}")
33except ImportError:
34 print("❌ SciPy not installed")
35
36# Test basic operations
37print("\n" + "=" * 50)
38print("Testing operations...")
39
40df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
41print(f"✅ Created DataFrame:\n{df}")
42
43arr = np.random.randn(5)
44print(f"\n✅ NumPy random array: {arr}")
45
46print("\n" + "=" * 50)
47print("🎉 Setup complete! Ready for Data Analysis!")

8. Thực hành

Hands-on Exercise

Exercise: Environment Setup Check

Python
1# 1. Tạo virtual environment mới
2# 2. Install tất cả packages cần thiết
3# 3. Chạy Jupyter Lab
4# 4. Tạo notebook đầu tiên với các imports
5# 5. Load sample data và hiển thị
6
7# YOUR CODE HERE
💡 Xem đáp án
Python
1# Cell 1: Setup và imports
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6
7# Verify versions
8print(f"Pandas: {pd.__version__}")
9print(f"NumPy: {np.__version__}")
10
11# Cell 2: Settings
12pd.set_option('display.max_columns', None)
13pd.set_option('display.float_format', '{:.2f}'.format)
14%matplotlib inline
15plt.style.use('seaborn-v0_8-whitegrid')
16
17# Cell 3: Load sample data
18# Using built-in datasets
19tips = sns.load_dataset('tips')
20print(f"Dataset shape: {tips.shape}")
21tips.head()
22
23# Cell 4: Quick exploration
24print(tips.info())
25print(tips.describe())
26
27# Cell 5: Simple visualization
28fig, axes = plt.subplots(1, 2, figsize=(12, 4))
29
30tips['day'].value_counts().plot(kind='bar', ax=axes[0], title='Orders by Day')
31sns.histplot(tips['total_bill'], kde=True, ax=axes[1])
32axes[1].set_title('Distribution of Total Bill')
33
34plt.tight_layout()
35plt.show()
36
37print("\n✅ Environment setup complete!")

9. Tổng kết

TopicKey Points
InstallationAnaconda (recommended) hoặc Python + pip
EnvironmentVirtual environments để isolate projects
JupyterInteractive analysis, export reports
NumPyEfficient numerical operations
Next StepsPandas fundamentals

Checklist hoàn thành:

  • Python 3.10+ installed
  • Virtual environment created
  • Core packages installed
  • Jupyter working
  • Test script passed

Bài tiếp theo: Pandas Fundamentals - DataFrames và operations cơ bản