Data Crawling & Building Web với Streamlit

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Thu thập dữ liệu từ website bằng requests + BeautifulSoup

✅ Gọi API và xử lý JSON response

✅ Xây dựng Data App với Streamlit (widgets, charts, layout)

✅ Quản lý state và caching trong Streamlit

✅ Build & deploy một Dashboard hoàn chỉnh

Thời gian: 3 giờ | Độ khó: Intermediate | Yêu cầu: Pandas (Bài 06), Visualization (Bài 08)

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Web Scraping	Thu thập dữ liệu web	Trích xuất dữ liệu từ HTML pages
API	Giao diện lập trình	Endpoint trả dữ liệu dạng JSON
HTTP Request	Yêu cầu HTTP	GET (đọc), POST (gửi) dữ liệu
HTML Parsing	Phân tích HTML	Trích xuất thông tin từ HTML
BeautifulSoup	—	Thư viện Python parse HTML
Streamlit	—	Framework xây Data App bằng Python
Session State	Trạng thái phiên	Lưu biến giữa các lần rerun
Caching	Bộ nhớ đệm	Lưu kết quả tránh chạy lại
Widget	Thành phần giao diện	Button, slider, selectbox…
Deploy	Triển khai	Đưa app lên internet

Checkpoint

Web Scraping và API là 2 cách chính để thu thập dữ liệu. Streamlit biến Python script thành web app chỉ trong vài phút!

Task 1

🌐 Web Scraping với Requests + BeautifulSoup

TB5 min

Web Scraping là gì? Là kỹ thuật tự động thu thập dữ liệu từ website. Khi dữ liệu không có sẵn dưới dạng CSV/Excel, bạn có thể viết Python để "cạo" (scrape) dữ liệu từ trang web và chuyển thành DataFrame.

HTML là gì? Là ngôn ngữ định dạng trang web. Mọi website đều được viết bằng HTML — gồm các tag như <h1>, <p>, <table>, <a>. BeautifulSoup giúp bạn "parse" (phân tích) HTML để lấy dữ liệu cần thiết.

Cài đặt

Bash

1pip install requests beautifulsoup4 lxml

HTTP Requests cơ bản

Python

1import requests
2
3# GET request
4response = requests.get('https://httpbin.org/get')
5print(response.status_code)   # 200 = OK
6print(response.headers['Content-Type'])
7print(response.text)           # Raw HTML/JSON string
8
9# Với headers (giả lập browser)
10headers = {
11    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0'
12}
13response = requests.get('https://example.com', headers=headers)
14
15# POST request
16data = {'username': 'test', 'password': '123'}
17response = requests.post('https://httpbin.org/post', data=data)

Parse HTML với BeautifulSoup

Python

1from bs4 import BeautifulSoup
2import requests
3
4url = 'https://quotes.toscrape.com/'
5response = requests.get(url)
6soup = BeautifulSoup(response.text, 'lxml')
7
8# Tìm elements
9title = soup.find('title').text
10print(f"Page title: {title}")
11
12# Tìm tất cả quotes
13quotes = soup.find_all('div', class_='quote')
14for quote in quotes:
15    text = quote.find('span', class_='text').text
16    author = quote.find('small', class_='author').text
17    print(f"{text} — {author}")

Trích xuất bảng HTML → DataFrame

Python

1import pandas as pd
2from bs4 import BeautifulSoup
3import requests
4
5url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
6response = requests.get(url)
7soup = BeautifulSoup(response.text, 'lxml')
8
9# Cách 1: pandas read_html (đơn giản nhất)
10tables = pd.read_html(url)
11df = tables[0]  # Bảng đầu tiên
12
13# Cách 2: Parse thủ công
14table = soup.find('table', class_='wikitable')
15rows = []
16for tr in table.find_all('tr')[1:]:  # Skip header
17    cells = [td.text.strip() for td in tr.find_all(['td', 'th'])]
18    rows.append(cells)
19
20df = pd.DataFrame(rows)

Crawl nhiều trang (Pagination)

Python

1import time
2
3all_quotes = []
4for page in range(1, 11):
5    url = f'https://quotes.toscrape.com/page/{page}/'
6    response = requests.get(url)
7    soup = BeautifulSoup(response.text, 'lxml')
8    
9    quotes = soup.find_all('div', class_='quote')
10    if not quotes:
11        break
12    
13    for q in quotes:
14        all_quotes.append({
15            'text': q.find('span', class_='text').text,
16            'author': q.find('small', class_='author').text,
17            'tags': [tag.text for tag in q.find_all('a', class_='tag')]
18        })
19    
20    time.sleep(1)  # Delay 1s — LUÔN respect server!
21
22df = pd.DataFrame(all_quotes)
23print(f"Scraped {len(df)} quotes")

Web Scraping Ethics:

Luôn kiểm tra robots.txt (ví dụ: example.com/robots.txt)
Thêm time.sleep() giữa các requests (1-3 giây)
Không overload server — rate limiting!
Đọc Terms of Service trước khi scrape
Ưu tiên dùng API nếu có sẵn

Checkpoint

Hãy thử scrape 5 quotes từ quotes.toscrape.com và lưu vào DataFrame!

Task 2

🔌 Thu Thập Dữ Liệu từ API

TB5 min

REST API cơ bản

Python

1import requests
2import pandas as pd
3
4# Public API — no authentication
5response = requests.get('https://api.github.com/users/octocat')
6data = response.json()  # Parse JSON
7print(data['login'], data['public_repos'])
8
9# API with parameters
10params = {
11    'q': 'python data science',
12    'sort': 'stars',
13    'order': 'desc',
14    'per_page': 10
15}
16response = requests.get('https://api.github.com/search/repositories', params=params)
17repos = response.json()['items']
18
19df = pd.DataFrame([{
20    'name': r['full_name'],
21    'stars': r['stargazers_count'],
22    'language': r['language']
23} for r in repos])

API với Authentication

Python

1# Bearer token
2headers = {'Authorization': 'Bearer YOUR_API_KEY'}
3response = requests.get('https://api.example.com/data', headers=headers)
4
5# API key as parameter
6params = {'api_key': 'YOUR_KEY', 'city': 'Hanoi'}
7response = requests.get('https://api.openweathermap.org/data/2.5/weather', params=params)
8weather = response.json()

Error Handling

Python

1def safe_request(url, max_retries=3):
2    """Request with retry logic"""
3    for attempt in range(max_retries):
4        try:
5            response = requests.get(url, timeout=10)
6            response.raise_for_status()  # Raise exception for 4xx/5xx
7            return response.json()
8        except requests.exceptions.Timeout:
9            print(f"Timeout - retry {attempt + 1}/{max_retries}")
10        except requests.exceptions.HTTPError as e:
11            print(f"HTTP Error: {e}")
12            break
13        except requests.exceptions.RequestException as e:
14            print(f"Error: {e}")
15            break
16    return None

Tip: Nhiều nguồn data miễn phí: GitHub API, OpenWeather, JSONPlaceholder, CoinGecko, News API. Ưu tiên API hơn scraping khi có thể!

Checkpoint

Gọi GitHub API lấy 5 repos trending Python và tạo DataFrame với columns: name, stars, language.

Task 3

🚀 Streamlit — Hello World & Widgets

TB5 min

Cài đặt và chạy

Bash

1pip install streamlit
2streamlit run app.py

Hello World

Python

1# app.py
2import streamlit as st
3
4st.set_page_config(page_title="My App", page_icon="📊", layout="wide")
5
6st.title("Hello Streamlit! 👋")
7st.write("This is my first Streamlit app")
8
9st.header("This is a header")
10st.markdown("**Bold** and *italic* text")
11st.code("print('Hello World')", language="python")
12st.latex(r"E = mc^2")

Input Widgets

Python

1import streamlit as st
2
3# Text
4name = st.text_input("Nhập tên", "MinAI")
5st.write(f"Hello, {name}!")
6
7# Numbers
8age = st.number_input("Tuổi", min_value=0, max_value=120, value=25)
9value = st.slider("Chọn giá trị", 0, 100, 50)
10
11# Selection
12option = st.selectbox("Chọn màu", ["Red", "Green", "Blue"])
13options = st.multiselect("Chọn trái cây", ["Apple", "Banana", "Cherry"])
14choice = st.radio("Lựa chọn", ["Option 1", "Option 2"])
15
16# Boolean
17agree = st.checkbox("Tôi đồng ý")
18
19# Date
20date = st.date_input("Chọn ngày")
21
22# File
23uploaded = st.file_uploader("Tải file CSV", type=["csv"])
24if uploaded:
25    df = pd.read_csv(uploaded)
26    st.dataframe(df)

Buttons

Python

1# Click button
2if st.button("Click me!"):
3    st.write("Đã click!")
4
5# Download
6st.download_button("Download CSV", data="a,b\n1,2", file_name="data.csv")

Checkpoint

Tạo file app.py với title, text input cho tên, slider cho tuổi, và hiển thị "Hello [name], bạn [age] tuổi!"

Task 4

📊 Streamlit — Hiển Thị Data & Charts

TB5 min

DataFrames

Python

1import streamlit as st
2import pandas as pd
3
4df = pd.DataFrame({
5    "Name": ["Alice", "Bob", "Charlie"],
6    "Age": [25, 30, 35],
7    "Salary": [50000, 60000, 70000]
8})
9
10st.table(df)                # Static table
11st.dataframe(df)            # Interactive (sortable)
12edited_df = st.data_editor(df)  # Editable!
13
14# KPI Metrics
15col1, col2, col3 = st.columns(3)
16col1.metric("Revenue", "$10,000", "+5%")
17col2.metric("Users", "1,234", "-2%")
18col3.metric("Rating", "4.5", "+0.2")

Charts

Python

1import streamlit as st
2import plotly.express as px
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6# Streamlit native charts
7st.line_chart(chart_data)
8st.bar_chart(chart_data)
9st.scatter_chart(chart_data, x="A", y="B")
10
11# Plotly (interactive — recommended!)
12fig = px.scatter(df, x="gdpPercap", y="lifeExp",
13                 color="continent", size="pop", hover_name="country")
14st.plotly_chart(fig, use_container_width=True)
15
16# Matplotlib/Seaborn
17fig, ax = plt.subplots()
18sns.histplot(df["col"], kde=True, ax=ax)
19st.pyplot(fig)

Task 5

📐 Streamlit — Layout & Organization

TB5 min

Columns

Python

1col1, col2 = st.columns([2, 1])  # 2:1 ratio
2with col1:
3    st.header("Main Content")
4    st.line_chart([1, 2, 3, 4])
5with col2:
6    st.header("Sidebar")
7    st.write("Extra info")

Tabs

Python

1tab1, tab2, tab3 = st.tabs(["📈 Chart", "🗃 Data", "📝 About"])
2with tab1:
3    st.plotly_chart(fig)
4with tab2:
5    st.dataframe(df)
6with tab3:
7    st.write("About this app")

Sidebar

Python

1st.sidebar.title("Settings")
2option = st.sidebar.selectbox("Select", ["A", "B", "C"])
3value = st.sidebar.slider("Value", 0, 100, 50)

Expander & Forms

Python

1with st.expander("Chi tiết"):
2    st.write("Nội dung ẩn")
3
4with st.form("my_form"):
5    name = st.text_input("Tên")
6    submitted = st.form_submit_button("Submit")
7    if submitted:
8        st.write(f"Hello {name}")

Task 6

⚡ Streamlit — State & Caching

TB5 min

Session State

Python

1import streamlit as st
2
3# Khởi tạo state
4if 'count' not in st.session_state:
5    st.session_state.count = 0
6
7if st.button('Tăng'):
8    st.session_state.count += 1
9
10if st.button('Reset'):
11    st.session_state.count = 0
12
13st.write(f"Count: {st.session_state.count}")

Caching

Python

1@st.cache_data
2def load_data(url):
3    """Chỉ chạy 1 lần, cache kết quả"""
4    return pd.read_csv(url)
5
6@st.cache_resource
7def load_model():
8    """Cache ML model / DB connection"""
9    import pickle
10    with open("model.pkl", "rb") as f:
11        return pickle.load(f)
12
13# Sử dụng — tự động cache!
14df = load_data("data.csv")
15model = load_model()

@st.cache_data cho data (DataFrame, list). @st.cache_resource cho resources (model, DB connection). Luôn cache các hàm load nặng!

Task 7

🏗️ Complete Dashboard Example

TB5 min

Python

1# dashboard.py
2import streamlit as st
3import pandas as pd
4import numpy as np
5import plotly.express as px
6
7st.set_page_config(page_title="Sales Dashboard", page_icon="📊", layout="wide")
8st.title("📊 Sales Dashboard")
9
10# Load data
11@st.cache_data
12def load_data():
13    np.random.seed(42)
14    return pd.DataFrame({
15        'Date': pd.date_range('2024-01-01', periods=365),
16        'Sales': np.random.randint(100, 1000, 365),
17        'Region': np.random.choice(['North', 'South', 'East', 'West'], 365),
18        'Product': np.random.choice(['A', 'B', 'C'], 365)
19    })
20
21df = load_data()
22
23# Sidebar filters
24st.sidebar.header("Filters")
25regions = st.sidebar.multiselect("Region", df['Region'].unique(), df['Region'].unique())
26products = st.sidebar.multiselect("Product", df['Product'].unique(), df['Product'].unique())
27
28df_filtered = df[(df['Region'].isin(regions)) & (df['Product'].isin(products))]
29
30# KPIs
31c1, c2, c3, c4 = st.columns(4)
32c1.metric("Total Sales", f"${df_filtered['Sales'].sum():,.0f}")
33c2.metric("Avg Sales", f"${df_filtered['Sales'].mean():,.0f}")
34c3.metric("Max Sales", f"${df_filtered['Sales'].max():,.0f}")
35c4.metric("Transactions", f"{len(df_filtered):,}")
36
37st.markdown("---")
38
39# Charts
40col1, col2 = st.columns(2)
41with col1:
42    st.subheader("Sales by Region")
43    fig = px.pie(df_filtered, values='Sales', names='Region', hole=0.4)
44    st.plotly_chart(fig, use_container_width=True)
45
46with col2:
47    st.subheader("Sales Trend")
48    daily = df_filtered.groupby('Date')['Sales'].sum().reset_index()
49    fig = px.line(daily, x='Date', y='Sales')
50    st.plotly_chart(fig, use_container_width=True)
51
52# Data table
53st.subheader("Raw Data")
54st.dataframe(df_filtered, use_container_width=True)
55csv = df_filtered.to_csv(index=False)
56st.download_button("Download CSV", csv, "sales.csv", "text/csv")

Deploy to Streamlit Cloud

Cấu trúc Streamlit App

📂my_app/

🐍app.py

📋requirements.txt — streamlit, pandas, plotly

📂.streamlit/

⚙️config.toml (optional)

Push lên GitHub
Vào share.streamlit.io
Connect repo → Select app.py → Deploy

Checkpoint

Build một Dashboard mini: load CSV, sidebar filters, 2 KPI metrics, 1 Plotly chart, 1 data table. Chạy thử streamlit run app.py!

Task 8

📝 Tổng Kết

TB5 min

Data Collection

Method	Use When	Libraries
Web Scraping	No API, data in HTML tables	requests, BeautifulSoup
API	Service provides REST API	requests
`pd.read_html()`	Simple HTML tables	pandas

Streamlit Quick Reference

Python

1import streamlit as st
2
3# Display
4st.title("Title")
5st.dataframe(df)
6st.plotly_chart(fig)
7st.metric("KPI", value, delta)
8
9# Input
10st.text_input(), st.selectbox(), st.slider()
11st.button(), st.file_uploader()
12
13# Layout
14st.columns(), st.tabs(), st.sidebar, st.expander()
15
16# Performance
17@st.cache_data, @st.cache_resource
18st.session_state

Bài tiếp theo: Ôn tập tổng hợp toàn bộ khóa học — từ Python cơ bản đến Streamlit! 📚

Câu hỏi tự kiểm tra

Web scraping với requests + BeautifulSoup gồm những bước cơ bản nào?
response.status_code trả về 200 và 404 có ý nghĩa gì? Tại sao cần kiểm tra status code?
@st.cache_data trong Streamlit dùng để làm gì? Tại sao nó giúp app chạy nhanh hơn?
Khi deploy Streamlit app lên cloud, cần chuẩn bị những file nào trong project?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Data Crawling & Streamlit!

Tiếp theo: Ôn tập tổng hợp toàn bộ khóa học để chuẩn bị cho Mini Project và bài kiểm tra cuối khóa!

Task 9