Conversation Management

0

🎯 Mục tiêu bài học

TB5 min

LLMs không có bộ nhớ — mỗi API call là independent. Bài này dạy bạn quản lý conversation history, tối ưu context window, và implement memory patterns.

Sau bài này, bạn sẽ:

✅ Hiểu context window và token limits ✅ Implement conversation memory patterns ✅ Tối ưu context cho long conversations ✅ Build chat app với smart memory

Task 0

1

🔍 Context Window Explained

TB5 min

1.1 Model Context Limits

Model	Context Window	~Tương đương
GPT-3.5	16K tokens	~12K từ
GPT-4 Turbo	128K tokens	~96K từ
Claude 3.5	200K tokens	~150K từ
Gemini 1.5 Pro	2M tokens	~1.5M từ

1.2 Token Budget

Ví dụ

1Context Window = System Prompt + Conversation History + User Message + Response
2 
3Ví dụ GPT-4 Turbo (128K):
4- System prompt: ~500 tokens
5- Response reserved: ~2000 tokens  
6- Available for history: ~125,500 tokens
7- ≈ 500 turns of conversation

1.3 Problem: History Overflow

python.py

1# ❌ Naive approach: Keep ALL messages
2messages = [system_prompt]
3
4for i in range(1000):  # Long conversation
5    messages.append({"role": "user", "content": user_input})
6    response = client.chat.completions.create(
7        model="gpt-4-turbo",
8        messages=messages  # ← Eventually exceeds context!
9    )
10    messages.append({"role": "assistant", "content": response})
11# Error: "maximum context length exceeded"

Checkpoint

Bạn đã hiểu context window, token budget và vấn đề history overflow chưa?

Task 1

2

📐 Memory Patterns

TB5 min

2.1 Pattern 1: Sliding Window

Giữ N messages gần nhất, bỏ message cũ.

python.py

1class SlidingWindowMemory:
2    def __init__(self, max_messages=20):
3        self.max_messages = max_messages
4        self.system_prompt = None
5        self.messages = []
6    
7    def set_system(self, prompt):
8        self.system_prompt = {"role": "system", "content": prompt}
9    
10    def add(self, role, content):
11        self.messages.append({"role": role, "content": content})
12        # Keep only recent messages
13        if len(self.messages) > self.max_messages:
14            self.messages = self.messages[-self.max_messages:]
15    
16    def get_messages(self):
17        result = []
18        if self.system_prompt:
19            result.append(self.system_prompt)
20        result.extend(self.messages)
21        return result
22
23# Usage
24memory = SlidingWindowMemory(max_messages=20)
25memory.set_system("Bạn là Python tutor.")
26memory.add("user", "List là gì?")
27memory.add("assistant", "List là cấu trúc dữ liệu...")

2.2 Pattern 2: Token-Based Truncation

Giữ messages cho đến khi gần hết token budget.

python.py

1import tiktoken
2
3class TokenMemory:
4    def __init__(self, max_tokens=8000, model="gpt-4"):
5        self.max_tokens = max_tokens
6        self.encoder = tiktoken.encoding_for_model(model)
7        self.system_prompt = None
8        self.messages = []
9    
10    def count_tokens(self, text):
11        return len(self.encoder.encode(text))
12    
13    def add(self, role, content):
14        self.messages.append({"role": role, "content": content})
15        self._truncate()
16    
17    def _truncate(self):
18        """Remove oldest messages until within budget."""
19        total = sum(self.count_tokens(m["content"]) + 4 for m in self.messages)
20        if self.system_prompt:
21            total += self.count_tokens(self.system_prompt["content"])
22        
23        while total > self.max_tokens and len(self.messages) > 2:
24            removed = self.messages.pop(0)
25            total -= self.count_tokens(removed["content"]) + 4
26    
27    def get_messages(self):
28        result = []
29        if self.system_prompt:
30            result.append(self.system_prompt)
31        result.extend(self.messages)
32        return result

2.3 Pattern 3: Summary Memory

Tóm tắt conversation cũ, giữ summary + recent messages.

python.py

1class SummaryMemory:
2    def __init__(self, client, summary_threshold=10):
3        self.client = client
4        self.summary_threshold = summary_threshold
5        self.summary = ""
6        self.recent_messages = []
7        self.system_prompt = None
8    
9    def add(self, role, content):
10        self.recent_messages.append({"role": role, "content": content})
11        
12        if len(self.recent_messages) > self.summary_threshold:
13            self._summarize_old()
14    
15    def _summarize_old(self):
16        """Summarize older messages, keep recent ones."""
17        old = self.recent_messages[:self.summary_threshold // 2]
18        self.recent_messages = self.recent_messages[self.summary_threshold // 2:]
19        
20        old_text = "\n".join(f"{m['role']}: {m['content']}" for m in old)
21        
22        response = self.client.chat.completions.create(
23            model="gpt-3.5-turbo",  # Cheap model for summarization
24            messages=[{
25                "role": "user",
26                "content": f"Tóm tắt cuộc hội thoại này trong 2-3 câu:\n{old_text}"
27            }],
28            max_tokens=200
29        )
30        
31        new_summary = response.choices[0].message.content
32        self.summary = f"{self.summary}\n{new_summary}".strip()
33    
34    def get_messages(self):
35        result = []
36        if self.system_prompt:
37            content = self.system_prompt["content"]
38            if self.summary:
39                content += f"\n\nConversation summary so far:\n{self.summary}"
40            result.append({"role": "system", "content": content})
41        
42        result.extend(self.recent_messages)
43        return result

2.4 Pattern Comparison

Pattern	Pros	Cons	Best For
Sliding Window	Simple, fast	Loses old context	Short chats, Q&A
Token Truncation	Precise budget	Cuts mid-conversation	Production apps
Summary	Preserves key info	Extra API calls, cost	Long conversations

Checkpoint

Bạn có thể so sánh Sliding Window, Token Budget và Summary memory patterns không?

Task 2

3

📐 Advanced: RAG Memory

TB5 min

3.1 Retrieval-Augmented Conversation

python.py

1# Concept: Store all messages in vector DB
2# Retrieve relevant past messages instead of keeping all
3
4from openai import OpenAI
5
6class RAGMemory:
7    def __init__(self, client):
8        self.client = client
9        self.messages_store = []  # All messages ever
10        self.embeddings_store = []
11    
12    def get_embedding(self, text):
13        response = self.client.embeddings.create(
14            model="text-embedding-3-small",
15            input=text
16        )
17        return response.data[0].embedding
18    
19    def add(self, role, content):
20        msg = {"role": role, "content": content}
21        embedding = self.get_embedding(content)
22        self.messages_store.append(msg)
23        self.embeddings_store.append(embedding)
24    
25    def retrieve_relevant(self, query, top_k=5):
26        """Find most relevant past messages."""
27        query_emb = self.get_embedding(query)
28        
29        # Cosine similarity
30        import numpy as np
31        scores = []
32        for emb in self.embeddings_store:
33            score = np.dot(query_emb, emb) / (
34                np.linalg.norm(query_emb) * np.linalg.norm(emb)
35            )
36            scores.append(score)
37        
38        # Top-k indices
39        top_indices = np.argsort(scores)[-top_k:][::-1]
40        return [self.messages_store[i] for i in top_indices]

Checkpoint

Bạn đã hiểu cách RAG Memory sử dụng embeddings để retrieve relevant messages chưa?

Task 3

4

💻 Streamlit Chat with Memory

TB5 min

4.1 Complete Implementation

python.py

1# smart_chat.py
2import streamlit as st
3from openai import OpenAI
4
5st.set_page_config(page_title="Smart Chat", page_icon="🧠")
6st.title("🧠 Smart Chat with Memory")
7
8client = OpenAI(api_key=st.secrets["OPENAI_API_KEY"])
9
10# Sidebar: Memory settings
11with st.sidebar:
12    st.header("⚙️ Memory Settings")
13    memory_type = st.radio(
14        "Memory Pattern:",
15        ["Sliding Window", "Token Budget", "Summary"]
16    )
17    
18    if memory_type == "Sliding Window":
19        max_msgs = st.slider("Max messages", 5, 50, 20)
20    elif memory_type == "Token Budget":
21        max_tokens = st.slider("Max tokens", 1000, 16000, 8000)
22    
23    st.divider()
24    
25    # Stats
26    if "messages" in st.session_state:
27        msg_count = len(st.session_state.messages) - 1  # minus system
28        st.metric("Messages", msg_count)
29        st.metric("Est. tokens", msg_count * 50)  # rough estimate
30    
31    if st.button("🗑️ Clear"):
32        st.session_state.messages = [
33            {"role": "system", "content": "Bạn là AI assistant thân thiện."}
34        ]
35        st.rerun()
36
37# Initialize
38if "messages" not in st.session_state:
39    st.session_state.messages = [
40        {"role": "system", "content": "Bạn là AI assistant thân thiện."}
41    ]
42
43# Display history
44for msg in st.session_state.messages[1:]:  # Skip system
45    with st.chat_message(msg["role"]):
46        st.write(msg["content"])
47
48# Input
49if prompt := st.chat_input("Message..."):
50    st.session_state.messages.append({"role": "user", "content": prompt})
51    with st.chat_message("user"):
52        st.write(prompt)
53    
54    # Apply memory pattern
55    messages_to_send = apply_memory(
56        st.session_state.messages, memory_type
57    )
58    
59    with st.chat_message("assistant"):
60        stream = client.chat.completions.create(
61            model="gpt-4-turbo",
62            messages=messages_to_send,
63            stream=True
64        )
65        response = st.write_stream(stream)
66    
67    st.session_state.messages.append(
68        {"role": "assistant", "content": response}
69    )

Checkpoint

Bạn đã xây dựng được smart chat app với memory settings chưa?

Task 4

5

📝 Best Practices

TB5 min

5.1 System Prompt Optimization

python.py

1# ❌ Bad: System prompt quá dài
2system = """
3Bạn là AI assistant. Bạn phải luôn trả lời bằng tiếng Việt.
4Bạn không được nói dối. Bạn phải trả lời chi tiết.
5Bạn phải format bằng markdown. Bạn phải thân thiện.
6... (500+ tokens of rules)
7"""
8
9# ✅ Good: Concise, prioritized
10system = """Role: AI tutor cho Data Science
11Language: Tiếng Việt
12Style: Thân thiện, dùng ví dụ thực tế
13Format: Markdown với code blocks"""

5.2 Context Injection

python.py

1# Inject relevant context into system prompt dynamically
2def build_system_prompt(user_profile, current_topic):
3    return f"""
4    Role: AI Tutor
5    Student: {user_profile['name']}, level {user_profile['level']}
6    Current topic: {current_topic}
7    Past topics: {', '.join(user_profile['completed_topics'])}
8    
9    Adjust explanations to student's level.
10    Reference past topics when relevant.
11    """

5.3 Memory Do's and Don'ts

Ví dụ

1✅ DO:
2- Choose memory pattern based on use case
3- Set token budgets conservatively (leave room for response)
4- Summarize long conversations periodically
5- Store conversation logs for debugging
6 
7❌ DON'T:
8- Keep unlimited history (WILL hit context limit)
9- Use expensive models for summarization (use GPT-3.5)
10- Forget to count system prompt tokens
11- Ignore token costs of conversation history

Checkpoint

Bạn đã nắm được các best practices cho system prompt và memory management chưa?

Task 5

6

💻 Hands-on Lab

TB5 min

Lab 1: Implement All 3 Memory Patterns

Build chatbot app (Streamlit) mà user có thể switch giữa:

Sliding Window (10 messages)
Token Budget (4000 tokens)
Summary (summarize every 6 messages)

Compare behavior khi chat 20+ turns.

Lab 2: Persistent Chat

Lưu conversation history vào file JSON:

Save on app close
Load on app open
Allow multiple named conversations
Delete conversations

Lab 3: Context-Aware Tutor

Build AI tutor mà:

Nhớ student đã học gì
Không giải thích lại concepts đã hiểu
Tham chiếu ví dụ từ conversations trước
Adjust difficulty dựa vào performance

Checkpoint

Bạn đã thực hành implement các memory patterns và persistent chat chưa?

Task 6

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 Context Window Explained

1.1 Model Context Limits

1.2 Token Budget

1.3 Problem: History Overflow

Checkpoint

📐 Memory Patterns

2.1 Pattern 1: Sliding Window

2.2 Pattern 2: Token-Based Truncation

2.3 Pattern 3: Summary Memory

2.4 Pattern Comparison

Checkpoint

📐 Advanced: RAG Memory

3.1 Retrieval-Augmented Conversation

Checkpoint

💻 Streamlit Chat with Memory

4.1 Complete Implementation

Checkpoint

📝 Best Practices

5.1 System Prompt Optimization

5.2 Context Injection

5.3 Memory Do's and Don'ts

Checkpoint

💻 Hands-on Lab

Lab 1: Implement All 3 Memory Patterns

Lab 2: Persistent Chat

Lab 3: Context-Aware Tutor

Checkpoint

🎯 Tổng kết

📝 Quiz

Những điểm quan trọng

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu