🎤 Speech-to-Text
🎯 Mục tiêu bài học
Sau bài học này, bạn sẽ:
✅ Hiểu STT Architecture và flow xử lý audio
✅ Tích hợp OpenAI Whisper vào n8n workflows
✅ Xây dựng Voice Command Workflow
✅ Tạo Meeting Transcription pipeline
✅ Xử lý multi-language audio
✅ Build Voice Note Processor với auto-classification
Biến audio thành text trong n8n workflows. Dùng cho voice commands, transcription, meeting notes.
🔍 STT Architecture
Checkpoint
STT pipeline gồm những bước nào? Audio input có thể đến từ những nguồn nào?
🛠️ OpenAI Whisper trong n8n
Setup
1// OpenAI node: Audio Transcription2// Model: whisper-13// Input: Audio file (mp3, wav, m4a, webm)4// Max file size: 25MB5// Supported languages: 98+ languages67// Configuration:8// - Model: whisper-19// - Response Format: json (or text, srt, vtt)10// - Language: auto-detect or specify (vi, en, ja, etc.)11// - Temperature: 0 (for most accurate)Basic Transcription Workflow
1// Webhook receives audio file2// OpenAI Whisper transcribes3// Return transcript45// Result format:6{7 "text": "Xin chào, tôi cần hỗ trợ về đơn hàng số 12345.",8 "language": "vi",9 "duration": 5.210}Checkpoint
Whisper hỗ trợ những audio formats nào? Cách cấu hình để có accuracy cao nhất?
⚡ Voice Command Workflow
1// Code node: Parse voice command2const transcript = $json.text;34const parsePrompt = `5Parse this voice command and extract the intent and parameters:67Voice: "${transcript}"89Possible intents:10- send_email (to, subject, body)11- create_task (title, due_date, priority)12- search (query)13- question (question_text)14- reminder (text, time)1516Return JSON:17{18 "intent": "...",19 "params": {...},20 "confidence": 0.0-1.021}`;2223return { json: { prompt: parsePrompt } };Checkpoint
Voice Command Workflow parse intent như thế nào? Những intents phổ biến nào cần hỗ trợ?
📝 Meeting Transcription
1// Workflow: Transcribe meeting audio → Summary → Action items23// Step 1: Transcribe4// OpenAI Whisper node56// Step 2: Summarize7const summaryPrompt = `8Transcribe and summarize this meeting:910Transcript:11${$json.transcript}1213Generate:141. Meeting Summary (3-5 sentences)152. Key Decisions Made163. Action Items (with assigned person if mentioned)174. Follow-up Topics185. Next Steps1920Format as Markdown.`;2122// Step 3: Extract action items23const actionPrompt = `24From this meeting transcript, extract all action items:2526${$json.transcript}2728Return JSON array:29[30 {31 "task": "description",32 "assignedTo": "person name or unassigned",33 "deadline": "mentioned deadline or none",34 "priority": "high/medium/low"35 }36]`;Checkpoint
Meeting Transcription pipeline gồm những bước nào? Cách extract action items từ transcript?
🌍 Audio File Processing & Multi-Language
Audio File Processing
1// Code node: Handle different audio sources23// Source 1: Direct upload via webhook4// Content-Type: multipart/form-data56// Source 2: Download from URL7// HTTP Request node → Download audio file89// Source 3: Record from Telegram voice message10// Telegram trigger → Download voice file1112// Source 4: Google Drive audio13// Google Drive node → Download file1415// All feed into Whisper for transcriptionMulti-Language Support
1// Whisper supports 98+ languages2// Auto-detection is usually accurate34// For explicit language setting:5// OpenAI node → Language: "vi" (Vietnamese)67// For multi-language meetings:8const postProcessPrompt = `9This meeting transcript contains multiple languages.10Identify each speaker's language and translate everything to Vietnamese.1112Transcript: ${$json.transcript}1314Output format:15[Speaker 1 (English)]: Original -> Translation16[Speaker 2 (Vietnamese)]: Text as-is17`;Checkpoint
Whisper hỗ trợ bao nhiêu ngôn ngữ? Xử lý multi-language audio như thế nào?
📋 Voice Note Processing
1// Quick voice note processor2// "Email John about the meeting tomorrow at 3pm"3// → Intent: email4// → To: John5// → Subject: Meeting tomorrow6// → Time context: 3pm78// "Remind me to call the client on Friday"9// → Intent: reminder10// → Task: Call client11// → When: Friday- Audio quality: Mic quality ảnh hưởng lớn đến accuracy
- File size: Max 25MB cho Whisper; split files lớn hơn
- Language hint: Specify language nếu biết trước, tăng accuracy
- Post-processing: Luôn có AI post-process để fix transcription errors
Checkpoint
Voice Note Processor classify intents dựa trên gì? Những tips nào giúp tăng STT accuracy?
📚 Bài tập thực hành
- Build basic STT workflow: upload audio, get transcript
- Create voice command parser (email, task, search)
- Build meeting transcription, summary workflow
- Create voice note processor với auto-classification
Checkpoint
Bạn đã build được STT workflow hoàn chỉnh chưa? Voice command parser có phân loại đúng intent không?
🚀 Bài tiếp theo
Bài tiếp theo: Text-to-Speech →
