🎙️ Voice AI Basics
Học cách tích hợp voice capabilities vào AI agents.
Voice AI Components
The Voice Pipeline
Text
1┌────────────┐ ┌─────────────┐ ┌────────────┐2│ User │───▶│ Speech-to- │───▶│ Agent │3│ Speech │ │ Text │ │ (Process) │4└────────────┘ └─────────────┘ └────────────┘5 │6┌────────────┐ ┌─────────────┐ │7│ Audio │◀───│ Text-to- │◀──────────┘8│ Output │ │ Speech │9└────────────┘ └─────────────┘Key Technologies
Text
11. Speech-to-Text (STT)2 - Convert voice → text3 - Also: ASR (Automatic Speech Recognition)4 - Providers: Whisper, Google, Azure5 62. Text-to-Speech (TTS)7 - Convert text → voice8 - Natural-sounding voices9 - Providers: ElevenLabs, Google, Azure10 113. Voice Agents12 - Combine STT + Agent + TTS13 - Phone systems14 - Voice assistantsSpeech-to-Text Options
Whisper (OpenAI)
Text
1Pros:2✅ Highly accurate3✅ Multi-language4✅ Free (local) / cheap (API)5✅ Handles accents well6 7Cons:8❌ Batch only (no streaming)9❌ Requires audio file10 11Use case:12- Transcription13- Voice messages14- Recorded contentGoogle Speech-to-Text
Text
1Pros:2✅ Streaming support3✅ Real-time transcription4✅ Many languages5✅ Good accuracy6 7Cons:8❌ Pay per minute9❌ Complex setup10 11Use case:12- Live transcription13- Phone systems14- Real-time agentsAzure Speech
Text
1Pros:2✅ Enterprise features3✅ Custom models4✅ Streaming5✅ Microsoft integration6 7Use case:8- Enterprise deployments9- Custom vocabulary10- High security needsText-to-Speech Options
ElevenLabs
Text
1Pros:2✅ Ultra-realistic voices3✅ Voice cloning4✅ Emotional range5✅ Easy to use6 7Cons:8❌ Higher cost9❌ Character limits10 11Best for:12- High-quality experience13- Brand voices14- Content creationGoogle Text-to-Speech
Text
1Pros:2✅ Many voices/languages3✅ SSML support4✅ Neural voices available5✅ Good free tier6 7Best for:8- Multi-language9- Cost-effective10- Simple integrationOpenAI TTS
Text
1Pros:2✅ Simple API3✅ Good quality4✅ Fast generation5✅ Multiple voices6 7Best for:8- Quick implementation9- OpenAI ecosystem10- General use casesImplementing Voice in No-Code
Voiceflow Voice
Text
1Voiceflow supports voice natively:2 31. Create project42. Select "Voice Assistant"53. Build flows (same as chat)64. Voiceflow handles STT/TTS75. Deploy to Alexa, Google AssistantMake + Voice APIs
Text
1Workflow:21. Receive audio file32. Send to Whisper API → Get text43. Process with AI54. Send text to ElevenLabs → Get audio65. Return audioExample: Voice Message Bot
Text
1Telegram Bot (Make):2 3Trigger: Voice message received4 ↓5HTTP: Send audio to Whisper6 ↓7Get transcript text8 ↓9HTTP: Send to OpenAI (process)10 ↓11HTTP: Send response to ElevenLabs12 ↓13Action: Send voice replyElevenLabs Setup
Getting Started
Text
11. Sign up at elevenlabs.io22. Get API key33. Choose voices:4 - Pre-made voices5 - Clone your voice6 - Generate customVoice Selection
Text
1Categories:2- Narrative (storytelling)3- Conversational (casual)4- Professional (business)5- Characters (unique styles)6 7Choose based on:8- Brand personality9- Use case10- AudienceAPI Usage
Text
1Endpoint: https://api.elevenlabs.io/v1/text-to-speech2 3Request:4{5 "text": "Hello, how can I help you?",6 "voice_id": "21m00Tcm4TlvDq8ikWAM",7 "model_id": "eleven_monolingual_v1"8}9 10Response: Audio file (mp3)Whisper API Setup
OpenAI Whisper
Text
1Endpoint: https://api.openai.com/v1/audio/transcriptions2 3Request:4- Model: whisper-15- File: audio file6- Language: optional7 8Response:9{10 "text": "Transcribed text here"11}Supported Formats
Text
1Audio formats:2- mp33- mp44- mpeg5- m4a6- wav7- webm8- Maximum: 25MBBuilding Voice Bot (Make)
Step 1: Audio Input
Text
1Sources:2- Telegram voice message3- Uploaded file4- Phone call (Twilio)5- Web recordingStep 2: Transcription
Text
1HTTP Module:2- URL: OpenAI Whisper endpoint3- Method: POST4- Body: Form data with audio file5- Parse response: Get textStep 3: Process Text
Text
1Send to:2- OpenAI for response3- Your AI agent4- Custom logic5 6Get text replyStep 4: Generate Audio
Text
1HTTP Module:2- URL: ElevenLabs endpoint3- Method: POST4- Body: JSON with text5- Response: Binary (audio)Step 5: Send Response
Text
1Options:2- Send audio file3- Play audio4- Stream to userVoice UX Best Practices
Design for Voice
Voice UX Tips
Text
11. Keep responses short2 - 1-3 sentences max3 - Easy to listen4 52. Confirm understanding6 - "I heard you say..."7 - Avoid misunderstandings8 93. Provide options10 - "You can say A, B, or C"11 - Guide the conversation12 134. Handle silence14 - Prompt after pause15 - "Are you still there?"16 175. Allow interruption18 - "barge-in" capability19 - Don't force listeningPronunciation
Text
1Control pronunciation:2- SSML tags3- Phonetic spelling4- Custom dictionaries5 6Example issues:7- Names: "Nguyen" → "Win"8- Numbers: "2023" vs "twenty twenty-three"9- Acronyms: "API" vs "A-P-I"Error Handling
Voice Errors
Text
1Common issues:2- No audio detected3- Poor quality audio4- Background noise5- Accent issues6- Network problemsSolutions
Handle Voice Errors
Text
11. Request repeat2 "I didn't catch that. Could you say it again?"3 42. Offer alternative5 "I'm having trouble hearing you. 6 Would you like to type instead?"7 83. Confirm critical info9 "Just to confirm, you said [X]. Is that right?"10 114. Graceful degradation12 Switch to text if voice failsTesting Voice
Quality Check
Text
1Test for:2- Accuracy of transcription3- Natural TTS sound4- Response latency5- Different accents6- Background noise7- Various devicesTools
Text
1- Test with real voice input2- Record test scenarios3- Use different devices4- Test network conditionsBài Tập
Practice
Build Voice Integration:
- Create ElevenLabs account
- Get OpenAI API key
- Build Make workflow:
- Accept audio input
- Transcribe with Whisper
- Process with AI
- Generate speech response
- Test with voice messages
Tiếp theo: Bài 6 - Voice Assistants
