🎙️ Voice AI Basics

Học cách tích hợp voice capabilities vào AI agents.

Voice AI Components

The Voice Pipeline

Text

1┌────────────┐    ┌─────────────┐    ┌────────────┐
2│   User     │───▶│ Speech-to-  │───▶│   Agent    │
3│   Speech   │    │    Text     │    │  (Process) │
4└────────────┘    └─────────────┘    └────────────┘
5                                            │
6┌────────────┐    ┌─────────────┐           │
7│   Audio    │◀───│ Text-to-    │◀──────────┘
8│   Output   │    │   Speech    │
9└────────────┘    └─────────────┘

Key Technologies

Text

11. Speech-to-Text (STT)
2   - Convert voice → text
3   - Also: ASR (Automatic Speech Recognition)
4   - Providers: Whisper, Google, Azure
5 
62. Text-to-Speech (TTS)
7   - Convert text → voice
8   - Natural-sounding voices
9   - Providers: ElevenLabs, Google, Azure
10 
113. Voice Agents
12   - Combine STT + Agent + TTS
13   - Phone systems
14   - Voice assistants

Speech-to-Text Options

Whisper (OpenAI)

Text

1Pros:
2✅ Highly accurate
3✅ Multi-language
4✅ Free (local) / cheap (API)
5✅ Handles accents well
6 
7Cons:
8❌ Batch only (no streaming)
9❌ Requires audio file
10 
11Use case:
12- Transcription
13- Voice messages
14- Recorded content

Google Speech-to-Text

Text

1Pros:
2✅ Streaming support
3✅ Real-time transcription
4✅ Many languages
5✅ Good accuracy
6 
7Cons:
8❌ Pay per minute
9❌ Complex setup
10 
11Use case:
12- Live transcription
13- Phone systems
14- Real-time agents

Azure Speech

Text

1Pros:
2✅ Enterprise features
3✅ Custom models
4✅ Streaming
5✅ Microsoft integration
6 
7Use case:
8- Enterprise deployments
9- Custom vocabulary
10- High security needs

Text-to-Speech Options

ElevenLabs

Text

1Pros:
2✅ Ultra-realistic voices
3✅ Voice cloning
4✅ Emotional range
5✅ Easy to use
6 
7Cons:
8❌ Higher cost
9❌ Character limits
10 
11Best for:
12- High-quality experience
13- Brand voices
14- Content creation

Google Text-to-Speech

Text

1Pros:
2✅ Many voices/languages
3✅ SSML support
4✅ Neural voices available
5✅ Good free tier
6 
7Best for:
8- Multi-language
9- Cost-effective
10- Simple integration

OpenAI TTS

Text

1Pros:
2✅ Simple API
3✅ Good quality
4✅ Fast generation
5✅ Multiple voices
6 
7Best for:
8- Quick implementation
9- OpenAI ecosystem
10- General use cases

Implementing Voice in No-Code

Voiceflow Voice

Text

1Voiceflow supports voice natively:
2 
31. Create project
42. Select "Voice Assistant"
53. Build flows (same as chat)
64. Voiceflow handles STT/TTS
75. Deploy to Alexa, Google Assistant

Make + Voice APIs

Text

1Workflow:
21. Receive audio file
32. Send to Whisper API → Get text
43. Process with AI
54. Send text to ElevenLabs → Get audio
65. Return audio

Example: Voice Message Bot

Text

1Telegram Bot (Make):
2 
3Trigger: Voice message received
4   ↓
5HTTP: Send audio to Whisper
6   ↓
7Get transcript text
8   ↓
9HTTP: Send to OpenAI (process)
10   ↓
11HTTP: Send response to ElevenLabs
12   ↓
13Action: Send voice reply

ElevenLabs Setup

Getting Started

Text

11. Sign up at elevenlabs.io
22. Get API key
33. Choose voices:
4   - Pre-made voices
5   - Clone your voice
6   - Generate custom

Voice Selection

Text

1Categories:
2- Narrative (storytelling)
3- Conversational (casual)
4- Professional (business)
5- Characters (unique styles)
6 
7Choose based on:
8- Brand personality
9- Use case
10- Audience

API Usage

Text

1Endpoint: https://api.elevenlabs.io/v1/text-to-speech
2 
3Request:
4{
5  "text": "Hello, how can I help you?",
6  "voice_id": "21m00Tcm4TlvDq8ikWAM",
7  "model_id": "eleven_monolingual_v1"
8}
9 
10Response: Audio file (mp3)

Whisper API Setup

OpenAI Whisper

Text

1Endpoint: https://api.openai.com/v1/audio/transcriptions
2 
3Request:
4- Model: whisper-1
5- File: audio file
6- Language: optional
7 
8Response:
9{
10  "text": "Transcribed text here"
11}

Supported Formats

Text

1Audio formats:
2- mp3
3- mp4
4- mpeg
5- m4a
6- wav
7- webm
8- Maximum: 25MB

Building Voice Bot (Make)

Step 1: Audio Input

Text

1Sources:
2- Telegram voice message
3- Uploaded file
4- Phone call (Twilio)
5- Web recording

Step 2: Transcription

Text

1HTTP Module:
2- URL: OpenAI Whisper endpoint
3- Method: POST
4- Body: Form data with audio file
5- Parse response: Get text

Step 3: Process Text

Text

1Send to:
2- OpenAI for response
3- Your AI agent
4- Custom logic
5 
6Get text reply

Step 4: Generate Audio

Text

1HTTP Module:
2- URL: ElevenLabs endpoint
3- Method: POST
4- Body: JSON with text
5- Response: Binary (audio)

Step 5: Send Response

Text

1Options:
2- Send audio file
3- Play audio
4- Stream to user

Voice UX Best Practices

Design for Voice

Voice UX Tips

Text

11. Keep responses short
2   - 1-3 sentences max
3   - Easy to listen
4 
52. Confirm understanding
6   - "I heard you say..."
7   - Avoid misunderstandings
8 
93. Provide options
10   - "You can say A, B, or C"
11   - Guide the conversation
12 
134. Handle silence
14   - Prompt after pause
15   - "Are you still there?"
16 
175. Allow interruption
18   - "barge-in" capability
19   - Don't force listening

Pronunciation

Text

1Control pronunciation:
2- SSML tags
3- Phonetic spelling
4- Custom dictionaries
5 
6Example issues:
7- Names: "Nguyen" → "Win"
8- Numbers: "2023" vs "twenty twenty-three"
9- Acronyms: "API" vs "A-P-I"

Error Handling

Voice Errors

Text

1Common issues:
2- No audio detected
3- Poor quality audio
4- Background noise
5- Accent issues
6- Network problems

Solutions

Handle Voice Errors

Text

11. Request repeat
2   "I didn't catch that. Could you say it again?"
3 
42. Offer alternative
5   "I'm having trouble hearing you. 
6   Would you like to type instead?"
7 
83. Confirm critical info
9   "Just to confirm, you said [X]. Is that right?"
10 
114. Graceful degradation
12   Switch to text if voice fails

Testing Voice

Quality Check

Text

1Test for:
2- Accuracy of transcription
3- Natural TTS sound
4- Response latency
5- Different accents
6- Background noise
7- Various devices

Tools

Text

1- Test with real voice input
2- Record test scenarios
3- Use different devices
4- Test network conditions

Bài Tập

Practice

Build Voice Integration:

Create ElevenLabs account
Get OpenAI API key
Build Make workflow:
- Accept audio input
- Transcribe with Whisper
- Process with AI
- Generate speech response
Test with voice messages

Tiếp theo: Bài 6 - Voice Assistants