Lý thuyết
35 phút
Bài 5/15

Voice AI Basics

Nền tảng Voice AI - speech-to-text và text-to-speech

🎙️ Voice AI Basics

Học cách tích hợp voice capabilities vào AI agents.

Voice AI Components

The Voice Pipeline

Text
1┌────────────┐ ┌─────────────┐ ┌────────────┐
2│ User │───▶│ Speech-to- │───▶│ Agent │
3│ Speech │ │ Text │ │ (Process) │
4└────────────┘ └─────────────┘ └────────────┘
5
6┌────────────┐ ┌─────────────┐ │
7│ Audio │◀───│ Text-to- │◀──────────┘
8│ Output │ │ Speech │
9└────────────┘ └─────────────┘

Key Technologies

Text
11. Speech-to-Text (STT)
2 - Convert voice → text
3 - Also: ASR (Automatic Speech Recognition)
4 - Providers: Whisper, Google, Azure
5
62. Text-to-Speech (TTS)
7 - Convert text → voice
8 - Natural-sounding voices
9 - Providers: ElevenLabs, Google, Azure
10
113. Voice Agents
12 - Combine STT + Agent + TTS
13 - Phone systems
14 - Voice assistants

Speech-to-Text Options

Whisper (OpenAI)

Text
1Pros:
2✅ Highly accurate
3✅ Multi-language
4✅ Free (local) / cheap (API)
5✅ Handles accents well
6
7Cons:
8❌ Batch only (no streaming)
9❌ Requires audio file
10
11Use case:
12- Transcription
13- Voice messages
14- Recorded content

Google Speech-to-Text

Text
1Pros:
2✅ Streaming support
3✅ Real-time transcription
4✅ Many languages
5✅ Good accuracy
6
7Cons:
8❌ Pay per minute
9❌ Complex setup
10
11Use case:
12- Live transcription
13- Phone systems
14- Real-time agents

Azure Speech

Text
1Pros:
2✅ Enterprise features
3✅ Custom models
4✅ Streaming
5✅ Microsoft integration
6
7Use case:
8- Enterprise deployments
9- Custom vocabulary
10- High security needs

Text-to-Speech Options

ElevenLabs

Text
1Pros:
2✅ Ultra-realistic voices
3✅ Voice cloning
4✅ Emotional range
5✅ Easy to use
6
7Cons:
8❌ Higher cost
9❌ Character limits
10
11Best for:
12- High-quality experience
13- Brand voices
14- Content creation

Google Text-to-Speech

Text
1Pros:
2✅ Many voices/languages
3✅ SSML support
4✅ Neural voices available
5✅ Good free tier
6
7Best for:
8- Multi-language
9- Cost-effective
10- Simple integration

OpenAI TTS

Text
1Pros:
2✅ Simple API
3✅ Good quality
4✅ Fast generation
5✅ Multiple voices
6
7Best for:
8- Quick implementation
9- OpenAI ecosystem
10- General use cases

Implementing Voice in No-Code

Voiceflow Voice

Text
1Voiceflow supports voice natively:
2
31. Create project
42. Select "Voice Assistant"
53. Build flows (same as chat)
64. Voiceflow handles STT/TTS
75. Deploy to Alexa, Google Assistant

Make + Voice APIs

Text
1Workflow:
21. Receive audio file
32. Send to Whisper API → Get text
43. Process with AI
54. Send text to ElevenLabs → Get audio
65. Return audio

Example: Voice Message Bot

Text
1Telegram Bot (Make):
2
3Trigger: Voice message received
4
5HTTP: Send audio to Whisper
6
7Get transcript text
8
9HTTP: Send to OpenAI (process)
10
11HTTP: Send response to ElevenLabs
12
13Action: Send voice reply

ElevenLabs Setup

Getting Started

Text
11. Sign up at elevenlabs.io
22. Get API key
33. Choose voices:
4 - Pre-made voices
5 - Clone your voice
6 - Generate custom

Voice Selection

Text
1Categories:
2- Narrative (storytelling)
3- Conversational (casual)
4- Professional (business)
5- Characters (unique styles)
6
7Choose based on:
8- Brand personality
9- Use case
10- Audience

API Usage

Text
1Endpoint: https://api.elevenlabs.io/v1/text-to-speech
2
3Request:
4{
5 "text": "Hello, how can I help you?",
6 "voice_id": "21m00Tcm4TlvDq8ikWAM",
7 "model_id": "eleven_monolingual_v1"
8}
9
10Response: Audio file (mp3)

Whisper API Setup

OpenAI Whisper

Text
1Endpoint: https://api.openai.com/v1/audio/transcriptions
2
3Request:
4- Model: whisper-1
5- File: audio file
6- Language: optional
7
8Response:
9{
10 "text": "Transcribed text here"
11}

Supported Formats

Text
1Audio formats:
2- mp3
3- mp4
4- mpeg
5- m4a
6- wav
7- webm
8- Maximum: 25MB

Building Voice Bot (Make)

Step 1: Audio Input

Text
1Sources:
2- Telegram voice message
3- Uploaded file
4- Phone call (Twilio)
5- Web recording

Step 2: Transcription

Text
1HTTP Module:
2- URL: OpenAI Whisper endpoint
3- Method: POST
4- Body: Form data with audio file
5- Parse response: Get text

Step 3: Process Text

Text
1Send to:
2- OpenAI for response
3- Your AI agent
4- Custom logic
5
6Get text reply

Step 4: Generate Audio

Text
1HTTP Module:
2- URL: ElevenLabs endpoint
3- Method: POST
4- Body: JSON with text
5- Response: Binary (audio)

Step 5: Send Response

Text
1Options:
2- Send audio file
3- Play audio
4- Stream to user

Voice UX Best Practices

Design for Voice

Voice UX Tips
Text
11. Keep responses short
2 - 1-3 sentences max
3 - Easy to listen
4
52. Confirm understanding
6 - "I heard you say..."
7 - Avoid misunderstandings
8
93. Provide options
10 - "You can say A, B, or C"
11 - Guide the conversation
12
134. Handle silence
14 - Prompt after pause
15 - "Are you still there?"
16
175. Allow interruption
18 - "barge-in" capability
19 - Don't force listening

Pronunciation

Text
1Control pronunciation:
2- SSML tags
3- Phonetic spelling
4- Custom dictionaries
5
6Example issues:
7- Names: "Nguyen" → "Win"
8- Numbers: "2023" vs "twenty twenty-three"
9- Acronyms: "API" vs "A-P-I"

Error Handling

Voice Errors

Text
1Common issues:
2- No audio detected
3- Poor quality audio
4- Background noise
5- Accent issues
6- Network problems

Solutions

Handle Voice Errors
Text
11. Request repeat
2 "I didn't catch that. Could you say it again?"
3
42. Offer alternative
5 "I'm having trouble hearing you.
6 Would you like to type instead?"
7
83. Confirm critical info
9 "Just to confirm, you said [X]. Is that right?"
10
114. Graceful degradation
12 Switch to text if voice fails

Testing Voice

Quality Check

Text
1Test for:
2- Accuracy of transcription
3- Natural TTS sound
4- Response latency
5- Different accents
6- Background noise
7- Various devices

Tools

Text
1- Test with real voice input
2- Record test scenarios
3- Use different devices
4- Test network conditions

Bài Tập

Practice

Build Voice Integration:

  1. Create ElevenLabs account
  2. Get OpenAI API key
  3. Build Make workflow:
    • Accept audio input
    • Transcribe with Whisper
    • Process with AI
    • Generate speech response
  4. Test with voice messages

Tiếp theo: Bài 6 - Voice Assistants