Building Sub-2s AI Voice Pipelines with Flask
When I joined Salescode.ai, one of my first challenges was building a real-time voice calling agent that could hold natural conversations in multiple languages. The target? Sub-2-second end-to-end latency. Here's how we achieved it.
The Challenge
Voice AI pipelines involve multiple sequential steps: capturing audio, converting speech to text (STT), processing the intent, generating a response, and converting it back to speech (TTS). Each step adds latency, and users expect near-instant responses in a conversation.
Our requirements were demanding:
- Support Hindi and Mexican Spanish with 90%+ accuracy
- End-to-end latency under 2 seconds
- Handle concurrent calls without degradation
- Graceful fallback when AI confidence is low
Architecture Decisions
We chose Flask for our microservices because of its lightweight nature and Python's rich ecosystem for AI/ML integrations. The key architectural decision was using an async pipeline with streaming wherever possible.
@app.route('/api/v1/voice/process', methods=['POST'])
def process_voice():
audio_stream = request.files['audio']
# Stream STT - don't wait for full transcription
transcript = stt_service.stream_transcribe(
audio_stream,
language=request.args.get('lang', 'en')
)
# Generate response while still transcribing
response = ai_engine.generate_response(
partial_transcript=transcript,
context=session.get('context', {})
)
# Stream TTS back to client
audio_response = tts_service.synthesize(
text=response.text,
voice=response.voice_profile
)
return send_file(audio_response, mimetype='audio/wav')
Optimization Strategies
The biggest wins came from three key optimizations:
1. Streaming STT
Instead of waiting for the complete audio to finish before starting transcription, we streamed audio chunks to the STT service. This alone saved 300-500ms on average.
2. Response Pre-computation
We started generating AI responses as soon as we had 80% confidence in the transcription, rather than waiting for the final result. If the transcription changed significantly, we'd regenerate — but in 90% of cases, the early prediction was accurate enough.
3. Connection Pooling
External API calls (AWS Transcribe, 11 Labs TTS) were the biggest bottleneck. We implemented connection pooling and keep-alive connections, reducing the overhead of each API call from ~200ms to ~50ms.
# Connection pool for external services
from urllib3 import PoolManager
pool = PoolManager(
num_pools=10,
maxsize=20,
retries=3,
timeout=Timeout(connect=1.0, read=3.0)
)
# Reuse connections across requests
tts_client = TTSClient(http_pool=pool)
stt_client = STTClient(http_pool=pool)
Results
After these optimizations, our pipeline consistently delivered:
- Average latency: 1.4 seconds (down from 3.2s)
- P95 latency: 1.8 seconds
- Accuracy: 90-95% across Hindi and Mexican Spanish
- Concurrent calls: 50+ without degradation
Key Takeaways
Building real-time AI systems taught me that the biggest performance gains come from architectural decisions, not micro-optimizations. Streaming, early prediction, and connection reuse were far more impactful than tweaking individual function performance.
If you're building similar systems, start by profiling your entire pipeline end-to-end. The bottleneck is almost never where you think it is.