Backend

Building Sub-2s AI Voice Pipelines with Flask

March 2026 5 min read Kavya Mittal

When I joined Salescode.ai, one of my first challenges was building a real-time voice calling agent that could hold natural conversations in multiple languages. The target? Sub-2-second end-to-end latency. Here's how we achieved it.

The Challenge

Voice AI pipelines involve multiple sequential steps: capturing audio, converting speech to text (STT), processing the intent, generating a response, and converting it back to speech (TTS). Each step adds latency, and users expect near-instant responses in a conversation.

Our requirements were demanding:

Support Hindi and Mexican Spanish with 90%+ accuracy
End-to-end latency under 2 seconds
Handle concurrent calls without degradation
Graceful fallback when AI confidence is low

Architecture Decisions

We chose Flask for our microservices because of its lightweight nature and Python's rich ecosystem for AI/ML integrations. The key architectural decision was using an async pipeline with streaming wherever possible.

@app.route('/api/v1/voice/process', methods=['POST'])
def process_voice():
    audio_stream = request.files['audio']

    # Stream STT - don't wait for full transcription
    transcript = stt_service.stream_transcribe(
        audio_stream,
        language=request.args.get('lang', 'en')
    )

    # Generate response while still transcribing
    response = ai_engine.generate_response(
        partial_transcript=transcript,
        context=session.get('context', {})
    )

    # Stream TTS back to client
    audio_response = tts_service.synthesize(
        text=response.text,
        voice=response.voice_profile
    )

    return send_file(audio_response, mimetype='audio/wav')

Optimization Strategies

The biggest wins came from three key optimizations:

1. Streaming STT

Instead of waiting for the complete audio to finish before starting transcription, we streamed audio chunks to the STT service. This alone saved 300-500ms on average.

2. Response Pre-computation

We started generating AI responses as soon as we had 80% confidence in the transcription, rather than waiting for the final result. If the transcription changed significantly, we'd regenerate — but in 90% of cases, the early prediction was accurate enough.

3. Connection Pooling

External API calls (AWS Transcribe, 11 Labs TTS) were the biggest bottleneck. We implemented connection pooling and keep-alive connections, reducing the overhead of each API call from ~200ms to ~50ms.

# Connection pool for external services
from urllib3 import PoolManager

pool = PoolManager(
    num_pools=10,
    maxsize=20,
    retries=3,
    timeout=Timeout(connect=1.0, read=3.0)
)

# Reuse connections across requests
tts_client = TTSClient(http_pool=pool)
stt_client = STTClient(http_pool=pool)

Results

After these optimizations, our pipeline consistently delivered:

Average latency: 1.4 seconds (down from 3.2s)
P95 latency: 1.8 seconds
Accuracy: 90-95% across Hindi and Mexican Spanish
Concurrent calls: 50+ without degradation

Key Takeaways

Building real-time AI systems taught me that the biggest performance gains come from architectural decisions, not micro-optimizations. Streaming, early prediction, and connection reuse were far more impactful than tweaking individual function performance.

If you're building similar systems, start by profiling your entire pipeline end-to-end. The bottleneck is almost never where you think it is.

Kavya Mittal Backend & AI Engineer at Salescode.ai