LOADING MODULE...
Backend

Building Sub-2s AI Voice Pipelines with Flask

March 2026 5 min read Kavya Mittal

When I joined Salescode.ai, one of my first challenges was building a real-time voice calling agent that could hold natural conversations in multiple languages. The target? Sub-2-second end-to-end latency. Here's how we achieved it.

The Challenge

Voice AI pipelines involve multiple sequential steps: capturing audio, converting speech to text (STT), processing the intent, generating a response, and converting it back to speech (TTS). Each step adds latency, and users expect near-instant responses in a conversation.

Our requirements were demanding:

  • Support Hindi and Mexican Spanish with 90%+ accuracy
  • End-to-end latency under 2 seconds
  • Handle concurrent calls without degradation
  • Graceful fallback when AI confidence is low

Architecture Decisions

We chose Flask for our microservices because of its lightweight nature and Python's rich ecosystem for AI/ML integrations. The key architectural decision was using an async pipeline with streaming wherever possible.

@app.route('/api/v1/voice/process', methods=['POST'])
def process_voice():
    audio_stream = request.files['audio']

    # Stream STT - don't wait for full transcription
    transcript = stt_service.stream_transcribe(
        audio_stream,
        language=request.args.get('lang', 'en')
    )

    # Generate response while still transcribing
    response = ai_engine.generate_response(
        partial_transcript=transcript,
        context=session.get('context', {})
    )

    # Stream TTS back to client
    audio_response = tts_service.synthesize(
        text=response.text,
        voice=response.voice_profile
    )

    return send_file(audio_response, mimetype='audio/wav')

Optimization Strategies

The biggest wins came from three key optimizations:

1. Streaming STT

Instead of waiting for the complete audio to finish before starting transcription, we streamed audio chunks to the STT service. This alone saved 300-500ms on average.

2. Response Pre-computation

We started generating AI responses as soon as we had 80% confidence in the transcription, rather than waiting for the final result. If the transcription changed significantly, we'd regenerate — but in 90% of cases, the early prediction was accurate enough.

3. Connection Pooling

External API calls (AWS Transcribe, 11 Labs TTS) were the biggest bottleneck. We implemented connection pooling and keep-alive connections, reducing the overhead of each API call from ~200ms to ~50ms.

# Connection pool for external services
from urllib3 import PoolManager

pool = PoolManager(
    num_pools=10,
    maxsize=20,
    retries=3,
    timeout=Timeout(connect=1.0, read=3.0)
)

# Reuse connections across requests
tts_client = TTSClient(http_pool=pool)
stt_client = STTClient(http_pool=pool)

Results

After these optimizations, our pipeline consistently delivered:

  • Average latency: 1.4 seconds (down from 3.2s)
  • P95 latency: 1.8 seconds
  • Accuracy: 90-95% across Hindi and Mexican Spanish
  • Concurrent calls: 50+ without degradation

Key Takeaways

Building real-time AI systems taught me that the biggest performance gains come from architectural decisions, not micro-optimizations. Streaming, early prediction, and connection reuse were far more impactful than tweaking individual function performance.

If you're building similar systems, start by profiling your entire pipeline end-to-end. The bottleneck is almost never where you think it is.

Kavya Mittal Backend & AI Engineer at Salescode.ai
Back to All Posts