Engineering

Why we migrated our voice stack to sub-300ms latency

Voice agents live or die on latency. Here's how we moved from 900ms round-trip to 280ms — the infra, the trade-offs, the cost.

Zia Wasi
Zia Wasi
·9 min read
Why we migrated our voice stack to sub-300ms latency

Latency is the product

In voice, the gap between “feels human” and “feels like a bot” is measured in milliseconds. Our previous stack ran at ~900ms round-trip — good enough for a demo, painful for a 20-minute conversation.

The target

Sub-300ms round-trip from speech-end → response-start. That’s the threshold where humans stop noticing the delay.

What we changed

  1. Streaming ASR — Deepgram with word-level streaming instead of utterance-level (saved ~200ms).
  2. Streaming LLM — tokens start flowing at the first logit, first spoken syllable at ~100ms after LLM start.
  3. Co-located TTS — moved Twilio media streams + TTS into the same region as our inference cluster (saved ~150ms of egress).
  4. Warm pool of workers — eliminated cold-start on concurrent calls.

The trade-offs

  • Streaming LLMs make correction harder — a wrong token is already spoken.
  • Regional colocation means higher infra cost in two regions instead of one.
  • Warm pool costs ~$280/day idle, but pays for itself after ~400 concurrent calls.

Where we landed

p50: 280ms · p95: 410ms · p99: 620ms

The p99 is where we still have work — mostly tail latency on cold TTS voices. Next quarter we’re caching prosody for the top 200 agent responses.

Zia Wasi

Written by

Zia Wasi

Part of the Depra team, building the future of AI-powered business automation.