Name: Depra
Price: 4999 INR

Why we migrated our voice stack to sub-300ms latency

Voice agents live or die on latency. Here's how we moved from 900ms round-trip to 280ms — the infra, the trade-offs, the cost.

Zia Wasi

April 7, 2026·9 min read

Why we migrated our voice stack to sub-300ms latency

Latency is the product

In voice, the gap between “feels human” and “feels like a bot” is measured in milliseconds. Our previous stack ran at ~900ms round-trip — good enough for a demo, painful for a 20-minute conversation.

The target

Sub-300ms round-trip from speech-end → response-start. That’s the threshold where humans stop noticing the delay.

What we changed

Streaming ASR — Deepgram with word-level streaming instead of utterance-level (saved ~200ms).
Streaming LLM — tokens start flowing at the first logit, first spoken syllable at ~100ms after LLM start.
Co-located TTS — moved Twilio media streams + TTS into the same region as our inference cluster (saved ~150ms of egress).
Warm pool of workers — eliminated cold-start on concurrent calls.

The trade-offs

Streaming LLMs make correction harder — a wrong token is already spoken.
Regional colocation means higher infra cost in two regions instead of one.
Warm pool costs ~$280/day idle, but pays for itself after ~400 concurrent calls.

Where we landed

p50: 280ms · p95: 410ms · p99: 620ms

The p99 is where we still have work — mostly tail latency on cold TTS voices. Next quarter we’re caching prosody for the top 200 agent responses.

Written by

Zia Wasi

Part of the Depra team, building the future of AI-powered business automation.

Why we migrated our voice stack to sub-300ms latency

Latency is the product

The target

What we changed

The trade-offs

Where we landed

Continue Reading

Honest BiteSpeed review 2026: what real users (and 3 G2 reviewers) actually say

Cutting COD RTO by 30% with WhatsApp confirmation

NDR recovery 101: turning failed deliveries into revenue