LiveKitWebRTCReal-Time Systems

How Real-Time Voice Agents Work: Architecture and Latency

A breakdown of how real-time voice agents work under the hood: the two-layer architecture, VAD, STT, LLM, TTS pipeline, and where latency comes from at each stage.

·8 min read

Real-time voice agents differ fundamentally from traditional request-response AI systems. Unlike text interfaces, voice interaction requires continuous media streaming, low-latency inference, and bidirectional audio transport.

At a conceptual level, a voice agent is simple: a user speaks, the system processes the input, and a response is returned. In practice, implementing this interaction in real time requires coordinated handling of media transport, speech recognition, language modeling, and speech synthesis, all operating under strict latency constraints.

Now, when it comes to building a production grade voice agent, things get more involved. Multiple subsystems have to operate in coordination:

  • Media transport over WebRTC
  • Streaming speech recognition
  • Incremental language model inference
  • Streaming speech synthesis
  • Turn detection and interruption handling

I'll walk through how to build a system like this using LiveKit. The focus here is on architecture and system behavior rather than SDK usage. LiveKit provides the infrastructure and agent framework needed to build these systems without having to implement WebRTC media handling from scratch.

At a high level, a voice agent can be understood as two major layers:

  • Media infrastructure layer
  • Inference orchestration layer

Media Infrastructure Layer: LiveKit Server

LiveKit provides the media infrastructure layer through the LiveKit Server. The server operates as a WebRTC SFU. Its responsibilities include:

  • Signaling and connection negotiation
  • ICE and NAT traversal
  • Secure media transport
  • Track routing between participants
  • Room state management
  • Horizontal scaling

SFU (Selective Forwarding Unit) - LiveKit forwards media packets as-is. It only tweaks RTP headers and selects which layers to send, without decoding the actual audio or video payload.

This matters because:

  • Reduced latency - you skip the encoding and decoding step entirely
  • CPU efficiency - forwarding is far cheaper than transcoding, so a single server can handle many more participants
  • Simplicity and reliability - less processing means fewer failure points and more predictable performance

When a user joins a room, all WebRTC negotiation, encryption, and track establishment occur within the LiveKit server. Once the connection is established, audio frames are streamed between participants with minimal server-side processing.

Inference Orchestration Layer: LiveKit Agents

The inference orchestration layer is where the actual intelligence lives. LiveKit Agents handle this. An agent subscribes to user audio tracks, processes them through an inference pipeline, and publishes synthesized audio back to the room.

All the business logic resides inside the agent:

  • Speech-to-text processing
  • Language model invocation
  • Tool execution
  • Text-to-speech synthesis
  • Decision logic

Think of the agent as a real-time media consumer and producer. It listens, thinks, and speaks.

LiveKit agents follow a plugin-based architecture. You can swap out different providers for:

  • VAD (voice activity detection)
  • STT (speech-to-text)
  • LLM (language model)
  • TTS (text-to-speech)

This lets you compose your inference pipeline however you want. Need to switch from Deepgram to Whisper? Swap the plugin. Want to try a different TTS provider? Same deal. The underlying media infrastructure stays untouched.

At this point, the architecture can go two different ways: a pipeline model or a realtime model.

Pipeline Architecture

In the pipeline architecture, STT, LLM, and TTS are treated as discrete components. Each stage is explicitly defined and operates independently.

Voice agent pipeline architecture showing User Audio flowing through VAD, STT, LLM, TTS to Audio output

Voice Activity Detection (VAD)

VAD is the gatekeeper of the entire pipeline. It serves two critical purposes: detecting when someone is actually speaking, and determining when they've finished their turn.

Under the hood, VAD processes incoming audio frames in real-time. It analyzes the audio stream and emits events signaling the start and end of speech segments. These events drive the conversation flow. They tell the system when to start transcribing and when to trigger response generation.

Without VAD, you'd either be transcribing silence (wasting compute) or missing the beginning of utterances. It's a simple concept, but getting it right is essential for natural conversation dynamics.

Speech-to-Text (STT)

STT sits at the first stage of the inference pipeline. Audio frames are streamed incrementally to the STT provider. There's no waiting for the user to finish speaking before processing begins.

This streaming approach is essential for latency. Most STT providers emit two types of transcription events:

  • Interim transcripts - partial results that update as more audio arrives
  • Final transcripts - stable, committed text when the model is confident in a segment

Interim transcripts allow downstream components to begin processing before the user finishes speaking. The LLM can start building context, and in some implementations, begin generating speculative responses.

Non-streaming STT - Some STT models only work with complete audio, not live streams. To use them in real-time, you buffer audio until VAD detects the user stopped speaking, then send the whole chunk at once. More latency, but works with any STT.

LLM Orchestration

The language model operates in streaming mode. Token-by-token generation reduces perceived latency. TTS can begin synthesis before the full response is generated.

The LLM receives the current chat context and yields output incrementally. The streaming interface keeps the pipeline responsive even when generating long responses.

One optimization worth noting is preemptive generation. The system can begin generating responses based on partial transcription, before the user's turn has officially ended. When STT returns final transcripts faster than VAD emits end-of-speech signals, there's enough context to start inference early.

Text-to-Speech (TTS)

TTS often contributes the most noticeable latency in the pipeline. The model has to synthesize audio from text, and the first audio frame needs to reach the user quickly for the response to feel natural.

Key parameters that affect TTS latency:

  • Streaming support - can the provider accept text incrementally and return audio before the full response is ready?
  • Chunking strategy - how is text segmented before synthesis? Sentence boundaries work well for natural prosody.
  • Time-to-first-byte - how long until the first audio frame is returned?
  • Audio encoding format - what format does the provider output?

Audio format matters for efficiency. Voice pipelines typically process audio as raw PCM internally. If a TTS provider outputs compressed formats (MP3, Opus, etc.), decoding adds CPU overhead. At scale, avoiding unnecessary transcoding reduces system load.

Non-streaming TTS - Some providers don't support streaming input. In these cases, a sentence tokenizer splits the text stream and sends complete sentences for synthesis. This trades latency for compatibility.

Realtime Model Architecture

An alternative to the pipeline approach is using a single multimodal model that handles everything: audio input, VAD, language reasoning, and audio output in one place.

Realtime model architecture showing audio input and output handled within a single multimodal model

Compared to the pipeline, this offers:

  • Lower latency (no coordination between separate components)
  • Native interruption handling
  • Simpler orchestration

But there are tradeoffs:

  • Less visibility into what's happening at each stage
  • Locked into one provider
  • Harder to customize individual steps

The choice depends on your latency targets, how much control you need, and how much complexity you're willing to manage.

Latency Considerations

Voice latency is cumulative. The total delay is the sum of audio capture, network transport, STT processing, LLM inference, TTS synthesis, and playback buffering.

The metric that matters most is turn gap: the time between the user finishing speech and the first audible agent response.

What Feels Natural

Human conversation operates on 200-500ms turn-taking rhythms. This is consistent across cultures. When voice AI exceeds this window, users notice.

  • <300ms - feels instant, matches natural conversation
  • 300-500ms - acceptable, still conversational
  • 500-800ms - noticeable delay, users start to feel it
  • >1 second - feels broken, users assume something went wrong

Most basic pipelines land around 800ms to 2 seconds. Well-optimized streaming systems can reach 500-700ms. Getting below 500ms consistently is hard.

Where Latency Comes From

STT - Two factors matter: streaming speed and endpointing (silence detection). Slow endpointing is a common culprit. If the system waits too long to decide the user is done speaking, you lose 200-500ms before the LLM even starts.

LLM - Time-to-first-token matters more than total generation time. Streaming is mandatory. A fast first token lets TTS start early.

TTS - Time-to-first-audio is critical. Sentence buffering increases delay. Avoid unnecessary audio format conversions inside the agent.

Network - Cross-region calls easily add 100-200ms. Co-locate your agent, STT, and TTS when possible.

Optimization Rules

  • Stream at every stage
  • Reduce endpointing delay (but carefully, or you'll cut off users mid-sentence)
  • Avoid internal audio format conversions
  • Keep infrastructure geographically close
  • Favor faster models over larger ones

Voice systems are judged on responsiveness. Above 1 second, users feel the delay. Above 1.5 seconds, they wonder if the connection dropped.


Pipeline and realtime architectures represent a tradeoff between control and latency. Pipeline systems give you visibility, modularity, and the ability to tune every stage. Realtime models reduce coordination overhead and often deliver lower turn gap, but at the cost of flexibility and provider independence.

Neither is universally better. The right choice depends on your latency targets, how much customization you need, and how much operational complexity you're willing to take on.

In the follow-up post, I go through building and deploying a simple voice agent from scratch using LiveKit.