LiveKitVoice AgentSpeech-to-TextLLMText-to-SpeechPython

Building a Simple Real-Time Voice Agent with LiveKit

A hands-on walkthrough of building a real-time voice agent with LiveKit, wiring up the full STT to LLM to TTS loop so you can have back-and-forth voice conversations.

·2 min read

In a previous post, I covered how real-time voice agents work: the architecture, the pipeline, and where latency comes from at each stage. This post is the practical follow-up: actually building and running one.

The goal is simple: a voice agent you can talk to. It listens to your speech, transcribes it, sends it to an LLM, and speaks the response back. The full STT → LLM → TTS loop, end to end. By the end, you'll have something running locally that you can experiment with.

Real-Time Voice Agent Demo

You can find the full source on GitHub at gokuljs/Livekit-Voice-agent. Clone it and follow along.

Before You Start

If you haven't spent time in the LiveKit docs yet, it's worth doing before diving into the code. The platform has a lot more surface area than what this post covers, and reading through it will give you a much better mental model of what's possible: rooms, participants, tracks, agent workers, SIP integration. You'll come back to these pages often.

Architecture

Architecture diagram: STT to LLM to TTS pipeline

Diving Into the Code

I'll let the code speak for itself here. You'll be surprised how little it takes to get something running.

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    await ctx.wait_for_participant()

    rime_tts = rime.TTS(model=RIME_MODEL, speaker=RIME_SPEAKER)
    session = AgentSession(
        stt=openai.STT(model=OPENAI_TRANSCRIPT_MODEL),
        llm=openai.LLM(model=OPENAI_MODEL),
        tts=rime_tts,
        vad=ctx.proc.userdata["vad"],
        turn_detection=MultilingualModel(),
    )

    await session.start(
        room=ctx.room,
        agent=VoiceAssistant(),
        room_input_options=RoomInputOptions(
            noise_cancellation=noise_cancellation.BVC()
        ),
    )
    await session.say(INTRO_PHRASE)

That's the entire entrypoint. You initialize your STT, LLM, and TTS, pass them into AgentSession, and start it. LiveKit wires the pipeline together for you. The agent connects to the room, waits for a participant, and greets them. That's all it takes to get a working voice agent off the ground.

Once it's running, the real challenges show up: chipping away at latency, building observability on top of what LiveKit provides, and thinking carefully about how you design your agent for the use case you're targeting. Getting started is the easy part.