Voice Pipeline
The voice pipeline is the core of Arkenos — it processes audio in real-time between the caller and the AI agent.Pipeline Flow
Components
Voice Activity Detection (VAD)
Silero VAD detects when the caller starts and stops speaking. This enables natural turn-taking without manual push-to-talk.Speech-to-Text (STT)
Converts caller audio to text. Arkenos supports three providers, configurable per agent:| Provider | Key Feature | Config Value |
|---|---|---|
| AssemblyAI | High accuracy, default provider | assemblyai |
| ElevenLabs | Low latency streaming | elevenlabs |
| Deepgram | Real-time streaming | deepgram |
Large Language Model (LLM)
Processes the transcript and generates a response. Currently uses Google Gemini 2.5 Flash. The LLM receives:- The agent’s system prompt
- Conversation history
- Available function definitions (if configured)
Function Calling
When the LLM decides to use a tool, the agent:- Extracts the function name and arguments
- Makes an HTTP request to the configured endpoint
- Returns the result to the LLM for incorporation into the response
Text-to-Speech (TTS)
Converts the LLM response to audio using Resemble AI. Each agent can be configured with a different voice from the Resemble voice library.Agent Configuration
The agent worker fetches its configuration from the backend API when it joins a room. The configuration includes:- System prompt and first message
- STT provider selection
- Voice UUID for TTS
- Function definitions
- Webhook URLs
Real-time Transport
LiveKit handles the real-time audio transport:- Browser sessions: WebRTC connection via LiveKit Client SDK
- Phone calls: Twilio SIP trunk → LiveKit SIP bridge
- Agent: LiveKit Agents SDK connects as a participant