Back to AI & Machine Learning
AI Engineer

Voice Assistant

Real-time voice-to-voice AI system with <2s latency, handling 100+ concurrent sessions via async WebSocket streaming.

PythonWebSocketAsyncioOpenAI WhisperGoogle GeminiElevenLabs TTSDocker

Completed

Yes

Duration

2 months

Role

AI Engineer

Team

Solo

Problem

Voice AI systems have high latency and can't scale to many concurrent users without degrading response quality.

Solution

Built an async streaming pipeline: WebSocket audio → Whisper ASR → Gemini LLM → ElevenLabs TTS → streamed response. Independent session queues with circuit breakers.

Impact

End-to-end latency under 2 seconds. 100+ concurrent sessions with per-session isolation and metrics dashboard.

About This Project

The system processes audio input through an asynchronous event pipeline, achieving end-to-end processing in under 2 seconds.

It receives audio via WebSocket, transcribes it using Whisper ASR, generates responses using Gemini LLM, synthesizes speech using ElevenLabs TTS, and streams the synthesized audio back to the client.

Built on a streaming-first architecture using Python's asyncio, it handles 100+ concurrent sessions with independent queues and tasks per session.

The platform features configurable timeouts, circuit breakers, a metrics dashboard, and session recording for debugging.

Key Features

Technical capabilities and highlights

Real-time audio streaming via bidirectional WebSocket

Speech recognition using OpenAI Whisper API

Intelligent responses from Google Gemini LLM

Natural speech synthesis using ElevenLabs TTS

Low latency end-to-end processing under 2 seconds

Concurrent session support for 100+ simultaneous conversations

Resilience patterns including circuit breakers

Comprehensive observability with metrics dashboard

Interested in this project?

Let's discuss how similar solutions can be built for your needs.