Voice Assistant
Real-time voice-to-voice AI system with <2s latency, handling 100+ concurrent sessions via async WebSocket streaming.
Completed
Yes
Duration
2 months
Role
AI Engineer
Team
Solo
Problem
Voice AI systems have high latency and can't scale to many concurrent users without degrading response quality.
Solution
Built an async streaming pipeline: WebSocket audio → Whisper ASR → Gemini LLM → ElevenLabs TTS → streamed response. Independent session queues with circuit breakers.
Impact
End-to-end latency under 2 seconds. 100+ concurrent sessions with per-session isolation and metrics dashboard.
About This Project
The system processes audio input through an asynchronous event pipeline, achieving end-to-end processing in under 2 seconds.
It receives audio via WebSocket, transcribes it using Whisper ASR, generates responses using Gemini LLM, synthesizes speech using ElevenLabs TTS, and streams the synthesized audio back to the client.
Built on a streaming-first architecture using Python's asyncio, it handles 100+ concurrent sessions with independent queues and tasks per session.
The platform features configurable timeouts, circuit breakers, a metrics dashboard, and session recording for debugging.
Key Features
Technical capabilities and highlights
Real-time audio streaming via bidirectional WebSocket
Speech recognition using OpenAI Whisper API
Intelligent responses from Google Gemini LLM
Natural speech synthesis using ElevenLabs TTS
Low latency end-to-end processing under 2 seconds
Concurrent session support for 100+ simultaneous conversations
Resilience patterns including circuit breakers
Comprehensive observability with metrics dashboard
Interested in this project?
Let's discuss how similar solutions can be built for your needs.