AI Engineer

Voice Assistant

Real-time voice-to-voice AI system with <2s latency, handling 100+ concurrent sessions via async WebSocket streaming.

PythonWebSocketAsyncioOpenAI WhisperGoogle GeminiElevenLabs TTSDocker

Completed

Yes

Duration

2 months

Role

AI Engineer

Team

Solo

Problem

Voice AI systems have high latency and can't scale to many concurrent users without degrading response quality.

Solution

Built an async streaming pipeline: WebSocket audio → Whisper ASR → Gemini LLM → ElevenLabs TTS → streamed response. Independent session queues with circuit breakers.

Impact

End-to-end latency under 2 seconds. 100+ concurrent sessions with per-session isolation and metrics dashboard.

About This Project

The system processes audio input through an asynchronous event pipeline, achieving end-to-end processing in under 2 seconds.

It receives audio via WebSocket, transcribes it using Whisper ASR, generates responses using Gemini LLM, synthesizes speech using ElevenLabs TTS, and streams the synthesized audio back to the client.

Built on a streaming-first architecture using Python's asyncio, it handles 100+ concurrent sessions with independent queues and tasks per session.

The platform features configurable timeouts, circuit breakers, a metrics dashboard, and session recording for debugging.

Key Features

Technical capabilities and highlights

Real-time audio streaming via bidirectional WebSocket

Speech recognition using OpenAI Whisper API

Intelligent responses from Google Gemini LLM

Natural speech synthesis using ElevenLabs TTS

Low latency end-to-end processing under 2 seconds

Concurrent session support for 100+ simultaneous conversations

Resilience patterns including circuit breakers

Comprehensive observability with metrics dashboard

Interested in this project?

Let's discuss how similar solutions can be built for your needs.

Get in Touch View on GitHub ← Back to AI & Machine Learning

AI & Machine Learning

Analytics & Visualization