LuxTTS: Open-Source Voice Cloning at 150x Realtime on 1GB VRAM
A new open-source text-to-speech model is turning heads in the AI community. LuxTTS, built on the ZipVoice architecture, can clone any voice from just 3 seconds of audio and generate speech at 150x realtime speed on a single GPU.
Tiny Model, Big Output
The model fits within 1GB of VRAM, making it accessible to virtually any consumer GPU. It even runs faster than realtime on CPUs alone. Unlike most TTS systems that cap at 24kHz, LuxTTS outputs at 48kHz - double the standard sample rate - delivering noticeably clearer audio.
How It Works
LuxTTS is a distilled version of ZipVoice, compressed down to just 4 inference steps with an improved sampling technique and a custom 48kHz vocoder. Users provide a short reference audio clip, and the model generates speech in the cloned voice. The entire pipeline runs locally with no API calls, no subscriptions, and no data leaving the machine.
Access and Usage
The model is available on Hugging Face with a live demo on Spaces. A Google Colab notebook is also provided for quick testing. Local installation requires only a pip install and a few lines of Python. The model supports CUDA, CPU, and Apple MPS backends.
The project has already attracted community contributions including a Gradio interface, a ComfyUI integration, and a clean desktop app. Float16 inference - expected to nearly double current speeds - is listed on the roadmap.