Kokoro: A Tiny TTS Model That Beats the Giants

I’ve been exploring various text-to-speech options for my projects, and just found something remarkable - an 82M parameter model that’s outperforming giants like MetaVoice (1.2B) and XTTS v2 (467M). Even more interesting? It’s fully open source and easy to self-host.

Here’s a demo of Kokoro in action (this very article!):

Kokoro Article TTS

Another example:

Kokoro TTS Demo

The Technical Stack

At its core, Kokoro uses StyleTTS 2 architecture with ISTFTNet for audio generation. Despite its small size (trained on less than 100 hours of audio), it’s currently ranked #1 in the TTS Spaces Arena. Pretty impressive for a model that’s a fraction of the size of its competitors.

Self Hosting

Thanks to a new FastAPI wrapper, deploying Kokoro is surprisingly straightforward:

# GPU Version
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.0post1
 
# CPU Version
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.1.0post1

You get:

OpenAI-compatible speech endpoint
Multiple voice support with voice combining
Streaming capability
Web UI for testing
Fast generation (35-50x realtime on a 4060Ti)

Using it is as simple as:

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880/v1",
    api_key="not-needed"
)
 
response = client.audio.speech.create(
    model="kokoro",
    voice="af_bella",  # or combine voices: "af_bella+af_sky"
    input="Hello world!",
    response_format="mp3"
)
response.stream_to_file("output.mp3")

Practical Benefits

What makes this particularly exciting for developers:

Resource efficient (runs well even on CPU)
Apache 2.0 licensed
Multiple audio format support (mp3, wav, opus, flac)
Streaming support with adjustable chunking
Easy voice combination

Current Limitations

It’s worth noting a few constraints:

No voice cloning capability
Mainly English support
Training data focused on narration over conversation

Looking Forward

I’ve integrated this into a few projects already and the combination of small size, good quality, and easy deployment makes it incredibly practical. The fact that it achieves this performance with such minimal resources makes me excited about future optimizations in the TTS space.

Currently testing it for an audiobook generation pipeline - curious if anyone else is building something similar? Contact me if so!

Code and containerized deployment options available on GitHub