Choose a voice

Speed

1.0×

0 / 5000

Generated audio

Built to respect your privacy

Every word you type stays on your device. Scribeus runs the AI model directly in your browser — no audio ever reaches a server.

⚡

Instant playback

Synthesis plays audio as each sentence completes — no waiting for the full text before you hear anything.

🎙

4 natural voices

Bella and Jessica for female narration, Liam and Adam for male — all powered by the Kokoro-82M model.

🔒

Zero-upload privacy

The model runs in a Web Worker. Your text never leaves your device — not even in encrypted form.

📶

Works offline

After the first load the 155 MB model is cached in your browser. Generate speech on a plane, no Wi-Fi needed.

🎚

Speed control

Adjust speech rate from 0.5× to 2× for presentations, audiobooks, or fast-paced content review.

💾

WAV download

Export as a lossless 24 kHz WAV file — ready for video editors, podcasts, or presentations.

Frequently asked questions

Everything you need to know about Scribeus.

How much text can I convert?

Up to 5,000 characters per generation. Long texts are split automatically at sentence boundaries and stitched into a single audio file.

Is my text completely private?

Yes. The Kokoro model runs entirely inside your browser in a Web Worker. No text, no audio, and no metadata is transmitted to any server.

Do I need an internet connection?

Only for the first visit, to download the AI model (~155 MB). After that, the model is cached and Scribeus works fully offline.

What AI model powers Scribeus?

Kokoro-82M v1.0, an open-source neural TTS model with 82 million parameters. We use the q4f16 quantized ONNX variant — optimized for browser inference without sacrificing audio quality.

Is Scribeus really free?

Completely free. No subscription, no account, no usage limits. Scribeus is part of the RuntimeHub suite of free, private, browser-based tools, supported by non-intrusive ads.

How long does generation take?

Speed depends on your device. A short sentence typically generates in 2–5 seconds. Longer passages are processed in chunks — you hear the first chunk while the rest are being generated.