SpeechT5 TTS β€” GGUF (ggml-quantised)

GGUF / ggml conversion of microsoft/speecht5_tts for use with CrispStrobe/CrispASR.

SpeechT5 is a lightweight (~80M param) encoder-decoder TTS model:

  • Text encoder β€” 12-layer transformer (768d) with relative positional encoding
  • Speech decoder β€” 6-layer AR decoder generating continuous mel frames (no codebook tokens)
  • Postnet β€” 5-layer Conv1d + BatchNorm + Tanh residual stack
  • HiFi-GAN vocoder β€” 4x upsample (rates [4,4,4,4]) with MRF resblocks to 16 kHz PCM

Speaker conditioning via 512-d x-vector (e.g. from Matthijs/cmu-arctic-xvectors). Deterministic output (greedy decoding, no sampling).

Released under MIT license.

Files

File Language Size Notes
speecht5-tts-f16.gguf English 301 MB encoder + decoder + postnet + HiFi-GAN vocoder
speecht5-german-f16.gguf German 300 MB German fine-tune, same architecture
speaker.bin β€” 2 KB Default 512-d x-vector for speaker conditioning

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli

# 2. Download model + speaker
huggingface-cli download cstr/speecht5-tts-GGUF speecht5-tts-f16.gguf speaker.bin --local-dir .

# 3. Synthesize
./build/bin/crispasr --backend speecht5 -m speecht5-tts-f16.gguf \
    --voice speaker.bin \
    --tts "Hello, how are you today?" \
    --tts-output hello.wav

Or with auto-download:

./build/bin/crispasr -m speecht5 --auto-download \
    --tts "The quick brown fox jumps over the lazy dog." \
    --tts-output fox.wav

Python binding

from crispasr import Session

sess = Session("speecht5-tts-f16.gguf")
sess.set_voice("speaker.bin")
pcm = sess.synthesize("Hello world.")
sess.write_wav("hello.wav", pcm)

Architecture details

See docs/architecture.md#speecht5 for the full architecture breakdown.

Conversion

Converted with models/convert-speecht5-tts-to-gguf.py from the CrispASR repo. The HiFi-GAN vocoder weights are from microsoft/speecht5_hifigan and are embedded in the same GGUF file.

Downloads last month
-
GGUF
Model size
0.2B params
Architecture
speecht5-tts
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/speecht5-tts-GGUF

Quantized
(4)
this model