SpeechT5 TTS — GGUF (ggml-quantised)

GGUF / ggml conversion of microsoft/speecht5_tts for use with CrispStrobe/CrispASR.

SpeechT5 is a lightweight (~80M param) encoder-decoder TTS model:

Text encoder — 12-layer transformer (768d) with relative positional encoding
Speech decoder — 6-layer AR decoder generating continuous mel frames (no codebook tokens)
Postnet — 5-layer Conv1d + BatchNorm + Tanh residual stack
HiFi-GAN vocoder — 4x upsample (rates [4,4,4,4]) with MRF resblocks to 16 kHz PCM

Speaker conditioning via 512-d x-vector (e.g. from Matthijs/cmu-arctic-xvectors). Deterministic output (greedy decoding, no sampling).

Released under MIT license.

Files

File	Language	Size	Notes
`speecht5-tts-f16.gguf`	English	301 MB	encoder + decoder + postnet + HiFi-GAN vocoder
`speecht5-german-f16.gguf`	German	300 MB	German fine-tune, same architecture
`speaker.bin`	—	2 KB	Default 512-d x-vector for speaker conditioning

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli

# 2. Download model + speaker
huggingface-cli download cstr/speecht5-tts-GGUF speecht5-tts-f16.gguf speaker.bin --local-dir .

# 3. Synthesize
./build/bin/crispasr --backend speecht5 -m speecht5-tts-f16.gguf \
    --voice speaker.bin \
    --tts "Hello, how are you today?" \
    --tts-output hello.wav

Or with auto-download:

./build/bin/crispasr -m speecht5 --auto-download \
    --tts "The quick brown fox jumps over the lazy dog." \
    --tts-output fox.wav

Python binding

from crispasr import Session

sess = Session("speecht5-tts-f16.gguf")
sess.set_voice("speaker.bin")
pcm = sess.synthesize("Hello world.")
sess.write_wav("hello.wav", pcm)

Architecture details

See docs/architecture.md#speecht5 for the full architecture breakdown.

Conversion

Converted with models/convert-speecht5-tts-to-gguf.py from the CrispASR repo. The HiFi-GAN vocoder weights are from microsoft/speecht5_hifigan and are embedded in the same GGUF file.

Downloads last month: -

GGUF

Model size

0.2B params

Architecture

speecht5-tts

Hardware compatibility

16-bit

Model tree for cstr/speecht5-tts-GGUF

Base model

microsoft/speecht5_tts

Quantized

(4)

this model