Supertonic TTS Quantization for QCS6490
A step-by-step guide to quantize the Supertonic TTS model for Qualcomm QCS6490 using QAIRT/QNN.
Sample Output
Audio generated on QCS6490 board using quantized models (10 diffusion steps, noise-reduced):
Requirements
- QAIRT/QNN SDK v2.37
- Python 3.8+
- Target device: QCS6490
Pipeline Architecture
text + style
│
┌───────────┴───────────┐
│ │
duration_predictor text_encoder
│ │
duration (scalar) text_emb (1,128,256)
│ │
latent_mask (1,1,256) │
└───────────┬───────────┘
│
vector_estimator (10 diffusion steps)
│
denoised_latent
│
vocoder
│
audio (44.1kHz)
The duration_predictor outputs a single scalar representing the total speech duration. This is post-processed into a latent_mask that tells the vector_estimator how many of the 256 fixed-size latent frames are active speech vs padding.
Workflow
1. Input Preparation
Prepare calibration inputs for model quantization.
Input_Preparation.ipynb
2. Step-by-Step Quantization
Convert ONNX models to QNN format with quantization for HTP backend.
Supertonic_TTS_StepbyStep.ipynb
3. Correlation Verification
Verify quantized model outputs against reference using cosine similarity.
Correlation_Verification.ipynb
Project Structure
├── Input_Preparation.ipynb # Prepare calibration inputs
├── Supertonic_TTS_StepbyStep.ipynb # ONNX → QNN quantization guide
├── Correlation_Verification.ipynb # Output verification
├── assets/ # ONNX models (git submodule)
│ └── onnx/
│ ├── text_encoder.onnx
│ ├── duration_predictor.onnx
│ ├── vector_estimator.onnx
│ └── vocoder.onnx
├── QNN_Models/ # Quantized QNN models (.bin, .cpp)
├── QNN_Model_lib/ # QNN runtime libraries (aarch64)
├── qnn_calibration/ # Calibration data for verification
├── inputs/ # Prepared input data
└── board_output/ # Inference outputs from board
Models
| Model | Description |
|---|---|
| text_encoder | Encodes text tokens with style embedding |
| duration_predictor | Predicts phoneme durations |
| vector_estimator | Diffusion-based latent generator (10 steps) |
| vocoder | Converts latent to audio waveform |
ONNX Models (Source)
Located in assets/onnx/ (git submodule from Hugging Face):
text_encoder.onnxduration_predictor.onnxvector_estimator.onnxvocoder.onnx
QNN Models (Quantized)
Located in QNN_Models/:
text_encoder_htp.bin/.cppvector_estimator_htp.bin/.cppvocoder_htp.bin/.cpp
Compiled Libraries (Ready for Deployment)
Located in QNN_Model_lib/aarch64-oe-linux-gcc11.2/:
libtext_encoder_htp.solibvector_estimator_htp.solibvocoder_htp.solibduration_predictor_htp.so
These .so files are compiled from the .cpp sources and are ready to be deployed (via SCP) to the board for inference.
Note: The
duration_predictoris quantized and compiled but not used in the current calibration-based workflow sincelatent_maskis precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate thelatent_mask.
Getting Started
Clone with submodules:
git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490Follow the notebooks in order:
Input_Preparation.ipynbSupertonic_TTS_StepbyStep.ipynbCorrelation_Verification.ipynb
Note
Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.
License
This model inherits the licensing from Supertone/supertonic-2:
- Model: OpenRAIL-M License
- Code: MIT License
Copyright (c) 2026 Supertone Inc. (original model)
Model tree for dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490
Base model
Supertone/supertonic-2