Spaces:

LEMAS-Project
/

LEMAS-Edit

Running on Zero

App Files Files Community

Approximetal commited on Jan 2

Commit

f36e46d

verified ·

1 Parent(s): e702978

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +116 -9
app.py +19 -0
apt.txt +3 -0
gradio_mix.py +1209 -0
inference_gradio.py +576 -0
lemas_tts/__init__.py +6 -0
lemas_tts/api.py +306 -0
lemas_tts/configs/multilingual_grl.yaml +78 -0
lemas_tts/configs/multilingual_prosody.yaml +78 -0
lemas_tts/infer/edit_multilingual.py +184 -0
lemas_tts/infer/frontend.py +251 -0
lemas_tts/infer/infer_cli.py +386 -0
lemas_tts/infer/text_norm/__init__.py +0 -0
lemas_tts/infer/text_norm/cn_tn.py +824 -0
lemas_tts/infer/text_norm/en_tn.py +178 -0
lemas_tts/infer/text_norm/gp2py.py +148 -0
lemas_tts/infer/text_norm/id_tn.py +275 -0
lemas_tts/infer/text_norm/jieba_dict.txt +0 -0
lemas_tts/infer/text_norm/pinyin-lexicon-r.txt +4120 -0
lemas_tts/infer/text_norm/symbols.py +419 -0
lemas_tts/infer/text_norm/tokenizer.py +235 -0
lemas_tts/infer/text_norm/txt2pinyin.py +225 -0
lemas_tts/infer/utils_infer.py +661 -0
lemas_tts/model/backbones/README.md +20 -0
lemas_tts/model/backbones/dit.py +254 -0
lemas_tts/model/backbones/ecapa_tdnn.py +931 -0
lemas_tts/model/backbones/mmdit.py +189 -0
lemas_tts/model/backbones/prosody_encoder.py +433 -0
lemas_tts/model/backbones/unett.py +250 -0
lemas_tts/model/cfm.py +899 -0
lemas_tts/model/modules.py +802 -0
lemas_tts/model/utils.py +190 -0
lemas_tts/scripts/inference_gradio.py +584 -0
requirements.txt +185 -0
uvr5/gui_data/constants.py +1147 -0
uvr5/lib_v5/mdxnet.py +140 -0
uvr5/lib_v5/mixer.ckpt +3 -0
uvr5/lib_v5/modules.py +74 -0
uvr5/lib_v5/pyrb.py +92 -0
uvr5/lib_v5/spec_utils.py +703 -0
uvr5/lib_v5/vr_network/__init__.py +1 -0
uvr5/lib_v5/vr_network/layers.py +143 -0
uvr5/lib_v5/vr_network/layers_new.py +126 -0
uvr5/lib_v5/vr_network/model_param_init.py +59 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr16000_hl512.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr32000_hl512.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr33075_hl384.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl1024.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl256.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl512.json +19 -0

README.md CHANGED Viewed

@@ -1,12 +1,119 @@
 ---
-title: LEMAS Edit
-emoji: 🏃
-colorFrom: red
-colorTo: yellow
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# LEMAS-TTS Gradio Demo (Hugging Face Space)
+This folder is a **clean, inference-only** version of LEMAS-TTS, organized for easy deployment on **Hugging Face Spaces**.
+It keeps only:
+- the inference models & configs (`lemas_tts`)
+- pretrained checkpoints and vocab (`pretrained_models`)
+- the bundled UVR5 denoiser (`uvr5`)
+- a Gradio web UI (`inference_gradio.py`, `app.py`)
+---
+## 1. Features
+- Zero-shot TTS: clone voice from a reference audio + reference text
+- Multilingual text input (Chinese / English / ES / IT / PT / DE, etc.)
+- Optional UVR5-based reference denoising
+- Two custom LEMAS checkpoints:
+  - `multilingual_prosody_custom`
+  - `multilingual_acc_grl_custom`
+---
+## 2. Project Structure
+```text
+LEMAS-TTS_gradio/
+  app.py                     # HF Space entrypoint (Gradio Blocks)
+  inference_gradio.py        # Full Gradio UI & logic
+  requirements.txt           # Minimal runtime dependencies
+  lemas_tts/                 # Core LEMAS-TTS package (inference only)
+    api.py                   # F5TTS API (used by the UI)
+    configs/                 # Model configs (F5TTS / E2TTS)
+    infer/                   # Inference utilities & text frontend
+    model/                   # DiT backbone, utils, etc.
+  pretrained_models/         # All local assets needed for inference
+    ckpts/
+      F5TTS_v1_Base_vocos_custom_multilingual_prosody/model_2698000.pt
+      F5TTS_v1_Base_vocos_custom_multilingual_acc_grl/model_2680000.pt
+      prosody_encoder/...
+      vocos-mel-24khz/...
+    data/
+      multilingual_prosody_custom/vocab.txt
+      multilingual_acc_grl_custom/vocab.txt
+      test_examples/*.wav    # Demo audios used in the UI
+    uvr5/
+      models/MDX_Net_Models/model_data/*.onnx, *.json
+  uvr5/                      # Bundled UVR5 implementation for denoising
+```
+`lemas_tts.api.F5TTS` automatically resolves `pretrained_models/` based on the repo layout, so no extra path configuration is required.
+---
+## 3. How to Run Locally
+```bash
+cd LEMAS-TTS_gradio
+pip install -r requirements.txt
+python app.py
+```
+Then open the printed URL (default `http://127.0.0.1:7860`) in your browser.
 ---
+## 4. Hugging Face Space Setup
+1. Create a new Space (type: **Gradio**).
+2. Upload the contents of `LEMAS-TTS_gradio/` to the Space repo:
+   - `app.py`
+   - `inference_gradio.py`
+   - `requirements.txt`
+   - `lemas_tts/`
+   - `pretrained_models/`
+   - `uvr5/`
+3. In the Space settings, choose a GPU hardware profile (the model is heavy).
+4. The Space will automatically run `app.py` and launch the Gradio Blocks named `app`.
+No extra arguments are needed; all paths are relative inside the repo.
 ---
+## 5. Usage Tips
+- **Reference Text** should match the reference audio roughly in content and language for best voice cloning.
+- **Denoise**:
+  - Turn on if your reference audio is noisy; it runs UVR5 on CPU.
+  - Turn off if the reference is already clean (saves time).
+- **Seed**:
+  - `-1` → random seed
+  - Any other integer → reproducible output
+---
+## 6. 中文说明（简要）
+这个目录是专门为 **Hugging Face Space** 打包的 **推理版 LEMAS-TTS**：
+- 只保留推理相关代码（`lemas_tts`）、预训练模型（`pretrained_models`）和 UVR5 去噪模块（`uvr5`）
+- Gradio 入口为 `app.py`，内部调用 `inference_gradio.py` 里的 `app`（一个 `gr.Blocks` 界面）
+- `pretrained_models/` 下已经包含：
+  - 自定义多语种 prosody / accent GRL 的 finetune 权重
+  - vocoder（`vocos-mel-24khz`）
+  - prosody encoder
+  - 以及示例语音 `test_examples/*.wav`
+在本地或 Space 中运行步骤：
+```bash
+pip install -r requirements.txt
+python app.py
+```
+然后在浏览器中打开提示的链接即可使用零样本 TTS Demo。

app.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""
+Gradio entrypoint for Hugging Face Spaces for LEMAS-Edit.
+This file exposes the Blocks app defined in `gradio_mix.get_app`.
+"""
+import gradio as gr  # noqa: F401
+from gradio_mix import get_app
+_app = get_app()
+# Expose as both `app` and `demo` for maximum compatibility
+app = _app
+demo = _app
+if __name__ == "__main__":
+    app.queue(api_open=True).launch()

apt.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+ffmpeg
+espeak-ng
+espeak

gradio_mix.py ADDED Viewed

	@@ -0,0 +1,1209 @@

+import os, gc
+import re, time
+import logging
+from num2words import num2words
+import gradio as gr
+import torch, torchaudio
+import numpy as np
+import random
+from scipy.io import wavfile
+import onnx
+import onnxruntime as ort
+import copy
+import uroman as ur
+import jieba, zhconv
+from pypinyin.core import Pinyin
+from pypinyin import Style
+from lemas_tts.api import TTS, PRETRAINED_ROOT, CKPTS_ROOT
+from lemas_tts.infer.edit_multilingual import gen_wav_multilingual
+from lemas_tts.infer.text_norm.txt2pinyin import (
+    MyConverter,
+    _PAUSE_SYMBOL,
+    change_tone_in_bu_or_yi,
+    get_phoneme_from_char_and_pinyin,
+)
+from lemas_tts.infer.text_norm.cn_tn import NSWNormalizer
+# import io
+# import uuid
+_JIEBA_DICT = os.path.join(
+    os.path.dirname(__file__),
+    "lemas_tts",
+    "infer",
+    "text_norm",
+    "jieba_dict.txt",
+)
+if os.path.isfile(_JIEBA_DICT):
+    jieba.set_dictionary(_JIEBA_DICT)
+# import sys
+# sys.path.append("/cto_labs/vistring/zhaozhiyuan/code/SpeechAugment/versatile_audio_super_resolution")
+# from inference import Predictor
+# from inference_tts_scale import inference_one_sample as inference_tts
+import langid
+langid.set_languages(['es','pt','zh','en','de','fr','it', 'ar', 'ru', 'ja', 'ko', 'hi', 'th', 'id', 'vi'])
+# import nltk
+# nltk.download('punkt')
+os.environ['CURL_CA_BUNDLE'] = ''
+DEMO_PATH = os.getenv("DEMO_PATH", "./demo")
+TMP_PATH = os.getenv("TMP_PATH", "./demo/temp")
+MODELS_PATH = os.getenv("MODELS_PATH", "./pretrained_models")
+device = "cuda" if torch.cuda.is_available() else "cpu"
+ASR_DEVICE = "cpu"  # force whisperx/pyannote to CPU to avoid cuDNN issues
+whisper_model, align_model = None, None
+tts_edit_model = None
+_whitespace_re = re.compile(r"\s+")
+alpha_pattern = re.compile(r"[a-zA-Z]")
+formatter = ("%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d || %(message)s")
+logging.basicConfig(format=formatter, level=logging.INFO)
+# def get_random_string():
+#     return "".join(str(uuid.uuid4()).split("-"))
+def seed_everything(seed):
+    if seed != -1:
+        os.environ['PYTHONHASHSEED'] = str(seed)
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed(seed)
+        torch.backends.cudnn.benchmark = False
+        torch.backends.cudnn.deterministic = True
+# class AudioSR:
+#     def __init__(self, model_name):
+#         code_dir = "/cto_labs/vistring/zhaozhiyuan/code/SpeechAugment/versatile_audio_super_resolution"
+#         self.model = self.load_model(model_name, code_dir)
+#         self.sr = 48000
+#         self.chunk_size=10.24
+#         self.overlap=0.16
+#         self.guidance_scale=1
+#         self.ddim_steps=20
+#         self.multiband_ensemble=False
+#     def load_model(self, model_name, code_dir):
+#         import sys, json
+#         sys.path.append(code_dir)
+#         from inference import Predictor
+#         sr_model = Predictor()
+#         sr_model.setup(model_name)
+#         return sr_model
+#     def audiosr(self, in_wav, src_sr, tar_sr, chunk_size=10.24, overlap=0.16, seed=0, guidance_scale=1, ddim_steps=20, multiband_ensemble=False):
+#         if seed == 0:
+#             seed = random.randint(0, 2**32 - 1)
+#         print(f"Setting seed to: {seed}")
+#         print(f"overlap = {overlap}")
+#         print(f"guidance_scale = {guidance_scale}")
+#         print(f"ddim_steps = {ddim_steps}")
+#         print(f"chunk_size = {chunk_size}")
+#         print(f"multiband_ensemble = {multiband_ensemble}")
+#         print(f"in_wav.shape = {in_wav.shape}")
+#         in_wav = torchaudio.functional.resample(in_wav.squeeze(), src_sr, 24000)
+#         in_wav = in_wav.squeeze().numpy()
+#         out_wav = self.model.process_audio(
+#             in_wav, 24000,
+#             chunk_size=chunk_size,
+#             overlap=overlap,
+#             seed=seed,
+#             guidance_scale=guidance_scale,
+#             ddim_steps=ddim_steps,
+#             multiband_ensemble=multiband_ensemble,
+#         )
+#         out_wav = out_wav[:int(self.sr*in_wav.shape[0]/24000)].T
+#         if tar_sr != self.sr:
+#             out_wav = torchaudio.functional.resample(torch.from_numpy(out_wav).squeeze(), self.sr, tar_sr)
+#         else:
+#             out_wav = torch.from_numpy(out_wav)
+#         print(f"out.shape = {out_wav.shape} tar_sr={tar_sr}")
+#         return out_wav.squeeze()
+class UVR5:
+    """Small wrapper around the bundled uvr5 implementation for denoising."""
+    def __init__(self, model_dir):
+        code_dir = os.path.join(os.path.dirname(__file__), "uvr5")
+        self.model = self.load_model(model_dir, code_dir)
+    def load_model(self, model_dir, code_dir):
+        import sys, json
+        if code_dir not in sys.path:
+            sys.path.append(code_dir)
+        from multiprocess_cuda_infer import ModelData, Inference
+        model_path = os.path.join(model_dir, "Kim_Vocal_1.onnx")
+        config_path = os.path.join(model_dir, "MDX-Net-Kim-Vocal1.json")
+        with open(config_path, "r", encoding="utf-8") as f:
+            configs = json.load(f)
+        model_data = ModelData(
+            model_path=model_path,
+            audio_path = model_dir,
+            result_path = model_dir,
+            device = 'cpu',
+            process_method = "MDX-Net",
+            base_dir=model_dir,
+            **configs
+        )
+        uvr5_model = Inference(model_data, 'cpu')
+        uvr5_model.load_model(model_path, 1)
+        return uvr5_model
+    def denoise(self, audio_info):
+        input_audio = load_wav(audio_info, sr=44100, channel=2)
+        output_audio = self.model.demix_base({0:input_audio.squeeze()}, is_match_mix=False)
+        # transform = torchaudio.transforms.Resample(44100, 16000)
+        # output_audio = transform(output_audio)
+        return output_audio.squeeze().T.numpy(), 44100
+class DeepFilterNet:
+    def __init__(self, model_path):
+        self.hop_size = 480
+        self.fft_size = 960
+        self.model = self.load_model(model_path)
+    def load_model(self, model_path, threads=1):
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = threads
+        sess_options.graph_optimization_level = (ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED)
+        sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+        model = onnx.load_model(model_path)
+        ort_session = ort.InferenceSession(
+            model.SerializeToString(),
+            sess_options,
+            providers=["CPUExecutionProvider"], # ["CUDAExecutionProvider"], #
+        )
+        input_names = ["input_frame", "states", "atten_lim_db"]
+        output_names = ["enhanced_audio_frame", "new_states", "lsnr"]
+        return ort_session
+    def denoise(self, audio_info):
+        wav = load_wav(audio_info, 48000)
+        orig_len = wav.shape[-1]
+        hop_size_divisible_padding_size = (self.hop_size - orig_len % self.hop_size) % self.hop_size
+        orig_len += hop_size_divisible_padding_size
+        wav = torch.nn.functional.pad(
+            wav, (0, self.fft_size + hop_size_divisible_padding_size)
+        )
+        chunked_audio = torch.split(wav, self.hop_size)
+        # chunked_audio = torch.split(wav, int(wav.shape[-1]/2))
+        state = np.zeros(45304,dtype=np.float32)
+        atten_lim_db = np.zeros(1,dtype=np.float32)
+        enhanced = []
+        for frame in chunked_audio:
+            out = self.model.run(None,input_feed={"input_frame":frame.numpy(),"states":state,"atten_lim_db":atten_lim_db})
+            enhanced.append(torch.tensor(out[0]))
+            state = out[1]
+        enhanced_audio = torch.cat(enhanced).unsqueeze(0)  # [t] -> [1, t] typical mono format
+        d = self.fft_size - self.hop_size
+        enhanced_audio = enhanced_audio[:, d: orig_len + d]
+        return enhanced_audio.squeeze().numpy(), 48000
+class TextNorm():
+    def __init__(self):
+        my_pinyin = Pinyin(MyConverter())
+        self.pinyin_parser = my_pinyin.pinyin
+    def sil_type(self, time_s):
+        if round(time_s) < 0.4:
+            return ""
+        elif round(time_s) >= 0.4 and round(time_s) < 0.8:
+            return "#1"
+        elif round(time_s) >= 0.8 and round(time_s) < 1.5:
+            return "#2"
+        elif round(time_s) >= 1.5 and round(time_s) < 3.0:
+            return "#3"
+        elif round(time_s) >= 3.0:
+            return "#4"
+    def add_sil_raw(self, sub_list, start_time, end_time, target_transcript):
+        txt = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txt.append(sil)
+        txt.append(txt_list[0])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txt.append(target_transcript)
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txt.append(sil)
+                txt.append(txt_list[i])
+        return ' '.join(txt)
+    def add_sil(self, sub_list, start_time, end_time, target_transcript, src_lang, tar_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txts.append([src_lang, sil])
+        if sub_list[0]["start"] < start_time:
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txts.append([tar_lang, target_transcript])
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def get_prompt(self, sub_list, start_time, end_time, src_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        if start_time <= sub_list[0]["start"]:
+            sil = self.sil_type(sub_list[0]["start"])
+            if len(sil) > 0:
+                txts.append([src_lang, sil])
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            # if sub_list[i]["start"] <= start_time and sub_list[i]["end"] <= end_time:
+            #     txts.append([tar_lang, target_transcript])
+            #     target_transcript = ""
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def txt2pinyin(self, text):
+        txts, phonemes = [], []
+        texts = re.split(r"(#\d)", text)
+        print("before norm: ", texts)
+        for text in texts:
+            if text in {'#1', '#2', '#3', '#4'}:
+                txts.append(text)
+                phonemes.append(text)
+                continue
+            text = NSWNormalizer(text.strip()).normalize()
+            text_list = list(jieba.cut(text))
+            print("jieba cut: ", text, text_list)
+            for words in text_list:
+                if words in _PAUSE_SYMBOL:
+                    # phonemes.append('#2')
+                    phonemes[-1] += _PAUSE_SYMBOL[words]
+                    txts[-1] += words
+                elif re.search("[\u4e00-\u9fa5]+", words):
+                    pinyin = self.pinyin_parser(words, style=Style.TONE3, errors="ignore")
+                    new_pinyin = []
+                    for x in pinyin:
+                        x = "".join(x)
+                        if "#" not in x:
+                            new_pinyin.append(x)
+                        else:
+                            phonemes.append(words)
+                            continue
+                    new_pinyin = change_tone_in_bu_or_yi(words, new_pinyin) if len(words)>1 and words[-1] not in {"一","不"} else new_pinyin
+                    phoneme = get_phoneme_from_char_and_pinyin(words, new_pinyin)
+                    phonemes += phoneme
+                    txts += list(words)
+                elif re.search(r"[a-zA-Z]", words) or re.search(r"#[1-4]", words):
+                    phonemes.append(words)
+                    txts.append(words)
+                    # phonemes.append("#1")
+        # phones = " ".join(phonemes)
+        return txts, phonemes
+def chunk_text(text, max_chars=135):
+    """
+    Splits the input text into chunks, each with a maximum number of characters.
+    Args:
+        text (str): The text to be split.
+        max_chars (int): The maximum number of characters per chunk.
+    Returns:
+        List[str]: A list of text chunks.
+    """
+    chunks = []
+    current_chunk = ""
+    # Split the text into sentences based on punctuation followed by whitespace
+    sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[；：，。！？])", text)
+    for sentence in sentences:
+        if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars:
+            current_chunk += sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+        else:
+            if current_chunk:
+                chunks.append(current_chunk.strip())
+            current_chunk = sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+    if current_chunk:
+        chunks.append(current_chunk.strip())
+    return chunks
+class MMSAlignModel:
+    def __init__(self):
+        from torchaudio.pipelines import MMS_FA as bundle
+        self.mms_model = bundle.get_model()
+        self.mms_model.to(device)
+        self.mms_tokenizer = bundle.get_tokenizer()
+        self.mms_aligner = bundle.get_aligner()
+        self.text_normalizer = ur.Uroman()
+    def text_normalization(self, text_list):
+        text_normalized = []
+        for word in text_list:
+            text_char = ''
+            for c in word:
+                if c.isalpha() or c=="'":
+                    text_char += c.lower()
+                elif c == "-":
+                    text_char += '*'
+            text_char = text_char if len(text_char) > 0 else "*"
+            text_normalized.append(text_char)
+        assert len(text_normalized) == len(text_list), f"normalized text len != raw text len: {len(text_normalized)} != {text_list}"
+        return text_normalized
+    def compute_alignments(self, waveform: torch.Tensor, tokens):
+        with torch.inference_mode():
+            emission, _ = self.mms_model(waveform.to(device))
+            token_spans = self.mms_aligner(emission[0], tokens)
+        return emission, token_spans
+    def align(self, data, wav):
+        waveform = load_wav(wav, 16000).unsqueeze(0)
+        raw_text = data['text'][0]
+        text = " ".join(data['text'][1]).replace("-", " ")
+        text = re.sub("\s+", " ", text)
+        text_normed = self.text_normalizer.romanize_string(text, lcode=data["lang"])
+        # text_normed = re.sub("[\d_.,!$£%?#−/]", '', text_normed)
+        fliter = re.compile("[^a-z^*^'^ ]")
+        text_normed = fliter.sub('', text_normed.lower())
+        text_normed = re.sub("\s+", " ", text_normed)
+        text_normed = text_normed.split()
+        assert len(text_normed) == len(raw_text), f"normalized text len != raw text len: {len(text_normed)} != {len(raw_text)}"
+        tokens = self.mms_tokenizer(text_normed)
+        with torch.inference_mode():
+            emission, _ = self.mms_model(waveform.to(device))
+            token_spans = self.mms_aligner(emission[0], tokens)
+        num_frames = emission.size(1)
+        ratio = waveform.size(1) / num_frames
+        res = []
+        for i in range(len(token_spans)):
+            score = round(sum([x.score for x in token_spans[i]]) / len(token_spans[i]), ndigits=3)
+            start = round(waveform.size(-1) * token_spans[i][0].start / num_frames / 16000, ndigits=3)
+            end = round(waveform.size(-1) * token_spans[i][-1].end / num_frames / 16000, ndigits=3)
+            res.append({"word": raw_text[i], "start": start, "end": end, "score": score})
+        res = {"lang":data["lang"], "start": 0, "end": round(waveform.shape[-1]/16000, ndigits=3), "text_raw":data["text_raw"], "text": text, "words": res}
+        return res
+class WhisperxModel:
+    def __init__(self, model_name):
+        from whisperx import load_model
+        from pathlib import Path
+        prompt = None  # "This might be a blend of Simplified Chinese and English speech, do not translate, only transcription be allowed."
+        # Prefer a local VAD model (to avoid network download / 301 issues)
+        vad_fp = Path(MODELS_PATH) / "whisperx-vad-segmentation.bin"
+        if not vad_fp.is_file():
+            logging.warning(
+                "Local whisperx VAD not found at %s, falling back to default download path.",
+                vad_fp,
+            )
+            vad_fp = None
+        self.model = load_model(
+            model_name,
+            ASR_DEVICE,
+            compute_type="float32",
+            asr_options={
+                "suppress_numerals": True,
+                "max_new_tokens": None,
+                "clip_timestamps": None,
+                "initial_prompt": prompt,
+                "append_punctuations": ".。,，!！?？:：、",
+                "hallucination_silence_threshold": None,
+                "multilingual": True,
+                "hotwords": None
+            },
+            vad_model_fp=str(vad_fp) if vad_fp is not None else None,
+        )
+    def transcribe(self, audio_info, lang=None):
+        audio = load_wav(audio_info).numpy()
+        if lang is None:
+            lang = self.model.detect_language(audio)
+        if lang == 'zh':
+            self.model.options._replace(initial_prompt="简体中文：")
+        else:
+            self.model.options._replace(initial_prompt=None)
+        segments = self.model.transcribe(audio, batch_size=8, language=lang)["segments"]
+        transcript = " ".join([segment["text"] for segment in segments])
+        if lang not in {'es','pt','zh','en','de','fr','it', 'ar', 'ru', 'ja', 'ko', 'hi', 'th', 'id', 'vi'}:
+            lang = langid.classify(transcript)[0]
+            segments = self.model.transcribe(audio, batch_size=8, language=lang)["segments"]
+            transcript = " ".join([segment["text"] for segment in segments])
+        logging.debug(f"whisperx: {segments}")
+        transcript = zhconv.convert(transcript, 'zh-hans')
+        transcript = transcript.replace("-", " ")
+        transcript = re.sub(_whitespace_re, " ", transcript)
+        transcript = transcript[1:] if transcript[0] == " " else transcript
+        segments = {'lang':lang, 'text_raw':transcript}
+        if lang == "zh":
+            segments["text"] = text_norm.txt2pinyin(transcript)
+        else:
+            transcript = replace_numbers_with_words(transcript, lang=lang).split(' ')
+            segments["text"] = (transcript, transcript)
+        return align_model.align(segments, audio_info)
+def load_wav(audio_info, sr=16000, channel=1):
+    raw_sr, audio = audio_info
+    audio = audio.T if len(audio.shape) > 1 and audio.shape[1] == 2 else audio
+    audio = audio / np.max(np.abs(audio))
+    audio = torch.from_numpy(audio).squeeze().float()
+    if channel == 1 and len(audio.shape) == 2:  # stereo to mono
+        audio = audio.mean(dim=0, keepdim=True)
+    elif channel == 2 and len(audio.shape) == 1:
+        audio = torch.stack((audio, audio)) # mono to stereo
+    if raw_sr != sr:
+        audio = torchaudio.functional.resample(audio.squeeze(), raw_sr, sr)
+    audio = torch.clip(audio, -0.999, 0.999).squeeze()
+    return audio
+def update_word_time(lst, cut_time, edit_start, edit_end):
+    for i in range(len(lst)):
+        lst[i]["start"] = round(lst[i]["start"] - cut_time, ndigits=3)
+        lst[i]["end"] = round(lst[i]["end"] - cut_time,  ndigits=3)
+    edit_start = max(round(edit_start - cut_time, ndigits=3), 0)
+    edit_end = round(edit_end - cut_time, ndigits=3)
+    return lst, edit_start, edit_end
+# def update_word_time2(lst, cut_time, edit_start, edit_end):
+#     for i in range(len(lst)):
+#         lst[i]["start"] = round(lst[i]["start"] + cut_time, ndigits=3)
+#     return lst, edit_start, edit_end
+def get_audio_slice(audio, words_info, start_time, end_time, max_len=10, sr=16000, code_sr=50):
+    audio_dur = audio.shape[-1] / sr
+    sub_list = []
+    # 如果尾部小于5s则保留后面全部，并截取前半段音频
+    if audio_dur - end_time <= max_len/2:
+        for word in reversed(words_info):
+            if word['start'] > start_time or audio_dur - word['start'] < max_len:
+                sub_list = [word] + sub_list
+    # 如果头部小于5s则保留前面全部，并截取后半段音频
+    elif start_time <=max_len/2:
+        for word in words_info:
+            if word['end'] < max(end_time, max_len):
+                sub_list += [word]
+    # 如果前后都大于5s，则前后各留5s
+    else:
+        for word in words_info:
+            if word['start'] > start_time - max_len/2 and word['end'] < end_time + max_len/2:
+                sub_list += [word]
+    audio = audio.squeeze()
+    start = int(sub_list[0]['start']*sr)
+    end = int(sub_list[-1]['end']*sr)
+    # print("wav cuts:", start, end, (end-start) % int(sr/code_sr))
+    end -= (end-start) % int(sr/code_sr) # chunk取整
+    sub_list, start_time, end_time = update_word_time(sub_list, sub_list[0]['start'], start_time, end_time)
+    audio = audio.squeeze()
+    # print("after update_word_time:", sub_list, start_time, end_time, (end-start)/sr)
+    return (audio[:start], audio[start:end], audio[end:]), (sub_list, start_time, end_time)
+def load_models(lemas_model_name, whisper_model_name, alignment_model_name, denoise_model_name):  # , audiosr_name):
+    global transcribe_model, align_model, denoise_model, text_norm, tts_edit_model
+    # if voicecraft_model:
+    #     del denoise_model
+    #     del transcribe_model
+    #     del align_model
+    #     del voicecraft_model
+    #     del audiosr
+    torch.cuda.empty_cache()
+    gc.collect()
+    if denoise_model_name == "UVR5":
+        denoise_model = UVR5(os.path.join(str(PRETRAINED_ROOT), "uvr5"))
+    elif denoise_model_name == "DeepFilterNet":
+        denoise_model = DeepFilterNet("./audio_preprocess/denoiser_model.onnx")
+    if alignment_model_name == "MMS":
+        align_model = MMSAlignModel()
+    else:
+        align_model = WhisperxAlignModel()
+    text_norm = TextNorm()
+    transcribe_model = WhisperxModel(whisper_model_name)
+    # Load LEMAS-TTS editing model (selected multilingual variant)
+    from pathlib import Path
+    ckpt_dir = Path(CKPTS_ROOT) / lemas_model_name
+    ckpt_candidates = sorted(
+        list(ckpt_dir.glob("*.safetensors")) + list(ckpt_dir.glob("*.pt"))
+    )
+    if not ckpt_candidates:
+        raise gr.Error(f"No LEMAS-TTS ckpt found under {ckpt_dir}")
+    ckpt_file = str(ckpt_candidates[-1])
+    vocab_file = Path(PRETRAINED_ROOT) / "data" / lemas_model_name / "vocab.txt"
+    if not vocab_file.is_file():
+        raise gr.Error(f"Vocab file not found: {vocab_file}")
+    prosody_cfg = Path(CKPTS_ROOT) / "prosody_encoder" / "pretssel_cfg.json"
+    prosody_ckpt = Path(CKPTS_ROOT) / "prosody_encoder" / "prosody_encoder_UnitY2.pt"
+    use_prosody = prosody_cfg.is_file() and prosody_ckpt.is_file()
+    tts_edit_model = TTS(
+        model=lemas_model_name,
+        ckpt_file=ckpt_file,
+        vocab_file=str(vocab_file),
+        device=device,
+        use_ema=True,
+        frontend="phone",
+        use_prosody_encoder=use_prosody,
+        prosody_cfg_path=str(prosody_cfg) if use_prosody else "",
+        prosody_ckpt_path=str(prosody_ckpt) if use_prosody else "",
+    )
+    logging.info(f"Loaded LEMAS-TTS edit model from {ckpt_file}")
+    return gr.Accordion()
+def get_transcribe_state(segments):
+    logging.info("===========After Align===========")
+    logging.info(segments)
+    return {
+        "segments": segments,
+        "transcript": segments["text_raw"],
+        "words_info": segments["words"],
+        "transcript_with_start_time": " ".join([f"{word['start']} {word['word']}" for word in segments["words"]]),
+        "transcript_with_end_time": " ".join([f"{word['word']} {word['end']}" for word in segments["words"]]),
+        "word_bounds": [f"{word['start']} {word['word']} {word['end']}" for word in segments["words"]]
+    }
+def transcribe(seed, audio_info):
+    if transcribe_model is None:
+        raise gr.Error("Transcription model not loaded")
+    seed_everything(seed)
+    segments = transcribe_model.transcribe(audio_info)
+    state = get_transcribe_state(segments)
+    return [
+        state["transcript"], state["transcript_with_start_time"], state["transcript_with_end_time"],
+        # gr.Dropdown(value=state["word_bounds"][-1], choices=state["word_bounds"], interactive=True), # prompt_to_word
+        gr.Dropdown(value=state["word_bounds"][0], choices=state["word_bounds"], interactive=True), # edit_from_word
+        gr.Dropdown(value=state["word_bounds"][-1], choices=state["word_bounds"], interactive=True), # edit_to_word
+        state
+    ]
+def align(transcript, audio_info, state):
+    lang = state["segments"]["lang"]
+    # print("realign: ", transcript, state)
+    transcript = re.sub(_whitespace_re, " ", transcript)
+    transcript = transcript[1:] if transcript[0] == " " else transcript
+    segments = {'lang':lang, 'text':transcript, 'text_raw':transcript}
+    if lang == "zh":
+        segments["text"] = text_norm.txt2pinyin(transcript)
+    else:
+        transcript = replace_numbers_with_words(transcript)
+        segments["text"] = (transcript.split(' '), transcript.split(' '))
+    # print("text:", segments["text"])
+    segments = align_model.align(segments, audio_info)
+    state = get_transcribe_state(segments)
+    return [
+        state["transcript"], state["transcript_with_start_time"], state["transcript_with_end_time"],
+        # gr.Dropdown(value=state["word_bounds"][-1], choices=state["word_bounds"], interactive=True), # prompt_to_word
+        gr.Dropdown(value=state["word_bounds"][0], choices=state["word_bounds"], interactive=True), # edit_from_word
+        gr.Dropdown(value=state["word_bounds"][-1], choices=state["word_bounds"], interactive=True), # edit_to_word
+        state
+    ]
+def denoise(audio_info):
+    denoised_audio, sr = denoise_model.denoise(audio_info)
+    denoised_audio = denoised_audio # .squeeze().numpy()
+    return (sr, denoised_audio)
+def cancel_denoise(audio_info):
+    return audio_info
+def get_output_audio(audio_tensors, sr):
+    result = torch.cat(audio_tensors, -1)
+    result = result.squeeze().cpu().numpy()
+    result = (result * np.iinfo(np.int16).max).astype(np.int16)
+    print("save result:", result.shape)
+    # wavfile.write(os.path.join(TMP_PATH, "output.wav"), sr, result)
+    return (int(sr), result)
+def get_edit_audio_part(audio_info, edit_start, edit_end):
+    sr, raw_wav = audio_info
+    raw_wav = raw_wav[int(edit_start*sr):int(edit_end*sr)]
+    return (sr, raw_wav)
+def crossfade_concat(chunk1, chunk2, overlap):
+    # 计算淡入和淡出系数
+    fade_out = torch.cos(torch.linspace(0, torch.pi / 2, overlap)) ** 2
+    fade_in = torch.cos(torch.linspace(torch.pi / 2, 0, overlap)) ** 2
+    chunk2[:overlap] = chunk1[-overlap:] * fade_out + chunk2[:overlap] * fade_in
+    chunk = torch.cat((chunk1[:-overlap], chunk2), dim=0)
+    return chunk
+def replace_numbers_with_words(sentence, lang="en"):
+    sentence = re.sub(r'(\d+)', r' \1 ', sentence) # add spaces around numbers
+    def replace_with_words(match):
+        num = match.group(0)
+        try:
+            return num2words(num, lang=lang) # Convert numbers to words
+        except:
+            return num # In case num2words fails (unlikely with digits but just to be safe)
+    return re.sub(r'\b\d+\b', replace_with_words, sentence) # Regular expression that matches numbers
+def run(seed, nfe_step, speed, cfg_strength, sway_sampling_coef, ref_ratio,
+        audio_info, denoised_audio, transcribe_state, transcript, smart_transcript,
+        mode, start_time, end_time,
+        split_text, selected_sentence, audio_tensors):
+    if tts_edit_model is None:
+        raise gr.Error("LEMAS-TTS edit model not loaded")
+    if smart_transcript and (transcribe_state is None):
+        raise gr.Error("Can't use smart transcript: whisper transcript not found")
+    # if mode == "Rerun":
+    #     colon_position = selected_sentence.find(':')
+    #     selected_sentence_idx = int(selected_sentence[:colon_position])
+    #     sentences = [selected_sentence[colon_position + 1:]]
+    # Choose base audio (denoised if duration matches)
+    audio_base = audio_info
+    audio_dur = round(audio_info[1].shape[0] / audio_info[0], ndigits=3)
+    if denoised_audio is not None:
+        denoised_dur = round(denoised_audio[1].shape[0] / denoised_audio[0], ndigits=3)
+        if audio_dur == denoised_dur or (
+            denoised_audio[0] != audio_info[0] and abs(audio_dur - denoised_dur) < 0.1
+        ):
+            audio_base = denoised_audio
+            logging.info("use denoised audio")
+    raw_sr, raw_wav = audio_base
+    print("audio_dur: ", audio_dur, raw_sr, raw_wav.shape, start_time, end_time)
+    # Build target text by replacing the selected span with `transcript`
+    words = transcribe_state["words_info"]
+    if not words:
+        raise gr.Error("No word-level alignment found; please run Transcribe first.")
+    start_time = float(start_time)
+    end_time = float(end_time)
+    if end_time <= start_time:
+        raise gr.Error("Edit end time must be greater than start time.")
+    # Find word indices covering the selected region
+    start_idx = 0
+    for i, w in enumerate(words):
+        if w["end"] > start_time:
+            start_idx = i
+            break
+    end_idx = len(words)
+    for i in range(len(words) - 1, -1, -1):
+        if words[i]["start"] < end_time:
+            end_idx = i + 1
+            break
+    if end_idx <= start_idx:
+        end_idx = min(start_idx + 1, len(words))
+    word_start_sec = float(words[start_idx]["start"])
+    word_end_sec = float(words[end_idx - 1]["end"])
+    # Edit span in seconds (relative to full utterance)
+    edit_start = max(0.0, word_start_sec - 0.1)
+    edit_end = min(word_end_sec + 0.1, audio_dur)
+    parts_to_edit = [(edit_start, edit_end)]
+    display_text = transcribe_state["segments"]["text_raw"].strip()
+    txt_list = display_text.split(" ") if display_text else [w["word"] for w in words]
+    prefix = " ".join(txt_list[:start_idx]).strip()
+    suffix = " ".join(txt_list[end_idx:]).strip()
+    new_phrase = transcript.strip()
+    pieces = []
+    if prefix:
+        pieces.append(prefix)
+    if new_phrase:
+        pieces.append(new_phrase)
+    if suffix:
+        pieces.append(suffix)
+    target_text = " ".join(pieces)
+    logging.info(
+        "target_text: %s (start_idx=%d, end_idx=%d, parts_to_edit=%s)",
+        target_text,
+        start_idx,
+        end_idx,
+        parts_to_edit,
+    )
+    # Prepare audio for LEMAS-TTS editing (mono, target SR)
+    segment_audio = load_wav(audio_base, sr=tts_edit_model.target_sample_rate)
+    seed_val = None if seed == -1 else int(seed)
+    wav_out, _ = gen_wav_multilingual(
+        tts_edit_model,
+        segment_audio,
+        tts_edit_model.target_sample_rate,
+        target_text,
+        parts_to_edit,
+        nfe_step=int(nfe_step),
+        cfg_strength=float(cfg_strength),
+        sway_sampling_coef=float(sway_sampling_coef),
+        ref_ratio=float(ref_ratio),
+        no_ref_audio=False,
+        use_acc_grl=False,
+        use_prosody_encoder_flag=True,
+        seed=seed_val,
+    )
+    wav_np = wav_out.cpu().numpy()
+    wav_np = np.clip(wav_np, -0.999, 0.999)
+    wav_int16 = (wav_np * np.iinfo(np.int16).max).astype(np.int16)
+    out_sr = int(tts_edit_model.target_sample_rate)
+    output_audio = (out_sr, wav_int16)
+    sentences = [f"0: {target_text}"]
+    audio_tensors = [torch.from_numpy(wav_np)]
+    component = gr.Dropdown(choices=sentences, value=sentences[0])
+    return output_audio, target_text, component, audio_tensors
+def update_input_audio(audio_info):
+    if audio_info is None:
+        return 0, 0, 0
+    elif type(audio_info) is str:
+        info = torchaudio.info(audio_path)
+        max_time = round(info.num_frames / info.sample_rate, 2)
+    elif type(audio_info) is tuple:
+        max_time = round(audio_info[1].shape[0] / audio_info[0], 2)
+    return [
+        # gr.Slider(maximum=max_time, value=max_time),
+        gr.Slider(maximum=max_time, value=0),
+        gr.Slider(maximum=max_time, value=max_time),
+    ]
+def change_mode(mode):
+    # tts_mode_controls, edit_mode_controls, edit_word_mode, split_text, long_tts_sentence_editor
+    return [
+        gr.Group(visible=mode != "Edit"),
+        gr.Group(visible=mode == "Edit"),
+        gr.Radio(visible=mode == "Edit"),
+        gr.Radio(visible=mode == "Long TTS"),
+        gr.Group(visible=mode == "Long TTS"),
+    ]
+def load_sentence(selected_sentence, audio_tensors):
+    if selected_sentence is None:
+        return None
+    colon_position = selected_sentence.find(':')
+    selected_sentence_idx = int(selected_sentence[:colon_position])
+    # Use LEMAS-TTS target sample rate if available, otherwise default to 16000
+    sr = getattr(tts_edit_model, "target_sample_rate", 16000)
+    return get_output_audio([audio_tensors[selected_sentence_idx]], sr)
+def update_bound_word(is_first_word, selected_word, edit_word_mode):
+    if selected_word is None:
+        return None
+    word_start_time = float(selected_word.split(' ')[0])
+    word_end_time = float(selected_word.split(' ')[-1])
+    if edit_word_mode == "Replace half":
+        bound_time = (word_start_time + word_end_time) / 2
+    elif is_first_word:
+        bound_time = word_start_time
+    else:
+        bound_time = word_end_time
+    return bound_time
+def update_bound_words(from_selected_word, to_selected_word, edit_word_mode):
+    return [
+        update_bound_word(True, from_selected_word, edit_word_mode),
+        update_bound_word(False, to_selected_word, edit_word_mode),
+    ]
+smart_transcript_info = """
+If enabled, the target transcript will be constructed for you:</br>
+ - In TTS and Long TTS mode just write the text you want to synthesize.</br>
+ - In Edit mode just write the text to replace selected editing segment.</br>
+If disabled, you should write the target transcript yourself:</br>
+ - In TTS mode write prompt transcript followed by generation transcript.</br>
+ - In Long TTS select split by newline (<b>SENTENCE SPLIT WON'T WORK</b>) and start each line with a prompt transcript.</br>
+ - In Edit mode write full prompt</br>
+"""
+demo_original_transcript = ""
+demo_text = {
+    "TTS": {
+        "smart": "take over the stage for half an hour,",
+        "regular": "Gwynplaine had, besides, for his work and for his feats of strength, take over the stage for half an hour."
+    },
+    "Edit": {
+        "smart": "Just write it line-by-line.",
+        "regular": "照片、医疗记录、神经重塑的易损性，这是某种数据库啊！还有PRELESS的脑部扫描、生物管型、神经重塑."
+    },
+    "Long TTS": {
+        "smart": "You can run the model on a big text!\n"
+                 "Just write it line-by-line. Or sentence-by-sentence.\n"
+                 "If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!",
+        "regular": "Gwynplaine had, besides, for his work and for his feats of strength, You can run the model on a big text!\n"
+                   "Gwynplaine had, besides, for his work and for his feats of strength, Just write it line-by-line. Or sentence-by-sentence.\n"
+                   "Gwynplaine had, besides, for his work and for his feats of strength, If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!"
+    }
+}
+all_demo_texts = {vv for k, v in demo_text.items() for kk, vv in v.items()}
+demo_words = ['0.069 Gwynplain 0.611', '0.671 had, 0.912', '0.952 besides, 1.414', '1.494 for 1.634', '1.695 his 1.835', '1.915 work 2.136', '2.196 and 2.297', '2.337 for 2.517', '2.557 his 2.678', '2.758 feats 3.019', '3.079 of 3.139', '3.2 strength, 3.561', '4.022 round 4.263', '4.303 his 4.444', '4.524 neck 4.705', '4.745 and 4.825', '4.905 over 5.086', '5.146 his 5.266', '5.307 shoulders, 5.768', '6.23 an 6.33', '6.531 esclavine 7.133', '7.213 of 7.293', '7.353 leather. 7.614']
+demo_words_info = [{'word': 'Gwynplain', 'start': 0.069, 'end': 0.611, 'score': 0.833}, {'word': 'had,', 'start': 0.671, 'end': 0.912, 'score': 0.879}, {'word': 'besides,', 'start': 0.952, 'end': 1.414, 'score': 0.863}, {'word': 'for', 'start': 1.494, 'end': 1.634, 'score': 0.89}, {'word': 'his', 'start': 1.695, 'end': 1.835, 'score': 0.669}, {'word': 'work', 'start': 1.915, 'end': 2.136, 'score': 0.916}, {'word': 'and', 'start': 2.196, 'end': 2.297, 'score': 0.766}, {'word': 'for', 'start': 2.337, 'end': 2.517, 'score': 0.808}, {'word': 'his', 'start': 2.557, 'end': 2.678, 'score': 0.786}, {'word': 'feats', 'start': 2.758, 'end': 3.019, 'score': 0.97}, {'word': 'of', 'start': 3.079, 'end': 3.139, 'score': 0.752}, {'word': 'strength,', 'start': 3.2, 'end': 3.561, 'score': 0.742}, {'word': 'round', 'start': 4.022, 'end': 4.263, 'score': 0.916}, {'word': 'his', 'start': 4.303, 'end': 4.444, 'score': 0.666}, {'word': 'neck', 'start': 4.524, 'end': 4.705, 'score': 0.908}, {'word': 'and', 'start': 4.745, 'end': 4.825, 'score': 0.882}, {'word': 'over', 'start': 4.905, 'end': 5.086, 'score': 0.847}, {'word': 'his', 'start': 5.146, 'end': 5.266, 'score': 0.791}, {'word': 'shoulders,', 'start': 5.307, 'end': 5.768, 'score': 0.729}, {'word': 'an', 'start': 6.23, 'end': 6.33, 'score': 0.854}, {'word': 'esclavine', 'start': 6.531, 'end': 7.133, 'score': 0.803}, {'word': 'of', 'start': 7.213, 'end': 7.293, 'score': 0.772}, {'word': 'leather.', 'start': 7.353, 'end': 7.614, 'score': 0.896}]
+def update_demo(mode, smart_transcript, edit_word_mode, transcript, edit_from_word, edit_to_word):
+    if transcript not in all_demo_texts:
+        return transcript, edit_from_word, edit_to_word
+    replace_half = edit_word_mode == "Replace half"
+    change_edit_from_word = edit_from_word == demo_words[2] or edit_from_word == demo_words[3]
+    change_edit_to_word = edit_to_word == demo_words[11] or edit_to_word == demo_words[12]
+    demo_edit_from_word_value = demo_words[2] if replace_half else demo_words[3]
+    demo_edit_to_word_value = demo_words[12] if replace_half else demo_words[11]
+    return [
+        demo_text[mode]["smart" if smart_transcript else "regular"],
+        demo_edit_from_word_value if change_edit_from_word else edit_from_word,
+        demo_edit_to_word_value if change_edit_to_word else edit_to_word,
+    ]
+def get_app():
+    with gr.Blocks() as app:
+        with gr.Row():
+            with gr.Column(scale=2):
+                load_models_btn = gr.Button(value="Load models")
+            with gr.Column(scale=5):
+                with gr.Accordion("Select models", open=False) as models_selector:
+                    # For LEMAS-TTS editing, we expose a simple model selector
+                    # between the two multilingual variants.
+                    lemas_model_choice = gr.Radio(
+                        label="LEMAS-TTS Model",
+                        choices=["multilingual_grl", "multilingual_prosody"],
+                        value="multilingual_grl",
+                        interactive=True,
+                    )
+                    with gr.Row():
+                        denoise_model_choice = gr.Radio(label="Denoise Model", scale=2, value="UVR5", choices=["UVR5", "DeepFilterNet"]) # "830M", "330M_TTSEnhanced", "830M_TTSEnhanced"])
+                        # whisper_backend_choice = gr.Radio(label="Whisper backend", value="", choices=["whisperX", "whisper"])
+                        whisper_model_choice = gr.Radio(label="Whisper model", scale=3, value="medium", choices=["base", "small", "medium", "large"])
+                        align_model_choice = gr.Radio(label="Forced alignment model", scale=2, value="MMS", choices=["whisperX", "MMS"], visible=False)
+                        # audiosr_choice = gr.Radio(label="AudioSR model", scale=2, value="None", choices=["basic", "speech", "None"])
+        with gr.Row():
+            with gr.Column(scale=2):
+                input_audio = gr.Audio(value=f"{DEMO_PATH}/V-00013_en-US.wav", label="Input Audio", interactive=True)
+                with gr.Row():
+                    transcribe_btn = gr.Button(value="Transcribe")
+                    align_btn = gr.Button(value="ReAlign")
+                with gr.Group():
+                    original_transcript = gr.Textbox(label="Original transcript", lines=5, interactive=True, value=demo_original_transcript,
+                                                    info="Use whisperx model to get the transcript. Fix and align it if necessary.")
+                    with gr.Accordion("Word start time", open=False, visible=False):
+                        transcript_with_start_time = gr.Textbox(label="Start time", lines=5, interactive=False, info="Start time before each word")
+                    with gr.Accordion("Word end time", open=False, visible=False):
+                        transcript_with_end_time = gr.Textbox(label="End time", lines=5, interactive=False, info="End time after each word")
+                with gr.Row():
+                    denoise_btn = gr.Button(value="Denoise")
+                    cancel_btn = gr.Button(value="Cancel Denoise")
+                denoise_audio = gr.Audio(label="Denoised Audio", value=None, interactive=False)
+            with gr.Column(scale=3):
+                with gr.Group():
+                    transcript_inbox = gr.Textbox(label="Text", lines=5, value=demo_text["Edit"]["smart"])
+                    with gr.Row(visible=False):
+                        smart_transcript = gr.Checkbox(label="Smart transcript", value=True)
+                        with gr.Accordion(label="?", open=False):
+                            info = gr.Markdown(value=smart_transcript_info)
+                    mode = gr.Radio(label="Mode", choices=["Edit"], value="Edit", visible=False)
+                    with gr.Row(visible=False):
+                        split_text = gr.Radio(label="Split text", choices=["Newline", "Sentence"], value="Newline",
+                                            info="Split text into parts and run TTS for each part.", visible=True)
+                        edit_word_mode = gr.Radio(label="Edit word mode", choices=["Replace half", "Replace all"], value="Replace all",
+                                                info="What to do with first and last word", visible=False)
+                    # with gr.Group(visible=False) as tts_mode_controls:
+                        # with gr.Row():
+                        #     edit_from_word = gr.Dropdown(label="First word in prompt", choices=demo_words, value=demo_words[12], interactive=True)
+                        #     edit_to_word = gr.Dropdown(label="Last word in prompt", choices=demo_words, value=demo_words[18], interactive=True)
+                        # with gr.Row():
+                        #     edit_start_time = gr.Slider(label="Prompt start time", minimum=0, maximum=7.614, step=0.001, value=4.022)
+                        #     edit_end_time = gr.Slider(label="Prompt end time", minimum=0, maximum=7.614, step=0.001, value=5.768)
+                        # with gr.Row():
+                        #     check_btn = gr.Button(value="Check prompt",scale=1)
+                        #     edit_audio = gr.Audio(label="Prompt Audio", scale=3)
+                    # with gr.Group() as edit_mode_controls:
+                    with gr.Row():
+                        edit_from_word = gr.Dropdown(label="First word to edit", choices=demo_words, value=demo_words[12], interactive=True)
+                        edit_to_word = gr.Dropdown(label="Last word to edit", choices=demo_words, value=demo_words[18], interactive=True)
+                    with gr.Row():
+                        edit_start_time = gr.Slider(label="Edit from time", minimum=0, maximum=7.614, step=0.001, value=4.022)
+                        edit_end_time = gr.Slider(label="Edit to time", minimum=0, maximum=7.614, step=0.001, value=5.768)
+                    with gr.Row():
+                        check_btn = gr.Button(value="Check edit words",scale=1)
+                        edit_audio = gr.Audio(label="Edit word(s)", scale=3)
+                    run_btn = gr.Button(value="Run", variant="primary")
+            with gr.Column(scale=2):
+                output_audio = gr.Audio(label="Output Audio")
+                with gr.Accordion("Inference transcript", open=True):
+                    inference_transcript = gr.Textbox(label="Inference transcript", lines=5, interactive=False, info="Inference was performed on this transcript.")
+                with gr.Group(visible=False) as long_tts_sentence_editor:
+                    sentence_selector = gr.Dropdown(label="Sentence", value=None,
+                                                    info="Select sentence you want to regenerate")
+                    sentence_audio = gr.Audio(label="Sentence Audio", scale=2)
+                    rerun_btn = gr.Button(value="Rerun")
+        with gr.Row():
+            with gr.Accordion("Generation Parameters - change these if you are unhappy with the generation", open=False):
+                with gr.Row():
+                    nfe_step = gr.Number(
+                        label="NFE Step",
+                        value=64,
+                        precision=0,
+                        info="Number of function evaluations (sampling steps).",
+                    )
+                    speed = gr.Slider(
+                        label="Speed",
+                        minimum=0.5,
+                        maximum=1.5,
+                        step=0.05,
+                        value=1.0,
+                        info="Placeholder for future use; currently not applied.",
+                    )
+                    cfg_strength = gr.Slider(
+                        label="CFG Strength",
+                        minimum=0.0,
+                        maximum=10.0,
+                        step=0.5,
+                        value=5.0,
+                        info="Classifier-free guidance strength.",
+                    )
+                with gr.Row():
+                    sway_sampling_coef = gr.Slider(
+                        label="Sway",
+                        minimum=2.0,
+                        maximum=5.0,
+                        step=0.1,
+                        value=3.0,
+                        info="Sampling sway coefficient.",
+                    )
+                    ref_ratio = gr.Slider(
+                        label="Ref Ratio",
+                        minimum=0.0,
+                        maximum=1.0,
+                        step=0.05,
+                        value=1.0,
+                        info="How much to rely on reference audio (if used).",
+                    )
+                    seed = gr.Number(
+                        label="Seed",
+                        value=-1,
+                        precision=0,
+                        info="-1 for random, otherwise fixed seed.",
+                    )
+        audio_tensors = gr.State()
+        transcribe_state = gr.State(value={"words_info": demo_words_info, "lang":"zh"})
+        edit_word_mode.change(fn=update_demo,
+                            inputs=[mode, smart_transcript, edit_word_mode, transcript_inbox, edit_from_word, edit_to_word],
+                            outputs=[transcript_inbox, edit_from_word, edit_to_word])
+        smart_transcript.change(
+            fn=update_demo,
+            inputs=[mode, smart_transcript, edit_word_mode, transcript_inbox, edit_from_word, edit_to_word],
+            outputs=[transcript_inbox, edit_from_word, edit_to_word],
+        )
+        load_models_btn.click(fn=load_models,
+                            inputs=[lemas_model_choice, whisper_model_choice, align_model_choice, denoise_model_choice],  # audiosr_choice],
+                            outputs=[models_selector])
+        input_audio.upload(fn=update_input_audio,
+                        inputs=[input_audio],
+                        outputs=[edit_start_time, edit_end_time]) # prompt_end_time
+        transcribe_btn.click(fn=transcribe,
+                            inputs=[seed, input_audio],
+                            outputs=[original_transcript, transcript_with_start_time, transcript_with_end_time,
+                                    edit_from_word, edit_to_word, transcribe_state]) # prompt_to_word
+        align_btn.click(fn=align,
+                        inputs=[original_transcript, input_audio, transcribe_state],
+                        outputs=[original_transcript, transcript_with_start_time, transcript_with_end_time,
+                                edit_from_word, edit_to_word, transcribe_state]) # prompt_to_word
+        denoise_btn.click(fn=denoise,
+                        inputs=[input_audio],
+                        outputs=[denoise_audio])
+        cancel_btn.click(fn=cancel_denoise,
+                        inputs=[input_audio],
+                        outputs=[denoise_audio])
+        # mode.change(fn=change_mode,
+        #             inputs=[mode],
+        #             outputs=[tts_mode_controls, edit_mode_controls, edit_word_mode, split_text, long_tts_sentence_editor])
+        check_btn.click(fn=get_edit_audio_part,
+                        inputs=[input_audio, edit_start_time, edit_end_time],
+                        outputs=[edit_audio])
+        run_btn.click(fn=run,
+                    inputs=[
+                        seed, nfe_step, speed, cfg_strength, sway_sampling_coef, ref_ratio,
+                        input_audio, denoise_audio, transcribe_state, transcript_inbox, smart_transcript,
+                        mode, edit_start_time, edit_end_time,
+                        split_text, sentence_selector, audio_tensors
+                    ],
+                    outputs=[output_audio, inference_transcript, sentence_selector, audio_tensors])
+        sentence_selector.change(
+            fn=load_sentence,
+            inputs=[sentence_selector, audio_tensors],
+            outputs=[sentence_audio],
+        )
+        rerun_btn.click(fn=run,
+                        inputs=[
+                            seed, nfe_step, speed, cfg_strength, sway_sampling_coef, ref_ratio,
+                            input_audio, denoise_audio, transcribe_state, transcript_inbox, smart_transcript,
+                            gr.State(value="Rerun"), edit_start_time, edit_end_time,
+                            split_text, sentence_selector, audio_tensors
+                        ],
+                        outputs=[output_audio, inference_transcript, sentence_audio, audio_tensors])
+        # prompt_to_word.change(fn=update_bound_word,
+        #                     inputs=[gr.State(False), prompt_to_word, gr.State("Replace all")],
+        #                     outputs=[prompt_end_time])
+        edit_from_word.change(fn=update_bound_word,
+                            inputs=[gr.State(True), edit_from_word, edit_word_mode],
+                            outputs=[edit_start_time])
+        edit_to_word.change(fn=update_bound_word,
+                            inputs=[gr.State(False), edit_to_word, edit_word_mode],
+                            outputs=[edit_end_time])
+        edit_word_mode.change(fn=update_bound_words,
+                            inputs=[edit_from_word, edit_to_word, edit_word_mode],
+                            outputs=[edit_start_time, edit_end_time])
+    return app
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="VoiceCraft gradio app.")
+    parser.add_argument("--demo-path", default="./demo", help="Path to demo directory")
+    parser.add_argument("--tmp-path", default="/cto_labs/vistring/zhaozhiyuan/outputs/voicecraft/tmp", help="Path to tmp directory")
+    parser.add_argument("--models-path", default="/cto_labs/vistring/zhaozhiyuan/outputs/voicecraft/pretrain/VoiceCraft", help="Path to voicecraft models directory")
+    parser.add_argument("--port", default=41020, type=int, help="App port")
+    parser.add_argument("--share", action="store_true", help="Launch with public url")
+    parser.add_argument("--server_name", default="0.0.0.0", type=str, help="Server name for launching the app. 127.0.0.1 for localhost; 0.0.0.0 to allow access from other machines in the local network. Might also give access to external users depends on the firewall settings.")
+    os.environ["USER"] = os.getenv("USER", "user")
+    args = parser.parse_args()
+    DEMO_PATH = args.demo_path
+    TMP_PATH = args.tmp_path
+    MODELS_PATH = args.models_path
+    app = get_app()
+    app.queue().launch(share=args.share, server_name=args.server_name, server_port=args.port)

inference_gradio.py ADDED Viewed

	@@ -0,0 +1,576 @@

+import gc
+import os
+import platform
+import psutil
+import tempfile
+from glob import glob
+import traceback
+import click
+import gradio as gr
+import torch
+import torchaudio
+import soundfile as sf
+from pathlib import Path
+from cached_path import cached_path
+from lemas_tts.api import TTS, PRETRAINED_ROOT, CKPTS_ROOT
+# Global variables
+tts_api = None
+last_checkpoint = ""
+last_device = ""
+last_ema = None
+# Device detection
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+REPO_ROOT = Path(__file__).resolve().parent
+# HF location for large TTS checkpoints (too big for Space storage)
+HF_PRETRAINED_ROOT = "hf://LEMAS-Project/LEMAS-TTS/pretrained_models"
+# 指向 `pretrained_models` 里的 espeak-ng-data（本地自带的字典）
+# 动态库交给系统安装的 espeak-ng 来提供（通过 apt），不强行指定 PHONEMIZER_ESPEAK_LIBRARY，
+# 避免本地复制的 .so 与 Space 基础镜像不兼容。
+ESPEAK_DATA_DIR = Path(PRETRAINED_ROOT) / "espeak-ng-data"
+os.environ["ESPEAK_DATA_PATH"] = str(ESPEAK_DATA_DIR)
+os.environ["ESPEAKNG_DATA_PATH"] = str(ESPEAK_DATA_DIR)
+class UVR5:
+    """Small wrapper around the bundled uvr5 implementation for denoising."""
+    def __init__(self, model_dir: Path, code_dir: Path):
+        self.model = self.load_model(str(model_dir), str(code_dir))
+    def load_model(self, model_dir: str, code_dir: str):
+        import sys
+        import json
+        if code_dir not in sys.path:
+            sys.path.append(code_dir)
+        from multiprocess_cuda_infer import ModelData, Inference
+        model_path = os.path.join(model_dir, "Kim_Vocal_1.onnx")
+        config_path = os.path.join(model_dir, "MDX-Net-Kim-Vocal1.json")
+        with open(config_path, "r", encoding="utf-8") as f:
+            configs = json.load(f)
+        model_data = ModelData(
+            model_path=model_path,
+            audio_path=model_dir,
+            result_path=model_dir,
+            device="cpu",
+            process_method="MDX-Net",
+            base_dir=model_dir,  # keep base_dir and model_dir the same (paths under `pretrained_models`)
+            **configs,
+        )
+        uvr5_model = Inference(model_data, "cpu")
+        uvr5_model.load_model(model_path, 1)
+        return uvr5_model
+    def denoise(self, audio_info):
+        print("denoise UVR5: ", audio_info)
+        input_audio = load_wav(audio_info, sr=44100, channel=2)
+        output_audio = self.model.demix_base({0: input_audio.squeeze()}, is_match_mix=False)
+        return output_audio.squeeze().T.numpy(), 44100
+denoise_model = UVR5(
+    model_dir=str(Path(PRETRAINED_ROOT) / "uvr5"),
+    code_dir=str(REPO_ROOT / "uvr5"),
+)
+def load_wav(audio_info, sr=16000, channel=1):
+    print("load audio:", audio_info)
+    audio, raw_sr = torchaudio.load(audio_info)
+    audio = audio.T if len(audio.shape) > 1 and audio.shape[1] == 2 else audio
+    audio = audio / torch.max(torch.abs(audio))
+    audio = audio.squeeze().float()
+    if channel == 1 and len(audio.shape) == 2:  # stereo to mono
+        audio = audio.mean(dim=0, keepdim=True)
+    elif channel == 2 and len(audio.shape) == 1:
+        audio = torch.stack((audio, audio)) # mono to stereo
+    if raw_sr != sr:
+        audio = torchaudio.functional.resample(audio.squeeze(), raw_sr, sr)
+    audio = torch.clip(audio, -0.999, 0.999).squeeze()
+    return audio
+def denoise(audio_info):
+    save_path = "./denoised_audio.wav"
+    denoised_audio, sr = denoise_model.denoise(audio_info)
+    sf.write(save_path, denoised_audio, sr, format='wav', subtype='PCM_24')
+    print("save denoised audio:", save_path)
+    return save_path
+def cancel_denoise(audio_info):
+    return audio_info
+def get_checkpoints_project(project_name=None, is_gradio=True):
+    """Get available checkpoint files"""
+    checkpoint_dir = [str(CKPTS_ROOT)]
+    # Remote ckpt locations on HF (used when local ckpts are not present)
+    remote_ckpts = {
+        "multilingual_grl": f"{HF_PRETRAINED_ROOT}/ckpts/multilingual_grl/multilingual_grl.safetensors",
+        "multilingual_prosody": f"{HF_PRETRAINED_ROOT}/ckpts/multilingual_prosody/multilingual_prosody.safetensors",
+    }
+    if project_name is None:
+        # Look for checkpoints in local directory
+        files_checkpoints = []
+        for path in checkpoint_dir:
+            if os.path.isdir(path):
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.pt"), recursive=True))
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.safetensors"), recursive=True))
+                break
+        # Fallback to remote ckpts if none found locally
+        if not files_checkpoints:
+            files_checkpoints = list(remote_ckpts.values())
+    else:
+        files_checkpoints = []
+        if os.path.isdir(checkpoint_dir[0]):
+            files_checkpoints = glob(os.path.join(checkpoint_dir[0], project_name, "*.pt"))
+            files_checkpoints.extend(glob(os.path.join(checkpoint_dir[0], project_name, "*.safetensors")))
+        # If no local ckpts for this project, try remote mapping
+        if not files_checkpoints:
+            ckpt = remote_ckpts.get(project_name)
+            files_checkpoints = [ckpt] if ckpt is not None else []
+    print("files_checkpoints:", project_name, files_checkpoints)
+    # Separate pretrained and regular checkpoints
+    pretrained_checkpoints = [f for f in files_checkpoints if "pretrained_" in os.path.basename(f)]
+    regular_checkpoints = [
+        f
+        for f in files_checkpoints
+        if "pretrained_" not in os.path.basename(f) and "model_last.pt" not in os.path.basename(f)
+    ]
+    last_checkpoint = [f for f in files_checkpoints if "model_last.pt" in os.path.basename(f)]
+    # Sort regular checkpoints by number
+    try:
+        regular_checkpoints = sorted(
+            regular_checkpoints, key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
+        )
+    except (IndexError, ValueError):
+        regular_checkpoints = sorted(regular_checkpoints)
+    # Combine in order: pretrained, regular, last
+    files_checkpoints = pretrained_checkpoints + regular_checkpoints + last_checkpoint
+    select_checkpoint = None if not files_checkpoints else files_checkpoints[-1]
+    if is_gradio:
+        return gr.update(choices=files_checkpoints, value=select_checkpoint)
+    return files_checkpoints, select_checkpoint
+def get_available_projects():
+    """Get available project names from data directory"""
+    data_paths = [
+        str(Path(PRETRAINED_ROOT) / "data"),
+    ]
+    project_list = []
+    for data_path in data_paths:
+        if os.path.isdir(data_path):
+            for folder in os.listdir(data_path):
+                path_folder = os.path.join(data_path, folder)
+                if "test" not in folder:
+                    project_list.append(folder)
+            break
+    # Fallback: if no local data dir, default to known HF projects
+    if not project_list:
+        project_list = ["multilingual_grl", "multilingual_prosody"]
+    project_list.sort()
+    print("project_list:", project_list)
+    return project_list
+def infer(
+    project, file_checkpoint, exp_name, ref_text, ref_audio, denoise_audio, gen_text, nfe_step, use_ema, separate_langs, frontend, speed, cfg_strength, use_acc_grl, ref_ratio, no_ref_audio, sway_sampling_coef, use_prosody_encoder, seed
+):
+    global last_checkpoint, last_device, tts_api, last_ema
+    # Resolve checkpoint path (local or HF URL)
+    ckpt_path = file_checkpoint
+    if isinstance(ckpt_path, str) and ckpt_path.startswith("hf://"):
+        try:
+            ckpt_resolved = str(cached_path(ckpt_path))
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error downloading checkpoint: {str(e)}", ""
+    else:
+        ckpt_resolved = ckpt_path
+    if not os.path.isfile(ckpt_resolved):
+        return None, "Checkpoint not found!", ""
+    if denoise_audio:
+        ref_audio = denoise_audio
+    device_test = device  # Use the global device
+    if last_checkpoint != ckpt_resolved or last_device != device_test or last_ema != use_ema or tts_api is None:
+        if last_checkpoint != ckpt_resolved:
+            last_checkpoint = ckpt_resolved
+        if last_device != device_test:
+            last_device = device_test
+        if last_ema != use_ema:
+            last_ema = use_ema
+        # Automatically enable prosody encoder when using the prosody checkpoint
+        use_prosody_encoder = True if "prosody" in str(ckpt_resolved) else False
+        # Resolve vocab file (local)
+        local_vocab = Path(PRETRAINED_ROOT) / "data" / project / "vocab.txt"
+        if not local_vocab.is_file():
+            return None, "Vocab file not found!", ""
+        vocab_file = str(local_vocab)
+        # Resolve prosody encoder config & weights (local)
+        local_prosody_cfg = Path(CKPTS_ROOT) / "prosody_encoder" / "pretssel_cfg.json"
+        local_prosody_ckpt = Path(CKPTS_ROOT) / "prosody_encoder" / "prosody_encoder_UnitY2.pt"
+        if not local_prosody_cfg.is_file() or not local_prosody_ckpt.is_file():
+            return None, "Prosody encoder files not found!", ""
+        prosody_cfg_path = str(local_prosody_cfg)
+        prosody_ckpt_path = str(local_prosody_ckpt)
+        try:
+            tts_api = TTS(
+                model=exp_name,
+                ckpt_file=ckpt_resolved,
+                vocab_file=vocab_file,
+                device=device_test,
+                use_ema=use_ema,
+                frontend=frontend,
+                use_prosody_encoder=use_prosody_encoder,
+                prosody_cfg_path=prosody_cfg_path,
+                prosody_ckpt_path=prosody_ckpt_path,
+            )
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error loading model: {str(e)}", ""
+        print("Model loaded >>", device_test, file_checkpoint, use_ema)
+    if seed == -1:  # -1 used for random
+        seed = None
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+            tts_api.infer(
+                ref_file=ref_audio,
+                ref_text=ref_text.strip(),
+                gen_text=gen_text.strip(),
+                nfe_step=nfe_step,
+                separate_langs=separate_langs,
+                speed=speed,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+                use_prosody_encoder=use_prosody_encoder,
+                file_wave=f.name,
+                seed=seed,
+            )
+            return f.name, f"Device: {tts_api.device}", str(tts_api.seed)
+    except Exception as e:
+        traceback.print_exc()
+        return None, f"Inference error: {str(e)}", ""
+def get_gpu_stats():
+    """Get GPU statistics"""
+    gpu_stats = ""
+    if torch.cuda.is_available():
+        gpu_count = torch.cuda.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.cuda.get_device_name(i)
+            gpu_properties = torch.cuda.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.cuda.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.cuda.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.xpu.get_device_name(i)
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.xpu.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.xpu.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.backends.mps.is_available():
+        gpu_count = 1
+        gpu_stats += "MPS GPU\n"
+        total_memory = psutil.virtual_memory().total / (
+            1024**3
+        )  # Total system memory (MPS doesn't have its own memory)
+        allocated_memory = 0
+        reserved_memory = 0
+        gpu_stats += (
+            f"Total system memory: {total_memory:.2f} GB\n"
+            f"Allocated GPU memory (MPS): {allocated_memory:.2f} MB\n"
+            f"Reserved GPU memory (MPS): {reserved_memory:.2f} MB\n"
+        )
+    else:
+        gpu_stats = "No GPU available"
+    return gpu_stats
+def get_cpu_stats():
+    """Get CPU statistics"""
+    cpu_usage = psutil.cpu_percent(interval=1)
+    memory_info = psutil.virtual_memory()
+    memory_used = memory_info.used / (1024**2)
+    memory_total = memory_info.total / (1024**2)
+    memory_percent = memory_info.percent
+    pid = os.getpid()
+    process = psutil.Process(pid)
+    nice_value = process.nice()
+    cpu_stats = (
+        f"CPU Usage: {cpu_usage:.2f}%\n"
+        f"System Memory: {memory_used:.2f} MB used / {memory_total:.2f} MB total ({memory_percent}% used)\n"
+        f"Process Priority (Nice value): {nice_value}"
+    )
+    return cpu_stats
+def get_combined_stats():
+    """Get combined system stats"""
+    gpu_stats = get_gpu_stats()
+    cpu_stats = get_cpu_stats()
+    combined_stats = f"### GPU Stats\n{gpu_stats}\n\n### CPU Stats\n{cpu_stats}"
+    return combined_stats
+# Create Gradio interface
+with gr.Blocks(title="LEMAS-TTS Inference") as app:
+    gr.Markdown(
+        """
+        # Zero-Shot TTS
+        Set seed to -1 for random generation.
+        """
+    )
+    with gr.Accordion("Model configuration", open=False):
+    # Model configuration
+        with gr.Row():
+            exp_name = gr.Radio(
+                label="Model",
+                choices=["multilingual_grl", "multilingual_prosody"],
+                value="multilingual_grl",
+                visible=False,
+            )
+        # Project selection
+        available_projects = get_available_projects()
+        # Get initial checkpoints
+        list_checkpoints, checkpoint_select = get_checkpoints_project(available_projects[0] if available_projects else None, False)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # load_models_btn = gr.Button(value="Load models")
+                cm_project = gr.Dropdown(
+                    choices=available_projects,
+                    value=available_projects[0] if available_projects else None,
+                    label="Project",
+                    allow_custom_value=True,
+                    scale=4
+                )
+            with gr.Column(scale=5):
+                cm_checkpoint = gr.Dropdown(
+                    choices=list_checkpoints, value=checkpoint_select, label="Checkpoints", allow_custom_value=True # scale=4,
+)
+            bt_checkpoint_refresh = gr.Button("Refresh", scale=1)
+        with gr.Row():
+            ch_use_ema = gr.Checkbox(label="Use EMA", visible=False, value=True, scale=2, info="Turn off at early stage might offer better results")
+            frontend = gr.Radio(label="Frontend", visible=False, choices=["phone", "char", "bpe"], value="phone", scale=3)
+            separate_langs = gr.Checkbox(label="Separate Languages", visible=False, value=True, scale=2, info="separate language tokens")
+        # Inference parameters
+        with gr.Row():
+            nfe_step = gr.Number(label="NFE Step", scale=1, value=64)
+            speed = gr.Slider(label="Speed", scale=3, value=1.0, minimum=0.5, maximum=1.5, step=0.1)
+            cfg_strength = gr.Slider(label="CFG Strength", scale=2, value=5.0, minimum=0.0, maximum=10.0, step=1)
+            sway_sampling_coef = gr.Slider(label="Sway Sampling Coef", scale=2, value=3, minimum=2, maximum=5, step=0.1)
+            ref_ratio = gr.Slider(label="Ref Ratio", scale=2, value=1.0, minimum=0.0, maximum=1.0, step=0.1)
+            no_ref_audio = gr.Checkbox(label="No Reference Audio", visible=False, value=False, scale=1, info="No mel condition")
+            use_acc_grl = gr.Checkbox(label="Use accent grl condition", visible=False, value=True, scale=1, info="Use accent grl condition")
+            use_prosody_encoder = gr.Checkbox(label="Use prosody encoder", visible=False, value=False, scale=1, info="Use prosody encoder")
+            seed = gr.Number(label="Random Seed", scale=1, value=-1, minimum=-1)
+    # Input fields
+    ref_text = gr.Textbox(label="Reference Text", placeholder="Enter the text for the reference audio...")
+    ref_audio = gr.Audio(label="Reference Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    with gr.Accordion("Denoise audio (Optional / Recommend)", open=True):
+        with gr.Row():
+            denoise_btn = gr.Button(value="Denoise")
+            cancel_btn = gr.Button(value="Cancel Denoise")
+        denoise_audio = gr.Audio(label="Denoised Audio", value=None, type="filepath", interactive=True, show_download_button=True, editable=True)
+    gen_text = gr.Textbox(label="Text to Generate", placeholder="Enter the text you want to generate...")
+    # Inference button and outputs
+    with gr.Row():
+        txt_info_gpu = gr.Textbox("", label="Device Info")
+        seed_info = gr.Textbox(label="Used Random Seed")
+        check_button_infer = gr.Button("Generate Audio", variant="primary")
+    gen_audio = gr.Audio(label="Generated Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    # Examples
+    def _resolve_example(name: str) -> str:
+        local = Path(PRETRAINED_ROOT) / "data" / "test_examples" / name
+        return str(local) if local.is_file() else ""
+    examples = gr.Examples(
+        examples=[
+            ["em, #1 I have a list of YouTubers, and I'm gonna be going to their houses and raiding them by.",
+            _resolve_example("en.wav"),
+            "我有一份 YouTuber 名单，我打算去他们家，对他们进行突袭。",
+            ],
+            ["Te voy a dar un tip #1 que le copia a John Rockefeller, uno de los empresarios más picudos de la historia.",
+            _resolve_example("es.wav"),
+            "我要给你一个从历史上最精明的商人之一约翰·洛克菲勒那里抄来的秘诀。",
+            ],
+            ["Nova, #1 dia 25 desse mês vai rolar operação the last Frontier.",
+            _resolve_example("pt.wav"),
+            "新消息，本月二十五日，'最后的边疆行动'将启动。",
+            ],
+        ],
+        inputs=[
+            ref_text,
+            ref_audio,
+            gen_text,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+        fn=infer,
+        cache_examples=False
+    )
+    # System Info section at the bottom
+    gr.Markdown("---")
+    gr.Markdown("## System Information")
+    with gr.Accordion("Update System Stats", open=False):
+        update_button = gr.Button("Update System Stats", scale=1)
+        output_box = gr.Textbox(label="GPU and CPU Information", lines=5, scale=5)
+    def update_stats():
+        return get_combined_stats()
+    denoise_btn.click(fn=denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    cancel_btn.click(fn=cancel_denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    # Event handlers
+    check_button_infer.click(
+        fn=infer,
+        inputs=[
+            cm_project,
+            cm_checkpoint,
+            exp_name,
+            ref_text,
+            ref_audio,
+            denoise_audio,
+            gen_text,
+            nfe_step,
+            ch_use_ema,
+            separate_langs,
+            frontend,
+            speed,
+            cfg_strength,
+            use_acc_grl,
+            ref_ratio,
+            no_ref_audio,
+            sway_sampling_coef,
+            use_prosody_encoder,
+            seed,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+    )
+    bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    ref_audio.change(
+            fn=lambda x: None,
+            inputs=[ref_audio],
+            outputs=[denoise_audio]
+        )
+    update_button.click(fn=update_stats, outputs=output_box)
+    # Auto-load system stats on startup
+    app.load(fn=update_stats, outputs=output_box)
+@click.command()
+@click.option("--port", "-p", default=7860, type=int, help="Port to run the app on")
+@click.option("--host", "-H", default="0.0.0.0", help="Host to run the app on")
+@click.option(
+    "--share",
+    "-s",
+    default=False,
+    is_flag=True,
+    help="Share the app via Gradio share link",
+)
+@click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
+def main(port, host, share, api):
+    global app
+    print("Starting LEMAS-TTS Inference Interface...")
+    print(f"Device: {device}")
+    app.queue(api_open=api).launch(
+        server_name=host,
+        server_port=port,
+        share=share,
+        show_api=api,
+        allowed_paths=[str(Path(PRETRAINED_ROOT) / "data")],
+    )
+if __name__ == "__main__":
+    main()

lemas_tts/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from .api import TTS
+__all__ = ["TTS"]
+__version__ = "0.1.0"

lemas_tts/api.py ADDED Viewed

	@@ -0,0 +1,306 @@

+import os
+import random
+import sys
+from pathlib import Path
+import re, regex
+import soundfile as sf
+import tqdm
+from hydra.utils import get_class
+from omegaconf import OmegaConf
+from lemas_tts.infer.utils_infer import (
+    load_model,
+    load_vocoder,
+    transcribe,
+    preprocess_ref_audio_text,
+    infer_process,
+    remove_silence_for_generated_wav,
+    save_spectrogram,
+)
+from lemas_tts.model.utils import seed_everything
+from lemas_tts.model.backbones.dit import DiT
+# Resolve repository layout so we can find pretrained assets (ckpts, vocoder, etc.)
+THIS_FILE = Path(__file__).resolve()
+print("THIS_FILE:", THIS_FILE)
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+def _find_pretrained_root(start: Path) -> Path:
+    """
+    Locate the `pretrained_models` root, with support for:
+    1) Explicit env override (LEMAS_PRETRAINED_ROOT)
+    2) Hugging Face Spaces model mount under /models
+    3) Local source tree (searching upwards from this file)
+    """
+    # 1) Explicit override
+    env_root = os.environ.get("LEMAS_PRETRAINED_ROOT")
+    if env_root:
+        p = Path(env_root)
+        if p.is_dir():
+            return p
+    # 2) HF Spaces model mount: /models/<model_id>/pretrained_models
+    models_dir = Path("/models")
+    if models_dir.is_dir():
+        # Try the expected model name first
+        specific = models_dir / "LEMAS-Project__LEMAS-TTS"
+        if (specific / "pretrained_models").is_dir():
+            return specific / "pretrained_models"
+        # Otherwise, pick the first model that has a pretrained_models subdir
+        for child in models_dir.iterdir():
+            if child.is_dir() and (child / "pretrained_models").is_dir():
+                return child / "pretrained_models"
+    # 3) Local repo layout
+    repo_root = _find_repo_root(start)
+    if (repo_root / "pretrained_models").is_dir():
+        return repo_root / "pretrained_models"
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd / "pretrained_models"
+    # Fallback: assume under repo root even if directory is missing
+    return repo_root / "pretrained_models"
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = _find_pretrained_root(THIS_FILE)
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+class TTS:
+    def __init__(
+        self,
+        model="multilingual",
+        ckpt_file="",
+        vocab_file="",
+        use_prosody_encoder=False,
+        prosody_cfg_path="",
+        prosody_ckpt_path="",
+        ode_method="euler",
+        use_ema=False,
+        vocoder_local_path=str(CKPTS_ROOT / "vocos-mel-24khz"),
+        device=None,
+        hf_cache_dir=None,
+        frontend="phone",
+    ):
+        # Load model architecture config from bundled yaml
+        config_dir = THIS_FILE.parent / "configs"
+        model_cfg = OmegaConf.load(config_dir / f"{model}.yaml")
+        # model_cls = get_class(f"lemas_tts.model.dit.{model_cfg.model.backbone}")
+        model_arc = model_cfg.model.arch
+        self.mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+        self.target_sample_rate = model_cfg.model.mel_spec.target_sample_rate
+        self.ode_method = ode_method
+        self.use_ema = use_ema
+        # remember whether this TTS instance is configured with a prosody encoder
+        self.use_prosody_encoder = use_prosody_encoder
+        self.langs = {"cmn":"zh", "zh":"zh", "en":"en-us", "it":"it", "es":"es", "pt":"pt-br", "fr":"fr-fr", "de":"de", "ru":"ru", "id":"id", "vi":"vi", "th":"th"}
+        if device is not None:
+            self.device = device
+        else:
+            import torch
+            self.device = (
+                "cuda"
+                if torch.cuda.is_available()
+                else "xpu"
+                if torch.xpu.is_available()
+                else "mps"
+                if torch.backends.mps.is_available()
+                else "cpu"
+            )
+        # # Load models
+        # Prefer local vocoder directory if it exists; otherwise let `load_vocoder`
+        # fall back to downloading from the default HF repo (charactr/vocos-mel-24khz).
+        vocoder_is_local = False
+        if vocoder_local_path is not None:
+            try:
+                vocoder_is_local = Path(vocoder_local_path).is_dir()
+            except TypeError:
+                vocoder_is_local = False
+        self.vocoder = load_vocoder(
+            self.mel_spec_type, vocoder_is_local, vocoder_local_path, self.device, hf_cache_dir
+        )
+        # self.vocoder = load_vocoder(vocoder_name="vocos", is_local=True, local_path=vocoder_local_path, device=self.device)
+        if frontend is not None:
+            from lemas_tts.infer.frontend import TextNorm
+            # try:
+                # Try requested frontend first (typically "phone")
+            self.frontend = TextNorm(dtype=frontend)
+            # except Exception as e:
+            #     # If espeak/phonemizer is not available, gracefully fall back to char frontend
+            #     print(f"[TTS] Failed to init TextNorm with dtype='{frontend}': {e}")
+            #     print("[TTS] Falling back to char frontend (no espeak required).")
+            #     self.frontend = TextNorm(dtype="char")
+        else:
+            self.frontend = None
+        self.ema_model = load_model(
+            DiT,
+            model_arc,
+            ckpt_file,
+            self.mel_spec_type,
+            vocab_file,
+            self.ode_method,
+            self.use_ema,
+            self.device,
+            use_prosody_encoder=use_prosody_encoder,
+            prosody_cfg_path=prosody_cfg_path,
+            prosody_ckpt_path=prosody_ckpt_path,
+        )
+    def transcribe(self, ref_audio, language=None):
+        return transcribe(ref_audio, language)
+    def export_wav(self, wav, file_wave, remove_silence=False):
+        sf.write(file_wave, wav, self.target_sample_rate)
+        if remove_silence:
+            remove_silence_for_generated_wav(file_wave)
+    def export_spectrogram(self, spec, file_spec):
+        save_spectrogram(spec, file_spec)
+    def infer(
+        self,
+        ref_file,
+        ref_text,
+        gen_text,
+        show_info=print,
+        progress=tqdm,
+        target_rms=0.1,
+        cross_fade_duration=0.15,
+        use_acc_grl=False,
+        ref_ratio=None,
+        no_ref_audio=False,
+        cfg_strength=2,
+        nfe_step=32,
+        speed=1.0,
+        sway_sampling_coef=5,
+        separate_langs=False,
+        fix_duration=None,
+        use_prosody_encoder=True,
+        file_wave=None,
+        file_spec=None,
+        seed=None,
+    ):
+        if seed is None:
+            seed = random.randint(0, sys.maxsize)
+        seed_everything(seed)
+        self.seed = seed
+        ref_file, ref_text = preprocess_ref_audio_text(ref_file, ref_text)
+        print("preprocesss:\n", "ref_file:", ref_file, "\nref_text:", ref_text)
+        if self.frontend.dtype == "phone":
+            ref_text = self.frontend.text2phn(ref_text+". ").replace("(cmn)", "(zh)").split("|")
+            gen_text = gen_text.split("\n")
+            gen_text = [self.frontend.text2phn(x+". ").replace("(cmn)", "(zh)").split("|") for x in gen_text]
+        elif self.frontend.dtype == "char":
+            src_lang, ref_text = self.frontend.text2norm(ref_text+". ")
+            ref_text = ["("+src_lang.replace("cmn", "zh")+")"] + list(ref_text)
+            gen_text = gen_text.split("\n")
+            gen_text = [self.frontend.text2norm(x+". ") for x in gen_text]
+            gen_text = [["("+x[0].replace("cmn", "zh")+")"] + list(x[1]) for x in gen_text]
+        print("after frontend:\n", "ref_text:", ref_text, "\ngen_text:", gen_text)
+        if separate_langs:
+            ref_text = self.process_phone_list(ref_text) # Optional
+            gen_text = [self.process_phone_list(x) for x in gen_text]
+        print("gen_text:", gen_text, "\nref_text:", ref_text)
+        wav, sr, spec = infer_process(
+            ref_file,
+            ref_text,
+            gen_text,
+            self.ema_model,
+            self.vocoder,
+            self.mel_spec_type,
+            show_info=show_info,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            use_prosody_encoder=use_prosody_encoder,
+            use_acc_grl=use_acc_grl,
+            ref_ratio=ref_ratio,
+            no_ref_audio=no_ref_audio,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=self.device,
+        )
+        if file_wave is not None:
+            self.export_wav(wav, file_wave, remove_silence=False)
+        if file_spec is not None:
+            self.export_spectrogram(spec, file_spec)
+        return wav, sr, spec
+    def process_phone_list(self, parts):
+        puncs = {"#1", "#2", "#3", "#4", "_", "!", ",", ".", "?", '"', "'", "^", "。", "，", "？", "！"}
+        """(vocab756 ver)处理phone list，给不带language id的phone添加当前language id前缀"""
+        # parts = phn_str.split('|')
+        processed = []
+        current_lang = ""
+        for i in range(len(parts)):
+            part = parts[i]
+            if part.startswith('(') and part.endswith(')') and part[1:-1] in self.langs:
+                # 这是一个language id
+                current_lang = part
+                # processed.append(part)
+            elif part in puncs: # not bool(regex.search(r'\p{L}', part[0])): # 匹配非字母数字、非空格的字符
+                # 是停顿符或标点
+                if len(processed) > 0 and processed[-1] == "_":
+                    processed.pop()
+                elif len(processed) > 0 and processed[-1] in puncs and part == "_":
+                    continue
+                processed.append(part)
+                # if i < len(parts) - 1 and parts[i+1] != "_":
+                #     processed.append("_")
+            elif current_lang is not None:
+                # 不是language id且有当前language id，添加前缀
+                processed.append(f"{current_lang}{part}")
+        return processed
+if __name__ == "__main__":
+    f5tts = F5TTS()
+    wav, sr, spec = f5tts.infer(
+        ref_file=str((THIS_FILE.parent / "infer" / "examples" / "basic" / "basic_ref_en.wav").resolve()),
+        ref_text="some call me nature, others call me mother nature.",
+        gen_text=(
+            "I don't really care what you call me. I've been a silent spectator, watching species evolve, "
+            "empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture "
+            "you; ignore me and you shall face the consequences."
+        ),
+        file_wave=str((REPO_ROOT / "outputs" / "api_out.wav").resolve()),
+        file_spec=str((REPO_ROOT / "outputs" / "api_out.png").resolve()),
+        seed=None,
+    )
+    print("seed :", f5tts.seed)

lemas_tts/configs/multilingual_grl.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+# compute_environment: LOCAL_MACHINE
+# debug: false
+# distributed_type: MULTI_GPU
+# downcast_bf16: 'no'
+# enable_cpu_affinity: true
+# gpu_ids: all
+# # machine_rank: 0
+# # main_training_function: main
+# mixed_precision: bf16
+# num_machines: 1
+# num_processes: 16
+# # rdzv_backend: static
+# same_network: true
+# use_cpu: false
+hydra:
+  run:
+    dir: exp/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: multilingual_vocab898_acc_grl_ctc_fix  # dataset name
+  batch_size_per_gpu: 40000  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 2
+  separate_langs: True
+optim:
+  epochs: 100
+  learning_rate: 2e-5
+  num_warmup_updates: 1000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: multilingual  # model name
+  tokenizer: custom  # tokenizer type
+  tokenizer_path: "pretrained_models/data/multilingual_grl/vocab.txt"  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  audio_dir: "pretrained_models/data/multilingual_grl"
+  use_ctc_loss: True  # whether to use ctc loss
+  use_spk_enc: False
+  use_prosody_encoder: False
+  prosody_cfg_path: "pretrained_models/ckpts/prosody_encoder/pretssel_cfg.json"  # pretssel_cfg.json
+  prosody_ckpt_path: "pretrained_models/ckpts/prosody_encoder/prosody_encoder_UnitY2.pt"  # prosody_encoder_pretssel.pt
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: True
+    qk_norm: null  # null | rms_norm
+    conv_layers: 4
+    pe_attn_head: null
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: True  # use local offline ckpt or not
+    # Path in the original training environment; kept here for reference only.
+    # For the open-sourced LEMAS-TTS repo, use `pretrained_models/ckpts/vocos-mel-24khz`.
+    local_path: "pretrained_models/ckpts/vocos-mel-24khz"  # local vocoder path
+ckpts:
+  logger: tensorboard  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 1000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 1000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

lemas_tts/configs/multilingual_prosody.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+# compute_environment: LOCAL_MACHINE
+# debug: false
+# distributed_type: MULTI_GPU
+# downcast_bf16: 'no'
+# enable_cpu_affinity: true
+# gpu_ids: all
+# # machine_rank: 0
+# # main_training_function: main
+# mixed_precision: bf16
+# num_machines: 1
+# num_processes: 16
+# # rdzv_backend: static
+# same_network: true
+# use_cpu: false
+hydra:
+  run:
+    dir: exp/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: multilingual_vocab898_acc_grl_prosody_ctc_fix  # dataset name
+  batch_size_per_gpu: 40000  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 2
+  separate_langs: True
+optim:
+  epochs: 100
+  learning_rate: 2e-5
+  num_warmup_updates: 1000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: multilingual  # model name
+  tokenizer: custom  # tokenizer type
+  tokenizer_path: "pretrained_models/data/multilingual_grl/vocab.txt"  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  audio_dir: "pretrained_models/data/multilingual_grl"
+  use_ctc_loss: True  # whether to use ctc loss
+  use_spk_enc: False
+  use_prosody_encoder: True
+  prosody_cfg_path: "pretrained_models/ckpts/prosody_encoder/pretssel_cfg.json"  # pretssel_cfg.json
+  prosody_ckpt_path: "pretrained_models/ckpts/prosody_encoder/prosody_encoder_UnitY2.pt"  # prosody_encoder_pretssel.pt
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: True
+    qk_norm: null  # null | rms_norm
+    conv_layers: 4
+    pe_attn_head: null
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: True  # use local offline ckpt or not
+    # Path in the original training environment; kept here for reference only.
+    # For the open-sourced LEMAS-TTS repo, use `pretrained_models/ckpts/vocos-mel-24khz`.
+    local_path: "pretrained_models/ckpts/vocos-mel-24khz"  # local vocoder path
+ckpts:
+  logger: tensorboard  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 1000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 1000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

lemas_tts/infer/edit_multilingual.py ADDED Viewed

	@@ -0,0 +1,184 @@

+"""
+Multilingual speech editing helpers for LEMAS-TTS.
+This is adapted from F5-TTS's `speech_edit_multilingual.py`, but uses the
+`lemas_tts.api.TTS` API instead of `F5TTS`.
+"""
+from __future__ import annotations
+from typing import List, Tuple
+import torch
+import torch.nn.functional as F
+import torchaudio
+from lemas_tts.api import TTS
+def build_tokens_from_text(tts: TTS, text: str) -> List[List[str]]:
+    """
+    Convert raw text into token sequence(s) consistent with the multilingual
+    LEMAS-TTS training pipeline.
+    We reuse the same frontend logic as in `TTS.infer`:
+      - frontend.dtype == "phone" -> TextNorm.text2phn -> split on '|'
+      - frontend.dtype == "char"  -> TextNorm.text2norm -> language tag + chars
+      - frontend is None          -> simple character sequence as fallback.
+    """
+    text_proc = text.strip()
+    if not text_proc.endswith((".", "。", "!", "？", "?", "！")):
+        text_proc = text_proc + "."
+    if getattr(tts, "frontend", None) is None:
+        tokens = list(text_proc)
+        return [tokens]
+    dtype = getattr(tts.frontend, "dtype", "phone")
+    if dtype == "phone":
+        phones = tts.frontend.text2phn(text_proc + " ")
+        phones = phones.replace("(cmn)", "(zh)")
+        tokens = [tok for tok in phones.split("|") if tok]
+        return [tokens]
+    if dtype == "char":
+        lang, norm = tts.frontend.text2norm(text_proc + " ")
+        lang_tag = f"({lang.replace('cmn', 'zh')})"
+        tokens = [lang_tag] + list(norm)
+        return [tokens]
+    # Fallback: character-level
+    tokens = list(text_proc)
+    return [tokens]
+def gen_wav_multilingual(
+    tts: TTS,
+    segment_audio: torch.Tensor,
+    sr: int,
+    target_text: str,
+    parts_to_edit: List[Tuple[float, float]],
+    nfe_step: int = 64,
+    cfg_strength: float = 5.0,
+    sway_sampling_coef: float = 3.0,
+    ref_ratio: float = 1.0,
+    no_ref_audio: bool = False,
+    use_acc_grl: bool = False,
+    use_prosody_encoder_flag: bool = False,
+    seed: int | None = None,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Core editing routine:
+      - build an edit mask over the mel frames;
+      - run CFM.sample with that mask and the new text;
+      - decode mel to waveform via the vocoder.
+    """
+    device = tts.device
+    model = tts.ema_model
+    vocoder = tts.vocoder
+    mel_spec = getattr(model, "mel_spec", None)
+    if mel_spec is None:
+        raise RuntimeError("CFM model has no attached MelSpec; check your checkpoint.")
+    target_sr = int(mel_spec.target_sample_rate)
+    hop_length = int(mel_spec.hop_length)
+    target_rms = 0.1
+    if segment_audio.dim() == 1:
+        audio = segment_audio.unsqueeze(0)
+    else:
+        audio = segment_audio
+    # RMS normalization
+    rms = torch.sqrt(torch.mean(torch.square(audio)))
+    if rms < target_rms:
+        audio = audio * target_rms / rms
+    # Resample if needed
+    if sr != target_sr:
+        resampler = torchaudio.transforms.Resample(sr, target_sr)
+        audio = resampler(audio)
+    audio = audio.to(device)
+    # Build edit mask over mel frames
+    offset = 0.0
+    edit_mask = torch.zeros(1, 0, dtype=torch.bool, device=device)
+    for (start, end) in parts_to_edit:
+        # small safety margin around the region to edit
+        start = max(start - 0.1, 0.0)
+        end = min(end + 0.1, audio.shape[-1] / target_sr)
+        part_dur_sec = end - start
+        part_dur_samples = int(round(part_dur_sec * target_sr))
+        start_samples = int(round(start * target_sr))
+        # frames before edited span: keep original (mask=True)
+        num_keep_frames = int(round((start_samples - offset) / hop_length))
+        # frames inside edited span: to be regenerated (mask=False)
+        num_edit_frames = int(round(part_dur_samples / hop_length))
+        if num_keep_frames > 0:
+            edit_mask = torch.cat(
+                [edit_mask, torch.ones(1, num_keep_frames, dtype=torch.bool, device=device)],
+                dim=-1,
+            )
+        if num_edit_frames > 0:
+            edit_mask = torch.cat(
+                [edit_mask, torch.zeros(1, num_edit_frames, dtype=torch.bool, device=device)],
+                dim=-1,
+            )
+        offset = end * target_sr
+    # Pad mask to full sequence length (True = keep original)
+    total_frames = audio.shape[-1] // hop_length
+    if edit_mask.shape[-1] < total_frames + 1:
+        pad_len = total_frames + 1 - edit_mask.shape[-1]
+        edit_mask = F.pad(edit_mask, (0, pad_len), value=True)
+    duration = total_frames
+    # Text tokens using multilingual frontend
+    final_text_list = build_tokens_from_text(tts, target_text)
+    # For multilingual models trained with `separate_langs=True`, we need to
+    # post-process the phone sequence so that each non-punctuation token is
+    # prefixed with its language id, consistent with training and the main API.
+    if hasattr(tts, "process_phone_list") and len(final_text_list) > 0:
+        final_text_list = [tts.process_phone_list(final_text_list[0])]
+    print("final_text_list:", final_text_list)
+    with torch.inference_mode():
+        generated, _ = model.sample(
+            cond=audio,
+            text=final_text_list,
+            duration=duration,
+            steps=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            seed=seed,
+            edit_mask=edit_mask,
+            use_acc_grl=use_acc_grl,
+            use_prosody_encoder=use_prosody_encoder_flag,
+            ref_ratio=ref_ratio,
+            no_ref_audio=no_ref_audio,
+        )
+    generated = generated.to(torch.float32)
+    generated_mel = generated.permute(0, 2, 1)  # [B, C, T_mel]
+    mel_for_vocoder = generated_mel.to(device)
+    if tts.mel_spec_type == "vocos":
+        wav_out = vocoder.decode(mel_for_vocoder)
+    elif tts.mel_spec_type == "bigvgan":
+        wav_out = vocoder(mel_for_vocoder)
+    else:
+        raise ValueError(f"Unsupported vocoder type: {tts.mel_spec_type}")
+    if rms < target_rms:
+        wav_out = wav_out * rms / target_rms
+    return wav_out.squeeze(0), generated_mel

lemas_tts/infer/frontend.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import os, re, regex
+import langid
+import uroman as ur
+import jieba, zhconv
+from num2words import num2words
+jieba.set_dictionary(dictionary_path=os.path.join(os.path.dirname(__file__) + "/../infer/text_norm/jieba_dict.txt"))
+# from pypinyin.core import Pinyin
+from pypinyin import pinyin, lazy_pinyin, Style
+from .text_norm.txt2pinyin import _PAUSE_SYMBOL, get_phoneme_from_char_and_pinyin
+from .text_norm.cn_tn import NSWNormalizer
+from .text_norm.tokenizer import TextTokenizer, txt2phone
+from pypinyin.contrib.tone_convert import to_initials, to_finals_tone3
+from pypinyin_dict.phrase_pinyin_data import large_pinyin  # large_pinyin  #  cc_cedict
+large_pinyin.load()
+class TextNorm():
+    def __init__(self, dtype="phone"):
+        # my_pinyin = Pinyin(MyConverter())
+        # self.pinyin_parser = my_pinyin.pinyin
+        cmn_lexicon = open(os.path.join(os.path.dirname(__file__)+'/../infer/text_norm/pinyin-lexicon-r.txt'),'r', encoding="utf-8").readlines()
+        cmn_lexicon = [x.strip().split() for x in cmn_lexicon]
+        self.cmn_dict = {x[0]:x[1:] for x in cmn_lexicon}
+        langid.set_languages(['es','pt','zh','en','de','fr','it','ru', 'vi','id','th','ja','ko','ar'])
+        langs = {"en":"en-us", "it":"it", "es":"es", "pt":"pt-br", "fr":"fr-fr", "de":"de", "ru":"ru", "vi":"vi", "id":"id", "th":"th", "ja":"ja", "ko":"ko"} # "zh":"cmn", "cmn":"cmn", "ar":"ar-sa"}
+        text_tokenizer = {}
+        for k,v in langs.items():
+            tokenizer = TextTokenizer(language=v, backend="espeak")
+            lang = "zh" if k == "cmn" else k
+            text_tokenizer[k] = (lang, tokenizer)
+        self.text_tokenizer = text_tokenizer
+        self.cn_tn = NSWNormalizer()
+        self.dtype = dtype
+    def detect_lang(self, text):
+        lang, _ = langid.classify(text)[0]
+        return lang
+    def sil_type(self, time_s):
+        if round(time_s) < 0.4:
+            return ""
+        elif round(time_s) >= 0.4 and round(time_s) < 0.8:
+            return "#1"
+        elif round(time_s) >= 0.8 and round(time_s) < 1.5:
+            return "#2"
+        elif round(time_s) >= 1.5 and round(time_s) < 3.0:
+            return "#3"
+        elif round(time_s) >= 3.0:
+            return "#4"
+    def add_sil_raw(self, sub_list, start_time, end_time, target_transcript):
+        txt = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txt.append(sil)
+        txt.append(txt_list[0])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txt.append(target_transcript)
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txt.append(sil)
+                txt.append(txt_list[i])
+        return ' '.join(txt)
+    def add_sil(self, sub_list, start_time, end_time, target_transcript, src_lang, tar_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txts.append([src_lang, sil])
+        if sub_list[0]["start"] < start_time:
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txts.append([tar_lang, target_transcript])
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def replace_numbers_with_words(self, sentence, lang="en"):
+        sentence = re.sub(r'(\d+)', r' \1 ', sentence) # add spaces around numbers
+        def replace_with_words(match):
+            num = match.group(0)
+            try:
+                return num2words(num, lang=lang) # Convert numbers to words
+            except:
+                return num # In case num2words fails (unlikely with digits but just to be safe)
+        return re.sub(r'\b\d+\b', replace_with_words, sentence) # Regular expression that matches numbers
+    def get_prompt(self, sub_list, start_time, end_time, src_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        if start_time <= sub_list[0]["start"]:
+            sil = self.sil_type(sub_list[0]["start"])
+            if len(sil) > 0:
+                txts.append([src_lang, sil])
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            # if sub_list[i]["start"] <= start_time and sub_list[i]["end"] <= end_time:
+            #     txts.append([tar_lang, target_transcript])
+            #     target_transcript = ""
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def txt2pinyin(self, text):
+        txts, phonemes = [], []
+        texts = re.split(r"(#\d)", text)
+        print("before norm: ", texts)
+        for text in texts:
+            if text in {'#1', '#2', '#3', '#4'}:
+                txts.append(text)
+                phonemes.append(text)
+                continue
+            text = self.cn_tn.normalize(text.strip())
+            text_list = list(jieba.cut(text))
+            print("jieba cut: ", text, text_list)
+            for words in text_list:
+                if words in _PAUSE_SYMBOL:
+                    # phonemes[-1] += _PAUSE_SYMBOL[words]
+                    phonemes.append(_PAUSE_SYMBOL[words])
+                    # phonemes.append('#1')
+                    txts[-1] += words
+                elif re.search("[\u4e00-\u9fa5]+", words):
+                    # pinyin = self.pinyin_parser(words, style=Style.TONE3, errors="ignore")
+                    pinyin = lazy_pinyin(words, style=Style.TONE3, tone_sandhi=True, neutral_tone_with_five=True)
+                    new_pinyin = []
+                    for x in pinyin:
+                        x = "".join(x)
+                        if "#" not in x:
+                            new_pinyin.append(x)
+                        else:
+                            phonemes.append(words)
+                            continue
+                    # new_pinyin = change_tone_in_bu_or_yi(words, new_pinyin) if len(words)>1 and words[-1] not in {"一","不"} else new_pinyin
+                    phoneme = get_phoneme_from_char_and_pinyin(words, new_pinyin)
+                    phonemes += phoneme
+                    txts += list(words)
+                elif re.search(r"[a-zA-Z]", words) or re.search(r"#[1-4]", words):
+                    phonemes.append(words.upper())
+                    txts.append(words.upper())
+                    # phonemes.append("#1")
+        # phones = " ".join(phonemes)
+        return txts, phonemes
+    def txt2pin_phns(self, text):
+        text = re.sub(r'(?<! )(' + r'[^\w\s]' + r')', r' \1', text)
+        text = re.sub(r'\s+', ' ', text).strip()
+        # print(text.split(" "))
+        res_list = []
+        for txt in text.split(" "):
+            if txt in self.cmn_dict:
+                # res_list +=  ["(zh)" + x for x in self.cmn_dict[txt]]
+                res_list.append("(zh)")
+                res_list.append(to_initials(txt, strict=False))
+                res_list.append(to_finals_tone3(txt, neutral_tone_with_five=True))
+            elif txt == '':
+                continue
+            elif txt[0] in {"#1", "#2", "#3", "#4"} or not bool(regex.search(r'\p{L}', txt[0][0])):
+                if len(res_list) > 0 and res_list[-1] == "_":
+                    res_list.pop()
+                res_list += [txt]
+                continue
+            else:
+                if len(res_list) > 0 and res_list[-1] == "_":
+                    res_list.pop()
+                lang = langid.classify(txt)[0]
+                lang = lang if lang in self.text_tokenizer else "en"
+                tokenizer = self.text_tokenizer[lang][1]
+                ipa = tokenizer.backend.phonemize([txt], separator=tokenizer.separator, strip=True, njobs=1)
+                phns = ipa[0] if ipa[0][0] == "(" else f"({lang})_" + ipa[0]
+                res_list += phns.replace("_", "|_|").split("|")
+                # lang = phns.split(")")[0][1:]
+                # phns = phns[len(lang)+3:].replace("_", "|_|")
+                # phns = phns.split("|")
+                # for i in range(len(phns)):
+                #     if phns[i] not in {"#1", "#2", "#3", "#4", "_", ",", ".", "?", "!"}:
+                #         phns[i] = f"({lang})" + phns[i]
+                # res_list += phns
+            res_list.append("_")
+        res = "|".join(res_list)
+        res = re.sub(r'(\|_)+', '|_', res)
+        return res
+    def text2phn(self, sentence, lang=None):
+        if not lang:
+            lang = langid.classify(sentence)[0]
+        if re.search("[\u4e00-\u9fa5]+", sentence):
+            txts, phones = self.txt2pinyin(sentence)
+            transcript_norm = " ".join(phones)
+            phones = self.txt2pin_phns(transcript_norm) # IPA mix Pinyin
+        else:
+            transcript = self.replace_numbers_with_words(sentence, lang=lang).split(' ')
+            transcript_norm = sentence
+            # All IPA
+            phones = txt2phone(self.text_tokenizer[lang][1], transcript_norm.strip().replace(".", ",").replace("。", ","))
+            phones = f"({lang})|" + phones if phones[0] != "(" else phones
+        return phones
+    def text2norm(self, sentence, lang=None):
+        if not lang:
+            lang = langid.classify(sentence)[0]
+        if re.search("[\u4e00-\u9fa5]+", sentence):
+            txts, phones = self.txt2pinyin(sentence)
+            transcript_norm = " ".join(phones)
+        else:
+            transcript = self.replace_numbers_with_words(sentence, lang=lang).split(' ')
+            transcript_norm = sentence
+        return (lang, transcript_norm)

lemas_tts/infer/infer_cli.py ADDED Viewed

	@@ -0,0 +1,386 @@

+import argparse
+import codecs
+import os
+import re
+from datetime import datetime
+from importlib.resources import files
+from pathlib import Path
+import numpy as np
+import soundfile as sf
+import tomli
+from cached_path import cached_path
+from hydra.utils import get_class
+from omegaconf import OmegaConf
+from lemas_tts.infer.utils_infer import (
+    mel_spec_type,
+    target_rms,
+    cross_fade_duration,
+    nfe_step,
+    cfg_strength,
+    sway_sampling_coef,
+    speed,
+    fix_duration,
+    device,
+    infer_process,
+    load_model,
+    load_vocoder,
+    preprocess_ref_audio_text,
+    remove_silence_for_generated_wav,
+)
+THIS_FILE = Path(__file__).resolve()
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+parser = argparse.ArgumentParser(
+    prog="python3 infer-cli.py",
+    description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
+    epilog="Specify options above to override one or more settings from config.",
+)
+parser.add_argument(
+    "-c",
+    "--config",
+    type=str,
+    default=os.path.join(files("lemas_tts").joinpath("infer/examples/basic"), "basic.toml"),
+    help="The configuration file, default see infer/examples/basic/basic.toml",
+)
+# Note. Not to provide default value here in order to read default from config file
+parser.add_argument(
+    "-m",
+    "--model",
+    type=str,
+    help="The model name: F5TTS_v1_Base | F5TTS_Base | E2TTS_Base | etc.",
+)
+parser.add_argument(
+    "-mc",
+    "--model_cfg",
+    type=str,
+    help="The path to F5-TTS model config file .yaml",
+)
+parser.add_argument(
+    "-p",
+    "--ckpt_file",
+    type=str,
+    help="The path to model checkpoint .pt, leave blank to use default",
+)
+parser.add_argument(
+    "-v",
+    "--vocab_file",
+    type=str,
+    help="The path to vocab file .txt, leave blank to use default",
+)
+parser.add_argument(
+    "-r",
+    "--ref_audio",
+    type=str,
+    help="The reference audio file.",
+)
+parser.add_argument(
+    "-s",
+    "--ref_text",
+    type=str,
+    help="The transcript/subtitle for the reference audio",
+)
+parser.add_argument(
+    "-t",
+    "--gen_text",
+    type=str,
+    help="The text to make model synthesize a speech",
+)
+parser.add_argument(
+    "-f",
+    "--gen_file",
+    type=str,
+    help="The file with text to generate, will ignore --gen_text",
+)
+parser.add_argument(
+    "-o",
+    "--output_dir",
+    type=str,
+    help="The path to output folder",
+)
+parser.add_argument(
+    "-w",
+    "--output_file",
+    type=str,
+    help="The name of output file",
+)
+parser.add_argument(
+    "--save_chunk",
+    action="store_true",
+    help="To save each audio chunks during inference",
+)
+parser.add_argument(
+    "--remove_silence",
+    action="store_true",
+    help="To remove long silence found in ouput",
+)
+parser.add_argument(
+    "--load_vocoder_from_local",
+    action="store_true",
+    help="To load vocoder from local dir, default to ../checkpoints/vocos-mel-24khz",
+)
+parser.add_argument(
+    "--vocoder_name",
+    type=str,
+    choices=["vocos", "bigvgan"],
+    help=f"Used vocoder name: vocos | bigvgan, default {mel_spec_type}",
+)
+parser.add_argument(
+    "--target_rms",
+    type=float,
+    help=f"Target output speech loudness normalization value, default {target_rms}",
+)
+parser.add_argument(
+    "--cross_fade_duration",
+    type=float,
+    help=f"Duration of cross-fade between audio segments in seconds, default {cross_fade_duration}",
+)
+parser.add_argument(
+    "--nfe_step",
+    type=int,
+    help=f"The number of function evaluation (denoising steps), default {nfe_step}",
+)
+parser.add_argument(
+    "--cfg_strength",
+    type=float,
+    help=f"Classifier-free guidance strength, default {cfg_strength}",
+)
+parser.add_argument(
+    "--sway_sampling_coef",
+    type=float,
+    help=f"Sway Sampling coefficient, default {sway_sampling_coef}",
+)
+parser.add_argument(
+    "--speed",
+    type=float,
+    help=f"The speed of the generated audio, default {speed}",
+)
+parser.add_argument(
+    "--fix_duration",
+    type=float,
+    help=f"Fix the total duration (ref and gen audios) in seconds, default {fix_duration}",
+)
+parser.add_argument(
+    "--device",
+    type=str,
+    help="Specify the device to run on",
+)
+args = parser.parse_args()
+# config file
+config = tomli.load(open(args.config, "rb"))
+# command-line interface parameters
+model = args.model or config.get("model", "F5TTS_v1_Base")
+ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
+vocab_file = args.vocab_file or config.get("vocab_file", "")
+ref_audio = args.ref_audio or config.get("ref_audio", "infer/examples/basic/basic_ref_en.wav")
+ref_text = (
+    args.ref_text
+    if args.ref_text is not None
+    else config.get("ref_text", "Some call me nature, others call me mother nature.")
+)
+gen_text = args.gen_text or config.get("gen_text", "Here we generate something just for test.")
+gen_file = args.gen_file or config.get("gen_file", "")
+output_dir = args.output_dir or config.get("output_dir", "tests")
+output_file = args.output_file or config.get(
+    "output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
+)
+save_chunk = args.save_chunk or config.get("save_chunk", False)
+remove_silence = args.remove_silence or config.get("remove_silence", False)
+load_vocoder_from_local = args.load_vocoder_from_local or config.get("load_vocoder_from_local", False)
+vocoder_name = args.vocoder_name or config.get("vocoder_name", mel_spec_type)
+target_rms = args.target_rms or config.get("target_rms", target_rms)
+cross_fade_duration = args.cross_fade_duration or config.get("cross_fade_duration", cross_fade_duration)
+nfe_step = args.nfe_step or config.get("nfe_step", nfe_step)
+cfg_strength = args.cfg_strength or config.get("cfg_strength", cfg_strength)
+sway_sampling_coef = args.sway_sampling_coef or config.get("sway_sampling_coef", sway_sampling_coef)
+speed = args.speed or config.get("speed", speed)
+fix_duration = args.fix_duration or config.get("fix_duration", fix_duration)
+device = args.device or config.get("device", device)
+# patches for pip pkg user
+if "infer/examples/" in ref_audio:
+    ref_audio = str(files("lemas_tts").joinpath(f"{ref_audio}"))
+if "infer/examples/" in gen_file:
+    gen_file = str(files("lemas_tts").joinpath(f"{gen_file}"))
+if "voices" in config:
+    for voice in config["voices"]:
+        voice_ref_audio = config["voices"][voice]["ref_audio"]
+        if "infer/examples/" in voice_ref_audio:
+            config["voices"][voice]["ref_audio"] = str(files("lemas_tts").joinpath(f"{voice_ref_audio}"))
+# ignore gen_text if gen_file provided
+if gen_file:
+    gen_text = codecs.open(gen_file, "r", "utf-8").read()
+# output path
+wave_path = Path(output_dir) / output_file
+# spectrogram_path = Path(output_dir) / "infer_cli_out.png"
+if save_chunk:
+    output_chunk_dir = os.path.join(output_dir, f"{Path(output_file).stem}_chunks")
+    if not os.path.exists(output_chunk_dir):
+        os.makedirs(output_chunk_dir)
+# load vocoder
+if vocoder_name == "vocos":
+    vocoder_local_path = str(CKPTS_ROOT / "vocos-mel-24khz")
+elif vocoder_name == "bigvgan":
+    vocoder_local_path = "../checkpoints/bigvgan_v2_24khz_100band_256x"
+vocoder = load_vocoder(
+    vocoder_name=vocoder_name, is_local=load_vocoder_from_local, local_path=vocoder_local_path, device=device
+)
+# load TTS model
+model_cfg = OmegaConf.load(
+    args.model_cfg or config.get("model_cfg", str(files("lemas_tts").joinpath(f"configs/{model}.yaml")))
+)
+model_cls = get_class(f"lemas_tts.model.{model_cfg.model.backbone}")
+model_arc = model_cfg.model.arch
+repo_name, ckpt_step, ckpt_type = "F5-TTS", 1250000, "safetensors"
+if model != "F5TTS_Base":
+    assert vocoder_name == model_cfg.model.mel_spec.mel_spec_type
+# override for previous models
+if model == "F5TTS_Base":
+    if vocoder_name == "vocos":
+        ckpt_step = 1200000
+    elif vocoder_name == "bigvgan":
+        model = "F5TTS_Base_bigvgan"
+        ckpt_type = "pt"
+elif model == "E2TTS_Base":
+    repo_name = "E2-TTS"
+    ckpt_step = 1200000
+if not ckpt_file:
+    ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{model}/model_{ckpt_step}.{ckpt_type}"))
+print(f"Using {model}...")
+ema_model = load_model(
+    model_cls, model_arc, ckpt_file, mel_spec_type=vocoder_name, vocab_file=vocab_file, device=device
+)
+# inference process
+def main():
+    main_voice = {"ref_audio": ref_audio, "ref_text": ref_text}
+    if "voices" not in config:
+        voices = {"main": main_voice}
+    else:
+        voices = config["voices"]
+        voices["main"] = main_voice
+    for voice in voices:
+        print("Voice:", voice)
+        print("ref_audio ", voices[voice]["ref_audio"])
+        voices[voice]["ref_audio"], voices[voice]["ref_text"] = preprocess_ref_audio_text(
+            voices[voice]["ref_audio"], voices[voice]["ref_text"]
+        )
+        print("ref_audio_", voices[voice]["ref_audio"], "\n\n")
+    generated_audio_segments = []
+    reg1 = r"(?=\[\w+\])"
+    chunks = re.split(reg1, gen_text)
+    reg2 = r"\[(\w+)\]"
+    for text in chunks:
+        if not text.strip():
+            continue
+        match = re.match(reg2, text)
+        if match:
+            voice = match[1]
+        else:
+            print("No voice tag found, using main.")
+            voice = "main"
+        if voice not in voices:
+            print(f"Voice {voice} not found, using main.")
+            voice = "main"
+        text = re.sub(reg2, "", text)
+        ref_audio_ = voices[voice]["ref_audio"]
+        ref_text_ = voices[voice]["ref_text"]
+        gen_text_ = text.strip()
+        print(f"Voice: {voice}")
+        audio_segment, final_sample_rate, spectragram = infer_process(
+            ref_audio_,
+            ref_text_,
+            gen_text_,
+            ema_model,
+            vocoder,
+            mel_spec_type=vocoder_name,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=device,
+        )
+        generated_audio_segments.append(audio_segment)
+        if save_chunk:
+            if len(gen_text_) > 200:
+                gen_text_ = gen_text_[:200] + " ... "
+            sf.write(
+                os.path.join(output_chunk_dir, f"{len(generated_audio_segments) - 1}_{gen_text_}.wav"),
+                audio_segment,
+                final_sample_rate,
+            )
+    if generated_audio_segments:
+        final_wave = np.concatenate(generated_audio_segments)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        with open(wave_path, "wb") as f:
+            sf.write(f.name, final_wave, final_sample_rate)
+            # Remove silence
+            if remove_silence:
+                remove_silence_for_generated_wav(f.name)
+            print(f.name)
+if __name__ == "__main__":
+    main()

lemas_tts/infer/text_norm/__init__.py ADDED Viewed

File without changes

lemas_tts/infer/text_norm/cn_tn.py ADDED Viewed

	@@ -0,0 +1,824 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Authors:
+#   2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
+#   2019.9 Jiayu DU
+#
+# requirements:
+#   - python 3.X
+# notes: python 2.X WILL fail or produce misleading results
+import sys, os, argparse, codecs, string, re, unicodedata
+# ================================================================================ #
+#                                    basic constant
+# ================================================================================ #
+CHINESE_DIGIS = u'零一二三四五六七八九'
+BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
+BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
+SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
+LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
+LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
+SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
+ZERO_ALT = u'〇'
+ONE_ALT = u'幺'
+TWO_ALTS = [u'两', u'兩']
+POSITIVE = [u'正', u'正']
+NEGATIVE = [u'负', u'負']
+POINT = [u'点', u'點']
+# PLUS = [u'加', u'加']
+# SIL = [u'杠', u'槓']
+# 中文数字系统类型
+NUMBERING_TYPES = ['low', 'mid', 'high']
+CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
+                 '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
+CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
+COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
+                  '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
+                  '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
+                  '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
+                  '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
+                  '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
+# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
+CHINESE_PUNC_STOP = '！？｡。'
+CHINESE_PUNC_NON_STOP = '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
+CHINESE_PUNC_OTHER = '·〈〉-'
+CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP + CHINESE_PUNC_OTHER
+# ================================================================================ #
+#                                    basic class
+# ================================================================================ #
+class ChineseChar(object):
+    """
+    中文字符
+    每个字符对应简体和繁体,
+    e.g. 简体 = '负', 繁体 = '負'
+    转换时可转换为简体或繁体
+    """
+    def __init__(self, simplified, traditional):
+        self.simplified = simplified
+        self.traditional = traditional
+        #self.__repr__ = self.__str__
+    def __str__(self):
+        return self.simplified or self.traditional or None
+    def __repr__(self):
+        return self.__str__()
+class ChineseNumberUnit(ChineseChar):
+    """
+    中文数字/数位字符
+    每个字符除繁简体外还有一个额外的大写字符
+    e.g. '陆' 和 '陸'
+    """
+    def __init__(self, power, simplified, traditional, big_s, big_t):
+        super(ChineseNumberUnit, self).__init__(simplified, traditional)
+        self.power = power
+        self.big_s = big_s
+        self.big_t = big_t
+    def __str__(self):
+        return '10^{}'.format(self.power)
+    @classmethod
+    def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
+        if small_unit:
+            return ChineseNumberUnit(power=index + 1,
+                                     simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[0]:
+            return ChineseNumberUnit(power=index + 8,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[1]:
+            return ChineseNumberUnit(power=(index + 2) * 4,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[2]:
+            return ChineseNumberUnit(power=pow(2, index + 3),
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        else:
+            raise ValueError(
+                'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
+class ChineseNumberDigit(ChineseChar):
+    """
+    中文数字字符
+    """
+    def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
+        super(ChineseNumberDigit, self).__init__(simplified, traditional)
+        self.value = value
+        self.big_s = big_s
+        self.big_t = big_t
+        self.alt_s = alt_s
+        self.alt_t = alt_t
+    def __str__(self):
+        return str(self.value)
+    @classmethod
+    def create(cls, i, v):
+        return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
+class ChineseMath(ChineseChar):
+    """
+    中文数位字符
+    """
+    def __init__(self, simplified, traditional, symbol, expression=None):
+        super(ChineseMath, self).__init__(simplified, traditional)
+        self.symbol = symbol
+        self.expression = expression
+        self.big_s = simplified
+        self.big_t = traditional
+CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
+class NumberSystem(object):
+    """
+    中文数字系统
+    """
+    pass
+class MathSymbol(object):
+    """
+    用于中文数字系统的数学符号 (繁/简体), e.g.
+    positive = ['正', '正']
+    negative = ['负', '負']
+    point = ['点', '點']
+    """
+    def __init__(self, positive, negative, point):
+        self.positive = positive
+        self.negative = negative
+        self.point = point
+    def __iter__(self):
+        for v in self.__dict__.values():
+            yield v
+# class OtherSymbol(object):
+#     """
+#     其他符号
+#     """
+#
+#     def __init__(self, sil):
+#         self.sil = sil
+#
+#     def __iter__(self):
+#         for v in self.__dict__.values():
+#             yield v
+# ================================================================================ #
+#                                    basic utils
+# ================================================================================ #
+def create_system(numbering_type=NUMBERING_TYPES[1]):
+    """
+    根据数字系统类型返回创建相应的数字系统，默认为 mid
+    NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
+        low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
+        mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
+        high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
+    返回对应的数字系统
+    """
+    # chinese number units of '亿' and larger
+    all_larger_units = zip(
+        LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    larger_units = [CNU.create(i, v, numbering_type, False)
+                    for i, v in enumerate(all_larger_units)]
+    # chinese number units of '十, 百, 千, 万'
+    all_smaller_units = zip(
+        SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    smaller_units = [CNU.create(i, v, small_unit=True)
+                     for i, v in enumerate(all_smaller_units)]
+    # digis
+    chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
+                        BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
+    digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
+    digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
+    digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
+    digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
+    # symbols
+    positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
+    negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
+    point_cn = CM(POINT[0], POINT[1], '.', lambda x,
+                  y: float(str(x) + '.' + str(y)))
+    # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
+    system = NumberSystem()
+    system.units = smaller_units + larger_units
+    system.digits = digits
+    system.math = MathSymbol(positive_cn, negative_cn, point_cn)
+    # system.symbols = OtherSymbol(sil_cn)
+    return system
+def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
+    def get_symbol(char, system):
+        for u in system.units:
+            if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
+                return u
+        for d in system.digits:
+            if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
+                return d
+        for m in system.math:
+            if char in [m.traditional, m.simplified]:
+                return m
+    def string2symbols(chinese_string, system):
+        int_string, dec_string = chinese_string, ''
+        for p in [system.math.point.simplified, system.math.point.traditional]:
+            if p in chinese_string:
+                int_string, dec_string = chinese_string.split(p)
+                break
+        return [get_symbol(c, system) for c in int_string], \
+               [get_symbol(c, system) for c in dec_string]
+    def correct_symbols(integer_symbols, system):
+        """
+        一百八 to 一百八十
+        一亿一千三百万 to 一亿 一千万 三百万
+        """
+        if integer_symbols and isinstance(integer_symbols[0], CNU):
+            if integer_symbols[0].power == 1:
+                integer_symbols = [system.digits[1]] + integer_symbols
+        if len(integer_symbols) > 1:
+            if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
+                integer_symbols.append(
+                    CNU(integer_symbols[-2].power - 1, None, None, None, None))
+        result = []
+        unit_count = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                result.append(s)
+                unit_count = 0
+            elif isinstance(s, CNU):
+                current_unit = CNU(s.power, None, None, None, None)
+                unit_count += 1
+            if unit_count == 1:
+                result.append(current_unit)
+            elif unit_count > 1:
+                for i in range(len(result)):
+                    if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
+                        result[-i - 1] = CNU(result[-i - 1].power +
+                                             current_unit.power, None, None, None, None)
+        return result
+    def compute_value(integer_symbols):
+        """
+        Compute the value.
+        When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
+        e.g. '两千万' = 2000 * 10000 not 2000 + 10000
+        """
+        value = [0]
+        last_power = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                value[-1] = s.value
+            elif isinstance(s, CNU):
+                value[-1] *= pow(10, s.power)
+                if s.power > last_power:
+                    value[:-1] = list(map(lambda v: v *
+                                                    pow(10, s.power), value[:-1]))
+                    last_power = s.power
+                value.append(0)
+        return sum(value)
+    system = create_system(numbering_type)
+    int_part, dec_part = string2symbols(chinese_string, system)
+    int_part = correct_symbols(int_part, system)
+    int_str = str(compute_value(int_part))
+    dec_str = ''.join([str(d.value) for d in dec_part])
+    if dec_part:
+        return '{0}.{1}'.format(int_str, dec_str)
+    else:
+        return int_str
+def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
+            traditional=False, alt_zero=False, alt_one=False, alt_two=True,
+            use_zeros=True, use_units=True):
+    def get_value(value_string, use_zeros=True):
+        striped_string = value_string.lstrip('0')
+        # record nothing if all zeros
+        if not striped_string:
+            return []
+        # record one digits
+        elif len(striped_string) == 1:
+            if use_zeros and len(value_string) != len(striped_string):
+                return [system.digits[0], system.digits[int(striped_string)]]
+            else:
+                return [system.digits[int(striped_string)]]
+        # recursively record multiple digits
+        else:
+            result_unit = next(u for u in reversed(
+                system.units) if u.power < len(striped_string))
+            result_string = value_string[:-result_unit.power]
+            return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
+    system = create_system(numbering_type)
+    int_dec = number_string.split('.')
+    if len(int_dec) == 1:
+        int_string = int_dec[0]
+        dec_string = ""
+    elif len(int_dec) == 2:
+        int_string = int_dec[0]
+        dec_string = int_dec[1]
+    else:
+        raise ValueError(
+            "invalid input num string with more than one dot: {}".format(number_string))
+    if use_units and len(int_string) > 1:
+        result_symbols = get_value(int_string)
+    else:
+        result_symbols = [system.digits[int(c)] for c in int_string]
+    dec_symbols = [system.digits[int(c)] for c in dec_string]
+    if dec_string:
+        result_symbols += [system.math.point] + dec_symbols
+    if alt_two:
+        liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
+                    system.digits[2].big_s, system.digits[2].big_t)
+        for i, v in enumerate(result_symbols):
+            if isinstance(v, CND) and v.value == 2:
+                next_symbol = result_symbols[i +
+                                             1] if i < len(result_symbols) - 1 else None
+                previous_symbol = result_symbols[i - 1] if i > 0 else None
+                if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
+                    if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
+                        result_symbols[i] = liang
+    # if big is True, '两' will not be used and `alt_two` has no impact on output
+    if big:
+        attr_name = 'big_'
+        if traditional:
+            attr_name += 't'
+        else:
+            attr_name += 's'
+    else:
+        if traditional:
+            attr_name = 'traditional'
+        else:
+            attr_name = 'simplified'
+    result = ''.join([getattr(s, attr_name) for s in result_symbols])
+    # if not use_zeros:
+    #     result = result.strip(getattr(system.digits[0], attr_name))
+    if alt_zero:
+        result = result.replace(
+            getattr(system.digits[0], attr_name), system.digits[0].alt_s)
+    if alt_one:
+        result = result.replace(
+            getattr(system.digits[1], attr_name), system.digits[1].alt_s)
+    for i, p in enumerate(POINT):
+        if result.startswith(p):
+            return CHINESE_DIGIS[0] + result
+    # ^10, 11, .., 19
+    if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
+                                          SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
+            result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
+        result = result[1:]
+    return result
+# ================================================================================ #
+#                          different types of rewriters
+# ================================================================================ #
+class Cardinal:
+    """
+    CARDINAL类
+    """
+    def __init__(self, cardinal=None, chntext=None):
+        self.cardinal = cardinal
+        self.chntext = chntext
+    def chntext2cardinal(self):
+        return chn2num(self.chntext)
+    def cardinal2chntext(self):
+        return num2chn(self.cardinal)
+class Digit:
+    """
+    DIGIT类
+    """
+    def __init__(self, digit=None, chntext=None):
+        self.digit = digit
+        self.chntext = chntext
+    # def chntext2digit(self):
+    #     return chn2num(self.chntext)
+    def digit2chntext(self):
+        return num2chn(self.digit, alt_two=False, use_units=False)
+class TelePhone:
+    """
+    TELEPHONE类
+    """
+    def __init__(self, telephone=None, raw_chntext=None, chntext=None):
+        self.telephone = telephone
+        self.raw_chntext = raw_chntext
+        self.chntext = chntext
+    # def chntext2telephone(self):
+    #     sil_parts = self.raw_chntext.split('<SIL>')
+    #     self.telephone = '-'.join([
+    #         str(chn2num(p)) for p in sil_parts
+    #     ])
+    #     return self.telephone
+    def telephone2chntext(self, fixed=False):
+        if fixed:
+            sil_parts = self.telephone.split('-')
+            self.raw_chntext = '<SIL>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sil_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SIL>', '')
+        else:
+            sp_parts = self.telephone.strip('+').split()
+            self.raw_chntext = '<SP>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sp_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SP>', '')
+        return self.chntext
+class Fraction:
+    """
+    FRACTION类
+    """
+    def __init__(self, fraction=None, chntext=None):
+        self.fraction = fraction
+        self.chntext = chntext
+    def chntext2fraction(self):
+        denominator, numerator = self.chntext.split('分之')
+        return chn2num(numerator) + '/' + chn2num(denominator)
+    def fraction2chntext(self):
+        numerator, denominator = self.fraction.split('/')
+        return num2chn(denominator) + '分之' + num2chn(numerator)
+class Date:
+    """
+    DATE类
+    """
+    def __init__(self, date=None, chntext=None):
+        self.date = date
+        self.chntext = chntext
+    # def chntext2date(self):
+    #     chntext = self.chntext
+    #     try:
+    #         year, other = chntext.strip().split('年', maxsplit=1)
+    #         year = Digit(chntext=year).digit2chntext() + '年'
+    #     except ValueError:
+    #         other = chntext
+    #         year = ''
+    #     if other:
+    #         try:
+    #             month, day = other.strip().split('月', maxsplit=1)
+    #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
+    #         except ValueError:
+    #             day = chntext
+    #             month = ''
+    #         if day:
+    #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
+    #     else:
+    #         month = ''
+    #         day = ''
+    #     date = year + month + day
+    #     self.date = date
+    #     return self.date
+    def date2chntext(self):
+        date = self.date
+        try:
+            year, other = date.strip().split('年', 1)
+            year = Digit(digit=year).digit2chntext() + '年'
+        except ValueError:
+            other = date
+            year = ''
+        if other:
+            try:
+                month, day = other.strip().split('月', 1)
+                month = Cardinal(cardinal=month).cardinal2chntext() + '月'
+            except ValueError:
+                day = date
+                month = ''
+            if day:
+                day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
+        else:
+            month = ''
+            day = ''
+        chntext = year + month + day
+        self.chntext = chntext
+        return self.chntext
+class Time:
+    """
+    MONEY类
+    """
+    def __init__(self, time=None, chntext=None):
+        self.time = time
+        self.chntext = chntext
+    # def chntext2money(self):
+    #     return self.money
+    def time2chntext(self):
+        time = self.time.replace('-', '至')
+        pattern = re.compile(r'(\d{1,2}:\d{1,2}(:)?(\d{1,2})?)')
+        matchers = pattern.findall(time)
+        if matchers:
+            if len(matchers[0])>2:
+                time = time.replace(':', '时', 1)
+            time = time.replace(':', '分', 1)
+        self.chntext = time
+        return self.chntext
+class Money:
+    """
+    MONEY类
+    """
+    def __init__(self, money=None, chntext=None):
+        self.money = money
+        self.chntext = chntext
+    # def chntext2money(self):
+    #     return self.money
+    def money2chntext(self):
+        money = self.money
+        pattern = re.compile(r'(\d+(\.\d+)?)')
+        matchers = pattern.findall(money)
+        if matchers:
+            for matcher in matchers:
+                money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext())
+        self.chntext = money
+        return self.chntext
+class Percentage:
+    """
+    PERCENTAGE类
+    """
+    def __init__(self, percentage=None, chntext=None):
+        self.percentage = percentage
+        self.chntext = chntext
+    def chntext2percentage(self):
+        return chn2num(self.chntext.strip().strip('百分之')) + '%'
+    def percentage2chntext(self):
+        return '百分之' + num2chn(self.percentage.strip().strip('%'))
+# ================================================================================ #
+#                            NSW Normalizer
+# ================================================================================ #
+class NSWNormalizer:
+    def __init__(self):
+        self.raw_text = ' ' # '^' + raw_text + '$'
+        self.norm_text = ''
+    def _particular(self):
+        text = self.norm_text
+        pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('particular')
+            for matcher in matchers:
+                text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
+        self.norm_text = text
+        return self.norm_text
+    def normalize(self, raw_text):
+        self.raw_text = '^' + raw_text + '$'
+        text = unicodedata.normalize("NFKC", self.raw_text)
+        # 规范化日期
+        pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('date')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
+        # 规范化时间
+        pattern = re.compile(r"\D+((\d{1,2}-)?\d{1,2}[时点:]((\d{1,2}-)?\d{1,2}[分:]((\d{1,2}-)?\d{1,2}秒)?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('time')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Time(time=matcher[0]).time2chntext(), 1)
+        # 规范化金钱
+        pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('money')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1)
+        # 规范化固话/手机号码
+        # 手机
+        # http://www.jihaoba.com/news/show/13680
+        # 移动：139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
+        # 联通：130、131、132、156、155、186、185、176
+        # 电信：133、153、189、180、181、177
+        pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1)
+        # 固话
+        pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('fixed telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1)
+        # 规范化分数
+        pattern = re.compile(r"(\d+/\d+)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('fraction')
+            for matcher in matchers:
+                text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1)
+        # 规范化百分数
+        text = text.replace('％', '%')
+        pattern = re.compile(r"(\d+(\.\d+)?%)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('percentage')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1)
+        # 规范化纯数+量词
+        pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal+quantifier')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+        # 规范化数字编号
+        pattern = re.compile(r"(\d{2,32})")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('digit')
+            for matcher in matchers:
+                text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
+        # 规范化纯数
+        pattern = re.compile(r"(\d+(\.\d+)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+        self.norm_text = text
+        self._particular()
+        return self.norm_text.lstrip('^').rstrip('$')
+def nsw_test_case(raw_text):
+    print('I:' + raw_text)
+    print('O:' + NSWNormalizer(raw_text).normalize())
+    print('')
+def nsw_test():
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('手机：+86 19859213959或15659451527。')
+    nsw_test_case('分数：32477/76391。')
+    nsw_test_case('百分数：80.03%。')
+    nsw_test_case('编号：31520181154418。')
+    nsw_test_case('纯数：2983.07克或12345.60米。')
+    nsw_test_case('日期：1999年2月20日或09年3月15号。')
+    nsw_test_case('金钱：12块5，34.5元，20.1万')
+    nsw_test_case('特殊：O2O或B2C。')
+    nsw_test_case('3456万吨')
+    nsw_test_case('2938个')
+    nsw_test_case('938')
+    nsw_test_case('今天吃了115个小笼包231个馒头')
+    nsw_test_case('有62％的概率')
+if __name__ == '__main__':
+    #nsw_test()
+    p = argparse.ArgumentParser()
+    p.add_argument('ifile', help='input filename, assume utf-8 encoding')
+    p.add_argument('ofile', help='output filename')
+    p.add_argument('--to_upper', action='store_true', help='convert to upper case')
+    p.add_argument('--to_lower', action='store_true', help='convert to lower case')
+    p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.")
+    p.add_argument('--log_interval', type=int, default=100000, help='log interval in number of processed lines')
+    args = p.parse_args()
+    ifile = codecs.open(args.ifile, 'r', 'utf8')
+    ofile = codecs.open(args.ofile, 'w+', 'utf8')
+    n = 0
+    for l in ifile:
+        key = ''
+        text = ''
+        if args.has_key:
+            cols = l.split(maxsplit=1)
+            key = cols[0]
+            if len(cols) == 2:
+                text = cols[1].strip()
+            else:
+                text = ''
+        else:
+            text = l.strip()
+        # cases
+        if args.to_upper and args.to_lower:
+            sys.stderr.write('cn_tn.py: to_upper OR to_lower?')
+            exit(1)
+        if args.to_upper:
+            text = text.upper()
+        if args.to_lower:
+            text = text.lower()
+        # NSW(Non-Standard-Word) normalization
+        text = NSWNormalizer(text).normalize()
+        # Punctuations removal
+        old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations
+        new_chars = ' ' * len(old_chars)
+        del_chars = ''
+        text = text.translate(str.maketrans(old_chars, new_chars, del_chars))
+        #
+        if args.has_key:
+            ofile.write(key + '\t' + text + '\n')
+        else:
+            if text.strip() != '': # skip empty line in pure text format(without Kaldi's utt key)
+                ofile.write(text + '\n')
+        n += 1
+        if n % args.log_interval == 0:
+            sys.stderr.write("cn_tn.py: {} lines done.\n".format(n))
+            sys.stderr.flush()
+    sys.stderr.write("cn_tn.py: {} lines done in total.\n".format(n))
+    sys.stderr.flush()
+    ifile.close()
+    ofile.close()

lemas_tts/infer/text_norm/en_tn.py ADDED Viewed

	@@ -0,0 +1,178 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2017 Keith Ito
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+import re
+from unidecode import unidecode
+import inflect
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r"([0-9][0-9\,]+[0-9])")
+_decimal_number_re = re.compile(r"([0-9]+\.[0-9]+)")
+_pounds_re = re.compile(r"£([0-9\,]*[0-9]+)")
+_dollars_re = re.compile(r"\$([0-9\.\,]*[0-9]+)")
+_ordinal_re = re.compile(r"[0-9]+(st|nd|rd|th)")
+_number_re = re.compile(r"[0-9]+")
+def _remove_commas(m):
+    return m.group(1).replace(",", "")
+def _expand_decimal_point(m):
+    return m.group(1).replace(".", " point ")
+def _expand_dollars(m):
+    match = m.group(1)
+    parts = match.split(".")
+    if len(parts) > 2:
+        return match + " dollars"  # Unexpected format
+    dollars = int(parts[0]) if parts[0] else 0
+    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+    if dollars and cents:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit)
+    elif dollars:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        return "%s %s" % (dollars, dollar_unit)
+    elif cents:
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s" % (cents, cent_unit)
+    else:
+        return "zero dollars"
+def _expand_ordinal(m):
+    return _inflect.number_to_words(m.group(0))
+def _expand_number(m):
+    num = int(m.group(0))
+    if num > 1000 and num < 3000:
+        if num == 2000:
+            return "two thousand"
+        elif num > 2000 and num < 2010:
+            return "two thousand " + _inflect.number_to_words(num % 100)
+        elif num % 100 == 0:
+            return _inflect.number_to_words(num // 100) + " hundred"
+        else:
+            return _inflect.number_to_words(
+                num, andword="", zero="oh", group=2
+            ).replace(", ", " ")
+    else:
+        return _inflect.number_to_words(num, andword="")
+def normalize_numbers(text):
+    text = re.sub(_comma_number_re, _remove_commas, text)
+    text = re.sub(_pounds_re, r"\1 pounds", text)
+    text = re.sub(_dollars_re, _expand_dollars, text)
+    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+    text = re.sub(_ordinal_re, _expand_ordinal, text)
+    text = re.sub(_number_re, _expand_number, text)
+    return text
+# Regular expression matching whitespace:
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = [
+    (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+    for x in [
+        ("mrs", "misess"),
+        ("mr", "mister"),
+        ("dr", "doctor"),
+        ("st", "saint"),
+        ("co", "company"),
+        ("jr", "junior"),
+        ("maj", "major"),
+        ("gen", "general"),
+        ("drs", "doctors"),
+        ("rev", "reverend"),
+        ("lt", "lieutenant"),
+        ("hon", "honorable"),
+        ("sgt", "sergeant"),
+        ("capt", "captain"),
+        ("esq", "esquire"),
+        ("ltd", "limited"),
+        ("col", "colonel"),
+        ("ft", "fort"),
+    ]
+]
+def expand_abbreviations(text):
+    for regex, replacement in _abbreviations:
+        text = re.sub(regex, replacement, text)
+    return text
+def expand_numbers(text):
+    return normalize_numbers(text)
+def lowercase(text):
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def convert_to_ascii(text):
+    return unidecode(text)
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def transliteration_cleaners(text):
+    """Pipeline for non-English text that transliterates to ASCII."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def english_cleaners(text):
+    """Pipeline for English text, including number and abbreviation expansion."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = expand_numbers(text)
+    text = expand_abbreviations(text)
+    text = collapse_whitespace(text)
+    return text
+def read_lexicon(lex_path):
+    lexicon = {}
+    with open(lex_path) as f:
+        for line in f:
+            temp = re.split(r"\s+", line.strip("\n"))
+            word = temp[0]
+            phones = temp[1:]
+            if word not in lexicon:
+                lexicon[word] = phones
+    return lexicon

lemas_tts/infer/text_norm/gp2py.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import argparse
+import copy
+import os
+from typing import List
+import jieba
+import pypinyin
+SPECIAL_NOTES = '。？！?!.;；:,，:'
+def read_vocab(file: os.PathLike) -> List[str]:
+    with open(file) as f:
+        vocab = f.read().split('\n')
+        vocab = [v for v in vocab if len(v) > 0 and v != '\n']
+    return vocab
+class TextNormal:
+    def __init__(self,
+                 gp_vocab_file: os.PathLike,
+                 py_vocab_file: os.PathLike,
+                 add_sp1=False,
+                 fix_er=False,
+                 add_sil=True):
+        if gp_vocab_file is not None:
+            self.gp_vocab = read_vocab(gp_vocab_file)
+        if py_vocab_file is not None:
+            self.py_vocab = read_vocab(py_vocab_file)
+            self.in_py_vocab = dict([(p, True) for p in self.py_vocab])
+        self.add_sp1 = add_sp1
+        self.add_sil = add_sil
+        self.fix_er = fix_er
+        # gp2idx = dict([(c, i) for i, c in enumerate(self.gp_vocab)])
+        # idx2gp = dict([(i, c) for i, c in enumerate(self.gp_vocab)])
+    def _split2sent(self, text):
+        new_sub = [text]
+        while True:
+            sub = copy.deepcopy(new_sub)
+            new_sub = []
+            for s in sub:
+                sp = False
+                for t in SPECIAL_NOTES:
+                    if t in s:
+                        new_sub += s.split(t)
+                        sp = True
+                        break
+                if not sp and len(s) > 0:
+                    new_sub += [s]
+            if len(new_sub) == len(sub):
+                break
+        tokens = [a for a in text if a in SPECIAL_NOTES]
+        return new_sub, tokens
+    def _correct_tone3(self, pys: List[str]) -> List[str]:
+        """Fix the continuous tone3 pronunciation problem"""
+        for i in range(2, len(pys)):
+            if pys[i][-1] == '3' and pys[i - 1][-1] == '3' and pys[i - 2][-1] == '3':
+                pys[i - 1] = pys[i - 1][:-1] + '2'  # change the middle one
+        for i in range(1, len(pys)):
+            if pys[i][-1] == '3':
+                if pys[i - 1][-1] == '3':
+                    pys[i - 1] = pys[i - 1][:-1] + '2'
+        return pys
+    def _correct_tone4(self, pys: List[str]) -> List[str]:
+        """Fixed the problem of pronouncing 不 bu2 yao4 / bu4 neng2"""
+        for i in range(len(pys) - 1):
+            if pys[i] == 'bu4':
+                if pys[i + 1][-1] == '4':
+                    pys[i] = 'bu2'
+        return pys
+    def _replace_with_sp(self, pys: List[str]) -> List[str]:
+        for i, p in enumerate(pys):
+            if p in ',，、':
+                pys[i] = 'sp1'
+        return pys
+    def _correct_tone5(self, pys: List[str]) -> List[str]:
+        for i in range(len(pys)):
+            if pys[i][-1] not in '1234':
+                pys[i] += '5'
+        return pys
+    def gp2py(self, gp_text: str) -> List[str]:
+        gp_sent_list, tokens = self._split2sent(gp_text)
+        py_sent_list = []
+        for sent in gp_sent_list:
+            pys = []
+            for words in list(jieba.cut(sent)):
+                py = pypinyin.pinyin(words, pypinyin.TONE3)
+                py = [p[0] for p in py]
+                pys += py
+            if self.add_sp1:
+                pys = self._replace_with_sp(pys)
+            pys = self._correct_tone3(pys)
+            pys = self._correct_tone4(pys)
+            pys = self._correct_tone5(pys)
+            if self.add_sil:
+                py_sent_list += [' '.join(['sil'] + pys + ['sil'])]
+            else:
+                py_sent_list += [' '.join(pys)]
+        if self.add_sil:
+            gp_sent_list = ['sil ' + ' '.join(list(gp)) + ' sil' for gp in gp_sent_list]
+        else:
+            gp_sent_list = [' '.join(list(gp)) for gp in gp_sent_list]
+        if self.fix_er:
+            new_py_sent_list = []
+            for py, gp in zip(py_sent_list, gp_sent_list):
+                py = self._convert_er2(py, gp)
+                new_py_sent_list += [py]
+            py_sent_list = new_py_sent_list
+            print(new_py_sent_list)
+        return py_sent_list, gp_sent_list
+    def _convert_er2(self, py, gp):
+        py2hz = dict([(p, h) for p, h in zip(py.split(), gp.split())])
+        py_list = py.split()
+        for i, p in enumerate(py_list):
+            if (p == 'er2' and py2hz[p] == '儿' and i > 1 and len(py_list[i - 1]) > 2 and py_list[i - 1][-1] in '1234'):
+                py_er = py_list[i - 1][:-1] + 'r' + py_list[i - 1][-1]
+                if self.in_py_vocab.get(py_er, False):  # must in vocab
+                    py_list[i - 1] = py_er
+                    py_list[i] = 'r'
+        py = ' '.join(py_list)
+        return py
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-t', '--text', type=str)
+    args = parser.parse_args()
+    text = args.text
+    tn = TextNormal('gp.vocab', 'py.vocab', add_sp1=True, fix_er=True)
+    py_list, gp_list = tn.gp2py(text)
+    for py, gp in zip(py_list, gp_list):
+        print(py + '|' + gp)

lemas_tts/infer/text_norm/id_tn.py ADDED Viewed

	@@ -0,0 +1,275 @@

+# Indonesian TTS Text Normalization for YouTube subtitles
+# Requirements: pip install num2words
+import re
+from num2words import num2words
+# --- small slang map (expandable) ---
+SLANG_MAP = {
+    "gpp": "nggak apa-apa",
+    "gak": "nggak", "ga": "nggak", "gk": "nggak",
+    "sy": "saya", "sya": "saya",
+    "km": "kamu",
+    "tp": "tapi", "tpi": "tapi",
+    "jd": "jadi",
+    "bgt": "banget",
+    "blm": "belum",
+    "trs": "terus",
+    "sm": "sama",
+    "wkwk": "wkwk",  # keep as-is (laugh token) or strip later
+    "wkwkwk": "wkwk"
+}
+# emoji pattern: removes most emoji blocks
+EMOJI_PATTERN = re.compile(
+    "["
+    "\U0001F600-\U0001F64F"  # emoticons
+    "\U0001F300-\U0001F5FF"  # symbols & pictographs
+    "\U0001F680-\U0001F6FF"  # transport & map symbols
+    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
+    "\U00002700-\U000027BF"  # dingbats
+    "\U000024C2-\U0001F251"
+    "]+", flags=re.UNICODE)
+# units map
+UNITS = {
+    "kg": "kilogram","g": "gram","km": "kilometer",
+    "m": "meter","cm": "sentimeter","mm": "milimeter",
+    "l": "liter"
+}
+# helper: safe num2words for Indonesian
+def num_to_words_ind(num_str):
+    """Convert numeric string to Indonesian words.
+       - Handles integers and simple decimals like '1.5' (reads digits after decimal).
+       - Removes grouping dots in Indonesian numbers (e.g. '10.000').
+    """
+    num_str = num_str.strip()
+    # remove thousand separators commonly used in Indonesian (dot)
+    # but if decimal point (like '1,5' or '1.5'), assume '.' is decimal point (we expect '.' used)
+    # We'll treat commas as thousand separators too if no decimal comma present.
+    if re.match(r'^\d+[.,]\d+$', num_str):
+        # decimal number: normalize to use '.' then split
+        s = num_str.replace(',', '.')
+        left, right = s.split('.', 1)
+        try:
+            left_w = num2words(int(left), lang='id')
+        except:
+            left_w = left
+        # read each decimal digit separately
+        right_w = " ".join(num2words(int(d), lang='id') for d in right if d.isdigit())
+        return f"{left_w} koma {right_w}"
+    else:
+        # remove non-digit separators like dots or commas used as thousand separators
+        cleaned = re.sub(r'[.,]', '', num_str)
+        try:
+            return num2words(int(cleaned), lang='id')
+        except:
+            return num_str
+# helper: per-digit reader for phone numbers (default)
+def read_digits_per_digit(number_str, prefix_plus=False):
+    digits = re.findall(r'\d', number_str)
+    words = " ".join(num2words(int(d), lang='id') for d in digits)
+    if prefix_plus:
+        return "plus " + words
+    return words
+# noise removal rule for tokens like 'yyy6yy' or other long mixed garbage:
+def is_noise_token(tok):
+    # remove tokens that:
+    # - length >=4 and contain at least one digit and at least one letter (typical ASR/keyboard noise)
+    # - or tokens of a single repeated char length >=4 (e.g., 'aaaa', '!!!!!!' but punctuation handled earlier)
+    if len(tok) < 4:
+        return False
+    if re.search(r'[A-Za-z]', tok) and re.search(r'\d', tok):
+        return True
+    if re.fullmatch(r'(.)\1{3,}', tok):  # same char repeated >=4
+        return True
+    return False
+# --- 新增：标点规范化函数 ---
+def punctuation_normalize(text):
+    """
+    - 替换除 . , ! ? 之外的所有标点为逗号
+    - 统一多重逗号为单逗号
+    - 去掉开头多余逗号、省略号
+    - 逗号后空格规范化
+    """
+    # 替换括号、引号、冒号、分号、破折号、省略号等为逗号
+    text = re.sub(r'[:;()\[\]{}"“”«»…—–/\\]', ',', text)
+    # 多个逗号替换成一个
+    text = re.sub(r',+', ',', text)
+    # 开头去掉逗号和省略号
+    text = re.sub(r'^(,|\.\.\.|…)+\s*', '', text)
+    # 逗号后空格规范
+    text = re.sub(r'\s*,\s*', ', ', text)
+    # 多余空白合并
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text
+def normalize_id_tts(text):
+    """
+    Main normalization pipeline tailored for:
+    - Indonesian YouTube subtitles (mostly ASR/MT)
+    - TTS frontend requirements:
+      * Remove emojis
+      * Keep . , ! ? as sentence/phrase delimiters
+      * Replace other punctuation with comma
+      * Expand numbers, percents, currency, units, times, dates
+      * Remove keyboard noise like 'yyy6yy'
+      * Keep English words as-is
+      * Keep repeated words (do not collapse)
+    """
+    if not text:
+        return text
+    # 1) Normalize whitespace and trim
+    text = text.strip()
+    text = re.sub(r'\s+', ' ', text)
+    # 2) Remove emojis
+    text = EMOJI_PATTERN.sub('', text)
+    # 3) 标点规范化（替代原有 PUNCT_TO_COMMA 替换）
+    text = punctuation_normalize(text)
+    # 保护时间和日期的代码（防止被逗号破坏）
+    text = re.sub(r'(\d{1,2}):(\d{2})', lambda m: f"__TIME_{m.group(1)}_{m.group(2)}__", text)
+    text = re.sub(r'(\d{1,4})[\/-](\d{1,2})[\/-](\d{1,4})', lambda m: f"__DATE_{m.group(1)}_{m.group(2)}_{m.group(3)}__", text)
+    # 恢复时间日期标记
+    text = re.sub(r'__TIME_(\d{1,2})_(\d{2})__', lambda m: f"{m.group(1)}:{m.group(2)}", text)
+    text = re.sub(r'__DATE_(\d{1,4})_(\d{1,2})_(\d{1,4})__', lambda m: f"{m.group(1)}/{m.group(2)}/{m.group(3)}", text)
+    # 4) Tokenize loosely by spaces and punctuation
+    tokens = re.split(r'(\s+|[,.!?])', text)  # keep delimiters
+    out_tokens = []
+    for tok in tokens:
+        if not tok or tok.isspace():
+            out_tokens.append(tok)
+            continue
+        # keep punctuation .,!? as-is
+        if tok in ['.', ',', '!', '?']:
+            out_tokens.append(tok)
+            continue
+        # remove any remaining emojis or control chars
+        if EMOJI_PATTERN.search(tok):
+            continue
+        # slang normalization
+        lower_tok = tok.lower()
+        if lower_tok in SLANG_MAP:
+            out_tokens.append(SLANG_MAP[lower_tok])
+            continue
+        # remove noise tokens
+        if is_noise_token(tok):
+            continue
+        # currency: Rp 10.000 or rp10.000
+        m = re.match(r'^(Rp|rp)\s*([0-9\.,]+)$', tok)
+        if m:
+            num = m.group(2)
+            cleaned = re.sub(r'[.,]', '', num)
+            out_tokens.append(f"{num_to_words_ind(cleaned)} rupiah")
+            continue
+        # percent like 30%
+        m = re.match(r'^(\d+)%$', tok)
+        if m:
+            out_tokens.append(f"{num_to_words_ind(m.group(1))} persen")
+            continue
+        # phone numbers +62..., 0812...
+        m = re.match(r'^\+?\d[\d\-\s]{6,}\d$', tok)
+        if m:
+            prefix_plus = tok.startswith('+')
+            out_tokens.append(read_digits_per_digit(tok, prefix_plus=prefix_plus))
+            continue
+        # time hh:mm
+        m = re.match(r'^(\d{1,2}):(\d{2})$', tok)
+        if m:
+            h, mi = m.group(1), m.group(2)
+            h_w = num_to_words_ind(h.lstrip('0') or '0')
+            mi_w = num_to_words_ind(mi.lstrip('0') or '0')
+            out_tokens.append(f"pukul {h_w} lewat {mi_w} menit")
+            continue
+        # date yyyy/mm/dd or dd/mm/yyyy
+        m = re.match(r'^(\d{1,4})\/(\d{1,2})\/(\d{1,4})$', tok)
+        if m:
+            a,b,c = m.group(1), m.group(2).zfill(2), m.group(3)
+            if len(a) == 4:
+                year, month, day = a, b, c
+            elif len(c) == 4:
+                day, month, year = a, b, c
+            else:
+                day, month, year = a, b, c
+            MONTHS = {
+                "01": "Januari","02": "Februari","03": "Maret","04": "April",
+                "05": "Mei","06": "Juni","07": "Juli","08": "Agustus",
+                "09": "September","10": "Oktober","11": "November","12": "Desember"
+            }
+            day_w = num_to_words_ind(day.lstrip('0') or '0')
+            year_w = num_to_words_ind(year)
+            month_name = MONTHS.get(month, month)
+            out_tokens.append(f"{day_w} {month_name} {year_w}")
+            continue
+        # units like 30kg
+        m = re.match(r'^(\d+)\s*(kg|g|km|m|cm|mm|l)$', tok, flags=re.I)
+        if m:
+            num, unit = m.group(1), m.group(2).lower()
+            unit_word = UNITS.get(unit, unit)
+            out_tokens.append(f"{num_to_words_ind(num)} {unit_word}")
+            continue
+        # plain integers
+        if re.fullmatch(r'\d+', tok):
+            out_tokens.append(num_to_words_ind(tok))
+            continue
+        # numbers with separators
+        if re.fullmatch(r'[\d\.,]+', tok) and re.search(r'[.,]', tok):
+            out_tokens.append(num_to_words_ind(tok))
+            continue
+        # keep English/as-is tokens
+        out_tokens.append(tok)
+    normalized = "".join(out_tokens)
+    # final cleanup: spacing around punctuation
+    normalized = re.sub(r'\s+,', ',', normalized)
+    normalized = re.sub(r',\s*', ', ', normalized)
+    normalized = re.sub(r'\s+\.', '.', normalized)
+    normalized = re.sub(r'\s+!', '!', normalized)
+    normalized = re.sub(r'\s+\?', '?', normalized)
+    normalized = re.sub(r'\s+', ' ', normalized).strip()
+    # 如果你不想全部小写，注释掉下面这行
+    normalized = normalized.lower()
+    return normalized
+# -------------------------
+# Example usage and tests
+# -------------------------
+if __name__ == "__main__":
+    examples = [
+        "kita cek Project nadi PHP pemberi harapan palsu tuh yyy6yy 46 ini ini usernya ini di bagian user",
+        "Harga Rp 10.000, diskon 30%! Buka jam 09:30 (hari 2025/11/28).",
+        "Call +62 812-3456-7890 sekarang!",
+        "angka kecil 3.14 dan 1,234 serta 1000",
+        "[musik]",
+        "... atau mungkin juga jumlah anggota keluarga mereka."
+    ]
+    for ex in examples:
+        print("IN: ", ex)
+        print("OUT:", normalize_id_tts(ex))
+        print("-"*60)

lemas_tts/infer/text_norm/jieba_dict.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

lemas_tts/infer/text_norm/pinyin-lexicon-r.txt ADDED Viewed

	@@ -0,0 +1,4120 @@

+a1 a1
+a2 a2
+a3 a3
+a4 a4
+a5 a5
+ai1 ai1
+ai2 ai2
+ai3 ai3
+ai4 ai4
+ai5 ai5
+an1 an1
+an2 an2
+an3 an3
+an4 an4
+an5 an5
+ang1 ang1
+ang2 ang2
+ang3 ang3
+ang4 ang4
+ang5 ang5
+ao1 ao1
+ao2 ao2
+ao3 ao3
+ao4 ao4
+ao5 ao5
+ba1 b a1
+ba2 b a2
+ba3 b a3
+ba4 b a4
+ba5 b a5
+bai1 b ai1
+bai2 b ai2
+bai3 b ai3
+bai4 b ai4
+bai5 b ai5
+ban1 b an1
+ban2 b an2
+ban3 b an3
+ban4 b an4
+ban5 b an5
+bang1 b ang1
+bang2 b ang2
+bang3 b ang3
+bang4 b ang4
+bang5 b ang5
+bao1 b ao1
+bao2 b ao2
+bao3 b ao3
+bao4 b ao4
+bao5 b ao5
+bei1 b ei1
+bei2 b ei2
+bei3 b ei3
+bei4 b ei4
+bei5 b ei5
+ben1 b en1
+ben2 b en2
+ben3 b en3
+ben4 b en4
+ben5 b en5
+beng1 b eng1
+beng2 b eng2
+beng3 b eng3
+beng4 b eng4
+beng5 b eng5
+bi1 b i1
+bi2 b i2
+bi3 b i3
+bi4 b i4
+bi5 b i5
+bian1 b ian1
+bian2 b ian2
+bian3 b ian3
+bian4 b ian4
+bian5 b ian5
+biao1 b iao1
+biao2 b iao2
+biao3 b iao3
+biao4 b iao4
+biao5 b iao5
+bie1 b ie1
+bie2 b ie2
+bie3 b ie3
+bie4 b ie4
+bie5 b ie5
+bin1 b in1
+bin2 b in2
+bin3 b in3
+bin4 b in4
+bin5 b in5
+bing1 b ing1
+bing2 b ing2
+bing3 b ing3
+bing4 b ing4
+bing5 b ing5
+bo1 b o1
+bo2 b o2
+bo3 b o3
+bo4 b o4
+bo5 b o5
+bu1 b u1
+bu2 b u2
+bu3 b u3
+bu4 b u4
+bu5 b u5
+ca1 c a1
+ca2 c a2
+ca3 c a3
+ca4 c a4
+ca5 c a5
+cai1 c ai1
+cai2 c ai2
+cai3 c ai3
+cai4 c ai4
+cai5 c ai5
+can1 c an1
+can2 c an2
+can3 c an3
+can4 c an4
+can5 c an5
+cang1 c ang1
+cang2 c ang2
+cang3 c ang3
+cang4 c ang4
+cang5 c ang5
+cao1 c ao1
+cao2 c ao2
+cao3 c ao3
+cao4 c ao4
+cao5 c ao5
+ce1 c e1
+ce2 c e2
+ce3 c e3
+ce4 c e4
+ce5 c e5
+cen1 c en1
+cen2 c en2
+cen3 c en3
+cen4 c en4
+cen5 c en5
+ceng1 c eng1
+ceng2 c eng2
+ceng3 c eng3
+ceng4 c eng4
+ceng5 c eng5
+cha1 ch a1
+cha2 ch a2
+cha3 ch a3
+cha4 ch a4
+cha5 ch a5
+chai1 ch ai1
+chai2 ch ai2
+chai3 ch ai3
+chai4 ch ai4
+chai5 ch ai5
+chan1 ch an1
+chan2 ch an2
+chan3 ch an3
+chan4 ch an4
+chan5 ch an5
+chang1 ch ang1
+chang2 ch ang2
+chang3 ch ang3
+chang4 ch ang4
+chang5 ch ang5
+chao1 ch ao1
+chao2 ch ao2
+chao3 ch ao3
+chao4 ch ao4
+chao5 ch ao5
+che1 ch e1
+che2 ch e2
+che3 ch e3
+che4 ch e4
+che5 ch e5
+chen1 ch en1
+chen2 ch en2
+chen3 ch en3
+chen4 ch en4
+chen5 ch en5
+cheng1 ch eng1
+cheng2 ch eng2
+cheng3 ch eng3
+cheng4 ch eng4
+cheng5 ch eng5
+chi1 ch iii1
+chi2 ch iii2
+chi3 ch iii3
+chi4 ch iii4
+chi5 ch iii5
+chong1 ch ong1
+chong2 ch ong2
+chong3 ch ong3
+chong4 ch ong4
+chong5 ch ong5
+chou1 ch ou1
+chou2 ch ou2
+chou3 ch ou3
+chou4 ch ou4
+chou5 ch ou5
+chu1 ch u1
+chu2 ch u2
+chu3 ch u3
+chu4 ch u4
+chu5 ch u5
+chuai1 ch uai1
+chuai2 ch uai2
+chuai3 ch uai3
+chuai4 ch uai4
+chuai5 ch uai5
+chuan1 ch uan1
+chuan2 ch uan2
+chuan3 ch uan3
+chuan4 ch uan4
+chuan5 ch uan5
+chuang1 ch uang1
+chuang2 ch uang2
+chuang3 ch uang3
+chuang4 ch uang4
+chuang5 ch uang5
+chui1 ch uei1
+chui2 ch uei2
+chui3 ch uei3
+chui4 ch uei4
+chui5 ch uei5
+chun1 ch uen1
+chun2 ch uen2
+chun3 ch uen3
+chun4 ch uen4
+chun5 ch uen5
+chuo1 ch uo1
+chuo2 ch uo2
+chuo3 ch uo3
+chuo4 ch uo4
+chuo5 ch uo5
+ci1 c ii1
+ci2 c ii2
+ci3 c ii3
+ci4 c ii4
+ci5 c ii5
+cong1 c ong1
+cong2 c ong2
+cong3 c ong3
+cong4 c ong4
+cong5 c ong5
+cou1 c ou1
+cou2 c ou2
+cou3 c ou3
+cou4 c ou4
+cou5 c ou5
+cu1 c u1
+cu2 c u2
+cu3 c u3
+cu4 c u4
+cu5 c u5
+cuan1 c uan1
+cuan2 c uan2
+cuan3 c uan3
+cuan4 c uan4
+cuan5 c uan5
+cui1 c uei1
+cui2 c uei2
+cui3 c uei3
+cui4 c uei4
+cui5 c uei5
+cun1 c uen1
+cun2 c uen2
+cun3 c uen3
+cun4 c uen4
+cun5 c uen5
+cuo1 c uo1
+cuo2 c uo2
+cuo3 c uo3
+cuo4 c uo4
+cuo5 c uo5
+da1 d a1
+da2 d a2
+da3 d a3
+da4 d a4
+da5 d a5
+dai1 d ai1
+dai2 d ai2
+dai3 d ai3
+dai4 d ai4
+dai5 d ai5
+dan1 d an1
+dan2 d an2
+dan3 d an3
+dan4 d an4
+dan5 d an5
+dang1 d ang1
+dang2 d ang2
+dang3 d ang3
+dang4 d ang4
+dang5 d ang5
+dao1 d ao1
+dao2 d ao2
+dao3 d ao3
+dao4 d ao4
+dao5 d ao5
+de1 d e1
+de2 d e2
+de3 d e3
+de4 d e4
+de5 d e5
+dei1 d ei1
+dei2 d ei2
+dei3 d ei3
+dei4 d ei4
+dei5 d ei5
+den1 d en1
+den2 d en2
+den3 d en3
+den4 d en4
+den5 d en5
+deng1 d eng1
+deng2 d eng2
+deng3 d eng3
+deng4 d eng4
+deng5 d eng5
+di1 d i1
+di2 d i2
+di3 d i3
+di4 d i4
+di5 d i5
+dia1 d ia1
+dia2 d ia2
+dia3 d ia3
+dia4 d ia4
+dia5 d ia5
+dian1 d ian1
+dian2 d ian2
+dian3 d ian3
+dian4 d ian4
+dian5 d ian5
+diao1 d iao1
+diao2 d iao2
+diao3 d iao3
+diao4 d iao4
+diao5 d iao5
+die1 d ie1
+die2 d ie2
+die3 d ie3
+die4 d ie4
+die5 d ie5
+ding1 d ing1
+ding2 d ing2
+ding3 d ing3
+ding4 d ing4
+ding5 d ing5
+diu1 d iou1
+diu2 d iou2
+diu3 d iou3
+diu4 d iou4
+diu5 d iou5
+dong1 d ong1
+dong2 d ong2
+dong3 d ong3
+dong4 d ong4
+dong5 d ong5
+dou1 d ou1
+dou2 d ou2
+dou3 d ou3
+dou4 d ou4
+dou5 d ou5
+du1 d u1
+du2 d u2
+du3 d u3
+du4 d u4
+du5 d u5
+duan1 d uan1
+duan2 d uan2
+duan3 d uan3
+duan4 d uan4
+duan5 d uan5
+dui1 d uei1
+dui2 d uei2
+dui3 d uei3
+dui4 d uei4
+dui5 d uei5
+dun1 d uen1
+dun2 d uen2
+dun3 d uen3
+dun4 d uen4
+dun5 d uen5
+duo1 d uo1
+duo2 d uo2
+duo3 d uo3
+duo4 d uo4
+duo5 d uo5
+e1 e1
+e2 e2
+e3 e3
+e4 e4
+e5 e5
+ei1 ei1
+ei2 ei2
+ei3 ei3
+ei4 ei4
+ei5 ei5
+en1 en1
+en2 en2
+en3 en3
+en4 en4
+en5 en5
+eng1 eng1
+eng2 eng2
+eng3 eng3
+eng4 eng4
+eng5 eng5
+r1 er1
+r2 er2
+r3 er3
+r4 er4
+r5 er5
+er1 er1
+er2 er2
+er3 er3
+er4 er4
+er5 er5
+fa1 f a1
+fa2 f a2
+fa3 f a3
+fa4 f a4
+fa5 f a5
+fan1 f an1
+fan2 f an2
+fan3 f an3
+fan4 f an4
+fan5 f an5
+fang1 f ang1
+fang2 f ang2
+fang3 f ang3
+fang4 f ang4
+fang5 f ang5
+fei1 f ei1
+fei2 f ei2
+fei3 f ei3
+fei4 f ei4
+fei5 f ei5
+fen1 f en1
+fen2 f en2
+fen3 f en3
+fen4 f en4
+fen5 f en5
+feng1 f eng1
+feng2 f eng2
+feng3 f eng3
+feng4 f eng4
+feng5 f eng5
+fo1 f o1
+fo2 f o2
+fo3 f o3
+fo4 f o4
+fo5 f o5
+fou1 f ou1
+fou2 f ou2
+fou3 f ou3
+fou4 f ou4
+fou5 f ou5
+fu1 f u1
+fu2 f u2
+fu3 f u3
+fu4 f u4
+fu5 f u5
+ga1 g a1
+ga2 g a2
+ga3 g a3
+ga4 g a4
+ga5 g a5
+gai1 g ai1
+gai2 g ai2
+gai3 g ai3
+gai4 g ai4
+gai5 g ai5
+gan1 g an1
+gan2 g an2
+gan3 g an3
+gan4 g an4
+gan5 g an5
+gang1 g ang1
+gang2 g ang2
+gang3 g ang3
+gang4 g ang4
+gang5 g ang5
+gao1 g ao1
+gao2 g ao2
+gao3 g ao3
+gao4 g ao4
+gao5 g ao5
+ge1 g e1
+ge2 g e2
+ge3 g e3
+ge4 g e4
+ge5 g e5
+gei1 g ei1
+gei2 g ei2
+gei3 g ei3
+gei4 g ei4
+gei5 g ei5
+gen1 g en1
+gen2 g en2
+gen3 g en3
+gen4 g en4
+gen5 g en5
+geng1 g eng1
+geng2 g eng2
+geng3 g eng3
+geng4 g eng4
+geng5 g eng5
+gong1 g ong1
+gong2 g ong2
+gong3 g ong3
+gong4 g ong4
+gong5 g ong5
+gou1 g ou1
+gou2 g ou2
+gou3 g ou3
+gou4 g ou4
+gou5 g ou5
+gu1 g u1
+gu2 g u2
+gu3 g u3
+gu4 g u4
+gu5 g u5
+gua1 g ua1
+gua2 g ua2
+gua3 g ua3
+gua4 g ua4
+gua5 g ua5
+guai1 g uai1
+guai2 g uai2
+guai3 g uai3
+guai4 g uai4
+guai5 g uai5
+guan1 g uan1
+guan2 g uan2
+guan3 g uan3
+guan4 g uan4
+guan5 g uan5
+guang1 g uang1
+guang2 g uang2
+guang3 g uang3
+guang4 g uang4
+guang5 g uang5
+gui1 g uei1
+gui2 g uei2
+gui3 g uei3
+gui4 g uei4
+gui5 g uei5
+gun1 g uen1
+gun2 g uen2
+gun3 g uen3
+gun4 g uen4
+gun5 g uen5
+guo1 g uo1
+guo2 g uo2
+guo3 g uo3
+guo4 g uo4
+guo5 g uo5
+ha1 h a1
+ha2 h a2
+ha3 h a3
+ha4 h a4
+ha5 h a5
+hai1 h ai1
+hai2 h ai2
+hai3 h ai3
+hai4 h ai4
+hai5 h ai5
+han1 h an1
+han2 h an2
+han3 h an3
+han4 h an4
+han5 h an5
+hang1 h ang1
+hang2 h ang2
+hang3 h ang3
+hang4 h ang4
+hang5 h ang5
+hao1 h ao1
+hao2 h ao2
+hao3 h ao3
+hao4 h ao4
+hao5 h ao5
+he1 h e1
+he2 h e2
+he3 h e3
+he4 h e4
+he5 h e5
+hei1 h ei1
+hei2 h ei2
+hei3 h ei3
+hei4 h ei4
+hei5 h ei5
+hen1 h en1
+hen2 h en2
+hen3 h en3
+hen4 h en4
+hen5 h en5
+heng1 h eng1
+heng2 h eng2
+heng3 h eng3
+heng4 h eng4
+heng5 h eng5
+hong1 h ong1
+hong2 h ong2
+hong3 h ong3
+hong4 h ong4
+hong5 h ong5
+hou1 h ou1
+hou2 h ou2
+hou3 h ou3
+hou4 h ou4
+hou5 h ou5
+hu1 h u1
+hu2 h u2
+hu3 h u3
+hu4 h u4
+hu5 h u5
+hua1 h ua1
+hua2 h ua2
+hua3 h ua3
+hua4 h ua4
+hua5 h ua5
+huai1 h uai1
+huai2 h uai2
+huai3 h uai3
+huai4 h uai4
+huai5 h uai5
+huan1 h uan1
+huan2 h uan2
+huan3 h uan3
+huan4 h uan4
+huan5 h uan5
+huang1 h uang1
+huang2 h uang2
+huang3 h uang3
+huang4 h uang4
+huang5 h uang5
+hui1 h uei1
+hui2 h uei2
+hui3 h uei3
+hui4 h uei4
+hui5 h uei5
+hun1 h uen1
+hun2 h uen2
+hun3 h uen3
+hun4 h uen4
+hun5 h uen5
+huo1 h uo1
+huo2 h uo2
+huo3 h uo3
+huo4 h uo4
+huo5 h uo5
+ji1 j i1
+ji2 j i2
+ji3 j i3
+ji4 j i4
+ji5 j i5
+jia1 j ia1
+jia2 j ia2
+jia3 j ia3
+jia4 j ia4
+jia5 j ia5
+jian1 j ian1
+jian2 j ian2
+jian3 j ian3
+jian4 j ian4
+jian5 j ian5
+jiang1 j iang1
+jiang2 j iang2
+jiang3 j iang3
+jiang4 j iang4
+jiang5 j iang5
+jiao1 j iao1
+jiao2 j iao2
+jiao3 j iao3
+jiao4 j iao4
+jiao5 j iao5
+jie1 j ie1
+jie2 j ie2
+jie3 j ie3
+jie4 j ie4
+jie5 j ie5
+jin1 j in1
+jin2 j in2
+jin3 j in3
+jin4 j in4
+jin5 j in5
+jing1 j ing1
+jing2 j ing2
+jing3 j ing3
+jing4 j ing4
+jing5 j ing5
+jiong1 j iong1
+jiong2 j iong2
+jiong3 j iong3
+jiong4 j iong4
+jiong5 j iong5
+jiu1 j iou1
+jiu2 j iou2
+jiu3 j iou3
+jiu4 j iou4
+jiu5 j iou5
+ju1 j v1
+ju2 j v2
+ju3 j v3
+ju4 j v4
+ju5 j v5
+juan1 j van1
+juan2 j van2
+juan3 j van3
+juan4 j van4
+juan5 j van5
+jue1 j ve1
+jue2 j ve2
+jue3 j ve3
+jue4 j ve4
+jue5 j ve5
+jun1 j vn1
+jun2 j vn2
+jun3 j vn3
+jun4 j vn4
+jun5 j vn5
+ka1 k a1
+ka2 k a2
+ka3 k a3
+ka4 k a4
+ka5 k a5
+kai1 k ai1
+kai2 k ai2
+kai3 k ai3
+kai4 k ai4
+kai5 k ai5
+kan1 k an1
+kan2 k an2
+kan3 k an3
+kan4 k an4
+kan5 k an5
+kang1 k ang1
+kang2 k ang2
+kang3 k ang3
+kang4 k ang4
+kang5 k ang5
+kao1 k ao1
+kao2 k ao2
+kao3 k ao3
+kao4 k ao4
+kao5 k ao5
+ke1 k e1
+ke2 k e2
+ke3 k e3
+ke4 k e4
+ke5 k e5
+kei1 k ei1
+kei2 k ei2
+kei3 k ei3
+kei4 k ei4
+kei5 k ei5
+ken1 k en1
+ken2 k en2
+ken3 k en3
+ken4 k en4
+ken5 k en5
+keng1 k eng1
+keng2 k eng2
+keng3 k eng3
+keng4 k eng4
+keng5 k eng5
+kong1 k ong1
+kong2 k ong2
+kong3 k ong3
+kong4 k ong4
+kong5 k ong5
+kou1 k ou1
+kou2 k ou2
+kou3 k ou3
+kou4 k ou4
+kou5 k ou5
+ku1 k u1
+ku2 k u2
+ku3 k u3
+ku4 k u4
+ku5 k u5
+kua1 k ua1
+kua2 k ua2
+kua3 k ua3
+kua4 k ua4
+kua5 k ua5
+kuai1 k uai1
+kuai2 k uai2
+kuai3 k uai3
+kuai4 k uai4
+kuai5 k uai5
+kuan1 k uan1
+kuan2 k uan2
+kuan3 k uan3
+kuan4 k uan4
+kuan5 k uan5
+kuang1 k uang1
+kuang2 k uang2
+kuang3 k uang3
+kuang4 k uang4
+kuang5 k uang5
+kui1 k uei1
+kui2 k uei2
+kui3 k uei3
+kui4 k uei4
+kui5 k uei5
+kun1 k uen1
+kun2 k uen2
+kun3 k uen3
+kun4 k uen4
+kun5 k uen5
+kuo1 k uo1
+kuo2 k uo2
+kuo3 k uo3
+kuo4 k uo4
+kuo5 k uo5
+la1 l a1
+la2 l a2
+la3 l a3
+la4 l a4
+la5 l a5
+lai1 l ai1
+lai2 l ai2
+lai3 l ai3
+lai4 l ai4
+lai5 l ai5
+lan1 l an1
+lan2 l an2
+lan3 l an3
+lan4 l an4
+lan5 l an5
+lang1 l ang1
+lang2 l ang2
+lang3 l ang3
+lang4 l ang4
+lang5 l ang5
+lao1 l ao1
+lao2 l ao2
+lao3 l ao3
+lao4 l ao4
+lao5 l ao5
+le1 l e1
+le2 l e2
+le3 l e3
+le4 l e4
+le5 l e5
+lei1 l ei1
+lei2 l ei2
+lei3 l ei3
+lei4 l ei4
+lei5 l ei5
+leng1 l eng1
+leng2 l eng2
+leng3 l eng3
+leng4 l eng4
+leng5 l eng5
+li1 l i1
+li2 l i2
+li3 l i3
+li4 l i4
+li5 l i5
+lia1 l ia1
+lia2 l ia2
+lia3 l ia3
+lia4 l ia4
+lia5 l ia5
+lian1 l ian1
+lian2 l ian2
+lian3 l ian3
+lian4 l ian4
+lian5 l ian5
+liang1 l iang1
+liang2 l iang2
+liang3 l iang3
+liang4 l iang4
+liang5 l iang5
+liao1 l iao1
+liao2 l iao2
+liao3 l iao3
+liao4 l iao4
+liao5 l iao5
+lie1 l ie1
+lie2 l ie2
+lie3 l ie3
+lie4 l ie4
+lie5 l ie5
+lin1 l in1
+lin2 l in2
+lin3 l in3
+lin4 l in4
+lin5 l in5
+ling1 l ing1
+ling2 l ing2
+ling3 l ing3
+ling4 l ing4
+ling5 l ing5
+liu1 l iou1
+liu2 l iou2
+liu3 l iou3
+liu4 l iou4
+liu5 l iou5
+lo1 l o1
+lo2 l o2
+lo3 l o3
+lo4 l o4
+lo5 l o5
+long1 l ong1
+long2 l ong2
+long3 l ong3
+long4 l ong4
+long5 l ong5
+lou1 l ou1
+lou2 l ou2
+lou3 l ou3
+lou4 l ou4
+lou5 l ou5
+lu1 l u1
+lu2 l u2
+lu3 l u3
+lu4 l u4
+lu5 l u5
+luan1 l uan1
+luan2 l uan2
+luan3 l uan3
+luan4 l uan4
+luan5 l uan5
+lue1 l ve1
+lue2 l ve2
+lue3 l ve3
+lue4 l ve4
+lue5 l ve5
+lve1 l ve1
+lve2 l ve2
+lve3 l ve3
+lve4 l ve4
+lve5 l ve5
+lun1 l uen1
+lun2 l uen2
+lun3 l uen3
+lun4 l uen4
+lun5 l uen5
+luo1 l uo1
+luo2 l uo2
+luo3 l uo3
+luo4 l uo4
+luo5 l uo5
+lv1 l v1
+lv2 l v2
+lv3 l v3
+lv4 l v4
+lv5 l v5
+ma1 m a1
+ma2 m a2
+ma3 m a3
+ma4 m a4
+ma5 m a5
+mai1 m ai1
+mai2 m ai2
+mai3 m ai3
+mai4 m ai4
+mai5 m ai5
+man1 m an1
+man2 m an2
+man3 m an3
+man4 m an4
+man5 m an5
+mang1 m ang1
+mang2 m ang2
+mang3 m ang3
+mang4 m ang4
+mang5 m ang5
+mao1 m ao1
+mao2 m ao2
+mao3 m ao3
+mao4 m ao4
+mao5 m ao5
+me1 m e1
+me2 m e2
+me3 m e3
+me4 m e4
+me5 m e5
+mei1 m ei1
+mei2 m ei2
+mei3 m ei3
+mei4 m ei4
+mei5 m ei5
+men1 m en1
+men2 m en2
+men3 m en3
+men4 m en4
+men5 m en5
+meng1 m eng1
+meng2 m eng2
+meng3 m eng3
+meng4 m eng4
+meng5 m eng5
+mi1 m i1
+mi2 m i2
+mi3 m i3
+mi4 m i4
+mi5 m i5
+mian1 m ian1
+mian2 m ian2
+mian3 m ian3
+mian4 m ian4
+mian5 m ian5
+miao1 m iao1
+miao2 m iao2
+miao3 m iao3
+miao4 m iao4
+miao5 m iao5
+mie1 m ie1
+mie2 m ie2
+mie3 m ie3
+mie4 m ie4
+mie5 m ie5
+min1 m in1
+min2 m in2
+min3 m in3
+min4 m in4
+min5 m in5
+ming1 m ing1
+ming2 m ing2
+ming3 m ing3
+ming4 m ing4
+ming5 m ing5
+miu1 m iou1
+miu2 m iou2
+miu3 m iou3
+miu4 m iou4
+miu5 m iou5
+mo1 m o1
+mo2 m o2
+mo3 m o3
+mo4 m o4
+mo5 m o5
+mou1 m ou1
+mou2 m ou2
+mou3 m ou3
+mou4 m ou4
+mou5 m ou5
+mu1 m u1
+mu2 m u2
+mu3 m u3
+mu4 m u4
+mu5 m u5
+na1 n a1
+na2 n a2
+na3 n a3
+na4 n a4
+na5 n a5
+nai1 n ai1
+nai2 n ai2
+nai3 n ai3
+nai4 n ai4
+nai5 n ai5
+nan1 n an1
+nan2 n an2
+nan3 n an3
+nan4 n an4
+nan5 n an5
+nang1 n ang1
+nang2 n ang2
+nang3 n ang3
+nang4 n ang4
+nang5 n ang5
+nao1 n ao1
+nao2 n ao2
+nao3 n ao3
+nao4 n ao4
+nao5 n ao5
+ne1 n e1
+ne2 n e2
+ne3 n e3
+ne4 n e4
+ne5 n e5
+nei1 n ei1
+nei2 n ei2
+nei3 n ei3
+nei4 n ei4
+nei5 n ei5
+nen1 n en1
+nen2 n en2
+nen3 n en3
+nen4 n en4
+nen5 n en5
+neng1 n eng1
+neng2 n eng2
+neng3 n eng3
+neng4 n eng4
+neng5 n eng5
+ni1 n i1
+ni2 n i2
+ni3 n i3
+ni4 n i4
+ni5 n i5
+nian1 n ian1
+nian2 n ian2
+nian3 n ian3
+nian4 n ian4
+nian5 n ian5
+niang1 n iang1
+niang2 n iang2
+niang3 n iang3
+niang4 n iang4
+niang5 n iang5
+niao1 n iao1
+niao2 n iao2
+niao3 n iao3
+niao4 n iao4
+niao5 n iao5
+nie1 n ie1
+nie2 n ie2
+nie3 n ie3
+nie4 n ie4
+nie5 n ie5
+nin1 n in1
+nin2 n in2
+nin3 n in3
+nin4 n in4
+nin5 n in5
+ning1 n ing1
+ning2 n ing2
+ning3 n ing3
+ning4 n ing4
+ning5 n ing5
+niu1 n iou1
+niu2 n iou2
+niu3 n iou3
+niu4 n iou4
+niu5 n iou5
+nong1 n ong1
+nong2 n ong2
+nong3 n ong3
+nong4 n ong4
+nong5 n ong5
+nou1 n ou1
+nou2 n ou2
+nou3 n ou3
+nou4 n ou4
+nou5 n ou5
+nu1 n u1
+nu2 n u2
+nu3 n u3
+nu4 n u4
+nu5 n u5
+nuan1 n uan1
+nuan2 n uan2
+nuan3 n uan3
+nuan4 n uan4
+nuan5 n uan5
+nue1 n ve1
+nue2 n ve2
+nue3 n ve3
+nue4 n ve4
+nue5 n ve5
+nve1 n ve1
+nve2 n ve2
+nve3 n ve3
+nve4 n ve4
+nve5 n ve5
+nuo1 n uo1
+nuo2 n uo2
+nuo3 n uo3
+nuo4 n uo4
+nuo5 n uo5
+nv1 n v1
+nv2 n v2
+nv3 n v3
+nv4 n v4
+nv5 n v5
+o1 o1
+o2 o2
+o3 o3
+o4 o4
+o5 o5
+ou1 ou1
+ou2 ou2
+ou3 ou3
+ou4 ou4
+ou5 ou5
+pa1 p a1
+pa2 p a2
+pa3 p a3
+pa4 p a4
+pa5 p a5
+pai1 p ai1
+pai2 p ai2
+pai3 p ai3
+pai4 p ai4
+pai5 p ai5
+pan1 p an1
+pan2 p an2
+pan3 p an3
+pan4 p an4
+pan5 p an5
+pang1 p ang1
+pang2 p ang2
+pang3 p ang3
+pang4 p ang4
+pang5 p ang5
+pao1 p ao1
+pao2 p ao2
+pao3 p ao3
+pao4 p ao4
+pao5 p ao5
+pei1 p ei1
+pei2 p ei2
+pei3 p ei3
+pei4 p ei4
+pei5 p ei5
+pen1 p en1
+pen2 p en2
+pen3 p en3
+pen4 p en4
+pen5 p en5
+peng1 p eng1
+peng2 p eng2
+peng3 p eng3
+peng4 p eng4
+peng5 p eng5
+pi1 p i1
+pi2 p i2
+pi3 p i3
+pi4 p i4
+pi5 p i5
+pian1 p ian1
+pian2 p ian2
+pian3 p ian3
+pian4 p ian4
+pian5 p ian5
+piao1 p iao1
+piao2 p iao2
+piao3 p iao3
+piao4 p iao4
+piao5 p iao5
+pie1 p ie1
+pie2 p ie2
+pie3 p ie3
+pie4 p ie4
+pie5 p ie5
+pin1 p in1
+pin2 p in2
+pin3 p in3
+pin4 p in4
+pin5 p in5
+ping1 p ing1
+ping2 p ing2
+ping3 p ing3
+ping4 p ing4
+ping5 p ing5
+po1 p o1
+po2 p o2
+po3 p o3
+po4 p o4
+po5 p o5
+pou1 p ou1
+pou2 p ou2
+pou3 p ou3
+pou4 p ou4
+pou5 p ou5
+pu1 p u1
+pu2 p u2
+pu3 p u3
+pu4 p u4
+pu5 p u5
+qi1 q i1
+qi2 q i2
+qi3 q i3
+qi4 q i4
+qi5 q i5
+qia1 q ia1
+qia2 q ia2
+qia3 q ia3
+qia4 q ia4
+qia5 q ia5
+qian1 q ian1
+qian2 q ian2
+qian3 q ian3
+qian4 q ian4
+qian5 q ian5
+qiang1 q iang1
+qiang2 q iang2
+qiang3 q iang3
+qiang4 q iang4
+qiang5 q iang5
+qiao1 q iao1
+qiao2 q iao2
+qiao3 q iao3
+qiao4 q iao4
+qiao5 q iao5
+qie1 q ie1
+qie2 q ie2
+qie3 q ie3
+qie4 q ie4
+qie5 q ie5
+qin1 q in1
+qin2 q in2
+qin3 q in3
+qin4 q in4
+qin5 q in5
+qing1 q ing1
+qing2 q ing2
+qing3 q ing3
+qing4 q ing4
+qing5 q ing5
+qiong1 q iong1
+qiong2 q iong2
+qiong3 q iong3
+qiong4 q iong4
+qiong5 q iong5
+qiu1 q iou1
+qiu2 q iou2
+qiu3 q iou3
+qiu4 q iou4
+qiu5 q iou5
+qu1 q v1
+qu2 q v2
+qu3 q v3
+qu4 q v4
+qu5 q v5
+quan1 q van1
+quan2 q van2
+quan3 q van3
+quan4 q van4
+quan5 q van5
+que1 q ve1
+que2 q ve2
+que3 q ve3
+que4 q ve4
+que5 q ve5
+qun1 q vn1
+qun2 q vn2
+qun3 q vn3
+qun4 q vn4
+qun5 q vn5
+ran1 r an1
+ran2 r an2
+ran3 r an3
+ran4 r an4
+ran5 r an5
+rang1 r ang1
+rang2 r ang2
+rang3 r ang3
+rang4 r ang4
+rang5 r ang5
+rao1 r ao1
+rao2 r ao2
+rao3 r ao3
+rao4 r ao4
+rao5 r ao5
+re1 r e1
+re2 r e2
+re3 r e3
+re4 r e4
+re5 r e5
+ren1 r en1
+ren2 r en2
+ren3 r en3
+ren4 r en4
+ren5 r en5
+reng1 r eng1
+reng2 r eng2
+reng3 r eng3
+reng4 r eng4
+reng5 r eng5
+ri1 r iii1
+ri2 r iii2
+ri3 r iii3
+ri4 r iii4
+ri5 r iii5
+rong1 r ong1
+rong2 r ong2
+rong3 r ong3
+rong4 r ong4
+rong5 r ong5
+rou1 r ou1
+rou2 r ou2
+rou3 r ou3
+rou4 r ou4
+rou5 r ou5
+ru1 r u1
+ru2 r u2
+ru3 r u3
+ru4 r u4
+ru5 r u5
+rua1 r ua1
+rua2 r ua2
+rua3 r ua3
+rua4 r ua4
+rua5 r ua5
+ruan1 r uan1
+ruan2 r uan2
+ruan3 r uan3
+ruan4 r uan4
+ruan5 r uan5
+rui1 r uei1
+rui2 r uei2
+rui3 r uei3
+rui4 r uei4
+rui5 r uei5
+run1 r uen1
+run2 r uen2
+run3 r uen3
+run4 r uen4
+run5 r uen5
+ruo1 r uo1
+ruo2 r uo2
+ruo3 r uo3
+ruo4 r uo4
+ruo5 r uo5
+sa1 s a1
+sa2 s a2
+sa3 s a3
+sa4 s a4
+sa5 s a5
+sai1 s ai1
+sai2 s ai2
+sai3 s ai3
+sai4 s ai4
+sai5 s ai5
+san1 s an1
+san2 s an2
+san3 s an3
+san4 s an4
+san5 s an5
+sang1 s ang1
+sang2 s ang2
+sang3 s ang3
+sang4 s ang4
+sang5 s ang5
+sao1 s ao1
+sao2 s ao2
+sao3 s ao3
+sao4 s ao4
+sao5 s ao5
+se1 s e1
+se2 s e2
+se3 s e3
+se4 s e4
+se5 s e5
+sen1 s en1
+sen2 s en2
+sen3 s en3
+sen4 s en4
+sen5 s en5
+seng1 s eng1
+seng2 s eng2
+seng3 s eng3
+seng4 s eng4
+seng5 s eng5
+sha1 sh a1
+sha2 sh a2
+sha3 sh a3
+sha4 sh a4
+sha5 sh a5
+shai1 sh ai1
+shai2 sh ai2
+shai3 sh ai3
+shai4 sh ai4
+shai5 sh ai5
+shan1 sh an1
+shan2 sh an2
+shan3 sh an3
+shan4 sh an4
+shan5 sh an5
+shang1 sh ang1
+shang2 sh ang2
+shang3 sh ang3
+shang4 sh ang4
+shang5 sh ang5
+shao1 sh ao1
+shao2 sh ao2
+shao3 sh ao3
+shao4 sh ao4
+shao5 sh ao5
+she1 sh e1
+she2 sh e2
+she3 sh e3
+she4 sh e4
+she5 sh e5
+shei1 sh ei1
+shei2 sh ei2
+shei3 sh ei3
+shei4 sh ei4
+shei5 sh ei5
+shen1 sh en1
+shen2 sh en2
+shen3 sh en3
+shen4 sh en4
+shen5 sh en5
+sheng1 sh eng1
+sheng2 sh eng2
+sheng3 sh eng3
+sheng4 sh eng4
+sheng5 sh eng5
+shi1 sh iii1
+shi2 sh iii2
+shi3 sh iii3
+shi4 sh iii4
+shi5 sh iii5
+shou1 sh ou1
+shou2 sh ou2
+shou3 sh ou3
+shou4 sh ou4
+shou5 sh ou5
+shu1 sh u1
+shu2 sh u2
+shu3 sh u3
+shu4 sh u4
+shu5 sh u5
+shua1 sh ua1
+shua2 sh ua2
+shua3 sh ua3
+shua4 sh ua4
+shua5 sh ua5
+shuai1 sh uai1
+shuai2 sh uai2
+shuai3 sh uai3
+shuai4 sh uai4
+shuai5 sh uai5
+shuan1 sh uan1
+shuan2 sh uan2
+shuan3 sh uan3
+shuan4 sh uan4
+shuan5 sh uan5
+shuang1 sh uang1
+shuang2 sh uang2
+shuang3 sh uang3
+shuang4 sh uang4
+shuang5 sh uang5
+shui1 sh uei1
+shui2 sh uei2
+shui3 sh uei3
+shui4 sh uei4
+shui5 sh uei5
+shun1 sh uen1
+shun2 sh uen2
+shun3 sh uen3
+shun4 sh uen4
+shun5 sh uen5
+shuo1 sh uo1
+shuo2 sh uo2
+shuo3 sh uo3
+shuo4 sh uo4
+shuo5 sh uo5
+si1 s ii1
+si2 s ii2
+si3 s ii3
+si4 s ii4
+si5 s ii5
+song1 s ong1
+song2 s ong2
+song3 s ong3
+song4 s ong4
+song5 s ong5
+sou1 s ou1
+sou2 s ou2
+sou3 s ou3
+sou4 s ou4
+sou5 s ou5
+su1 s u1
+su2 s u2
+su3 s u3
+su4 s u4
+su5 s u5
+suan1 s uan1
+suan2 s uan2
+suan3 s uan3
+suan4 s uan4
+suan5 s uan5
+sui1 s uei1
+sui2 s uei2
+sui3 s uei3
+sui4 s uei4
+sui5 s uei5
+sun1 s uen1
+sun2 s uen2
+sun3 s uen3
+sun4 s uen4
+sun5 s uen5
+suo1 s uo1
+suo2 s uo2
+suo3 s uo3
+suo4 s uo4
+suo5 s uo5
+ta1 t a1
+ta2 t a2
+ta3 t a3
+ta4 t a4
+ta5 t a5
+tai1 t ai1
+tai2 t ai2
+tai3 t ai3
+tai4 t ai4
+tai5 t ai5
+tan1 t an1
+tan2 t an2
+tan3 t an3
+tan4 t an4
+tan5 t an5
+tang1 t ang1
+tang2 t ang2
+tang3 t ang3
+tang4 t ang4
+tang5 t ang5
+tao1 t ao1
+tao2 t ao2
+tao3 t ao3
+tao4 t ao4
+tao5 t ao5
+te1 t e1
+te2 t e2
+te3 t e3
+te4 t e4
+te5 t e5
+tei1 t ei1
+tei2 t ei2
+tei3 t ei3
+tei4 t ei4
+tei5 t ei5
+teng1 t eng1
+teng2 t eng2
+teng3 t eng3
+teng4 t eng4
+teng5 t eng5
+ti1 t i1
+ti2 t i2
+ti3 t i3
+ti4 t i4
+ti5 t i5
+tian1 t ian1
+tian2 t ian2
+tian3 t ian3
+tian4 t ian4
+tian5 t ian5
+tiao1 t iao1
+tiao2 t iao2
+tiao3 t iao3
+tiao4 t iao4
+tiao5 t iao5
+tie1 t ie1
+tie2 t ie2
+tie3 t ie3
+tie4 t ie4
+tie5 t ie5
+ting1 t ing1
+ting2 t ing2
+ting3 t ing3
+ting4 t ing4
+ting5 t ing5
+tong1 t ong1
+tong2 t ong2
+tong3 t ong3
+tong4 t ong4
+tong5 t ong5
+tou1 t ou1
+tou2 t ou2
+tou3 t ou3
+tou4 t ou4
+tou5 t ou5
+tu1 t u1
+tu2 t u2
+tu3 t u3
+tu4 t u4
+tu5 t u5
+tuan1 t uan1
+tuan2 t uan2
+tuan3 t uan3
+tuan4 t uan4
+tuan5 t uan5
+tui1 t uei1
+tui2 t uei2
+tui3 t uei3
+tui4 t uei4
+tui5 t uei5
+tun1 t uen1
+tun2 t uen2
+tun3 t uen3
+tun4 t uen4
+tun5 t uen5
+tuo1 t uo1
+tuo2 t uo2
+tuo3 t uo3
+tuo4 t uo4
+tuo5 t uo5
+wa1 w ua1
+wa2 w ua2
+wa3 w ua3
+wa4 w ua4
+wa5 w ua5
+wai1 w uai1
+wai2 w uai2
+wai3 w uai3
+wai4 w uai4
+wai5 w uai5
+wan1 w uan1
+wan2 w uan2
+wan3 w uan3
+wan4 w uan4
+wan5 w uan5
+wang1 w uang1
+wang2 w uang2
+wang3 w uang3
+wang4 w uang4
+wang5 w uang5
+wei1 w uei1
+wei2 w uei2
+wei3 w uei3
+wei4 w uei4
+wei5 w uei5
+wen1 w uen1
+wen2 w uen2
+wen3 w uen3
+wen4 w uen4
+wen5 w uen5
+weng1 w uen1
+weng2 w uen2
+weng3 w uen3
+weng4 w uen4
+weng5 w uen5
+wo1 w uo1
+wo2 w uo2
+wo3 w uo3
+wo4 w uo4
+wo5 w uo5
+wu1 w u1
+wu2 w u2
+wu3 w u3
+wu4 w u4
+wu5 w u5
+xi1 x i1
+xi2 x i2
+xi3 x i3
+xi4 x i4
+xi5 x i5
+xia1 x ia1
+xia2 x ia2
+xia3 x ia3
+xia4 x ia4
+xia5 x ia5
+xian1 x ian1
+xian2 x ian2
+xian3 x ian3
+xian4 x ian4
+xian5 x ian5
+xiang1 x iang1
+xiang2 x iang2
+xiang3 x iang3
+xiang4 x iang4
+xiang5 x iang5
+xiao1 x iao1
+xiao2 x iao2
+xiao3 x iao3
+xiao4 x iao4
+xiao5 x iao5
+xie1 x ie1
+xie2 x ie2
+xie3 x ie3
+xie4 x ie4
+xie5 x ie5
+xin1 x in1
+xin2 x in2
+xin3 x in3
+xin4 x in4
+xin5 x in5
+xing1 x ing1
+xing2 x ing2
+xing3 x ing3
+xing4 x ing4
+xing5 x ing5
+xiong1 x iong1
+xiong2 x iong2
+xiong3 x iong3
+xiong4 x iong4
+xiong5 x iong5
+xiu1 x iou1
+xiu2 x iou2
+xiu3 x iou3
+xiu4 x iou4
+xiu5 x iou5
+xu1 x v1
+xu2 x v2
+xu3 x v3
+xu4 x v4
+xu5 x v5
+xuan1 x van1
+xuan2 x van2
+xuan3 x van3
+xuan4 x van4
+xuan5 x van5
+xue1 x ve1
+xue2 x ve2
+xue3 x ve3
+xue4 x ve4
+xue5 x ve5
+xun1 x vn1
+xun2 x vn2
+xun3 x vn3
+xun4 x vn4
+xun5 x vn5
+ya1 y ia1
+ya2 y ia2
+ya3 y ia3
+ya4 y ia4
+ya5 y ia5
+yan1 y ian1
+yan2 y ian2
+yan3 y ian3
+yan4 y ian4
+yan5 y ian5
+yang1 y iang1
+yang2 y iang2
+yang3 y iang3
+yang4 y iang4
+yang5 y iang5
+yao1 y iao1
+yao2 y iao2
+yao3 y iao3
+yao4 y iao4
+yao5 y iao5
+ye1 y ie1
+ye2 y ie2
+ye3 y ie3
+ye4 y ie4
+ye5 y ie5
+yi1 y i1
+yi2 y i2
+yi3 y i3
+yi4 y i4
+yi5 y i5
+yin1 y in1
+yin2 y in2
+yin3 y in3
+yin4 y in4
+yin5 y in5
+ying1 y ing1
+ying2 y ing2
+ying3 y ing3
+ying4 y ing4
+ying5 y ing5
+yo1 y iou1
+yo2 y iou2
+yo3 y iou3
+yo4 y iou4
+yo5 y iou5
+yong1 y iong1
+yong2 y iong2
+yong3 y iong3
+yong4 y iong4
+yong5 y iong5
+you1 y iou1
+you2 y iou2
+you3 y iou3
+you4 y iou4
+you5 y iou5
+yu1 y v1
+yu2 y v2
+yu3 y v3
+yu4 y v4
+yu5 y v5
+yuan1 y van1
+yuan2 y van2
+yuan3 y van3
+yuan4 y van4
+yuan5 y van5
+yue1 y ve1
+yue2 y ve2
+yue3 y ve3
+yue4 y ve4
+yue5 y ve5
+yun1 y vn1
+yun2 y vn2
+yun3 y vn3
+yun4 y vn4
+yun5 y vn5
+za1 z a1
+za2 z a2
+za3 z a3
+za4 z a4
+za5 z a5
+zai1 z ai1
+zai2 z ai2
+zai3 z ai3
+zai4 z ai4
+zai5 z ai5
+zan1 z an1
+zan2 z an2
+zan3 z an3
+zan4 z an4
+zan5 z an5
+zang1 z ang1
+zang2 z ang2
+zang3 z ang3
+zang4 z ang4
+zang5 z ang5
+zao1 z ao1
+zao2 z ao2
+zao3 z ao3
+zao4 z ao4
+zao5 z ao5
+ze1 z e1
+ze2 z e2
+ze3 z e3
+ze4 z e4
+ze5 z e5
+zei1 z ei1
+zei2 z ei2
+zei3 z ei3
+zei4 z ei4
+zei5 z ei5
+zen1 z en1
+zen2 z en2
+zen3 z en3
+zen4 z en4
+zen5 z en5
+zeng1 z eng1
+zeng2 z eng2
+zeng3 z eng3
+zeng4 z eng4
+zeng5 z eng5
+zha1 zh a1
+zha2 zh a2
+zha3 zh a3
+zha4 zh a4
+zha5 zh a5
+zhai1 zh ai1
+zhai2 zh ai2
+zhai3 zh ai3
+zhai4 zh ai4
+zhai5 zh ai5
+zhan1 zh an1
+zhan2 zh an2
+zhan3 zh an3
+zhan4 zh an4
+zhan5 zh an5
+zhang1 zh ang1
+zhang2 zh ang2
+zhang3 zh ang3
+zhang4 zh ang4
+zhang5 zh ang5
+zhao1 zh ao1
+zhao2 zh ao2
+zhao3 zh ao3
+zhao4 zh ao4
+zhao5 zh ao5
+zhe1 zh e1
+zhe2 zh e2
+zhe3 zh e3
+zhe4 zh e4
+zhe5 zh e5
+zhei1 zh ei1
+zhei2 zh ei2
+zhei3 zh ei3
+zhei4 zh ei4
+zhei5 zh ei5
+zhen1 zh en1
+zhen2 zh en2
+zhen3 zh en3
+zhen4 zh en4
+zhen5 zh en5
+zheng1 zh eng1
+zheng2 zh eng2
+zheng3 zh eng3
+zheng4 zh eng4
+zheng5 zh eng5
+zhi1 zh iii1
+zhi2 zh iii2
+zhi3 zh iii3
+zhi4 zh iii4
+zhi5 zh iii5
+zhong1 zh ong1
+zhong2 zh ong2
+zhong3 zh ong3
+zhong4 zh ong4
+zhong5 zh ong5
+zhou1 zh ou1
+zhou2 zh ou2
+zhou3 zh ou3
+zhou4 zh ou4
+zhou5 zh ou5
+zhu1 zh u1
+zhu2 zh u2
+zhu3 zh u3
+zhu4 zh u4
+zhu5 zh u5
+zhua1 zh ua1
+zhua2 zh ua2
+zhua3 zh ua3
+zhua4 zh ua4
+zhua5 zh ua5
+zhuai1 zh uai1
+zhuai2 zh uai2
+zhuai3 zh uai3
+zhuai4 zh uai4
+zhuai5 zh uai5
+zhuan1 zh uan1
+zhuan2 zh uan2
+zhuan3 zh uan3
+zhuan4 zh uan4
+zhuan5 zh uan5
+zhuang1 zh uang1
+zhuang2 zh uang2
+zhuang3 zh uang3
+zhuang4 zh uang4
+zhuang5 zh uang5
+zhui1 zh uei1
+zhui2 zh uei2
+zhui3 zh uei3
+zhui4 zh uei4
+zhui5 zh uei5
+zhun1 zh uen1
+zhun2 zh uen2
+zhun3 zh uen3
+zhun4 zh uen4
+zhun5 zh uen5
+zhuo1 zh uo1
+zhuo2 zh uo2
+zhuo3 zh uo3
+zhuo4 zh uo4
+zhuo5 zh uo5
+zi1 z ii1
+zi2 z ii2
+zi3 z ii3
+zi4 z ii4
+zi5 z ii5
+zong1 z ong1
+zong2 z ong2
+zong3 z ong3
+zong4 z ong4
+zong5 z ong5
+zou1 z ou1
+zou2 z ou2
+zou3 z ou3
+zou4 z ou4
+zou5 z ou5
+zu1 z u1
+zu2 z u2
+zu3 z u3
+zu4 z u4
+zu5 z u5
+zuan1 z uan1
+zuan2 z uan2
+zuan3 z uan3
+zuan4 z uan4
+zuan5 z uan5
+zui1 z uei1
+zui2 z uei2
+zui3 z uei3
+zui4 z uei4
+zui5 z uei5
+zun1 z uen1
+zun2 z uen2
+zun3 z uen3
+zun4 z uen4
+zun5 z uen5
+zuo1 z uo1
+zuo2 z uo2
+zuo3 z uo3
+zuo4 z uo4
+zuo5 z uo5
+ar1 a1 rr
+ar2 a2 rr
+ar3 a3 rr
+ar4 a4 rr
+ar5 a5 rr
+air1 ai1 rr
+air2 ai2 rr
+air3 ai3 rr
+air4 ai4 rr
+air5 ai5 rr
+anr1 an1 rr
+anr2 an2 rr
+anr3 an3 rr
+anr4 an4 rr
+anr5 an5 rr
+angr1 ang1 rr
+angr2 ang2 rr
+angr3 ang3 rr
+angr4 ang4 rr
+angr5 ang5 rr
+aor1 ao1 rr
+aor2 ao2 rr
+aor3 ao3 rr
+aor4 ao4 rr
+aor5 ao5 rr
+bar1 b a1 rr
+bar2 b a2 rr
+bar3 b a3 rr
+bar4 b a4 rr
+bar5 b a5 rr
+bair1 b ai1 rr
+bair2 b ai2 rr
+bair3 b ai3 rr
+bair4 b ai4 rr
+bair5 b ai5 rr
+banr1 b an1 rr
+banr2 b an2 rr
+banr3 b an3 rr
+banr4 b an4 rr
+banr5 b an5 rr
+bangr1 b ang1 rr
+bangr2 b ang2 rr
+bangr3 b ang3 rr
+bangr4 b ang4 rr
+bangr5 b ang5 rr
+baor1 b ao1 rr
+baor2 b ao2 rr
+baor3 b ao3 rr
+baor4 b ao4 rr
+baor5 b ao5 rr
+beir1 b ei1 rr
+beir2 b ei2 rr
+beir3 b ei3 rr
+beir4 b ei4 rr
+beir5 b ei5 rr
+benr1 b en1 rr
+benr2 b en2 rr
+benr3 b en3 rr
+benr4 b en4 rr
+benr5 b en5 rr
+bengr1 b eng1 rr
+bengr2 b eng2 rr
+bengr3 b eng3 rr
+bengr4 b eng4 rr
+bengr5 b eng5 rr
+bir1 b i1 rr
+bir2 b i2 rr
+bir3 b i3 rr
+bir4 b i4 rr
+bir5 b i5 rr
+bianr1 b ian1 rr
+bianr2 b ian2 rr
+bianr3 b ian3 rr
+bianr4 b ian4 rr
+bianr5 b ian5 rr
+biaor1 b iao1 rr
+biaor2 b iao2 rr
+biaor3 b iao3 rr
+biaor4 b iao4 rr
+biaor5 b iao5 rr
+bier1 b ie1 rr
+bier2 b ie2 rr
+bier3 b ie3 rr
+bier4 b ie4 rr
+bier5 b ie5 rr
+binr1 b in1 rr
+binr2 b in2 rr
+binr3 b in3 rr
+binr4 b in4 rr
+binr5 b in5 rr
+bingr1 b ing1 rr
+bingr2 b ing2 rr
+bingr3 b ing3 rr
+bingr4 b ing4 rr
+bingr5 b ing5 rr
+bor1 b o1 rr
+bor2 b o2 rr
+bor3 b o3 rr
+bor4 b o4 rr
+bor5 b o5 rr
+bur1 b u1 rr
+bur2 b u2 rr
+bur3 b u3 rr
+bur4 b u4 rr
+bur5 b u5 rr
+car1 c a1 rr
+car2 c a2 rr
+car3 c a3 rr
+car4 c a4 rr
+car5 c a5 rr
+cair1 c ai1 rr
+cair2 c ai2 rr
+cair3 c ai3 rr
+cair4 c ai4 rr
+cair5 c ai5 rr
+canr1 c an1 rr
+canr2 c an2 rr
+canr3 c an3 rr
+canr4 c an4 rr
+canr5 c an5 rr
+cangr1 c ang1 rr
+cangr2 c ang2 rr
+cangr3 c ang3 rr
+cangr4 c ang4 rr
+cangr5 c ang5 rr
+caor1 c ao1 rr
+caor2 c ao2 rr
+caor3 c ao3 rr
+caor4 c ao4 rr
+caor5 c ao5 rr
+cer1 c e1 rr
+cer2 c e2 rr
+cer3 c e3 rr
+cer4 c e4 rr
+cer5 c e5 rr
+cenr1 c en1 rr
+cenr2 c en2 rr
+cenr3 c en3 rr
+cenr4 c en4 rr
+cenr5 c en5 rr
+cengr1 c eng1 rr
+cengr2 c eng2 rr
+cengr3 c eng3 rr
+cengr4 c eng4 rr
+cengr5 c eng5 rr
+char1 ch a1 rr
+char2 ch a2 rr
+char3 ch a3 rr
+char4 ch a4 rr
+char5 ch a5 rr
+chair1 ch ai1 rr
+chair2 ch ai2 rr
+chair3 ch ai3 rr
+chair4 ch ai4 rr
+chair5 ch ai5 rr
+chanr1 ch an1 rr
+chanr2 ch an2 rr
+chanr3 ch an3 rr
+chanr4 ch an4 rr
+chanr5 ch an5 rr
+changr1 ch ang1 rr
+changr2 ch ang2 rr
+changr3 ch ang3 rr
+changr4 ch ang4 rr
+changr5 ch ang5 rr
+chaor1 ch ao1 rr
+chaor2 ch ao2 rr
+chaor3 ch ao3 rr
+chaor4 ch ao4 rr
+chaor5 ch ao5 rr
+cher1 ch e1 rr
+cher2 ch e2 rr
+cher3 ch e3 rr
+cher4 ch e4 rr
+cher5 ch e5 rr
+chenr1 ch en1 rr
+chenr2 ch en2 rr
+chenr3 ch en3 rr
+chenr4 ch en4 rr
+chenr5 ch en5 rr
+chengr1 ch eng1 rr
+chengr2 ch eng2 rr
+chengr3 ch eng3 rr
+chengr4 ch eng4 rr
+chengr5 ch eng5 rr
+chir1 ch iii1 rr
+chir2 ch iii2 rr
+chir3 ch iii3 rr
+chir4 ch iii4 rr
+chir5 ch iii5 rr
+chongr1 ch ong1 rr
+chongr2 ch ong2 rr
+chongr3 ch ong3 rr
+chongr4 ch ong4 rr
+chongr5 ch ong5 rr
+chour1 ch ou1 rr
+chour2 ch ou2 rr
+chour3 ch ou3 rr
+chour4 ch ou4 rr
+chour5 ch ou5 rr
+chur1 ch u1 rr
+chur2 ch u2 rr
+chur3 ch u3 rr
+chur4 ch u4 rr
+chur5 ch u5 rr
+chuair1 ch uai1 rr
+chuair2 ch uai2 rr
+chuair3 ch uai3 rr
+chuair4 ch uai4 rr
+chuair5 ch uai5 rr
+chuanr1 ch uan1 rr
+chuanr2 ch uan2 rr
+chuanr3 ch uan3 rr
+chuanr4 ch uan4 rr
+chuanr5 ch uan5 rr
+chuangr1 ch uang1 rr
+chuangr2 ch uang2 rr
+chuangr3 ch uang3 rr
+chuangr4 ch uang4 rr
+chuangr5 ch uang5 rr
+chuir1 ch uei1 rr
+chuir2 ch uei2 rr
+chuir3 ch uei3 rr
+chuir4 ch uei4 rr
+chuir5 ch uei5 rr
+chunr1 ch uen1 rr
+chunr2 ch uen2 rr
+chunr3 ch uen3 rr
+chunr4 ch uen4 rr
+chunr5 ch uen5 rr
+chuor1 ch uo1 rr
+chuor2 ch uo2 rr
+chuor3 ch uo3 rr
+chuor4 ch uo4 rr
+chuor5 ch uo5 rr
+cir1 c ii1 rr
+cir2 c ii2 rr
+cir3 c ii3 rr
+cir4 c ii4 rr
+cir5 c ii5 rr
+congr1 c ong1 rr
+congr2 c ong2 rr
+congr3 c ong3 rr
+congr4 c ong4 rr
+congr5 c ong5 rr
+cour1 c ou1 rr
+cour2 c ou2 rr
+cour3 c ou3 rr
+cour4 c ou4 rr
+cour5 c ou5 rr
+cur1 c u1 rr
+cur2 c u2 rr
+cur3 c u3 rr
+cur4 c u4 rr
+cur5 c u5 rr
+cuanr1 c uan1 rr
+cuanr2 c uan2 rr
+cuanr3 c uan3 rr
+cuanr4 c uan4 rr
+cuanr5 c uan5 rr
+cuir1 c uei1 rr
+cuir2 c uei2 rr
+cuir3 c uei3 rr
+cuir4 c uei4 rr
+cuir5 c uei5 rr
+cunr1 c uen1 rr
+cunr2 c uen2 rr
+cunr3 c uen3 rr
+cunr4 c uen4 rr
+cunr5 c uen5 rr
+cuor1 c uo1 rr
+cuor2 c uo2 rr
+cuor3 c uo3 rr
+cuor4 c uo4 rr
+cuor5 c uo5 rr
+dar1 d a1 rr
+dar2 d a2 rr
+dar3 d a3 rr
+dar4 d a4 rr
+dar5 d a5 rr
+dair1 d ai1 rr
+dair2 d ai2 rr
+dair3 d ai3 rr
+dair4 d ai4 rr
+dair5 d ai5 rr
+danr1 d an1 rr
+danr2 d an2 rr
+danr3 d an3 rr
+danr4 d an4 rr
+danr5 d an5 rr
+dangr1 d ang1 rr
+dangr2 d ang2 rr
+dangr3 d ang3 rr
+dangr4 d ang4 rr
+dangr5 d ang5 rr
+daor1 d ao1 rr
+daor2 d ao2 rr
+daor3 d ao3 rr
+daor4 d ao4 rr
+daor5 d ao5 rr
+der1 d e1 rr
+der2 d e2 rr
+der3 d e3 rr
+der4 d e4 rr
+der5 d e5 rr
+deir1 d ei1 rr
+deir2 d ei2 rr
+deir3 d ei3 rr
+deir4 d ei4 rr
+deir5 d ei5 rr
+denr1 d en1 rr
+denr2 d en2 rr
+denr3 d en3 rr
+denr4 d en4 rr
+denr5 d en5 rr
+dengr1 d eng1 rr
+dengr2 d eng2 rr
+dengr3 d eng3 rr
+dengr4 d eng4 rr
+dengr5 d eng5 rr
+dir1 d i1 rr
+dir2 d i2 rr
+dir3 d i3 rr
+dir4 d i4 rr
+dir5 d i5 rr
+diar1 d ia1 rr
+diar2 d ia2 rr
+diar3 d ia3 rr
+diar4 d ia4 rr
+diar5 d ia5 rr
+dianr1 d ian1 rr
+dianr2 d ian2 rr
+dianr3 d ian3 rr
+dianr4 d ian4 rr
+dianr5 d ian5 rr
+diaor1 d iao1 rr
+diaor2 d iao2 rr
+diaor3 d iao3 rr
+diaor4 d iao4 rr
+diaor5 d iao5 rr
+dier1 d ie1 rr
+dier2 d ie2 rr
+dier3 d ie3 rr
+dier4 d ie4 rr
+dier5 d ie5 rr
+dingr1 d ing1 rr
+dingr2 d ing2 rr
+dingr3 d ing3 rr
+dingr4 d ing4 rr
+dingr5 d ing5 rr
+diur1 d iou1 rr
+diur2 d iou2 rr
+diur3 d iou3 rr
+diur4 d iou4 rr
+diur5 d iou5 rr
+dongr1 d ong1 rr
+dongr2 d ong2 rr
+dongr3 d ong3 rr
+dongr4 d ong4 rr
+dongr5 d ong5 rr
+dour1 d ou1 rr
+dour2 d ou2 rr
+dour3 d ou3 rr
+dour4 d ou4 rr
+dour5 d ou5 rr
+dur1 d u1 rr
+dur2 d u2 rr
+dur3 d u3 rr
+dur4 d u4 rr
+dur5 d u5 rr
+duanr1 d uan1 rr
+duanr2 d uan2 rr
+duanr3 d uan3 rr
+duanr4 d uan4 rr
+duanr5 d uan5 rr
+duir1 d uei1 rr
+duir2 d uei2 rr
+duir3 d uei3 rr
+duir4 d uei4 rr
+duir5 d uei5 rr
+dunr1 d uen1 rr
+dunr2 d uen2 rr
+dunr3 d uen3 rr
+dunr4 d uen4 rr
+dunr5 d uen5 rr
+duor1 d uo1 rr
+duor2 d uo2 rr
+duor3 d uo3 rr
+duor4 d uo4 rr
+duor5 d uo5 rr
+er1 e1 rr
+er2 e2 rr
+er3 e3 rr
+er4 e4 rr
+er5 e5 rr
+eir1 ei1 rr
+eir2 ei2 rr
+eir3 ei3 rr
+eir4 ei4 rr
+eir5 ei5 rr
+enr1 en1 rr
+enr2 en2 rr
+enr3 en3 rr
+enr4 en4 rr
+enr5 en5 rr
+engr1 eng1 rr
+engr2 eng2 rr
+engr3 eng3 rr
+engr4 eng4 rr
+engr5 eng5 rr
+far1 f a1 rr
+far2 f a2 rr
+far3 f a3 rr
+far4 f a4 rr
+far5 f a5 rr
+fanr1 f an1 rr
+fanr2 f an2 rr
+fanr3 f an3 rr
+fanr4 f an4 rr
+fanr5 f an5 rr
+fangr1 f ang1 rr
+fangr2 f ang2 rr
+fangr3 f ang3 rr
+fangr4 f ang4 rr
+fangr5 f ang5 rr
+feir1 f ei1 rr
+feir2 f ei2 rr
+feir3 f ei3 rr
+feir4 f ei4 rr
+feir5 f ei5 rr
+fenr1 f en1 rr
+fenr2 f en2 rr
+fenr3 f en3 rr
+fenr4 f en4 rr
+fenr5 f en5 rr
+fengr1 f eng1 rr
+fengr2 f eng2 rr
+fengr3 f eng3 rr
+fengr4 f eng4 rr
+fengr5 f eng5 rr
+for1 f o1 rr
+for2 f o2 rr
+for3 f o3 rr
+for4 f o4 rr
+for5 f o5 rr
+four1 f ou1 rr
+four2 f ou2 rr
+four3 f ou3 rr
+four4 f ou4 rr
+four5 f ou5 rr
+fur1 f u1 rr
+fur2 f u2 rr
+fur3 f u3 rr
+fur4 f u4 rr
+fur5 f u5 rr
+gar1 g a1 rr
+gar2 g a2 rr
+gar3 g a3 rr
+gar4 g a4 rr
+gar5 g a5 rr
+gair1 g ai1 rr
+gair2 g ai2 rr
+gair3 g ai3 rr
+gair4 g ai4 rr
+gair5 g ai5 rr
+ganr1 g an1 rr
+ganr2 g an2 rr
+ganr3 g an3 rr
+ganr4 g an4 rr
+ganr5 g an5 rr
+gangr1 g ang1 rr
+gangr2 g ang2 rr
+gangr3 g ang3 rr
+gangr4 g ang4 rr
+gangr5 g ang5 rr
+gaor1 g ao1 rr
+gaor2 g ao2 rr
+gaor3 g ao3 rr
+gaor4 g ao4 rr
+gaor5 g ao5 rr
+ger1 g e1 rr
+ger2 g e2 rr
+ger3 g e3 rr
+ger4 g e4 rr
+ger5 g e5 rr
+geir1 g ei1 rr
+geir2 g ei2 rr
+geir3 g ei3 rr
+geir4 g ei4 rr
+geir5 g ei5 rr
+genr1 g en1 rr
+genr2 g en2 rr
+genr3 g en3 rr
+genr4 g en4 rr
+genr5 g en5 rr
+gengr1 g eng1 rr
+gengr2 g eng2 rr
+gengr3 g eng3 rr
+gengr4 g eng4 rr
+gengr5 g eng5 rr
+gongr1 g ong1 rr
+gongr2 g ong2 rr
+gongr3 g ong3 rr
+gongr4 g ong4 rr
+gongr5 g ong5 rr
+gour1 g ou1 rr
+gour2 g ou2 rr
+gour3 g ou3 rr
+gour4 g ou4 rr
+gour5 g ou5 rr
+gur1 g u1 rr
+gur2 g u2 rr
+gur3 g u3 rr
+gur4 g u4 rr
+gur5 g u5 rr
+guar1 g ua1 rr
+guar2 g ua2 rr
+guar3 g ua3 rr
+guar4 g ua4 rr
+guar5 g ua5 rr
+guair1 g uai1 rr
+guair2 g uai2 rr
+guair3 g uai3 rr
+guair4 g uai4 rr
+guair5 g uai5 rr
+guanr1 g uan1 rr
+guanr2 g uan2 rr
+guanr3 g uan3 rr
+guanr4 g uan4 rr
+guanr5 g uan5 rr
+guangr1 g uang1 rr
+guangr2 g uang2 rr
+guangr3 g uang3 rr
+guangr4 g uang4 rr
+guangr5 g uang5 rr
+guir1 g uei1 rr
+guir2 g uei2 rr
+guir3 g uei3 rr
+guir4 g uei4 rr
+guir5 g uei5 rr
+gunr1 g uen1 rr
+gunr2 g uen2 rr
+gunr3 g uen3 rr
+gunr4 g uen4 rr
+gunr5 g uen5 rr
+guor1 g uo1 rr
+guor2 g uo2 rr
+guor3 g uo3 rr
+guor4 g uo4 rr
+guor5 g uo5 rr
+har1 h a1 rr
+har2 h a2 rr
+har3 h a3 rr
+har4 h a4 rr
+har5 h a5 rr
+hair1 h ai1 rr
+hair2 h ai2 rr
+hair3 h ai3 rr
+hair4 h ai4 rr
+hair5 h ai5 rr
+hanr1 h an1 rr
+hanr2 h an2 rr
+hanr3 h an3 rr
+hanr4 h an4 rr
+hanr5 h an5 rr
+hangr1 h ang1 rr
+hangr2 h ang2 rr
+hangr3 h ang3 rr
+hangr4 h ang4 rr
+hangr5 h ang5 rr
+haor1 h ao1 rr
+haor2 h ao2 rr
+haor3 h ao3 rr
+haor4 h ao4 rr
+haor5 h ao5 rr
+her1 h e1 rr
+her2 h e2 rr
+her3 h e3 rr
+her4 h e4 rr
+her5 h e5 rr
+heir1 h ei1 rr
+heir2 h ei2 rr
+heir3 h ei3 rr
+heir4 h ei4 rr
+heir5 h ei5 rr
+henr1 h en1 rr
+henr2 h en2 rr
+henr3 h en3 rr
+henr4 h en4 rr
+henr5 h en5 rr
+hengr1 h eng1 rr
+hengr2 h eng2 rr
+hengr3 h eng3 rr
+hengr4 h eng4 rr
+hengr5 h eng5 rr
+hongr1 h ong1 rr
+hongr2 h ong2 rr
+hongr3 h ong3 rr
+hongr4 h ong4 rr
+hongr5 h ong5 rr
+hour1 h ou1 rr
+hour2 h ou2 rr
+hour3 h ou3 rr
+hour4 h ou4 rr
+hour5 h ou5 rr
+hur1 h u1 rr
+hur2 h u2 rr
+hur3 h u3 rr
+hur4 h u4 rr
+hur5 h u5 rr
+huar1 h ua1 rr
+huar2 h ua2 rr
+huar3 h ua3 rr
+huar4 h ua4 rr
+huar5 h ua5 rr
+huair1 h uai1 rr
+huair2 h uai2 rr
+huair3 h uai3 rr
+huair4 h uai4 rr
+huair5 h uai5 rr
+huanr1 h uan1 rr
+huanr2 h uan2 rr
+huanr3 h uan3 rr
+huanr4 h uan4 rr
+huanr5 h uan5 rr
+huangr1 h uang1 rr
+huangr2 h uang2 rr
+huangr3 h uang3 rr
+huangr4 h uang4 rr
+huangr5 h uang5 rr
+huir1 h uei1 rr
+huir2 h uei2 rr
+huir3 h uei3 rr
+huir4 h uei4 rr
+huir5 h uei5 rr
+hunr1 h uen1 rr
+hunr2 h uen2 rr
+hunr3 h uen3 rr
+hunr4 h uen4 rr
+hunr5 h uen5 rr
+huor1 h uo1 rr
+huor2 h uo2 rr
+huor3 h uo3 rr
+huor4 h uo4 rr
+huor5 h uo5 rr
+jir1 j i1 rr
+jir2 j i2 rr
+jir3 j i3 rr
+jir4 j i4 rr
+jir5 j i5 rr
+jiar1 j ia1 rr
+jiar2 j ia2 rr
+jiar3 j ia3 rr
+jiar4 j ia4 rr
+jiar5 j ia5 rr
+jianr1 j ian1 rr
+jianr2 j ian2 rr
+jianr3 j ian3 rr
+jianr4 j ian4 rr
+jianr5 j ian5 rr
+jiangr1 j iang1 rr
+jiangr2 j iang2 rr
+jiangr3 j iang3 rr
+jiangr4 j iang4 rr
+jiangr5 j iang5 rr
+jiaor1 j iao1 rr
+jiaor2 j iao2 rr
+jiaor3 j iao3 rr
+jiaor4 j iao4 rr
+jiaor5 j iao5 rr
+jier1 j ie1 rr
+jier2 j ie2 rr
+jier3 j ie3 rr
+jier4 j ie4 rr
+jier5 j ie5 rr
+jinr1 j in1 rr
+jinr2 j in2 rr
+jinr3 j in3 rr
+jinr4 j in4 rr
+jinr5 j in5 rr
+jingr1 j ing1 rr
+jingr2 j ing2 rr
+jingr3 j ing3 rr
+jingr4 j ing4 rr
+jingr5 j ing5 rr
+jiongr1 j iong1 rr
+jiongr2 j iong2 rr
+jiongr3 j iong3 rr
+jiongr4 j iong4 rr
+jiongr5 j iong5 rr
+jiur1 j iou1 rr
+jiur2 j iou2 rr
+jiur3 j iou3 rr
+jiur4 j iou4 rr
+jiur5 j iou5 rr
+jur1 j v1 rr
+jur2 j v2 rr
+jur3 j v3 rr
+jur4 j v4 rr
+jur5 j v5 rr
+juanr1 j van1 rr
+juanr2 j van2 rr
+juanr3 j van3 rr
+juanr4 j van4 rr
+juanr5 j van5 rr
+juer1 j ve1 rr
+juer2 j ve2 rr
+juer3 j ve3 rr
+juer4 j ve4 rr
+juer5 j ve5 rr
+junr1 j vn1 rr
+junr2 j vn2 rr
+junr3 j vn3 rr
+junr4 j vn4 rr
+junr5 j vn5 rr
+kar1 k a1 rr
+kar2 k a2 rr
+kar3 k a3 rr
+kar4 k a4 rr
+kar5 k a5 rr
+kair1 k ai1 rr
+kair2 k ai2 rr
+kair3 k ai3 rr
+kair4 k ai4 rr
+kair5 k ai5 rr
+kanr1 k an1 rr
+kanr2 k an2 rr
+kanr3 k an3 rr
+kanr4 k an4 rr
+kanr5 k an5 rr
+kangr1 k ang1 rr
+kangr2 k ang2 rr
+kangr3 k ang3 rr
+kangr4 k ang4 rr
+kangr5 k ang5 rr
+kaor1 k ao1 rr
+kaor2 k ao2 rr
+kaor3 k ao3 rr
+kaor4 k ao4 rr
+kaor5 k ao5 rr
+ker1 k e1 rr
+ker2 k e2 rr
+ker3 k e3 rr
+ker4 k e4 rr
+ker5 k e5 rr
+keir1 k ei1 rr
+keir2 k ei2 rr
+keir3 k ei3 rr
+keir4 k ei4 rr
+keir5 k ei5 rr
+kenr1 k en1 rr
+kenr2 k en2 rr
+kenr3 k en3 rr
+kenr4 k en4 rr
+kenr5 k en5 rr
+kengr1 k eng1 rr
+kengr2 k eng2 rr
+kengr3 k eng3 rr
+kengr4 k eng4 rr
+kengr5 k eng5 rr
+kongr1 k ong1 rr
+kongr2 k ong2 rr
+kongr3 k ong3 rr
+kongr4 k ong4 rr
+kongr5 k ong5 rr
+kour1 k ou1 rr
+kour2 k ou2 rr
+kour3 k ou3 rr
+kour4 k ou4 rr
+kour5 k ou5 rr
+kur1 k u1 rr
+kur2 k u2 rr
+kur3 k u3 rr
+kur4 k u4 rr
+kur5 k u5 rr
+kuar1 k ua1 rr
+kuar2 k ua2 rr
+kuar3 k ua3 rr
+kuar4 k ua4 rr
+kuar5 k ua5 rr
+kuair1 k uai1 rr
+kuair2 k uai2 rr
+kuair3 k uai3 rr
+kuair4 k uai4 rr
+kuair5 k uai5 rr
+kuanr1 k uan1 rr
+kuanr2 k uan2 rr
+kuanr3 k uan3 rr
+kuanr4 k uan4 rr
+kuanr5 k uan5 rr
+kuangr1 k uang1 rr
+kuangr2 k uang2 rr
+kuangr3 k uang3 rr
+kuangr4 k uang4 rr
+kuangr5 k uang5 rr
+kuir1 k uei1 rr
+kuir2 k uei2 rr
+kuir3 k uei3 rr
+kuir4 k uei4 rr
+kuir5 k uei5 rr
+kunr1 k uen1 rr
+kunr2 k uen2 rr
+kunr3 k uen3 rr
+kunr4 k uen4 rr
+kunr5 k uen5 rr
+kuor1 k uo1 rr
+kuor2 k uo2 rr
+kuor3 k uo3 rr
+kuor4 k uo4 rr
+kuor5 k uo5 rr
+lar1 l a1 rr
+lar2 l a2 rr
+lar3 l a3 rr
+lar4 l a4 rr
+lar5 l a5 rr
+lair1 l ai1 rr
+lair2 l ai2 rr
+lair3 l ai3 rr
+lair4 l ai4 rr
+lair5 l ai5 rr
+lanr1 l an1 rr
+lanr2 l an2 rr
+lanr3 l an3 rr
+lanr4 l an4 rr
+lanr5 l an5 rr
+langr1 l ang1 rr
+langr2 l ang2 rr
+langr3 l ang3 rr
+langr4 l ang4 rr
+langr5 l ang5 rr
+laor1 l ao1 rr
+laor2 l ao2 rr
+laor3 l ao3 rr
+laor4 l ao4 rr
+laor5 l ao5 rr
+ler1 l e1 rr
+ler2 l e2 rr
+ler3 l e3 rr
+ler4 l e4 rr
+ler5 l e5 rr
+leir1 l ei1 rr
+leir2 l ei2 rr
+leir3 l ei3 rr
+leir4 l ei4 rr
+leir5 l ei5 rr
+lengr1 l eng1 rr
+lengr2 l eng2 rr
+lengr3 l eng3 rr
+lengr4 l eng4 rr
+lengr5 l eng5 rr
+lir1 l i1 rr
+lir2 l i2 rr
+lir3 l i3 rr
+lir4 l i4 rr
+lir5 l i5 rr
+liar1 l ia1 rr
+liar2 l ia2 rr
+liar3 l ia3 rr
+liar4 l ia4 rr
+liar5 l ia5 rr
+lianr1 l ian1 rr
+lianr2 l ian2 rr
+lianr3 l ian3 rr
+lianr4 l ian4 rr
+lianr5 l ian5 rr
+liangr1 l iang1 rr
+liangr2 l iang2 rr
+liangr3 l iang3 rr
+liangr4 l iang4 rr
+liangr5 l iang5 rr
+liaor1 l iao1 rr
+liaor2 l iao2 rr
+liaor3 l iao3 rr
+liaor4 l iao4 rr
+liaor5 l iao5 rr
+lier1 l ie1 rr
+lier2 l ie2 rr
+lier3 l ie3 rr
+lier4 l ie4 rr
+lier5 l ie5 rr
+linr1 l in1 rr
+linr2 l in2 rr
+linr3 l in3 rr
+linr4 l in4 rr
+linr5 l in5 rr
+lingr1 l ing1 rr
+lingr2 l ing2 rr
+lingr3 l ing3 rr
+lingr4 l ing4 rr
+lingr5 l ing5 rr
+liur1 l iou1 rr
+liur2 l iou2 rr
+liur3 l iou3 rr
+liur4 l iou4 rr
+liur5 l iou5 rr
+lor1 l o1 rr
+lor2 l o2 rr
+lor3 l o3 rr
+lor4 l o4 rr
+lor5 l o5 rr
+longr1 l ong1 rr
+longr2 l ong2 rr
+longr3 l ong3 rr
+longr4 l ong4 rr
+longr5 l ong5 rr
+lour1 l ou1 rr
+lour2 l ou2 rr
+lour3 l ou3 rr
+lour4 l ou4 rr
+lour5 l ou5 rr
+lur1 l u1 rr
+lur2 l u2 rr
+lur3 l u3 rr
+lur4 l u4 rr
+lur5 l u5 rr
+luanr1 l uan1 rr
+luanr2 l uan2 rr
+luanr3 l uan3 rr
+luanr4 l uan4 rr
+luanr5 l uan5 rr
+luer1 l ve1 rr
+luer2 l ve2 rr
+luer3 l ve3 rr
+luer4 l ve4 rr
+luer5 l ve5 rr
+lver1 l ve1 rr
+lver2 l ve2 rr
+lver3 l ve3 rr
+lver4 l ve4 rr
+lver5 l ve5 rr
+lunr1 l uen1 rr
+lunr2 l uen2 rr
+lunr3 l uen3 rr
+lunr4 l uen4 rr
+lunr5 l uen5 rr
+luor1 l uo1 rr
+luor2 l uo2 rr
+luor3 l uo3 rr
+luor4 l uo4 rr
+luor5 l uo5 rr
+lvr1 l v1 rr
+lvr2 l v2 rr
+lvr3 l v3 rr
+lvr4 l v4 rr
+lvr5 l v5 rr
+mar1 m a1 rr
+mar2 m a2 rr
+mar3 m a3 rr
+mar4 m a4 rr
+mar5 m a5 rr
+mair1 m ai1 rr
+mair2 m ai2 rr
+mair3 m ai3 rr
+mair4 m ai4 rr
+mair5 m ai5 rr
+manr1 m an1 rr
+manr2 m an2 rr
+manr3 m an3 rr
+manr4 m an4 rr
+manr5 m an5 rr
+mangr1 m ang1 rr
+mangr2 m ang2 rr
+mangr3 m ang3 rr
+mangr4 m ang4 rr
+mangr5 m ang5 rr
+maor1 m ao1 rr
+maor2 m ao2 rr
+maor3 m ao3 rr
+maor4 m ao4 rr
+maor5 m ao5 rr
+mer1 m e1 rr
+mer2 m e2 rr
+mer3 m e3 rr
+mer4 m e4 rr
+mer5 m e5 rr
+meir1 m ei1 rr
+meir2 m ei2 rr
+meir3 m ei3 rr
+meir4 m ei4 rr
+meir5 m ei5 rr
+menr1 m en1 rr
+menr2 m en2 rr
+menr3 m en3 rr
+menr4 m en4 rr
+menr5 m en5 rr
+mengr1 m eng1 rr
+mengr2 m eng2 rr
+mengr3 m eng3 rr
+mengr4 m eng4 rr
+mengr5 m eng5 rr
+mir1 m i1 rr
+mir2 m i2 rr
+mir3 m i3 rr
+mir4 m i4 rr
+mir5 m i5 rr
+mianr1 m ian1 rr
+mianr2 m ian2 rr
+mianr3 m ian3 rr
+mianr4 m ian4 rr
+mianr5 m ian5 rr
+miaor1 m iao1 rr
+miaor2 m iao2 rr
+miaor3 m iao3 rr
+miaor4 m iao4 rr
+miaor5 m iao5 rr
+mier1 m ie1 rr
+mier2 m ie2 rr
+mier3 m ie3 rr
+mier4 m ie4 rr
+mier5 m ie5 rr
+minr1 m in1 rr
+minr2 m in2 rr
+minr3 m in3 rr
+minr4 m in4 rr
+minr5 m in5 rr
+mingr1 m ing1 rr
+mingr2 m ing2 rr
+mingr3 m ing3 rr
+mingr4 m ing4 rr
+mingr5 m ing5 rr
+miur1 m iou1 rr
+miur2 m iou2 rr
+miur3 m iou3 rr
+miur4 m iou4 rr
+miur5 m iou5 rr
+mor1 m o1 rr
+mor2 m o2 rr
+mor3 m o3 rr
+mor4 m o4 rr
+mor5 m o5 rr
+mour1 m ou1 rr
+mour2 m ou2 rr
+mour3 m ou3 rr
+mour4 m ou4 rr
+mour5 m ou5 rr
+mur1 m u1 rr
+mur2 m u2 rr
+mur3 m u3 rr
+mur4 m u4 rr
+mur5 m u5 rr
+nar1 n a1 rr
+nar2 n a2 rr
+nar3 n a3 rr
+nar4 n a4 rr
+nar5 n a5 rr
+nair1 n ai1 rr
+nair2 n ai2 rr
+nair3 n ai3 rr
+nair4 n ai4 rr
+nair5 n ai5 rr
+nanr1 n an1 rr
+nanr2 n an2 rr
+nanr3 n an3 rr
+nanr4 n an4 rr
+nanr5 n an5 rr
+nangr1 n ang1 rr
+nangr2 n ang2 rr
+nangr3 n ang3 rr
+nangr4 n ang4 rr
+nangr5 n ang5 rr
+naor1 n ao1 rr
+naor2 n ao2 rr
+naor3 n ao3 rr
+naor4 n ao4 rr
+naor5 n ao5 rr
+ner1 n e1 rr
+ner2 n e2 rr
+ner3 n e3 rr
+ner4 n e4 rr
+ner5 n e5 rr
+neir1 n ei1 rr
+neir2 n ei2 rr
+neir3 n ei3 rr
+neir4 n ei4 rr
+neir5 n ei5 rr
+nenr1 n en1 rr
+nenr2 n en2 rr
+nenr3 n en3 rr
+nenr4 n en4 rr
+nenr5 n en5 rr
+nengr1 n eng1 rr
+nengr2 n eng2 rr
+nengr3 n eng3 rr
+nengr4 n eng4 rr
+nengr5 n eng5 rr
+nir1 n i1 rr
+nir2 n i2 rr
+nir3 n i3 rr
+nir4 n i4 rr
+nir5 n i5 rr
+nianr1 n ian1 rr
+nianr2 n ian2 rr
+nianr3 n ian3 rr
+nianr4 n ian4 rr
+nianr5 n ian5 rr
+niangr1 n iang1 rr
+niangr2 n iang2 rr
+niangr3 n iang3 rr
+niangr4 n iang4 rr
+niangr5 n iang5 rr
+niaor1 n iao1 rr
+niaor2 n iao2 rr
+niaor3 n iao3 rr
+niaor4 n iao4 rr
+niaor5 n iao5 rr
+nier1 n ie1 rr
+nier2 n ie2 rr
+nier3 n ie3 rr
+nier4 n ie4 rr
+nier5 n ie5 rr
+ninr1 n in1 rr
+ninr2 n in2 rr
+ninr3 n in3 rr
+ninr4 n in4 rr
+ninr5 n in5 rr
+ningr1 n ing1 rr
+ningr2 n ing2 rr
+ningr3 n ing3 rr
+ningr4 n ing4 rr
+ningr5 n ing5 rr
+niur1 n iou1 rr
+niur2 n iou2 rr
+niur3 n iou3 rr
+niur4 n iou4 rr
+niur5 n iou5 rr
+nongr1 n ong1 rr
+nongr2 n ong2 rr
+nongr3 n ong3 rr
+nongr4 n ong4 rr
+nongr5 n ong5 rr
+nour1 n ou1 rr
+nour2 n ou2 rr
+nour3 n ou3 rr
+nour4 n ou4 rr
+nour5 n ou5 rr
+nur1 n u1 rr
+nur2 n u2 rr
+nur3 n u3 rr
+nur4 n u4 rr
+nur5 n u5 rr
+nuanr1 n uan1 rr
+nuanr2 n uan2 rr
+nuanr3 n uan3 rr
+nuanr4 n uan4 rr
+nuanr5 n uan5 rr
+nuer1 n ve1 rr
+nuer2 n ve2 rr
+nuer3 n ve3 rr
+nuer4 n ve4 rr
+nuer5 n ve5 rr
+nver1 n ve1 rr
+nver2 n ve2 rr
+nver3 n ve3 rr
+nver4 n ve4 rr
+nver5 n ve5 rr
+nuor1 n uo1 rr
+nuor2 n uo2 rr
+nuor3 n uo3 rr
+nuor4 n uo4 rr
+nuor5 n uo5 rr
+nvr1 n v1 rr
+nvr2 n v2 rr
+nvr3 n v3 rr
+nvr4 n v4 rr
+nvr5 n v5 rr
+or1 o1 rr
+or2 o2 rr
+or3 o3 rr
+or4 o4 rr
+or5 o5 rr
+our1 ou1 rr
+our2 ou2 rr
+our3 ou3 rr
+our4 ou4 rr
+our5 ou5 rr
+par1 p a1 rr
+par2 p a2 rr
+par3 p a3 rr
+par4 p a4 rr
+par5 p a5 rr
+pair1 p ai1 rr
+pair2 p ai2 rr
+pair3 p ai3 rr
+pair4 p ai4 rr
+pair5 p ai5 rr
+panr1 p an1 rr
+panr2 p an2 rr
+panr3 p an3 rr
+panr4 p an4 rr
+panr5 p an5 rr
+pangr1 p ang1 rr
+pangr2 p ang2 rr
+pangr3 p ang3 rr
+pangr4 p ang4 rr
+pangr5 p ang5 rr
+paor1 p ao1 rr
+paor2 p ao2 rr
+paor3 p ao3 rr
+paor4 p ao4 rr
+paor5 p ao5 rr
+peir1 p ei1 rr
+peir2 p ei2 rr
+peir3 p ei3 rr
+peir4 p ei4 rr
+peir5 p ei5 rr
+penr1 p en1 rr
+penr2 p en2 rr
+penr3 p en3 rr
+penr4 p en4 rr
+penr5 p en5 rr
+pengr1 p eng1 rr
+pengr2 p eng2 rr
+pengr3 p eng3 rr
+pengr4 p eng4 rr
+pengr5 p eng5 rr
+pir1 p i1 rr
+pir2 p i2 rr
+pir3 p i3 rr
+pir4 p i4 rr
+pir5 p i5 rr
+pianr1 p ian1 rr
+pianr2 p ian2 rr
+pianr3 p ian3 rr
+pianr4 p ian4 rr
+pianr5 p ian5 rr
+piaor1 p iao1 rr
+piaor2 p iao2 rr
+piaor3 p iao3 rr
+piaor4 p iao4 rr
+piaor5 p iao5 rr
+pier1 p ie1 rr
+pier2 p ie2 rr
+pier3 p ie3 rr
+pier4 p ie4 rr
+pier5 p ie5 rr
+pinr1 p in1 rr
+pinr2 p in2 rr
+pinr3 p in3 rr
+pinr4 p in4 rr
+pinr5 p in5 rr
+pingr1 p ing1 rr
+pingr2 p ing2 rr
+pingr3 p ing3 rr
+pingr4 p ing4 rr
+pingr5 p ing5 rr
+por1 p o1 rr
+por2 p o2 rr
+por3 p o3 rr
+por4 p o4 rr
+por5 p o5 rr
+pour1 p ou1 rr
+pour2 p ou2 rr
+pour3 p ou3 rr
+pour4 p ou4 rr
+pour5 p ou5 rr
+pur1 p u1 rr
+pur2 p u2 rr
+pur3 p u3 rr
+pur4 p u4 rr
+pur5 p u5 rr
+qir1 q i1 rr
+qir2 q i2 rr
+qir3 q i3 rr
+qir4 q i4 rr
+qir5 q i5 rr
+qiar1 q ia1 rr
+qiar2 q ia2 rr
+qiar3 q ia3 rr
+qiar4 q ia4 rr
+qiar5 q ia5 rr
+qianr1 q ian1 rr
+qianr2 q ian2 rr
+qianr3 q ian3 rr
+qianr4 q ian4 rr
+qianr5 q ian5 rr
+qiangr1 q iang1 rr
+qiangr2 q iang2 rr
+qiangr3 q iang3 rr
+qiangr4 q iang4 rr
+qiangr5 q iang5 rr
+qiaor1 q iao1 rr
+qiaor2 q iao2 rr
+qiaor3 q iao3 rr
+qiaor4 q iao4 rr
+qiaor5 q iao5 rr
+qier1 q ie1 rr
+qier2 q ie2 rr
+qier3 q ie3 rr
+qier4 q ie4 rr
+qier5 q ie5 rr
+qinr1 q in1 rr
+qinr2 q in2 rr
+qinr3 q in3 rr
+qinr4 q in4 rr
+qinr5 q in5 rr
+qingr1 q ing1 rr
+qingr2 q ing2 rr
+qingr3 q ing3 rr
+qingr4 q ing4 rr
+qingr5 q ing5 rr
+qiongr1 q iong1 rr
+qiongr2 q iong2 rr
+qiongr3 q iong3 rr
+qiongr4 q iong4 rr
+qiongr5 q iong5 rr
+qiur1 q iou1 rr
+qiur2 q iou2 rr
+qiur3 q iou3 rr
+qiur4 q iou4 rr
+qiur5 q iou5 rr
+qur1 q v1 rr
+qur2 q v2 rr
+qur3 q v3 rr
+qur4 q v4 rr
+qur5 q v5 rr
+quanr1 q van1 rr
+quanr2 q van2 rr
+quanr3 q van3 rr
+quanr4 q van4 rr
+quanr5 q van5 rr
+quer1 q ve1 rr
+quer2 q ve2 rr
+quer3 q ve3 rr
+quer4 q ve4 rr
+quer5 q ve5 rr
+qunr1 q vn1 rr
+qunr2 q vn2 rr
+qunr3 q vn3 rr
+qunr4 q vn4 rr
+qunr5 q vn5 rr
+ranr1 r an1 rr
+ranr2 r an2 rr
+ranr3 r an3 rr
+ranr4 r an4 rr
+ranr5 r an5 rr
+rangr1 r ang1 rr
+rangr2 r ang2 rr
+rangr3 r ang3 rr
+rangr4 r ang4 rr
+rangr5 r ang5 rr
+raor1 r ao1 rr
+raor2 r ao2 rr
+raor3 r ao3 rr
+raor4 r ao4 rr
+raor5 r ao5 rr
+rer1 r e1 rr
+rer2 r e2 rr
+rer3 r e3 rr
+rer4 r e4 rr
+rer5 r e5 rr
+renr1 r en1 rr
+renr2 r en2 rr
+renr3 r en3 rr
+renr4 r en4 rr
+renr5 r en5 rr
+rengr1 r eng1 rr
+rengr2 r eng2 rr
+rengr3 r eng3 rr
+rengr4 r eng4 rr
+rengr5 r eng5 rr
+rir1 r iii1 rr
+rir2 r iii2 rr
+rir3 r iii3 rr
+rir4 r iii4 rr
+rir5 r iii5 rr
+rongr1 r ong1 rr
+rongr2 r ong2 rr
+rongr3 r ong3 rr
+rongr4 r ong4 rr
+rongr5 r ong5 rr
+rour1 r ou1 rr
+rour2 r ou2 rr
+rour3 r ou3 rr
+rour4 r ou4 rr
+rour5 r ou5 rr
+rur1 r u1 rr
+rur2 r u2 rr
+rur3 r u3 rr
+rur4 r u4 rr
+rur5 r u5 rr
+ruar1 r ua1 rr
+ruar2 r ua2 rr
+ruar3 r ua3 rr
+ruar4 r ua4 rr
+ruar5 r ua5 rr
+ruanr1 r uan1 rr
+ruanr2 r uan2 rr
+ruanr3 r uan3 rr
+ruanr4 r uan4 rr
+ruanr5 r uan5 rr
+ruir1 r uei1 rr
+ruir2 r uei2 rr
+ruir3 r uei3 rr
+ruir4 r uei4 rr
+ruir5 r uei5 rr
+runr1 r uen1 rr
+runr2 r uen2 rr
+runr3 r uen3 rr
+runr4 r uen4 rr
+runr5 r uen5 rr
+ruor1 r uo1 rr
+ruor2 r uo2 rr
+ruor3 r uo3 rr
+ruor4 r uo4 rr
+ruor5 r uo5 rr
+sar1 s a1 rr
+sar2 s a2 rr
+sar3 s a3 rr
+sar4 s a4 rr
+sar5 s a5 rr
+sair1 s ai1 rr
+sair2 s ai2 rr
+sair3 s ai3 rr
+sair4 s ai4 rr
+sair5 s ai5 rr
+sanr1 s an1 rr
+sanr2 s an2 rr
+sanr3 s an3 rr
+sanr4 s an4 rr
+sanr5 s an5 rr
+sangr1 s ang1 rr
+sangr2 s ang2 rr
+sangr3 s ang3 rr
+sangr4 s ang4 rr
+sangr5 s ang5 rr
+saor1 s ao1 rr
+saor2 s ao2 rr
+saor3 s ao3 rr
+saor4 s ao4 rr
+saor5 s ao5 rr
+ser1 s e1 rr
+ser2 s e2 rr
+ser3 s e3 rr
+ser4 s e4 rr
+ser5 s e5 rr
+senr1 s en1 rr
+senr2 s en2 rr
+senr3 s en3 rr
+senr4 s en4 rr
+senr5 s en5 rr
+sengr1 s eng1 rr
+sengr2 s eng2 rr
+sengr3 s eng3 rr
+sengr4 s eng4 rr
+sengr5 s eng5 rr
+shar1 sh a1 rr
+shar2 sh a2 rr
+shar3 sh a3 rr
+shar4 sh a4 rr
+shar5 sh a5 rr
+shair1 sh ai1 rr
+shair2 sh ai2 rr
+shair3 sh ai3 rr
+shair4 sh ai4 rr
+shair5 sh ai5 rr
+shanr1 sh an1 rr
+shanr2 sh an2 rr
+shanr3 sh an3 rr
+shanr4 sh an4 rr
+shanr5 sh an5 rr
+shangr1 sh ang1 rr
+shangr2 sh ang2 rr
+shangr3 sh ang3 rr
+shangr4 sh ang4 rr
+shangr5 sh ang5 rr
+shaor1 sh ao1 rr
+shaor2 sh ao2 rr
+shaor3 sh ao3 rr
+shaor4 sh ao4 rr
+shaor5 sh ao5 rr
+sher1 sh e1 rr
+sher2 sh e2 rr
+sher3 sh e3 rr
+sher4 sh e4 rr
+sher5 sh e5 rr
+sheir1 sh ei1 rr
+sheir2 sh ei2 rr
+sheir3 sh ei3 rr
+sheir4 sh ei4 rr
+sheir5 sh ei5 rr
+shenr1 sh en1 rr
+shenr2 sh en2 rr
+shenr3 sh en3 rr
+shenr4 sh en4 rr
+shenr5 sh en5 rr
+shengr1 sh eng1 rr
+shengr2 sh eng2 rr
+shengr3 sh eng3 rr
+shengr4 sh eng4 rr
+shengr5 sh eng5 rr
+shir1 sh iii1 rr
+shir2 sh iii2 rr
+shir3 sh iii3 rr
+shir4 sh iii4 rr
+shir5 sh iii5 rr
+shour1 sh ou1 rr
+shour2 sh ou2 rr
+shour3 sh ou3 rr
+shour4 sh ou4 rr
+shour5 sh ou5 rr
+shur1 sh u1 rr
+shur2 sh u2 rr
+shur3 sh u3 rr
+shur4 sh u4 rr
+shur5 sh u5 rr
+shuar1 sh ua1 rr
+shuar2 sh ua2 rr
+shuar3 sh ua3 rr
+shuar4 sh ua4 rr
+shuar5 sh ua5 rr
+shuair1 sh uai1 rr
+shuair2 sh uai2 rr
+shuair3 sh uai3 rr
+shuair4 sh uai4 rr
+shuair5 sh uai5 rr
+shuanr1 sh uan1 rr
+shuanr2 sh uan2 rr
+shuanr3 sh uan3 rr
+shuanr4 sh uan4 rr
+shuanr5 sh uan5 rr
+shuangr1 sh uang1 rr
+shuangr2 sh uang2 rr
+shuangr3 sh uang3 rr
+shuangr4 sh uang4 rr
+shuangr5 sh uang5 rr
+shuir1 sh uei1 rr
+shuir2 sh uei2 rr
+shuir3 sh uei3 rr
+shuir4 sh uei4 rr
+shuir5 sh uei5 rr
+shunr1 sh uen1 rr
+shunr2 sh uen2 rr
+shunr3 sh uen3 rr
+shunr4 sh uen4 rr
+shunr5 sh uen5 rr
+shuor1 sh uo1 rr
+shuor2 sh uo2 rr
+shuor3 sh uo3 rr
+shuor4 sh uo4 rr
+shuor5 sh uo5 rr
+sir1 s ii1 rr
+sir2 s ii2 rr
+sir3 s ii3 rr
+sir4 s ii4 rr
+sir5 s ii5 rr
+songr1 s ong1 rr
+songr2 s ong2 rr
+songr3 s ong3 rr
+songr4 s ong4 rr
+songr5 s ong5 rr
+sour1 s ou1 rr
+sour2 s ou2 rr
+sour3 s ou3 rr
+sour4 s ou4 rr
+sour5 s ou5 rr
+sur1 s u1 rr
+sur2 s u2 rr
+sur3 s u3 rr
+sur4 s u4 rr
+sur5 s u5 rr
+suanr1 s uan1 rr
+suanr2 s uan2 rr
+suanr3 s uan3 rr
+suanr4 s uan4 rr
+suanr5 s uan5 rr
+suir1 s uei1 rr
+suir2 s uei2 rr
+suir3 s uei3 rr
+suir4 s uei4 rr
+suir5 s uei5 rr
+sunr1 s uen1 rr
+sunr2 s uen2 rr
+sunr3 s uen3 rr
+sunr4 s uen4 rr
+sunr5 s uen5 rr
+suor1 s uo1 rr
+suor2 s uo2 rr
+suor3 s uo3 rr
+suor4 s uo4 rr
+suor5 s uo5 rr
+tar1 t a1 rr
+tar2 t a2 rr
+tar3 t a3 rr
+tar4 t a4 rr
+tar5 t a5 rr
+tair1 t ai1 rr
+tair2 t ai2 rr
+tair3 t ai3 rr
+tair4 t ai4 rr
+tair5 t ai5 rr
+tanr1 t an1 rr
+tanr2 t an2 rr
+tanr3 t an3 rr
+tanr4 t an4 rr
+tanr5 t an5 rr
+tangr1 t ang1 rr
+tangr2 t ang2 rr
+tangr3 t ang3 rr
+tangr4 t ang4 rr
+tangr5 t ang5 rr
+taor1 t ao1 rr
+taor2 t ao2 rr
+taor3 t ao3 rr
+taor4 t ao4 rr
+taor5 t ao5 rr
+ter1 t e1 rr
+ter2 t e2 rr
+ter3 t e3 rr
+ter4 t e4 rr
+ter5 t e5 rr
+teir1 t ei1 rr
+teir2 t ei2 rr
+teir3 t ei3 rr
+teir4 t ei4 rr
+teir5 t ei5 rr
+tengr1 t eng1 rr
+tengr2 t eng2 rr
+tengr3 t eng3 rr
+tengr4 t eng4 rr
+tengr5 t eng5 rr
+tir1 t i1 rr
+tir2 t i2 rr
+tir3 t i3 rr
+tir4 t i4 rr
+tir5 t i5 rr
+tianr1 t ian1 rr
+tianr2 t ian2 rr
+tianr3 t ian3 rr
+tianr4 t ian4 rr
+tianr5 t ian5 rr
+tiaor1 t iao1 rr
+tiaor2 t iao2 rr
+tiaor3 t iao3 rr
+tiaor4 t iao4 rr
+tiaor5 t iao5 rr
+tier1 t ie1 rr
+tier2 t ie2 rr
+tier3 t ie3 rr
+tier4 t ie4 rr
+tier5 t ie5 rr
+tingr1 t ing1 rr
+tingr2 t ing2 rr
+tingr3 t ing3 rr
+tingr4 t ing4 rr
+tingr5 t ing5 rr
+tongr1 t ong1 rr
+tongr2 t ong2 rr
+tongr3 t ong3 rr
+tongr4 t ong4 rr
+tongr5 t ong5 rr
+tour1 t ou1 rr
+tour2 t ou2 rr
+tour3 t ou3 rr
+tour4 t ou4 rr
+tour5 t ou5 rr
+tur1 t u1 rr
+tur2 t u2 rr
+tur3 t u3 rr
+tur4 t u4 rr
+tur5 t u5 rr
+tuanr1 t uan1 rr
+tuanr2 t uan2 rr
+tuanr3 t uan3 rr
+tuanr4 t uan4 rr
+tuanr5 t uan5 rr
+tuir1 t uei1 rr
+tuir2 t uei2 rr
+tuir3 t uei3 rr
+tuir4 t uei4 rr
+tuir5 t uei5 rr
+tunr1 t uen1 rr
+tunr2 t uen2 rr
+tunr3 t uen3 rr
+tunr4 t uen4 rr
+tunr5 t uen5 rr
+tuor1 t uo1 rr
+tuor2 t uo2 rr
+tuor3 t uo3 rr
+tuor4 t uo4 rr
+tuor5 t uo5 rr
+war1 w ua1 rr
+war2 w ua2 rr
+war3 w ua3 rr
+war4 w ua4 rr
+war5 w ua5 rr
+wair1 w uai1 rr
+wair2 w uai2 rr
+wair3 w uai3 rr
+wair4 w uai4 rr
+wair5 w uai5 rr
+wanr1 w uan1 rr
+wanr2 w uan2 rr
+wanr3 w uan3 rr
+wanr4 w uan4 rr
+wanr5 w uan5 rr
+wangr1 w uang1 rr
+wangr2 w uang2 rr
+wangr3 w uang3 rr
+wangr4 w uang4 rr
+wangr5 w uang5 rr
+weir1 w uei1 rr
+weir2 w uei2 rr
+weir3 w uei3 rr
+weir4 w uei4 rr
+weir5 w uei5 rr
+wenr1 w uen1 rr
+wenr2 w uen2 rr
+wenr3 w uen3 rr
+wenr4 w uen4 rr
+wenr5 w uen5 rr
+wengr1 w uen1 rr
+wengr2 w uen2 rr
+wengr3 w uen3 rr
+wengr4 w uen4 rr
+wengr5 w uen5 rr
+wor1 w uo1 rr
+wor2 w uo2 rr
+wor3 w uo3 rr
+wor4 w uo4 rr
+wor5 w uo5 rr
+wur1 w u1 rr
+wur2 w u2 rr
+wur3 w u3 rr
+wur4 w u4 rr
+wur5 w u5 rr
+xir1 x i1 rr
+xir2 x i2 rr
+xir3 x i3 rr
+xir4 x i4 rr
+xir5 x i5 rr
+xiar1 x ia1 rr
+xiar2 x ia2 rr
+xiar3 x ia3 rr
+xiar4 x ia4 rr
+xiar5 x ia5 rr
+xianr1 x ian1 rr
+xianr2 x ian2 rr
+xianr3 x ian3 rr
+xianr4 x ian4 rr
+xianr5 x ian5 rr
+xiangr1 x iang1 rr
+xiangr2 x iang2 rr
+xiangr3 x iang3 rr
+xiangr4 x iang4 rr
+xiangr5 x iang5 rr
+xiaor1 x iao1 rr
+xiaor2 x iao2 rr
+xiaor3 x iao3 rr
+xiaor4 x iao4 rr
+xiaor5 x iao5 rr
+xier1 x ie1 rr
+xier2 x ie2 rr
+xier3 x ie3 rr
+xier4 x ie4 rr
+xier5 x ie5 rr
+xinr1 x in1 rr
+xinr2 x in2 rr
+xinr3 x in3 rr
+xinr4 x in4 rr
+xinr5 x in5 rr
+xingr1 x ing1 rr
+xingr2 x ing2 rr
+xingr3 x ing3 rr
+xingr4 x ing4 rr
+xingr5 x ing5 rr
+xiongr1 x iong1 rr
+xiongr2 x iong2 rr
+xiongr3 x iong3 rr
+xiongr4 x iong4 rr
+xiongr5 x iong5 rr
+xiur1 x iou1 rr
+xiur2 x iou2 rr
+xiur3 x iou3 rr
+xiur4 x iou4 rr
+xiur5 x iou5 rr
+xur1 x v1 rr
+xur2 x v2 rr
+xur3 x v3 rr
+xur4 x v4 rr
+xur5 x v5 rr
+xuanr1 x van1 rr
+xuanr2 x van2 rr
+xuanr3 x van3 rr
+xuanr4 x van4 rr
+xuanr5 x van5 rr
+xuer1 x ve1 rr
+xuer2 x ve2 rr
+xuer3 x ve3 rr
+xuer4 x ve4 rr
+xuer5 x ve5 rr
+xunr1 x vn1 rr
+xunr2 x vn2 rr
+xunr3 x vn3 rr
+xunr4 x vn4 rr
+xunr5 x vn5 rr
+yar1 y ia1 rr
+yar2 y ia2 rr
+yar3 y ia3 rr
+yar4 y ia4 rr
+yar5 y ia5 rr
+yanr1 y ian1 rr
+yanr2 y ian2 rr
+yanr3 y ian3 rr
+yanr4 y ian4 rr
+yanr5 y ian5 rr
+yangr1 y iang1 rr
+yangr2 y iang2 rr
+yangr3 y iang3 rr
+yangr4 y iang4 rr
+yangr5 y iang5 rr
+yaor1 y iao1 rr
+yaor2 y iao2 rr
+yaor3 y iao3 rr
+yaor4 y iao4 rr
+yaor5 y iao5 rr
+yer1 y ie1 rr
+yer2 y ie2 rr
+yer3 y ie3 rr
+yer4 y ie4 rr
+yer5 y ie5 rr
+yir1 y i1 rr
+yir2 y i2 rr
+yir3 y i3 rr
+yir4 y i4 rr
+yir5 y i5 rr
+yinr1 y in1 rr
+yinr2 y in2 rr
+yinr3 y in3 rr
+yinr4 y in4 rr
+yinr5 y in5 rr
+yingr1 y ing1 rr
+yingr2 y ing2 rr
+yingr3 y ing3 rr
+yingr4 y ing4 rr
+yingr5 y ing5 rr
+yor1 y iou1 rr
+yor2 y iou2 rr
+yor3 y iou3 rr
+yor4 y iou4 rr
+yor5 y iou5 rr
+yongr1 y iong1 rr
+yongr2 y iong2 rr
+yongr3 y iong3 rr
+yongr4 y iong4 rr
+yongr5 y iong5 rr
+your1 y iou1 rr
+your2 y iou2 rr
+your3 y iou3 rr
+your4 y iou4 rr
+your5 y iou5 rr
+yur1 y v1 rr
+yur2 y v2 rr
+yur3 y v3 rr
+yur4 y v4 rr
+yur5 y v5 rr
+yuanr1 y van1 rr
+yuanr2 y van2 rr
+yuanr3 y van3 rr
+yuanr4 y van4 rr
+yuanr5 y van5 rr
+yuer1 y ve1 rr
+yuer2 y ve2 rr
+yuer3 y ve3 rr
+yuer4 y ve4 rr
+yuer5 y ve5 rr
+yunr1 y vn1 rr
+yunr2 y vn2 rr
+yunr3 y vn3 rr
+yunr4 y vn4 rr
+yunr5 y vn5 rr
+zar1 z a1 rr
+zar2 z a2 rr
+zar3 z a3 rr
+zar4 z a4 rr
+zar5 z a5 rr
+zair1 z ai1 rr
+zair2 z ai2 rr
+zair3 z ai3 rr
+zair4 z ai4 rr
+zair5 z ai5 rr
+zanr1 z an1 rr
+zanr2 z an2 rr
+zanr3 z an3 rr
+zanr4 z an4 rr
+zanr5 z an5 rr
+zangr1 z ang1 rr
+zangr2 z ang2 rr
+zangr3 z ang3 rr
+zangr4 z ang4 rr
+zangr5 z ang5 rr
+zaor1 z ao1 rr
+zaor2 z ao2 rr
+zaor3 z ao3 rr
+zaor4 z ao4 rr
+zaor5 z ao5 rr
+zer1 z e1 rr
+zer2 z e2 rr
+zer3 z e3 rr
+zer4 z e4 rr
+zer5 z e5 rr
+zeir1 z ei1 rr
+zeir2 z ei2 rr
+zeir3 z ei3 rr
+zeir4 z ei4 rr
+zeir5 z ei5 rr
+zenr1 z en1 rr
+zenr2 z en2 rr
+zenr3 z en3 rr
+zenr4 z en4 rr
+zenr5 z en5 rr
+zengr1 z eng1 rr
+zengr2 z eng2 rr
+zengr3 z eng3 rr
+zengr4 z eng4 rr
+zengr5 z eng5 rr
+zhar1 zh a1 rr
+zhar2 zh a2 rr
+zhar3 zh a3 rr
+zhar4 zh a4 rr
+zhar5 zh a5 rr
+zhair1 zh ai1 rr
+zhair2 zh ai2 rr
+zhair3 zh ai3 rr
+zhair4 zh ai4 rr
+zhair5 zh ai5 rr
+zhanr1 zh an1 rr
+zhanr2 zh an2 rr
+zhanr3 zh an3 rr
+zhanr4 zh an4 rr
+zhanr5 zh an5 rr
+zhangr1 zh ang1 rr
+zhangr2 zh ang2 rr
+zhangr3 zh ang3 rr
+zhangr4 zh ang4 rr
+zhangr5 zh ang5 rr
+zhaor1 zh ao1 rr
+zhaor2 zh ao2 rr
+zhaor3 zh ao3 rr
+zhaor4 zh ao4 rr
+zhaor5 zh ao5 rr
+zher1 zh e1 rr
+zher2 zh e2 rr
+zher3 zh e3 rr
+zher4 zh e4 rr
+zher5 zh e5 rr
+zheir1 zh ei1 rr
+zheir2 zh ei2 rr
+zheir3 zh ei3 rr
+zheir4 zh ei4 rr
+zheir5 zh ei5 rr
+zhenr1 zh en1 rr
+zhenr2 zh en2 rr
+zhenr3 zh en3 rr
+zhenr4 zh en4 rr
+zhenr5 zh en5 rr
+zhengr1 zh eng1 rr
+zhengr2 zh eng2 rr
+zhengr3 zh eng3 rr
+zhengr4 zh eng4 rr
+zhengr5 zh eng5 rr
+zhir1 zh iii1 rr
+zhir2 zh iii2 rr
+zhir3 zh iii3 rr
+zhir4 zh iii4 rr
+zhir5 zh iii5 rr
+zhongr1 zh ong1 rr
+zhongr2 zh ong2 rr
+zhongr3 zh ong3 rr
+zhongr4 zh ong4 rr
+zhongr5 zh ong5 rr
+zhour1 zh ou1 rr
+zhour2 zh ou2 rr
+zhour3 zh ou3 rr
+zhour4 zh ou4 rr
+zhour5 zh ou5 rr
+zhur1 zh u1 rr
+zhur2 zh u2 rr
+zhur3 zh u3 rr
+zhur4 zh u4 rr
+zhur5 zh u5 rr
+zhuar1 zh ua1 rr
+zhuar2 zh ua2 rr
+zhuar3 zh ua3 rr
+zhuar4 zh ua4 rr
+zhuar5 zh ua5 rr
+zhuair1 zh uai1 rr
+zhuair2 zh uai2 rr
+zhuair3 zh uai3 rr
+zhuair4 zh uai4 rr
+zhuair5 zh uai5 rr
+zhuanr1 zh uan1 rr
+zhuanr2 zh uan2 rr
+zhuanr3 zh uan3 rr
+zhuanr4 zh uan4 rr
+zhuanr5 zh uan5 rr
+zhuangr1 zh uang1 rr
+zhuangr2 zh uang2 rr
+zhuangr3 zh uang3 rr
+zhuangr4 zh uang4 rr
+zhuangr5 zh uang5 rr
+zhuir1 zh uei1 rr
+zhuir2 zh uei2 rr
+zhuir3 zh uei3 rr
+zhuir4 zh uei4 rr
+zhuir5 zh uei5 rr
+zhunr1 zh uen1 rr
+zhunr2 zh uen2 rr
+zhunr3 zh uen3 rr
+zhunr4 zh uen4 rr
+zhunr5 zh uen5 rr
+zhuor1 zh uo1 rr
+zhuor2 zh uo2 rr
+zhuor3 zh uo3 rr
+zhuor4 zh uo4 rr
+zhuor5 zh uo5 rr
+zir1 z ii1 rr
+zir2 z ii2 rr
+zir3 z ii3 rr
+zir4 z ii4 rr
+zir5 z ii5 rr
+zongr1 z ong1 rr
+zongr2 z ong2 rr
+zongr3 z ong3 rr
+zongr4 z ong4 rr
+zongr5 z ong5 rr
+zour1 z ou1 rr
+zour2 z ou2 rr
+zour3 z ou3 rr
+zour4 z ou4 rr
+zour5 z ou5 rr
+zur1 z u1 rr
+zur2 z u2 rr
+zur3 z u3 rr
+zur4 z u4 rr
+zur5 z u5 rr
+zuanr1 z uan1 rr
+zuanr2 z uan2 rr
+zuanr3 z uan3 rr
+zuanr4 z uan4 rr
+zuanr5 z uan5 rr
+zuir1 z uei1 rr
+zuir2 z uei2 rr
+zuir3 z uei3 rr
+zuir4 z uei4 rr
+zuir5 z uei5 rr
+zunr1 z uen1 rr
+zunr2 z uen2 rr
+zunr3 z uen3 rr
+zunr4 z uen4 rr
+zunr5 z uen5 rr
+zuor1 z uo1 rr
+zuor2 z uo2 rr
+zuor3 z uo3 rr
+zuor4 z uo4 rr
+zuor5 z uo5 rr

lemas_tts/infer/text_norm/symbols.py ADDED Viewed

	@@ -0,0 +1,419 @@

+pinyin_dict = {
+    "a": ("^", "a"),
+    "ai": ("^", "ai"),
+    "an": ("^", "an"),
+    "ang": ("^", "ang"),
+    "ao": ("^", "ao"),
+    "ba": ("b", "a"),
+    "bai": ("b", "ai"),
+    "ban": ("b", "an"),
+    "bang": ("b", "ang"),
+    "bao": ("b", "ao"),
+    "be": ("b", "e"),
+    "bei": ("b", "ei"),
+    "ben": ("b", "en"),
+    "beng": ("b", "eng"),
+    "bi": ("b", "i"),
+    "bian": ("b", "ian"),
+    "biao": ("b", "iao"),
+    "bie": ("b", "ie"),
+    "bin": ("b", "in"),
+    "bing": ("b", "ing"),
+    "bo": ("b", "o"),
+    "bu": ("b", "u"),
+    "ca": ("c", "a"),
+    "cai": ("c", "ai"),
+    "can": ("c", "an"),
+    "cang": ("c", "ang"),
+    "cao": ("c", "ao"),
+    "ce": ("c", "e"),
+    "cen": ("c", "en"),
+    "ceng": ("c", "eng"),
+    "cha": ("ch", "a"),
+    "chai": ("ch", "ai"),
+    "chan": ("ch", "an"),
+    "chang": ("ch", "ang"),
+    "chao": ("ch", "ao"),
+    "che": ("ch", "e"),
+    "chen": ("ch", "en"),
+    "cheng": ("ch", "eng"),
+    "chi": ("ch", "iii"),
+    "chong": ("ch", "ong"),
+    "chou": ("ch", "ou"),
+    "chu": ("ch", "u"),
+    "chua": ("ch", "ua"),
+    "chuai": ("ch", "uai"),
+    "chuan": ("ch", "uan"),
+    "chuang": ("ch", "uang"),
+    "chui": ("ch", "uei"),
+    "chun": ("ch", "uen"),
+    "chuo": ("ch", "uo"),
+    "ci": ("c", "ii"),
+    "cong": ("c", "ong"),
+    "cou": ("c", "ou"),
+    "cu": ("c", "u"),
+    "cuan": ("c", "uan"),
+    "cui": ("c", "uei"),
+    "cun": ("c", "uen"),
+    "cuo": ("c", "uo"),
+    "da": ("d", "a"),
+    "dai": ("d", "ai"),
+    "dan": ("d", "an"),
+    "dang": ("d", "ang"),
+    "dao": ("d", "ao"),
+    "de": ("d", "e"),
+    "dei": ("d", "ei"),
+    "den": ("d", "en"),
+    "deng": ("d", "eng"),
+    "di": ("d", "i"),
+    "dia": ("d", "ia"),
+    "dian": ("d", "ian"),
+    "diao": ("d", "iao"),
+    "die": ("d", "ie"),
+    "ding": ("d", "ing"),
+    "diu": ("d", "iou"),
+    "dong": ("d", "ong"),
+    "dou": ("d", "ou"),
+    "du": ("d", "u"),
+    "duan": ("d", "uan"),
+    "dui": ("d", "uei"),
+    "dun": ("d", "uen"),
+    "duo": ("d", "uo"),
+    "e": ("^", "e"),
+    "ei": ("^", "ei"),
+    "en": ("^", "en"),
+    "ng": ("^", "en"),
+    "eng": ("^", "eng"),
+    "er": ("^", "er"),
+    "fa": ("f", "a"),
+    "fan": ("f", "an"),
+    "fang": ("f", "ang"),
+    "fei": ("f", "ei"),
+    "fen": ("f", "en"),
+    "feng": ("f", "eng"),
+    "fo": ("f", "o"),
+    "fou": ("f", "ou"),
+    "fu": ("f", "u"),
+    "ga": ("g", "a"),
+    "gai": ("g", "ai"),
+    "gan": ("g", "an"),
+    "gang": ("g", "ang"),
+    "gao": ("g", "ao"),
+    "ge": ("g", "e"),
+    "gei": ("g", "ei"),
+    "gen": ("g", "en"),
+    "geng": ("g", "eng"),
+    "gong": ("g", "ong"),
+    "gou": ("g", "ou"),
+    "gu": ("g", "u"),
+    "gua": ("g", "ua"),
+    "guai": ("g", "uai"),
+    "guan": ("g", "uan"),
+    "guang": ("g", "uang"),
+    "gui": ("g", "uei"),
+    "gun": ("g", "uen"),
+    "guo": ("g", "uo"),
+    "ha": ("h", "a"),
+    "hai": ("h", "ai"),
+    "han": ("h", "an"),
+    "hang": ("h", "ang"),
+    "hao": ("h", "ao"),
+    "he": ("h", "e"),
+    "hei": ("h", "ei"),
+    "hen": ("h", "en"),
+    "heng": ("h", "eng"),
+    "hong": ("h", "ong"),
+    "hou": ("h", "ou"),
+    "hu": ("h", "u"),
+    "hua": ("h", "ua"),
+    "huai": ("h", "uai"),
+    "huan": ("h", "uan"),
+    "huang": ("h", "uang"),
+    "hui": ("h", "uei"),
+    "hun": ("h", "uen"),
+    "huo": ("h", "uo"),
+    "ji": ("j", "i"),
+    "jia": ("j", "ia"),
+    "jian": ("j", "ian"),
+    "jiang": ("j", "iang"),
+    "jiao": ("j", "iao"),
+    "jie": ("j", "ie"),
+    "jin": ("j", "in"),
+    "jing": ("j", "ing"),
+    "jiong": ("j", "iong"),
+    "jiu": ("j", "iou"),
+    "ju": ("j", "v"),
+    "juan": ("j", "van"),
+    "jue": ("j", "ve"),
+    "jun": ("j", "vn"),
+    "ka": ("k", "a"),
+    "kai": ("k", "ai"),
+    "kan": ("k", "an"),
+    "kang": ("k", "ang"),
+    "kao": ("k", "ao"),
+    "ke": ("k", "e"),
+    "kei": ("k", "ei"),
+    "ken": ("k", "en"),
+    "keng": ("k", "eng"),
+    "kong": ("k", "ong"),
+    "kou": ("k", "ou"),
+    "ku": ("k", "u"),
+    "kua": ("k", "ua"),
+    "kuai": ("k", "uai"),
+    "kuan": ("k", "uan"),
+    "kuang": ("k", "uang"),
+    "kui": ("k", "uei"),
+    "kun": ("k", "uen"),
+    "kuo": ("k", "uo"),
+    "la": ("l", "a"),
+    "lai": ("l", "ai"),
+    "lan": ("l", "an"),
+    "lang": ("l", "ang"),
+    "lao": ("l", "ao"),
+    "le": ("l", "e"),
+    "lei": ("l", "ei"),
+    "leng": ("l", "eng"),
+    "li": ("l", "i"),
+    "lia": ("l", "ia"),
+    "lian": ("l", "ian"),
+    "liang": ("l", "iang"),
+    "liao": ("l", "iao"),
+    "lie": ("l", "ie"),
+    "lin": ("l", "in"),
+    "ling": ("l", "ing"),
+    "liu": ("l", "iou"),
+    "lo": ("l", "o"),
+    "long": ("l", "ong"),
+    "lou": ("l", "ou"),
+    "lu": ("l", "u"),
+    "lv": ("l", "v"),
+    "luan": ("l", "uan"),
+    "lve": ("l", "ve"),
+    "lue": ("l", "ve"),
+    "lun": ("l", "uen"),
+    "luo": ("l", "uo"),
+    "ma": ("m", "a"),
+    "mai": ("m", "ai"),
+    "man": ("m", "an"),
+    "mang": ("m", "ang"),
+    "mao": ("m", "ao"),
+    "me": ("m", "e"),
+    "mei": ("m", "ei"),
+    "men": ("m", "en"),
+    "meng": ("m", "eng"),
+    "mi": ("m", "i"),
+    "mian": ("m", "ian"),
+    "miao": ("m", "iao"),
+    "mie": ("m", "ie"),
+    "min": ("m", "in"),
+    "ming": ("m", "ing"),
+    "miu": ("m", "iou"),
+    "mo": ("m", "o"),
+    "mou": ("m", "ou"),
+    "mu": ("m", "u"),
+    "na": ("n", "a"),
+    "nai": ("n", "ai"),
+    "nan": ("n", "an"),
+    "nang": ("n", "ang"),
+    "nao": ("n", "ao"),
+    "ne": ("n", "e"),
+    "nei": ("n", "ei"),
+    "nen": ("n", "en"),
+    "neng": ("n", "eng"),
+    "ni": ("n", "i"),
+    "nia": ("n", "ia"),
+    "nian": ("n", "ian"),
+    "niang": ("n", "iang"),
+    "niao": ("n", "iao"),
+    "nie": ("n", "ie"),
+    "nin": ("n", "in"),
+    "ning": ("n", "ing"),
+    "niu": ("n", "iou"),
+    "nong": ("n", "ong"),
+    "nou": ("n", "ou"),
+    "nu": ("n", "u"),
+    "nv": ("n", "v"),
+    "nuan": ("n", "uan"),
+    "nve": ("n", "ve"),
+    "nue": ("n", "ve"),
+    "nuo": ("n", "uo"),
+    "o": ("^", "o"),
+    "ou": ("^", "ou"),
+    "pa": ("p", "a"),
+    "pai": ("p", "ai"),
+    "pan": ("p", "an"),
+    "pang": ("p", "ang"),
+    "pao": ("p", "ao"),
+    "pe": ("p", "e"),
+    "pei": ("p", "ei"),
+    "pen": ("p", "en"),
+    "peng": ("p", "eng"),
+    "pi": ("p", "i"),
+    "pian": ("p", "ian"),
+    "piao": ("p", "iao"),
+    "pie": ("p", "ie"),
+    "pin": ("p", "in"),
+    "ping": ("p", "ing"),
+    "po": ("p", "o"),
+    "pou": ("p", "ou"),
+    "pu": ("p", "u"),
+    "qi": ("q", "i"),
+    "qia": ("q", "ia"),
+    "qian": ("q", "ian"),
+    "qiang": ("q", "iang"),
+    "qiao": ("q", "iao"),
+    "qie": ("q", "ie"),
+    "qin": ("q", "in"),
+    "qing": ("q", "ing"),
+    "qiong": ("q", "iong"),
+    "qiu": ("q", "iou"),
+    "qu": ("q", "v"),
+    "quan": ("q", "van"),
+    "que": ("q", "ve"),
+    "qun": ("q", "vn"),
+    "ran": ("r", "an"),
+    "rang": ("r", "ang"),
+    "rao": ("r", "ao"),
+    "re": ("r", "e"),
+    "ren": ("r", "en"),
+    "reng": ("r", "eng"),
+    "ri": ("r", "iii"),
+    "rong": ("r", "ong"),
+    "rou": ("r", "ou"),
+    "ru": ("r", "u"),
+    "rua": ("r", "ua"),
+    "ruan": ("r", "uan"),
+    "rui": ("r", "uei"),
+    "run": ("r", "uen"),
+    "ruo": ("r", "uo"),
+    "sa": ("s", "a"),
+    "sai": ("s", "ai"),
+    "san": ("s", "an"),
+    "sang": ("s", "ang"),
+    "sao": ("s", "ao"),
+    "se": ("s", "e"),
+    "sen": ("s", "en"),
+    "seng": ("s", "eng"),
+    "sha": ("sh", "a"),
+    "shai": ("sh", "ai"),
+    "shan": ("sh", "an"),
+    "shang": ("sh", "ang"),
+    "shao": ("sh", "ao"),
+    "she": ("sh", "e"),
+    "shei": ("sh", "ei"),
+    "shen": ("sh", "en"),
+    "sheng": ("sh", "eng"),
+    "shi": ("sh", "iii"),
+    "shou": ("sh", "ou"),
+    "shu": ("sh", "u"),
+    "shua": ("sh", "ua"),
+    "shuai": ("sh", "uai"),
+    "shuan": ("sh", "uan"),
+    "shuang": ("sh", "uang"),
+    "shui": ("sh", "uei"),
+    "shun": ("sh", "uen"),
+    "shuo": ("sh", "uo"),
+    "si": ("s", "ii"),
+    "song": ("s", "ong"),
+    "sou": ("s", "ou"),
+    "su": ("s", "u"),
+    "suan": ("s", "uan"),
+    "sui": ("s", "uei"),
+    "sun": ("s", "uen"),
+    "suo": ("s", "uo"),
+    "ta": ("t", "a"),
+    "tai": ("t", "ai"),
+    "tan": ("t", "an"),
+    "tang": ("t", "ang"),
+    "tao": ("t", "ao"),
+    "te": ("t", "e"),
+    "tei": ("t", "ei"),
+    "teng": ("t", "eng"),
+    "ti": ("t", "i"),
+    "tian": ("t", "ian"),
+    "tiao": ("t", "iao"),
+    "tie": ("t", "ie"),
+    "ting": ("t", "ing"),
+    "tong": ("t", "ong"),
+    "tou": ("t", "ou"),
+    "tu": ("t", "u"),
+    "tuan": ("t", "uan"),
+    "tui": ("t", "uei"),
+    "tun": ("t", "uen"),
+    "tuo": ("t", "uo"),
+    "wa": ("^", "ua"),
+    "wai": ("^", "uai"),
+    "wan": ("^", "uan"),
+    "wang": ("^", "uang"),
+    "wei": ("^", "uei"),
+    "wen": ("^", "uen"),
+    "weng": ("^", "ueng"),
+    "wo": ("^", "uo"),
+    "wu": ("^", "u"),
+    "xi": ("x", "i"),
+    "xia": ("x", "ia"),
+    "xian": ("x", "ian"),
+    "xiang": ("x", "iang"),
+    "xiao": ("x", "iao"),
+    "xie": ("x", "ie"),
+    "xin": ("x", "in"),
+    "xing": ("x", "ing"),
+    "xiong": ("x", "iong"),
+    "xiu": ("x", "iou"),
+    "xu": ("x", "v"),
+    "xuan": ("x", "van"),
+    "xue": ("x", "ve"),
+    "xun": ("x", "vn"),
+    "ya": ("^", "ia"),
+    "yan": ("^", "ian"),
+    "yang": ("^", "iang"),
+    "yao": ("^", "iao"),
+    "ye": ("^", "ie"),
+    "yi": ("^", "i"),
+    "yin": ("^", "in"),
+    "ying": ("^", "ing"),
+    "yo": ("^", "iou"),
+    "yong": ("^", "iong"),
+    "you": ("^", "iou"),
+    "yu": ("^", "v"),
+    "yuan": ("^", "van"),
+    "yue": ("^", "ve"),
+    "yun": ("^", "vn"),
+    "za": ("z", "a"),
+    "zai": ("z", "ai"),
+    "zan": ("z", "an"),
+    "zang": ("z", "ang"),
+    "zao": ("z", "ao"),
+    "ze": ("z", "e"),
+    "zei": ("z", "ei"),
+    "zen": ("z", "en"),
+    "zeng": ("z", "eng"),
+    "zha": ("zh", "a"),
+    "zhai": ("zh", "ai"),
+    "zhan": ("zh", "an"),
+    "zhang": ("zh", "ang"),
+    "zhao": ("zh", "ao"),
+    "zhe": ("zh", "e"),
+    "zhei": ("zh", "ei"),
+    "zhen": ("zh", "en"),
+    "zheng": ("zh", "eng"),
+    "zhi": ("zh", "iii"),
+    "zhong": ("zh", "ong"),
+    "zhou": ("zh", "ou"),
+    "zhu": ("zh", "u"),
+    "zhua": ("zh", "ua"),
+    "zhuai": ("zh", "uai"),
+    "zhuan": ("zh", "uan"),
+    "zhuang": ("zh", "uang"),
+    "zhui": ("zh", "uei"),
+    "zhun": ("zh", "uen"),
+    "zhuo": ("zh", "uo"),
+    "zi": ("z", "ii"),
+    "zong": ("z", "ong"),
+    "zou": ("z", "ou"),
+    "zu": ("z", "u"),
+    "zuan": ("z", "uan"),
+    "zui": ("z", "uei"),
+    "zun": ("z", "uen"),
+    "zuo": ("z", "uo"),
+}

lemas_tts/infer/text_norm/tokenizer.py ADDED Viewed

	@@ -0,0 +1,235 @@

+# cp from https://github.com/lifeiteng/vall-e/blob/main/valle/data/tokenizer.py
+# Copyright    2023                            (authors: Feiteng Li)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import re, logging
+from dataclasses import asdict, dataclass
+from typing import Any, Dict, List, Optional, Pattern, Union
+import math
+import numpy as np
+import torch
+import torchaudio
+# from lhotse.features import FeatureExtractor
+# from lhotse.utils import Seconds, compute_num_frames
+from phonemizer.backend.espeak.wrapper import EspeakWrapper
+from phonemizer.backend import EspeakBackend
+from phonemizer.backend.espeak.language_switch import LanguageSwitch
+from phonemizer.backend.espeak.words_mismatch import WordMismatch
+from phonemizer.punctuation import Punctuation
+from phonemizer.separator import Separator
+# Configure espeak-ng via espeakng_loader if available.
+# This provides a consistent libespeak-ng + data across environments (e.g. HF Spaces).
+try:
+    import espeakng_loader
+    EspeakWrapper.set_library(espeakng_loader.get_library_path())
+    data_path = espeakng_loader.get_data_path()
+    # Export data path via environment so underlying espeak-ng uses it.
+    os.environ["ESPEAK_DATA_PATH"] = data_path
+    os.environ["ESPEAKNG_DATA_PATH"] = data_path
+    print("[LEMAS-TTS] espeak-ng configured via espeakng_loader")
+except Exception as e:  # ImportError or runtime errors
+    # Fall back to system espeak-ng discovery.
+    print(f"[LEMAS-TTS] espeakng_loader not available or failed ({e}); using system espeak-ng")
+class TextTokenizer:
+    """Phonemize Text."""
+    def __init__(
+        self,
+        language="en-us",
+        backend="espeak",
+        separator=Separator(word="_", syllable="-", phone="|"),
+        preserve_punctuation=True,
+        punctuation_marks: Union[str, Pattern] = Punctuation.default_marks(),
+        with_stress: bool = False,
+        tie: Union[bool, str] = False,
+        language_switch: LanguageSwitch = "keep-flags",
+        words_mismatch: WordMismatch = "ignore",
+    ) -> None:
+        phonemizer = EspeakBackend(
+            language,
+            punctuation_marks=punctuation_marks,
+            preserve_punctuation=preserve_punctuation,
+            with_stress=with_stress,
+            tie=tie,
+            language_switch=language_switch,
+            words_mismatch=words_mismatch,
+        )
+        self.backend = phonemizer
+        self.separator = separator
+    def to_list(self, phonemized: str) -> List[str]:
+        fields = []
+        for word in phonemized.split(self.separator.word):
+            # "ɐ    m|iː|n?"    ɹ|ɪ|z|ɜː|v; h|ɪ|z.
+            pp = re.findall(r"\w+|[^\w\s]", word, re.UNICODE)
+            fields.extend(
+                [p for p in pp if p != self.separator.phone]
+                + [self.separator.word]
+            )
+        assert len("".join(fields[:-1])) == len(phonemized) - phonemized.count(
+            self.separator.phone
+        )
+        return fields[:-1]
+    def __call__(self, text, strip=True) -> List[List[str]]:
+        if isinstance(text, str):
+            text = [text]
+        phones = []
+        for txt in text:
+            if txt == '':
+                continue
+            if txt[0] == '#':
+                phones.append(txt)
+            else:
+                ipa = text_tokenizer.backend.phonemize([txt], separator=text_tokenizer.separator, strip=True, njobs=1, logger=logging.basicConfig(level=logging.ERROR))
+                phones += text_tokenizer.to_list(ipa[0])
+        return phones
+def tokenize_text(tokenizer: TextTokenizer, text: str) -> List[str]:
+    phonemes = tokenizer([text.strip()])
+    return phonemes[0]  # k2symbols
+_PAUSE_SYMBOL = {'、':',', '，':',', '。':',', '！':'!', '？':'?', '：':':'}
+def _replace(match):
+    word = match.group(0)
+    return _PAUSE_SYMBOL[word]
+def txt2phone(tokenizer: TextTokenizer, text: str):
+    text = re.sub('|'.join(_PAUSE_SYMBOL.keys()), _replace, text)
+    text = re.split(r"(#\d)", text)
+    phones = []
+    for txt in text:
+        if txt == '':
+            continue
+        if txt[0] == '#':
+            phones.append(txt)
+        else:
+            ipa = tokenizer.backend.phonemize([txt], separator=tokenizer.separator, strip=True, njobs=1)
+            phones += tokenizer.to_list(ipa[0])
+    phones = "|".join(phones).replace("(|", "(").replace("|)", ")")
+    # phones = ["(cmn)"] + phones.split("|")
+    return phones
+def convert_audio(wav: torch.Tensor, sr: int, target_sr: int, target_channels: int):
+    assert wav.shape[0] in [1, 2], "Audio must be mono or stereo."
+    if target_channels == 1:
+        wav = wav.mean(0, keepdim=True)
+    elif target_channels == 2:
+        *shape, _, length = wav.shape
+        wav = wav.expand(*shape, target_channels, length)
+    elif wav.shape[0] == 1:
+        wav = wav.expand(target_channels, -1)
+    wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
+    return wav
+class AudioTokenizer:
+    """EnCodec audio."""
+    def __init__(
+        self,
+        device: Any = None,
+        signature = None
+    ) -> None:
+        from audiocraft.solvers import CompressionSolver
+        model = CompressionSolver.model_from_checkpoint(signature)
+        self.sample_rate = model.sample_rate
+        self.channels = model.channels
+        if not device:
+            device = torch.device("cpu")
+            if torch.cuda.is_available():
+                device = torch.device("cuda:0")
+        self._device = device
+        self.codec = model.to(device)
+    @property
+    def device(self):
+        return self._device
+    def encode(self, wav: torch.Tensor) -> torch.Tensor:
+        codes = self.codec.encode(wav.to(self.device))
+        return [(codes[0], None)]
+    def decode(self, frames: torch.Tensor) -> torch.Tensor:
+        frames = frames[0][0] # [1,4,T]
+        return self.codec.decode(frames)
+def tokenize_audio(tokenizer: AudioTokenizer, audio, offset = -1, num_frames=-1):
+    # Load and pre-process the audio waveform
+    if type(audio) == str:
+        if offset != -1 and num_frames!=-1:
+            wav, sr = torchaudio.load(audio, frame_offset=offset, num_frames=num_frames)
+        else:
+            wav, sr = torchaudio.load(audio)
+        wav = convert_audio(wav, sr, tokenizer.sample_rate, tokenizer.channels)
+        wav = wav.unsqueeze(0)
+    else:
+        wav = audio.unsqueeze(0).unsqueeze(0)
+    # Extract discrete codes from EnCodec
+    with torch.no_grad():
+        encoded_frames = tokenizer.encode(wav)
+    return encoded_frames
+class AudioSR:
+    """EnCodec audio."""
+    def __init__(
+        self,
+        model_path,
+        device = "cpu",
+    ) -> None:
+        import dac
+        self.codec = dac.DAC.load(model_path)
+        self.codec.to(device)
+        self.codec.eval()
+        self.sample_rate = self.codec.sample_rate
+        self.channels = 1
+        self._device = device
+    @property
+    def device(self):
+        return self._device
+    def encode(self, wav: torch.Tensor) -> torch.Tensor:
+        length = wav.shape[-1]
+        right_pad = math.ceil(length / self.codec.hop_length) * self.codec.hop_length - length
+        wav = torch.nn.functional.pad(wav, (0, right_pad))
+        z, codes, _, _, _ = self.codec.encode(wav.to(self._device))
+        return [(codes, z)]
+    def decode(self, frames: torch.Tensor) -> torch.Tensor:
+        # frames = frames[0][0] # [1,4,T]
+        # with torch.no_grad():
+        #     z = self.codec.quantizer.from_codes(frames)[0]
+        #     y = self.codec.decode(z)
+        z = frames[0][1] # [1, 2048, T]
+        with torch.no_grad():
+            y = self.codec.decode(z)
+        return y

lemas_tts/infer/text_norm/txt2pinyin.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import multiprocessing
+from concurrent.futures import ProcessPoolExecutor
+import argparse
+import os, sys, re
+from random import shuffle
+from tqdm import tqdm
+from pypinyin import Style
+from pypinyin.contrib.neutral_tone import NeutralToneWith5Mixin
+from pypinyin.converter import DefaultConverter
+from pypinyin.core import Pinyin
+import jieba
+jieba.set_dictionary(dictionary_path=os.path.join(os.path.dirname(__file__)+'/jieba_dict.txt'))
+from .symbols import pinyin_dict
+from .cn_tn import NSWNormalizer
+zh_pattern = re.compile("[\u4e00-\u9fa5]")
+alpha_pattern = re.compile(r"[a-zA-Z]")
+def is_zh(word):
+    global zh_pattern
+    match = zh_pattern.search(word)
+    return match is not None
+def is_alpha(word):
+    global alpha_pattern
+    match = alpha_pattern.search(word)
+    return match is not None
+def get_phoneme_from_char_and_pinyin(chn_char, pinyin):
+    # we do not need #4, use sil to replace it
+    chn_char = chn_char.replace("#4", "")
+    char_len = len(chn_char)
+    i, j = 0, 0
+    result = []
+    # print(pinyin)
+    while i < char_len:
+        cur_char = chn_char[i]
+        if is_zh(cur_char):
+            if pinyin[j][:-1] == 'n':  # 处理特殊“嗯” 特殊拼音
+                pinyin[j] = 'en' + pinyin[j][-1]
+            if i < len(chn_char)-2 and is_zh(chn_char[i:i+3]) and pinyin[j][-1] == pinyin[j+1][-1] == pinyin[j+2][-1] == '3':  # 处理连续三个三声变调
+                pinyin[j+1] = pinyin[j+1][:-1] + '2'
+                # print(chn_char[i:i+3], pinyin[j:j+3])
+            if i < len(chn_char)-1 and pinyin[j][:-1] in pinyin_dict and is_zh(chn_char[i]) and is_zh(chn_char[i+1]) and pinyin[j][-1] == pinyin[j+1][-1] == '3':  # 处理连续两个三声变调
+                pinyin[j] = pinyin[j][:-1] + '2'
+                # print('change tone ', chn_char[i:i+2], pinyin[j:j + 2])
+            if pinyin[j][:-1] not in pinyin_dict:  # 处理儿化音
+                assert chn_char[i + 1] == "儿", f"current_char : {cur_char}, next_char: {chn_char[i+1]}, cur_pinyin: {pinyin[j]}"
+                assert pinyin[j][-2] == "r"
+                tone = pinyin[j][-1]
+                a = pinyin[j][:-2]
+                # a1, a2 = pinyin_dict[a]
+                # result += [a1, a2 + tone, "er5"]
+                result += [a + tone, er5]
+                if i + 2 < char_len and chn_char[i + 2] != "#":
+                    result.append("#0")
+                i += 2
+                j += 1
+            else:
+                tone = pinyin[j][-1]
+                a = pinyin[j][:-1]
+                a1, a2 = pinyin_dict[a] # a="wen" a1="^", a2="en"
+                # result += [a1, a2 + tone]  # result = [zh, ong1, ^,en2]
+                result.append(a+tone)
+                # if i + 1 < char_len and chn_char[i + 1] != "#":  # 每个字后面接一个#0
+                    # result.append("#0")
+                i += 1
+                j += 1
+        # TODO support English alpha
+        # elif is_alpha(cur_char):
+        #     result += ALPHA_PHONE_DICT[cur_char.upper()]
+        #     if i + 1 < char_len and chn_char[i + 1] not in "#、，。！？：" :  # 每个字后面接一个#0
+        #         result.append("#0")
+        #     i += 1
+        #     j += 1  # baker alpha dataset "ABC" in pinyin
+        elif cur_char == "#":
+            result.append(chn_char[i : i + 2])
+            i += 2
+        elif cur_char in _PAUSE_SYMBOL:  # 遇到标点符号，添加停顿
+            result.pop()  # 去掉#0
+            result.append("#3")
+            i += 1
+        else:
+            # ignore the unknown char
+            # result.append(chn_char[i])
+            i += 1
+    if result[-1] == "#0":  # 去掉最后的#0，改为sil
+        result = result[:-1]
+    # if result[-1] != "sil":
+    #     result.append("sil")
+    assert j == len(pinyin)
+    return result
+# _PAUSE_SYMBOL = {'、', '，', '。', ',', '！', '!', '？', '：', ':', '《', '》', '·', '（', '）', '(', ')'}
+_PAUSE_SYMBOL = {'.':'.', '、':',', '，':',', '。':'.', ',':',', '！':'!', '!':'!', '？':'?', '?':'?', '：':',', ':':',', '——':','}
+class MyConverter(NeutralToneWith5Mixin, DefaultConverter):
+    pass
+def checkErHuaYin(text, GT_pinyin):
+    new_pinyin = []
+    check_pattern = re.compile("[\\t\.\!\?\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）“”：；]+")
+    check_text = check_pattern.sub('', text)
+    if len(check_text) > len(GT_pinyin) and '儿' in check_text:
+        # print('Size mismatch: ', check_text, len(check_text), '\n', GT_pinyin, len(GT_pinyin))
+        for i in range(len(GT_pinyin)):
+            if GT_pinyin[i][-2] == 'r' and GT_pinyin[i][:2] != 'er' and check_text[i + 1] == '儿':
+                new_pinyin.append(GT_pinyin[i][:-2] + GT_pinyin[i][-1])
+                new_pinyin.append('er5')
+                replace_word = check_text[i:i + 2]
+                replace_pattern = re.compile(replace_word)
+                # text = replace_pattern.sub(replace_word[:-1], text)
+                check_text = replace_pattern.sub(replace_word[:-1], check_text, count=1)
+            else:
+                new_pinyin.append(GT_pinyin[i])
+        GT_pinyin = new_pinyin
+    return GT_pinyin
+def change_tone_in_bu_or_yi(chars, pinyin_list):
+    location_yi = [m.start() for m in re.finditer(r'一', chars)]
+    location_bu = [m.start() for m in re.finditer(r'不', chars)]
+    # print('data: ', chars, pinyin_list, location_yi, location_bu)
+    for l in location_yi:
+        if l > 0 and l<len(chars) and chars[l-1]==chars[l+1]:
+            pinyin_list[l] = 'yi5'
+        elif l<len(chars) and pinyin_list[l+1][-1] == '4':
+                pinyin_list[l] = 'yi2'
+    for l in location_bu:
+        if l<len(chars) and pinyin_list[l+1][-1] == '4':
+            pinyin_list[l] = 'bu2'
+    return pinyin_list
+def txt2pinyin(text, pinyin_parser):
+    phonemes = []
+    text = NSWNormalizer(text.strip()).normalize().upper()
+    texts = text.split(' ')
+    for text in texts:
+        text_list = list(jieba.cut(text))
+        for words in text_list:
+            # print('words: ', words)
+            if words in _PAUSE_SYMBOL:
+                # phonemes.append('#2')
+                phonemes[-1] += _PAUSE_SYMBOL[words]
+            elif re.search("[\u4e00-\u9fa5]+", words):
+                pinyin = pinyin_parser(words, style=Style.TONE3, errors="ignore")
+                new_pinyin = []
+                for x in pinyin:
+                    x = "".join(x)
+                    if "#" not in x:
+                        new_pinyin.append(x)
+                new_pinyin = change_tone_in_bu_or_yi(words, new_pinyin) if len(words)>1 and words[-1] not in {"一","不"} else new_pinyin
+                phoneme = get_phoneme_from_char_and_pinyin(words, new_pinyin) # phoneme seq: [sil c e4 #0 sh iii4 #0 ^ uen2 #0 b en3 sil]  string 的list
+                phonemes += phoneme
+            elif re.search(r"[a-zA-Z]", words):
+                phonemes.append(words.upper())
+                # phonemes.append("#1")
+    phones = " ".join(phonemes)
+    return phones
+def process_batch(text_list, save_dir):
+    my_pinyin = Pinyin(MyConverter())
+    pinyin_parser = my_pinyin.pinyin
+    for text_info in tqdm(text_list):
+        try:
+            name, text = text_info
+            save_path = os.path.join(save_dir, name+".txt")
+            phones = txt2pinyin(text, pinyin_parser)
+            open(save_path, 'w', encoding='utf-8').write(phones)
+        except Exception as e:
+            print(text_info, e)
+def parallel_process(filenames, num_processes, save_dir):
+    with ProcessPoolExecutor(max_workers=num_processes) as executor:
+        tasks = []
+        for i in range(num_processes):
+            start = int(i * len(filenames) / num_processes)
+            end = int((i + 1) * len(filenames) / num_processes)
+            chunk = filenames[start:end]
+            tasks.append(executor.submit(process_batch, chunk, save_dir))
+        for task in tqdm(tasks):
+            task.result()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--text_file", type=str, default="", help="path to input text file")
+    parser.add_argument(
+        "--save_dir", type=str, default="", help="path to output text file")
+    parser.add_argument(
+        '--workers', type=int, default=4, help='You are advised to set the number of processes to the same as the number of CPU cores')
+    args = parser.parse_args()
+    sampling_rate = 16000
+    os.makedirs(args.save_dir, exist_ok=True)
+    filenames = open(args.text_file, 'r', encoding='utf-8').readlines()
+    filenames = [x.strip().split('\t') for x in tqdm(filenames)]
+    filenames = [[x[0], x[-1]] for x in tqdm(filenames)]
+    # shuffle(filenames)
+    print(len(filenames))
+    multiprocessing.set_start_method("spawn", force=True)
+    if args.workers == 0:
+        args.workers = os.cpu_count()
+    parallel_process(filenames, args.workers, args.save_dir)
+#################################################################################

lemas_tts/infer/utils_infer.py ADDED Viewed

	@@ -0,0 +1,661 @@

+# A unified script for inference process
+# Make adjustments inside functions, and consider both gradio and cli scripts if need to change func output format
+import os
+import sys
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
+sys.path.append(f"{os.path.dirname(os.path.abspath(__file__))}/../../third_party/BigVGAN/")
+import hashlib
+import re
+import tempfile
+from importlib.resources import files
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+import numpy as np
+import torch
+import torchaudio
+import tqdm
+from huggingface_hub import hf_hub_download
+from pydub import AudioSegment, silence
+from transformers import pipeline
+from vocos import Vocos
+from lemas_tts.model.cfm import CFM
+from lemas_tts.model.utils import (
+    get_tokenizer,
+    convert_char_to_pinyin,
+)
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+# Resolve repository layout for pretrained assets when running from source tree
+THIS_FILE = Path(__file__).resolve()
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+_ref_audio_cache = {}
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+# -----------------------------------------
+target_sample_rate = 24000
+n_mel_channels = 100
+hop_length = 256
+win_length = 1024
+n_fft = 1024
+mel_spec_type = "vocos"
+target_rms = 0.1
+cross_fade_duration = 0.15
+ode_method = "euler"
+nfe_step = 32  # 16, 32
+cfg_strength = 3.0
+sway_sampling_coef = 1
+speed = 1.0
+fix_duration = None
+# -----------------------------------------
+# chunk text into smaller pieces
+def chunk_text(text, max_chars=135):
+    """
+    Splits the input text into chunks, each with a maximum number of characters.
+    Args:
+        text (str): The text to be split.
+        max_chars (int): The maximum number of characters per chunk.
+    Returns:
+        List[str]: A list of text chunks.
+    """
+    chunks = []
+    current_chunk = ""
+    # Split the text into sentences based on punctuation followed by whitespace
+    sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[；：，。！？])", text)
+    for sentence in sentences:
+        if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars:
+            current_chunk += sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+        else:
+            if current_chunk:
+                chunks.append(current_chunk.strip())
+            current_chunk = sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+    if current_chunk:
+        chunks.append(current_chunk.strip())
+    return chunks
+# load vocoder
+def load_vocoder(vocoder_name="vocos", is_local=False, local_path="", device=device, hf_cache_dir=None):
+    if vocoder_name == "vocos":
+        # vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
+        if is_local:
+            print(f"Load vocos from local path {local_path}")
+            config_path = f"{local_path}/config.yaml"
+            model_path = f"{local_path}/pytorch_model.bin"
+        else:
+            print("Download Vocos from huggingface charactr/vocos-mel-24khz")
+            repo_id = "charactr/vocos-mel-24khz"
+            config_path = hf_hub_download(repo_id=repo_id, cache_dir=hf_cache_dir, filename="config.yaml")
+            model_path = hf_hub_download(repo_id=repo_id, cache_dir=hf_cache_dir, filename="pytorch_model.bin")
+        vocoder = Vocos.from_hparams(config_path)
+        state_dict = torch.load(model_path, map_location="cpu", weights_only=True)
+        from vocos.feature_extractors import EncodecFeatures
+        if isinstance(vocoder.feature_extractor, EncodecFeatures):
+            encodec_parameters = {
+                "feature_extractor.encodec." + key: value
+                for key, value in vocoder.feature_extractor.encodec.state_dict().items()
+            }
+            state_dict.update(encodec_parameters)
+        vocoder.load_state_dict(state_dict)
+        vocoder = vocoder.eval().to(device)
+    elif vocoder_name == "bigvgan":
+        try:
+            from third_party.BigVGAN import bigvgan
+        except ImportError:
+            print("You need to follow the README to init submodule and change the BigVGAN source code.")
+        if is_local:
+            # download generator from https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x/tree/main
+            vocoder = bigvgan.BigVGAN.from_pretrained(local_path, use_cuda_kernel=False)
+        else:
+            vocoder = bigvgan.BigVGAN.from_pretrained(
+                "nvidia/bigvgan_v2_24khz_100band_256x", use_cuda_kernel=False, cache_dir=hf_cache_dir
+            )
+        vocoder.remove_weight_norm()
+        vocoder = vocoder.eval().to(device)
+    return vocoder
+# load asr pipeline
+asr_pipe = None
+def initialize_asr_pipeline(device: str = device, dtype=None):
+    if dtype is None:
+        dtype = (
+            torch.float16
+            if "cuda" in device
+            and torch.cuda.get_device_properties(device).major >= 7
+            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
+            else torch.float32
+        )
+    global asr_pipe
+    asr_pipe = pipeline(
+        "automatic-speech-recognition",
+        model="openai/whisper-large-v3-turbo",
+        torch_dtype=dtype,
+        device=device,
+    )
+# transcribe
+def transcribe(ref_audio, language=None):
+    global asr_pipe
+    if asr_pipe is None:
+        initialize_asr_pipeline(device=device)
+    return asr_pipe(
+        ref_audio,
+        chunk_length_s=30,
+        batch_size=128,
+        generate_kwargs={"task": "transcribe", "language": language} if language else {"task": "transcribe"},
+        return_timestamps=False,
+    )["text"].strip()
+# load model checkpoint for inference
+def load_checkpoint(model, ckpt_path, device: str, dtype=None, use_ema=True):
+    if dtype is None:
+        dtype = (
+            torch.float16
+            if "cuda" in device
+            and torch.cuda.get_device_properties(device).major >= 7
+            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
+            else torch.float32
+        )
+    model = model.to(dtype)
+    ckpt_type = ckpt_path.split(".")[-1]
+    if ckpt_type == "safetensors":
+        from safetensors.torch import load_file
+        checkpoint = load_file(ckpt_path, device=device)
+    else:
+        checkpoint = torch.load(ckpt_path, map_location=device, weights_only=True)
+    if use_ema:
+        if ckpt_type == "safetensors":
+            checkpoint = {"ema_model_state_dict": checkpoint}
+        checkpoint["model_state_dict"] = {
+            k.replace("ema_model.", ""): v
+            for k, v in checkpoint["ema_model_state_dict"].items()
+            if k not in ["initted", "step"]
+        }
+        # patch for backward compatibility, 305e3ea
+        for key in [
+            "mel_spec.mel_stft.mel_scale.fb",
+            "mel_spec.mel_stft.spectrogram.window",
+            "ctc.proj.0.weight",
+            "ctc.proj.0.bias",
+            "ctc.ctc_proj.weight",
+            "ctc.ctc_proj.bias",
+        ]:
+            if key in checkpoint["model_state_dict"]:
+                del checkpoint["model_state_dict"][key]
+        # use strict=False so newly added modules (e.g. prosody encoder)
+        # that are initialized from their own checkpoints do not cause
+        # missing-key errors when loading older CFM checkpoints
+        model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    else:
+        if ckpt_type == "safetensors":
+            checkpoint = {"model_state_dict": checkpoint}
+        model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    del checkpoint
+    torch.cuda.empty_cache()
+    return model.to(device)
+# load model for inference
+def load_model(
+    model_cls,
+    model_cfg,
+    ckpt_path,
+    mel_spec_type=mel_spec_type,
+    vocab_file="",
+    ode_method=ode_method,
+    use_ema=True,
+    device=device,
+    use_prosody_encoder=False,
+    prosody_cfg_path="",
+    prosody_ckpt_path="",
+):
+    if vocab_file == "":
+        vocab_file = str(files("lemas_tts").joinpath("infer/examples/vocab.txt"))
+    tokenizer = "custom"
+    print("\nvocab : ", vocab_file)
+    print("token : ", tokenizer)
+    print("model : ", ckpt_path, "\n")
+    vocab_char_map, vocab_size = get_tokenizer(vocab_file, tokenizer)
+    # Resolve prosody encoder assets if requested but paths not provided
+    if use_prosody_encoder:
+        if not prosody_cfg_path:
+            prosody_cfg_path = str(CKPTS_ROOT / "prosody_encoder" / "pretssel_cfg.json")
+        if not prosody_ckpt_path:
+            prosody_ckpt_path = str(CKPTS_ROOT / "prosody_encoder" / "prosody_encoder_UnitY2.pt")
+    model = CFM(
+        transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels, use_prosody_encoder=use_prosody_encoder),
+        mel_spec_kwargs=dict(
+            n_fft=n_fft,
+            hop_length=hop_length,
+            win_length=win_length,
+            n_mel_channels=n_mel_channels,
+            target_sample_rate=target_sample_rate,
+            mel_spec_type=mel_spec_type,
+        ),
+        odeint_kwargs=dict(
+            method=ode_method,
+        ),
+        vocab_char_map=vocab_char_map,
+        use_prosody_encoder=use_prosody_encoder,
+        prosody_cfg_path=prosody_cfg_path,
+        prosody_ckpt_path=prosody_ckpt_path,
+    ).to(device)
+    dtype = torch.float32 if mel_spec_type == "bigvgan" else None
+    model = load_checkpoint(model, ckpt_path, device, dtype=dtype, use_ema=use_ema)
+    return model
+def remove_silence_edges(audio, silence_threshold=-42):
+    # Remove silence from the start
+    non_silent_start_idx = silence.detect_leading_silence(audio, silence_threshold=silence_threshold)
+    audio = audio[non_silent_start_idx:]
+    # Remove silence from the end
+    non_silent_end_duration = audio.duration_seconds
+    for ms in reversed(audio):
+        if ms.dBFS > silence_threshold:
+            break
+        non_silent_end_duration -= 0.001
+    trimmed_audio = audio[: int(non_silent_end_duration * 1000)]
+    return trimmed_audio
+# preprocess reference audio and text
+def preprocess_ref_audio_text(ref_audio_orig, ref_text, clip_short=True, show_info=print):
+    show_info("Converting audio...")
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+        aseg = AudioSegment.from_file(ref_audio_orig)
+        if clip_short:
+            # 1. try to find long silence for clipping
+            non_silent_segs = silence.split_on_silence(
+                aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000, seek_step=10
+            )
+            non_silent_wave = AudioSegment.silent(duration=0)
+            for non_silent_seg in non_silent_segs:
+                if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
+                    show_info("Audio is over 12s, clipping short. (1)")
+                    break
+                non_silent_wave += non_silent_seg
+            # 2. try to find short silence for clipping if 1. failed
+            if len(non_silent_wave) > 12000:
+                non_silent_segs = silence.split_on_silence(
+                    aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10
+                )
+                non_silent_wave = AudioSegment.silent(duration=0)
+                for non_silent_seg in non_silent_segs:
+                    if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
+                        show_info("Audio is over 12s, clipping short. (2)")
+                        break
+                    non_silent_wave += non_silent_seg
+            aseg = non_silent_wave
+            # 3. if no proper silence found for clipping
+            if len(aseg) > 12000:
+                aseg = aseg[:12000]
+                show_info("Audio is over 12s, clipping short. (3)")
+        aseg = remove_silence_edges(aseg) + AudioSegment.silent(duration=50)
+        aseg.export(f.name, format="wav")
+        ref_audio = f.name
+    # Compute a hash of the reference audio file
+    with open(ref_audio, "rb") as audio_file:
+        audio_data = audio_file.read()
+        audio_hash = hashlib.md5(audio_data).hexdigest()
+    if not ref_text.strip():
+        global _ref_audio_cache
+        if audio_hash in _ref_audio_cache:
+            # Use cached asr transcription
+            show_info("Using cached reference text...")
+            ref_text = _ref_audio_cache[audio_hash]
+        else:
+            show_info("No reference text provided, transcribing reference audio...")
+            ref_text = transcribe(ref_audio)
+            # Cache the transcribed text (not caching custom ref_text, enabling users to do manual tweak)
+            _ref_audio_cache[audio_hash] = ref_text
+    else:
+        show_info("Using custom reference text...")
+    # Ensure ref_text ends with a proper sentence-ending punctuation
+    if not ref_text.endswith(". ") and not ref_text.endswith("。"):
+        if ref_text.endswith("."):
+            ref_text += " "
+        else:
+            ref_text += ". "
+    print("\nref_text  ", ref_text)
+    return ref_audio, ref_text
+# infer process: chunk text -> infer batches [i.e. infer_batch_process()]
+def infer_process(
+    ref_audio,
+    ref_text,
+    gen_text,
+    model_obj,
+    vocoder,
+    mel_spec_type=mel_spec_type,
+    show_info=print,
+    progress=tqdm,
+    target_rms=target_rms,
+    cross_fade_duration=cross_fade_duration,
+    nfe_step=nfe_step,
+    cfg_strength=cfg_strength,
+    sway_sampling_coef=sway_sampling_coef,
+    use_acc_grl=True,
+    use_prosody_encoder=True,
+    ref_ratio=None,
+    no_ref_audio=False,
+    speed=speed,
+    fix_duration=fix_duration,
+    device=device,
+):
+    # Split the input text into batches
+    audio, sr = torchaudio.load(ref_audio)
+    if type(ref_text) == str:
+        max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (22 - audio.shape[-1] / sr))
+        gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
+    else:
+        gen_text_batches = gen_text
+    print(f"ref_text:", ref_text)
+    for i, gen_text in enumerate(gen_text_batches):
+        print(f"gen_text {i}", gen_text)
+    print("\n")
+    show_info(f"Generating audio in {len(gen_text_batches)} batches...")
+    return next(
+        infer_batch_process(
+            (audio, sr),
+            ref_text,
+            gen_text_batches,
+            model_obj,
+            vocoder,
+            mel_spec_type=mel_spec_type,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            use_acc_grl=use_acc_grl,
+            use_prosody_encoder=use_prosody_encoder,
+            ref_ratio=ref_ratio,
+            no_ref_audio=no_ref_audio,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=device,
+        )
+    )
+# infer batches
+def infer_batch_process(
+    ref_audio,
+    ref_text,
+    gen_text_batches,
+    model_obj,
+    vocoder,
+    mel_spec_type="vocos",
+    progress=tqdm,
+    target_rms=0.1,
+    cross_fade_duration=0.15,
+    nfe_step=32,
+    cfg_strength=2.0,
+    sway_sampling_coef=-1,
+    use_acc_grl=True,
+    use_prosody_encoder=True,
+    ref_ratio=None,
+    no_ref_audio=False,
+    speed=1,
+    fix_duration=None,
+    device=None,
+    streaming=False,
+    chunk_size=2048,
+):
+    audio, sr = ref_audio
+    if audio.shape[0] > 1:
+        audio = torch.mean(audio, dim=0, keepdim=True)
+    rms = torch.sqrt(torch.mean(torch.square(audio)))
+    if rms < target_rms:
+        audio = audio * target_rms / rms
+    if sr != target_sample_rate:
+        resampler = torchaudio.transforms.Resample(sr, target_sample_rate)
+        audio = resampler(audio)
+    audio = audio.to(device)
+    generated_waves = []
+    spectrograms = []
+    if type(ref_text) == str:
+        if len(ref_text[-1].encode("utf-8")) == 1:
+            ref_text = ref_text + " "
+    def process_batch(gen_text):
+        local_speed = speed
+        if type(ref_text) == str:
+            if len(gen_text.encode("utf-8")) < 10:
+                local_speed = 0.3
+            # Prepare the text
+            text_list = [ref_text + gen_text]
+            final_text_list = convert_char_to_pinyin(text_list)
+        else:
+            final_text_list = [ref_text + gen_text]
+        print("final_text_list:", final_text_list)
+        ref_audio_len = audio.shape[-1] // hop_length
+        if fix_duration is not None:
+            duration = int(fix_duration * target_sample_rate / hop_length)
+        else:
+            # Calculate duration
+            ref_text_len = len(ref_text) # .encode("utf-8")
+            gen_text_len = len(gen_text) # .encode("utf-8")
+            duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / local_speed)
+        # inference
+        with torch.inference_mode():
+            generated, _ = model_obj.sample(
+                cond=audio,
+                text=final_text_list,
+                duration=duration,
+                steps=nfe_step,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                use_prosody_encoder=use_prosody_encoder,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+            )
+            del _
+            generated = generated.to(torch.float32)  # generated mel spectrogram
+            generated = generated[:, ref_audio_len:, :]
+            generated = generated.permute(0, 2, 1)
+            if mel_spec_type == "vocos":
+                generated_wave = vocoder.decode(generated)
+            elif mel_spec_type == "bigvgan":
+                generated_wave = vocoder(generated)
+            if rms < target_rms:
+                generated_wave = generated_wave * rms / target_rms
+            # wav -> numpy
+            # generated_wave = torch.clip(generated_wave, -0.999, 0.999)
+            generated_wave = generated_wave.squeeze().cpu().numpy()
+            if streaming:
+                for j in range(0, len(generated_wave), chunk_size):
+                    yield generated_wave[j : j + chunk_size], target_sample_rate
+            else:
+                generated_cpu = generated[0].cpu().numpy()
+                del generated
+                yield generated_wave, generated_cpu
+    if streaming:
+        for gen_text in progress.tqdm(gen_text_batches) if progress is not None else gen_text_batches:
+            for chunk in process_batch(gen_text):
+                yield chunk
+    else:
+        with ThreadPoolExecutor() as executor:
+            futures = [executor.submit(process_batch, gen_text) for gen_text in gen_text_batches]
+            for future in progress.tqdm(futures) if progress is not None else futures:
+                result = future.result()
+                if result:
+                    generated_wave, generated_mel_spec = next(result)
+                    generated_waves.append(generated_wave)
+                    spectrograms.append(generated_mel_spec)
+        if generated_waves:
+            if cross_fade_duration <= 0:
+                # Simply concatenate
+                final_wave = np.concatenate(generated_waves)
+            else:
+                # Combine all generated waves with cross-fading
+                final_wave = generated_waves[0]
+                for i in range(1, len(generated_waves)):
+                    prev_wave = final_wave
+                    next_wave = generated_waves[i]
+                    # Calculate cross-fade samples, ensuring it does not exceed wave lengths
+                    cross_fade_samples = int(cross_fade_duration * target_sample_rate)
+                    cross_fade_samples = min(cross_fade_samples, len(prev_wave), len(next_wave))
+                    if cross_fade_samples <= 0:
+                        # No overlap possible, concatenate
+                        final_wave = np.concatenate([prev_wave, next_wave])
+                        continue
+                    # Overlapping parts
+                    prev_overlap = prev_wave[-cross_fade_samples:]
+                    next_overlap = next_wave[:cross_fade_samples]
+                    # Fade out and fade in
+                    fade_out = np.linspace(1, 0, cross_fade_samples)
+                    fade_in = np.linspace(0, 1, cross_fade_samples)
+                    # Cross-faded overlap
+                    cross_faded_overlap = prev_overlap * fade_out + next_overlap * fade_in
+                    # Combine
+                    new_wave = np.concatenate(
+                        [prev_wave[:-cross_fade_samples], cross_faded_overlap, next_wave[cross_fade_samples:]]
+                    )
+                    final_wave = new_wave
+            # Create a combined spectrogram
+            combined_spectrogram = np.concatenate(spectrograms, axis=1)
+            final_wave = np.clip(final_wave, -0.999, 0.999)
+            yield final_wave, target_sample_rate, combined_spectrogram
+        else:
+            yield None, target_sample_rate, None
+# remove silence from generated wav
+def remove_silence_for_generated_wav(filename):
+    aseg = AudioSegment.from_file(filename)
+    non_silent_segs = silence.split_on_silence(
+        aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=500, seek_step=10
+    )
+    non_silent_wave = AudioSegment.silent(duration=0)
+    for non_silent_seg in non_silent_segs:
+        non_silent_wave += non_silent_seg
+    aseg = non_silent_wave
+    aseg.export(filename, format="wav")
+# save spectrogram
+def save_spectrogram(spectrogram, path):
+    plt.figure(figsize=(12, 4))
+    plt.imshow(spectrogram, origin="lower", aspect="auto")
+    plt.colorbar()
+    plt.savefig(path)
+    plt.close()

lemas_tts/model/backbones/README.md ADDED Viewed

	@@ -0,0 +1,20 @@

+## Backbones quick introduction
+### unett.py
+- flat unet transformer
+- structure same as in e2-tts & voicebox paper except using rotary pos emb
+- possible abs pos emb & convnextv2 blocks for embedded text before concat
+### dit.py
+- adaln-zero dit
+- embedded timestep as condition
+- concatted noised_input + masked_cond + embedded_text, linear proj in
+- possible abs pos emb & convnextv2 blocks for embedded text before concat
+- possible long skip connection (first layer to last layer)
+### mmdit.py
+- stable diffusion 3 block structure
+- timestep as condition
+- left stream: text embedded and applied a abs pos emb
+- right stream: masked_cond & noised_input concatted and with same conv pos emb as unett

lemas_tts/model/backbones/dit.py ADDED Viewed

	@@ -0,0 +1,254 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from typing import Optional
+import torch
+from torch import nn
+import torch.nn.functional as F
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvNeXtV2Block,
+    ConvPositionEmbedding,
+    DiTBlock,
+    AdaLayerNorm_Final,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+from lemas_tts.model.backbones.ecapa_tdnn import ECAPA_TDNN
+# Text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        if conv_layers > 0:
+            self.extra_modeling = True
+            self.precompute_max_pos = 4096  # ~44s of 24khz audio
+            self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
+            self.text_blocks = nn.Sequential(
+                *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
+            )
+        else:
+            self.extra_modeling = False
+    def forward(self, text: int["b nt"], seq_len, drop_text=False):  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
+        batch, text_len = text.shape[0], text.shape[1]
+        text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b n -> b n d
+        # possible extra modeling
+        if self.extra_modeling:
+            # sinus pos emb
+            batch_start = torch.zeros((batch,), dtype=torch.long)
+            pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
+            text_pos_embed = self.freqs_cis[pos_idx]
+            text = text + text_pos_embed
+            # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
+        return text
+# noised input audio and context mixing embedding
+class InputEmbedding(nn.Module):
+    def __init__(self, mel_dim, text_dim, out_dim):
+        super().__init__()
+        self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:  # cfg for cond audio
+            cond = torch.zeros_like(cond)
+        x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
+        x = self.conv_pos_embed(x) + x
+        return x
+# Transformer backbone using DiT blocks
+class DiT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
+        conv_layers=0,
+        pe_attn_head=None,
+        long_skip_connection=False,
+        checkpoint_activations=False,
+        use_prosody_encoder=False,
+    ):
+        super().__init__()
+        self.time_embed = TimestepEmbedding(dim)
+        if text_dim is None:
+            text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        # project prosody embeddings (512-dim) to text_dim for conditioning
+        self.use_prosody_encoder = use_prosody_encoder
+        if use_prosody_encoder:
+            self.prosody_text_proj = nn.Linear(512, text_dim)
+        else:
+            self.prosody_text_proj = None
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        self.dim = dim
+        self.depth = depth
+        self.transformer_blocks = nn.ModuleList(
+            [
+                DiTBlock(
+                    dim=dim,
+                    heads=heads,
+                    dim_head=dim_head,
+                    ff_mult=ff_mult,
+                    dropout=dropout,
+                    qk_norm=qk_norm,
+                    pe_attn_head=pe_attn_head,
+                )
+                for _ in range(depth)
+            ]
+        )
+        self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
+        self.proj_out = nn.Linear(dim, mel_dim)
+        self.checkpoint_activations = checkpoint_activations
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in DiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm.linear.weight, 0)
+            nn.init.constant_(block.attn_norm.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
+    def ckpt_wrapper(self, module):
+        # https://github.com/chuanyangjin/fast-DiT/blob/main/models.py
+        def ckpt_forward(*inputs):
+            outputs = module(*inputs)
+            return outputs
+        return ckpt_forward
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+        prosody_text: Optional[torch.Tensor] = None,
+    ):
+        batch, seq_len = x.shape[0], x.shape[1]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning time, text: text, x: noised audio + cond audio + text
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
+        # optional prosody conditioning on text side
+        if prosody_text is not None and self.use_prosody_encoder:
+            # prosody_text: (B, T_text, 512) -> project to text_dim and align to seq_len
+            pt = self.prosody_text_proj(prosody_text)
+            if pt.size(1) < seq_len:
+                pad_len = seq_len - pt.size(1)
+                pt = F.pad(pt, (0, 0, 0, pad_len))
+            elif pt.size(1) > seq_len:
+                pt = pt[:, :seq_len]
+            text_embed = text_embed + pt
+        x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
+        rope = self.rotary_embed.forward_from_seq_len(seq_len)
+        if self.long_skip_connection is not None:
+            residual = x
+        for block in self.transformer_blocks:
+            if self.checkpoint_activations:
+                # https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
+                x = torch.utils.checkpoint.checkpoint(self.ckpt_wrapper(block), x, t, mask, rope, use_reentrant=False)
+            else:
+                x = block(x, t, mask=mask, rope=rope)
+        if self.long_skip_connection is not None:
+            x = self.long_skip_connection(torch.cat((x, residual), dim=-1))
+        x = self.norm_out(x, t)
+        output = self.proj_out(x)
+        return output

lemas_tts/model/backbones/ecapa_tdnn.py ADDED Viewed

	@@ -0,0 +1,931 @@

+"""A popular speaker recognition and diarization model.
+Authors
+ * Hwidong Na 2020
+"""
+import math
+import os
+import torch  # noqa: F401
+import torch.nn as nn
+import torch.nn.functional as F
+def length_to_mask(length, max_len=None, dtype=None, device=None):
+    """Creates a binary mask for each sequence.
+    Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
+    Arguments
+    ---------
+    length : torch.LongTensor
+        Containing the length of each sequence in the batch. Must be 1D.
+    max_len : int
+        Max length for the mask, also the size of the second dimension.
+    dtype : torch.dtype, default: None
+        The dtype of the generated mask.
+    device: torch.device, default: None
+        The device to put the mask variable.
+    Returns
+    -------
+    mask : tensor
+        The binary mask.
+    Example
+    -------
+    >>> length=torch.Tensor([1,2,3])
+    >>> mask=length_to_mask(length)
+    >>> mask
+    tensor([[1., 0., 0.],
+            [1., 1., 0.],
+            [1., 1., 1.]])
+    """
+    assert len(length.shape) == 1
+    if max_len is None:
+        max_len = length.max().long().item()  # using arange to generate mask
+    mask = torch.arange(max_len, device=length.device, dtype=length.dtype).expand(
+        len(length), max_len
+    ) < length.unsqueeze(1)
+    if dtype is None:
+        dtype = length.dtype
+    if device is None:
+        device = length.device
+    mask = torch.as_tensor(mask, dtype=dtype, device=device)
+    return mask
+def get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int):
+    """This function computes the number of elements to add for zero-padding.
+    Arguments
+    ---------
+    L_in : int
+    stride: int
+    kernel_size : int
+    dilation : int
+    """
+    if stride > 1:
+        n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
+        L_out = stride * (n_steps - 1) + kernel_size * dilation
+        padding = [kernel_size // 2, kernel_size // 2]
+    else:
+        L_out = (L_in - dilation * (kernel_size - 1) - 1) // stride + 1
+        padding = [(L_in - L_out) // 2, (L_in - L_out) // 2]
+    return padding
+class Conv1d(nn.Module):
+    """This function implements 1d convolution.
+    Arguments
+    ---------
+    out_channels : int
+        It is the number of output channels.
+    kernel_size : int
+        Kernel size of the convolutional filters.
+    input_shape : tuple
+        The shape of the input. Alternatively use ``in_channels``.
+    in_channels : int
+        The number of input channels. Alternatively use ``input_shape``.
+    stride : int
+        Stride factor of the convolutional filters. When the stride factor > 1,
+        a decimation in time is performed.
+    dilation : int
+        Dilation factor of the convolutional filters.
+    padding : str
+        (same, valid, causal). If "valid", no padding is performed.
+        If "same" and stride is 1, output shape is the same as the input shape.
+        "causal" results in causal (dilated) convolutions.
+    padding_mode : str
+        This flag specifies the type of padding. See torch.nn documentation
+        for more information.
+    skip_transpose : bool
+        If False, uses batch x time x channel convention of speechbrain.
+        If True, uses batch x channel x time convention.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 40, 16])
+    >>> cnn_1d = Conv1d(
+    ...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
+    ... )
+    >>> out_tensor = cnn_1d(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 40, 8])
+    """
+    def __init__(
+        self,
+        out_channels,
+        kernel_size,
+        input_shape=None,
+        in_channels=None,
+        stride=1,
+        dilation=1,
+        padding="same",
+        groups=1,
+        bias=True,
+        padding_mode="reflect",
+        skip_transpose=True,
+    ):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.padding = padding
+        self.padding_mode = padding_mode
+        self.unsqueeze = False
+        self.skip_transpose = skip_transpose
+        if input_shape is None and in_channels is None:
+            raise ValueError("Must provide one of input_shape or in_channels")
+        if in_channels is None:
+            in_channels = self._check_input_shape(input_shape)
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            self.kernel_size,
+            stride=self.stride,
+            dilation=self.dilation,
+            padding=0,
+            groups=groups,
+            bias=bias,
+        )
+    def forward(self, x):
+        """Returns the output of the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel)
+            input to convolve. 2d or 4d tensors are expected.
+        """
+        if not self.skip_transpose:
+            x = x.transpose(1, -1)
+        if self.unsqueeze:
+            x = x.unsqueeze(1)
+        if self.padding == "same":
+            x = self._manage_padding(x, self.kernel_size, self.dilation, self.stride)
+        elif self.padding == "causal":
+            num_pad = (self.kernel_size - 1) * self.dilation
+            x = F.pad(x, (num_pad, 0))
+        elif self.padding == "valid":
+            pass
+        else:
+            raise ValueError(
+                "Padding must be 'same', 'valid' or 'causal'. Got " + self.padding
+            )
+        wx = self.conv(x.to(self.conv.weight.dtype))
+        if self.unsqueeze:
+            wx = wx.squeeze(1)
+        if not self.skip_transpose:
+            wx = wx.transpose(1, -1)
+        return wx
+    def _manage_padding(
+        self,
+        x,
+        kernel_size: int,
+        dilation: int,
+        stride: int,
+    ):
+        """This function performs zero-padding on the time axis
+        such that their lengths is unchanged after the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input tensor.
+        kernel_size : int
+            Size of kernel.
+        dilation : int
+            Dilation used.
+        stride : int
+            Stride.
+        """
+        # Detecting input shape
+        L_in = x.shape[-1]
+        # Time padding
+        padding = get_padding_elem(L_in, stride, kernel_size, dilation)
+        # Applying padding
+        x = F.pad(x, padding, mode=self.padding_mode)
+        return x
+    def _check_input_shape(self, shape):
+        """Checks the input shape and returns the number of input channels."""
+        if len(shape) == 2:
+            self.unsqueeze = True
+            in_channels = 1
+        elif self.skip_transpose:
+            in_channels = shape[1]
+        elif len(shape) == 3:
+            in_channels = shape[2]
+        else:
+            raise ValueError("conv1d expects 2d, 3d inputs. Got " + str(len(shape)))
+        # Kernel size must be odd
+        if self.kernel_size % 2 == 0:
+            raise ValueError(
+                "The field kernel size must be an odd number. Got %s."
+                % (self.kernel_size)
+            )
+        return in_channels
+class Fp32BatchNorm(nn.Module):
+    def __init__(self, sync=True, *args, **kwargs):
+        super().__init__()
+        if (
+            not torch.distributed.is_initialized()
+            or torch.distributed.get_world_size() == 1
+        ):
+            sync = False
+        if sync:
+            self.bn = nn.SyncBatchNorm(*args, **kwargs)
+        else:
+            self.bn = nn.BatchNorm1d(*args, **kwargs)
+        self.sync = sync
+    def forward(self, input):
+        if self.bn.running_mean.dtype != torch.float:
+            if self.sync:
+                self.bn.running_mean = self.bn.running_mean.float()
+                self.bn.running_var = self.bn.running_var.float()
+                if self.bn.affine:
+                    try:
+                        self.bn.weight = self.bn.weight.float()
+                        self.bn.bias = self.bn.bias.float()
+                    except:
+                        self.bn.float()
+            else:
+                self.bn.float()
+        output = self.bn(input.float())
+        return output.type_as(input)
+class BatchNorm1d(nn.Module):
+    """Applies 1d batch normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    affine : bool
+        When set to True, the affine parameters are learned.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    combine_batch_time : bool
+        When true, it combines batch an time axis.
+    Example
+    -------
+    >>> input = torch.randn(100, 10)
+    >>> norm = BatchNorm1d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=True,
+        combine_batch_time=False,
+        skip_transpose=True,
+        enabled=True,
+    ):
+        super().__init__()
+        self.combine_batch_time = combine_batch_time
+        self.skip_transpose = skip_transpose
+        if input_size is None and skip_transpose:
+            input_size = input_shape[1]
+        elif input_size is None:
+            input_size = input_shape[-1]
+        if enabled:
+            self.norm = Fp32BatchNorm(
+                num_features=input_size,
+                eps=eps,
+                momentum=momentum,
+                affine=affine,
+                track_running_stats=track_running_stats,
+            )
+        else:
+            self.norm = nn.Identity()
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, [channels])
+            input to normalize. 2d or 3d tensors are expected in input
+            4d tensors can be used when combine_dims=True.
+        """
+        shape_or = x.shape
+        if self.combine_batch_time:
+            if x.ndim == 3:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[2])
+            else:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[3], shape_or[2])
+        elif not self.skip_transpose:
+            x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        if self.combine_batch_time:
+            x_n = x_n.reshape(shape_or)
+        elif not self.skip_transpose:
+            x_n = x_n.transpose(1, -1)
+        return x_n
+class Linear(torch.nn.Module):
+    """Computes a linear transformation y = wx + b.
+    Arguments
+    ---------
+    n_neurons : int
+        It is the number of output neurons (i.e, the dimensionality of the
+        output).
+    bias : bool
+        If True, the additive bias b is adopted.
+    combine_dims : bool
+        If True and the input is 4D, combine 3rd and 4th dimensions of input.
+    Example
+    -------
+    >>> inputs = torch.rand(10, 50, 40)
+    >>> lin_t = Linear(input_shape=(10, 50, 40), n_neurons=100)
+    >>> output = lin_t(inputs)
+    >>> output.shape
+    torch.Size([10, 50, 100])
+    """
+    def __init__(
+        self,
+        n_neurons,
+        input_shape=None,
+        input_size=None,
+        bias=True,
+        combine_dims=False,
+    ):
+        super().__init__()
+        self.combine_dims = combine_dims
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected one of input_shape or input_size")
+        if input_size is None:
+            input_size = input_shape[-1]
+            if len(input_shape) == 4 and self.combine_dims:
+                input_size = input_shape[2] * input_shape[3]
+        # Weights are initialized following pytorch approach
+        self.w = nn.Linear(input_size, n_neurons, bias=bias)
+    def forward(self, x):
+        """Returns the linear transformation of input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input to transform linearly.
+        """
+        if x.ndim == 4 and self.combine_dims:
+            x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3])
+        wx = self.w(x)
+        return wx
+class TDNNBlock(nn.Module):
+    """An implementation of TDNN.
+    Arguments
+    ----------
+    in_channels : int
+        Number of input channels.
+    out_channels : int
+        The number of output channels.
+    kernel_size : int
+        The kernel size of the TDNN blocks.
+    dilation : int
+        The dilation of the Res2Net block.
+    activation : torch class
+        A class for constructing the activation layers.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        dilation,
+        activation=nn.ReLU,
+        batch_norm=True,
+    ):
+        super(TDNNBlock, self).__init__()
+        self.conv = Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            dilation=dilation,
+        )
+        self.activation = activation()
+        self.norm = BatchNorm1d(input_size=out_channels, enabled=batch_norm)
+    def forward(self, x):
+        return self.norm(self.activation(self.conv(x)))
+class Res2NetBlock(torch.nn.Module):
+    """An implementation of Res2NetBlock w/ dilation.
+    Arguments
+    ---------
+    in_channels : int
+        The number of channels expected in the input.
+    out_channels : int
+        The number of output channels.
+    scale : int
+        The scale of the Res2Net block.
+    kernel_size: int
+        The kernel size of the Res2Net block.
+    dilation : int
+        The dilation of the Res2Net block.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        scale=8,
+        kernel_size=3,
+        dilation=1,
+        batch_norm=True,
+    ):
+        super(Res2NetBlock, self).__init__()
+        assert in_channels % scale == 0
+        assert out_channels % scale == 0
+        in_channel = in_channels // scale
+        hidden_channel = out_channels // scale
+        self.blocks = nn.ModuleList(
+            [
+                TDNNBlock(
+                    in_channel,
+                    hidden_channel,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                    batch_norm=batch_norm,
+                )
+                for i in range(scale - 1)
+            ]
+        )
+        self.scale = scale
+    def forward(self, x):
+        y = []
+        for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
+            if i == 0:
+                y_i = x_i
+            elif i == 1:
+                y_i = self.blocks[i - 1](x_i)
+            else:
+                y_i = self.blocks[i - 1](x_i + y_i)
+            y.append(y_i)
+        y = torch.cat(y, dim=1)
+        return y
+class SEBlock(nn.Module):
+    """An implementation of squeeze-and-excitation block.
+    Arguments
+    ---------
+    in_channels : int
+        The number of input channels.
+    se_channels : int
+        The number of output channels after squeeze.
+    out_channels : int
+        The number of output channels.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> se_layer = SEBlock(64, 16, 64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(self, in_channels, se_channels, out_channels):
+        super(SEBlock, self).__init__()
+        self.conv1 = Conv1d(
+            in_channels=in_channels, out_channels=se_channels, kernel_size=1
+        )
+        self.relu = torch.nn.ReLU(inplace=True)
+        self.conv2 = Conv1d(
+            in_channels=se_channels, out_channels=out_channels, kernel_size=1
+        )
+        self.sigmoid = torch.nn.Sigmoid()
+    def forward(self, x, lengths=None):
+        L = x.shape[-1]
+        if lengths is not None:
+            mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+            mask = mask.unsqueeze(1)
+            total = mask.sum(dim=2, keepdim=True)
+            s = (x * mask).sum(dim=2, keepdim=True) / total
+        else:
+            s = x.mean(dim=2, keepdim=True)
+        s = self.relu(self.conv1(s))
+        s = self.sigmoid(self.conv2(s))
+        return s * x
+class AttentiveStatisticsPooling(nn.Module):
+    """This class implements an attentive statistic pooling layer for each channel.
+    It returns the concatenated mean and std of the input tensor.
+    Arguments
+    ---------
+    channels: int
+        The number of input channels.
+    attention_channels: int
+        The number of attention channels.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> asp_layer = AttentiveStatisticsPooling(64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 1, 128])
+    """
+    def __init__(
+        self, channels, attention_channels=128, global_context=True, batch_norm=True
+    ):
+        super().__init__()
+        self.eps = 1e-12
+        self.global_context = global_context
+        if global_context:
+            self.tdnn = TDNNBlock(
+                channels * 3, attention_channels, 1, 1, batch_norm=batch_norm
+            )
+        else:
+            self.tdnn = TDNNBlock(
+                channels, attention_channels, 1, 1, batch_norm, batch_norm
+            )
+        self.tanh = nn.Tanh()
+        self.conv = Conv1d(
+            in_channels=attention_channels, out_channels=channels, kernel_size=1
+        )
+    def forward(self, x, lengths=None):
+        """Calculates mean and std for a batch (input tensor).
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape [N, C, L].
+        """
+        L = x.shape[-1]
+        def _compute_statistics(x, m, dim=2, eps=self.eps):
+            mean = (m * x).sum(dim)
+            std = torch.sqrt((m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps))
+            return mean, std
+        if lengths is None:
+            lengths = torch.ones(x.shape[0], device=x.device)
+        # Make binary mask of shape [N, 1, L]
+        mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+        mask = mask.unsqueeze(1)
+        # Expand the temporal context of the pooling layer by allowing the
+        # self-attention to look at global properties of the utterance.
+        if self.global_context:
+            # torch.std is unstable for backward computation
+            # https://github.com/pytorch/pytorch/issues/4320
+            total = mask.sum(dim=2, keepdim=True).float()
+            mean, std = _compute_statistics(x, mask / total)
+            mean = mean.unsqueeze(2).repeat(1, 1, L)
+            std = std.unsqueeze(2).repeat(1, 1, L)
+            attn = torch.cat([x, mean, std], dim=1)
+        else:
+            attn = x
+        # Apply layers
+        attn = self.conv(self.tanh(self.tdnn(attn)))
+        # Filter out zero-paddings
+        attn = attn.masked_fill(mask == 0, float("-inf"))
+        attn = F.softmax(attn, dim=2)
+        mean, std = _compute_statistics(x, attn)
+        # Append mean and std of the batch
+        pooled_stats = torch.cat((mean, std), dim=1)
+        pooled_stats = pooled_stats.unsqueeze(2)
+        return pooled_stats
+class SERes2NetBlock(nn.Module):
+    """An implementation of building block in ECAPA-TDNN, i.e.,
+    TDNN-Res2Net-TDNN-SEBlock.
+    Arguments
+    ----------
+    out_channels: int
+        The number of output channels.
+    res2net_scale: int
+        The scale of the Res2Net block.
+    kernel_size: int
+        The kernel size of the TDNN blocks.
+    dilation: int
+        The dilation of the Res2Net block.
+    activation : torch class
+        A class for constructing the activation layers.
+    Example
+    -------
+    >>> x = torch.rand(8, 120, 64).transpose(1, 2)
+    >>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
+    >>> out = conv(x).transpose(1, 2)
+    >>> out.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        res2net_scale=8,
+        se_channels=128,
+        kernel_size=1,
+        dilation=1,
+        activation=torch.nn.ReLU,
+        batch_norm=True,
+    ):
+        super().__init__()
+        self.out_channels = out_channels
+        self.tdnn1 = TDNNBlock(
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            batch_norm=batch_norm,
+        )
+        self.res2net_block = Res2NetBlock(
+            out_channels,
+            out_channels,
+            res2net_scale,
+            kernel_size,
+            dilation,
+            batch_norm=batch_norm,
+        )
+        self.tdnn2 = TDNNBlock(
+            out_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            batch_norm=batch_norm,
+        )
+        self.se_block = SEBlock(out_channels, se_channels, out_channels)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x, lengths=None):
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.tdnn1(x)
+        x = self.res2net_block(x)
+        x = self.tdnn2(x)
+        x = self.se_block(x, lengths)
+        return x + residual
+class ECAPA_TDNN(torch.nn.Module):
+    """An implementation of the speaker embedding model in a paper.
+    "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in
+    TDNN Based Speaker Verification" (https://arxiv.org/abs/2005.07143).
+    Arguments
+    ---------
+    device : str
+        Device used, e.g., "cpu" or "cuda".
+    activation : torch class
+        A class for constructing the activation layers.
+    channels : list of ints
+        Output channels for TDNN/SERes2Net layer.
+    kernel_sizes : list of ints
+        List of kernel sizes for each layer.
+    dilations : list of ints
+        List of dilations for kernels in each layer.
+    lin_neurons : int
+        Number of neurons in linear layers.
+    Example
+    -------
+    >>> input_feats = torch.rand([5, 120, 80])
+    >>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
+    >>> outputs = compute_embedding(input_feats)
+    >>> outputs.shape
+    torch.Size([5, 1, 192])
+    """
+    def __init__(
+        self,
+        input_size,
+        lin_neurons=192,
+        activation=torch.nn.ReLU,
+        channels=[512, 512, 512, 512, 1536],
+        kernel_sizes=[5, 3, 3, 3, 1],
+        dilations=[1, 2, 3, 4, 1],
+        attention_channels=128,
+        res2net_scale=8,
+        se_channels=128,
+        global_context=True,
+        batch_norm=True,
+    ):
+        super().__init__()
+        assert len(channels) == len(kernel_sizes)
+        assert len(channels) == len(dilations)
+        self.channels = channels
+        self.blocks = nn.ModuleList()
+        # The initial TDNN layer
+        self.blocks.append(
+            TDNNBlock(
+                input_size,
+                channels[0],
+                kernel_sizes[0],
+                dilations[0],
+                activation,
+                batch_norm=batch_norm,
+            )
+        )
+        # SE-Res2Net layers
+        for i in range(1, len(channels) - 1):
+            self.blocks.append(
+                SERes2NetBlock(
+                    channels[i - 1],
+                    channels[i],
+                    res2net_scale=res2net_scale,
+                    se_channels=se_channels,
+                    kernel_size=kernel_sizes[i],
+                    dilation=dilations[i],
+                    activation=activation,
+                    batch_norm=batch_norm,
+                )
+            )
+        # Multi-layer feature aggregation
+        self.mfa = TDNNBlock(
+            channels[-1],
+            channels[-1],
+            kernel_sizes[-1],
+            dilations[-1],
+            activation,
+            batch_norm=batch_norm,
+        )
+        # Attentive Statistical Pooling
+        self.asp = AttentiveStatisticsPooling(
+            channels[-1],
+            attention_channels=attention_channels,
+            global_context=global_context,
+            batch_norm=batch_norm,
+        )
+        self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2, enabled=batch_norm)
+        # Final linear transformation
+        self.fc = Conv1d(
+            in_channels=channels[-1] * 2,
+            out_channels=input_size, # lin_neurons,
+            kernel_size=1,
+        )
+    # @torch.cuda.amp.autocast(enabled=True, dtype=torch.float32)
+    def forward(self, x, lengths=None):
+        """Returns the embedding vector.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape (batch, time, channel).
+        """
+        # Minimize transpose for efficiency
+        x = x.transpose(1, 2)
+        xl = []
+        for layer in self.blocks:
+            try:
+                x = layer(x, lengths=lengths)
+            except TypeError:
+                x = layer(x)
+            xl.append(x)
+        # Multi-layer feature aggregation
+        x = torch.cat(xl[1:], dim=1)
+        x = self.mfa(x)
+        # Attentive Statistical Pooling
+        x = self.asp(x, lengths=lengths)
+        x = self.asp_bn(x)
+        # Final linear transformation
+        x = self.fc(x)
+        x = x.squeeze(-1)
+        return x
+if __name__ == "__main__":
+    model = ECAPA_TDNN(128, batch_norm=False)
+    # print(model)

lemas_tts/model/backbones/mmdit.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+import torch
+from torch import nn
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvPositionEmbedding,
+    MMDiTBlock,
+    AdaLayerNorm_Final,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+# text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, out_dim, text_num_embeds, mask_padding=True):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, out_dim)  # will use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        self.precompute_max_pos = 1024
+        self.register_buffer("freqs_cis", precompute_freqs_cis(out_dim, self.precompute_max_pos), persistent=False)
+    def forward(self, text: int["b nt"], drop_text=False) -> int["b nt d"]:  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b nt -> b nt d
+        # sinus pos emb
+        batch_start = torch.zeros((text.shape[0],), dtype=torch.long)
+        batch_text_len = text.shape[1]
+        pos_idx = get_pos_embed_indices(batch_start, batch_text_len, max_pos=self.precompute_max_pos)
+        text_pos_embed = self.freqs_cis[pos_idx]
+        text = text + text_pos_embed
+        if self.mask_padding:
+            text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+        return text
+# noised input & masked cond audio embedding
+class AudioEmbedding(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.linear = nn.Linear(2 * in_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:
+            cond = torch.zeros_like(cond)
+        x = torch.cat((x, cond), dim=-1)
+        x = self.linear(x)
+        x = self.conv_pos_embed(x) + x
+        return x
+# Transformer backbone using MM-DiT blocks
+class MMDiT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_mask_padding=True,
+        qk_norm=None,
+    ):
+        super().__init__()
+        self.time_embed = TimestepEmbedding(dim)
+        self.text_embed = TextEmbedding(dim, text_num_embeds, mask_padding=text_mask_padding)
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.audio_embed = AudioEmbedding(mel_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        self.dim = dim
+        self.depth = depth
+        self.transformer_blocks = nn.ModuleList(
+            [
+                MMDiTBlock(
+                    dim=dim,
+                    heads=heads,
+                    dim_head=dim_head,
+                    dropout=dropout,
+                    ff_mult=ff_mult,
+                    context_pre_only=i == depth - 1,
+                    qk_norm=qk_norm,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
+        self.proj_out = nn.Linear(dim, mel_dim)
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in MMDiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm_x.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_x.linear.bias, 0)
+            nn.init.constant_(block.attn_norm_c.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_c.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+    ):
+        batch = x.shape[0]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning (time), c: context (text + masked cond audio), x: noised input audio
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, drop_text=True)
+                c = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, drop_text=False)
+                c = self.text_cond
+        else:
+            c = self.text_embed(text, drop_text=drop_text)
+        x = self.audio_embed(x, cond, drop_audio_cond=drop_audio_cond)
+        seq_len = x.shape[1]
+        text_len = text.shape[1]
+        rope_audio = self.rotary_embed.forward_from_seq_len(seq_len)
+        rope_text = self.rotary_embed.forward_from_seq_len(text_len)
+        for block in self.transformer_blocks:
+            c, x = block(x, c, t, mask=mask, rope=rope_audio, c_rope=rope_text)
+        x = self.norm_out(x, t)
+        output = self.proj_out(x)
+        return output

lemas_tts/model/backbones/prosody_encoder.py ADDED Viewed

	@@ -0,0 +1,433 @@

+"""
+Prosody encoder backbone based on the Pretssel ECAPA-TDNN architecture.
+This module provides:
+  - ProsodyEncoder: wraps an ECAPA-TDNN model to produce utterance-level
+    prosody embeddings from 80-dim FBANK features.
+  - extract_fbank_16k: utility to compute 80-bin FBANK from 16kHz audio.
+It is self-contained (no fairseq2 dependency) and can be used inside
+CFM or other models as a conditioning network.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import List, Optional, Tuple
+import json
+import torch
+import torchaudio
+from torch import Tensor
+from torch import nn
+from torch.nn import Conv1d, LayerNorm, Module, ModuleList, ReLU, Sigmoid, Tanh, init
+import torch.nn.functional as F
+AUDIO_SAMPLE_RATE = 16_000
+class ECAPA_TDNN(Module):
+    """
+    ECAPA-TDNN core used in Pretssel prosody encoder.
+    Expects input features of shape (B, T, C) with C=80 and returns
+    a normalized embedding of shape (B, embed_dim).
+    """
+    def __init__(
+        self,
+        channels: List[int],
+        kernel_sizes: List[int],
+        dilations: List[int],
+        attention_channels: int,
+        res2net_scale: int,
+        se_channels: int,
+        global_context: bool,
+        groups: List[int],
+        embed_dim: int,
+        input_dim: int,
+    ):
+        super().__init__()
+        assert len(channels) == len(kernel_sizes) == len(dilations)
+        self.channels = channels
+        self.embed_dim = embed_dim
+        self.blocks = ModuleList()
+        self.blocks.append(
+            TDNNBlock(
+                input_dim,
+                channels[0],
+                kernel_sizes[0],
+                dilations[0],
+                groups[0],
+            )
+        )
+        for i in range(1, len(channels) - 1):
+            self.blocks.append(
+                SERes2NetBlock(
+                    channels[i - 1],
+                    channels[i],
+                    res2net_scale=res2net_scale,
+                    se_channels=se_channels,
+                    kernel_size=kernel_sizes[i],
+                    dilation=dilations[i],
+                    groups=groups[i],
+                )
+            )
+        self.mfa = TDNNBlock(
+            channels[-1],
+            channels[-1],
+            kernel_sizes[-1],
+            dilations[-1],
+            groups=groups[-1],
+        )
+        self.asp = AttentiveStatisticsPooling(
+            channels[-1],
+            attention_channels=attention_channels,
+            global_context=global_context,
+        )
+        self.asp_norm = LayerNorm(channels[-1] * 2, eps=1e-12)
+        self.fc = Conv1d(
+            in_channels=channels[-1] * 2,
+            out_channels=embed_dim,
+            kernel_size=1,
+        )
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        def encoder_init(m: Module) -> None:
+            if isinstance(m, Conv1d):
+                init.xavier_uniform_(m.weight, init.calculate_gain("relu"))
+        self.apply(encoder_init)
+    def forward(
+        self,
+        x: Tensor,
+        padding_mask: Optional[Tensor] = None,
+    ) -> Tensor:
+        # x: (B, T, C)
+        x = x.transpose(1, 2)  # (B, C, T)
+        xl = []
+        for layer in self.blocks:
+            x = layer(x, padding_mask=padding_mask)
+            xl.append(x)
+        x = torch.cat(xl[1:], dim=1)
+        x = self.mfa(x)
+        x = self.asp(x, padding_mask=padding_mask)
+        x = self.asp_norm(x.transpose(1, 2)).transpose(1, 2)
+        x = self.fc(x)
+        x = x.transpose(1, 2).squeeze(1)  # (B, embed_dim)
+        return F.normalize(x, dim=-1)
+class TDNNBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        dilation: int,
+        groups: int = 1,
+    ):
+        super().__init__()
+        self.conv = Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            dilation=dilation,
+            padding=dilation * (kernel_size - 1) // 2,
+            groups=groups,
+        )
+        self.activation = ReLU()
+        self.norm = LayerNorm(out_channels, eps=1e-12)
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        x = self.activation(self.conv(x))
+        return self.norm(x.transpose(1, 2)).transpose(1, 2)
+class Res2NetBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        scale: int = 8,
+        kernel_size: int = 3,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        assert in_channels % scale == 0
+        assert out_channels % scale == 0
+        in_channel = in_channels // scale
+        hidden_channel = out_channels // scale
+        self.blocks = ModuleList(
+            [
+                TDNNBlock(
+                    in_channel,
+                    hidden_channel,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                )
+                for _ in range(scale - 1)
+            ]
+        )
+        self.scale = scale
+    def forward(self, x: Tensor) -> Tensor:
+        y = []
+        for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
+            if i == 0:
+                y_i = x_i
+            elif i == 1:
+                y_i = self.blocks[i - 1](x_i)
+            else:
+                y_i = self.blocks[i - 1](x_i + y_i)
+            y.append(y_i)
+        return torch.cat(y, dim=1)
+class SEBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        se_channels: int,
+        out_channels: int,
+    ):
+        super().__init__()
+        self.conv1 = Conv1d(in_channels=in_channels, out_channels=se_channels, kernel_size=1)
+        self.relu = ReLU(inplace=True)
+        self.conv2 = Conv1d(in_channels=se_channels, out_channels=out_channels, kernel_size=1)
+        self.sigmoid = Sigmoid()
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        if padding_mask is not None:
+            # padding_mask: (B, T) with 1 for valid, 0 for pad
+            mask = padding_mask.unsqueeze(1)  # (B, 1, T)
+            lengths = mask.sum(dim=2, keepdim=True)
+            s = (x * mask).sum(dim=2, keepdim=True) / torch.clamp(lengths, min=1.0)
+        else:
+            s = x.mean(dim=2, keepdim=True)
+        s = self.relu(self.conv1(s))
+        s = self.sigmoid(self.conv2(s))
+        return s * x
+class AttentiveStatisticsPooling(Module):
+    def __init__(
+        self, channels: int, attention_channels: int = 128, global_context: bool = True
+    ):
+        super().__init__()
+        self.eps = 1e-12
+        self.global_context = global_context
+        if global_context:
+            self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
+        else:
+            self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
+        self.tanh = Tanh()
+        self.conv = Conv1d(in_channels=attention_channels, out_channels=channels, kernel_size=1)
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        # x: (N, C, L)
+        N, C, L = x.shape
+        def _compute_statistics(
+            x: Tensor, m: Tensor, dim: int = 2, eps: float = 1e-12
+        ) -> Tuple[Tensor, Tensor]:
+            mean = (m * x).sum(dim)
+            std = torch.sqrt((m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps))
+            return mean, std
+        if padding_mask is not None:
+            mask = padding_mask
+        else:
+            mask = torch.ones(N, L, device=x.device, dtype=x.dtype)
+        mask = mask.unsqueeze(1)  # (N, 1, L)
+        if self.global_context:
+            total = mask.sum(dim=2, keepdim=True).to(x)
+            mean, std = _compute_statistics(x, mask / total)
+            mean = mean.unsqueeze(2).repeat(1, 1, L)
+            std = std.unsqueeze(2).repeat(1, 1, L)
+            attn = torch.cat([x, mean, std], dim=1)
+        else:
+            attn = x
+        attn = self.conv(self.tanh(self.tdnn(attn)))
+        attn = attn.masked_fill(mask == 0, float("-inf"))
+        attn = F.softmax(attn, dim=2)
+        mean, std = _compute_statistics(x, attn)
+        pooled_stats = torch.cat((mean, std), dim=1)
+        pooled_stats = pooled_stats.unsqueeze(2)
+        return pooled_stats
+class SERes2NetBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        res2net_scale: int = 8,
+        se_channels: int = 128,
+        kernel_size: int = 1,
+        dilation: int = 1,
+        groups: int = 1,
+    ):
+        super().__init__()
+        self.out_channels = out_channels
+        self.tdnn1 = TDNNBlock(
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            groups=groups,
+        )
+        self.res2net_block = Res2NetBlock(
+            out_channels,
+            out_channels,
+            res2net_scale,
+            kernel_size,
+            dilation,
+        )
+        self.tdnn2 = TDNNBlock(
+            out_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            groups=groups,
+        )
+        self.se_block = SEBlock(out_channels, se_channels, out_channels)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.tdnn1(x)
+        x = self.res2net_block(x)
+        x = self.tdnn2(x)
+        x = self.se_block(x, padding_mask=padding_mask)
+        return x + residual
+def extract_fbank_16k(audio_16k: Tensor) -> Tensor:
+    """
+    Compute 80-dim FBANK features from 16kHz audio.
+    Args:
+        audio_16k: Tensor of shape (T,) or (1, T)
+    Returns:
+        fbank: Tensor of shape (T_fbank, 80)
+    """
+    if audio_16k.ndim == 1:
+        audio_16k = audio_16k.unsqueeze(0)
+    # Ensure minimum length for kaldi.fbank window (default 25ms @16k -> 400 samples)
+    min_len = 400
+    if audio_16k.shape[-1] < min_len:
+        repeat_times = (min_len // audio_16k.shape[-1]) + 1
+        audio_16k = audio_16k.repeat(1, repeat_times) if audio_16k.dim() > 1 else audio_16k.repeat(repeat_times)
+    fbank = torchaudio.compliance.kaldi.fbank(
+        audio_16k,
+        num_mel_bins=80,
+        sample_frequency=AUDIO_SAMPLE_RATE,
+    )
+    return fbank
+class ProsodyEncoder(nn.Module):
+    """
+    High-level wrapper for the Pretssel prosody encoder.
+    Usage:
+        encoder = ProsodyEncoder(cfg_path, ckpt_path, freeze=True)
+        emb = encoder(fbank_batch)  # (B, 512)
+    """
+    def __init__(self, cfg_path: Path, ckpt_path: Path, freeze: bool = True):
+        super().__init__()
+        model_cfg = self._load_pretssel_model_cfg(cfg_path)
+        self.encoder = self._build_prosody_encoder(model_cfg)
+        self._load_prosody_encoder_state(self.encoder, ckpt_path)
+        if freeze:
+            for p in self.encoder.parameters():
+                p.requires_grad = False
+    @staticmethod
+    def _load_pretssel_model_cfg(cfg_path: Path) -> dict:
+        cfg = json.loads(cfg_path.read_text())
+        if "model" not in cfg:
+            raise ValueError(f"{cfg_path} does not contain a top-level 'model' key.")
+        return cfg["model"]
+    @staticmethod
+    def _build_prosody_encoder(model_cfg: dict) -> ECAPA_TDNN:
+        encoder = ECAPA_TDNN(
+            channels=model_cfg["prosody_channels"],
+            kernel_sizes=model_cfg["prosody_kernel_sizes"],
+            dilations=model_cfg["prosody_dilations"],
+            attention_channels=model_cfg["prosody_attention_channels"],
+            res2net_scale=model_cfg["prosody_res2net_scale"],
+            se_channels=model_cfg["prosody_se_channels"],
+            global_context=model_cfg["prosody_global_context"],
+            groups=model_cfg["prosody_groups"],
+            embed_dim=model_cfg["prosody_embed_dim"],
+            input_dim=model_cfg["input_feat_per_channel"],
+        )
+        return encoder
+    @staticmethod
+    def _load_prosody_encoder_state(model: Module, ckpt_path: Path) -> None:
+        state = torch.load(ckpt_path, map_location="cpu")
+        if isinstance(state, dict):
+            if all(isinstance(k, str) for k in state.keys()) and (
+                any(k.startswith("prosody_encoder.") for k in state.keys())
+                or any(k.startswith("prosody_encoder_model.") for k in state.keys())
+            ):
+                state = {
+                    k.replace("prosody_encoder_model.", "", 1).replace("prosody_encoder.", "", 1): v
+                    for k, v in state.items()
+                    if k.startswith("prosody_encoder.") or k.startswith("prosody_encoder_model.")
+                }
+        missing, unexpected = model.load_state_dict(state, strict=False)
+        if missing or unexpected:
+            raise RuntimeError(
+                f"Error loading checkpoint {ckpt_path}: missing keys={missing}, "
+                f"unexpected keys={unexpected}"
+            )
+    def forward(self, fbank: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        """
+        Args:
+            fbank: Tensor of shape (B, T, 80)
+            padding_mask: Optional tensor of shape (B, T) with 1 for valid.
+        Returns:
+            emb: Tensor of shape (B, 512)
+        """
+        return self.encoder(fbank, padding_mask=padding_mask)

lemas_tts/model/backbones/unett.py ADDED Viewed

	@@ -0,0 +1,250 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from typing import Literal
+import torch
+from torch import nn
+import torch.nn.functional as F
+from x_transformers import RMSNorm
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvNeXtV2Block,
+    ConvPositionEmbedding,
+    Attention,
+    AttnProcessor,
+    FeedForward,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+# Text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        if conv_layers > 0:
+            self.extra_modeling = True
+            self.precompute_max_pos = 4096  # ~44s of 24khz audio
+            self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
+            self.text_blocks = nn.Sequential(
+                *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
+            )
+        else:
+            self.extra_modeling = False
+    def forward(self, text: int["b nt"], seq_len, drop_text=False):  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
+        batch, text_len = text.shape[0], text.shape[1]
+        text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b n -> b n d
+        # possible extra modeling
+        if self.extra_modeling:
+            # sinus pos emb
+            batch_start = torch.zeros((batch,), dtype=torch.long)
+            pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
+            text_pos_embed = self.freqs_cis[pos_idx]
+            text = text + text_pos_embed
+            # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
+        return text
+# noised input audio and context mixing embedding
+class InputEmbedding(nn.Module):
+    def __init__(self, mel_dim, text_dim, out_dim):
+        super().__init__()
+        self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:  # cfg for cond audio
+            cond = torch.zeros_like(cond)
+        x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
+        x = self.conv_pos_embed(x) + x
+        return x
+# Flat UNet Transformer backbone
+class UNetT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
+        conv_layers=0,
+        pe_attn_head=None,
+        skip_connect_type: Literal["add", "concat", "none"] = "concat",
+    ):
+        super().__init__()
+        assert depth % 2 == 0, "UNet-Transformer's depth should be even."
+        self.time_embed = TimestepEmbedding(dim)
+        if text_dim is None:
+            text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        # transformer layers & skip connections
+        self.dim = dim
+        self.skip_connect_type = skip_connect_type
+        needs_skip_proj = skip_connect_type == "concat"
+        self.depth = depth
+        self.layers = nn.ModuleList([])
+        for idx in range(depth):
+            is_later_half = idx >= (depth // 2)
+            attn_norm = RMSNorm(dim)
+            attn = Attention(
+                processor=AttnProcessor(pe_attn_head=pe_attn_head),
+                dim=dim,
+                heads=heads,
+                dim_head=dim_head,
+                dropout=dropout,
+                qk_norm=qk_norm,
+            )
+            ff_norm = RMSNorm(dim)
+            ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+            skip_proj = nn.Linear(dim * 2, dim, bias=False) if needs_skip_proj and is_later_half else None
+            self.layers.append(
+                nn.ModuleList(
+                    [
+                        skip_proj,
+                        attn_norm,
+                        attn,
+                        ff_norm,
+                        ff,
+                    ]
+                )
+            )
+        self.norm_out = RMSNorm(dim)
+        self.proj_out = nn.Linear(dim, mel_dim)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+    ):
+        batch, seq_len = x.shape[0], x.shape[1]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
+        x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
+        # postfix time t to input x, [b n d] -> [b n+1 d]
+        x = torch.cat([t.unsqueeze(1), x], dim=1)  # pack t to x
+        if mask is not None:
+            mask = F.pad(mask, (1, 0), value=1)
+        rope = self.rotary_embed.forward_from_seq_len(seq_len + 1)
+        # flat unet transformer
+        skip_connect_type = self.skip_connect_type
+        skips = []
+        for idx, (maybe_skip_proj, attn_norm, attn, ff_norm, ff) in enumerate(self.layers):
+            layer = idx + 1
+            # skip connection logic
+            is_first_half = layer <= (self.depth // 2)
+            is_later_half = not is_first_half
+            if is_first_half:
+                skips.append(x)
+            if is_later_half:
+                skip = skips.pop()
+                if skip_connect_type == "concat":
+                    x = torch.cat((x, skip), dim=-1)
+                    x = maybe_skip_proj(x)
+                elif skip_connect_type == "add":
+                    x = x + skip
+            # attention and feedforward blocks
+            x = attn(attn_norm(x), rope=rope, mask=mask) + x
+            x = ff(ff_norm(x)) + x
+        assert len(skips) == 0
+        x = self.norm_out(x)[:, 1:, :]  # unpack t from x
+        return self.proj_out(x)

lemas_tts/model/cfm.py ADDED Viewed

	@@ -0,0 +1,899 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from random import random
+import random as _random
+from typing import Callable, Dict, OrderedDict
+import math
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+import torchaudio
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+from torchdiffeq import odeint
+from lemas_tts.model.modules import MelSpec
+from lemas_tts.model.modules import MIEsitmator, AccentClassifier, grad_reverse
+from lemas_tts.model.backbones.ecapa_tdnn import ECAPA_TDNN
+from lemas_tts.model.backbones.prosody_encoder import ProsodyEncoder, extract_fbank_16k
+from lemas_tts.model.utils import (
+    default,
+    exists,
+    lens_to_mask,
+    list_str_to_idx,
+    list_str_to_tensor,
+    mask_from_frac_lengths,
+)
+def clip_and_shuffle(mel, mel_len, sample_rate=24000, hop_length=256, ratio=None):
+    """
+    Randomly clip a mel-spectrogram segment and shuffle 1-second chunks to
+    create an accent-invariant conditioning segment.
+    This is a inference-time utility used by the accent GRL path.
+    Args:
+        mel: [n_mels, T]
+        mel_len: int, original mel length (T)
+    """
+    frames_per_second = int(sample_rate / hop_length)  # ≈ 94 frames / second
+    # ---- 1. Randomly crop 25%~75% of the original length (or ratio * length) ----
+    total_len = mel_len
+    if not ratio:
+        seg_len = _random.randint(int(0.25 * total_len), int(0.75 * total_len))
+    else:
+        seg_len = int(total_len * ratio)
+    start = _random.randint(0, max(0, total_len - seg_len))
+    mel_seg = mel[:, start : start + seg_len]
+    # ---- 2. Split into ~1-second chunks ----
+    n_chunks = (mel_seg.size(1) + frames_per_second - 1) // frames_per_second
+    chunks = []
+    for i in range(n_chunks):
+        chunk = mel_seg[:, i * frames_per_second : (i + 1) * frames_per_second]
+        chunks.append(chunk)
+    # ---- 3. Shuffle chunk order ----
+    _random.shuffle(chunks)
+    shuffled_mel = torch.cat(chunks, dim=1)
+    # ---- 4. Repeat random chunks until reaching original length ----
+    if shuffled_mel.size(1) < total_len:
+        repeat_chunks = []
+        while sum(c.size(1) for c in repeat_chunks) < total_len:
+            repeat_chunks.append(_random.choice(chunks))
+        shuffled_mel = torch.cat([shuffled_mel] + repeat_chunks, dim=1)
+    # ---- 5. Trim to exactly mel_len ----
+    shuffled_mel = shuffled_mel[:, :total_len]
+    assert shuffled_mel.shape == mel.shape, f"shuffled_mel.shape != mel.shape: {shuffled_mel.shape} != {mel.shape}"
+    return shuffled_mel
+class CFM(nn.Module):
+    def __init__(
+        self,
+        transformer: nn.Module,
+        sigma=0.0,
+        odeint_kwargs: dict = dict(
+            # atol = 1e-5,
+            # rtol = 1e-5,
+            method="euler"  # 'midpoint'
+        ),
+        audio_drop_prob=0.3,
+        text_drop_prob=0.1,
+        num_channels=None,
+        mel_spec_module: nn.Module | None = None,
+        mel_spec_kwargs: dict = dict(),
+        frac_lengths_mask: tuple[float, float] = (0.7, 1.0),
+        vocab_char_map: dict[str:int] | None = None,
+        use_ctc_loss: bool = False,
+        use_spk_enc: bool = False,
+        use_prosody_encoder: bool = False,
+        prosody_cfg_path: str | None = None,
+        prosody_ckpt_path: str | None = None,
+    ):
+        super().__init__()
+        self.frac_lengths_mask = frac_lengths_mask
+        # mel spec
+        self.mel_spec = default(mel_spec_module, MelSpec(**mel_spec_kwargs))
+        num_channels = default(num_channels, self.mel_spec.n_mel_channels)
+        self.num_channels = num_channels
+        # classifier-free guidance
+        self.audio_drop_prob = audio_drop_prob
+        self.text_drop_prob = text_drop_prob
+        # transformer
+        self.transformer = transformer
+        dim = transformer.dim
+        self.dim = dim
+        # conditional flow related
+        self.sigma = sigma
+        # sampling related
+        self.odeint_kwargs = odeint_kwargs
+        # vocab map for tokenization
+        self.vocab_char_map = vocab_char_map
+        # Prosody encoder (Pretssel ECAPA-TDNN)
+        self.use_prosody_encoder = (
+            use_prosody_encoder and prosody_cfg_path is not None and prosody_ckpt_path is not None
+        )
+        if self.use_prosody_encoder:
+            cfg_path = Path(prosody_cfg_path)
+            ckpt_path = Path(prosody_ckpt_path)
+            self.prosody_encoder = ProsodyEncoder(cfg_path, ckpt_path, freeze=True)
+            # 512-d prosody -> mel channel dimension
+            self.prosody_to_mel = nn.Linear(512, self.num_channels)
+            self.prosody_dropout = nn.Dropout(p=0.2)
+        else:
+            self.prosody_encoder = None
+        # Speaker encoder
+        self.use_spk_enc = use_spk_enc
+        if use_spk_enc:
+            self.speaker_encoder = ECAPA_TDNN(
+                self.num_channels,
+                self.dim,
+                channels=[512, 512, 512, 512, 1536],
+                kernel_sizes=[5, 3, 3, 3, 1],
+                dilations=[1, 2, 3, 4, 1],
+                attention_channels=128,
+                res2net_scale=4,
+                se_channels=128,
+                global_context=True,
+                batch_norm=True,
+            )
+            # self.load_partial_weights(self.speaker_encoder, "/cto_labs/vistring/zhaozhiyuan/outputs/F5-TTS/pretrain/speaker.bin", device="cpu")
+        self.use_ctc_loss = use_ctc_loss
+        if use_ctc_loss:
+            # print("vocab_char_map:", len(vocab_char_map)+1, "dim:", dim, "mel_spec_kwargs:",mel_spec_kwargs)
+            self.ctc = MIEsitmator(len(self.vocab_char_map), self.num_channels, self.dim, dropout=self.text_drop_prob)
+        self.accent_classifier = AccentClassifier(input_dim=self.num_channels, hidden_dim=self.dim, num_accents=12)
+        self.accent_criterion = nn.CrossEntropyLoss()
+    def load_partial_weights(self, model: nn.Module,
+                            ckpt_path: str,
+                            device="cpu",
+                            verbose=True) -> int:
+        """
+        仅加载形状匹配的参数，其余跳过。
+        返回成功加载的参数数量。
+        """
+        state_dict = torch.load(ckpt_path, map_location=device)
+        model_dict = model.state_dict()
+        ok_count = 0
+        new_dict: OrderedDict[str, torch.Tensor] = OrderedDict()
+        for k, v in state_dict.items():
+            if k in model_dict and v.shape == model_dict[k].shape:
+                new_dict[k] = v
+                ok_count += 1
+            else:
+                if verbose:
+                    print(f"[SKIP] {k}  ckpt:{v.shape}  model:{model_dict[k].shape if k in model_dict else 'N/A'}")
+        model_dict.update(new_dict)
+        model.load_state_dict(model_dict)
+        if verbose:
+            print(f"=> 成功加载 {ok_count}/{len(state_dict)} 个参数")
+        return ok_count
+    @property
+    def device(self):
+        return next(self.parameters()).device
+    @torch.no_grad()
+    def sample(
+        self,
+        cond: float["b n d"] | float["b nw"],  # noqa: F722
+        text: int["b nt"] | list[str],  # noqa: F722
+        duration: int | int["b"],  # noqa: F821
+        *,
+        lens: int["b"] | None = None,  # noqa: F821
+        steps=32,
+        cfg_strength=1.0,
+        sway_sampling_coef=None,
+        seed: int | None = None,
+        max_duration=4096,
+        vocoder: Callable[[float["b d n"]], float["b nw"]] | None = None,  # noqa: F722
+        no_ref_audio=False,
+        duplicate_test=False,
+        t_inter=0.1,
+        edit_mask=None,
+        use_acc_grl = True,
+        use_prosody_encoder = True,
+        ref_ratio = 1,
+    ):
+        self.eval()
+        # raw wave -> mel, keep a copy for prosody encoder if available
+        raw_audio = None
+        if cond.ndim == 2:
+            raw_audio = cond.clone()  # (B, nw)
+            cond = self.mel_spec(cond)
+            cond = cond.permute(0, 2, 1)
+            assert cond.shape[-1] == self.num_channels
+        cond = cond.to(next(self.parameters()).dtype)
+        cond_mean = cond.mean(dim=1, keepdim=True)
+        batch, cond_seq_len, device = *cond.shape[:2], cond.device
+        if not exists(lens):
+            lens = torch.full((batch,), cond_seq_len, device=device, dtype=torch.long)
+        # optional global prosody conditioning at inference (one embedding per sample)
+        prosody_mel_cond = None
+        prosody_text_cond = None
+        prosody_embeds = None
+        if self.prosody_encoder is not None and raw_audio is not None and use_prosody_encoder:
+            embeds = []
+            for b in range(batch):
+                audio_b = raw_audio[b].unsqueeze(0)  # (1, nw)
+                src_sr = self.mel_spec.target_sample_rate
+                if src_sr != 16_000:
+                    audio_16k = torchaudio.functional.resample(
+                        audio_b, src_sr, 16_000
+                    ).squeeze(0)
+                else:
+                    audio_16k = audio_b.squeeze(0)
+                fbank = extract_fbank_16k(audio_16k)
+                fbank = fbank.unsqueeze(0).to(device=device, dtype=cond.dtype)
+                emb = self.prosody_encoder(fbank, padding_mask=None)[0]  # (512,)
+                embeds.append(emb)
+            prosody_embeds = torch.stack(embeds, dim=0)  # (B, 512)
+            # broadcast along mel and text
+            prosody_mel_cond = prosody_embeds[:, None, :].expand(-1, cond_seq_len, -1)
+        if use_acc_grl:
+            # rand_mel = clip_and_shuffle(cond.permute(0, 2, 1).squeeze(0), cond.shape[1])
+            # rand_mel = rand_mel.unsqueeze(0).permute(0, 2, 1)
+            # assert rand_mel.shape == cond.shape, f"Shape diff: rand_mel.shape: {rand_mel.shape}, cond.shape: {cond.shape}"
+            # cond_grl = grad_reverse(rand_mel, lambda_=1.0)
+            if ref_ratio < 1:
+                rand_mel = clip_and_shuffle(cond.permute(0, 2, 1).squeeze(0), cond.shape[1], ratio=ref_ratio)
+                rand_mel = rand_mel.unsqueeze(0).permute(0, 2, 1)
+                assert rand_mel.shape == cond.shape, f"Shape diff: rand_mel.shape: {rand_mel.shape}, cond.shape: {cond.shape}"
+                cond_grl = grad_reverse(rand_mel, lambda_=1.0)
+            else:
+                cond_grl = grad_reverse(cond, lambda_=1.0)
+            # print("cond:", cond.shape, cond.mean(), cond.max(), cond.min(), "rand_mel:", rand_mel.mean(), rand_mel.max(), rand_mel.min(), "cond_grl:", cond_grl.mean(), cond_grl.max(), cond_grl.min())
+        # text
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # duration
+        cond_mask = lens_to_mask(lens)
+        if edit_mask is not None:
+            cond_mask = cond_mask & edit_mask
+        if isinstance(duration, int):
+            duration = torch.full((batch,), duration, device=device, dtype=torch.long)
+        duration = torch.maximum(
+            torch.maximum((text != -1).sum(dim=-1), lens) + 1, duration
+        )  # duration at least text/audio prompt length plus one token, so something is generated
+        # clamp and convert max_duration to python int for padding ops
+        duration = duration.clamp(max=max_duration)
+        max_duration = int(duration.amax().item())
+        # duplicate test corner for inner time step oberservation
+        if duplicate_test:
+            test_cond = F.pad(cond, (0, 0, cond_seq_len, max_duration - 2 * cond_seq_len), value=0.0)
+        cond = F.pad(cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
+        if prosody_mel_cond is not None:
+            prosody_mel_cond = F.pad(
+                prosody_mel_cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0
+            )
+            prosody_mel_proj = self.prosody_to_mel(prosody_mel_cond)
+            cond = cond + prosody_mel_proj
+        if no_ref_audio:
+            random_cond = torch.randn_like(cond) * 0.1 + cond_mean
+            random_cond = random_cond / random_cond.mean(dim=1, keepdim=True) * cond_mean
+            print("cond:", cond.mean(), cond.max(), cond.min(), "random_cond:", random_cond.mean(), random_cond.max(), random_cond.min(), "mean_cond:", cond_mean.shape)
+            cond = random_cond
+        cond_mask = F.pad(cond_mask, (0, max_duration - cond_mask.shape[-1]), value=False)
+        cond_mask = cond_mask.unsqueeze(-1)
+        if use_acc_grl:
+            cond_grl = F.pad(cond_grl, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
+        step_cond = torch.where(cond_mask, cond, torch.zeros_like(cond))  # allow direct control (cut cond audio) with lens passed in
+        if batch > 1:
+            mask = lens_to_mask(duration)
+        else:  # save memory and speed up, as single inference need no mask currently
+            mask = None
+        # neural ode
+        def compute_sway_max(steps: int,
+                            t_start: float = 0.0,
+                            dtype=torch.float32,
+                            min_ratio: float | None = None,
+                            safety_factor: float = 0.5) -> float:
+            """
+            Compute a safe upper bound for sway_sampling_coef given steps and t_start.
+            - steps: number of ODE steps
+            - t_start: start time in [0,1)
+            - dtype: torch dtype (for machine eps)
+            - min_ratio: smallest distinguishable dt^p (if None, use conservative default)
+            - safety_factor: scale down the theoretical maximum to be safe
+            """
+            assert 0.0 <= t_start < 1.0
+            dt = (1.0 - t_start) / max(1, steps)
+            eps = torch.finfo(dtype).eps
+            if min_ratio is None:
+                # conservative default: ~100 * eps (float32 -> ~1e-5)
+                min_ratio = max(1e-9, 1e2 * float(eps))
+            if dt >= 0.9:
+                p_max = 1.0 + 10.0
+            else:
+                # solve dt^p >= min_ratio  =>  p <= log(min_ratio)/log(dt)
+                p_max = math.log(min_ratio) / math.log(dt)
+            sway_max = max(0.0, p_max - 1.0)
+            sway_max = sway_max * float(safety_factor)
+            return torch.tensor(sway_max, device=device, dtype=dtype)
+        # prepare text-side prosody conditioning if embeddings available
+        if prosody_embeds is not None:
+            text_len = text.shape[1]
+            prosody_text_cond = prosody_embeds[:, None, :].expand(-1, text_len, -1)
+        else:
+            prosody_text_cond = None
+        def fn(t, x):
+            # at each step, conditioning is fixed
+            # if use_spk_enc:
+            #     mix_cond = t * cond + (1-t) * spk_emb
+            #     step_cond = torch.where(cond_mask, mix_cond, torch.zeros_like(mix_cond))
+            if use_acc_grl:
+                step_cond = torch.where(cond_mask, cond_grl, torch.zeros_like(cond_grl))
+            else:
+                step_cond = torch.where(cond_mask, cond, torch.zeros_like(cond))
+            # predict flow
+            pred = self.transformer(
+                x=x,
+                cond=step_cond,
+                text=text,
+                time=t,
+                mask=mask,
+                drop_audio_cond=False,
+                drop_text=False,
+                cache=True,
+                prosody_text=prosody_text_cond,
+            )
+            if cfg_strength < 1e-5:
+                return pred
+            null_pred = self.transformer(
+                x=x,
+                cond=step_cond,
+                text=text,
+                time=t,
+                mask=mask,
+                drop_audio_cond=True,
+                drop_text=True,
+                cache=True,
+                prosody_text=prosody_text_cond,
+            )
+            # cfg_t = cfg_strength * torch.cos(0.5 * torch.pi * t)
+            # cfg_t = cfg_strength * (1 - t)
+            cfg_t = cfg_strength * ((1 - t) ** 2)
+            # print("t:", t, "cfg_t:", cfg_t)
+            res = pred + (pred - null_pred) * cfg_t
+            # print("t:", t.item(), "\tres:", res.shape, res.mean().item(), res.max().item(), res.min().item(), "\tpred:", pred.mean().item(), pred.max().item(), pred.min().item(), "\tnull_pred:", null_pred.mean().item(), null_pred.max().item(), null_pred.min().item(), "\tcfg_t:", cfg_t.item())
+            res = res.clamp(-20, 20)
+            return res
+        # noise input
+        # to make sure batch inference result is same with different batch size, and for sure single inference
+        # still some difference maybe due to convolutional layers
+        y0 = []
+        for dur in duration:
+            if exists(seed):
+                torch.manual_seed(seed)
+            y0.append(torch.randn(dur, self.num_channels, device=self.device, dtype=step_cond.dtype))
+        y0 = pad_sequence(y0, padding_value=0, batch_first=True)
+        t_start = 0
+        # duplicate test corner for inner time step oberservation
+        if duplicate_test:
+            t_start = t_inter
+            y0 = (1 - t_start) * y0 + t_start * test_cond
+            steps = int(steps * (1 - t_start))
+        t = torch.linspace(t_start, 1, int(steps + 1), device=self.device, dtype=step_cond.dtype)
+        sway_max = compute_sway_max(steps, t_start=t_start, dtype=step_cond.dtype, min_ratio=1e-9, safety_factor=0.7)
+        if sway_sampling_coef is not None:
+            sway_sampling_coef = min(sway_max, sway_sampling_coef)
+            # t = t + sway_sampling_coef *  (torch.cos(torch.pi / 2 * t) - 1 + t)
+            t = t ** (1 + sway_sampling_coef)
+        else:
+            t = t ** (1 + sway_max)
+        # print("t:",t, "sway_max:", sway_max, "sway_sampling_coef:", sway_sampling_coef)
+        trajectory = odeint(fn, y0, t, **self.odeint_kwargs)
+        self.transformer.clear_cache()
+        sampled = trajectory[-1]
+        out = sampled
+        out = torch.where(cond_mask, cond, out)
+        # out生成的部分，或者说pad补0的部分，单独计算mean, 然后和cond的mean做对齐（乘以系数，两个的均值要差不多）
+        if no_ref_audio:
+            out_mean = out[:,cond_seq_len:,:].mean(dim=1, keepdim=True)
+            out[:,cond_seq_len:,:] = out[:,cond_seq_len:,:] - (out_mean - cond_mean)
+            # print("out_mean:", out_mean.shape, out_mean.mean(), "cond_mean:", cond_mean.shape, cond_mean.mean(), "out:", out[:,cond_seq_len:,:].shape, out[:,cond_seq_len:,:].mean().item(), out[:,cond_seq_len:,:].max().item(), out[:,cond_seq_len:,:].min().item())
+        if exists(vocoder):
+            out = out.permute(0, 2, 1)
+            out = vocoder(out)
+        # print("out:", out.shape, "trajectory:", trajectory.shape)
+        return out, trajectory
+    def info_nce_speaker(self,
+                        e_gt: torch.Tensor,
+                        e_pred: torch.Tensor,
+                        temperature: float = 0.1):
+        """
+        InfoNCE loss for speaker encoder training.
+        同一条样本的 e_gt 与 e_pred 互为正例，其余均为负例。
+        Args:
+            temperature: 温度缩放 τ
+        Returns:
+            loss: 标量 tensor，可 backward
+        """
+        B = e_gt.size(0)
+        # 2. L2 归一化
+        e_gt   = F.normalize(e_gt,   dim=1)
+        e_pred = F.normalize(e_pred, dim=1)
+        # 3. 计算 B×B 相似度矩阵（pred 对 gt）
+        logits = torch.einsum('bd,cd->bc', e_pred, e_gt) / temperature  # [B, B]
+        # 4. 正例标签正好是对角线
+        labels = torch.arange(B, device=logits.device)
+        # 5. InfoNCE = cross-entropy over in-batch negatives
+        loss = F.cross_entropy(logits, labels)
+        return loss
+    def forward_old(
+        self,
+        batchs: Dict[str, torch.Tensor],
+        # inp: float["b n d"] | float["b nw"],  # mel or raw wave  # noqa: F722
+        # text: int["b nt"] | list[str],  # noqa: F722
+        *,
+        # lens: int["b"] | None = None,  # noqa: F821
+        noise_scheduler: str | None = None,
+    ):
+        inp = batchs["mel"].permute(0, 2, 1)
+        lens = batchs["mel_lengths"]
+        rand_mel = batchs["rand_mel"].permute(0, 2, 1)
+        text = batchs["text"]
+        target_text_lengths = torch.tensor([len(x) for x in text], device=inp.device)
+        langs = batchs["langs"]
+        # print("inp:", inp.shape, "rand_mel:", rand_mel.shape, "lens:", lens, "target_text_lengths:", target_text_lengths, "langs:", langs)
+        # handle raw wave
+        if inp.ndim == 2:
+            inp = self.mel_spec(inp)
+            inp = inp.permute(0, 2, 1)
+            assert inp.shape[-1] == self.num_channels
+        batch, seq_len, dtype, device, _σ1 = *inp.shape[:2], inp.dtype, self.device, self.sigma
+        # print("inp_shape:", inp.shape, inp.max(), inp.min(), "dtype:", dtype, "device:", device, "σ1:", _σ1)
+        # handle text as string
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # lens and mask
+        if not exists(lens):
+            lens = torch.full((batch,), seq_len, device=device)
+        mask = lens_to_mask(lens, length=seq_len)  # useless here, as collate_fn will pad to max length in batch
+        # get a random span to mask out for training conditionally
+        frac_lengths = torch.zeros((batch,), device=self.device).float().uniform_(*self.frac_lengths_mask)
+        rand_span_mask = mask_from_frac_lengths(lens, frac_lengths)
+        if exists(mask):
+            rand_span_mask &= mask
+        # mel is x1
+        x1 = inp
+        # x0 is gaussian noise
+        x0 = torch.randn_like(x1)
+        # time step
+        time = torch.rand((batch,), dtype=dtype, device=self.device)
+        # TODO. noise_scheduler
+        # sample xt (φ_t(x) in the paper)
+        t = time.unsqueeze(-1).unsqueeze(-1)
+        φ = (1 - t) * x0 + t * x1
+        flow = x1 - x0
+        # cond = torch.where(rand_span_mask[..., None], torch.zeros_like(rand_mel), rand_mel)
+        cond = torch.where(rand_span_mask[..., None], torch.zeros_like(x1), x1)
+        # print("seq_len:", seq_len, "lens:", lens)
+        if self.use_spk_enc: # 50%的概率使用spk_emb
+            spk_emb = self.speaker_encoder(rand_mel, lens)
+            # global_emb: [batch, 1, dim] -> 重复扩展到 [batch, seq_len, dim]
+            spk_emb = spk_emb.unsqueeze(1).expand_as(x1)
+            # print("spk_emb_shape:", spk_emb.shape)
+            # 应用mask操作
+            cond = torch.where(rand_span_mask[..., None], torch.zeros_like(spk_emb), spk_emb)
+            # assert cond.shape[0] == batch, "speaker encoder batch size mismatch"
+            # print("x1.shape:", x1.shape, "cond_shape:", cond.shape)
+            # 给一个随机数，把spk_emb * 随机数，再加上原来的cond *（1-随机数）
+            rand_num = torch.rand((batch, 1, 1), dtype=dtype, device=self.device)
+            cond = cond * rand_num + spk_emb * (1 - rand_num)
+        cond_grl = grad_reverse(cond, lambda_=1.0)
+        # print("inp_shape:", inp.shape, "rand_span_mask:", rand_span_mask.shape)
+        # # # transformer and cfg training with a drop rate
+        # drop_audio_cond = random() < self.audio_drop_prob  # p_drop in voicebox paper
+        # drop_text_cond = random() < self.text_drop_prob  # p_drop in voicebox paper
+        drop_audio_cond = random() < self.audio_drop_prob  # p_drop in voicebox paper
+        if random() < self.text_drop_prob:  # p_uncond in voicebox paper
+            drop_audio_cond = True
+            drop_text_cond = True
+        else:
+            drop_text_cond = False
+        # print("drop_audio_cond:", drop_audio_cond, "drop_text_cond:", drop_text_cond)
+        # if want rigorously mask out padding, record in collate_fn in dataset.py, and pass in here
+        # adding mask will use more memory, thus also need to adjust batchsampler with scaled down threshold for long sequences
+        pred = self.transformer(x=φ, cond=cond_grl, text=text, time=time, drop_audio_cond=drop_audio_cond, drop_text=drop_text_cond)
+        # flow matching loss
+        pred_clamp = pred.float().clamp(-20, 20)
+        loss = F.mse_loss(pred_clamp, flow, reduction="none")
+        loss = loss[rand_span_mask]  # [N]
+        # # # 1. 全局截断：>2 或 NaN → 0（全局）
+        # print("mse loss shape:", loss.shape, "loss max:", loss.max(), "loss min:", loss.min(), target_text_lengths[0])
+        # # 2. 统计非NaN值的百分比
+        # valid_mask = ~torch.isnan(loss)
+        # total_count = loss.numel()  # 总元素数量（所有维度）
+        # valid_count = valid_mask.sum().item()  # 非NaN元素数量
+        # valid_percentage = (valid_count / total_count) * 100
+        # print(f"mse loss: total_count: {total_count}", f"valid_count: {valid_count}", f"valid_percentage: {valid_percentage:.2f}%")
+        # valid_loss = loss[~torch.isnan(loss)]
+        loss = torch.where(torch.isnan(loss) | (loss > 300.0), 300.0, loss)
+        loss = loss.mean()
+        # loss = torch.tanh(torch.log1p(loss.mean())) # 对数缩放
+        # if len(valid_loss) > 0:
+        #     clipped_loss = torch.clamp(valid_loss, max=150)
+        #     loss = torch.tanh(torch.log1p(clipped_loss.mean())) # 对数缩放
+        # else:
+        #     loss = torch.tensor(0.0, device=pred.device)
+        accent_logits = self.accent_classifier(cond_grl)
+        accent_logits_mean = accent_logits.mean(dim=1)
+        lang_labels = langs.to(accent_logits.device).long()
+        # print("langs:", lang_labels, "accent_logits:", accent_logits.shape, "accent_logits_mean:", accent_logits_mean.shape)
+        accent_loss = self.accent_criterion(accent_logits_mean, lang_labels)
+        # guard against NaN / Inf in accent_loss
+        if not torch.isfinite(accent_loss):
+            accent_loss = torch.zeros_like(accent_loss, device=accent_loss.device)
+        # accent_loss = torch.zeros_like(loss, device=loss.device, requires_grad=True)
+        loss += 0.1 * accent_loss
+        valid_indices = torch.where(time > 0.5)[0]
+        # print("torch.where(time > 0.5):", valid_indices, torch.where(time > 0.5))
+        if valid_indices.size(0) > 2:
+            # 动态选择符合条件的sample
+            selected_gt = inp[valid_indices]
+            selected_pred = pred[valid_indices]
+            selected_text = text[valid_indices]
+            selected_lens = lens[valid_indices]
+            selected_target_lengths = target_text_lengths[valid_indices]
+            # print("pred:", selected_pred.shape, "valid_indices:", valid_indices, "lens:", selected_lens, "target_lengths:", selected_target_lengths)
+        if self.use_spk_enc and valid_indices.size(0) > 2:
+            # speaker encoder loss
+            e_gt = self.speaker_encoder(selected_gt, selected_lens)
+            e_pred = self.speaker_encoder(selected_pred, selected_lens)
+            spk_loss = self.info_nce_speaker(e_gt, e_pred)
+            if not torch.isnan(spk_loss).any(): #  and spk_loss.item() > 1e-6:
+                loss = loss + spk_loss * 10.0
+            else:
+                spk_loss = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        else:
+            spk_loss = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        # print("spk_loss:", spk_loss)
+        # ctc loss
+        if self.use_ctc_loss and valid_indices.size(0) > 2:
+            # 如果t大于0.5 则计算ctc loss
+            ctc_loss = self.ctc(
+                decoder_outputs=selected_pred,
+                target_phones=selected_text,
+                decoder_lengths=selected_lens,
+                target_lengths=selected_target_lengths,
+            )
+            # print("loss:", loss, "ctc_loss:", ctc_loss, "time: ", time.shape, time[valid_indices].mean())
+            # 如果ctc loss没有nan，才加上ctc loss
+            if not torch.isnan(ctc_loss).any() and ctc_loss.item() > 1e-6:
+                # ctc_scaled = torch.tanh(torch.log1p(ctc_loss))
+                ctc_scaled = ctc_loss
+                loss = loss + 0.1 * ctc_scaled
+            else:
+                ctc_scaled = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+            # print("loss:", loss, "ctc_scaled:", ctc_scaled)
+        else:
+            ctc_scaled = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        # 在计算完 total loss 之前
+        total_loss = loss  # base flow loss + others you added
+        # note: we intentionally do NOT add 0.0 * pred.sum() etc. here, to avoid
+        # propagating NaNs from intermediate tensors into the loss scalar.
+        return total_loss, ctc_scaled, accent_loss, len(valid_indices), cond, pred # accent_loss,
+    def forward(self, batchs: Dict[str, torch.Tensor], *, noise_scheduler: str | None = None):
+        """
+        Simplified forward version for accent-invariant flow matching.
+        Removes speaker encoder and CTC parts, keeps accent GRL.
+        """
+        inp = batchs["mel"].permute(0, 2, 1)           # [B, T_mel, D]
+        lens = batchs["mel_lengths"]
+        text = batchs["text"]
+        langs = batchs["langs"]
+        audio_16k_list = batchs.get("audio_16k", None)
+        prosody_idx_list = batchs.get("prosody_idx", None)
+        # # ---- 4. 随机截取并打乱 segment ----
+        # rand_mel = [clip_and_shuffle(spec, spec.shape[-1]) for spec in batchs["mel"]]
+        # padded_rand_mel = []
+        # for spec in rand_mel:
+        #     padding = (0, batchs["mel"].shape[-1] - spec.size(-1))
+        #     padded_spec = F.pad(spec, padding, value=0)
+        #     padded_rand_mel.append(padded_spec)
+        # rand_mel = torch.stack(padded_rand_mel).permute(0, 2, 1)
+        # assert rand_mel.shape == inp.shape, f"shape diff: rand_mel.shape: {rand_mel.shape}, inp.shape: {inp.shape}"
+        if inp.ndim == 2:
+            inp = self.mel_spec(inp).permute(0, 2, 1)
+            assert inp.shape[-1] == self.num_channels
+        batch, seq_len, dtype, device = *inp.shape[:2], inp.dtype, self.device
+        # --- handle text
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # print("text:", batchs["text"][0], text.shape, text[0], batchs["text_lengths"][0])
+        # --- prosody conditioning (compute embeddings per sub-utterance)
+        prosody_mel_cond = None
+        prosody_text_cond = None
+        if (
+            self.prosody_encoder is not None
+            and audio_16k_list is not None
+            and prosody_idx_list is not None
+        ):
+            # prepare zero tensors for each sample
+            T_mel = seq_len
+            T_text = text.shape[1]
+            prosody_mel_cond = torch.zeros(batch, T_mel, 512, device=device, dtype=dtype)
+            prosody_text_cond = torch.zeros(batch, T_text, 512, device=device, dtype=dtype)
+            # collect all segments, run encoder per segment
+            seg_embeds: list[Tensor] = []
+            seg_meta: list[tuple[int, int, int, int, int, int]] = []
+            for b in range(batch):
+                audio_b = audio_16k_list[b]
+                idx_list = prosody_idx_list[b]
+                if audio_b is None or idx_list is None:
+                    continue
+                audio_b = audio_b.to(device=device, dtype=dtype)
+                for seg in idx_list:
+                    text_start, text_end, mel_start, mel_end, audio_start, audio_end = seg
+                    # clamp audio indices
+                    audio_start = max(0, min(audio_start, audio_b.shape[0] - 1))
+                    audio_end = max(audio_start + 1, min(audio_end, audio_b.shape[0]))
+                    audio_seg = audio_b[audio_start:audio_end]
+                    if audio_seg.numel() == 0:
+                        continue
+                    fbank = extract_fbank_16k(audio_seg)  # (T_fbank, 80)
+                    fbank = fbank.unsqueeze(0).to(device=device, dtype=dtype)  # (1, T_fbank, 80)
+                    with torch.no_grad():
+                        emb = self.prosody_encoder(fbank, padding_mask=None)[0]  # (512,)
+                    seg_embeds.append(emb)
+                    seg_meta.append(
+                        (b, text_start, text_end, mel_start, mel_end)
+                    )
+            if seg_embeds:
+                seg_embeds_tensor = torch.stack(seg_embeds, dim=0)  # (N_seg, 512)
+                # scatter embeddings back to per-sample tensors
+                for emb, meta in zip(seg_embeds_tensor, seg_meta):
+                    b, ts, te, ms, me = meta
+                    emb_exp = emb.to(device=device, dtype=dtype)
+                    prosody_mel_cond[b, ms:me, :] = emb_exp
+                    prosody_text_cond[b, ts:te, :] = emb_exp
+            # dropout on prosody conditioning
+            prosody_mel_cond = self.prosody_dropout(prosody_mel_cond)
+            prosody_text_cond = self.prosody_dropout(prosody_text_cond)
+        # --- mask & random span
+        mask = lens_to_mask(lens, length=seq_len)
+        frac_lengths = torch.zeros((batch,), device=device).float().uniform_(*self.frac_lengths_mask)
+        rand_span_mask = mask_from_frac_lengths(lens, frac_lengths)
+        if exists(mask):
+            rand_span_mask &= mask
+        # --- flow setup
+        x1 = inp
+        x0 = torch.randn_like(x1)
+        time = torch.rand((batch,), dtype=dtype, device=device)
+        t = time[:, None, None]
+        φ = (1 - t) * x0 + t * x1
+        flow = x1 - x0
+        # --- conditional input (masked mel) + optional prosody
+        cond = torch.where(rand_span_mask[..., None], torch.zeros_like(x1), x1) # x1 # rand_mel
+        if prosody_mel_cond is not None:
+            prosody_mel_proj = self.prosody_to_mel(prosody_mel_cond)  # (B, T_mel, num_channels)
+            # if needed, pad/crop to seq_len
+            if prosody_mel_proj.size(1) < seq_len:
+                pad_len = seq_len - prosody_mel_proj.size(1)
+                prosody_mel_proj = F.pad(prosody_mel_proj, (0, 0, 0, pad_len))
+            elif prosody_mel_proj.size(1) > seq_len:
+                prosody_mel_proj = prosody_mel_proj[:, :seq_len, :]
+            cond = cond + prosody_mel_proj
+        # --- Gradient reversal: encourage accent-invariant cond
+        cond_grl = grad_reverse(cond, lambda_=1.0)
+        # # --- random drop condition for CFG-like robustness
+        # drop_audio_cond = random() < self.audio_drop_prob
+        # drop_text_cond = random() < self.text_drop_prob if not drop_audio_cond else True
+        # safe per-batch random (tensor)
+        rand_for_drop = torch.rand(1, device=device)
+        drop_audio_cond = (rand_for_drop.item() < self.audio_drop_prob)
+        rand_for_text = torch.rand(1, device=device)
+        drop_text_cond = (rand_for_text.item() < self.text_drop_prob)
+        # --- main prediction
+        pred = self.transformer(
+            x=φ,
+            cond=cond_grl,
+            text=text,
+            time=time,
+            drop_audio_cond=drop_audio_cond,
+            drop_text=drop_text_cond,
+            prosody_text=prosody_text_cond,
+        )
+        # === FLOW LOSS (robust mask-weighted) ===
+        pred_clamp = pred.float().clamp(-20, 20)
+        per_elem_loss = F.mse_loss(pred_clamp, flow, reduction="none")  # [B, T, D]
+        mask_exp = rand_span_mask.unsqueeze(-1).to(dtype=per_elem_loss.dtype)  # [B, T, 1]
+        masked_loss = per_elem_loss * mask_exp  # zeros where mask False
+        # total selected scalar (frames * dim)
+        n_selected = mask_exp.sum() * per_elem_loss.size(-1)  # scalar
+        denom = torch.clamp(n_selected, min=1.0)
+        loss_sum = masked_loss.sum()
+        loss = loss_sum / denom
+        # numeric safety
+        loss = torch.where(torch.isnan(loss) | (loss > 300.0), torch.tensor(300.0, device=loss.device, dtype=loss.dtype), loss)
+        # === ACCENT LOSS ===
+        accent_logits = self.accent_classifier(cond_grl)
+        # pool across time -> [B, C]
+        accent_logits_mean = accent_logits.mean(dim=1)
+        lang_labels = langs.to(accent_logits_mean.device).long()
+        accent_loss = self.accent_criterion(accent_logits_mean, lang_labels)
+        # guard against NaN / Inf in accent_loss
+        if not torch.isfinite(accent_loss):
+            accent_loss = torch.zeros_like(accent_loss, device=accent_loss.device)
+        base_loss = loss + 0.1 * accent_loss
+        # === OPTIONAL CTC LOSS (robust, only on valid samples) ===
+        ctc_scaled = torch.tensor(0.0, device=device, dtype=dtype)
+        if getattr(self, "use_ctc_loss", False) and getattr(self, "ctc", None) is not None:
+            # select samples with larger t for CTC supervision (similar to forward_old)
+            valid_indices = torch.where(time > 0.5)[0]
+            if valid_indices.size(0) > 2:
+                selected_pred = pred[valid_indices]
+                selected_text = text[valid_indices]
+                selected_lens = lens[valid_indices]
+                # text was tokenized from list_str_to_idx, where padding is -1
+                selected_target_lengths = (selected_text != -1).sum(dim=-1)
+                ctc_loss = self.ctc(
+                    decoder_outputs=selected_pred,
+                    target_phones=selected_text,
+                    decoder_lengths=selected_lens,
+                    target_lengths=selected_target_lengths,
+                )
+                if torch.isfinite(ctc_loss) and ctc_loss.item() > 1e-6:
+                    ctc_scaled = ctc_loss
+                    base_loss = base_loss + 0.1 * ctc_scaled
+        total_loss = base_loss
+        # note: we intentionally do NOT add 0.0 * pred.sum() etc. here, to avoid
+        # propagating NaNs from intermediate tensors into the loss scalar.
+        return total_loss, accent_loss, ctc_scaled, cond, pred

lemas_tts/model/modules.py ADDED Viewed

	@@ -0,0 +1,802 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+import torch
+import torch.nn.functional as F
+import torchaudio
+from librosa.filters import mel as librosa_mel_fn
+from torch import nn
+from x_transformers.x_transformers import apply_rotary_pos_emb
+from torch.autograd import Function
+# raw wav to mel spec
+mel_basis_cache = {}
+hann_window_cache = {}
+def get_bigvgan_mel_spectrogram(
+    waveform,
+    n_fft=1024,
+    n_mel_channels=100,
+    target_sample_rate=24000,
+    hop_length=256,
+    win_length=1024,
+    fmin=0,
+    fmax=None,
+    center=False,
+):  # Copy from https://github.com/NVIDIA/BigVGAN/tree/main
+    device = waveform.device
+    key = f"{n_fft}_{n_mel_channels}_{target_sample_rate}_{hop_length}_{win_length}_{fmin}_{fmax}_{device}"
+    if key not in mel_basis_cache:
+        mel = librosa_mel_fn(sr=target_sample_rate, n_fft=n_fft, n_mels=n_mel_channels, fmin=fmin, fmax=fmax)
+        mel_basis_cache[key] = torch.from_numpy(mel).float().to(device)  # TODO: why they need .float()?
+        hann_window_cache[key] = torch.hann_window(win_length).to(device)
+    mel_basis = mel_basis_cache[key]
+    hann_window = hann_window_cache[key]
+    padding = (n_fft - hop_length) // 2
+    waveform = torch.nn.functional.pad(waveform.unsqueeze(1), (padding, padding), mode="reflect").squeeze(1)
+    spec = torch.stft(
+        waveform,
+        n_fft,
+        hop_length=hop_length,
+        win_length=win_length,
+        window=hann_window,
+        center=center,
+        pad_mode="reflect",
+        normalized=False,
+        onesided=True,
+        return_complex=True,
+    )
+    spec = torch.sqrt(torch.view_as_real(spec).pow(2).sum(-1) + 1e-9)
+    mel_spec = torch.matmul(mel_basis, spec)
+    mel_spec = torch.log(torch.clamp(mel_spec, min=1e-5))
+    return mel_spec
+def get_vocos_mel_spectrogram(
+    waveform,
+    n_fft=1024,
+    n_mel_channels=100,
+    target_sample_rate=24000,
+    hop_length=256,
+    win_length=1024,
+):
+    mel_stft = torchaudio.transforms.MelSpectrogram(
+        sample_rate=target_sample_rate,
+        n_fft=n_fft,
+        win_length=win_length,
+        hop_length=hop_length,
+        n_mels=n_mel_channels,
+        power=1,
+        center=True,
+        normalized=False,
+        norm=None,
+    ).to(waveform.device)
+    if len(waveform.shape) == 3:
+        waveform = waveform.squeeze(1)  # 'b 1 nw -> b nw'
+    assert len(waveform.shape) == 2
+    mel = mel_stft(waveform)
+    mel = mel.clamp(min=1e-5).log()
+    return mel
+class MelSpec(nn.Module):
+    def __init__(
+        self,
+        n_fft=1024,
+        hop_length=256,
+        win_length=1024,
+        n_mel_channels=100,
+        target_sample_rate=24_000,
+        mel_spec_type="vocos",
+    ):
+        super().__init__()
+        assert mel_spec_type in ["vocos", "bigvgan"], print("We only support two extract mel backend: vocos or bigvgan")
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.n_mel_channels = n_mel_channels
+        self.target_sample_rate = target_sample_rate
+        if mel_spec_type == "vocos":
+            self.extractor = get_vocos_mel_spectrogram
+        elif mel_spec_type == "bigvgan":
+            self.extractor = get_bigvgan_mel_spectrogram
+        self.register_buffer("dummy", torch.tensor(0), persistent=False)
+    def forward(self, wav):
+        if self.dummy.device != wav.device:
+            self.to(wav.device)
+        mel = self.extractor(
+            waveform=wav,
+            n_fft=self.n_fft,
+            n_mel_channels=self.n_mel_channels,
+            target_sample_rate=self.target_sample_rate,
+            hop_length=self.hop_length,
+            win_length=self.win_length,
+        )
+        return mel
+# sinusoidal position embedding
+class SinusPositionEmbedding(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, x, scale=1000):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device).float() * -emb)
+        emb = scale * x.unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+# convolutional position embedding
+class ConvPositionEmbedding(nn.Module):
+    def __init__(self, dim, kernel_size=31, groups=16):
+        super().__init__()
+        assert kernel_size % 2 != 0
+        self.conv1d = nn.Sequential(
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+        )
+    def forward(self, x: float["b n d"], mask: bool["b n"] | None = None):  # noqa: F722
+        if mask is not None:
+            mask = mask[..., None]
+            x = x.masked_fill(~mask, 0.0)
+        x = x.permute(0, 2, 1)
+        x = self.conv1d(x)
+        out = x.permute(0, 2, 1)
+        if mask is not None:
+            out = out.masked_fill(~mask, 0.0)
+        return out
+# rotary positional embedding related
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, theta_rescale_factor=1.0):
+    # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
+    # has some connection to NTK literature
+    # https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
+    # https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py
+    theta *= theta_rescale_factor ** (dim / (dim - 2))
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cos = torch.cos(freqs)  # real part
+    freqs_sin = torch.sin(freqs)  # imaginary part
+    return torch.cat([freqs_cos, freqs_sin], dim=-1)
+def get_pos_embed_indices(start, length, max_pos, scale=1.0):
+    # length = length if isinstance(length, int) else length.max()
+    scale = scale * torch.ones_like(start, dtype=torch.float32)  # in case scale is a scalar
+    pos = (
+        start.unsqueeze(1)
+        + (torch.arange(length, device=start.device, dtype=torch.float32).unsqueeze(0) * scale.unsqueeze(1)).long()
+    )
+    # avoid extra long error.
+    pos = torch.where(pos < max_pos, pos, max_pos - 1)
+    return pos
+# Global Response Normalization layer (Instance Normalization ?)
+class GRN(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.gamma = nn.Parameter(torch.zeros(1, 1, dim))
+        self.beta = nn.Parameter(torch.zeros(1, 1, dim))
+    def forward(self, x):
+        Gx = torch.norm(x, p=2, dim=1, keepdim=True)
+        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
+        return self.gamma * (x * Nx) + self.beta + x
+# ConvNeXt-V2 Block https://github.com/facebookresearch/ConvNeXt-V2/blob/main/models/convnextv2.py
+# ref: https://github.com/bfs18/e2_tts/blob/main/rfwave/modules.py#L108
+class ConvNeXtV2Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        intermediate_dim: int,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        padding = (dilation * (7 - 1)) // 2
+        self.dwconv = nn.Conv1d(
+            dim, dim, kernel_size=7, padding=padding, groups=dim, dilation=dilation
+        )  # depthwise conv
+        self.norm = nn.LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, intermediate_dim)  # pointwise/1x1 convs, implemented with linear layers
+        self.act = nn.GELU()
+        self.grn = GRN(intermediate_dim)
+        self.pwconv2 = nn.Linear(intermediate_dim, dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        residual = x
+        x = x.transpose(1, 2)  # b n d -> b d n
+        x = self.dwconv(x)
+        x = x.transpose(1, 2)  # b d n -> b n d
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.grn(x)
+        x = self.pwconv2(x)
+        return residual + x
+# RMSNorm
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.native_rms_norm = float(torch.__version__[:3]) >= 2.4
+    def forward(self, x):
+        if self.native_rms_norm:
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = F.rms_norm(x, normalized_shape=(x.shape[-1],), weight=self.weight, eps=self.eps)
+        else:
+            variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = x * self.weight
+        return x
+# AdaLayerNorm
+# return with modulated x for attn input, and params for later mlp modulation
+class AdaLayerNorm(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 6)
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+    def forward(self, x, emb=None):
+        emb = self.linear(self.silu(emb))
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = torch.chunk(emb, 6, dim=1)
+        x = self.norm(x) * (1 + scale_msa[:, None]) + shift_msa[:, None]
+        return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
+# AdaLayerNorm for final layer
+# return only with modulated x for attn input, cuz no more mlp modulation
+class AdaLayerNorm_Final(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 2)
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+    def forward(self, x, emb):
+        emb = self.linear(self.silu(emb))
+        scale, shift = torch.chunk(emb, 2, dim=1)
+        x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
+        return x
+# FeedForward
+class FeedForward(nn.Module):
+    def __init__(self, dim, dim_out=None, mult=4, dropout=0.0, approximate: str = "none"):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        activation = nn.GELU(approximate=approximate)
+        project_in = nn.Sequential(nn.Linear(dim, inner_dim), activation)
+        self.ff = nn.Sequential(project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out))
+    def forward(self, x):
+        return self.ff(x)
+# Attention with possible joint part
+# modified from diffusers/src/diffusers/models/attention_processor.py
+class Attention(nn.Module):
+    def __init__(
+        self,
+        processor: JointAttnProcessor | AttnProcessor,
+        dim: int,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        context_dim: Optional[int] = None,  # if not None -> joint attention
+        context_pre_only: bool = False,
+        qk_norm: Optional[str] = None,
+    ):
+        super().__init__()
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("Attention equires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+        self.processor = processor
+        self.dim = dim
+        self.heads = heads
+        self.inner_dim = dim_head * heads
+        self.dropout = dropout
+        self.context_dim = context_dim
+        self.context_pre_only = context_pre_only
+        self.to_q = nn.Linear(dim, self.inner_dim)
+        self.to_k = nn.Linear(dim, self.inner_dim)
+        self.to_v = nn.Linear(dim, self.inner_dim)
+        if qk_norm is None:
+            self.q_norm = None
+            self.k_norm = None
+        elif qk_norm == "rms_norm":
+            self.q_norm = RMSNorm(dim_head, eps=1e-6)
+            self.k_norm = RMSNorm(dim_head, eps=1e-6)
+        else:
+            raise ValueError(f"Unimplemented qk_norm: {qk_norm}")
+        if self.context_dim is not None:
+            self.to_q_c = nn.Linear(context_dim, self.inner_dim)
+            self.to_k_c = nn.Linear(context_dim, self.inner_dim)
+            self.to_v_c = nn.Linear(context_dim, self.inner_dim)
+            if qk_norm is None:
+                self.c_q_norm = None
+                self.c_k_norm = None
+            elif qk_norm == "rms_norm":
+                self.c_q_norm = RMSNorm(dim_head, eps=1e-6)
+                self.c_k_norm = RMSNorm(dim_head, eps=1e-6)
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(nn.Linear(self.inner_dim, dim))
+        self.to_out.append(nn.Dropout(dropout))
+        if self.context_dim is not None and not self.context_pre_only:
+            self.to_out_c = nn.Linear(self.inner_dim, context_dim)
+    def forward(
+        self,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b n d"] = None,  # context c  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.Tensor:
+        if c is not None:
+            return self.processor(self, x, c=c, mask=mask, rope=rope, c_rope=c_rope)
+        else:
+            return self.processor(self, x, mask=mask, rope=rope)
+# Attention processor
+class AttnProcessor:
+    def __init__(
+        self,
+        pe_attn_head: int | None = None,  # number of attention head to apply rope, None for all
+    ):
+        self.pe_attn_head = pe_attn_head
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding
+    ) -> torch.FloatTensor:
+        batch_size = x.shape[0]
+        # `sample` projections
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        # apply rotary position embedding
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            if self.pe_attn_head is not None:
+                pn = self.pe_attn_head
+                query[:, :pn, :, :] = apply_rotary_pos_emb(query[:, :pn, :, :], freqs, q_xpos_scale)
+                key[:, :pn, :, :] = apply_rotary_pos_emb(key[:, :pn, :, :], freqs, k_xpos_scale)
+            else:
+                query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+                key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = mask
+            attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+            attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+        if mask is not None:
+            mask = mask.unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+        return x
+# Joint Attention processor for MM-DiT
+# modified from diffusers/src/diffusers/models/attention_processor.py
+class JointAttnProcessor:
+    def __init__(self):
+        pass
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b nt d"] = None,  # context c, here text # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.FloatTensor:
+        residual = x
+        batch_size = c.shape[0]
+        # `sample` projections
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+        # `context` projections
+        c_query = attn.to_q_c(c)
+        c_key = attn.to_k_c(c)
+        c_value = attn.to_v_c(c)
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_query = c_query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_key = c_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_value = c_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        if attn.c_q_norm is not None:
+            c_query = attn.c_q_norm(c_query)
+        if attn.c_k_norm is not None:
+            c_key = attn.c_k_norm(c_key)
+        # apply rope for context and noised input independently
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+            key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+        if c_rope is not None:
+            freqs, xpos_scale = c_rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
+            c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
+        # joint attention
+        query = torch.cat([query, c_query], dim=2)
+        key = torch.cat([key, c_key], dim=2)
+        value = torch.cat([value, c_value], dim=2)
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = F.pad(mask, (0, c.shape[1]), value=True)  # no mask for c (text)
+            attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+            attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+        # Split the attention outputs.
+        x, c = (
+            x[:, : residual.shape[1]],
+            x[:, residual.shape[1] :],
+        )
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+        if not attn.context_pre_only:
+            c = attn.to_out_c(c)
+        if mask is not None:
+            mask = mask.unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+            # c = c.masked_fill(~mask, 0.)  # no mask for c (text)
+        return x, c
+# DiT Block
+class DiTBlock(nn.Module):
+    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, qk_norm=None, pe_attn_head=None):
+        super().__init__()
+        self.attn_norm = AdaLayerNorm(dim)
+        self.attn = Attention(
+            processor=AttnProcessor(pe_attn_head=pe_attn_head),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+            qk_norm=qk_norm,
+        )
+        self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+    def forward(self, x, t, mask=None, rope=None):  # x: noised input, t: time embedding
+        # pre-norm & modulation for attention input
+        norm, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.attn_norm(x, emb=t)
+        # attention
+        attn_output = self.attn(x=norm, mask=mask, rope=rope)
+        # process attention output for input x
+        x = x + gate_msa.unsqueeze(1) * attn_output
+        norm = self.ff_norm(x) * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+        ff_output = self.ff(norm)
+        x = x + gate_mlp.unsqueeze(1) * ff_output
+        return x
+# MMDiT Block https://arxiv.org/abs/2403.03206
+class MMDiTBlock(nn.Module):
+    r"""
+    modified from diffusers/src/diffusers/models/attention.py
+    notes.
+    _c: context related. text, cond, etc. (left part in sd3 fig2.b)
+    _x: noised input related. (right part)
+    context_pre_only: last layer only do prenorm + modulation cuz no more ffn
+    """
+    def __init__(
+        self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_dim=None, context_pre_only=False, qk_norm=None
+    ):
+        super().__init__()
+        if context_dim is None:
+            context_dim = dim
+        self.context_pre_only = context_pre_only
+        self.attn_norm_c = AdaLayerNorm_Final(context_dim) if context_pre_only else AdaLayerNorm(context_dim)
+        self.attn_norm_x = AdaLayerNorm(dim)
+        self.attn = Attention(
+            processor=JointAttnProcessor(),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+            context_dim=context_dim,
+            context_pre_only=context_pre_only,
+            qk_norm=qk_norm,
+        )
+        if not context_pre_only:
+            self.ff_norm_c = nn.LayerNorm(context_dim, elementwise_affine=False, eps=1e-6)
+            self.ff_c = FeedForward(dim=context_dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+        else:
+            self.ff_norm_c = None
+            self.ff_c = None
+        self.ff_norm_x = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_x = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+    def forward(self, x, c, t, mask=None, rope=None, c_rope=None):  # x: noised input, c: context, t: time embedding
+        # pre-norm & modulation for attention input
+        if self.context_pre_only:
+            norm_c = self.attn_norm_c(c, t)
+        else:
+            norm_c, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.attn_norm_c(c, emb=t)
+        norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp = self.attn_norm_x(x, emb=t)
+        # attention
+        x_attn_output, c_attn_output = self.attn(x=norm_x, c=norm_c, mask=mask, rope=rope, c_rope=c_rope)
+        # process attention output for context c
+        if self.context_pre_only:
+            c = None
+        else:  # if not last layer
+            c = c + c_gate_msa.unsqueeze(1) * c_attn_output
+            norm_c = self.ff_norm_c(c) * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+            c_ff_output = self.ff_c(norm_c)
+            c = c + c_gate_mlp.unsqueeze(1) * c_ff_output
+        # process attention output for input x
+        x = x + x_gate_msa.unsqueeze(1) * x_attn_output
+        norm_x = self.ff_norm_x(x) * (1 + x_scale_mlp[:, None]) + x_shift_mlp[:, None]
+        x_ff_output = self.ff_x(norm_x)
+        x = x + x_gate_mlp.unsqueeze(1) * x_ff_output
+        return c, x
+# time step conditioning embedding
+class TimestepEmbedding(nn.Module):
+    def __init__(self, dim, freq_embed_dim=256):
+        super().__init__()
+        self.time_embed = SinusPositionEmbedding(freq_embed_dim)
+        self.time_mlp = nn.Sequential(nn.Linear(freq_embed_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
+    def forward(self, timestep: float["b"]):  # noqa: F821
+        time_hidden = self.time_embed(timestep)
+        time_hidden = time_hidden.to(timestep.dtype)
+        time = self.time_mlp(time_hidden)  # b d
+        return time
+class MIEsitmator(nn.Module):
+    def __init__(self, vocab_size, decoder_dim, hidden_size, dropout=0.5):
+        super(MIEsitmator, self).__init__()
+        self.proj = nn.Sequential(
+            torch.nn.Linear(decoder_dim, hidden_size, bias=True),
+            nn.ReLU(),
+            nn.Dropout(p=dropout)
+        )
+        self.ctc_proj = torch.nn.Linear(hidden_size, vocab_size + 1, bias=True)
+        self.ctc = nn.CTCLoss(blank=vocab_size, reduction='mean', zero_infinity=True)
+    def forward(self, decoder_outputs, target_phones, decoder_lengths, target_lengths):
+        out = self.proj(decoder_outputs.type(self.ctc_proj.weight.dtype))
+        log_probs = self.ctc_proj(out).log_softmax(dim=2)
+        log_probs = log_probs.transpose(1, 0)
+        ctc_loss = self.ctc(log_probs.float(), target_phones, decoder_lengths, target_lengths)
+        ctc_loss = ctc_loss / decoder_lengths.float()
+        # print("ctc_loss:", ctc_loss.shape, "ctc_max:", torch.max(ctc_loss), "ctc_min:", torch.min(ctc_loss), decoder_lengths[0])
+        # # 2. 统计非NaN值的百分比
+        # mask = ~torch.isnan(ctc_loss)
+        # total_count = ctc_loss.numel()  # 总元素数量（所有维度）
+        # valid_count = mask.sum().item()  # 非NaN元素数量
+        # valid_percentage = (valid_count / total_count) * 100
+        # print(f"ctc loss: total_count: {total_count}", f"valid_count: {valid_count}", f"valid_percentage: {valid_percentage:.2f}%")
+        # 3. 将NaN或大于150的值替换为150
+        # ctc_loss = torch.where(torch.isnan(ctc_loss), 150.0, ctc_loss)
+        ctc_loss = torch.where((ctc_loss > 300.0) | torch.isnan(ctc_loss), 300.0, ctc_loss)
+        # ctc_loss = torch.nan_to_num(ctc_loss, nan=0.0, posinf=0.0, neginf=0.0)
+        # average by number of frames since taco_loss is averaged.
+        ctc_loss = ctc_loss.mean()
+        return ctc_loss
+    def inference(self, decoder_output):
+        out = self.proj(decoder_output.type(self.ctc_proj.weight.dtype))
+        log_probs = self.ctc_proj(out).log_softmax(dim=2)
+        log_probs = log_probs.transpose(1, 0)
+        return log_probs.item()
+class AccentClassifier(nn.Module):
+    def __init__(self, input_dim, hidden_dim, num_accents, dropout=0.3):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, num_accents)
+        )
+    def forward(self, x):
+        return self.net(x)
+class GradientReversalFunction(Function):
+    @staticmethod
+    def forward(ctx, x, lambda_):
+        ctx.lambda_ = lambda_
+        return x.view_as(x)
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.neg() * ctx.lambda_, None
+def grad_reverse(x, lambda_=1.0):
+    return GradientReversalFunction.apply(x, lambda_)

lemas_tts/model/utils.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from __future__ import annotations
+import os
+import random
+from collections import defaultdict
+from importlib.resources import files
+import torch
+from torch.nn.utils.rnn import pad_sequence
+import jieba
+from pypinyin import lazy_pinyin, Style
+import sys
+# seed everything
+def seed_everything(seed=0):
+    random.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+# helpers
+def exists(v):
+    return v is not None
+def default(v, d):
+    return v if exists(v) else d
+# tensor helpers
+def lens_to_mask(t: int["b"], length: int | None = None) -> bool["b n"]:  # noqa: F722 F821
+    if not exists(length):
+        length = t.amax()
+    seq = torch.arange(length, device=t.device)
+    return seq[None, :] < t[:, None]
+def mask_from_start_end_indices(seq_len: int["b"], start: int["b"], end: int["b"]):  # noqa: F722 F821
+    max_seq_len = seq_len.max().item()
+    seq = torch.arange(max_seq_len, device=start.device).long()
+    start_mask = seq[None, :] >= start[:, None]
+    end_mask = seq[None, :] < end[:, None]
+    return start_mask & end_mask
+def mask_from_frac_lengths(seq_len: int["b"], frac_lengths: float["b"]):  # noqa: F722 F821
+    lengths = (frac_lengths * seq_len).long()
+    max_start = seq_len - lengths
+    rand = torch.rand_like(frac_lengths)
+    start = (max_start * rand).long().clamp(min=0)
+    end = start + lengths
+    return mask_from_start_end_indices(seq_len, start, end)
+def maybe_masked_mean(t: float["b n d"], mask: bool["b n"] = None) -> float["b d"]:  # noqa: F722
+    if not exists(mask):
+        return t.mean(dim=1)
+    t = torch.where(mask[:, :, None], t, torch.tensor(0.0, device=t.device))
+    num = t.sum(dim=1)
+    den = mask.float().sum(dim=1)
+    return num / den.clamp(min=1.0)
+# simple utf-8 tokenizer, since paper went character based
+def list_str_to_tensor(text: list[str], padding_value=-1) -> int["b nt"]:  # noqa: F722
+    list_tensors = [torch.tensor([*bytes(t, "UTF-8")]) for t in text]  # ByT5 style
+    text = pad_sequence(list_tensors, padding_value=padding_value, batch_first=True)
+    return text
+# char tokenizer, based on custom dataset's extracted .txt file
+def list_str_to_idx(
+    text: list[str] | list[list[str]],
+    vocab_char_map: dict[str, int],  # {char: idx}
+    padding_value=-1,
+) -> int["b nt"]:  # noqa: F722
+    list_idx_tensors = [torch.tensor([vocab_char_map.get(c, 0) for c in t]) for t in text]  # pinyin or char style
+    text = pad_sequence(list_idx_tensors, padding_value=padding_value, batch_first=True)
+    return text
+# Get tokenizer
+def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
+    """
+    tokenizer   - "pinyin" do g2p for only chinese characters, need .txt vocab_file
+                - "char" for char-wise tokenizer, need .txt vocab_file
+                - "byte" for utf-8 tokenizer
+                - "custom" if you're directly passing in a path to the vocab.txt you want to use
+    vocab_size  - if use "pinyin", all available pinyin types, common alphabets (also those with accent) and symbols
+                - if use "char", derived from unfiltered character & symbol counts of custom dataset
+                - if use "byte", set to 256 (unicode byte range)
+    """
+    if tokenizer in ["pinyin", "char"]:
+        tokenizer_path = os.path.join(files("lemas_tts").joinpath("../../data"), f"{dataset_name}_{tokenizer}/vocab.txt")
+        with open(tokenizer_path, "r", encoding="utf-8") as f:
+            vocab_char_map = {}
+            for i, char in enumerate(f):
+                vocab_char_map[char[:-1]] = i
+        vocab_size = len(vocab_char_map)
+        assert vocab_char_map[" "] == 0, "make sure space is of idx 0 in vocab.txt, cuz 0 is used for unknown char"
+    elif tokenizer == "byte":
+        vocab_char_map = None
+        vocab_size = 256
+    elif tokenizer == "custom":
+        with open(dataset_name, "r", encoding="utf-8") as f:
+            vocab_char_map = {}
+            for i, char in enumerate(f):
+                vocab_char_map[char[:-1]] = i
+        vocab_size = len(vocab_char_map)
+    return vocab_char_map, vocab_size
+# convert char to pinyin
+def convert_char_to_pinyin(text_list, polyphone=True):
+    if jieba.dt.initialized is False:
+        jieba.default_logger.setLevel(50)  # CRITICAL
+        jieba.initialize()
+    final_text_list = []
+    custom_trans = str.maketrans(
+        {";": ",", "“": '"', "”": '"', "‘": "'", "’": "'"}
+    )  # add custom trans here, to address oov
+    def is_chinese(c):
+        return (
+            "\u3100" <= c <= "\u9fff"  # common chinese characters
+        )
+    for text in text_list:
+        char_list = []
+        text = text.translate(custom_trans)
+        from lemas_tts.infer.cn_tn import NSWNormalizer
+        text = NSWNormalizer(text.strip()).normalize()
+        text = list(jieba.cut(text))
+        for seg in text:
+            seg_byte_len = len(bytes(seg, "UTF-8"))
+            if seg_byte_len == len(seg):  # if pure alphabets and symbols
+                if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
+                    char_list.append(" ")
+                char_list.extend(seg)
+            elif polyphone and seg_byte_len == 3 * len(seg):  # if pure east asian characters
+                seg_ = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
+                for i, c in enumerate(seg):
+                    if is_chinese(c):
+                        char_list.append(" ")
+                    char_list.append(seg_[i])
+            else:  # if mixed characters, alphabets and symbols
+                for c in seg:
+                    if ord(c) < 256:
+                        char_list.extend(c)
+                    elif is_chinese(c):
+                        char_list.append(" ")
+                        char_list.extend(lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True))
+                    else:
+                        char_list.append(c)
+        final_text_list.append(char_list)
+    return final_text_list
+# filter func for dirty data with many repetitions
+def repetition_found(text, length=2, tolerance=10):
+    pattern_count = defaultdict(int)
+    for i in range(len(text) - length + 1):
+        pattern = text[i : i + length]
+        pattern_count[pattern] += 1
+    for pattern, count in pattern_count.items():
+        if count > tolerance:
+            return True
+    return False

lemas_tts/scripts/inference_gradio.py ADDED Viewed

	@@ -0,0 +1,584 @@

+import gc
+import os
+import platform
+import psutil
+import tempfile
+from glob import glob
+import traceback
+import click
+import gradio as gr
+import torch
+import sys
+from pathlib import Path
+# Add the local code directory so that `lemas_tts` can be imported when running this
+# script directly without installing the package.
+THIS_FILE = Path(__file__).resolve()
+SRC_ROOT = THIS_FILE.parents[2]  # .../code
+sys.path.append(str(SRC_ROOT))
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+DATA_ROOT = PRETRAINED_ROOT / "data"
+UVR5_CODE_DIR = REPO_ROOT / "code" / "uvr5"
+UVR5_MODEL_DIR = PRETRAINED_ROOT / "uvr5" / "models" / "MDX_Net_Models" / "model_data"
+from lemas_tts.api import F5TTS
+import torch, torchaudio
+import soundfile as sf
+# Global variables
+tts_api = None
+last_checkpoint = ""
+last_device = ""
+last_ema = None
+# Device detection
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+class UVR5:
+    def __init__(self, model_dir):
+        code_dir = str(UVR5_CODE_DIR)
+        self.model = self.load_model(str(model_dir), code_dir)
+    def load_model(self, model_dir, code_dir):
+        import sys, json, os
+        sys.path.append(code_dir)
+        from multiprocess_cuda_infer import ModelData, Inference
+        model_path = os.path.join(model_dir, 'Kim_Vocal_1.onnx')
+        config_path = os.path.join(model_dir, 'MDX-Net-Kim-Vocal1.json')
+        configs = json.loads(open(config_path, 'r', encoding='utf-8').read())
+        model_data = ModelData(
+            model_path=model_path,
+            audio_path = model_dir,
+            result_path = model_dir,
+            device = 'cpu',
+            process_method = "MDX-Net",
+            base_dir=code_dir,
+            **configs
+        )
+        uvr5_model = Inference(model_data, 'cpu')
+        uvr5_model.load_model(model_path, 1)
+        return uvr5_model
+    def denoise(self, audio_info):
+        print("denoise UVR5: ", audio_info)
+        input_audio = load_wav(audio_info, sr=44100, channel=2)
+        output_audio = self.model.demix_base({0:input_audio.squeeze()}, is_match_mix=False)
+        # transform = torchaudio.transforms.Resample(44100, 16000)
+        # output_audio = transform(output_audio)
+        return output_audio.squeeze().T.numpy(), 44100
+denoise_model = UVR5(UVR5_MODEL_DIR)
+def load_wav(audio_info, sr=16000, channel=1):
+    print("load audio:", audio_info)
+    audio, raw_sr = torchaudio.load(audio_info)
+    audio = audio.T if len(audio.shape) > 1 and audio.shape[1] == 2 else audio
+    audio = audio / torch.max(torch.abs(audio))
+    audio = audio.squeeze().float()
+    if channel == 1 and len(audio.shape) == 2:  # stereo to mono
+        audio = audio.mean(dim=0, keepdim=True)
+    elif channel == 2 and len(audio.shape) == 1:
+        audio = torch.stack((audio, audio)) # mono to stereo
+    if raw_sr != sr:
+        audio = torchaudio.functional.resample(audio.squeeze(), raw_sr, sr)
+    audio = torch.clip(audio, -0.999, 0.999).squeeze()
+    return audio
+def denoise(audio_info):
+    save_path = "./denoised_audio.wav"
+    denoised_audio, sr = denoise_model.denoise(audio_info)
+    sf.write(save_path, denoised_audio, sr, format='wav', subtype='PCM_24')
+    print("save denoised audio:", save_path)
+    return save_path
+def cancel_denoise(audio_info):
+    return audio_info
+def get_checkpoints_project(project_name=None, is_gradio=True):
+    """Get available checkpoint files"""
+    checkpoint_dir = [str(CKPTS_ROOT)]
+    if project_name is None:
+        # Look for checkpoints in common locations
+        files_checkpoints = []
+        for path in checkpoint_dir:
+            if os.path.isdir(path):
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.pt"), recursive=True))
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.safetensors"), recursive=True))
+                break
+    else:
+        # project_name = project_name.replace("_pinyin", "").replace("_char", "")
+        project_name = "_".join(["F5TTS_v1_Base", "vocos", "custom", project_name.replace("_custom", "")]) if project_name != "F5TTS_v1_Base" else project_name
+        if os.path.isdir(checkpoint_dir[0]):
+            files_checkpoints = glob(os.path.join(checkpoint_dir[0], project_name, "*.pt"))
+            files_checkpoints.extend(glob(os.path.join(checkpoint_dir[0], project_name, "*.safetensors")))
+        else:
+            files_checkpoints = []
+    print("files_checkpoints:", project_name, files_checkpoints)
+    # Separate pretrained and regular checkpoints
+    pretrained_checkpoints = [f for f in files_checkpoints if "pretrained_" in os.path.basename(f)]
+    regular_checkpoints = [
+        f
+        for f in files_checkpoints
+        if "pretrained_" not in os.path.basename(f) and "model_last.pt" not in os.path.basename(f)
+    ]
+    last_checkpoint = [f for f in files_checkpoints if "model_last.pt" in os.path.basename(f)]
+    # Sort regular checkpoints by number
+    try:
+        regular_checkpoints = sorted(
+            regular_checkpoints, key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
+        )
+    except (IndexError, ValueError):
+        regular_checkpoints = sorted(regular_checkpoints)
+    # Combine in order: pretrained, regular, last
+    files_checkpoints = pretrained_checkpoints + regular_checkpoints + last_checkpoint
+    select_checkpoint = None if not files_checkpoints else files_checkpoints[-1]
+    if is_gradio:
+        return gr.update(choices=files_checkpoints, value=select_checkpoint)
+    return files_checkpoints, select_checkpoint
+def get_available_projects():
+    """Get available project names from data directory"""
+    data_path = str(DATA_ROOT)
+    project_list = []
+    if os.path.isdir(data_path):
+        for folder in os.listdir(data_path):
+            if "test" in folder:
+                continue
+            project_list.append(folder)
+    # Fallback to a sensible default if no projects are found
+    if not project_list:
+        project_list = ["multilingual_acc_grl_custom"]
+    return project_list
+def infer(
+    project, file_checkpoint, exp_name, ref_text, ref_audio, denoise_audio, gen_text, nfe_step, use_ema, separate_langs, frontend, speed, cfg_strength, use_acc_grl, ref_ratio, no_ref_audio, sway_sampling_coef, use_prosody_encoder, seed
+):
+    global last_checkpoint, last_device, tts_api, last_ema
+    if not os.path.isfile(file_checkpoint):
+        return None, "Checkpoint not found!", ""
+    if denoise_audio:
+        ref_audio = denoise_audio
+    device_test = device  # Use the global device
+    if last_checkpoint != file_checkpoint or last_device != device_test or last_ema != use_ema or tts_api is None:
+        if last_checkpoint != file_checkpoint:
+            last_checkpoint = file_checkpoint
+        if last_device != device_test:
+            last_device = device_test
+        if last_ema != use_ema:
+            last_ema = use_ema
+        # Try to find vocab file
+        vocab_file = None
+        possible_vocab_paths = [
+            str(DATA_ROOT / project / "vocab.txt"),
+            # legacy fallbacks for older layouts
+            f"./data/{project}/vocab.txt",
+            f"../../data/{project}/vocab.txt",
+            "./data/Emilia_ZH_EN_pinyin/vocab.txt",
+            "../../data/Emilia_ZH_EN_pinyin/vocab.txt",
+        ]
+        for path in possible_vocab_paths:
+            if os.path.isfile(path):
+                vocab_file = path
+                break
+        if vocab_file is None:
+            return None, "Vocab file not found!", ""
+        try:
+            tts_api = F5TTS(
+                model=exp_name,
+                ckpt_file=file_checkpoint,
+                vocab_file=vocab_file,
+                device=device_test,
+                use_ema=use_ema,
+                frontend=frontend,
+                use_prosody_encoder=use_prosody_encoder,
+                prosody_cfg_path=str(CKPTS_ROOT / "prosody_encoder" / "pretssel_cfg.json"),
+                prosody_ckpt_path=str(CKPTS_ROOT / "prosody_encoder" / "prosody_encoder_UnitY2.pt"),
+            )
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error loading model: {str(e)}", ""
+        print("Model loaded >>", device_test, file_checkpoint, use_ema)
+    if seed == -1:  # -1 used for random
+        seed = None
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+            tts_api.infer(
+                ref_file=ref_audio,
+                ref_text=ref_text.strip(),
+                gen_text=gen_text.strip(),
+                nfe_step=nfe_step,
+                separate_langs=separate_langs,
+                speed=speed,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+                use_prosody_encoder=use_prosody_encoder,
+                file_wave=f.name,
+                seed=seed,
+            )
+            return f.name, f"Device: {tts_api.device}", str(tts_api.seed)
+    except Exception as e:
+        traceback.print_exc()
+        return None, f"Inference error: {str(e)}", ""
+def get_gpu_stats():
+    """Get GPU statistics"""
+    gpu_stats = ""
+    if torch.cuda.is_available():
+        gpu_count = torch.cuda.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.cuda.get_device_name(i)
+            gpu_properties = torch.cuda.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.cuda.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.cuda.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.xpu.get_device_name(i)
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.xpu.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.xpu.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.backends.mps.is_available():
+        gpu_count = 1
+        gpu_stats += "MPS GPU\n"
+        total_memory = psutil.virtual_memory().total / (
+            1024**3
+        )  # Total system memory (MPS doesn't have its own memory)
+        allocated_memory = 0
+        reserved_memory = 0
+        gpu_stats += (
+            f"Total system memory: {total_memory:.2f} GB\n"
+            f"Allocated GPU memory (MPS): {allocated_memory:.2f} MB\n"
+            f"Reserved GPU memory (MPS): {reserved_memory:.2f} MB\n"
+        )
+    else:
+        gpu_stats = "No GPU available"
+    return gpu_stats
+def get_cpu_stats():
+    """Get CPU statistics"""
+    cpu_usage = psutil.cpu_percent(interval=1)
+    memory_info = psutil.virtual_memory()
+    memory_used = memory_info.used / (1024**2)
+    memory_total = memory_info.total / (1024**2)
+    memory_percent = memory_info.percent
+    pid = os.getpid()
+    process = psutil.Process(pid)
+    nice_value = process.nice()
+    cpu_stats = (
+        f"CPU Usage: {cpu_usage:.2f}%\n"
+        f"System Memory: {memory_used:.2f} MB used / {memory_total:.2f} MB total ({memory_percent}% used)\n"
+        f"Process Priority (Nice value): {nice_value}"
+    )
+    return cpu_stats
+def get_combined_stats():
+    """Get combined system stats"""
+    gpu_stats = get_gpu_stats()
+    cpu_stats = get_cpu_stats()
+    combined_stats = f"### GPU Stats\n{gpu_stats}\n\n### CPU Stats\n{cpu_stats}"
+    return combined_stats
+# Create Gradio interface
+with gr.Blocks(title="LEMAS-TTS Inference") as app:
+    gr.Markdown(
+        """
+        # Zero-Shot TTS
+        Set seed to -1 for random generation.
+        """
+    )
+    with gr.Accordion("Model configuration", open=False):
+    # Model configuration
+        with gr.Row():
+            exp_name = gr.Radio(
+                label="Model", choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"], value="F5TTS_v1_Base", visible=False
+            )
+        # Project selection
+        available_projects = get_available_projects()
+        # Get initial checkpoints
+        list_checkpoints, checkpoint_select = get_checkpoints_project(available_projects[0] if available_projects else None, False)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # load_models_btn = gr.Button(value="Load models")
+                cm_project = gr.Dropdown(
+                    choices=available_projects,
+                    value=available_projects[0] if available_projects else None,
+                    label="Project",
+                    allow_custom_value=True,
+                    scale=4
+                )
+            with gr.Column(scale=5):
+                cm_checkpoint = gr.Dropdown(
+                    choices=list_checkpoints, value=checkpoint_select, label="Checkpoints", allow_custom_value=True # scale=4,
+)
+            bt_checkpoint_refresh = gr.Button("Refresh", scale=1)
+        with gr.Row():
+            ch_use_ema = gr.Checkbox(label="Use EMA", value=True, scale=2, info="Turn off at early stage might offer better results")
+            frontend = gr.Radio(label="Frontend", choices=["phone", "char", "bpe"], value="phone", scale=3)
+            separate_langs = gr.Checkbox(label="Separate Languages", value=True, scale=2, info="separate language tokens")
+        # Inference parameters
+        with gr.Row():
+            nfe_step = gr.Number(label="NFE Step", scale=1, value=64)
+            speed = gr.Slider(label="Speed", scale=3, value=1.0, minimum=0.5, maximum=1.5, step=0.1)
+            cfg_strength = gr.Slider(label="CFG Strength", scale=2, value=5.0, minimum=0.0, maximum=10.0, step=1)
+            sway_sampling_coef = gr.Slider(label="Sway Sampling Coef", scale=2, value=3, minimum=-1, maximum=5, step=0.1)
+            ref_ratio = gr.Slider(label="Ref Ratio", scale=2, value=1.0, minimum=0.0, maximum=1.0, step=0.1)
+            no_ref_audio = gr.Checkbox(label="No Reference Audio", value=False, scale=1, info="No mel condition")
+            use_acc_grl = gr.Checkbox(label="Use accent grl condition", value=False, scale=1, info="Use accent grl condition")
+            use_prosody_encoder = gr.Checkbox(label="Use prosody encoder", value=False, scale=1, info="Use prosody encoder")
+            seed = gr.Number(label="Random Seed", scale=1, value=5828684826493313192, minimum=-1)
+    # Input fields
+    ref_text = gr.Textbox(label="Reference Text", placeholder="Enter the text for the reference audio...")
+    ref_audio = gr.Audio(label="Reference Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    with gr.Row():
+        denoise_btn = gr.Button(value="Denoise")
+        cancel_btn = gr.Button(value="Cancel Denoise")
+    denoise_audio = gr.Audio(label="Denoised Audio", value=None, type="filepath", interactive=True, show_download_button=True, editable=True)
+    gen_text = gr.Textbox(label="Text to Generate", placeholder="Enter the text you want to generate...")
+    # Inference button and outputs
+    with gr.Row():
+        txt_info_gpu = gr.Textbox("", label="Device Info")
+        seed_info = gr.Textbox(label="Used Random Seed")
+        check_button_infer = gr.Button("Generate Audio", variant="primary")
+    gen_audio = gr.Audio(label="Generated Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    # Examples
+    examples = gr.Examples(
+        examples=[
+            [
+                "Ich glaub, mein Schwein pfeift.",
+                str(DATA_ROOT / "test_examples" / "de.wav"),
+                "我觉得我的猪在吹口哨。",
+            ],
+            [
+                "em, #1 I have a list of YouTubers, and I'm gonna be going to their houses and raiding them by.",
+                str(DATA_ROOT / "test_examples" / "en.wav"),
+                "我有一份 YouTuber 名单，我打算去他们家，对他们进行突袭。",
+            ],
+            [
+                "Te voy a dar un tip #1 que le copia a John Rockefeller, uno de los empresarios más picudos de la historia.",
+                str(DATA_ROOT / "test_examples" / "es.wav"),
+                "我要给你一个从历史上最精明的商人之一约翰·洛克菲勒那里抄来的秘诀。",
+            ],
+            [
+                "Per l'amor di Dio #1 fai, #2 se pensi di non poterti fermare, fallo #1 e fallo.",
+                str(DATA_ROOT / "test_examples" / "it.wav"),
+                "看在上帝的份上，去做吧，如果你认为你无法停止，那就去做吧，继续做下去。",
+            ],
+            [
+                "Nova, #1 dia 25 desse mês vai rolar operação the last Frontier.",
+                str(DATA_ROOT / "test_examples" / "pt.wav"),
+                "新消息，本月二十五日，'最后的边疆行动'将启动。",
+            ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_0.wav",
+            # " #1"
+            # ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_1.wav",
+            # " #1",
+            # ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_2.wav",
+            # " #1",
+            # ],
+            # ["Oh, and in case I don't see ya, #1",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_3.wav",
+            # " #1",
+            # ],
+            # ["Good afternoon, good evening, and good night. #1",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_4.wav",
+            # " #1",
+            # ],
+        ],
+        inputs=[
+            ref_text,
+            ref_audio,
+            gen_text,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+        fn=infer,
+        cache_examples=False
+    )
+    # System Info section at the bottom
+    gr.Markdown("---")
+    gr.Markdown("## System Information")
+    with gr.Accordion("Update System Stats", open=False):
+        update_button = gr.Button("Update System Stats", scale=1)
+        output_box = gr.Textbox(label="GPU and CPU Information", lines=5, scale=5)
+    def update_stats():
+        return get_combined_stats()
+    denoise_btn.click(fn=denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    cancel_btn.click(fn=cancel_denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    # Event handlers
+    check_button_infer.click(
+        fn=infer,
+        inputs=[
+            cm_project,
+            cm_checkpoint,
+            exp_name,
+            ref_text,
+            ref_audio,
+            denoise_audio,
+            gen_text,
+            nfe_step,
+            ch_use_ema,
+            separate_langs,
+            frontend,
+            speed,
+            cfg_strength,
+            use_acc_grl,
+            ref_ratio,
+            no_ref_audio,
+            sway_sampling_coef,
+            use_prosody_encoder,
+            seed,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+    )
+    bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    ref_audio.change(
+            fn=lambda x: None,
+            inputs=[ref_audio],
+            outputs=[denoise_audio]
+        )
+    update_button.click(fn=update_stats, outputs=output_box)
+    # Auto-load system stats on startup
+    app.load(fn=update_stats, outputs=output_box)
+@click.command()
+@click.option("--port", "-p", default=7860, type=int, help="Port to run the app on")
+@click.option("--host", "-H", default="0.0.0.0", help="Host to run the app on")
+@click.option(
+    "--share",
+    "-s",
+    default=False,
+    is_flag=True,
+    help="Share the app via Gradio share link",
+)
+@click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
+def main(port, host, share, api):
+    global app
+    print("Starting LEMAS-TTS Inference Interface...")
+    print(f"Device: {device}")
+    app.queue(api_open=api).launch(
+        server_name=host,
+        server_port=port,
+        share=share,
+        show_api=api,
+        allowed_paths=[str(DATA_ROOT)],
+    )
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,185 @@

+--extra-index-url https://download.pytorch.org/whl/cu121
+faster-whisper==1.1.0
+whisperx==3.1.1
+accelerate>=0.33.0
+aiofiles==23.2.1
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.2
+aiosignal==1.4.0
+annotated-doc==0.0.4
+annotated-types==0.7.0
+antlr4-python3-runtime==4.9.3
+anyio==4.12.0
+attrs==25.4.0
+audioread==3.1.0
+babel==2.17.0
+bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
+boto3==1.42.16
+botocore==1.42.16
+brotli==1.2.0
+cached_path
+cachetools==6.2.4
+certifi==2025.11.12
+cffi==2.0.0
+charset-normalizer==3.4.4
+click
+contourpy==1.3.2
+csvw==3.7.0
+cycler==0.12.1
+datasets
+decorator==5.2.1
+dill==0.4.0
+dlinfo==2.0.0
+docopt==0.6.2
+einops==0.8.1
+einx==0.3.0
+ema-pytorch==0.7.3
+encodec==0.1.1
+espeakng==1.0.2
+espeakng-loader==0.2.4
+espeak_phonemizer==1.3.1
+fastapi==0.127.0
+ffmpy==1.0.0
+filelock==3.20.1
+fonttools==4.61.1
+frozendict==2.4.7
+frozenlist==1.8.0
+fsspec==2025.10.0
+gitdb==4.0.12
+GitPython==3.1.45
+google-api-core==2.28.1
+google-auth==2.45.0
+google-cloud-core==2.5.0
+google-cloud-storage==3.7.0
+google-crc32c==1.8.0
+google-resumable-media==2.8.0
+googleapis-common-protos==1.72.0
+gradio==5.38.0
+gradio-client==1.11.0
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.36.0
+hydra-core>=1.3.0
+idna==3.11
+isodate==0.7.2
+jieba
+Jinja2==3.1.6
+jmespath==1.0.1
+joblib==1.5.3
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+kiwisolver==1.4.9
+langid==1.1.6
+language-tags==1.2.0
+lazy_loader==0.4
+librosa
+llvmlite==0.42.0
+loguru==0.7.3
+markdown-it-py==4.0.0
+MarkupSafe
+matplotlib
+mdurl==0.1.2
+mpmath==1.3.0
+msgpack==1.1.2
+multidict==6.7.0
+multiprocess==0.70.18
+networkx==3.1
+num2words==0.5.13
+numba==0.59.0
+numpy==1.26.0
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==8.9.2.26
+nvidia-cufft-cu12==11.0.2.54
+nvidia-cufile-cu12==1.11.1.6
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-cusparselt-cu12==0.6.3
+nvidia-nccl-cu12==2.20.5
+nvidia-nvjitlink-cu12==12.6.85
+nvidia-nvtx-cu12==12.1.105
+omegaconf==2.3.0
+onnx==1.16.0
+onnxruntime
+onnxruntime-gpu
+orjson==3.11.5
+packaging==25.0
+pandas==2.3.3
+phonemizer==3.3.0
+pillow==11.3.0
+platformdirs==4.5.1
+pooch==1.8.2
+propcache==0.4.1
+proto-plus==1.27.0
+protobuf==6.33.2
+psutil==7.2.0
+pyarrow==22.0.0
+pyasn1==0.6.1
+pyasn1_modules==0.4.2
+pycparser==2.23
+pydantic<=2.10.6
+pydantic_core==2.27.2
+pydub
+py-espeak-ng==0.1.8
+Pygments==2.19.2
+pyparsing==3.3.1
+pypinyin
+pypinyin-dict
+python-dateutil==2.9.0.post0
+python-multipart==0.0.21
+pytz==2025.2
+PyYAML==6.0.3
+rdflib==7.5.0
+referencing==0.37.0
+regex
+requests==2.32.5
+rfc3986==1.5.0
+rich==13.9.4
+rpds-py==0.30.0
+rsa==4.9.1
+s3transfer==0.16.0
+safehttpx==0.1.7
+safetensors
+scikit-learn==1.7.1
+scipy==1.15.3
+segments==2.3.0
+semantic-version==2.10.0
+sentry-sdk==2.48.0
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+smmap==5.0.2
+soundfile
+soxr==1.0.0
+starlette==0.50.0
+sympy==1.14.0
+termcolor==3.2.0
+threadpoolctl==3.6.0
+tokenizers==0.22.1
+tomli
+tomlkit==0.13.3
+torch==2.3.1
+torchaudio==2.3.1
+torchdiffeq==0.2.4
+tqdm>=4.65.0
+transformers
+transformers-stream-generator
+triton==2.3.1
+typer==0.16.0
+typing_extensions==4.12.2
+tzdata==2025.3
+uritemplate==4.2.0
+urllib3==2.6.2
+uroman
+uvicorn==0.40.0
+vocos
+x-transformers>=1.31.14
+xxhash==3.6.0
+yarl==1.22.0
+zhconv

uvr5/gui_data/constants.py ADDED Viewed

	@@ -0,0 +1,1147 @@

+import platform
+#Platform Details
+OPERATING_SYSTEM = platform.system()
+SYSTEM_ARCH = platform.platform()
+SYSTEM_PROC = platform.processor()
+ARM = 'arm'
+#Main Font
+MAIN_FONT_NAME = "Century Gothic"
+#Model Types
+VR_ARCH_TYPE = 'VR Arc'
+MDX_ARCH_TYPE = 'MDX-Net'
+DEMUCS_ARCH_TYPE = 'Demucs'
+VR_ARCH_PM = 'VR Architecture'
+ENSEMBLE_MODE = 'Ensemble Mode'
+ENSEMBLE_STEM_CHECK = 'Ensemble Stem'
+SECONDARY_MODEL = 'Secondary Model'
+DEMUCS_6_STEM_MODEL = 'htdemucs_6s'
+DEMUCS_V3_ARCH_TYPE = 'Demucs v3'
+DEMUCS_V4_ARCH_TYPE = 'Demucs v4'
+DEMUCS_NEWER_ARCH_TYPES = [DEMUCS_V3_ARCH_TYPE, DEMUCS_V4_ARCH_TYPE]
+DEMUCS_V1 = 'v1'
+DEMUCS_V2 = 'v2'
+DEMUCS_V3 = 'v3'
+DEMUCS_V4 = 'v4'
+DEMUCS_V1_TAG = 'v1 | '
+DEMUCS_V2_TAG = 'v2 | '
+DEMUCS_V3_TAG = 'v3 | '
+DEMUCS_V4_TAG = 'v4 | '
+DEMUCS_NEWER_TAGS = [DEMUCS_V3_TAG, DEMUCS_V4_TAG]
+DEMUCS_VERSION_MAPPER = {
+            DEMUCS_V1:DEMUCS_V1_TAG,
+            DEMUCS_V2:DEMUCS_V2_TAG,
+            DEMUCS_V3:DEMUCS_V3_TAG,
+            DEMUCS_V4:DEMUCS_V4_TAG}
+#Download Center
+DOWNLOAD_FAILED = 'Download Failed'
+DOWNLOAD_STOPPED = 'Download Stopped'
+DOWNLOAD_COMPLETE = 'Download Complete'
+DOWNLOAD_UPDATE_COMPLETE = 'Update Download Complete'
+SETTINGS_MENU_EXIT = 'exit'
+NO_CONNECTION = 'No Internet Connection'
+VIP_SELECTION = 'VIP:'
+DEVELOPER_SELECTION = 'VIP:'
+NO_NEW_MODELS = 'All Available Models Downloaded'
+ENSEMBLE_PARTITION = ': '
+NO_MODEL = 'No Model Selected'
+CHOOSE_MODEL = 'Choose Model'
+SINGLE_DOWNLOAD = 'Downloading Item 1/1...'
+DOWNLOADING_ITEM = 'Downloading Item'
+FILE_EXISTS = 'File already exists!'
+DOWNLOADING_UPDATE = 'Downloading Update...'
+DOWNLOAD_MORE = 'Download More Models'
+#Menu Options
+AUTO_SELECT = 'Auto'
+#LINKS
+DOWNLOAD_CHECKS = "https://raw.githubusercontent.com/TRvlvr/application_data/main/filelists/download_checks.json"
+MDX_MODEL_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/mdx_model_data/model_data.json"
+VR_MODEL_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/vr_model_data/model_data.json"
+DEMUCS_MODEL_NAME_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/demucs_model_data/model_name_mapper.json"
+MDX_MODEL_NAME_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/mdx_model_data/model_name_mapper.json"
+DONATE_LINK_BMAC = "https://www.buymeacoffee.com/uvr5"
+DONATE_LINK_PATREON = "https://www.patreon.com/uvr"
+#DOWNLOAD REPOS
+NORMAL_REPO = "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/"
+UPDATE_REPO = "https://github.com/TRvlvr/model_repo/releases/download/uvr_update_patches/"
+UPDATE_MAC_ARM_REPO = "https://github.com/Anjok07/ultimatevocalremovergui/releases/download/v5.5.0/Ultimate_Vocal_Remover_v5_5_MacOS_arm64.dmg"
+UPDATE_MAC_X86_64_REPO = "https://github.com/Anjok07/ultimatevocalremovergui/releases/download/v5.5.0/Ultimate_Vocal_Remover_v5_5_MacOS_x86_64.dmg"
+UPDATE_LINUX_REPO = "https://github.com/Anjok07/ultimatevocalremovergui#linux-installation"
+UPDATE_REPO = "https://github.com/TRvlvr/model_repo/releases/download/uvr_update_patches/"
+ISSUE_LINK = 'https://github.com/Anjok07/ultimatevocalremovergui/issues/new'
+VIP_REPO = b'\xf3\xc2W\x19\x1foI)\xc2\xa9\xcc\xb67(Z\xf5',\
+           b'gAAAAABjQAIQ-NpNMMxMedpKHHb7ze_nqB05hw0YhbOy3pFzuzDrfqumn8_qvraxEoUpZC5ZXC0gGvfDxFMqyq9VWbYKlA67SUFI_wZB6QoVyGI581vs7kaGfUqlXHIdDS6tQ_U-BfjbEAK9EU_74-R2zXjz8Xzekw=='
+NO_CODE = 'incorrect_code'
+#Extensions
+ONNX = '.onnx'
+CKPT = '.ckpt'
+YAML = '.yaml'
+PTH = '.pth'
+TH_EXT = '.th'
+JSON = '.json'
+#GUI Buttons
+START_PROCESSING = 'Start Processing'
+WAIT_PROCESSING = 'Please wait...'
+STOP_PROCESSING = 'Halting process, please wait...'
+LOADING_MODELS = 'Loading models...'
+#---Messages and Logs----
+MISSING_MODEL = 'missing'
+MODEL_PRESENT = 'present'
+UNRECOGNIZED_MODEL = 'Unrecognized Model Detected', ' is an unrecognized model.\n\n' + \
+                     'Would you like to select the correct parameters before continuing?'
+STOP_PROCESS_CONFIRM = 'Confirmation', 'You are about to stop all active processes.\n\nAre you sure you wish to continue?'
+NO_ENSEMBLE_SELECTED = 'No Models Selected', 'Please select ensemble and try again.'
+PICKLE_CORRU = 'File Corrupted', 'Unable to load this ensemble.\n\n' + \
+               'Would you like to remove this ensemble from your list?'
+DELETE_ENS_ENTRY = 'Confirm Removal', 'Are you sure you want to remove this entry?'
+ALL_STEMS = 'All Stems'
+VOCAL_STEM = 'Vocals'
+INST_STEM = 'Instrumental'
+OTHER_STEM = 'Other'
+BASS_STEM = 'Bass'
+DRUM_STEM = 'Drums'
+GUITAR_STEM = 'Guitar'
+PIANO_STEM = 'Piano'
+SYNTH_STEM = 'Synthesizer'
+STRINGS_STEM = 'Strings'
+WOODWINDS_STEM = 'Woodwinds'
+BRASS_STEM = 'Brass'
+WIND_INST_STEM = 'Wind Inst'
+NO_OTHER_STEM = 'No Other'
+NO_BASS_STEM = 'No Bass'
+NO_DRUM_STEM = 'No Drums'
+NO_GUITAR_STEM = 'No Guitar'
+NO_PIANO_STEM = 'No Piano'
+NO_SYNTH_STEM = 'No Synthesizer'
+NO_STRINGS_STEM = 'No Strings'
+NO_WOODWINDS_STEM = 'No Woodwinds'
+NO_WIND_INST_STEM = 'No Wind Inst'
+NO_BRASS_STEM = 'No Brass'
+PRIMARY_STEM = 'Primary Stem'
+SECONDARY_STEM = 'Secondary Stem'
+#Other Constants
+DEMUCS_2_SOURCE = ["instrumental", "vocals"]
+DEMUCS_4_SOURCE = ["drums", "bass", "other", "vocals"]
+DEMUCS_2_SOURCE_MAPPER = {
+                        INST_STEM: 0,
+                        VOCAL_STEM: 1}
+DEMUCS_4_SOURCE_MAPPER = {
+                        BASS_STEM: 0,
+                        DRUM_STEM: 1,
+                        OTHER_STEM: 2,
+                        VOCAL_STEM: 3}
+DEMUCS_6_SOURCE_MAPPER = {
+                        BASS_STEM: 0,
+                        DRUM_STEM: 1,
+                        OTHER_STEM: 2,
+                        VOCAL_STEM: 3,
+                        GUITAR_STEM:4,
+                        PIANO_STEM:5}
+DEMUCS_4_SOURCE_LIST = [BASS_STEM, DRUM_STEM, OTHER_STEM, VOCAL_STEM]
+DEMUCS_6_SOURCE_LIST = [BASS_STEM, DRUM_STEM, OTHER_STEM, VOCAL_STEM, GUITAR_STEM, PIANO_STEM]
+DEMUCS_UVR_MODEL = 'UVR_Model'
+CHOOSE_STEM_PAIR = 'Choose Stem Pair'
+STEM_SET_MENU = (VOCAL_STEM,
+                 INST_STEM,
+                 OTHER_STEM,
+                 BASS_STEM,
+                 DRUM_STEM,
+                 GUITAR_STEM,
+                 PIANO_STEM,
+                 SYNTH_STEM,
+                 STRINGS_STEM,
+                 WOODWINDS_STEM,
+                 BRASS_STEM,
+                 WIND_INST_STEM,
+                 NO_OTHER_STEM,
+                 NO_BASS_STEM,
+                 NO_DRUM_STEM,
+                 NO_GUITAR_STEM,
+                 NO_PIANO_STEM,
+                 NO_SYNTH_STEM,
+                 NO_STRINGS_STEM,
+                 NO_WOODWINDS_STEM,
+                 NO_BRASS_STEM,
+                 NO_WIND_INST_STEM)
+STEM_PAIR_MAPPER = {
+            VOCAL_STEM: INST_STEM,
+            INST_STEM: VOCAL_STEM,
+            OTHER_STEM: NO_OTHER_STEM,
+            BASS_STEM: NO_BASS_STEM,
+            DRUM_STEM: NO_DRUM_STEM,
+            GUITAR_STEM: NO_GUITAR_STEM,
+            PIANO_STEM: NO_PIANO_STEM,
+            SYNTH_STEM: NO_SYNTH_STEM,
+            STRINGS_STEM: NO_STRINGS_STEM,
+            WOODWINDS_STEM: NO_WOODWINDS_STEM,
+            BRASS_STEM: NO_BRASS_STEM,
+            WIND_INST_STEM: NO_WIND_INST_STEM,
+            NO_OTHER_STEM: OTHER_STEM,
+            NO_BASS_STEM: BASS_STEM,
+            NO_DRUM_STEM: DRUM_STEM,
+            NO_GUITAR_STEM: GUITAR_STEM,
+            NO_PIANO_STEM: PIANO_STEM,
+            NO_SYNTH_STEM: SYNTH_STEM,
+            NO_STRINGS_STEM: STRINGS_STEM,
+            NO_WOODWINDS_STEM: WOODWINDS_STEM,
+            NO_BRASS_STEM: BRASS_STEM,
+            NO_WIND_INST_STEM: WIND_INST_STEM,
+            PRIMARY_STEM: SECONDARY_STEM}
+NON_ACCOM_STEMS = (
+            VOCAL_STEM,
+            OTHER_STEM,
+            BASS_STEM,
+            DRUM_STEM,
+            GUITAR_STEM,
+            PIANO_STEM,
+            SYNTH_STEM,
+            STRINGS_STEM,
+            WOODWINDS_STEM,
+            BRASS_STEM,
+            WIND_INST_STEM)
+MDX_NET_FREQ_CUT = [VOCAL_STEM, INST_STEM]
+DEMUCS_4_STEM_OPTIONS = (ALL_STEMS, VOCAL_STEM, OTHER_STEM, BASS_STEM, DRUM_STEM)
+DEMUCS_6_STEM_OPTIONS = (ALL_STEMS, VOCAL_STEM, OTHER_STEM, BASS_STEM, DRUM_STEM, GUITAR_STEM, PIANO_STEM)
+DEMUCS_2_STEM_OPTIONS = (VOCAL_STEM, INST_STEM)
+DEMUCS_4_STEM_CHECK = (OTHER_STEM, BASS_STEM, DRUM_STEM)
+#Menu Dropdowns
+VOCAL_PAIR = f'{VOCAL_STEM}/{INST_STEM}'
+INST_PAIR = f'{INST_STEM}/{VOCAL_STEM}'
+OTHER_PAIR = f'{OTHER_STEM}/{NO_OTHER_STEM}'
+DRUM_PAIR = f'{DRUM_STEM}/{NO_DRUM_STEM}'
+BASS_PAIR = f'{BASS_STEM}/{NO_BASS_STEM}'
+FOUR_STEM_ENSEMBLE = '4 Stem Ensemble'
+ENSEMBLE_MAIN_STEM = (CHOOSE_STEM_PAIR, VOCAL_PAIR, OTHER_PAIR, DRUM_PAIR, BASS_PAIR, FOUR_STEM_ENSEMBLE)
+MIN_SPEC = 'Min Spec'
+MAX_SPEC = 'Max Spec'
+AUDIO_AVERAGE = 'Average'
+MAX_MIN = f'{MAX_SPEC}/{MIN_SPEC}'
+MAX_MAX = f'{MAX_SPEC}/{MAX_SPEC}'
+MAX_AVE = f'{MAX_SPEC}/{AUDIO_AVERAGE}'
+MIN_MAX = f'{MIN_SPEC}/{MAX_SPEC}'
+MIN_MIX = f'{MIN_SPEC}/{MIN_SPEC}'
+MIN_AVE = f'{MIN_SPEC}/{AUDIO_AVERAGE}'
+AVE_MAX = f'{AUDIO_AVERAGE}/{MAX_SPEC}'
+AVE_MIN = f'{AUDIO_AVERAGE}/{MIN_SPEC}'
+AVE_AVE = f'{AUDIO_AVERAGE}/{AUDIO_AVERAGE}'
+ENSEMBLE_TYPE = (MAX_MIN, MAX_MAX, MAX_AVE, MIN_MAX, MIN_MIX, MIN_AVE, AVE_MAX, AVE_MIN, AVE_AVE)
+ENSEMBLE_TYPE_4_STEM = (MAX_SPEC, MIN_SPEC, AUDIO_AVERAGE)
+BATCH_MODE = 'Batch Mode'
+BETA_VERSION = 'BETA'
+DEF_OPT = 'Default'
+CHUNKS = (AUTO_SELECT, '1', '5', '10', '15', '20',
+          '25', '30', '35', '40', '45', '50',
+          '55', '60', '65', '70', '75', '80',
+          '85', '90', '95', 'Full')
+BATCH_SIZE = (DEF_OPT, '2', '3', '4', '5',
+          '6', '7', '8', '9', '10')
+VOL_COMPENSATION = (AUTO_SELECT, '1.035', '1.08')
+MARGIN_SIZE = ('44100', '22050', '11025')
+AUDIO_TOOLS = 'Audio Tools'
+MANUAL_ENSEMBLE = 'Manual Ensemble'
+TIME_STRETCH = 'Time Stretch'
+CHANGE_PITCH = 'Change Pitch'
+ALIGN_INPUTS = 'Align Inputs'
+if OPERATING_SYSTEM == 'Windows' or OPERATING_SYSTEM == 'Darwin':
+   AUDIO_TOOL_OPTIONS = (MANUAL_ENSEMBLE, TIME_STRETCH, CHANGE_PITCH, ALIGN_INPUTS)
+else:
+   AUDIO_TOOL_OPTIONS = (MANUAL_ENSEMBLE, ALIGN_INPUTS)
+MANUAL_ENSEMBLE_OPTIONS = (MIN_SPEC, MAX_SPEC, AUDIO_AVERAGE)
+PROCESS_METHODS = (VR_ARCH_PM, MDX_ARCH_TYPE, DEMUCS_ARCH_TYPE, ENSEMBLE_MODE, AUDIO_TOOLS)
+DEMUCS_SEGMENTS = ('Default', '1', '5', '10', '15', '20',
+                  '25', '30', '35', '40', '45', '50',
+                  '55', '60', '65', '70', '75', '80',
+                  '85', '90', '95', '100')
+DEMUCS_SHIFTS = (0, 1, 2, 3, 4, 5,
+                 6, 7, 8, 9, 10, 11,
+                 12, 13, 14, 15, 16, 17,
+                 18, 19, 20)
+DEMUCS_OVERLAP = (0.25, 0.50, 0.75, 0.99)
+VR_AGGRESSION = (1, 2, 3, 4, 5,
+                 6, 7, 8, 9, 10, 11,
+                 12, 13, 14, 15, 16, 17,
+                 18, 19, 20)
+VR_WINDOW = ('320', '512','1024')
+VR_CROP = ('256', '512', '1024')
+POST_PROCESSES_THREASHOLD_VALUES = ('0.1', '0.2', '0.3')
+MDX_POP_PRO = ('MDX-NET_Noise_Profile_14_kHz', 'MDX-NET_Noise_Profile_17_kHz', 'MDX-NET_Noise_Profile_Full_Band')
+MDX_POP_STEMS = ('Vocals', 'Instrumental', 'Other', 'Drums', 'Bass')
+MDX_POP_NFFT = ('4096', '5120', '6144', '7680', '8192', '16384')
+MDX_POP_DIMF = ('2048', '3072', '4096')
+SAVE_ENSEMBLE = 'Save Ensemble'
+CLEAR_ENSEMBLE = 'Clear Selection(s)'
+MENU_SEPARATOR = 35*'•'
+CHOOSE_ENSEMBLE_OPTION = 'Choose Option'
+INVALID_ENTRY = 'Invalid Input, Please Try Again'
+ENSEMBLE_INPUT_RULE = '1. Only letters, numbers, spaces, and dashes allowed.\n2. No dashes or spaces at the start or end of input.'
+ENSEMBLE_OPTIONS = (SAVE_ENSEMBLE, CLEAR_ENSEMBLE)
+ENSEMBLE_CHECK = 'ensemble check'
+SELECT_SAVED_ENSEMBLE = 'Select Saved Ensemble'
+SELECT_SAVED_SETTING = 'Select Saved Setting'
+ENSEMBLE_OPTION = "Ensemble Customization Options"
+MDX_OPTION = "Advanced MDX-Net Options"
+DEMUCS_OPTION = "Advanced Demucs Options"
+VR_OPTION = "Advanced VR Options"
+HELP_OPTION = "Open Information Guide"
+ERROR_OPTION = "Open Error Log"
+VERIFY_BEGIN = 'Verifying file '
+SAMPLE_BEGIN = 'Creating Sample '
+MODEL_MISSING_CHECK = 'Model Missing:'
+# Audio Player
+PLAYING_SONG = ": Playing"
+PAUSE_SONG = ": Paused"
+STOP_SONG = ": Stopped"
+SELECTED_VER = 'Selected'
+DETECTED_VER = 'Detected'
+SAMPLE_MODE_CHECKBOX = lambda v:f'Sample Mode ({v}s)'
+REMOVED_FILES = lambda r, e:f'Audio Input Verification Report:\n\nRemoved Files:\n\n{r}\n\nError Details:\n\n{e}'
+ADVANCED_SETTINGS = (ENSEMBLE_OPTION, MDX_OPTION, DEMUCS_OPTION, VR_OPTION, HELP_OPTION, ERROR_OPTION)
+WAV = 'WAV'
+FLAC = 'FLAC'
+MP3 = 'MP3'
+MP3_BIT_RATES = ('96k', '128k', '160k', '224k', '256k', '320k')
+WAV_TYPE = ('PCM_U8', 'PCM_16', 'PCM_24', 'PCM_32', '32-bit Float', '64-bit Float')
+SELECT_SAVED_SET = 'Choose Option'
+SAVE_SETTINGS = 'Save Current Settings'
+RESET_TO_DEFAULT = 'Reset to Default'
+RESET_FULL_TO_DEFAULT = 'Reset to Default'
+RESET_PM_TO_DEFAULT = 'Reset All Application Settings to Default'
+SAVE_SET_OPTIONS = (SAVE_SETTINGS, RESET_TO_DEFAULT)
+TIME_PITCH = ('1.0', '2.0', '3.0', '4.0')
+TIME_TEXT = '_time_stretched'
+PITCH_TEXT = '_pitch_shifted'
+#RegEx Input Validation
+REG_PITCH = r'^[-+]?(1[0]|[0-9]([.][0-9]*)?)$'
+REG_TIME = r'^[+]?(1[0]|[0-9]([.][0-9]*)?)$'
+REG_COMPENSATION = r'\b^(1[0]|[0-9]([.][0-9]*)?|Auto|None)$\b'
+REG_THES_POSTPORCESS = r'\b^([0]([.][0-9]{0,6})?)$\b'
+REG_CHUNKS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Auto|Full)$\b'
+REG_CHUNKS_DEMUCS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Auto|Full)$\b'
+REG_MARGIN = r'\b^[0-9]*$\b'
+REG_SEGMENTS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Default)$\b'
+REG_SAVE_INPUT = r'\b^([a-zA-Z0-9 -]{0,25})$\b'
+REG_AGGRESSION = r'^[-+]?[0-9]\d*?$'
+REG_WINDOW = r'\b^[0-9]{0,4}$\b'
+REG_SHIFTS = r'\b^[0-9]*$\b'
+REG_BATCHES = r'\b^([0-9]*?|Default)$\b'
+REG_OVERLAP = r'\b^([0]([.][0-9]{0,6})?|None)$\b'
+# Sub Menu
+VR_ARCH_SETTING_LOAD = 'Load for VR Arch'
+MDX_SETTING_LOAD = 'Load for MDX-Net'
+DEMUCS_SETTING_LOAD = 'Load for Demucs'
+ALL_ARCH_SETTING_LOAD = 'Load for Full Application'
+# Mappers
+DEFAULT_DATA = {
+        'chosen_process_method': MDX_ARCH_TYPE,
+        'vr_model': CHOOSE_MODEL,
+        'aggression_setting': 10,
+        'window_size': 512,
+        'batch_size': 4,
+        'crop_size': 256,
+        'is_tta': False,
+        'is_output_image': False,
+        'is_post_process': False,
+        'is_high_end_process': False,
+        'post_process_threshold': 0.2,
+        'vr_voc_inst_secondary_model': NO_MODEL,
+        'vr_other_secondary_model': NO_MODEL,
+        'vr_bass_secondary_model': NO_MODEL,
+        'vr_drums_secondary_model': NO_MODEL,
+        'vr_is_secondary_model_activate': False,
+        'vr_voc_inst_secondary_model_scale': 0.9,
+        'vr_other_secondary_model_scale': 0.7,
+        'vr_bass_secondary_model_scale': 0.5,
+        'vr_drums_secondary_model_scale': 0.5,
+        'demucs_model': CHOOSE_MODEL,
+        'demucs_stems': ALL_STEMS,
+        'segment': DEMUCS_SEGMENTS[0],
+        'overlap': DEMUCS_OVERLAP[0],
+        'shifts': 2,
+        'chunks_demucs': CHUNKS[0],
+        'margin_demucs': 44100,
+        'is_chunk_demucs': False,
+        'is_chunk_mdxnet': False,
+        'is_primary_stem_only_Demucs': False,
+        'is_secondary_stem_only_Demucs': False,
+        'is_split_mode': True,
+        'is_demucs_combine_stems': True,
+        'demucs_voc_inst_secondary_model': NO_MODEL,
+        'demucs_other_secondary_model': NO_MODEL,
+        'demucs_bass_secondary_model': NO_MODEL,
+        'demucs_drums_secondary_model': NO_MODEL,
+        'demucs_is_secondary_model_activate': False,
+        'demucs_voc_inst_secondary_model_scale': 0.9,
+        'demucs_other_secondary_model_scale': 0.7,
+        'demucs_bass_secondary_model_scale': 0.5,
+        'demucs_drums_secondary_model_scale': 0.5,
+        'demucs_stems': ALL_STEMS,
+        'demucs_pre_proc_model': NO_MODEL,
+        'is_demucs_pre_proc_model_activate': False,
+        'is_demucs_pre_proc_model_inst_mix': False,
+        'mdx_net_model': CHOOSE_MODEL,
+        'chunks': CHUNKS[0],
+        'margin': 44100,
+        'compensate': AUTO_SELECT,
+        'is_denoise': False,
+        'is_invert_spec': False,
+        'is_mixer_mode': False,
+        'mdx_batch_size': DEF_OPT,
+        'mdx_voc_inst_secondary_model': NO_MODEL,
+        'mdx_other_secondary_model': NO_MODEL,
+        'mdx_bass_secondary_model': NO_MODEL,
+        'mdx_drums_secondary_model': NO_MODEL,
+        'mdx_is_secondary_model_activate': False,
+        'mdx_voc_inst_secondary_model_scale': 0.9,
+        'mdx_other_secondary_model_scale': 0.7,
+        'mdx_bass_secondary_model_scale': 0.5,
+        'mdx_drums_secondary_model_scale': 0.5,
+        'is_save_all_outputs_ensemble': True,
+        'is_append_ensemble_name': False,
+        'chosen_audio_tool': AUDIO_TOOL_OPTIONS[0],
+        'choose_algorithm': MANUAL_ENSEMBLE_OPTIONS[0],
+        'time_stretch_rate': 2.0,
+        'pitch_rate': 2.0,
+        'is_gpu_conversion': False,
+        'is_primary_stem_only': False,
+        'is_secondary_stem_only': False,
+        'is_testing_audio': False,
+        'is_add_model_name': False,
+        'is_accept_any_input': False,
+        'is_task_complete': False,
+        'is_normalization': False,
+        'is_create_model_folder': False,
+        'mp3_bit_set': '320k',
+        'save_format': WAV,
+        'wav_type_set': 'PCM_16',
+        'user_code': '',
+        'export_path': '',
+        'input_paths': [],
+        'lastDir': None,
+        'export_path': '',
+        'model_hash_table': None,
+        'help_hints_var': False,
+        'model_sample_mode': False,
+        'model_sample_mode_duration': 30
+}
+SETTING_CHECK = ('vr_model',
+               'aggression_setting',
+               'window_size',
+               'batch_size',
+               'crop_size',
+               'is_tta',
+               'is_output_image',
+               'is_post_process',
+               'is_high_end_process',
+               'post_process_threshold',
+               'vr_voc_inst_secondary_model',
+               'vr_other_secondary_model',
+               'vr_bass_secondary_model',
+               'vr_drums_secondary_model',
+               'vr_is_secondary_model_activate',
+               'vr_voc_inst_secondary_model_scale',
+               'vr_other_secondary_model_scale',
+               'vr_bass_secondary_model_scale',
+               'vr_drums_secondary_model_scale',
+               'demucs_model',
+               'segment',
+               'overlap',
+               'shifts',
+               'chunks_demucs',
+               'margin_demucs',
+               'is_chunk_demucs',
+               'is_primary_stem_only_Demucs',
+               'is_secondary_stem_only_Demucs',
+               'is_split_mode',
+               'is_demucs_combine_stems',
+               'demucs_voc_inst_secondary_model',
+               'demucs_other_secondary_model',
+               'demucs_bass_secondary_model',
+               'demucs_drums_secondary_model',
+               'demucs_is_secondary_model_activate',
+               'demucs_voc_inst_secondary_model_scale',
+               'demucs_other_secondary_model_scale',
+               'demucs_bass_secondary_model_scale',
+               'demucs_drums_secondary_model_scale',
+               'demucs_stems',
+               'mdx_net_model',
+               'chunks',
+               'margin',
+               'compensate',
+               'is_denoise',
+               'is_invert_spec',
+               'mdx_batch_size',
+               'mdx_voc_inst_secondary_model',
+               'mdx_other_secondary_model',
+               'mdx_bass_secondary_model',
+               'mdx_drums_secondary_model',
+               'mdx_is_secondary_model_activate',
+               'mdx_voc_inst_secondary_model_scale',
+               'mdx_other_secondary_model_scale',
+               'mdx_bass_secondary_model_scale',
+               'mdx_drums_secondary_model_scale',
+               'is_save_all_outputs_ensemble',
+               'is_append_ensemble_name',
+               'chosen_audio_tool',
+               'choose_algorithm',
+               'time_stretch_rate',
+               'pitch_rate',
+               'is_primary_stem_only',
+               'is_secondary_stem_only',
+               'is_testing_audio',
+               'is_add_model_name',
+               "is_accept_any_input",
+               'is_task_complete',
+               'is_create_model_folder',
+               'mp3_bit_set',
+               'save_format',
+               'wav_type_set',
+               'user_code',
+               'is_gpu_conversion',
+               'is_normalization',
+               'help_hints_var',
+               'model_sample_mode',
+               'model_sample_mode_duration')
+# Message Box Text
+INVALID_INPUT = 'Invalid Input', 'The input is invalid.\n\nPlease verify the input still exists or is valid and try again.'
+INVALID_EXPORT = 'Invalid Export Directory', 'You have selected an invalid export directory.\n\nPlease make sure the selected directory still exists.'
+INVALID_ENSEMBLE = 'Not Enough Models', 'You must select 2 or more models to run ensemble.'
+INVALID_MODEL = 'No Model Chosen', 'You must select an model to continue.'
+MISSING_MODEL = 'Model Missing', 'The selected model is missing or not valid.'
+ERROR_OCCURED = 'Error Occured', '\n\nWould you like to open the error log for more details?\n'
+# GUI Text Constants
+BACK_TO_MAIN_MENU = 'Back to Main Menu'
+# Help Hint Text
+INTERNAL_MODEL_ATT = 'Internal model attribute. \n\n ***Do not change this setting if you are unsure!***'
+STOP_HELP = 'Halts any running processes. \n A pop-up window will ask the user to confirm the action.'
+SETTINGS_HELP = 'Opens the main settings guide. This window includes the \"Download Center\"'
+COMMAND_TEXT_HELP = 'Provides information on the progress of the current process.'
+SAVE_CURRENT_SETTINGS_HELP = 'Allows the user to open any saved settings or save the current application settings.'
+CHUNKS_HELP = ('For MDX-Net, all values use the same amount of resources. Using chunks is no longer recommended.\n\n' + \
+                '• This option is now only for output quality.\n' + \
+                '• Some tracks may fare better depending on the value.\n' + \
+                '• Some tracks may fare worse depending on the value.\n' + \
+                '• Larger chunk sizes use will take less time to process.\n' +\
+                '• Smaller chunk sizes use will take more time to process.\n')
+CHUNKS_DEMUCS_HELP = ('This option allows the user to reduce (or increase) RAM or V-RAM usage.\n\n' + \
+                '• Smaller chunk sizes use less RAM or V-RAM but can also increase processing times.\n' + \
+                '• Larger chunk sizes use more RAM or V-RAM but can also reduce processing times.\n' + \
+                '• Selecting \"Auto\" calculates an appropriate chuck size based on how much RAM or V-RAM your system has.\n' + \
+                '• Selecting \"Full\" will process the track as one whole chunk. (not recommended)\n' + \
+                '• The default selection is \"Auto\".')
+MARGIN_HELP = 'Selects the frequency margins to slice the chunks from.\n\n• The recommended margin size is 44100.\n• Other values can give unpredictable results.'
+AGGRESSION_SETTING_HELP = ('This option allows you to set how strong the primary stem extraction will be.\n\n' + \
+                           '• The range is 0-100.\n' + \
+                           '• Higher values perform deeper extractions.\n' + \
+                           '• The default is 10 for instrumental & vocal models.\n' + \
+                           '• Values over 10 can result in muddy-sounding instrumentals for the non-vocal models')
+WINDOW_SIZE_HELP = ('The smaller your window size, the better your conversions will be. \nHowever, a smaller window means longer conversion times and heavier resource usage.\n\n' + \
+                    'Breakdown of the selectable window size values:\n' + \
+                    '• 1024 - Low conversion quality, shortest conversion time, low resource usage.\n' + \
+                    '• 512 - Average conversion quality, average conversion time, normal resource usage.\n' + \
+                    '• 320 - Better conversion quality.')
+DEMUCS_STEMS_HELP = ('Here, you can choose which stem to extract using the selected model.\n\n' +\
+                     'Stem Selections:\n\n' +\
+                     '• All Stems - Saves all of the stems the model is able to extract.\n' +\
+                     '• Vocals - Pulls vocal stem only.\n' +\
+                     '• Other - Pulls other stem only.\n' +\
+                     '• Bass - Pulls bass stem only.\n' +\
+                     '• Drums - Pulls drum stem only.\n')
+SEGMENT_HELP = ('This option allows the user to reduce (or increase) RAM or V-RAM usage.\n\n' + \
+                '• Smaller segment sizes use less RAM or V-RAM but can also increase processing times.\n' + \
+                '• Larger segment sizes use more RAM or V-RAM but can also reduce processing times.\n' + \
+                '• Selecting \"Default\" uses the recommended segment size.\n' + \
+                '• It is recommended that you not use segments with \"Chunking\".')
+ENSEMBLE_MAIN_STEM_HELP = 'Allows the user to select the type of stems they wish to ensemble.\n\nOptions:\n\n' +\
+                          f'• {VOCAL_PAIR} - The primary stem will be the vocals and the secondary stem will be the the instrumental\n' +\
+                          f'• {OTHER_PAIR} - The primary stem will be other and the secondary stem will be no other (the mixture without the \'other\' stem)\n' +\
+                          f'• {BASS_PAIR} - The primary stem will be bass and the secondary stem will be no bass (the mixture without the \'bass\' stem)\n' +\
+                          f'• {DRUM_PAIR} - The primary stem will be drums and the secondary stem will be no drums (the mixture without the \'drums\' stem)\n' +\
+                          f'• {FOUR_STEM_ENSEMBLE} - This option will gather all the 4 stem Demucs models and ensemble all of the outputs.\n'
+ENSEMBLE_TYPE_HELP = 'Allows the user to select the ensemble algorithm to be used to generate the final output.\n\nExample & Other Note:\n\n' +\
+                     f'• {MAX_MIN} - If this option is chosen, the primary stem outputs will be processed through \nthe \'Max Spec\' algorithm, and the secondary stem will be processed through the \'Min Spec\' algorithm.\n' +\
+                     f'• Only a single algorithm will be shown when the \'4 Stem Ensemble\' option is chosen.\n\nAlgorithm Details:\n\n' +\
+                     f'• {MAX_SPEC} - This algorithm combines the final results and generates the highest possible output from them.\nFor example, if this algorithm were processing vocal stems, you would get the fullest possible \n' +\
+                        'result making the ensembled vocal stem sound cleaner. However, it might result in more unwanted artifacts.\n' +\
+                     f'• {MIN_SPEC} - This algorithm combines the results and generates the lowest possible output from them.\nFor example, if this algorithm were processing instrumental stems, you would get the cleanest possible result \n' +\
+                        'result, eliminating more unwanted artifacts. However, the result might also sound \'muddy\' and lack a fuller sound.\n' +\
+                     f'• {AUDIO_AVERAGE} - This algorithm simply combines the results and averages all of them together. \n'
+ENSEMBLE_LISTBOX_HELP = 'List of the all the models available for the main stem pair selected.'
+IS_GPU_CONVERSION_HELP = ('When checked, the application will attempt to use your GPU (if you have one).\n' +\
+                         'If you do not have a GPU but have this checked, the application will default to your CPU.\n\n' +\
+                         'Note: CPU conversions are much slower than those processed through the GPU.')
+SAVE_STEM_ONLY_HELP = 'Allows the user to save only the selected stem.'
+IS_NORMALIZATION_HELP = 'Normalizes output to prevent clipping.'
+CROP_SIZE_HELP = '**Only compatible with select models only!**\n\n Setting should match training crop-size value. Leave as is if unsure.'
+IS_TTA_HELP = ('This option performs Test-Time-Augmentation to improve the separation quality.\n\n' +\
+               'Note: Having this selected will increase the time it takes to complete a conversion')
+IS_POST_PROCESS_HELP = ('This option can potentially identify leftover instrumental artifacts within the vocal outputs. \nThis option may improve the separation of some songs.\n\n' +\
+                       'Note: Selecting this option can adversely affect the conversion process, depending on the track. Because of this, it is only recommended as a last resort.')
+IS_HIGH_END_PROCESS_HELP = 'The application will mirror the missing frequency range of the output.'
+SHIFTS_HELP = ('Performs multiple predictions with random shifts of the input and averages them.\n\n' +\
+              '• The higher number of shifts, the longer the prediction will take. \n- Not recommended unless you have a GPU.')
+OVERLAP_HELP = 'This option controls the amount of overlap between prediction windows (for Demucs one window is 10 seconds)'
+IS_CHUNK_DEMUCS_HELP = '• Enables \"Chunks\".\n• We recommend you not enable this option with \"Split Mode\" enabled or with the Demucs v4 Models.'
+IS_CHUNK_MDX_NET_HELP = '• Enables \"Chunks\".\n• Using this option for MDX-Net no longer effects RAM usage.\n• Having this enabled will effect output quality, for better or worse depending on the set value.'
+IS_SPLIT_MODE_HELP = ('• Enables \"Segments\". \n• We recommend you not enable this option with \"Enable Chunks\".\n' +\
+                      '• Deselecting this option is only recommended for those with powerful PCs or if using \"Chunk\" mode instead.')
+IS_DEMUCS_COMBINE_STEMS_HELP = 'The application will create the secondary stem by combining the remaining stems \ninstead of inverting the primary stem with the mixture.'
+COMPENSATE_HELP = 'Compensates the audio of the primary stems to allow for a better secondary stem.'
+IS_DENOISE_HELP = '• This option removes a majority of the noise generated by the MDX-Net models.\n• The conversion will take nearly twice as long with this enabled.'
+CLEAR_CACHE_HELP = 'Clears any user selected model settings for previously unrecognized models.'
+IS_SAVE_ALL_OUTPUTS_ENSEMBLE_HELP = 'Enabling this option will keep all indivudual outputs generated by an ensemble.'
+IS_APPEND_ENSEMBLE_NAME_HELP = 'The application will append the ensemble name to the final output \nwhen this option is enabled.'
+DONATE_HELP = 'Takes the user to an external web-site to donate to this project!'
+IS_INVERT_SPEC_HELP = '• This option may produce a better secondary stem.\n• Inverts primary stem with mixture using spectragrams instead of wavforms.\n• This inversion method is slightly slower.'
+IS_MIXER_MODE_HELP = '• This option may improve separations for outputs from 4-stem models.\n• Might produce more noise.\n• This option might slow down separation time.'
+IS_TESTING_AUDIO_HELP = 'Appends a unique 10 digit number to output files so the user \ncan compare results with different settings.'
+IS_MODEL_TESTING_AUDIO_HELP = 'Appends the model name to output files so the user \ncan compare results with different settings.'
+IS_ACCEPT_ANY_INPUT_HELP = 'The application will accept any input when enabled, even if it does not have an audio format extension.\n\nThis is for experimental purposes, and having it enabled is not recommended.'
+IS_TASK_COMPLETE_HELP = 'When enabled, chimes will be heard when a process completes or fails.'
+IS_CREATE_MODEL_FOLDER_HELP = 'Two new directories will be generated for the outputs in \nthe export directory after each conversion.\n\n' +\
+                              '• First directory - Named after the model.\n' +\
+                              '• Second directory - Named after the track.\n\n' +\
+                              '• Example: \n\n' +\
+                              '─ Export Directory\n' +\
+                              '   └── First Directory\n' +\
+                              '           └── Second Directory\n' +\
+                              '                    └── Output File(s)'
+DELETE_YOUR_SETTINGS_HELP = 'This menu contains your saved settings. You will be asked to \nconfirm if you wish to delete the selected setting.'
+SET_STEM_NAME_HELP = 'Choose the primary stem for the selected model.'
+MDX_DIM_T_SET_HELP = INTERNAL_MODEL_ATT
+MDX_DIM_F_SET_HELP = INTERNAL_MODEL_ATT
+MDX_N_FFT_SCALE_SET_HELP = 'Set the N_FFT size the model was trained with.'
+POPUP_COMPENSATE_HELP = f'Choose the appropriate voluem compensattion for the selected model\n\nReminder: {COMPENSATE_HELP}'
+VR_MODEL_PARAM_HELP = 'Choose the parameters needed to run the selected model.'
+CHOSEN_ENSEMBLE_HELP = 'Select saved enselble or save current ensemble.\n\nDefault Selections:\n\n• Save the current ensemble.\n• Clears all current model selections.'
+CHOSEN_PROCESS_METHOD_HELP = 'Here, you choose between different Al networks and algorithms to process your track.\n\n' +\
+                             'There are five options:\n\n' +\
+                             '• VR Architecture - These models use magnitude spectrograms for Source Separation.\n' +\
+                             '• MDX-Net - These models use Hybrid Spectrogram/Waveform for Source Separation.\n' +\
+                             '• Demucs v3 - These models use Hybrid Spectrogram/Waveform for Source Separation.\n' +\
+                             '• Ensemble Mode - Here, you can get the best results from multiple models and networks.\n' +\
+                             '• Audio Tools - These are additional tools for added convenience.'
+INPUT_FOLDER_ENTRY_HELP = 'Select Input:\n\nHere is where you select the audio files(s) you wish to process.'
+INPUT_FOLDER_ENTRY_HELP_2 = 'Input Option Menu:\n\nClick here to access the input option menu.'
+OUTPUT_FOLDER_ENTRY_HELP = 'Select Output:\n\nHere is where you select the directory where your processed files are to be saved.'
+INPUT_FOLDER_BUTTON_HELP = 'Open Input Folder Button: \n\nOpens the directory containing the selected input audio file(s).'
+OUTPUT_FOLDER_BUTTON_HELP = 'Open Output Folder Button: \n\nOpens the selected output folder.'
+CHOOSE_MODEL_HELP = 'Each process method comes with its own set of options and models.\n\nHere is where you choose the model associated with the selected process method.'
+FORMAT_SETTING_HELP = 'Save outputs as '
+SECONDARY_MODEL_ACTIVATE_HELP = 'When enabled, the application will run an additional inference with the selected model(s) above.'
+SECONDARY_MODEL_HELP = 'Choose the secondary model associated with this stem you wish to run with the current process method.'
+SECONDARY_MODEL_SCALE_HELP = 'The scale determines how the final audio outputs will be averaged between the primary and secondary models.\n\nFor example:\n\n' +\
+                             '• 10% - 10 percent of the main model result will be factored into the final result.\n' +\
+                             '• 50% - The results from the main and secondary models will be averaged evenly.\n' +\
+                             '• 90% - 90 percent of the main model result will be factored into the final result.'
+PRE_PROC_MODEL_ACTIVATE_HELP = 'The application will run an inference with the selected model above, pulling only the instrumental stem when enabled. \nFrom there, all of the non-vocal stems will be pulled from the generated instrumental.\n\nNotes:\n\n' +\
+                               '• This option can significantly reduce vocal bleed within the non-vocal stems.\n' +\
+                               '• It is only available in Demucs.\n' +\
+                               '• It is only compatible with non-vocal and non-instrumental stem outputs.\n' +\
+                               '• This will increase thetotal processing time.\n' +\
+                               '• Only VR and MDX-Net Vocal or Instrumental models are selectable above.'
+AUDIO_TOOLS_HELP = 'Here, you choose between different audio tools to process your track.\n\n' +\
+                               '• Manual Ensemble - You must have 2 or more files selected as your inputs. Allows the user to run their tracks through \nthe same algorithms used in Ensemble Mode.\n' +\
+                               '• Align Inputs - You must have exactly 2 files selected as your inputs. The second input will be aligned with the first input.\n' +\
+                               '• Time Stretch - The user can speed up or slow down the selected inputs.\n' +\
+                               '• Change Pitch - The user can change the pitch for the selected inputs.\n'
+PRE_PROC_MODEL_INST_MIX_HELP = 'When enabled, the application will generate a third output without the selected stem and vocals.'
+MODEL_SAMPLE_MODE_HELP = 'Allows the user to process only part of a track to sample settings or a model without \nrunning a full conversion.\n\nNotes:\n\n' +\
+                         '• The number in the parentheses is the current number of seconds the generated sample will be.\n' +\
+                         '• You can choose the number of seconds to extract from the track in the \"Additional Settings\" menu.'
+POST_PROCESS_THREASHOLD_HELP = 'Allows the user to control the intensity of the Post_process option.\n\nNotes:\n\n' +\
+                               '• Higher values potentially remove more artifacts. However, bleed might increase.\n' +\
+                               '• Lower values limit artifact removal.'
+BATCH_SIZE_HELP = 'Specify the number of batches to be processed at a time.\n\nNotes:\n\n' +\
+                               '• Higher values mean more RAM usage but slightly faster processing times.\n' +\
+                               '• Lower values mean less RAM usage but slightly longer processing times.\n' +\
+                               '• Batch size value has no effect on output quality.'
+# Warning Messages
+STORAGE_ERROR = 'Insufficient Storage', 'There is not enough storage on main drive to continue. Your main drive must have at least 3 GB\'s of storage in order for this application function properly. \n\nPlease ensure your main drive has at least 3 GB\'s of storage and try again.\n\n'
+STORAGE_WARNING = 'Available Storage Low', 'Your main drive is running low on storage. Your main drive must have at least 3 GB\'s of storage in order for this application function properly.\n\n'
+CONFIRM_WARNING = '\nAre you sure you wish to continue?'
+PROCESS_FAILED = 'Process failed, please see error log\n'
+EXIT_PROCESS_ERROR = 'Active Process', 'Please stop the active process or wait for it to complete before you exit.'
+EXIT_HALTED_PROCESS_ERROR = 'Halting Process', 'Please wait for the application to finish halting the process before exiting.'
+EXIT_DOWNLOAD_ERROR = 'Active Download', 'Please stop the download or wait for it to complete before you exit.'
+SET_TO_DEFAULT_PROCESS_ERROR = 'Active Process', 'You cannot reset all of the application settings during an active process.'
+SET_TO_ANY_PROCESS_ERROR = 'Active Process', 'You cannot reset the application settings during an active process.'
+RESET_ALL_TO_DEFAULT_WARNING = 'Reset Settings Confirmation', 'All application settings will be set to factory default.\n\nAre you sure you wish to continue?'
+AUDIO_VERIFICATION_CHECK = lambda i, e:f'++++++++++++++++++++++++++++++++++++++++++++++++++++\n\nBroken File Removed: \n\n{i}\n\nError Details:\n\n{e}\n++++++++++++++++++++++++++++++++++++++++++++++++++++'
+INVALID_ONNX_MODEL_ERROR = 'Invalid Model', 'The file selected is not a valid MDX-Net model. Please see the error log for more information.'
+# Separation Text
+LOADING_MODEL = 'Loading model...'
+INFERENCE_STEP_1 = 'Running inference...'
+INFERENCE_STEP_1_SEC = 'Running inference (secondary model)...'
+INFERENCE_STEP_1_4_STEM = lambda stem:f'Running inference (secondary model for {stem})...'
+INFERENCE_STEP_1_PRE = 'Running inference (pre-process model)...'
+INFERENCE_STEP_2_PRE = lambda pm, m:f'Loading pre-process model ({pm}: {m})...'
+INFERENCE_STEP_2_SEC = lambda pm, m:f'Loading secondary model ({pm}: {m})...'
+INFERENCE_STEP_2_SEC_CACHED_MODOEL = lambda pm, m:f'Secondary model ({pm}: {m}) cache loaded.\n'
+INFERENCE_STEP_2_PRE_CACHED_MODOEL = lambda pm, m:f'Pre-process model ({pm}: {m}) cache loaded.\n'
+INFERENCE_STEP_2_SEC_CACHED = 'Loading cached secondary model source(s)... Done!\n'
+INFERENCE_STEP_2_PRIMARY_CACHED = 'Model cache loaded.\n'
+INFERENCE_STEP_2 = 'Inference complete.'
+SAVING_STEM = 'Saving ', ' stem...'
+SAVING_ALL_STEMS = 'Saving all stems...'
+ENSEMBLING_OUTPUTS = 'Ensembling outputs...'
+DONE = ' Done!\n'
+ENSEMBLES_SAVED = 'Ensembled outputs saved!\n\n'
+NEW_LINES = "\n\n"
+NEW_LINE = "\n"
+NO_LINE = ''
+# Widget Placements
+MAIN_ROW_Y = -15, -17
+MAIN_ROW_X = -4, 21
+MAIN_ROW_WIDTH = -53
+MAIN_ROW_2_Y = -15, -17
+MAIN_ROW_2_X = -28, 1
+CHECK_BOX_Y = 0
+CHECK_BOX_X = 20
+CHECK_BOX_WIDTH = -50
+CHECK_BOX_HEIGHT = 2
+LEFT_ROW_WIDTH = -10
+LABEL_HEIGHT = -5
+OPTION_HEIGHT = 7
+LOW_MENU_Y = 18, 16
+FFMPEG_EXT = (".aac", ".aiff", ".alac" ,".flac", ".FLAC", ".mov", ".mp4", ".MP4",
+              ".m4a", ".M4A", ".mp2", ".mp3", "MP3", ".mpc", ".mpc8",
+              ".mpeg", ".ogg", ".OGG", ".tta", ".wav", ".wave", ".WAV", ".WAVE", ".wma", ".webm", ".eac3", ".mkv")
+FFMPEG_MORE_EXT = (".aa", ".aac", ".ac3", ".aiff", ".alac", ".avi", ".f4v",".flac", ".flic", ".flv",
+              ".m4v",".mlv", ".mov", ".mp4", ".m4a", ".mp2", ".mp3", ".mp4", ".mpc", ".mpc8",
+              ".mpeg", ".ogg", ".tta", ".tty", ".vcd", ".wav", ".wma")
+ANY_EXT = ""
+# Secondary Menu Constants
+VOCAL_PAIR_PLACEMENT = 1, 2, 3, 4
+OTHER_PAIR_PLACEMENT = 5, 6, 7, 8
+BASS_PAIR_PLACEMENT = 9, 10, 11, 12
+DRUMS_PAIR_PLACEMENT = 13, 14, 15, 16
+# Drag n Drop String Checks
+DOUBLE_BRACKET = "} {"
+RIGHT_BRACKET = "}"
+LEFT_BRACKET = "{"
+# Manual Downloads
+VR_PLACEMENT_TEXT = 'Place models in \"models/VR_Models\" directory.'
+MDX_PLACEMENT_TEXT = 'Place models in \"models/MDX_Net_Models\" directory.'
+DEMUCS_PLACEMENT_TEXT = 'Place models in \"models/Demucs_Models\" directory.'
+DEMUCS_V3_V4_PLACEMENT_TEXT = 'Place items in \"models/Demucs_Models/v3_v4_repo\" directory.'
+FULL_DOWNLOAD_LIST_VR = {
+                    "VR Arch Single Model v5: 1_HP-UVR": "1_HP-UVR.pth",
+                    "VR Arch Single Model v5: 2_HP-UVR": "2_HP-UVR.pth",
+                    "VR Arch Single Model v5: 3_HP-Vocal-UVR": "3_HP-Vocal-UVR.pth",
+                    "VR Arch Single Model v5: 4_HP-Vocal-UVR": "4_HP-Vocal-UVR.pth",
+                    "VR Arch Single Model v5: 5_HP-Karaoke-UVR": "5_HP-Karaoke-UVR.pth",
+                    "VR Arch Single Model v5: 6_HP-Karaoke-UVR": "6_HP-Karaoke-UVR.pth",
+                    "VR Arch Single Model v5: 7_HP2-UVR": "7_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 8_HP2-UVR": "8_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 9_HP2-UVR": "9_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 10_SP-UVR-2B-32000-1": "10_SP-UVR-2B-32000-1.pth",
+                    "VR Arch Single Model v5: 11_SP-UVR-2B-32000-2": "11_SP-UVR-2B-32000-2.pth",
+                    "VR Arch Single Model v5: 12_SP-UVR-3B-44100": "12_SP-UVR-3B-44100.pth",
+                    "VR Arch Single Model v5: 13_SP-UVR-4B-44100-1": "13_SP-UVR-4B-44100-1.pth",
+                    "VR Arch Single Model v5: 14_SP-UVR-4B-44100-2": "14_SP-UVR-4B-44100-2.pth",
+                    "VR Arch Single Model v5: 15_SP-UVR-MID-44100-1": "15_SP-UVR-MID-44100-1.pth",
+                    "VR Arch Single Model v5: 16_SP-UVR-MID-44100-2": "16_SP-UVR-MID-44100-2.pth",
+                    "VR Arch Single Model v4: MGM_HIGHEND_v4": "MGM_HIGHEND_v4.pth",
+                    "VR Arch Single Model v4: MGM_LOWEND_A_v4": "MGM_LOWEND_A_v4.pth",
+                    "VR Arch Single Model v4: MGM_LOWEND_B_v4": "MGM_LOWEND_B_v4.pth",
+                    "VR Arch Single Model v4: MGM_MAIN_v4": "MGM_MAIN_v4.pth"
+                  }
+FULL_DOWNLOAD_LIST_MDX = {
+                    "MDX-Net Model: UVR-MDX-NET Main": "UVR_MDXNET_Main.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst Main": "UVR-MDX-NET-Inst_Main.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 1": "UVR_MDXNET_1_9703.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 2": "UVR_MDXNET_2_9682.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 3": "UVR_MDXNET_3_9662.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 1": "UVR-MDX-NET-Inst_1.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 2": "UVR-MDX-NET-Inst_2.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 3": "UVR-MDX-NET-Inst_3.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Karaoke": "UVR_MDXNET_KARA.onnx",
+                    "MDX-Net Model: UVR_MDXNET_9482": "UVR_MDXNET_9482.onnx",
+                    "MDX-Net Model: Kim_Vocal_1": "Kim_Vocal_1.onnx",
+                    "MDX-Net Model: kuielab_a_vocals": "kuielab_a_vocals.onnx",
+                    "MDX-Net Model: kuielab_a_other": "kuielab_a_other.onnx",
+                    "MDX-Net Model: kuielab_a_bass": "kuielab_a_bass.onnx",
+                    "MDX-Net Model: kuielab_a_drums": "kuielab_a_drums.onnx",
+                    "MDX-Net Model: kuielab_b_vocals": "kuielab_b_vocals.onnx",
+                    "MDX-Net Model: kuielab_b_other": "kuielab_b_other.onnx",
+                    "MDX-Net Model: kuielab_b_bass": "kuielab_b_bass.onnx",
+                    "MDX-Net Model: kuielab_b_drums": "kuielab_b_drums.onnx"}
+FULL_DOWNLOAD_LIST_DEMUCS = {
+	                "Demucs v4: htdemucs_ft":{
+	                                "f7e0c4bc-ba3fe64a.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/f7e0c4bc-ba3fe64a.th",
+	                                "d12395a8-e57c48e6.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/d12395a8-e57c48e6.th",
+	                                "92cfc3b6-ef3bcb9c.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/92cfc3b6-ef3bcb9c.th",
+	                                "04573f0d-f3cf25b2.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/04573f0d-f3cf25b2.th",
+	                                "htdemucs_ft.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs_ft.yaml"
+	                                },
+	                "Demucs v4: htdemucs":{
+	                                "955717e8-8726e21a.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th",
+	                                "htdemucs.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs.yaml"
+	                                },
+	                "Demucs v4: hdemucs_mmi":{
+	                                "75fc33f5-1941ce65.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/75fc33f5-1941ce65.th",
+	                                "hdemucs_mmi.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/hdemucs_mmi.yaml"
+	                                },
+	                "Demucs v4: htdemucs_6s":{
+	                                "5c90dfd2-34c22ccb.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/5c90dfd2-34c22ccb.th",
+	                                "htdemucs_6s.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs_6s.yaml"
+	                                },
+	                "Demucs v3: mdx":{
+	                                "0d19c1c6-0f06f20e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/0d19c1c6-0f06f20e.th",
+	                                "7ecf8ec1-70f50cc9.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7ecf8ec1-70f50cc9.th",
+	                                "c511e2ab-fe698775.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/c511e2ab-fe698775.th",
+	                                "7d865c68-3d5dd56b.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7d865c68-3d5dd56b.th",
+	                                "mdx.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx.yaml"
+	                                },
+	                "Demucs v3: mdx_q":{
+	                                "6b9c2ca1-3fd82607.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/6b9c2ca1-3fd82607.th",
+	                                "b72baf4e-8778635e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/b72baf4e-8778635e.th",
+	                                "42e558d4-196e0e1b.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/42e558d4-196e0e1b.th",
+	                                "305bc58f-18378783.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/305bc58f-18378783.th",
+	                                "mdx_q.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_q.yaml"
+	                                },
+	                "Demucs v3: mdx_extra":{
+	                                "e51eebcc-c1b80bdd.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/e51eebcc-c1b80bdd.th",
+	                                "a1d90b5c-ae9d2452.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/a1d90b5c-ae9d2452.th",
+	                                "5d2d6c55-db83574e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/5d2d6c55-db83574e.th",
+	                                "cfa93e08-61801ae1.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/cfa93e08-61801ae1.th",
+	                                "mdx_extra.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_extra.yaml"
+	                                },
+	                "Demucs v3: mdx_extra_q": {
+	                                "83fc094f-4a16d450.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/83fc094f-4a16d450.th",
+	                                "464b36d7-e5a9386e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/464b36d7-e5a9386e.th",
+	                                "14fc6a69-a89dd0ee.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/14fc6a69-a89dd0ee.th",
+	                                "7fd6ef75-a905dd85.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7fd6ef75-a905dd85.th",
+	                                "mdx_extra_q.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_extra_q.yaml"
+	                                },
+	                "Demucs v3: UVR Model":{
+	                                "ebf34a2db.th": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/ebf34a2db.th",
+	                                "UVR_Demucs_Model_1.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/UVR_Demucs_Model_1.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a":{
+	                                "9a6b4851-03af0aa6.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/9a6b4851-03af0aa6.th",
+	                                "1ef250f1-592467ce.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/1ef250f1-592467ce.th",
+	                                "fa0cb7f9-100d8bf4.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/fa0cb7f9-100d8bf4.th",
+	                                "902315c2-b39ce9c9.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/902315c2-b39ce9c9.th",
+	                                "repro_mdx_a.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a_time_only":{
+	                                "9a6b4851-03af0aa6.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/9a6b4851-03af0aa6.th",
+	                                "1ef250f1-592467ce.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/1ef250f1-592467ce.th",
+	                                "repro_mdx_a_time_only.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a_time_only.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a_hybrid_only":{
+	                                "fa0cb7f9-100d8bf4.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/fa0cb7f9-100d8bf4.th",
+	                                "902315c2-b39ce9c9.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/902315c2-b39ce9c9.th",
+	                                "repro_mdx_a_hybrid_only.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a_hybrid_only.yaml"
+	                                },
+	                "Demucs v2: demucs": {
+	                                "demucs-e07c671f.th": "https://dl.fbaipublicfiles.com/demucs/v3.0/demucs-e07c671f.th"
+	                                },
+	                "Demucs v2: demucs_extra": {
+	                                "demucs_extra-3646af93.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs_extra-3646af93.th"
+	                                },
+	                "Demucs v2: demucs48_hq": {
+	                                "demucs48_hq-28a1282c.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs48_hq-28a1282c.th"
+	                                },
+	                "Demucs v2: tasnet": {
+	                                "tasnet-beb46fac.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/tasnet-beb46fac.th"
+	                                },
+	                "Demucs v2: tasnet_extra": {
+	                                "tasnet_extra-df3777b2.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/tasnet_extra-df3777b2.th"
+	                                },
+	                "Demucs v2: demucs_unittest": {
+	                                "demucs_unittest-09ebc15f.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs_unittest-09ebc15f.th"
+	                                },
+	                "Demucs v1: demucs": {
+	                                "demucs.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/demucs.th"
+	                                },
+	                "Demucs v1: demucs_extra": {
+	                                "demucs_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/demucs_extra.th"
+	                                },
+	                "Demucs v1: light": {
+	                                "light.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/light.th"
+	                                },
+	                "Demucs v1: light_extra": {
+	                                "light_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/light_extra.th"
+	                                },
+	                "Demucs v1: tasnet": {
+	                                "tasnet.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/tasnet.th"
+	                                },
+	                "Demucs v1: tasnet_extra": {
+	                                "tasnet_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/tasnet_extra.th"
+	                                }
+                }
+# Main Menu Labels
+CHOOSE_PROC_METHOD_MAIN_LABEL = 'CHOOSE PROCESS METHOD'
+SELECT_SAVED_SETTINGS_MAIN_LABEL = 'SELECT SAVED SETTINGS'
+CHOOSE_MDX_MODEL_MAIN_LABEL = 'CHOOSE MDX-NET MODEL'
+BATCHES_MDX_MAIN_LABEL = 'BATCH SIZE'
+VOL_COMP_MDX_MAIN_LABEL = 'VOLUME COMPENSATION'
+SELECT_VR_MODEL_MAIN_LABEL = 'CHOOSE VR MODEL'
+AGGRESSION_SETTING_MAIN_LABEL = 'AGGRESSION SETTING'
+WINDOW_SIZE_MAIN_LABEL = 'WINDOW SIZE'
+CHOOSE_DEMUCS_MODEL_MAIN_LABEL = 'CHOOSE DEMUCS MODEL'
+CHOOSE_DEMUCS_STEMS_MAIN_LABEL = 'CHOOSE STEM(S)'
+CHOOSE_SEGMENT_MAIN_LABEL = 'SEGMENT'
+ENSEMBLE_OPTIONS_MAIN_LABEL = 'ENSEMBLE OPTIONS'
+CHOOSE_MAIN_PAIR_MAIN_LABEL = 'MAIN STEM PAIR'
+CHOOSE_ENSEMBLE_ALGORITHM_MAIN_LABEL = 'ENSEMBLE ALGORITHM'
+AVAILABLE_MODELS_MAIN_LABEL = 'AVAILABLE MODELS'
+CHOOSE_AUDIO_TOOLS_MAIN_LABEL = 'CHOOSE AUDIO TOOL'
+CHOOSE_MANUAL_ALGORITHM_MAIN_LABEL = 'CHOOSE ALGORITHM'
+CHOOSE_RATE_MAIN_LABEL = 'RATE'
+CHOOSE_SEMITONES_MAIN_LABEL = 'SEMITONES'
+GPU_CONVERSION_MAIN_LABEL = 'GPU Conversion'
+if OPERATING_SYSTEM=="Darwin":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running macOS Catalina and above.\n' +\
+                              '• Application functionality for systems running macOS Mojave or lower is not guaranteed.\n' +\
+                              '• Application functionality for older or budget Mac systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 13
+   FONT_SIZE_F2 = 11
+   FONT_SIZE_F3 = 12
+   FONT_SIZE_0 = 9
+   FONT_SIZE_1 = 11
+   FONT_SIZE_2 = 12
+   FONT_SIZE_3 = 13
+   FONT_SIZE_4 = 14
+   FONT_SIZE_5 = 15
+   FONT_SIZE_6 = 17
+   HELP_HINT_CHECKBOX_WIDTH = 13
+   MDX_CHECKBOXS_WIDTH = 14
+   VR_CHECKBOXS_WIDTH = 14
+   ENSEMBLE_CHECKBOXS_WIDTH = 18
+   DEMUCS_CHECKBOXS_WIDTH = 14
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 20
+   GEN_SETTINGS_WIDTH = 17
+   MENU_COMBOBOX_WIDTH = 16
+elif OPERATING_SYSTEM=="Linux":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running Linux Ubuntu 18.04+.\n' +\
+                              '• Application functionality for systems running other Linux platforms is not guaranteed.\n' +\
+                              '• Application functionality for older or budget systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 10
+   FONT_SIZE_F2 = 8
+   FONT_SIZE_F3 = 9
+   FONT_SIZE_0 = 7
+   FONT_SIZE_1 = 8
+   FONT_SIZE_2 = 9
+   FONT_SIZE_3 = 10
+   FONT_SIZE_4 = 11
+   FONT_SIZE_5 = 12
+   FONT_SIZE_6 = 15
+   HELP_HINT_CHECKBOX_WIDTH = 13
+   MDX_CHECKBOXS_WIDTH = 14
+   VR_CHECKBOXS_WIDTH = 16
+   ENSEMBLE_CHECKBOXS_WIDTH = 25
+   DEMUCS_CHECKBOXS_WIDTH = 18
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 27
+   GEN_SETTINGS_WIDTH = 17
+   MENU_COMBOBOX_WIDTH = 19
+elif OPERATING_SYSTEM=="Windows":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running Windows 10 or higher.\n' +\
+                              '• Application functionality for systems running Windows 7 or lower is not guaranteed.\n' +\
+                              '• Application functionality for Intel Pentium & Celeron CPUs systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 10
+   FONT_SIZE_F2 = 8
+   FONT_SIZE_F3 = 9
+   FONT_SIZE_0 = 7
+   FONT_SIZE_1 = 8
+   FONT_SIZE_2 = 9
+   FONT_SIZE_3 = 10
+   FONT_SIZE_4 = 11
+   FONT_SIZE_5 = 12
+   FONT_SIZE_6 = 15
+   HELP_HINT_CHECKBOX_WIDTH = 16
+   MDX_CHECKBOXS_WIDTH = 16
+   VR_CHECKBOXS_WIDTH = 16
+   ENSEMBLE_CHECKBOXS_WIDTH = 25
+   DEMUCS_CHECKBOXS_WIDTH = 18
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 27
+   GEN_SETTINGS_WIDTH = 23
+   MENU_COMBOBOX_WIDTH = 19
+LICENSE_TEXT = lambda a, p:f'Current Application Version: Ultimate Vocal Remover {a}\n' +\
+               f'Current Patch Version: {p}\n\n' +\
+               'Copyright (c) 2022 Ultimate Vocal Remover\n\n' +\
+               'UVR is free and open-source, but MIT licensed. Please credit us if you use our\n' +\
+               f'models or code for projects unrelated to UVR.\n\n{LICENSE_OS_SPECIFIC_TEXT}' +\
+               'This bundle contains the UVR interface, Python, PyTorch, and other\n' +\
+               'dependencies needed to run the application effectively.\n\n' +\
+               'Website Links: This application, System or Service(s) may contain links to\n' +\
+               'other websites and downloads, and they are solely provided to you as an\n' +\
+               'additional convenience. You understand and acknowledge that by clicking\n' +\
+               'or activating such links you are accessing a site or service outside of\n' +\
+               'this application, and that we do not screen, review, approve, or otherwise\n' +\
+               'endorse any content or information contained in these linked websites.\n' +\
+               'You acknowledge and agree that we, our affiliates and partners are not\n' +\
+               'responsible for the contents of any of these linked websites, including\n' +\
+               'the accuracy or availability of information provided by the linked websites,\n' +\
+               'and we make no representations or warranties regarding your use of\n' +\
+               'the linked websites.\n\n' +\
+               'This application is MIT Licensed\n\n' +\
+               'Permission is hereby granted, free of charge, to any person obtaining a copy\n' +\
+               'of this software and associated documentation files (the "Software"), to deal\n' +\
+               'in the Software without restriction, including without limitation the rights\n' +\
+               'to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n' +\
+               'copies of the Software, and to permit persons to whom the Software is\n' +\
+               'furnished to do so, subject to the following conditions:\n\n' +\
+               'The above copyright notice and this permission notice shall be included in all\n' +\
+               'copies or substantial portions of the Software.\n\n' +\
+               'THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n' +\
+               'IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n' +\
+               'FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n' +\
+               'AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n' +\
+               'LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n' +\
+               'OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n' +\
+               'SOFTWARE.'
+CHANGE_LOG_HEADER = lambda patch:f"Patch Version:\n\n{patch}"
+#DND CONSTS
+MAC_DND_CHECK = ('/Users/',
+                 '/Applications/',
+                 '/Library/',
+                 '/System/')
+LINUX_DND_CHECK = ('/home/',
+                   '/usr/')
+WINDOWS_DND_CHECK = ('A:', 'B:', 'C:', 'D:', 'E:', 'F:', 'G:', 'H:', 'I:', 'J:', 'K:', 'L:', 'M:', 'N:', 'O:', 'P:', 'Q:', 'R:', 'S:', 'T:', 'U:', 'V:', 'W:', 'X:', 'Y:', 'Z:')
+WOOD_INST_MODEL_HASH = '0ec76fd9e65f81d8b4fbd13af4826ed8'
+WOOD_INST_PARAMS = {
+    "vr_model_param": "4band_v3",
+    "primary_stem": NO_WIND_INST_STEM
+                     }

uvr5/lib_v5/mdxnet.py ADDED Viewed

	@@ -0,0 +1,140 @@

+from abc import ABCMeta
+import torch
+import torch.nn as nn
+from pytorch_lightning import LightningModule
+from .modules import TFC_TDF
+dim_s = 4
+class AbstractMDXNet(LightningModule):
+    __metaclass__ = ABCMeta
+    def __init__(self, target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length, overlap):
+        super().__init__()
+        self.target_name = target_name
+        self.lr = lr
+        self.optimizer = optimizer
+        self.dim_c = dim_c
+        self.dim_f = dim_f
+        self.dim_t = dim_t
+        self.n_fft = n_fft
+        self.n_bins = n_fft // 2 + 1
+        self.hop_length = hop_length
+        self.window = nn.Parameter(torch.hann_window(window_length=self.n_fft, periodic=True), requires_grad=False)
+        self.freq_pad = nn.Parameter(torch.zeros([1, dim_c, self.n_bins - self.dim_f, self.dim_t]), requires_grad=False)
+    def configure_optimizers(self):
+        if self.optimizer == 'rmsprop':
+            return torch.optim.RMSprop(self.parameters(), self.lr)
+        if self.optimizer == 'adamw':
+            return torch.optim.AdamW(self.parameters(), self.lr)
+class ConvTDFNet(AbstractMDXNet):
+    def __init__(self, target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length,
+                 num_blocks, l, g, k, bn, bias, overlap):
+        super(ConvTDFNet, self).__init__(
+            target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length, overlap)
+        self.save_hyperparameters()
+        self.num_blocks = num_blocks
+        self.l = l
+        self.g = g
+        self.k = k
+        self.bn = bn
+        self.bias = bias
+        if optimizer == 'rmsprop':
+            norm = nn.BatchNorm2d
+        if optimizer == 'adamw':
+            norm = lambda input:nn.GroupNorm(2, input)
+        self.n = num_blocks // 2
+        scale = (2, 2)
+        self.first_conv = nn.Sequential(
+            nn.Conv2d(in_channels=self.dim_c, out_channels=g, kernel_size=(1, 1)),
+            norm(g),
+            nn.ReLU(),
+        )
+        f = self.dim_f
+        c = g
+        self.encoding_blocks = nn.ModuleList()
+        self.ds = nn.ModuleList()
+        for i in range(self.n):
+            self.encoding_blocks.append(TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm))
+            self.ds.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c + g, kernel_size=scale, stride=scale),
+                    norm(c + g),
+                    nn.ReLU()
+                )
+            )
+            f = f // 2
+            c += g
+        self.bottleneck_block = TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm)
+        self.decoding_blocks = nn.ModuleList()
+        self.us = nn.ModuleList()
+        for i in range(self.n):
+            self.us.append(
+                nn.Sequential(
+                    nn.ConvTranspose2d(in_channels=c, out_channels=c - g, kernel_size=scale, stride=scale),
+                    norm(c - g),
+                    nn.ReLU()
+                )
+            )
+            f = f * 2
+            c -= g
+            self.decoding_blocks.append(TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm))
+        self.final_conv = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=self.dim_c, kernel_size=(1, 1)),
+        )
+    def forward(self, x):
+        x = self.first_conv(x)
+        x = x.transpose(-1, -2)
+        ds_outputs = []
+        for i in range(self.n):
+            x = self.encoding_blocks[i](x)
+            ds_outputs.append(x)
+            x = self.ds[i](x)
+        x = self.bottleneck_block(x)
+        for i in range(self.n):
+            x = self.us[i](x)
+            x *= ds_outputs[-i - 1]
+            x = self.decoding_blocks[i](x)
+        x = x.transpose(-1, -2)
+        x = self.final_conv(x)
+        return x
+class Mixer(nn.Module):
+    def __init__(self, device, mixer_path):
+        super(Mixer, self).__init__()
+        self.linear = nn.Linear((dim_s+1)*2, dim_s*2, bias=False)
+        self.load_state_dict(
+            torch.load(mixer_path, map_location=device)
+        )
+    def forward(self, x):
+        x = x.reshape(1,(dim_s+1)*2,-1).transpose(-1,-2)
+        x = self.linear(x)
+        return x.transpose(-1,-2).reshape(dim_s,2,-1)

uvr5/lib_v5/mixer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea781bd52c6a523b825fa6cdbb6189f52e318edd8b17e6fe404f76f7af8caa9c
+size 1208

uvr5/lib_v5/modules.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import torch
+import torch.nn as nn
+class TFC(nn.Module):
+    def __init__(self, c, l, k, norm):
+        super(TFC, self).__init__()
+        self.H = nn.ModuleList()
+        for i in range(l):
+            self.H.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c, kernel_size=k, stride=1, padding=k // 2),
+                    norm(c),
+                    nn.ReLU(),
+                )
+            )
+    def forward(self, x):
+        for h in self.H:
+            x = h(x)
+        return x
+class DenseTFC(nn.Module):
+    def __init__(self, c, l, k, norm):
+        super(DenseTFC, self).__init__()
+        self.conv = nn.ModuleList()
+        for i in range(l):
+            self.conv.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c, kernel_size=k, stride=1, padding=k // 2),
+                    norm(c),
+                    nn.ReLU(),
+                )
+            )
+    def forward(self, x):
+        for layer in self.conv[:-1]:
+            x = torch.cat([layer(x), x], 1)
+        return self.conv[-1](x)
+class TFC_TDF(nn.Module):
+    def __init__(self, c, l, f, k, bn, dense=False, bias=True, norm=nn.BatchNorm2d):
+        super(TFC_TDF, self).__init__()
+        self.use_tdf = bn is not None
+        self.tfc = DenseTFC(c, l, k, norm) if dense else TFC(c, l, k, norm)
+        if self.use_tdf:
+            if bn == 0:
+                self.tdf = nn.Sequential(
+                    nn.Linear(f, f, bias=bias),
+                    norm(c),
+                    nn.ReLU()
+                )
+            else:
+                self.tdf = nn.Sequential(
+                    nn.Linear(f, f // bn, bias=bias),
+                    norm(c),
+                    nn.ReLU(),
+                    nn.Linear(f // bn, f, bias=bias),
+                    norm(c),
+                    nn.ReLU()
+                )
+    def forward(self, x):
+        x = self.tfc(x)
+        return x + self.tdf(x) if self.use_tdf else x

uvr5/lib_v5/pyrb.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+import subprocess
+import tempfile
+import six
+import numpy as np
+import soundfile as sf
+import sys
+if getattr(sys, 'frozen', False):
+    BASE_PATH_RUB = sys._MEIPASS
+else:
+    BASE_PATH_RUB = os.path.dirname(os.path.abspath(__file__))
+__all__ = ['time_stretch', 'pitch_shift']
+__RUBBERBAND_UTIL = os.path.join(BASE_PATH_RUB, 'rubberband')
+if six.PY2:
+    DEVNULL = open(os.devnull, 'w')
+else:
+    DEVNULL = subprocess.DEVNULL
+def __rubberband(y, sr, **kwargs):
+    assert sr > 0
+    # Get the input and output tempfile
+    fd, infile = tempfile.mkstemp(suffix='.wav')
+    os.close(fd)
+    fd, outfile = tempfile.mkstemp(suffix='.wav')
+    os.close(fd)
+    # dump the audio
+    sf.write(infile, y, sr)
+    try:
+        # Execute rubberband
+        arguments = [__RUBBERBAND_UTIL, '-q']
+        for key, value in six.iteritems(kwargs):
+            arguments.append(str(key))
+            arguments.append(str(value))
+        arguments.extend([infile, outfile])
+        subprocess.check_call(arguments, stdout=DEVNULL, stderr=DEVNULL)
+        # Load the processed audio.
+        y_out, _ = sf.read(outfile, always_2d=True)
+        # make sure that output dimensions matches input
+        if y.ndim == 1:
+            y_out = np.squeeze(y_out)
+    except OSError as exc:
+        six.raise_from(RuntimeError('Failed to execute rubberband. '
+                                    'Please verify that rubberband-cli '
+                                    'is installed.'),
+                       exc)
+    finally:
+        # Remove temp files
+        os.unlink(infile)
+        os.unlink(outfile)
+    return y_out
+def time_stretch(y, sr, rate, rbargs=None):
+    if rate <= 0:
+        raise ValueError('rate must be strictly positive')
+    if rate == 1.0:
+        return y
+    if rbargs is None:
+        rbargs = dict()
+    rbargs.setdefault('--tempo', rate)
+    return __rubberband(y, sr, **rbargs)
+def pitch_shift(y, sr, n_steps, rbargs=None):
+    if n_steps == 0:
+        return y
+    if rbargs is None:
+        rbargs = dict()
+    rbargs.setdefault('--pitch', n_steps)
+    return __rubberband(y, sr, **rbargs)

uvr5/lib_v5/spec_utils.py ADDED Viewed

	@@ -0,0 +1,703 @@

+import librosa
+import numpy as np
+import soundfile as sf
+import math
+import random
+import math
+import platform
+import traceback
+from . import pyrb
+#cur
+OPERATING_SYSTEM = platform.system()
+SYSTEM_ARCH = platform.platform()
+SYSTEM_PROC = platform.processor()
+ARM = 'arm'
+if OPERATING_SYSTEM == 'Windows':
+    from pyrubberband import pyrb
+else:
+    from . import pyrb
+if OPERATING_SYSTEM == 'Darwin':
+    wav_resolution = "polyphase" if SYSTEM_PROC == ARM or ARM in SYSTEM_ARCH else "sinc_fastest"
+else:
+    wav_resolution = "sinc_fastest"
+MAX_SPEC = 'Max Spec'
+MIN_SPEC = 'Min Spec'
+AVERAGE = 'Average'
+def crop_center(h1, h2):
+    h1_shape = h1.size()
+    h2_shape = h2.size()
+    if h1_shape[3] == h2_shape[3]:
+        return h1
+    elif h1_shape[3] < h2_shape[3]:
+        raise ValueError('h1_shape[3] must be greater than h2_shape[3]')
+    s_time = (h1_shape[3] - h2_shape[3]) // 2
+    e_time = s_time + h2_shape[3]
+    h1 = h1[:, :, :, s_time:e_time]
+    return h1
+def preprocess(X_spec):
+    X_mag = np.abs(X_spec)
+    X_phase = np.angle(X_spec)
+    return X_mag, X_phase
+def make_padding(width, cropsize, offset):
+    left = offset
+    roi_size = cropsize - offset * 2
+    if roi_size == 0:
+        roi_size = cropsize
+    right = roi_size - (width % roi_size) + left
+    return left, right, roi_size
+def wave_to_spectrogram(wave, hop_length, n_fft, mid_side=False, mid_side_b2=False, reverse=False):
+    if reverse:
+        wave_left = np.flip(np.asfortranarray(wave[0]))
+        wave_right = np.flip(np.asfortranarray(wave[1]))
+    elif mid_side:
+        wave_left = np.asfortranarray(np.add(wave[0], wave[1]) / 2)
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1]))
+    elif mid_side_b2:
+        wave_left = np.asfortranarray(np.add(wave[1], wave[0] * .5))
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1] * .5))
+    else:
+        wave_left = np.asfortranarray(wave[0])
+        wave_right = np.asfortranarray(wave[1])
+    spec_left = librosa.stft(wave_left, n_fft, hop_length=hop_length)
+    spec_right = librosa.stft(wave_right, n_fft, hop_length=hop_length)
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def wave_to_spectrogram_mt(wave, hop_length, n_fft, mid_side=False, mid_side_b2=False, reverse=False):
+    import threading
+    if reverse:
+        wave_left = np.flip(np.asfortranarray(wave[0]))
+        wave_right = np.flip(np.asfortranarray(wave[1]))
+    elif mid_side:
+        wave_left = np.asfortranarray(np.add(wave[0], wave[1]) / 2)
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1]))
+    elif mid_side_b2:
+        wave_left = np.asfortranarray(np.add(wave[1], wave[0] * .5))
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1] * .5))
+    else:
+        wave_left = np.asfortranarray(wave[0])
+        wave_right = np.asfortranarray(wave[1])
+    def run_thread(**kwargs):
+        global spec_left
+        spec_left = librosa.stft(**kwargs)
+    thread = threading.Thread(target=run_thread, kwargs={'y': wave_left, 'n_fft': n_fft, 'hop_length': hop_length})
+    thread.start()
+    # print(wave_right.shape, n_fft, hop_length)
+    spec_right = librosa.stft(wave_right, n_fft=n_fft, hop_length=hop_length)
+    thread.join()
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def normalize(wave, is_normalize=False):
+    """Save output music files"""
+    maxv = np.max(np.abs(wave))
+    if maxv > 1.0:
+        print(f"\nNormalization Set {is_normalize}: Input above threshold for clipping. Max:{maxv}")
+        if is_normalize:
+            print(f"The result was normalized.")
+            wave /= maxv
+        else:
+            print(f"The result was not normalized.")
+    else:
+        print(f"\nNormalization Set {is_normalize}: Input not above threshold for clipping. Max:{maxv}")
+    # stereo to mono
+    if wave.shape[1] < wave.shape[0]:
+        wave = np.mean(wave, axis=1)
+    else:
+        wave = np.mean(wave, axis=0)
+    return wave
+def normalize_two_stem(wave, mix, is_normalize=False):
+    """Save output music files"""
+    maxv = np.abs(wave).max()
+    max_mix = np.abs(mix).max()
+    if maxv > 1.0:
+        print(f"\nNormalization Set {is_normalize}: Primary source above threshold for clipping. Max:{maxv}")
+        print(f"\nNormalization Set {is_normalize}: Mixture above threshold for clipping. Max:{max_mix}")
+        if is_normalize:
+            print(f"The result was normalized.")
+            wave /= maxv
+            mix /= maxv
+        else:
+            print(f"The result was not normalized.")
+    else:
+        print(f"\nNormalization Set {is_normalize}: Input not above threshold for clipping. Max:{maxv}")
+    print(f"\nNormalization Set {is_normalize}: Primary source - Max:{np.abs(wave).max()}")
+    print(f"\nNormalization Set {is_normalize}: Mixture - Max:{np.abs(mix).max()}")
+    return wave, mix
+def combine_spectrograms(specs, mp):
+    l = min([specs[i].shape[2] for i in specs])
+    spec_c = np.zeros(shape=(2, mp.param['bins'] + 1, l), dtype=np.complex64)
+    offset = 0
+    bands_n = len(mp.param['band'])
+    for d in range(1, bands_n + 1):
+        h = mp.param['band'][d]['crop_stop'] - mp.param['band'][d]['crop_start']
+        spec_c[:, offset:offset+h, :l] = specs[d][:, mp.param['band'][d]['crop_start']:mp.param['band'][d]['crop_stop'], :l]
+        offset += h
+    if offset > mp.param['bins']:
+        raise ValueError('Too much bins')
+    # lowpass fiter
+    if mp.param['pre_filter_start'] > 0: # and mp.param['band'][bands_n]['res_type'] in ['scipy', 'polyphase']:
+        if bands_n == 1:
+            spec_c = fft_lp_filter(spec_c, mp.param['pre_filter_start'], mp.param['pre_filter_stop'])
+        else:
+            gp = 1
+            for b in range(mp.param['pre_filter_start'] + 1, mp.param['pre_filter_stop']):
+                g = math.pow(10, -(b - mp.param['pre_filter_start']) * (3.5 - gp) / 20.0)
+                gp = g
+                spec_c[:, b, :] *= g
+    return np.asfortranarray(spec_c)
+def spectrogram_to_image(spec, mode='magnitude'):
+    if mode == 'magnitude':
+        if np.iscomplexobj(spec):
+            y = np.abs(spec)
+        else:
+            y = spec
+        y = np.log10(y ** 2 + 1e-8)
+    elif mode == 'phase':
+        if np.iscomplexobj(spec):
+            y = np.angle(spec)
+        else:
+            y = spec
+    y -= y.min()
+    y *= 255 / y.max()
+    img = np.uint8(y)
+    if y.ndim == 3:
+        img = img.transpose(1, 2, 0)
+        img = np.concatenate([
+            np.max(img, axis=2, keepdims=True), img
+        ], axis=2)
+    return img
+def reduce_vocal_aggressively(X, y, softmask):
+    v = X - y
+    y_mag_tmp = np.abs(y)
+    v_mag_tmp = np.abs(v)
+    v_mask = v_mag_tmp > y_mag_tmp
+    y_mag = np.clip(y_mag_tmp - v_mag_tmp * v_mask * softmask, 0, np.inf)
+    return y_mag * np.exp(1.j * np.angle(y))
+def merge_artifacts(y_mask, thres=0.01, min_range=64, fade_size=32):
+    mask = y_mask
+    try:
+        if min_range < fade_size * 2:
+            raise ValueError('min_range must be >= fade_size * 2')
+        idx = np.where(y_mask.min(axis=(0, 1)) > thres)[0]
+        start_idx = np.insert(idx[np.where(np.diff(idx) != 1)[0] + 1], 0, idx[0])
+        end_idx = np.append(idx[np.where(np.diff(idx) != 1)[0]], idx[-1])
+        artifact_idx = np.where(end_idx - start_idx > min_range)[0]
+        weight = np.zeros_like(y_mask)
+        if len(artifact_idx) > 0:
+            start_idx = start_idx[artifact_idx]
+            end_idx = end_idx[artifact_idx]
+            old_e = None
+            for s, e in zip(start_idx, end_idx):
+                if old_e is not None and s - old_e < fade_size:
+                    s = old_e - fade_size * 2
+                if s != 0:
+                    weight[:, :, s:s + fade_size] = np.linspace(0, 1, fade_size)
+                else:
+                    s -= fade_size
+                if e != y_mask.shape[2]:
+                    weight[:, :, e - fade_size:e] = np.linspace(1, 0, fade_size)
+                else:
+                    e += fade_size
+                weight[:, :, s + fade_size:e - fade_size] = 1
+                old_e = e
+        v_mask = 1 - y_mask
+        y_mask += weight * v_mask
+        mask = y_mask
+    except Exception as e:
+        error_name = f'{type(e).__name__}'
+        traceback_text = ''.join(traceback.format_tb(e.__traceback__))
+        message = f'{error_name}: "{e}"\n{traceback_text}"'
+        print('Post Process Failed: ', message)
+    return mask
+def align_wave_head_and_tail(a, b):
+    l = min([a[0].size, b[0].size])
+    return a[:l,:l], b[:l,:l]
+def spectrogram_to_wave(spec, hop_length, mid_side, mid_side_b2, reverse, clamp=False):
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    wave_left = librosa.istft(spec_left, hop_length=hop_length)
+    wave_right = librosa.istft(spec_right, hop_length=hop_length)
+    if reverse:
+        return np.asfortranarray([np.flip(wave_left), np.flip(wave_right)])
+    elif mid_side:
+        return np.asfortranarray([np.add(wave_left, wave_right / 2), np.subtract(wave_left, wave_right / 2)])
+    elif mid_side_b2:
+        return np.asfortranarray([np.add(wave_right / 1.25, .4 * wave_left), np.subtract(wave_left / 1.25, .4 * wave_right)])
+    else:
+        return np.asfortranarray([wave_left, wave_right])
+def spectrogram_to_wave_mt(spec, hop_length, mid_side, reverse, mid_side_b2):
+    import threading
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    def run_thread(**kwargs):
+        global wave_left
+        wave_left = librosa.istft(**kwargs)
+    thread = threading.Thread(target=run_thread, kwargs={'stft_matrix': spec_left, 'hop_length': hop_length})
+    thread.start()
+    wave_right = librosa.istft(spec_right, hop_length=hop_length)
+    thread.join()
+    if reverse:
+        return np.asfortranarray([np.flip(wave_left), np.flip(wave_right)])
+    elif mid_side:
+        return np.asfortranarray([np.add(wave_left, wave_right / 2), np.subtract(wave_left, wave_right / 2)])
+    elif mid_side_b2:
+        return np.asfortranarray([np.add(wave_right / 1.25, .4 * wave_left), np.subtract(wave_left / 1.25, .4 * wave_right)])
+    else:
+        return np.asfortranarray([wave_left, wave_right])
+def cmb_spectrogram_to_wave(spec_m, mp, extra_bins_h=None, extra_bins=None):
+    bands_n = len(mp.param['band'])
+    offset = 0
+    # print('spec_m: ', spec_m.shape, np.max(spec_m), np.min(spec_m))
+    for d in range(1, bands_n + 1):
+        bp = mp.param['band'][d]
+        spec_s = np.ndarray(shape=(2, bp['n_fft'] // 2 + 1, spec_m.shape[2]), dtype=complex)
+        h = bp['crop_stop'] - bp['crop_start']
+        spec_s[:, bp['crop_start']:bp['crop_stop'], :] = spec_m[:, offset:offset+h, :]
+        # print('\nbp', d, bands_n, bp)
+        # print('spec_s: ', spec_s.shape, np.max(spec_s), np.min(spec_s))
+        offset += h
+        if d == bands_n: # higher
+            # print('hpf_start: ', extra_bins_h, bp['hpf_start'])
+            if extra_bins_h: # if --high_end_process bypass
+                max_bin = bp['n_fft'] // 2
+                spec_s[:, max_bin-extra_bins_h:max_bin, :] = extra_bins[:, :extra_bins_h, :]
+                # print('extra_bins_h, max_bin, extra_bins: ', extra_bins_h, max_bin, extra_bins.shape, np.max(extra_bins), np.min(extra_bins))
+                # print('spec_s d=4: ', spec_s.shape, np.max(spec_s), np.min(spec_s))
+            if bp['hpf_start'] > 0:
+                spec_s = fft_hp_filter(spec_s, bp['hpf_start'], bp['hpf_stop'] - 1)
+                # print('spec_s fft: ', spec_s.shape, np.max(spec_s), np.min(spec_s) )
+            if bands_n == 1:
+                wave = spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse'])
+            else:
+                wave = np.add(wave, spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']))
+        else:
+            sr = mp.param['band'][d+1]['sr']
+            if d == 1: # lower
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'] - 1) # test
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'])
+                wave = librosa.resample(spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']), bp['sr'], sr, res_type=wav_resolution)
+            else: # mid
+                spec_s = fft_hp_filter(spec_s, bp['hpf_start'], bp['hpf_stop'] - 1)
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'])
+                wave2 = np.add(wave, spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']))
+                wave = librosa.resample(wave2, bp['sr'], sr, res_type=wav_resolution)
+        # print('spec to wav shape: ', d, wave.shape, np.max(wave), np.min(wave), spec_s.shape, np.max(spec_s), np.min(spec_s))
+    return wave
+def fft_lp_filter(spec, bin_start, bin_stop):
+    g = 1.0
+    for b in range(bin_start, bin_stop):
+        g -= 1 / (bin_stop - bin_start)
+        spec[:, b, :] = g * spec[:, b, :]
+    spec[:, bin_stop:, :] *= 0
+    return spec
+def fft_hp_filter(spec, bin_start, bin_stop):
+    g = 1.0
+    for b in range(bin_start, bin_stop, -1):
+        g -= 1 / (bin_start - bin_stop)
+        spec[:, b, :] = g * spec[:, b, :]
+    spec[:, 0:bin_stop+1, :] *= 0
+    return spec
+def mirroring(a, spec_m, input_high_end, mp):
+    if 'mirroring' == a:
+        mirror = np.flip(np.abs(spec_m[:, mp.param['pre_filter_start']-10-input_high_end.shape[1]:mp.param['pre_filter_start']-10, :]), 1)
+        mirror = mirror * np.exp(1.j * np.angle(input_high_end))
+        return np.where(np.abs(input_high_end) <= np.abs(mirror), input_high_end, mirror)
+    if 'mirroring2' == a:
+        mirror = np.flip(np.abs(spec_m[:, mp.param['pre_filter_start']-10-input_high_end.shape[1]:mp.param['pre_filter_start']-10, :]), 1)
+        mi = np.multiply(mirror, input_high_end * 1.7)
+        return np.where(np.abs(input_high_end) <= np.abs(mi), input_high_end, mi)
+def adjust_aggr(mask, is_non_accom_stem, aggressiveness):
+    aggr = aggressiveness['value']
+    if aggr != 0:
+        if is_non_accom_stem:
+            aggr = 1 - aggr
+        aggr = [aggr, aggr]
+        if aggressiveness['aggr_correction'] is not None:
+            aggr[0] += aggressiveness['aggr_correction']['left']
+            aggr[1] += aggressiveness['aggr_correction']['right']
+        for ch in range(2):
+            mask[ch, :aggressiveness['split_bin']] = np.power(mask[ch, :aggressiveness['split_bin']], 1 + aggr[ch] / 3)
+            mask[ch, aggressiveness['split_bin']:] = np.power(mask[ch, aggressiveness['split_bin']:], 1 + aggr[ch])
+        # if is_non_accom_stem:
+        #     mask = (1.0 - mask)
+    return mask
+def stft(wave, nfft, hl):
+    wave_left = np.asfortranarray(wave[0])
+    wave_right = np.asfortranarray(wave[1])
+    spec_left = librosa.stft(wave_left, nfft, hop_length=hl)
+    spec_right = librosa.stft(wave_right, nfft, hop_length=hl)
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def istft(spec, hl):
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    wave_left = librosa.istft(spec_left, hop_length=hl)
+    wave_right = librosa.istft(spec_right, hop_length=hl)
+    wave = np.asfortranarray([wave_left, wave_right])
+    return wave
+def spec_effects(wave, algorithm='Default', value=None):
+    spec = [stft(wave[0],2048,1024), stft(wave[1],2048,1024)]
+    if algorithm == 'Min_Mag':
+        v_spec_m = np.where(np.abs(spec[1]) <= np.abs(spec[0]), spec[1], spec[0])
+        wave = istft(v_spec_m,1024)
+    elif algorithm == 'Max_Mag':
+        v_spec_m = np.where(np.abs(spec[1]) >= np.abs(spec[0]), spec[1], spec[0])
+        wave = istft(v_spec_m,1024)
+    elif algorithm == 'Default':
+        wave = (wave[1] * value) + (wave[0] * (1-value))
+    elif algorithm == 'Invert_p':
+        X_mag = np.abs(spec[0])
+        y_mag = np.abs(spec[1])
+        max_mag = np.where(X_mag >= y_mag, X_mag, y_mag)
+        v_spec = spec[1] - max_mag * np.exp(1.j * np.angle(spec[0]))
+        wave = istft(v_spec,1024)
+    return wave
+def spectrogram_to_wave_no_mp(spec, n_fft=2048, hop_length=1024):
+    wave = librosa.istft(spec, n_fft=n_fft, hop_length=hop_length)
+    if wave.ndim == 1:
+        wave = np.asfortranarray([wave,wave])
+    return wave
+def wave_to_spectrogram_no_mp(wave):
+    spec = librosa.stft(wave, n_fft=2048, hop_length=1024)
+    if spec.ndim == 1:
+        spec = np.asfortranarray([spec,spec])
+    return spec
+def invert_audio(specs, invert_p=True):
+    ln = min([specs[0].shape[2], specs[1].shape[2]])
+    specs[0] = specs[0][:,:,:ln]
+    specs[1] = specs[1][:,:,:ln]
+    if invert_p:
+        X_mag = np.abs(specs[0])
+        y_mag = np.abs(specs[1])
+        max_mag = np.where(X_mag >= y_mag, X_mag, y_mag)
+        v_spec = specs[1] - max_mag * np.exp(1.j * np.angle(specs[0]))
+    else:
+        specs[1] = reduce_vocal_aggressively(specs[0], specs[1], 0.2)
+        v_spec = specs[0] - specs[1]
+    return v_spec
+def invert_stem(mixture, stem):
+    mixture = wave_to_spectrogram_no_mp(mixture)
+    stem = wave_to_spectrogram_no_mp(stem)
+    output = spectrogram_to_wave_no_mp(invert_audio([mixture, stem]))
+    return -output.T
+def ensembling(a, specs):
+    for i in range(1, len(specs)):
+        if i == 1:
+            spec = specs[0]
+        ln = min([spec.shape[2], specs[i].shape[2]])
+        spec = spec[:,:,:ln]
+        specs[i] = specs[i][:,:,:ln]
+        if MIN_SPEC == a:
+            spec = np.where(np.abs(specs[i]) <= np.abs(spec), specs[i], spec)
+        if MAX_SPEC == a:
+            spec = np.where(np.abs(specs[i]) >= np.abs(spec), specs[i], spec)
+        if AVERAGE == a:
+            spec = np.where(np.abs(specs[i]) == np.abs(spec), specs[i], spec)
+    return spec
+def ensemble_inputs(audio_input, algorithm, is_normalization, wav_type_set, save_path):
+    wavs_ = []
+    if algorithm == AVERAGE:
+        output = average_audio(audio_input)
+        samplerate = 44100
+    else:
+        specs = []
+        for i in range(len(audio_input)):
+            wave, samplerate = librosa.load(audio_input[i], mono=False, sr=44100)
+            wavs_.append(wave)
+            spec = wave_to_spectrogram_no_mp(wave)
+            specs.append(spec)
+        wave_shapes = [w.shape[1] for w in wavs_]
+        target_shape = wavs_[wave_shapes.index(max(wave_shapes))]
+        output = spectrogram_to_wave_no_mp(ensembling(algorithm, specs))
+        output = to_shape(output, target_shape.shape)
+    sf.write(save_path, normalize(output.T, is_normalization), samplerate, subtype=wav_type_set)
+def to_shape(x, target_shape):
+    padding_list = []
+    for x_dim, target_dim in zip(x.shape, target_shape):
+        pad_value = (target_dim - x_dim)
+        pad_tuple = ((0, pad_value))
+        padding_list.append(pad_tuple)
+    return np.pad(x, tuple(padding_list), mode='constant')
+def to_shape_minimize(x: np.ndarray, target_shape):
+    padding_list = []
+    for x_dim, target_dim in zip(x.shape, target_shape):
+        pad_value = (target_dim - x_dim)
+        pad_tuple = ((0, pad_value))
+        padding_list.append(pad_tuple)
+    return np.pad(x, tuple(padding_list), mode='constant')
+def augment_audio(export_path, audio_file, rate, is_normalization, wav_type_set, save_format=None, is_pitch=False):
+    wav, sr = librosa.load(audio_file, sr=44100, mono=False)
+    if wav.ndim == 1:
+        wav = np.asfortranarray([wav,wav])
+    if is_pitch:
+        wav_1 = pyrb.pitch_shift(wav[0], sr, rate, rbargs=None)
+        wav_2 = pyrb.pitch_shift(wav[1], sr, rate, rbargs=None)
+    else:
+        wav_1 = pyrb.time_stretch(wav[0], sr, rate, rbargs=None)
+        wav_2 = pyrb.time_stretch(wav[1], sr, rate, rbargs=None)
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        wav_1 = to_shape(wav_1, wav_2.shape)
+    wav_mix = np.asfortranarray([wav_1, wav_2])
+    sf.write(export_path, normalize(wav_mix.T, is_normalization), sr, subtype=wav_type_set)
+    save_format(export_path)
+def average_audio(audio):
+    waves = []
+    wave_shapes = []
+    final_waves = []
+    for i in range(len(audio)):
+        wave = librosa.load(audio[i], sr=44100, mono=False)
+        waves.append(wave[0])
+        wave_shapes.append(wave[0].shape[1])
+    wave_shapes_index = wave_shapes.index(max(wave_shapes))
+    target_shape = waves[wave_shapes_index]
+    waves.pop(wave_shapes_index)
+    final_waves.append(target_shape)
+    for n_array in waves:
+        wav_target = to_shape(n_array, target_shape.shape)
+        final_waves.append(wav_target)
+    waves = sum(final_waves)
+    waves = waves/len(audio)
+    return waves
+def average_dual_sources(wav_1, wav_2, value):
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        wav_1 = to_shape(wav_1, wav_2.shape)
+    wave = (wav_1 * value) + (wav_2 * (1-value))
+    return wave
+def reshape_sources(wav_1: np.ndarray, wav_2: np.ndarray):
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        ln = min([wav_1.shape[1], wav_2.shape[1]])
+        wav_2 = wav_2[:,:ln]
+    ln = min([wav_1.shape[1], wav_2.shape[1]])
+    wav_1 = wav_1[:,:ln]
+    wav_2 = wav_2[:,:ln]
+    return wav_2
+def align_audio(file1, file2, file2_aligned, file_subtracted, wav_type_set, is_normalization, command_Text, progress_bar_main_var, save_format):
+    def get_diff(a, b):
+        corr = np.correlate(a, b, "full")
+        diff = corr.argmax() - (b.shape[0] - 1)
+        return diff
+    progress_bar_main_var.set(10)
+    # read tracks
+    wav1, sr1 = librosa.load(file1, sr=44100, mono=False)
+    wav2, sr2 = librosa.load(file2, sr=44100, mono=False)
+    wav1 = wav1.transpose()
+    wav2 = wav2.transpose()
+    command_Text(f"Audio file shapes: {wav1.shape} / {wav2.shape}\n")
+    wav2_org = wav2.copy()
+    progress_bar_main_var.set(20)
+    command_Text("Processing files... \n")
+  # pick random position and get diff
+    counts = {}       # counting up for each diff value
+    progress = 20
+    check_range = 64
+    base = (64 / check_range)
+    for i in range(check_range):
+        index = int(random.uniform(44100 * 2, min(wav1.shape[0], wav2.shape[0]) - 44100 * 2))
+        shift = int(random.uniform(-22050,+22050))
+        samp1 = wav1[index      :index      +44100, 0]          # currently use left channel
+        samp2 = wav2[index+shift:index+shift+44100, 0]
+        progress += 1 * base
+        progress_bar_main_var.set(progress)
+        diff = get_diff(samp1, samp2)
+        diff -= shift
+    if abs(diff) < 22050:
+        if not diff in counts:
+            counts[diff] = 0
+        counts[diff] += 1
+  # use max counted diff value
+    max_count = 0
+    est_diff  = 0
+    for diff in counts.keys():
+        if counts[diff] > max_count:
+            max_count = counts[diff]
+            est_diff = diff
+    command_Text(f"Estimated difference is {est_diff} (count: {max_count})\n")
+    progress_bar_main_var.set(90)
+    audio_files = []
+    def save_aligned_audio(wav2_aligned):
+        command_Text(f"Aligned File 2 with File 1.\n")
+        command_Text(f"Saving files... ")
+        sf.write(file2_aligned, normalize(wav2_aligned, is_normalization), sr2, subtype=wav_type_set)
+        save_format(file2_aligned)
+        min_len = min(wav1.shape[0], wav2_aligned.shape[0])
+        wav_sub = wav1[:min_len] - wav2_aligned[:min_len]
+        audio_files.append(file2_aligned)
+        return min_len, wav_sub
+  # make aligned track 2
+    if est_diff > 0:
+        wav2_aligned = np.append(np.zeros((est_diff, 2)), wav2_org, axis=0)
+        min_len, wav_sub = save_aligned_audio(wav2_aligned)
+    elif est_diff < 0:
+        wav2_aligned = wav2_org[-est_diff:]
+        min_len, wav_sub = save_aligned_audio(wav2_aligned)
+    else:
+        command_Text(f"Audio files already aligned.\n")
+        command_Text(f"Saving inverted track... ")
+        min_len = min(wav1.shape[0], wav2.shape[0])
+        wav_sub = wav1[:min_len] - wav2[:min_len]
+    wav_sub = np.clip(wav_sub, -1, +1)
+    sf.write(file_subtracted, normalize(wav_sub, is_normalization), sr1, subtype=wav_type_set)
+    save_format(file_subtracted)
+    progress_bar_main_var.set(95)

uvr5/lib_v5/vr_network/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # VR init.

uvr5/lib_v5/vr_network/layers.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from lib_v5 import spec_utils
+class Conv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(Conv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class SeperableConv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(SeperableConv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nin,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                groups=nin,
+                bias=False),
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=1,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class Encoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
+        super(Encoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
+    def __call__(self, x):
+        skip = self.conv1(x)
+        h = self.conv2(skip)
+        return h, skip
+class Decoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
+        super(Decoder, self).__init__()
+        self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def __call__(self, x, skip=None):
+        x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
+        if skip is not None:
+            skip = spec_utils.crop_center(skip, x)
+            x = torch.cat([x, skip], dim=1)
+        h = self.conv(x)
+        if self.dropout is not None:
+            h = self.dropout(h)
+        return h
+class ASPPModule(nn.Module):
+    def __init__(self, nn_architecture, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU):
+        super(ASPPModule, self).__init__()
+        self.conv1 = nn.Sequential(
+            nn.AdaptiveAvgPool2d((1, None)),
+            Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
+        )
+        self.nn_architecture = nn_architecture
+        self.six_layer = [129605]
+        self.seven_layer = [537238, 537227, 33966]
+        extra_conv = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
+        self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
+        self.conv3 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
+        self.conv4 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
+        self.conv5 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
+        if self.nn_architecture in self.six_layer:
+            self.conv6 = extra_conv
+            nin_x = 6
+        elif self.nn_architecture in self.seven_layer:
+            self.conv6 = extra_conv
+            self.conv7 = extra_conv
+            nin_x = 7
+        else:
+            nin_x = 5
+        self.bottleneck = nn.Sequential(
+            Conv2DBNActiv(nin * nin_x, nout, 1, 1, 0, activ=activ),
+            nn.Dropout2d(0.1)
+        )
+    def forward(self, x):
+        _, _, h, w = x.size()
+        feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
+        feat2 = self.conv2(x)
+        feat3 = self.conv3(x)
+        feat4 = self.conv4(x)
+        feat5 = self.conv5(x)
+        if self.nn_architecture in self.six_layer:
+            feat6 = self.conv6(x)
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6), dim=1)
+        elif self.nn_architecture in self.seven_layer:
+            feat6 = self.conv6(x)
+            feat7 = self.conv7(x)
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1)
+        else:
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
+        bottle = self.bottleneck(out)
+        return bottle

uvr5/lib_v5/vr_network/layers_new.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from lib_v5 import spec_utils
+class Conv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(Conv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class Encoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
+        super(Encoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, stride, pad, activ=activ)
+        self.conv2 = Conv2DBNActiv(nout, nout, ksize, 1, pad, activ=activ)
+    def __call__(self, x):
+        h = self.conv1(x)
+        h = self.conv2(h)
+        return h
+class Decoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
+        super(Decoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        # self.conv2 = Conv2DBNActiv(nout, nout, ksize, 1, pad, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def __call__(self, x, skip=None):
+        x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
+        if skip is not None:
+            skip = spec_utils.crop_center(skip, x)
+            x = torch.cat([x, skip], dim=1)
+        h = self.conv1(x)
+        # h = self.conv2(h)
+        if self.dropout is not None:
+            h = self.dropout(h)
+        return h
+class ASPPModule(nn.Module):
+    def __init__(self, nin, nout, dilations=(4, 8, 12), activ=nn.ReLU, dropout=False):
+        super(ASPPModule, self).__init__()
+        self.conv1 = nn.Sequential(
+            nn.AdaptiveAvgPool2d((1, None)),
+            Conv2DBNActiv(nin, nout, 1, 1, 0, activ=activ)
+        )
+        self.conv2 = Conv2DBNActiv(nin, nout, 1, 1, 0, activ=activ)
+        self.conv3 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[0], dilations[0], activ=activ
+        )
+        self.conv4 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[1], dilations[1], activ=activ
+        )
+        self.conv5 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[2], dilations[2], activ=activ
+        )
+        self.bottleneck = Conv2DBNActiv(nout * 5, nout, 1, 1, 0, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def forward(self, x):
+        _, _, h, w = x.size()
+        feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
+        feat2 = self.conv2(x)
+        feat3 = self.conv3(x)
+        feat4 = self.conv4(x)
+        feat5 = self.conv5(x)
+        out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
+        out = self.bottleneck(out)
+        if self.dropout is not None:
+            out = self.dropout(out)
+        return out
+class LSTMModule(nn.Module):
+    def __init__(self, nin_conv, nin_lstm, nout_lstm):
+        super(LSTMModule, self).__init__()
+        self.conv = Conv2DBNActiv(nin_conv, 1, 1, 1, 0)
+        self.lstm = nn.LSTM(
+            input_size=nin_lstm,
+            hidden_size=nout_lstm // 2,
+            bidirectional=True
+        )
+        self.dense = nn.Sequential(
+            nn.Linear(nout_lstm, nin_lstm),
+            nn.BatchNorm1d(nin_lstm),
+            nn.ReLU()
+        )
+    def forward(self, x):
+        N, _, nbins, nframes = x.size()
+        h = self.conv(x)[:, 0]  # N, nbins, nframes
+        h = h.permute(2, 0, 1)  # nframes, N, nbins
+        h, _ = self.lstm(h)
+        h = self.dense(h.reshape(-1, h.size()[-1]))  # nframes * N, nbins
+        h = h.reshape(nframes, N, 1, nbins)
+        h = h.permute(1, 2, 3, 0)
+        return h

uvr5/lib_v5/vr_network/model_param_init.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import json
+import pathlib
+default_param = {}
+default_param['bins'] = 768
+default_param['unstable_bins'] = 9 # training only
+default_param['reduction_bins'] = 762 # training only
+default_param['sr'] = 44100
+default_param['pre_filter_start'] = 757
+default_param['pre_filter_stop'] = 768
+default_param['band'] = {}
+default_param['band'][1] = {
+    'sr': 11025,
+    'hl': 128,
+    'n_fft': 960,
+    'crop_start': 0,
+    'crop_stop': 245,
+    'lpf_start': 61, # inference only
+    'res_type': 'polyphase'
+}
+default_param['band'][2] = {
+    'sr': 44100,
+    'hl': 512,
+    'n_fft': 1536,
+    'crop_start': 24,
+    'crop_stop': 547,
+    'hpf_start': 81, # inference only
+    'res_type': 'sinc_best'
+}
+def int_keys(d):
+    r = {}
+    for k, v in d:
+        if k.isdigit():
+            k = int(k)
+        r[k] = v
+    return r
+class ModelParameters(object):
+    def __init__(self, config_path=''):
+        if '.pth' == pathlib.Path(config_path).suffix:
+            import zipfile
+            with zipfile.ZipFile(config_path, 'r') as zip:
+                self.param = json.loads(zip.read('param.json'), object_pairs_hook=int_keys)
+        elif '.json' == pathlib.Path(config_path).suffix:
+            with open(config_path, 'r') as f:
+                self.param = json.loads(f.read(), object_pairs_hook=int_keys)
+        else:
+            self.param = default_param
+        for k in ['mid_side', 'mid_side_b', 'mid_side_b2', 'stereo_w', 'stereo_n', 'reverse']:
+            if not k in self.param:
+                self.param[k] = False

uvr5/lib_v5/vr_network/modelparams/1band_sr16000_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 16000,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 16000,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr32000_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 32000,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "kaiser_fast"
+		}
+	},
+	"sr": 32000,
+	"pre_filter_start": 1000,
+	"pre_filter_stop": 1021
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr33075_hl384.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 33075,
+			"hl": 384,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 33075,
+	"pre_filter_start": 1000,
+	"pre_filter_stop": 1021
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl1024.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 1024,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl256.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 256,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 256,
+			"n_fft": 512,
+			"crop_start": 0,
+			"crop_stop": 256,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 256,
+	"pre_filter_stop": 256
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}