I think you misunderstood.
Text input is either from user or the reasoning/inference llm.
That text is fed into the tts llm to convert to speech tokens (this will include expected inclinations, how loud or quiet to do the voice, etc. based on how it is trained). (Think voice cloning)
The neural codec then converts the tokens to sound.
The chain from input to sound needs no reasoning other than that of how to produce speech patterns based on it's training.
Remember, this is TTS not STT.
In a full workflow,
User input
| (text input)
Inference llm
| (llm response)
Speech llm
| (response as speech tokens)
Neural codex
| (sound waves for playback)
Sound player