GGUF/llama.cpp support
Hi, I'm really excited to try a MOSS-Audio model because Gemma, Qwen Omni, and most other audio text to text models do not support word-level timestamps and seem limited.
I'm curious if the MOSS team would ever integrate support for MOSS-Audio into llama.cpp. Currently Qwen audio models, Mistral's Voxtral, Gemma, LFM2-Audio, and Ultravox have support in llama.cpp.
Thanks for your interest in MOSS-Audio — we really appreciate it. At the moment, we do not have official plans to support llama.cpp. That said, if someone from the community would like to help adapt MOSS-Audio to llama.cpp, we would be very grateful.
Thanks. I tried out the 8B models and was disappointed by them. They are not very good at following instructions, even though sometimes the output was useful. The thinking one was marginally better, but I could not get them to output in a consistent format after many attempts, even after providing a complete example. They are not good at parsing individual music notes or changes in pitch which is what I was trying to do with it. I don't know of any models that can do this well but I was hoping MOSS-Audio-8B-Instruct/MOSS-Audio 8B Thinking would do it.
Thanks a lot for the detailed feedback.
Accurate timestamp alignment is already a challenging problem by itself, and combining that with fine-grained pitch or note recognition makes it even harder. Compared with standard speech transcription, this kind of task places much higher demands on temporal precision, acoustic modeling, and output consistency, so your experience is understandable.
We would also love to learn more about your specific use case. For example, are you working with vocals or instruments, monophonic or polyphonic audio, and are you mainly looking for note-level transcription, pitch contour tracking, or structured event outputs with timestamps?
If you are open to it, feel free to send more details, sample inputs, or your expected output format to my email as well, and we’d be happy to discuss further.
Thank you. At first, I was trying to do pitch contour tracking. Later I found more traditional prosody tracking tools were able to accomplish what I wanted. So, for specific fine-grained timestamps I can use other word alignment tools. However, those tools do not interpret the pitch values.
So I was trying to prompt MOSS Audio to get the regions of spoken audio where there is a change in intonation. Here is the prompt I used:
Encode the intonation of all the speech in the included audio sample. Use the following format, including inter-sentence ↗︎ and ↘ arrows. Go word by word first to decide where the rising and falling arrows should go.
Example:
1. "The conference room is on the ↗︎ third floor, ↘︎ near the window." --> Standard informational tone. The slight rise on "third floor" emphasizes where the conference room is.
2. "↗︎ Why'd you come over ↘ here ↗︎ anyway?" --> The "why'd" is marginally drawn out, some vocal fry on "anyway", which has an upward, questioning intonation.
3. "I ↘︎ mean..." --> A brief pause and lowered voice indicative of gathering thoughts.
4. "I don't know ↗︎ why..." --> The word "why" is spoken with an up-down pitch. Her voice trails off.
5. "I don't know." --> Spoken very low.
6. "So, ↘︎ um..." --> Emphasis on "um", trails off."""
The models never wanted to follow the format. I could never get it to follow any format I wanted it to answer in, and it came up with a different answer style every time I prompted it.
Thanks for the detailed feedback. MOSS-Audio does have the ability to perceive pitch and intonation-related information in speech.
However, for a very specific output schema like the arrow-based format you described, prompting alone may not be enough to make the model follow it consistently. In this case, you may need to adjust the format on your side, use another way to structure the output, or fine-tune the model on data with a similar target format.
or maybe you can try the thinking model,it has better performance on instruction-following.