Abstract
Klear addresses audio-video joint generation challenges through a unified model architecture, progressive multitask training, and large-scale dense-caption data construction, achieving superior alignment and generalization.
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.
Community
Klear: 26B model for joint audio-video generation
- Single-tower DiT with "Omni-Full Attention" across video, audio, and text
- Progressive multi-task training (T2V, T2A, T2AV, I2V all in one model)
- 81M sample dataset with dense captions
Claims Veo 3-level performance on lip-sync & AV consistency
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation (2025)
- MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning (2026)
- JoVA: Unified Multimodal Learning for Joint Video-Audio Generation (2025)
- LTX-2: Efficient Joint Audio-Visual Foundation Model (2026)
- Omni2Sound: Towards Unified Video-Text-to-Audio Generation (2026)
- JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation (2025)
- DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
no weights ๐
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper