Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Paper • 2605.30280 • Published 7 days ago • 133
Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction Paper • 2605.26230 • Published 10 days ago • 41
SpatialBench: Is Your Spatial Foundation Model an All-Round Player? Paper • 2605.27367 • Published 9 days ago • 70
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding Paper • 2605.27365 • Published 9 days ago • 135
CubePart: An Open-Vocabulary Part-Controllable 3D Generator Paper • 2605.28763 • Published 8 days ago • 14
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving Paper • 2605.23163 • Published 10 days ago • 17
GEM: Generative Supervision Helps Embodied Intelligence Paper • 2605.28548 • Published 8 days ago • 40
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models Paper • 2605.21573 • Published 15 days ago • 107
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation Paper • 2604.18168 • Published Apr 20 • 96
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds Paper • 2604.14268 • Published Apr 15 • 122
Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself Paper • 2604.14048 • Published Apr 15 • 16
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation Paper • 2604.10030 • Published Apr 11 • 15