VideoNSA: Native Sparse Attention Scales Video Understanding
Paper
β’
2510.02295
β’
Published
β’
9
VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.
VideoNSA employs a hybrid attention strategy with three complementary branches:
Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.
For installation, training, and evaluation instructions, please refer to:
@misc{song2025videonsanativesparseattention,
title={VideoNSA: Native Sparse Attention Scales Video Understanding},
author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
year={2025},
eprint={2510.02295},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02295},
}
This model is released under the Apache 2.0 License.