Improve model card: Add pipeline tag, library name, links, and usage example

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +50 -0
README.md CHANGED
@@ -1,9 +1,59 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
4
  # V-Thinker: Interactive Thinking with Images
5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ## Abstract
8
 
9
  Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, profoundly shifting from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by narrow visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions — diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  # V-Thinker: Interactive Thinking with Images
8
 
9
+ <div align="center">
10
+ <img src="https://github.com/We-Math/V-Thinker/raw/main/assets/logo2.png" alt="V-Thinker Logo" width="500">
11
+ </div>
12
+
13
+ This repository hosts the **V-Thinker** model, a general-purpose multimodal reasoning assistant that enables **Interactive Thinking with Images**, as presented in the paper [V-Thinker: Interactive Thinking with Images](https://huggingface.co/papers/2511.04460).
14
+
15
+ **V-Thinker** focuses on integrating image interaction with long-horizon reasoning through an end-to-end reinforcement learning framework, comprising a Data Evolution Flywheel and a Visual Progressive Training Curriculum.
16
+
17
+ * **[📄 Paper](https://huggingface.co/papers/2511.04460)**
18
+ * **[💻 GitHub Repository](https://github.com/We-Math/V-Thinker)**
19
+
20
+ Associated Hugging Face Resources:
21
+ * **[🤗 V-Thinker Models](https://huggingface.co/We-Math/V-Thinker)**
22
+ * **[📚 V-Interaction-400K Dataset](https://huggingface.co/datasets/We-Math/V-Interaction-400K)**
23
+ * **[📚 V-Perception-40K Dataset](https://huggingface.co/datasets/We-Math/V-Perception-40K)**
24
+ * **[📊 VTBench Benchmark](https://huggingface.co/datasets/We-Math/VTBench)**
25
 
26
  ## Abstract
27
 
28
  Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, profoundly shifting from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by narrow visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions — diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
29
+
30
+ ## Quick Start
31
+
32
+ The authors provide a simple script to run inference on custom cases:
33
+
34
+ ```bash
35
+ cd ./eval/vtbench_IR
36
+ python inference.py
37
+ ```
38
+
39
+ For more details on installation, training, and further inference examples, please refer to the [official GitHub repository](https://github.com/We-Math/V-Thinker).
40
+
41
+ ## Citation
42
+
43
+ If you find **V-Thinker** useful for your research or applications, please cite the paper:
44
+
45
+ ```bibtex
46
+ @misc{qiao2025vthinker,
47
+ title={V-Thinker: Interactive Thinking with Images},
48
+ author={Runqi Qiao and Qiuna Tan and Minghan Yang and Guanting Dong and Peiqing Yang and Shiqiang Lang and Enhui Wan and Xiaowan Wang and Yida Xu and Lan Yang and Chong Sun and Chen Li and Honggang Zhang},
49
+ year={2025},
50
+ eprint={2511.04460},
51
+ archivePrefix={arXiv},
52
+ primaryClass={cs.CV},
53
+ url={https://arxiv.org/abs/2511.04460},
54
+ }
55
+ ```
56
+
57
+ ## License
58
+
59
+ This project is released under the [MIT License](https://opensource.org/licenses/MIT).