Instructions to use VisionXLab/FIRM-Gen-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use VisionXLab/FIRM-Gen-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="VisionXLab/FIRM-Gen-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("VisionXLab/FIRM-Gen-8B") model = AutoModelForImageTextToText.from_pretrained("VisionXLab/FIRM-Gen-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use VisionXLab/FIRM-Gen-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "VisionXLab/FIRM-Gen-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VisionXLab/FIRM-Gen-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/VisionXLab/FIRM-Gen-8B
- SGLang
How to use VisionXLab/FIRM-Gen-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "VisionXLab/FIRM-Gen-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VisionXLab/FIRM-Gen-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "VisionXLab/FIRM-Gen-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "VisionXLab/FIRM-Gen-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use VisionXLab/FIRM-Gen-8B with Docker Model Runner:
docker model run hf.co/VisionXLab/FIRM-Gen-8B
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
- llama-factory
- reward-model
- image-generation
- reinforcement-learning
- generated_from_trainer
model-index:
- name: FIRM-Gen-8B (gen_reward_sft)
results: []
FIRM-Gen-8B (gen_reward_sft)
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct and serves as a robust reward model (critic) for text-to-image generation. It was introduced as part of the FIRM (Faithful Image Reward Modeling) framework in the paper "Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation".
- Paper: Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
- Project Page: firm-reward.github.io
- Repository: VisionXLab/FIRM-Reward
Model Description
FIRM-Gen-8B is specifically trained on the FIRM-Gen-293K dataset to provide accurate and reliable guidance for faithful image generation. It addresses the common issue of reward hacking and hallucinations in Multimodal Large Language Models (MLLMs) by using a "plan-then-score" pipeline to evaluate how well a generated image follows complex instructions.
Within a Reinforcement Learning (RL) pipeline, this model acts as the critic, assigning scores that guide the optimization of generative models (like Stable Diffusion 3.5 or FLUX) toward better instruction adherence and visual fidelity.
Intended Uses & Limitations
This model is intended to be used as a reward signal in RL pipelines or as an evaluation metric for text-to-image alignment. It is compatible with the transformers library and can be deployed using the reward server scripts found in the official repository.
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 5
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 80
- total_eval_batch_size: 16
- optimizer: AdamW
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1.0
Training Results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.6131 | 0.1380 | 500 | 0.6089 |
| 0.5714 | 0.2760 | 1000 | 0.5768 |
| 0.5524 | 0.4140 | 1500 | 0.5562 |
| 0.537 | 0.5520 | 2000 | 0.5407 |
| 0.5282 | 0.6899 | 2500 | 0.5283 |
| 0.5155 | 0.8279 | 3000 | 0.5207 |
| 0.5106 | 0.9659 | 3500 | 0.5181 |
Citation
If you find this model useful, please cite:
@article{zhao2025trust,
title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
journal={arXiv preprint arXiv:2603.12247},
year={2025}
}