THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW VISION ARCHITECTURE

Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.

This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .

Training Summary

Item Value
Objective Masked autoencoding over visual patch tokens
Modalities Images, with optional video mixed into training
Output width 3072
Next stage Multimodal alignment with the Arlow text backbone

Model

Item Value
Vision encoder ArlowVLVisionModel
Depth 48
Embedding dimension 1536
Hidden size 3072
Attention heads 24
Patch size 14
Temporal patch size 2
Spatial merge size 2
Activation gelu_pytorch_tanh
Deformable attention Enabled
Progressive patches Enabled
DeepStack visual features Enabled
M-ROPE Enabled

Data

Item Value
Primary modality Images
Optional modality Video
Default video sampling probability 0.25
Default image data ILSVRC/imagenet-1k train split
Default video data ucf101 train split
Recommended larger-scale direction YFCC-style image data and OpenVid-style video data

Optimization

Item Value
Hardware target 8x RTX 8000 with 48 GB each
System RAM target 200 GB
Precision fp16
Attention backend sdpa
Distributed strategy DeepSpeed ZeRO-2
Epochs 1
Steps per epoch cap 2621440
Per-device batch size 2
Gradient accumulation 16
Effective global batch size on 8 GPUs 256
Learning rate 1.5e-4
Weight decay 0.05
Warmup steps 40000
Max grad norm 1.0

MAE Objective

Item Value
Mask ratio 0.75
Decoder embedding size 512
Decoder depth 8
Decoder heads 8
Normalized pixel loss Enabled

Exported Artifacts

Item Value
Main artifact to keep checkpoint-*/vision_encoder/
Matching preprocessing artifacts image_processor/, video_processor/, processor_config.json
Downloads last month
304
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including yuchenxie/Arlow-Vision-Encoder