THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW VISION ARCHITECTURE
Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.
This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.
Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL
git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .
Training Summary
| Item |
Value |
| Objective |
Masked autoencoding over visual patch tokens |
| Modalities |
Images, with optional video mixed into training |
| Output width |
3072 |
| Next stage |
Multimodal alignment with the Arlow text backbone |
Model
| Item |
Value |
| Vision encoder |
ArlowVLVisionModel |
| Depth |
48 |
| Embedding dimension |
1536 |
| Hidden size |
3072 |
| Attention heads |
24 |
| Patch size |
14 |
| Temporal patch size |
2 |
| Spatial merge size |
2 |
| Activation |
gelu_pytorch_tanh |
| Deformable attention |
Enabled |
| Progressive patches |
Enabled |
| DeepStack visual features |
Enabled |
| M-ROPE |
Enabled |
Data
| Item |
Value |
| Primary modality |
Images |
| Optional modality |
Video |
| Default video sampling probability |
0.25 |
| Default image data |
ILSVRC/imagenet-1k train split |
| Default video data |
ucf101 train split |
| Recommended larger-scale direction |
YFCC-style image data and OpenVid-style video data |
Optimization
| Item |
Value |
| Hardware target |
8x RTX 8000 with 48 GB each |
| System RAM target |
200 GB |
| Precision |
fp16 |
| Attention backend |
sdpa |
| Distributed strategy |
DeepSpeed ZeRO-2 |
| Epochs |
1 |
| Steps per epoch cap |
2621440 |
| Per-device batch size |
2 |
| Gradient accumulation |
16 |
| Effective global batch size on 8 GPUs |
256 |
| Learning rate |
1.5e-4 |
| Weight decay |
0.05 |
| Warmup steps |
40000 |
| Max grad norm |
1.0 |
MAE Objective
| Item |
Value |
| Mask ratio |
0.75 |
| Decoder embedding size |
512 |
| Decoder depth |
8 |
| Decoder heads |
8 |
| Normalized pixel loss |
Enabled |
Exported Artifacts
| Item |
Value |
| Main artifact to keep |
checkpoint-*/vision_encoder/ |
| Matching preprocessing artifacts |
image_processor/, video_processor/, processor_config.json |