lmms-lab-encoder
/

onevision-encoder-large

Safetensors

onevision_encoder

custom_code

Model card Files Files and versions

xet

Community

xiangan commited on 29 days ago

Commit

3b2a6ee

verified ·

1 Parent(s): 97925c5

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

README.md +18 -18

README.md CHANGED Viewed

@@ -2,30 +2,12 @@
 license: apache-2.0
 ---
-### Model Card
-| Property                      | Value                             |
-| ----------------------------- | --------------------------------- |
-| **Model Type**                | Vision Transformer (ViT)          |
-| **Architecture**              | HEVC-Style Vision Transformer     |
-| **Hidden Size**               | 1024                              |
-| **Intermediate Size**         | 4096                              |
-| **Number of Layers**          | 24                                |
-| **Number of Attention Heads** | 16                                |
-| **Patch Size**                | 14                                |
-| **Image Resolution**          | 448×448 (pre-trained)             |
-| **Video Resolution**          | 224×224 with 256 tokens per frame |
-| **Positional Encoding**       | 3D RoPE (4:6:6 split for T:H:W)   |
-| **Normalization**             | Layer Normalization               |
-| **Activation Function**       | GELU                              |
-| **License**                   | Apache 2.0                        |
 ### Key Features
 - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
 - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
-### Intended Use
 #### Downstream Tasks
@@ -136,3 +118,21 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
     <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
   </picture>
 </p>

 license: apache-2.0
 ---
 ### Key Features
 - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
 - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
 #### Downstream Tasks
     <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
   </picture>
 </p>
+### Model Card
+| Property                      | Value                             |
+| ----------------------------- | --------------------------------- |
+| **Model Type**                | Vision Transformer (ViT)          |
+| **Architecture**              | HEVC-Style Vision Transformer     |
+| **Hidden Size**               | 1024                              |
+| **Intermediate Size**         | 4096                              |
+| **Number of Layers**          | 24                                |
+| **Number of Attention Heads** | 16                                |
+| **Patch Size**                | 14                                |
+| **Image Resolution**          | 448×448 (pre-trained)             |
+| **Video Resolution**          | 224×224 with 256 tokens per frame |
+| **Positional Encoding**       | 3D RoPE (4:6:6 split for T:H:W)   |
+| **Normalization**             | Layer Normalization               |
+| **Activation Function**       | GELU                              |
+| **License**                   | Apache 2.0                        |