xiangan commited on
Commit
3b2a6ee
·
verified ·
1 Parent(s): 97925c5

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -2,30 +2,12 @@
2
  license: apache-2.0
3
  ---
4
 
5
- ### Model Card
6
-
7
- | Property | Value |
8
- | ----------------------------- | --------------------------------- |
9
- | **Model Type** | Vision Transformer (ViT) |
10
- | **Architecture** | HEVC-Style Vision Transformer |
11
- | **Hidden Size** | 1024 |
12
- | **Intermediate Size** | 4096 |
13
- | **Number of Layers** | 24 |
14
- | **Number of Attention Heads** | 16 |
15
- | **Patch Size** | 14 |
16
- | **Image Resolution** | 448×448 (pre-trained) |
17
- | **Video Resolution** | 224×224 with 256 tokens per frame |
18
- | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
19
- | **Normalization** | Layer Normalization |
20
- | **Activation Function** | GELU |
21
- | **License** | Apache 2.0 |
22
 
23
  ### Key Features
24
 
25
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
26
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
27
 
28
- ### Intended Use
29
 
30
  #### Downstream Tasks
31
 
@@ -136,3 +118,21 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
136
  <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
137
  </picture>
138
  </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  ### Key Features
7
 
8
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
9
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
10
 
 
11
 
12
  #### Downstream Tasks
13
 
 
118
  <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
119
  </picture>
120
  </p>
121
+
122
+ ### Model Card
123
+
124
+ | Property | Value |
125
+ | ----------------------------- | --------------------------------- |
126
+ | **Model Type** | Vision Transformer (ViT) |
127
+ | **Architecture** | HEVC-Style Vision Transformer |
128
+ | **Hidden Size** | 1024 |
129
+ | **Intermediate Size** | 4096 |
130
+ | **Number of Layers** | 24 |
131
+ | **Number of Attention Heads** | 16 |
132
+ | **Patch Size** | 14 |
133
+ | **Image Resolution** | 448×448 (pre-trained) |
134
+ | **Video Resolution** | 224×224 with 256 tokens per frame |
135
+ | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
136
+ | **Normalization** | Layer Normalization |
137
+ | **Activation Function** | GELU |
138
+ | **License** | Apache 2.0 |