ONNX
vision-encoder-decoder

TexTeller ONNX

ONNX export of OleehyO/TexTeller, an image-to-LaTeX model based on VisionEncoderDecoderModel.

This export is tuned for:

  • transformers.js (browser / Node)
  • WebGPU / ONNX Runtime Web
  • KV-cache decoding (supports decoder_with_past_model.onnx with dynamic batch)

Contents

Main files:

  • encoder_model.onnx

    • Input: pixel_values of shape [batch_size, 1, 448, 448]
  • decoder_model.onnx

    • Decoder without past (not required for KV-cache decoding)
  • decoder_with_past_model.onnx

    • Decoder with KV cache
    • Key inputs:
      • input_ids: [batch_size, decoder_sequence_length]
      • encoder_hidden_states: [batch_size, encoder_sequence_length, 768]
      • past_key_values.N.decoder.{key,value}: [batch_size, 16, past_decoder_sequence_length, 64]
      • past_key_values.N.encoder.{key,value}: [batch_size, 16, encoder_sequence_length, 64]

Supporting files:

  • config.json – model config
  • tokenizer.json / tokenizer_config.json – tokenizer
  • preprocessor_config.json – image preprocessing config

Preprocessing

preprocessor_config.json:

{
  "do_resize": true,
  "size": { "height": 448, "width": 448 },
  "resample": 3,
  "do_normalize": true,
  "image_mean": [0.9545467],
  "image_std": [0.15394445],
  "do_convert_rgb": false,
  "num_channels": 1,
  "feature_extractor_type": "ViTFeatureExtractor"
}

Important:

  • Input must be grayscale (num_channels = 1)

  • Resize to 448 × 448

  • Normalize (per channel):

    x = (x / 255.0 - 0.9545467) / 0.15394445
    

If you use AutoProcessor / ImageProcessor in transformers / transformers.js with this repo, it will apply these settings automatically.


transformers.js (browser / WebGPU) example

import { pipeline } from "@huggingface/transformers";

// Replace with this repo id
const MODEL_ID = "your-username/texteller-onnx";

const captioner = await pipeline("image-to-text", MODEL_ID, {
  device: "webgpu", // or "wasm"
  dtype: "fp16",    // good default for WebGPU
});

// Any image source supported by transformers.js: URL, HTMLImageElement, etc.
const outputs = await captioner("path-or-url-to-image.png", {
  max_new_tokens: 128,
});

console.log(outputs[0]?.generated_text);

Notes

  • This ONNX export supports decoder_with_past_model.onnx with dynamic batch, so you can implement your own batched, KV-cached beam search on top of model.forward and past_key_values.
  • For simple use cases, using pipeline("image-to-text", ...) as shown above is enough.
Downloads last month
920
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ji-Ha/TexTeller3-ONNX-dynamic

Base model

OleehyO/TexTeller
Quantized
(3)
this model

Datasets used to train Ji-Ha/TexTeller3-ONNX-dynamic