LLaVA-PLLuM-12b-nc-instruct

This model is the first Polish-focused Vision-Language Model (VLM), created by extending the open-source LLaVA architecture with the PLLuM language model. Our pipeline integrates high-quality multimodal instruction tuning with PLLuM’s strong Polish linguistic abilities, resulting in a VLM that demonstrates significantly improved understanding of Polish language, culture, and context-specific visual reasoning.

Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Environmental Impact
Technical Specifications
Citation
How to Get Started with the Model

Model Details

Model Description

Developed by: NASK PIB
Funded by: NASK PIB
Shared by: NASK PIB
Model type: Multimodal (Image-Text-to-Text) / Vision Language Model
Language(s) (NLP): Polish, English
License: Model LLaVA-PLLuM-12b-nc-instruct is published under PLLuM-1.0 license.

Model Sources

Blogpost: Blogpost

Uses

Direct Use

The model is intended for research and development purposes, specifically focusing on multimodal tasks requiring the Polish language and cultural context. It can be used directly for:

Visual Question Answering (VQA) in Polish: Users can provide an image and ask questions about it in Polish (e.g., "Co znajduje się na zdjęciu?").
Image Captioning: Generating detailed descriptions of images in grammatically correct Polish.
Optical Character Recognition (OCR): Extracting and interpreting text visible within images, including Polish documents.
Object Counting: Performing simple enumeration of objects within a visual scene.
Multimodal Research: Serving as a baseline or starting point for researchers developing non-English or bilingual Vision-Language Models (VLMs).

Downstream Use

This model can be fine-tuned or integrated into larger applications to support specific use cases, such as:

Accessibility Tools: Creating applications that describe surroundings or digital content to visually impaired Polish speakers.
E-commerce: Generating automated product descriptions based on images for Polish marketplaces.
Educational Assistants: Developing tutoring systems that can explain visual content (diagrams, historical photos) to students in Polish.
Specialized Fine-tuning: The model can be further fine-tuned on domain-specific datasets (e.g., Polish medical imaging reports or legal document analysis) to improve performance in niche sectors.

Out-of-Scope Use

Generation of Harmful Content: Utilizing the model to generate hate speech, explicit content, or to facilitate harassment and disinformation.
High-Stakes Factual Retrieval: Like all Large Language Models, this model can "hallucinate" or produce factually incorrect information. It should not be relied upon as a sole source of truth without human verification.
English-Primary Tasks: While the model retains some English capabilities, it is optimized for Polish. Users seeking state-of-the-art performance strictly for English tasks should prefer models trained primarily on English data.

Bias, Risks, and Limitations

Potential Hallucinations: Like other LLMs, PLLuM may occasionally produce factually incorrect or fabricated content.
Sensitivity & Bias: The current version has not undergone multimodal safety alignment. As a result, users may encounter biased behavior or toxic generations, particularly when the model is prompted with visual inputs.
Context Length: Very long context tasks may challenge certain models, depending on memory constraints.

Recommendations

Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model. We recommend the following:

Treat as a Research Proof-of-Concept: This model represents a preliminary step toward robust Polish multimodal AI. It is not a finished commercial product. Users should exercise caution when applying it to real-world scenarios and should not deploy it in production environments without extensive domain-specific testing and guardrails.
Human Verification Required: Like all Large Multimodal Models (LMMs), this model is prone to "hallucinations"—confidently stating incorrect facts or describing objects that are not present in the image. Always keep a human in the loop to verify outputs, especially for factual queries or quantitative tasks (e.g., counting objects).
Awareness of Translation Artifacts: A significant portion of the instruction-tuning dataset (e.g., ALLaVA, LLaVA-Instruct) was automatically translated from English to Polish. While we employed filtering metrics (COMET), some linguistic unnaturalness or translation artifacts may persist in the model's responses.

Training Details

Training Data

The model was trained in two stages using a combination of translated open-source datasets and synthetic data, totaling approximately 2 million samples with an 85% Polish / 15% English split.

Stage 1: Pre-training (Feature Alignment) Stage 2: Instruction Tuning (Visual Instruction Tuning)

Training Procedure

Preprocessing

To create high-quality Polish multimodal data from English sources, a rigorous translation and filtering pipeline was employed:

Translation: Source English datasets were translated using the Tower+ 72B model.
Filtering: The COMET reference-free metric was used to filter out poor-quality translations.
Manual Review: A portion of the data underwent manual expert filtering to ensure linguistic quality.
Dynamic Tiling: Following LLaVA-NeXT, images are processed with dynamic tiling to support higher input resolutions.

Speeds, Sizes, Times

Training Stages: 2 Stages.
Epochs: 1 Epoch for both stages.
Batch Size: 256 (Stage 1), 128 (Stage 2).
Context Size: 8,192 tokens.
Trainable Parameters:
- Stage 1: 30M (Projector only).
- Stage 2: 12B (LLM via LoRA) + 400M (Vision Encoder) + 30M (Projector).
Learning Rates (Stage 2): 2x10⁻⁶ (Vision), 2x10⁻⁵ (Projector & LLM).
LoRA Config: Rank 128, Alpha 256, Dropout 0.05.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Quantitative: MMBench v1.1 (Development Split). The dataset was translated to Polish using Tower+ 72B and subsequently manually corrected by experts to remove translation artifacts (referred to as MMBench-PL).
Qualitative (Model-as-a-Judge): XM3600 (Polish subset), a dataset requiring accurate and culturally relevant image descriptions.

Factors

Language: Performance comparison between Polish (target) and English (source) capabilities.
Task Type: Object recognition, OCR, commonsense reasoning, fine-grained perception, and cultural context recognition.

Metrics

Accuracy: Used for MMBench multiple-choice questions.
Win-rate (LLM-as-a-Judge): Pairwise comparison using LLaVA-OneVision-72B to judge caption quality between PLLuM and baseline models (PaliGemma, Qwen2.5, Pixtral).

Results

Summary

The model demonstrates a significant advancement in Polish multimodal capabilities:

MMBench-PL: Achieved 79.35%, marking a +9.55% improvement over LLaVA-1.6-Vicuna-13B, while maintaining comparable English performance.
Captioning Quality: Achieved better performance than PaliGemma-3B (65.28% win-rate vs. PaliGemma-3B), slightly outperforms LLaVA-1.6-Mistral-7B and LLaVA-1.6-Vicuna-13B, and shows competitive results—though slightly lower-compared to Qwen2.5-VL-7B and Pixtral-12B.
Qualitative Analysis: The model shows superior handling of Polish grammar/morphology and correctly identifies Polish cultural elements (e.g., specific landmarks like the Palace of Culture and Science, regional food like Toruń gingerbread) where generic models often fail.

Societal Impact Assessment

Cultural Inclusion: This model helps bridge the gap in multimodal AI for the Polish language, allowing for technology that reflects local linguistic and cultural nuances rather than defaulting to US-centric norms.
Lack of Safety Alignment: Important: As a research proof-of-concept, this model has not undergone specific safety alignment (e.g., RLHF) for the vision-language domain. Consequently, it may be more prone to generating biased, toxic, or inappropriate responses compared to fully commercialized models, especially when prompted with controversial visual content.
Reliability: Users should be aware of the potential for hallucinations, particularly in OCR or counting tasks, and should not use the model for high-stakes decision-making.

Technical Specifications

Model Architecture and Objective

Architecture: Based on the LLaVA-NeXT framework.
Language Model: PLLuM-12B-nc-instruct (Polish-native, instruction-tuned).
Vision Encoder: SigLIP2 So400m/14, 384px (Chosen for strong multilingual alignment).
Connector: Two-layer MLP projector.
Objective: The model uses a standard autoregressive language modeling objective, conditioned on visual inputs processed through the encoder and projector.

Compute Infrastructure

Hardware

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018129

Model Card Contact

For questions or contributions, please reach out via: nlp@nask.pl

How to Get Started with the Model

Inference Example using Transformers

Use the code below to run the model. We recommend using transformers >= 4.56.2.

import torch
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    dtype=torch.float16, 
    device_map="auto",
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
    images=image, 
    text=prompt, 
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256)

input_len = inputs.input_ids.shape[1]
generated_ids = output[0][input_len:]
print(processor.decode(generated_ids, skip_special_tokens=True))

Inference with vLLM

You can also use the model via vLLM, see below example. We recommend using vllm >= 0.10.0.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image
import requests


model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(
    model=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    limit_mm_per_prompt={"image": 1},
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(temperature=0.2, max_tokens=256)
output = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    },
    sampling_params=sampling_params
)

print(output[0].outputs[0].text)

Citation

If you use this model, please cite the following paper:

@inproceedings{statkiewicz2026annotation,
  title     = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
  author    = {Statkiewicz, Grzegorz and
               Dobrzeniecka, Alicja and
               Seweryn, Karolina and
               Krasnod{\k e}bska, Aleksandra and
               Piosek, Karolina and
               Bogusz, Katarzyna and
               Cygert, Sebastian and
               Kusa, Wojciech},
  booktitle = {Proceedings of the Student Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)},
  year      = {2026},
  publisher = {Association for Computational Linguistics}
}

Downloads last month: 26

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for NASK-PIB/LLaVA-PLLuM-12b-nc-instruct

Base model

google/siglip2-so400m-patch14-384

Finetuned

(18)

this model

Space using NASK-PIB/LLaVA-PLLuM-12b-nc-instruct 1

Collection including NASK-PIB/LLaVA-PLLuM-12b-nc-instruct

LLaVA-PLLuM

Collection

Polish Vision-Language Model based on PLLuM and LLaVA frameworks • 5 items • Updated 14 days ago

NASK-PIB
/

LLaVA-PLLuM-12b-nc-instruct