LLaVA-PLLuM-12b-nc-instruct
This model is the first Polish-focused Vision-Language Model (VLM), created by extending the open-source LLaVA architecture with the PLLuM language model. Our pipeline integrates high-quality multimodal instruction tuning with PLLuM’s strong Polish linguistic abilities, resulting in a VLM that demonstrates significantly improved understanding of Polish language, culture, and context-specific visual reasoning.
Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Environmental Impact
- Technical Specifications
- Citation
- How to Get Started with the Model
Model Details
Model Description
- Developed by: NASK PIB
- Funded by: NASK PIB
- Shared by: NASK PIB
- Model type: Multimodal (Image-Text-to-Text) / Vision Language Model
- Language(s) (NLP): Polish, English
- License: Model LLaVA-PLLuM-12b-nc-instruct is published under PLLuM-1.0 license.
Model Sources
- Blogpost: Blogpost
Uses
Direct Use
The model is intended for research and development purposes, specifically focusing on multimodal tasks requiring the Polish language and cultural context. It can be used directly for:
- Visual Question Answering (VQA) in Polish: Users can provide an image and ask questions about it in Polish (e.g., "Co znajduje się na zdjęciu?").
- Image Captioning: Generating detailed descriptions of images in grammatically correct Polish.
- Optical Character Recognition (OCR): Extracting and interpreting text visible within images, including Polish documents.
- Object Counting: Performing simple enumeration of objects within a visual scene.
- Multimodal Research: Serving as a baseline or starting point for researchers developing non-English or bilingual Vision-Language Models (VLMs).
Downstream Use
This model can be fine-tuned or integrated into larger applications to support specific use cases, such as:
- Accessibility Tools: Creating applications that describe surroundings or digital content to visually impaired Polish speakers.
- E-commerce: Generating automated product descriptions based on images for Polish marketplaces.
- Educational Assistants: Developing tutoring systems that can explain visual content (diagrams, historical photos) to students in Polish.
- Specialized Fine-tuning: The model can be further fine-tuned on domain-specific datasets (e.g., Polish medical imaging reports or legal document analysis) to improve performance in niche sectors.
Out-of-Scope Use
- Generation of Harmful Content: Utilizing the model to generate hate speech, explicit content, or to facilitate harassment and disinformation.
- High-Stakes Factual Retrieval: Like all Large Language Models, this model can "hallucinate" or produce factually incorrect information. It should not be relied upon as a sole source of truth without human verification.
- English-Primary Tasks: While the model retains some English capabilities, it is optimized for Polish. Users seeking state-of-the-art performance strictly for English tasks should prefer models trained primarily on English data.
Bias, Risks, and Limitations
- Potential Hallucinations: Like other LLMs, PLLuM may occasionally produce factually incorrect or fabricated content.
- Sensitivity & Bias: The current version has not undergone multimodal safety alignment. As a result, users may encounter biased behavior or toxic generations, particularly when the model is prompted with visual inputs.
- Context Length: Very long context tasks may challenge certain models, depending on memory constraints.
Recommendations
Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model. We recommend the following:
- Treat as a Research Proof-of-Concept: This model represents a preliminary step toward robust Polish multimodal AI. It is not a finished commercial product. Users should exercise caution when applying it to real-world scenarios and should not deploy it in production environments without extensive domain-specific testing and guardrails.
- Human Verification Required: Like all Large Multimodal Models (LMMs), this model is prone to "hallucinations"—confidently stating incorrect facts or describing objects that are not present in the image. Always keep a human in the loop to verify outputs, especially for factual queries or quantitative tasks (e.g., counting objects).
- Awareness of Translation Artifacts: A significant portion of the instruction-tuning dataset (e.g., ALLaVA, LLaVA-Instruct) was automatically translated from English to Polish. While we employed filtering metrics (COMET), some linguistic unnaturalness or translation artifacts may persist in the model's responses.
Training Details
Training Data
The model was trained in two stages using a combination of translated open-source datasets and synthetic data, totaling approximately 2 million samples with an 85% Polish / 15% English split.
Stage 1: Pre-training (Feature Alignment) Stage 2: Instruction Tuning (Visual Instruction Tuning)
Training Procedure
Preprocessing
To create high-quality Polish multimodal data from English sources, a rigorous translation and filtering pipeline was employed:
- Translation: Source English datasets were translated using the Tower+ 72B model.
- Filtering: The COMET reference-free metric was used to filter out poor-quality translations.
- Manual Review: A portion of the data underwent manual expert filtering to ensure linguistic quality.
- Dynamic Tiling: Following LLaVA-NeXT, images are processed with dynamic tiling to support higher input resolutions.
Speeds, Sizes, Times
- Training Stages: 2 Stages.
- Epochs: 1 Epoch for both stages.
- Batch Size: 256 (Stage 1), 128 (Stage 2).
- Context Size: 8,192 tokens.
- Trainable Parameters:
- Stage 1: 30M (Projector only).
- Stage 2: 12B (LLM via LoRA) + 400M (Vision Encoder) + 30M (Projector).
- Learning Rates (Stage 2): 2x10⁻⁶ (Vision), 2x10⁻⁵ (Projector & LLM).
- LoRA Config: Rank 128, Alpha 256, Dropout 0.05.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Quantitative: MMBench v1.1 (Development Split). The dataset was translated to Polish using Tower+ 72B and subsequently manually corrected by experts to remove translation artifacts (referred to as MMBench-PL).
- Qualitative (Model-as-a-Judge): XM3600 (Polish subset), a dataset requiring accurate and culturally relevant image descriptions.
Factors
- Language: Performance comparison between Polish (target) and English (source) capabilities.
- Task Type: Object recognition, OCR, commonsense reasoning, fine-grained perception, and cultural context recognition.
Metrics
- Accuracy: Used for MMBench multiple-choice questions.
- Win-rate (LLM-as-a-Judge): Pairwise comparison using LLaVA-OneVision-72B to judge caption quality between PLLuM and baseline models (PaliGemma, Qwen2.5, Pixtral).
Results
Summary
The model demonstrates a significant advancement in Polish multimodal capabilities:
- MMBench-PL: Achieved 79.35%, marking a +9.55% improvement over LLaVA-1.6-Vicuna-13B, while maintaining comparable English performance.
- Captioning Quality: Achieved better performance than PaliGemma-3B (65.28% win-rate vs. PaliGemma-3B), slightly outperforms LLaVA-1.6-Mistral-7B and LLaVA-1.6-Vicuna-13B, and shows competitive results—though slightly lower-compared to Qwen2.5-VL-7B and Pixtral-12B.
- Qualitative Analysis: The model shows superior handling of Polish grammar/morphology and correctly identifies Polish cultural elements (e.g., specific landmarks like the Palace of Culture and Science, regional food like Toruń gingerbread) where generic models often fail.
Societal Impact Assessment
- Cultural Inclusion: This model helps bridge the gap in multimodal AI for the Polish language, allowing for technology that reflects local linguistic and cultural nuances rather than defaulting to US-centric norms.
- Lack of Safety Alignment: Important: As a research proof-of-concept, this model has not undergone specific safety alignment (e.g., RLHF) for the vision-language domain. Consequently, it may be more prone to generating biased, toxic, or inappropriate responses compared to fully commercialized models, especially when prompted with controversial visual content.
- Reliability: Users should be aware of the potential for hallucinations, particularly in OCR or counting tasks, and should not use the model for high-stakes decision-making.
Technical Specifications
Model Architecture and Objective
- Architecture: Based on the LLaVA-NeXT framework.
- Language Model: PLLuM-12B-nc-instruct (Polish-native, instruction-tuned).
- Vision Encoder: SigLIP2 So400m/14, 384px (Chosen for strong multilingual alignment).
- Connector: Two-layer MLP projector.
- Objective: The model uses a standard autoregressive language modeling objective, conditioned on visual inputs processed through the encoder and projector.
Compute Infrastructure
Hardware
We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018129
Model Card Contact
For questions or contributions, please reach out via: nlp@nask.pl
How to Get Started with the Model
Inference Example using Transformers
Use the code below to run the model. We recommend using transformers >= 4.56.2.
import torch
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from PIL import Image
import requests
model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.float16,
device_map="auto",
)
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
conversation = [
{
"role": "user",
"content": "<image>\nOpisz ten obrazek w szczegółach." # "Describe this image in detail"
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
images=image,
text=prompt,
return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
input_len = inputs.input_ids.shape[1]
generated_ids = output[0][input_len:]
print(processor.decode(generated_ids, skip_special_tokens=True))
Inference with vLLM
You can also use the model via vLLM, see below example. We recommend using vllm >= 0.10.0.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image
import requests
model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(
model=model_id,
trust_remote_code=True,
dtype="bfloat16",
max_model_len=8192,
limit_mm_per_prompt={"image": 1},
)
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
conversation = [
{
"role": "user",
"content": "<image>\nOpisz ten obrazek w szczegółach." # "Describe this image in detail"
},
]
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.2, max_tokens=256)
output = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": image},
},
sampling_params=sampling_params
)
print(output[0].outputs[0].text)
Citation
If you use this model, please cite the following paper:
@inproceedings{statkiewicz2026annotation,
title = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
author = {Statkiewicz, Grzegorz and
Dobrzeniecka, Alicja and
Seweryn, Karolina and
Krasnod{\k e}bska, Aleksandra and
Piosek, Karolina and
Bogusz, Katarzyna and
Cygert, Sebastian and
Kusa, Wojciech},
booktitle = {Proceedings of the Student Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)},
year = {2026},
publisher = {Association for Computational Linguistics}
}
- Downloads last month
- 26
Model tree for NASK-PIB/LLaVA-PLLuM-12b-nc-instruct
Base model
google/siglip2-so400m-patch14-384