Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Abstract
Vision-language models demonstrate limited capability in inferring structured cultural metadata from visual input, showing inconsistent performance across different cultures and metadata types.
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
Community
We introduce Appear2Meaning, a cross-cultural benchmark for structured cultural metadata inference from images.
Unlike standard image captioning, this task requires models to predict non-observable attributes such as culture, period, origin, and creator from visual input alone. The dataset contains 750 curated objects from the Getty and the Metropolitan Museum of Art, covering multiple object types and four cultural regions.
Key Findings
- Structured metadata inference is significantly harder than image captioning
- Models capture partial signals but fail at coherent multi-attribute prediction
- Strong variation across cultural regions, with East Asia performing better
- Frequent errors include cross-cultural misattribution and period compression
Why it matters
This work highlights the gap between visual perception and culturally grounded reasoning in VLMs, providing a benchmark for studying bias, generalization, and structured multimodal inference in cultural heritage.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains (2026)
- MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment (2026)
- MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps (2026)
- Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints (2026)
- From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation (2026)
- GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing (2026)
- CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.07338 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper