Snowflake Arctic Embeddings for ROR Affiliation Matching

A sentence embedding model fine-tuned for Research Organization Registry (ROR) affiliation matching.

Model Description

This model is fine-tuned from Snowflake/snowflake-arctic-embed-l-v2.0 using contrastive learning on the AffilGood contrastive dataset. It produces embeddings optimized for matching affiliation strings to ROR organization records.

Training

  • Base model: Snowflake/snowflake-arctic-embed-l-v2.0
  • Training dataset: SIRIS-Lab/affilgood-contrastive-dataset
  • Training examples: 50,255
  • Validation examples: 2,645
  • Loss: MultipleNegativesRankingLoss (with hard negatives)
  • Epochs: 3
  • Batch size: 32
  • Learning rate: 2e-05
  • Max sequence length: 256

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cometadata/snowflake-arctic-ror-affiliations")

# Encode affiliations
affiliations = [
    "Department of Physics, MIT, Cambridge, MA",
    "Harvard Medical School, Boston",
]
embeddings = model.encode(affiliations, normalize_embeddings=True)

# Encode ROR organization names for matching
organizations = [
    "Massachusetts Institute of Technology",
    "Harvard University",
]
org_embeddings = model.encode(organizations, normalize_embeddings=True)

# Compute similarity
import numpy as np
similarities = np.dot(embeddings, org_embeddings.T)

Intended Use

This model is designed for dense retrieval in affiliation matching pipelines. It should be used as the first-stage retriever to find candidate ROR organizations for a given affiliation string.

Training Data

Fine-tuned on SIRIS-Lab/affilgood-contrastive-dataset, which contains 52,900 affiliation-organization pairs with curated hard negatives across 105 languages.

Timestamp

2026-01-07T08:08:33.561241+00:00

Downloads last month
15
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/snowflake-arctic-ror-affiliations

Finetuned
(20)
this model

Dataset used to train cometadata/snowflake-arctic-ror-affiliations