Snowflake Arctic Embeddings for ROR Affiliation Matching
A sentence embedding model fine-tuned for Research Organization Registry (ROR) affiliation matching.
Model Description
This model is fine-tuned from Snowflake/snowflake-arctic-embed-l-v2.0 using contrastive learning
on the AffilGood contrastive dataset. It produces embeddings optimized for matching affiliation
strings to ROR organization records.
Training
- Base model: Snowflake/snowflake-arctic-embed-l-v2.0
- Training dataset: SIRIS-Lab/affilgood-contrastive-dataset
- Training examples: 50,255
- Validation examples: 2,645
- Loss: MultipleNegativesRankingLoss (with hard negatives)
- Epochs: 3
- Batch size: 32
- Learning rate: 2e-05
- Max sequence length: 256
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cometadata/snowflake-arctic-ror-affiliations")
# Encode affiliations
affiliations = [
"Department of Physics, MIT, Cambridge, MA",
"Harvard Medical School, Boston",
]
embeddings = model.encode(affiliations, normalize_embeddings=True)
# Encode ROR organization names for matching
organizations = [
"Massachusetts Institute of Technology",
"Harvard University",
]
org_embeddings = model.encode(organizations, normalize_embeddings=True)
# Compute similarity
import numpy as np
similarities = np.dot(embeddings, org_embeddings.T)
Intended Use
This model is designed for dense retrieval in affiliation matching pipelines. It should be used as the first-stage retriever to find candidate ROR organizations for a given affiliation string.
Training Data
Fine-tuned on SIRIS-Lab/affilgood-contrastive-dataset, which contains 52,900 affiliation-organization pairs with curated hard negatives across 105 languages.
Timestamp
2026-01-07T08:08:33.561241+00:00
- Downloads last month
- 15
Model tree for cometadata/snowflake-arctic-ror-affiliations
Base model
Snowflake/snowflake-arctic-embed-l-v2.0