Papers
arxiv:2511.10657

Patent Representation Learning via Self-supervision

Published on Nov 3, 2025
Authors:
,
,
,

Abstract

This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

Community

Sign up or log in to comment

Models citing this paper 22

Browse 22 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.10657 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.10657 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.