Text Ranking
sentence-transformers
Safetensors
Transformers
Russian
bert
text-classification
rubert
cross-encoder
reranker
msmarco
text-embeddings-inference
Instructions to use DiTy/cross-encoder-russian-msmarco with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use DiTy/cross-encoder-russian-msmarco with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("DiTy/cross-encoder-russian-msmarco") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Transformers
How to use DiTy/cross-encoder-russian-msmarco with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("DiTy/cross-encoder-russian-msmarco") model = AutoModelForSequenceClassification.from_pretrained("DiTy/cross-encoder-russian-msmarco") - Notebooks
- Google Colab
- Kaggle
Upload tokenizer
Browse files- README.md +7 -10
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +57 -0
- vocab.txt +0 -0
README.md
CHANGED
|
@@ -1,4 +1,6 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
library_name: sentence-transformers
|
| 3 |
tags:
|
| 4 |
- sentence-transformers
|
|
@@ -9,19 +11,14 @@ tags:
|
|
| 9 |
- msmarco
|
| 10 |
datasets:
|
| 11 |
- unicamp-dl/mmarco
|
| 12 |
-
language:
|
| 13 |
-
- ru
|
| 14 |
base_model: DeepPavlov/rubert-base-cased
|
| 15 |
widget:
|
| 16 |
-
- text:
|
| 17 |
-
как часто нужно ходить к стоматологу? [SEP] Дядя Женя работает врачем
|
| 18 |
-
стоматологом.
|
| 19 |
example_title: Example 1
|
| 20 |
-
- text:
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
чаще
|
| 24 |
-
отследить любые начинающиеся проблемы и исправить их сразу же.
|
| 25 |
example_title: Example 2
|
| 26 |
---
|
| 27 |
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- ru
|
| 4 |
library_name: sentence-transformers
|
| 5 |
tags:
|
| 6 |
- sentence-transformers
|
|
|
|
| 11 |
- msmarco
|
| 12 |
datasets:
|
| 13 |
- unicamp-dl/mmarco
|
|
|
|
|
|
|
| 14 |
base_model: DeepPavlov/rubert-base-cased
|
| 15 |
widget:
|
| 16 |
+
- text: как часто нужно ходить к стоматологу? [SEP] Дядя Женя работает врачем стоматологом.
|
|
|
|
|
|
|
| 17 |
example_title: Example 1
|
| 18 |
+
- text: как часто нужно ходить к стоматологу? [SEP] Минимальный обязательный срок
|
| 19 |
+
посещения зубного врача – раз в год, но специалисты рекомендуют делать это чаще
|
| 20 |
+
– раз в полгода, а ещё лучше – раз в квартал. При таком сроке легко отследить
|
| 21 |
+
любые начинающиеся проблемы и исправить их сразу же.
|
|
|
|
| 22 |
example_title: Example 2
|
| 23 |
---
|
| 24 |
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_basic_tokenize": true,
|
| 47 |
+
"do_lower_case": false,
|
| 48 |
+
"mask_token": "[MASK]",
|
| 49 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 50 |
+
"never_split": null,
|
| 51 |
+
"pad_token": "[PAD]",
|
| 52 |
+
"sep_token": "[SEP]",
|
| 53 |
+
"strip_accents": null,
|
| 54 |
+
"tokenize_chinese_chars": true,
|
| 55 |
+
"tokenizer_class": "BertTokenizer",
|
| 56 |
+
"unk_token": "[UNK]"
|
| 57 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|