update dataset

Files changed (12) hide show

README.md +3 -374
requirements.txt +13 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/dataset_dict.json +0 -1
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00001-of-00005.arrow +3 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00002-of-00005.arrow +3 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00003-of-00005.arrow +3 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00004-of-00005.arrow +3 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/dataset_info.json +65 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/state.json +25 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/data-00000-of-00001.arrow +3 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/dataset_info.json +59 -0
training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/state.json +13 -0

README.md CHANGED Viewed

@@ -1,374 +1,3 @@
----
-license: cc-by-nc-nd-4.0
----
-![Untitled design (3)](https://cdn-uploads.huggingface.co/production/uploads/64cd5b3f0494187a9e8b7c69/bpOe1xggl9lw90JMi3VsC.png)
-This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction.
-- `embeddings` folder contains processed huggingface datasets with PeptideCLM embeddings. The `.csv` is the pre-processed data.
-- `metrics` folder contains the model performance on the validation data
-- `models` host all trained model weights
-- `training_data` host all **raw data** to train the classifiers
-- `functions` contains files to utilize the trained weights and classifiers
-- `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
-- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
-# PeptiVerse 🧬🌌
-A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
-## Predictors 🧫
-PeptiVerse includes the following property predictors:
-| Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
-|-----------|-------------|-----------------| --------------------|--------------|------------|
-| **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
-| **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
-| **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
-| **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
-| **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
-## Model Performance 🌟
-#### Binary Classification Predictors
-| Predictor | Val AUC | Val F1 |
-|-----------|----------------|----------|
-| **Non-Hemolysis** | 0.7902 | 0.8260 |
-| **Solubility** | 0.6016 | 0.5767 |
-| **Nonfouling** | 0.9327 | 0.8774 |
-#### Regression Predictors
-| Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
-|-----------|------------------------------|----------------------------|
-| **Permeability** | 0.958 | 0.710 |
-| **Binding Affinity** | 0.805 | 0.611 |
-## Setup 🌟
-1. Clone the repository:
-```bash
-git clone https://github.com/sophtang/PeptiVerse.git
-cd PeptiVerse
-```
-2. Install environment:
-```bash
-conda env create -f environment.yml
-conda activate peptiverse
-```
-3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
-## Usage 🌟
-#### 1. Hemolysis Prediction
-Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
-```python
-import sys
-sys.path.append('/path/to/PeptiVerse')
-from functions.hemolysis.hemolysis import Hemolysis
-# Initialize predictor
-hemo = Hemolysis()
-# Input peptide in SMILES format
-peptides = [
-    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
-]
-# Get predictions
-scores = hemo(peptides)
-print(f"Non-hemolytic probability: {scores[0]:.3f}")
-```
-**Output interpretation:**
-- Score close to 1.0 = likely non-hemolytic (safe)
-- Score close to 0.0 = likely hemolytic (unsafe)
----
-#### 2. Solubility Prediction
-Predicts aqueous solubility. Higher scores indicate better solubility.
-```python
-from functions.solubility.solubility import Solubility
-# Initialize predictor
-sol = Solubility()
-# Input peptide
-peptides = [
-    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
-]
-# Get predictions
-scores = sol(peptides)
-print(f"Solubility probability: {scores[0]:.3f}")
-```
-**Output interpretation:**
-- Score close to 1.0 = highly soluble
-- Score close to 0.0 = poorly soluble
----
-#### 3. Nonfouling Prediction
-Predicts protein resistance/non-fouling properties.
-```python
-from functions.nonfouling.nonfouling import Nonfouling
-# Initialize predictor
-nf = Nonfouling()
-# Input peptide
-peptides = [
-    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
-]
-# Get predictions
-scores = nf(peptides)
-print(f"Nonfouling score: {scores[0]:.3f}")
-```
-**Output interpretation:**
-- Higher scores = better non-fouling properties
----
-#### 4. Permeability Prediction
-Predicts membrane permeability on a log P scale.
-```python
-from functions.permeability.permeability import Permeability
-# Initialize predictor
-perm = Permeability()
-# Input peptide
-peptides = [
-    "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
-]
-# Get predictions
-scores = perm(peptides)
-print(f"Permeability (log P): {scores[0]:.3f}")
-```
-**Output interpretation:**
-- Higher values = more permeable
-- Typical range: -10 to 0 (log scale)
----
-#### 5. Binding Affinity Prediction
-Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
-```python
-from functions.binding.binding import BindingAffinity
-# Target protein sequence (amino acid format)
-target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
-# Initialize predictor with target protein
-binding = BindingAffinity(prot_seq=target_protein)
-# Input peptide in SMILES format
-peptides = [
-    "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
-]
-# Get predictions
-scores = binding(peptides)
-print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
-```
-**Output interpretation:**
-- Higher values = stronger binding
-- Scale: -log(Kd/Ki/IC50)
-  - 7.5+ = tight binding (≤ ~30nM)
-  - 6.0-7.5 = medium binding (~30nM - 1μM)
-  - <6.0 = weak binding (> 1μM)
----
-## Batch Processing 🌟
-All predictors support batch processing for multiple peptides:
-```python
-from functions.hemolysis.hemolysis import Hemolysis
-hemo = Hemolysis()
-# Multiple peptides
-peptides = [
-    "NCC(=O)N[C@H](CS)C(=O)O",
-    "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
-    "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
-]
-# Get predictions for all
-scores = hemo(peptides)
-for i, score in enumerate(scores):
-    print(f"Peptide {i+1}: {score:.3f}")
-```
----
-## Unified Scoring with Multiple Predictors 🌟
-For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
-### Basic Usage
-```python
-import sys
-sys.path.append('/path/to/PeptiVerse')
-from scoring_functions import ScoringFunctions
-# Initialize with desired scoring functions
-# Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
-#            'solubility', 'hemolysis', 'nonfouling'
-scoring = ScoringFunctions(
-    score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
-    prot_seqs=[]  # Empty if not using binding affinity
-)
-# Input peptides in SMILES format
-peptides = [
-    'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
-    'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
-]
-# Get scores (returns numpy array of shape: num_peptides x num_functions)
-scores = scoring(input_seqs=peptides)
-print(scores)
-```
-### Adding Binding Affinity
-```python
-from scoring_functions import ScoringFunctions
-# Target protein sequence (amino acid format)
-tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
-# Initialize with binding affinity for one protein
-scoring = ScoringFunctions(
-    score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
-    prot_seqs=[tfr_protein]  # Provide target protein sequence
-)
-peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
-scores = scoring(input_seqs=peptides)
-# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
-print(f"Scores for peptide 1:")
-print(f"  Binding Affinity: {scores[0][0]:.3f}")
-print(f"  Solubility: {scores[0][1]:.3f}")
-print(f"  Hemolysis: {scores[0][2]:.3f}")
-print(f"  Permeability: {scores[0][3]:.3f}")
-```
-### Multiple Binding Targets
-```python
-# For dual binding affinity prediction
-protein1 = "MMDQARSAFSNLFGGEPLSYTR..."  # First target
-protein2 = "MTKSNGEEPKMGGRMERFQQGV..."  # Second target
-scoring = ScoringFunctions(
-    score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
-    prot_seqs=[protein1, protein2]  # Provide both protein sequences
-)
-peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
-scores = scoring(input_seqs=peptides)
-# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
-```
-### Output Format
-The `ScoringFunctions` class returns a numpy array where:
-- **Rows**: Each row corresponds to one input peptide
-- **Columns**: Each column corresponds to one scoring function (in the order specified)
-```python
-# Example with 3 peptides and 4 scoring functions
-scores = scoring(input_seqs=peptides)
-# Shape: (3, 4)
-# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
-# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
-# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
-```
----
-## Complete Example 🌟
-```python
-import sys
-sys.path.append('/path/to/PeptiVerse')
-from functions.hemolysis.hemolysis import Hemolysis
-from functions.solubility.solubility import Solubility
-from functions.permeability.permeability import Permeability
-# Initialize predictors
-hemo = Hemolysis()
-sol = Solubility()
-perm = Permeability()
-# Test peptide
-peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
-# Get all predictions
-hemo_score = hemo(peptide)[0]
-sol_score = sol(peptide)[0]
-perm_score = perm(peptide)[0]
-print("Peptide Property Predictions:")
-print(f"  Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
-print(f"  Solubility: {sol_score:.3f}")
-print(f"  Permeability: {perm_score:.3f}")
-```
----
-## Model Architecture 🌟
-All predictors use:
-- **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
-- **Classifier**: XGBoost gradient boosting
-- **Input**: SMILES representation of peptides
-- **Training**: Models trained on curated datasets with cross-validation
----
-## Citation
-If you find this repository helpful for your publications, please consider citing our paper:
-```
-@article{zhang2025peptiverse,
-  title={PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
-  author={Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
-  journal={bioRxiv},
-  doi={10.64898/2025.12.31.697180}
-  year={2026}
-}
-```
-To use this repository, you agree to abide by the MIT License.

+version https://git-lfs.github.com/spec/v1
+oid sha256:0dd39f30b6311602a8b9533d532405c3f5427d7b61179f993d29b10f95627017
+size 18784

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+gradio>=4.0.0
+pandas>=2.0.0
+numpy>=1.24.0
+plotly>=5.14.0
+torch>=2.0.0
+transformers==4.46.0
+scikit-learn>=1.3.0
+biopython>=1.81
+rdkit>=2023.3.1
+seaborn
+SmilesPE
+xgboost
+ipython

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/dataset_dict.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"splits": ["train", "val"]}

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00001-of-00005.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f6281286045c604cd2d8c21bb2fc04112969044de787a84fac80e653a1d14b58
+size 506101952

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00002-of-00005.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dba30b3c634beee1d4076a4913fc9579e05ace58fd14f029254028339f7d6dc1
+size 346101152

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00003-of-00005.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb80d65357bf3aef44c7d38e5d5c263be7a1baec9b79d79b022bbd983cb1b3da
+size 480935432

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/data-00004-of-00005.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49ae3e0d1e202e56e1982e7c1de2f5728b6ba256534a68d0f59694026f53a640
+size 425662736

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/dataset_info.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "builder_name": "generator",
+  "citation": "",
+  "config_name": "default",
+  "dataset_name": "generator",
+  "dataset_size": 2114611931,
+  "description": "",
+  "download_checksums": {},
+  "download_size": 0,
+  "features": {
+    "sequence": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "label": {
+      "dtype": "int64",
+      "_type": "Value"
+    },
+    "embedding": {
+      "feature": {
+        "feature": {
+          "dtype": "float16",
+          "_type": "Value"
+        },
+        "length": 768,
+        "_type": "List"
+      },
+      "_type": "List"
+    },
+    "attention_mask": {
+      "feature": {
+        "dtype": "int8",
+        "_type": "Value"
+      },
+      "_type": "List"
+    },
+    "length": {
+      "dtype": "int64",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": "",
+  "size_in_bytes": 2114611931,
+  "splits": {
+    "train": {
+      "name": "train",
+      "num_bytes": 2114611931,
+      "num_examples": 8838,
+      "shard_lengths": [
+        3000,
+        3000,
+        2000,
+        838
+      ],
+      "dataset_name": "generator"
+    }
+  },
+  "version": {
+    "version_str": "0.0.0",
+    "major": 0,
+    "minor": 0,
+    "patch": 0
+  }
+}

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/train/state.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00005.arrow"
+    },
+    {
+      "filename": "data-00001-of-00005.arrow"
+    },
+    {
+      "filename": "data-00002-of-00005.arrow"
+    },
+    {
+      "filename": "data-00003-of-00005.arrow"
+    },
+    {
+      "filename": "data-00004-of-00005.arrow"
+    }
+  ],
+  "_fingerprint": "b072567eebe4f415",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": "train"
+}

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/data-00000-of-00001.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fad51e18ba1b7cd70dca29a2954e708aa11054e2a7719c05af651ad8da49775e
+size 411839544

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/dataset_info.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "builder_name": "generator",
+  "citation": "",
+  "config_name": "default",
+  "dataset_name": "generator",
+  "dataset_size": 411837070,
+  "description": "",
+  "download_checksums": {},
+  "download_size": 0,
+  "features": {
+    "sequence": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "label": {
+      "dtype": "int64",
+      "_type": "Value"
+    },
+    "embedding": {
+      "feature": {
+        "feature": {
+          "dtype": "float16",
+          "_type": "Value"
+        },
+        "length": 768,
+        "_type": "List"
+      },
+      "_type": "List"
+    },
+    "attention_mask": {
+      "feature": {
+        "dtype": "int8",
+        "_type": "Value"
+      },
+      "_type": "List"
+    },
+    "length": {
+      "dtype": "int64",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": "",
+  "size_in_bytes": 411837070,
+  "splits": {
+    "train": {
+      "name": "train",
+      "num_bytes": 411837070,
+      "num_examples": 2198,
+      "dataset_name": "generator"
+    }
+  },
+  "version": {
+    "version_str": "0.0.0",
+    "major": 0,
+    "minor": 0,
+    "patch": 0
+  }
+}

training_data_cleaned/toxicity/tox_smiles_with_embeddings_unpooled/val/state.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "4bac7336e23d3fed",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": "train"
+}