Model Card: algaGPT
Overview
Name: algaGPT Type: Causal language model for protein sequence classification Base: nanoGPT (Andrej Karpathy) Task: Binary classification of microalgal vs. contaminant protein sequences Mode: TI-inclusive (full-length sequences)
Training
- Data: ~58.6M protein sequences (1:1 algal:contaminant ratio)
- Algal sources: 166 microalgal genomes across 10 phyla
- Contaminant sources: Bacterial, archaeal, and fungal sequences from NCBI nr
Performance
| Metric | Score |
|---|---|
| Recall | >99% |
| Speed vs BLASTP+ | ~10,701× faster |
Usage
Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction.
Citation
Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. Patterns. 2024;6(11).
Contact
Kourosh Salehi-Ashtiani ksa3@nyu.edu