Model Card: algaGPT

Overview

Name: algaGPT Type: Causal language model for protein sequence classification Base: nanoGPT (Andrej Karpathy) Task: Binary classification of microalgal vs. contaminant protein sequences Mode: TI-inclusive (full-length sequences)

Training

  • Data: ~58.6M protein sequences (1:1 algal:contaminant ratio)
  • Algal sources: 166 microalgal genomes across 10 phyla
  • Contaminant sources: Bacterial, archaeal, and fungal sequences from NCBI nr

Performance

Metric Score
Recall >99%
Speed vs BLASTP+ ~10,701× faster

Usage

Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction.

Citation

Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. Patterns. 2024;6(11).

Contact

Kourosh Salehi-Ashtiani ksa3@nyu.edu

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support