A unified transformer model for cross-species codon optimization that generates optimized DNA sequences from protein inputs using species-specific codon usage patterns.
The Multi-Species Codon Transformer addresses the challenge of codon optimization across different organisms by conditioning codon generation on both protein context and target species. Unlike existing models that require separate optimizers for each species, our approach uses a single unified model that can optimize codons for multiple organisms simultaneously.
The model extends CodonBERT with additional biological context through:
- Codon Embeddings: Pretrained CodonBERT token representations
- Amino Acid Embeddings: Learned representations for each amino acid
- Species Embeddings: Organism-specific context vectors
- DNA+Protein Tokenization: It takes the protein into context along with the DNA
Extended CodonBERT vocabulary with:
- 37 additional codon tokens
- 5 organism-specific tokens
- 20 amino acid tokens
- Standard BERT special tokens ([CLS], [SEP], [MASK], [UNK])
[CLS] [SPECIES_TOKEN] [CODON_SEQUENCE] [SEP] [AMINO_ACID_SEQUENCE] [SEP]
Training data derived from CodonTransformer dataset with focus on top 5 organisms:
- Homo sapiens: 111,997 sequences
- Nicotiana tabacum: 69,642 sequences
- Mus musculus: 68,447 sequences
- Danio rerio: 47,006 sequences
- Arabidopsis thaliana: 40,558 sequences
- Training: 80,000 entries (20k per organism)
- Validation: 20,000 entries (4k per organism)
- Test: 30,000 entries (randomly picked)
Masked language modelling technique was used to train the model.
- Epochs: 15
- Learning Rate: 2.41e-4
- Batch Size: 32
- Weight Decay: 0.193
- Warmup Ratio: 0.1
- Scheduler: Linear
- Training Loss: 2.153
- Validation Loss: 2.140
- Training Perplexity: 8.61
- Validation Perplexity: 8.50
- Test Perplexity: 8.52
This project is licensed under the MIT License - see the LICENSE file for details.
- Built upon CodonBERT by Hallee et al.
- Dataset derived from CodonTransformer repository
- Inspired by recent advances in protein language modeling and codon optimization