Name	Name	Last commit message	Last commit date
Latest commit History 13 Commits
evo2	evo2
savanna @ 4d87349	savanna @ 4d87349
test	test
vortex @ 792fff2	vortex @ 792fff2
.gitmodules	.gitmodules
LICENSE	LICENSE
NOTICE	NOTICE
README.md	README.md
requirements.txt	requirements.txt
setup.py	setup.py

Evo 2: DNA modeling and design across all life's domains

Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 uses the Striped Hyena 2 architecture and is pretrained using Savanna on 2048 GPUs. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length. Evo 2 is trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life.

We describe Evo 2 in the preprint: "Genome modeling and design across all domains of life with Evo 2".

Setup
- Requirements
- Installation
Checkpoints
Usage
Dataset
Training Code
Citation

Setup

Requirements

Evo 2 is based on Striped Hyena 2. A CUDA-capable system is required to build and install the prerequisites. Evo 2 uses FlashAttention-2, which may not work on all GPU architectures. Please consult the FlashAttention GitHub repository for the current list of supported GPUs.

Installation

Follow the commands below to install.

conda create -n evo2 python=3.12 -y && conda activate evo2
git clone https://github.com/arcinstitute/evo2.git
cd evo2/
git submodule init vortex
git submodule update vortex
pip install .

After installation, check that the installation was correct by running a test

python ./test/test_evo2.py --model_name evo2_7b

Checkpoints

We provide the following model checkpoints, hosted on HuggingFace:

Checkpoint Name	Description
`evo2_40b`	A model pretrained with 1 million context obtained through context extension of `evo2_40b_base`.
`evo2_7b`	A model pretrained with 1 million context obtained through context extension of `evo2_7b_base`.
`evo2_40b_base`	A model pretrained with 8192 context length.
`evo2_7b_base`	A model pretrained with 8192 context length.
`evo2_1b_base`	A smaller model pretrained with 8192 context length.

To use Evo 2 40B, you will need multiple GPUs. Vortex automatically handles device placement, splitting the model across available cuda devices.

Usage

Below are simple examples of how to download Evo 2 and use it locally using Python.

Inference

Evo 2 can be used to score the likelihood of DNA sequence.

import torch
from evo2 import Evo2

evo2_model = Evo2('evo2_40b')
evo2_model.model.eval()

sequence = 'ACGT'
input_ids = torch.tensor(
    evo2_model.tokenizer.tokenize(sequence),
    dtype=torch.int,
).unsqueeze(0).to('cuda:0')

outputs, _ = evo2_model(input_ids)
logits = outputs[0]

print('Logits: ', logits)
print('Shape (batch, length, vocab): ', logits.shape)

Generation

Evo 2 can generate DNA sequence based on prompts.

from evo2 import Evo2

evo2_model = Evo2('evo2_7b')
evo2_model.model.eval()

output = evo2_model.generate(prompt_seqs=["ACGT"], n_tokens=400, temperature=1.0, top_k=4)
print(output.sequences[0])

Embeddings

Evo 2 embeddings can be saved for use downstream.

import torch
from evo2 import Evo2

evo2_model = Evo2('evo2_7b')
evo2_model.model.eval()

sequence = 'ACGT'
input_ids = torch.tensor(
    evo2_model.tokenizer.tokenize(sequence),
    dtype=torch.int,
).unsqueeze(0).to('cuda:0')

layer_name = 'blocks.28.mlp.l3'

outputs, embeddings = evo2_model.forward(input_ids, return_embeddings=True, layer_names=[layer_name])

print('Embeddings shape: ', embeddings[layer_name].shape)

Dataset

The OpenGenome2 dataset used for pretraining Evo2 is available at Hugging Face datasets. Data is available either as raw fastas or as JSONL files which include preprocessing and data augmentation.

Training Code

Evo 2 was trained using Savanna, an open source framework for training alternative architectures.

Citation

If you find these models useful for your research, please cite the relevant papers

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evo 2: DNA modeling and design across all life's domains

Contents

Setup

Requirements

Installation

Checkpoints

Usage

Inference

Generation

Embeddings

Dataset

Training Code

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Languages

License

ArcInstitute/evo2

Folders and files

Latest commit

History

Repository files navigation

Evo 2: DNA modeling and design across all life's domains

Contents

Setup

Requirements

Installation

Checkpoints

Usage

Inference

Generation

Embeddings

Dataset

Training Code

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Languages

Packages