This is the code for paper DivPro: Diverse Protein Sequence Design with Direct Structure Recovery Guidance.
We provide the model checkpoint in folder model_ckpt and the TS50 and TS500 datasets in folder data. CATH 4.2 dataset can be downloaded from http://people.csail.mit.edu/ingraham/graph-protein-design/data/. Please put the downloaded chain_set.jsonl and chain_set_splits.json under folder data for inference.
pip install torch torchvision torchaudio
pip install tqdm
We recommend running on Linux systems.
Run the following script to start design.
python infer.py <dataset>
The dataset can be CATH (CATH 4.2 test set), ts50 and ts500. The script will sample 5 sequences for each structure in the dataset and print the results of the first structure for demonstration. An example output for running python infer.py ts50:
3a4rA
Native sequence:
GPLGSQELRLRVQGKEKHQMLEISLSPDSPLKVLMSHYEEAMGLSGHKLSFFFDGTKLSGKELPADLGLESGDLIEVWG
Generated sequences:
GPLGSTPIKITVKGNKPDDVLTLDLPPTAPLETVIKEVQKALGLEGAELTFYYNGKKLTGTEYPADLGLKSGDTITIEG
GSLGSKPIKVTVKGDKPDDVLELELEPTAKLKELKEAFLEALGLKGKDLKFYYNGKELTGDEYPEDLGLKDGDTITVKG
GPLGDEPIRVTVRGDKPDDVVTVELRPDEPLAALMAEFQAALGKEGADLTFYYKGKRLSGEELPADLGLKDGDTVTVEG
GSLGSKPIKVTVRGEKKDDVVEVDLAPSAPLKHLIDKFQEALGKKGKDLKFYYNGKELTGSELPSDLGLKSGDVIEVKG
GPLGSTPITLTVVGEDASDVLTITLSPTAPLATVIDAFQEALGLKGADLTFYYNGKKLSGSELPADLGLKSGDTITVTG