Official implement of paper "Multi-purpose RNA Language Modeling with Motif-aware Pre-training and Type-guided Fine-tuning" with paddlepaddle.
This repository contains codes and pre-trained models for RNAErnie, which leverages RNA motifs as biological priors and proposes a motif-level random masking strategy to enhance pre-training tasks. Furthermore, RNAErnie improves sequence classfication, RNA-RNA interaction prediction, and RNA secondary structure prediction by fine-tuning or adapating on downstream tasks with two-stage type-guided learning. Our paper will be published soon.
-
2024.06.26: Considering most of the researchers will prefer to use transformers and pytorch as backend. So, I transfer my work to transformers and train a pytorch model from scratch. The new model is trained with more powerful settings: The max model length is up to 2048 now and the pretraining dataset is the newest version of rnacentral, which contains about 31 million RNA sequences after length filtering (<2048). This pytorch version model has been uploaded to huggingface at https://huggingface.co/WANGNingroci/RNAErnie and the training framework/tokenization is located at https://github.com/CatIIIIIIII/RNAErnie2. (NOTE: the tokenization is a little different from the original paddle implementation). Moreover, Multimolecule are implementing current most powerful RNA language model with transformers and pytorch. Our model also could be accessed at https://huggingface.co/multimolecule/rnaernie.
-
2024.05.13: 🎉🎉 Our paper has been published at https://www.nature.com/articles/s42256-024-00836-4.
-
2024.04.20: 🎉🎉 RNAErnie has been accepted by Nature Machine Intelligence! The paper will be released soon.
-
2024.03.21: Add DOI and citation.
-
2024.01.26: Add ad-hoc pre-training with additional classification task.
-
2024.01.23: Integrate AUC metric in base_classes.py for simpler usage; Add content and update log section in README.md.
If you have any questions, feel free to contact us by email: [email protected].
First, download the repository and create the environment.
git clone https://github.com/CatIIIIIIII/RNAErnie.git
cd ./RNAErnie
conda env create -f environment.ymlThen, activate the "RNAErnie" environment.
conda activate RNAErnieor you could
First clone the repository:
git clone https://github.com/CatIIIIIIII/RNAErnie.gitHere we provide two ways to load the docker image.
[Option1] You can directly access the docker image using this link:
https://hub.docker.com/r/nwang227/rnaernieAfter docker sign in, you could pull the docker image using the following command:
sudo docker pull nwang227/rnaernie:1.1NOTE:
- If you encounter the error
unauthorized: authentication required, this means that you haven't logged in your docker account to access docker hub.
- Sign up a docker account
- Login with
sudo docker login -u username --password-stdin - Then try to pull the image again.
- If you encounter the error
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docke daemon running?, this means that you haven't started the docker service.- Start the docker service with
systemctl start docker - Then try to run the container again.
- Start the docker service with
[Option2] Or you can download the image tar from Google Drive or use the url as follow
https://drive.google.com/file/d/1Lkgw7w9xGZQ02PnU3yk0cn1V9om2yfd3and load by
sudo docker load --input rnaernie-1.1.tarRun the container with data volumn mounted:
sudo docker run --gpus all --name rnaernie_docker -it -v $PWD/RNAErnie:/home/ nwang227/rnaernie:1.1 /bin/bashTODO: For python version conflict, RNA secondary structure prediction task is not available in docker image. We will fix in the future.
You can download my selected (nts<512) pretraining dataset from Google Drive or from RNAcentral and place the .fasta files in the ./data/pre_random folder.
Then, you can use the following command to generate the pre-training data:
Pretrain RNAErnie on selected RNAcentral datasets (nts<=512) with the following command:
python run_pretrain.py \
--output_dir=./output \
--per_device_train_batch_size=50 \
--learning_rate=0.0001 \
--save_steps=1000To use multi-gpu training, you can add the following arguments:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch run_pretrain.py where CUDA_VISIBLE_DEVICES specifies the GPU ids you want to use.
Our pre-trained model with BERT, ERNIE and MOTIF masking strategies could be downloaded from Google Drive and place the .pdparams and .json files in the ./output/BERT,ERNIE,MOTIF,PROMPT folder.
You can visualize the pre-training process with the following command:
visualdl --logdir ./output/BERT,ERNIE,MOTIF,PROMPT/runs/you_date/ Then you could extract embeddings of given RNA sequences or from .fasta file with the following codes:
import paddle
from rna_ernie import BatchConverter
from paddlenlp.transformers import ErnieModel
# ========== Set device
paddle.set_device("gpu")
# ========== Prepare Data
data = [
("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
# data = "./data/ft/seq_cls/nRC/test.fa"
# ========== Batch Converter
batch_converter = BatchConverter(k_mer=1,
vocab_path="./data/vocab/vocab_1MER.txt",
batch_size=256,
max_seq_len=512)
# ========== RNAErnie Model
rna_ernie = ErnieModel.from_pretrained("output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final/")
rna_ernie.eval()
# call batch_converter to convert sequences to batch inputs
for names, _, inputs_ids in batch_converter(data):
with paddle.no_grad():
# extract whole sequence embeddings
embeddings = rna_ernie(inputs_ids)[0].detach()
# extract [CLS] token embedding
embeddings_cls = embeddings[:, 0, :]You can download training data from Google Drive and place them in the ./data/ft/seq_cls folder. Three datasets (nRC, lncRNA_H, lncRNA_M) are available for this task.
Fine-tune RNAErnie on RNA sequence classification task with the following command:
python run_seq_cls.py \
--dataset=nRC \
--dataset_dir=./data/ft/seq_cls \
--model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
--train=True \
--batch_size=50 \
--num_train_epochs=100 \
--learning_rate=0.0001 \
--output=./output_ft/seq_clsMoreover, to train on long ncRNA classification tasks, change augument --dataset to lncRNA_M or lncRNA_H, and you can add the --use_chunk=True argument to chunk and ensemble the whole sequence.
To use two-stage fine-tuning, you can add the --two_stage=True argument.
Or you could download our weights of RNAErnie on sequence classification tasks from Google Drive and place them in the ./output_ft/seq_cls folder.
Then you could evaluate the performance with the following codes:
python run_seq_cls.py \
--dataset=nRC \
--dataset_dir=./data/ft/seq_cls \
--model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
--model_path=./output_ft/seq_cls/nRC/BERT,ERNIE,MOTIF,PROMPT/model_state.pdparams \
--train=False \
--batch_size=50To evaluate two-stage procedure, you can add the --two_stage=True argument and change the --model_path to ./output_ft/seq_cls/nRC/BERT,ERNIE,MOTIF,PROMPT,2.
You can download training data from Google Drive and place them in the ./data/ft/rr_inter folder.
Fine-tune RNAErnie on RNA-RNA interaction task with the following command:
python run_rr_inter.py \
--dataset=MirTarRAW \
--dataset_dir=./data/ft/rr_inter \
--model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
--train=True \
--batch_size=256 \
--num_train_epochs=100 \
--lr=0.001 \
--output=./output_ft/rr_interOr you could download our weights of RNAErnie on RNA RNA interaction tasks from Google Drive and place them in the ./output_ft/rr_inter folder.
Then you could evaluate the performance with the following codes:
python run_rr_inter.py \
--dataset=MirTarRAW \
--dataset_dir=./data/ft/rr_inter \
--model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
--model_path=./output_ft/rr_inter/MirTarRAW/BERT,ERNIE,MOTIF,PROMPT \
--train=False \
--batch_size=256You can download training data from Google Drive and unzip and place them in the ./data/ft/ssp folder. Two tasks (RNAStrAlign-ArchiveII, bpRNA1m) are available for this task.
Adapt RNAErnie on RNA secondary structure prediction task with the following command:
python run_ssp.py \
--task_name=RNAStrAlign \
--dataset_dir=./data/ft/ssp \
--model_name_or_path=./output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final \
--train=True \
--num_train_epochs=50 \
--lr=0.001 \
--output=./output_ft/sspNote: we use interface.*.so compiled from mxfold2. If you system could not run the interface.*.so file, you could download the source code from here and compile it by yourself. Then copy the generated interface.*.so file to ./ path.
Or you could download our weights of RNAErnie on RNA secondary structure prediction tasks from Google Drive and place them in the ./output_ft/ssp folder.
Then you could evaluate the performance with the following codes:
python run_ssp.py \
--task_name=RNAStrAlign \
--dataset_dir=./data/ft/ssp \
--train=FalseWe also implement other BERT-like large-scale pre-trained RNA language models for comparison, see here: https://github.com/CatIIIIIIII/RNAErnie_baselines.
If you use the code or the data for your research, please cite our paper as follows:
@Article{Wang2024,
author={Wang, Ning
and Bian, Jiang
and Li, Yuchen
and Li, Xuhong
and Mumtaz, Shahid
and Kong, Linghe
and Xiong, Haoyi},
title={Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning},
journal={Nature Machine Intelligence},
year={2024},
month={May},
day={13},
issn={2522-5839},
doi={10.1038/s42256-024-00836-4},
url={https://doi.org/10.1038/s42256-024-00836-4}
}
We pretrained model from scractch with additional classification head appended to '[CLS]' token. The total loss function is
The pre-trained model could be downloaded from Google Drive and place the .pdparams and .json files in the ./output/BERT,ERNIE,MOTIF,ADHOC folder. Moreover, original pre-trained RNAErnie weight at 1 epoch could be obtained from Google Drive.
