BERT-DPPIV

A novel attention-based peptide language model for DPP-IV inhibitory peptide identification. This is a model for DPP-IV inhibitory peptides recognition and prediction based on BERT which is proposed by Google AI. We pre-train a BERT model through amount of proteins sequences downloaded from UniPort and fine-tune the model on DPP-IV dataset and evaluate its performance.

Model and data

The pre-train model can be downloaded from the link below：

https://drive.google.com/file/d/1lR0DslcUL3Ui3JA6ufgN8_LugIs6EV1I/view?usp=sharing.

The fine-tune model can be downloaded from the link below：

https://drive.google.com/file/d/1qmnBZ7AvJfXSMersmAPge5Xe3NCE-ygL/view?usp=sharing.

Then you should decompress these model files and put them on the root of the project. please use the linux tar command such as:

tar -zxvf fine_tune_model.tar.gz

If you want to pre-train the BERT model, you should download the pre-train data from the link below:

https://drive.google.com/file/d/1QeXWV5_OIKgms7u5ShNfvPEKPBLOC3UT/view?usp=sharing.

Then you should put the data into pre_train_data folder.

Pre-training

You should use the script pretrain_data_creating.sh to preprocess the pre-train data.

./pretrain_data_creating.sh

If you don't want to cost time to preprocess the pre-train data, you can download our preprocessed data from link below:

https://drive.google.com/file/d/1dDoWLFAfd4JL3vZfUX-Tb4LLtlOjpRJz/view?usp=sharing.

After preprocessing the data, then you can use the script pre_train.sh to pre-train the BERT model.

./pre_train.sh

Fine-tune

When you ready to fine-tune the model , you should run the following code. The format of positive file and negative file is fasta file.

python fine_tune_model.py \
--do_eval True \
--do_save_model True \
--data_name DPP-IV \
--batch_size 32 \
--num_train_epochs 50 \
--warmup_proportion 0.1 \
--learning_rate 2e-6 \
--using_tpu False \
--seq_length 128 \
--data_root ./Fine_tune_data/1kmer_data/ \
--positive_file train-positive.txt \
--negative_file train-negative.txt \
--kmer 1 \
--vocab_file ./vocab/vocab_1kmer.txt \
--init_checkpoint ./model/1kmer_model/model.ckpt \
--bert_config ./bert_config_1.json \
--save_path ./fine_tune_model/1kmer_fine_tune_model/model.ckpt

The meaning of each parameter is as follows, you should change these according to your needs. You can also open file fine_tune_model.py and change the these parameters.

do_eval: whether to evaluate the model after training
do_save_model: whether to save the model after training
data_name: the name of the training set
batch_size: batch size
num_train_epochs: training epochs
warmup_proportion: proportion of warmup
learning_rate: learning rate
using_tpu: Whether to use TPU
seq_length: sequence length
data_root: the location of the training set to be used
positive_file: The name of file containing the positive trian data
negative_file: The name of file containing the negative trian data
kmer: The type of word segmentation
vocab_file: location of dictionary
init_checkpoint: initialization node of the model
bert_config: BERT configuration
save_path: where to save the trained model

Prediction

You can predict your peptides data by command

python BERT_DPPIV_prediction.py

You should change the codes in row 140-145 according to your needs.

data_name: location of the testing set
out_file: storage location of test results
model_path: the location of the trained model
kmer: the type of word segmentation
config_file: BERT configuration
vocab_file: location of dictionary

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Fine_tune_data		Fine_tune_data
Pre_result		Pre_result
pre_train_data		pre_train_data
vocab		vocab
BERT_DPPIV_prediction.py		BERT_DPPIV_prediction.py
LICENSE		LICENSE
README.md		README.md
bert_config_1.json		bert_config_1.json
bert_config_2.json		bert_config_2.json
bert_config_3.json		bert_config_3.json
create_pretraining_data.py		create_pretraining_data.py
extract_features.py		extract_features.py
fine_tune_model.py		fine_tune_model.py
modeling.py		modeling.py
optimization.py		optimization.py
pre_train.sh		pre_train.sh
preprocess.py		preprocess.py
pretrain_data_creating.sh		pretrain_data_creating.sh
pretrain_data_preprocess.py		pretrain_data_preprocess.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_pretraining.py		run_pretraining.py
tokenization.py		tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BERT-DPPIV

Model and data

Pre-training

Fine-tune

Prediction

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

guanchangge/BERT-DPPIV

Folders and files

Latest commit

History

Repository files navigation

BERT-DPPIV

Model and data

Pre-training

Fine-tune

Prediction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages