Repository corresponding to the code used at article:
A memetic algorithm enables global all-atom protein-protein docking with sidechain flexibility
- PyRosetta==4
- numpy>=1.21.0
- pandas>=1.3.4
- scipy>=1.7.1
- seaborn>=0.11.2
- setuptools>=44.0.0
- imageio>=2.10.1
- matplotlib>=3.4.3
This package is only compatible with Python 3.4 and above. To install this package, please follow the instructions below:
- Install the previous descripted dependencies
- Download and install PyRosetta following the instructions found at http://www.pyrosetta.org/dow
- Install the package itself:
git clone https://github.com/Andre-lab/evodock.git
cd evodock
pip install -r requirements.txt
or
git clone https://github.com/Andre-lab/evodock.git
pip setup.py install
or
pip install git+https://github.com/Andre-lab/evodock.git
A setup.py and environment.yml files are provided to use alternative installation using pip or conda.
- Preprocess complex pdb with prepacking
python ./scripts/prepacking.py <input_pdb>
- Create a configuration file following the example found at sample_dock.ini
[Docking]
# selects docking protocl [Global, Local]
type=Global
[Inputs]
# complex pdb
pose_input=/inputs/input_pdb/1ACB/1ACB_c_u_0001.pdb
native_input=/inputs/native_pdb/1ACB/1ACB_c_b.pdb
[Outputs]
# output file log
output_path=sample_dock/
output_pdb=True
[DE]
# evolution algorithm parent strategy [RANDOM, BEST]
scheme=BEST
# population size
popsize=10
# mutation rate (weight factor F)
mutate=0.9
# crossover probability (CR)
recombination=0.3
# maximum number of generations/iterations (stopping criteria)
maxiter=10
# hybrid local search strategy [None, only_slide, mcm_rosetta]
local_search=mcm_rosetta
information about the DE parameters can be found at https://en.wikipedia.org/wiki/Differential_evolution
- Run with the algorithm with the desired configuration
python evodock.py configs/sample_dock_global.ini
or
python -m evodock configs/sample_dock_global.ini
Files configs/sample_dock_global.ini, configs/sample_dock_flexbb.ini and configs/sample_dock_refinement.ini contains configuration examples for Global Docking, Flexible Backbone Docking and Global Docking with and initial population.
At pose_input, you might provide the path to a complex with two chains, which previously was preprocessed with a prepack protocol in order to fix possible collisions at the sidechain. An script at script folders is provided.
At output_path indicates the output folder for the results in .csv format. output_pdb is a boolean to dump pdbs during the evolution and the final evolved protein.
Option "type" allows to select between global docking (Global), local docking (Local), flexible backbone (Flexbb) and using an starting population such as models from ClusPro (Refinement).
The set of parameters for Differential Evolution ([DE]) that you must change for a production run are populsize (from 10 to 100) and maxiter (from 10 to 100), which would lead into an evolution of 100 individuals during 100 iterations/generations. Evolutionary parameters (mutation F and crossover CR), can be fine tuned for specific purposes, although this set (0.3 and 0.9) have shown a good balance between exploration and exploration at our benchmark runs, which leads into good results. Scheme corresponds to the selection strategy for the base vector at mutation operation (https://en.wikipedia.org/wiki/Differential_evolution for more details). Parameter "local_search" can be changed to None (aka, only DE is performed), only_slide (local search operation is equivalent to apply slide_into_contact) or mcm_rosetta (which applies slide_into_contact + MC energy minimization and sidechain optimization, recommended option and used at our benchmarks)
Uses path_ligands and path_receptors to indicate the path of *.pdb files with different backbone ensembles.
Uses init_pdbs to indicate the path of *.pdbs used as initial population, i.e. models from ClusPro.
It is going to produce 4 different log files:
- evolution*csv is a summary of the evolutionary process, which indicates the number of generation,
average energy of the population, lowest energy of population and the RMSD of the best individual with the lowest energy.
-
popul*csv is the status of each generation during the evolution. Each line correponds to the population information of one generation.
-
interface*csv is similar to popul, but it reports the interface energy value and the iRMSD for each corresponding individual at each generation.
-
trials*csv is the equivalent file to popul*csv, but it reports the trials (candidates) generated during the each generation. This can be practically useful in case that you want to check if the DE+MC is creating proper candidates that can contribute to the evolution.
-
time*csv is the computational time (in seconds) for each generation.
-
best*csv contains, at each line, the rotation (first 3 values) and translation (3 values) of the individual with lowest energy value.
python ./scripts/make_scatter_plot.py "<path_to_popul*.csv>"
It creates the global energy value vs RMSD plot if input is populcsv or interface energy vs iRMSD plot if input corresponds to interfacecsv. Each point corresponds to an individual in the last generation. Several *csv files can be specified in order to collect the results from different independent runs, where each color corresponds to a run.
For each popul*csv
python ./scripts/make_evolution_plot.py <path to evolution*.csv>
Creates a lineplot where y-axis corresponds to the global energy function (used as fitness function during the evolution) and x-axis corresponds to each generation.
Green line corresponds to the average energy value of the population, while the red line corresponds to the lowest energy value of the population. A proper evolution should maintain a close distance between both lines and average line should follow the tend of the lowest energy line. That would indicate that the population evolves towards the best energy individual. In case that there is a large different between both lines, F and CR parameters should be tuned. For example, reducing the exploration of the algorithm by decreasing the value of F.
Differential Evolution [Price97] is a population-based search method. DE creates new candidate solutions by combining existing ones according to a simple formula of vector crossover and mutation, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand.
- Storn, R., Price, K. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997). https://doi.org/10.1023/A:1008202821328