Added Dockerfile, updated README.md with instructions to use docker

padilha · padilha · commit 9e9b9694d5ae · 2020-04-08T17:05:33.000-03:00
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,23 @@
+FROM ubuntu:18.04
+
+SHELL ["/bin/bash", "-c"]
+
+RUN mkdir /home/CRISPRcasIdentifier
+WORKDIR /home/CRISPRcasIdentifier
+COPY *.py ./
+COPY crispr-env.yml ./
+COPY README.md ./
+COPY HMM_sets.tar.gz ./
+COPY trained_models_2015.tar.gz ./
+ADD examples ./examples
+ADD software ./software
+
+RUN apt-get update
+RUN apt-get install -y wget && rm -rf /var/lib/apt/lists/*
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+RUN bash Miniconda3-latest-Linux-x86_64.sh -b
+RUN rm Miniconda3-latest-Linux-x86_64.sh
+ENV PATH /root/miniconda3/bin:$PATH
+
+RUN conda env create -f crispr-env.yml -n crispr-env
+RUN echo "source ~/miniconda3/etc/profile.d/conda.sh && conda activate crispr-env" >> ~/.bashrc
diff --git a/README.md b/README.md
@@ -1,14 +1,64 @@
-## CRISPRcasIdentifier
+# CRISPRcasIdentifier
 
-CRISPRcasIdentifier is an effective machine learning approach for the identification and classification of CRISPR-Cas proteins. It consists of a holistic strategy which allows us to: (i) combine regression and classification approaches for improving the quality of the input protein cassettes and predicting their subtypes with high accuracy; (ii) to detect signature genes for the different subtypes; (iii) to extract several types of information for each protein, such as potential rules that reveal the identity of neighboring genes; and (iv) define a complete and extensible framework able to integrate newly discovered Cas proteins and CRISPR subtypes. We achieve balanced accuracy scores above 0.95 in the classification experiment of CRISPR subtypes, mean absolute error values below 0.05 for the prediction of the normalized bit-score of different Cas proteins and a balanced accuracy of 0.88 in our benchmarking against other available tools.
+CRISPRcasIdentifier is an effective machine learning approach for the identification and classification of CRISPR-Cas proteins. It consists of a holistic strategy which allows us to: (i) combine regression and classification approaches for improving the quality of the input protein cassettes and predicting their subtypes with high accuracy; (ii) to detect signature genes for the different subtypes; (iii) to extract several types of information for each protein, such as potential rules that reveal the identity of neighboring genes; and (iv) define a complete and extensible framework able to integrate newly discovered Cas proteins and CRISPR subtypes. We achieve balanced accuracy scores above 0.95 in the classification experiment of CRISPR subtypes, mean absolute error values below 0.05 for the prediction of the normalized bit-score of different Cas proteins and a balanced accuracy of 0.89 in our benchmarking against other available tools.
 
-### Requirements
+## Requirements
 
-CRISPRcasIdentifier has been tested with Python 3.7.6. To run it, we recommend installing the same library versions we used. Since we exported our classifiers using [joblib.dump](https://scikit-learn.org/stable/modules/model_persistence.html), it is not guaranteed that they will work properly if loaded using other Python and/or libraries versions. For such, we recommend the use of conda virtual environments, which make it easy to install the correct Python and library dependencies without affecting the whole operating system (see below).
+CRISPRcasIdentifier has been tested with Python 3.7.6. To run it, we recommend installing the same library versions we used. Since we exported our classifiers using [joblib.dump](https://scikit-learn.org/stable/modules/model_persistence.html), it is not guaranteed that they will work properly if loaded using other Python and/or libraries versions. For such, we recommend the use of our docker image or conda virtual environments, which make it easy to install the correct Python and library dependencies without affecting the whole operating system (see below).
 
-### Setting up a virtual environment
+### First step: clone this repository
 
-The easiest way to install the correct python version and its dependencies to run CRISPRcasIdentifier is by using [miniconda](https://docs.conda.io/en/latest/miniconda.html).
+```
+git clone https://github.com/BackofenLab/CRISPRcasIdentifier.git
+```
+
+### Second step: download the Hidden Markov (HMM) and Machine Learning (ML) models
+
+Due to GitHub's file size constraints, we made our HMM and ML models available in Google Drive. You can download them [here](https://drive.google.com/file/d/166bh1sAjoB9kW5pn8YrEuEWrsM2QDV78/view?usp=sharing) and [here](https://drive.google.com/file/d/1ZOR1e-wIb_rxtCiU3OaBVdrHrup1svq3/view?usp=sharing). Save both tar.gz files inside CRISPRcasIdentifier's folder. It is not necessary to extract them, since the tool will do that the first time it is run.
+
+Next, you can choose which third step to follow: either using a docker container or using conda.
+
+### Third step (docker)
+
+The easiest way to run CRISPRcasIdentifier is by using docker (please refer to its [installation guideline](https://docs.docker.com/get-docker/) for details).
+
+After installing docker, build an image from the Dockerfile
+
+```
+cd CRISPRcasIdentifier
+docker build -t crispr-cas-identifier .
+```
+
+Run the docker image in a new container
+
+```
+docker run -it crispr-cas-identifier:latest /bin/bash
+```
+
+To avoid creating multiple containers everytime you want to use CRISPRcasIdentifier, you can reuse the created container by using the following commands
+
+```
+docker restart CONTAINER_ID
+docker exec -it CONTAINER_ID /bin/bash
+```
+
+You can obtain the CONTAINER_ID by using
+
+```
+docker ps --all
+```
+
+You can also copy a local fasta input file to CRISPRcasIdentifier's container by using
+
+```
+docker cp file.fa CONTAINER_ID:/home/CRISPRcasIdentifier
+```
+
+After this, everything should be set up and you can skip to the "How to use" section. 
+
+### Third step (conda)
+
+Another way to install the correct python version and its dependencies to run CRISPRcasIdentifier is by using [miniconda](https://docs.conda.io/en/latest/miniconda.html).
 
 Install Miniconda
 
@@ -25,11 +75,7 @@ conda env create -f crispr-env.yml -n crispr-env
 conda activate crispr-env
 ```
 
-### Downloading the Hidden Markov (HMM) and Machine Learning (ML) models
-
-Due to GitHub's file size constraints, we made our HMM and ML models available in Google Drive. You can download them [here](https://drive.google.com/file/d/166bh1sAjoB9kW5pn8YrEuEWrsM2QDV78/view?usp=sharing) and [here](https://drive.google.com/file/d/1ZOR1e-wIb_rxtCiU3OaBVdrHrup1svq3/view?usp=sharing). Save both tar.gz files inside CRISPRcasIdentifier's folder. It is not necessary to extract them, since CRISPRcasIdentifier will do that the first time it is run.
-
-### How to use
+## How to use
 
 To list the available command line arguments type