This repository contains xr-seq and damage-seq workflows.
-
This workflow is prepared using Snakemake workflow management system and conda
-
To run the workflow, you should have conda installed for environment management. All the other packages including Snakemake and their dependencies can be obtained automatically through environments prepared for each step of the workflow. You can follow the installation steps from the link.
-
Initially, you should clone the repository and navigate into the directory:
git clone https://github.com/CompGenomeLab/xr-ds-seq-snakemake.git cd xr-ds-seq-snakemake -
Next, you should create a conda environment with the defined packages. Install mamba and create the environment using mamba:
conda install -c conda-forge mamba mamba create -c bioconda -c conda-forge -c r -n repair snakemake=6.3.0 python=3.8 rust=1.50 sra-tools=2.11.0 conda activate repair
- You should install boquila tool to simulate repair and damage reads.
This workflow is prepared according to the structure recommended by Snakemake:
-
config/: contains the configuration files. -
logs/: contains the log files of each step. This folder will automatically appear when you run the workflow. -
report/: contains the description files of figures, which will be used in reports. -
resources/: containssamples/where the raw XR-seq and Damage-seq data are stored,input/where the input files are stored, andref_genomes/where the reference genome files are stored. Reference genome files can be automatically produced by the workflows, if they are properly defined in the config files. -
results/: contains the generated files and figures. This folder will automatically appear when you run the workflow. -
workflow/: containsenvs/where the environments are stored,rules/where the Snakemake rules are stored, andscripts/where the scripts used inside the rules are stored.
Before running the workflow, you should edit the configuration files.
For both XR-seq and Damage-seq workflows, there are 2 configuration files:
config_(XR/DS)_initial.yaml and config_(XR/DS).yaml.
The configuration file with "initial" prefix shouldn't be modified
by the user since they are containing configuration settings
that are common for all XR-seq and Damage-seq experiments.
For more detail about these configuration files,
check out the readme file in config/ directory.
The parameters for "config_(XR/DS).yaml" as below:
-
sample: The name of the sample file w/o the extension. Multiple sample names can be given in the below format:sample: - "SAMPLE_1" - "SAMPLE_2" - "SAMPLE_3"-
Using the given sample name, the workflow will look for
{SAMPLE}.fastq.gzas raw data. Therefore, the fastq file must be gzipped before running the workflow. -
If the layout of the given sample is paired-end, the workflow will look for
{SAMPLE}_R1.fastq.gz,{SAMPLE}_R2.fastq.gzand{SAMPLE}_1.fastq.gz,{SAMPLE}_2.fastq.gzas raw data. Therefore, paired-end sample files must contain_R1/2or_1/2suffixes and the suffixes should not be given in the configuration file to thesample.
-
-
damage_type: Damage type of each sample should be provided here in the same order of the samples:damage_type: - "64" - "CPD" - "oxaliplatin"-
Currently damages below are available can be provided as (case-insensitive):
- (6-4)PP:
64,64pp,(6-4)pp,6-4pp; - CPD:
CPD; - Cisplatin:
cisplatin; - Oxaliplatin:
oxaliplatin.
- (6-4)PP:
-
-
srr: Contains 2 parameters:enabled,codes.enabledis a boolean variable (can only be True or False) and if it is True, then the pipeline will retrieve the raw data by sra-toolkit the defining the SRR code in thecodesparameter.srr: enabled: True codes: - "SRRXXXXXXX"-
Each downloaded raw data with the SRR codes will be named as the corresponding sample name in the
sampleparameter. -
If a sample have multiple SRR codes, then the code should be provided in the given format:
SRRXXXXXXX:SRRXXXXXXX:SRRXXXXXXX
-
-
input: Contains 4 parameters:exist,files,sample,srr. This parameter is used for boquila, which can take input files for more accurate read simulations. If you don't have any input file, you can setexistas False.filesparameter contains the names of input files w/o the extension.sampleparameter contains the indeces of samples that will use the input of the same order. Lastly,srrcan be used for downloading inputs from sra-toolkit. It works the same was as thesrrparameter above. As a whole inputs can be provided as:input: exist: True files: - "input0" - "input1" - "input2" sample: - "0-3" # input0 will be used for sample 0,1,2,3 - "4, 7" # input1 will be used for sample 4,7 - "5, 6" # input2 will be used for sample 5,6 srr: enabled: True codes: - "SRRXXXXXXX" # srr of input0 - "SRRXXXXXXX" # srr of input1 - "SRRXXXXXXX" # srr of input2 -
Genome parameters: The parameters
build,species,datatype, andreleaseare defined to retrieve correct reference genome from ensembl. For more information, you can check out the link.-
If
genome_downloadis set to False, then a fasta file expected to be provided asresources/ref_genomes/${build}/genome_${build}.fa, where${build}should be the string given inbuildparameter. -
In the same manner, If
bowtie2_buildis set to False, then a build should be provided inresources/ref_genomes/${build}/Bowtie2/, where${build}should be the string given inbuildparameter. File name before the extensions must begenome_${build}. Lastly to properly setbowtie2_buildto False,genome_downloadmust be set to False as well.
-
-
filter: This parameter is used to filter chromosomes by the given regex.
After adjusting the configuration file, you can run the workflow
from xr-ds-seq-snakemake directory.
-
For XR-seq:
snakemake -pr --use-conda --cores 64 --snakefile Snakefile_XR.smk --keep-going --rerun-incomplete -
For Damage-seq:
snakemake -pr --use-conda --cores 64 --snakefile Snakefile_DS.smk --keep-going --rerun-incomplete
Note: To run the workflow on Slurm Workload Manager as set of jobs, --profile flag must be provided: |
|---|
snakemake -pr --use-conda --profile config/slurm --snakefile Snakefile_DS.smk --keep-going --rerun-incomplete
To generate detailed HTML report files, the code below should be run after workflow:
snakemake --report report.zip