Name	Name	Last commit message	Last commit date
Latest commit History 285 Commits
bin	bin
db	db
evaluation	evaluation
example	example
script	script
simulation	simulation
.gitignore	.gitignore
License	License
README.MD	README.MD
environment.yml	environment.yml
index.sh	index.sh

SpecHLA: full-resolution HLA typing from sequencing data

Install

First, create the env with conda, and activate the env.

git clone https://github.com/deepomicslab/SpecHLA.git --depth 1
cd SpecHLA/
conda env create --prefix=./spechla_env -f environment.yml
conda activate ./spechla_env

Second, make the softwares in bin/ executable.

chmod +x -R bin/*

Third, index the database and install the packages.

unset LD_LIBRARY_PATH && unset LIBRARY_PATH 
bash index.sh

Perform SpecHLA with

bash script/whole/SpecHLA.sh -h

Note:

SpecHLA requires a License of Novoalign in bin/. If not detected, it uses bowtie2 as replacement automatically. The License file of Novoalign should be put in the bin/ folder before running bash index.sh.
SpecHLA now supports Linux and Windows WSL systems.
SpecHLA does not accept short single-end reads.

Test

Please go to the example/ folder, run SpecHLA with given scripts, and find results in the output/.

Basic Usage

Main functions

Scripts	Description
script/ExtractHLAread.sh	Extract HLA reads from enrichment-free data.
script/whole/SpecHLA.sh	HLA typing with paired-end (PE), PE+long reads, PE+HiC, or PE+10X data.
script/long_read_typing.py	HLA typing with only long-read data.
script/cal.hla.copy.pl	Detect HLA LOH events based on SpecHLA's typing results.

Extract HLA-related reads

First extract HLA reads with enrichment-free data. Otherwise, HLA typing would be slow. Map reads to hg19 or hg38, then use script/ExtractHLAread.sh to extract HLA-related reads. We use the script of Kourami with minor revision for this step. Extract HLA-related reads by

USAGE: bash script/ExtractHLAread.sh -s <sample_id> -b <bamfile> -r <refGenome> -o <outdir>

 -s          : desired sample name (ex: NA12878) [required]

 -b          : sorted and indexed bam or cram (ex: NA12878.bam) [required]

 -r          : hg38 or hg19

 -o          : folder to save extracted reads [required]

HLA Typing

Full-resolution and exon HLA typing using SpecHLA. With Exome data like WES or RNASeq, only support exon typing.

Perform full-resolution HLA typing with paired-end reads by

bash script/whole/SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform exon HLA typing with paired-end reads by

bash script/whole/SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/ -u 1

Perform full-resolution HLA typing with paired-end reads and PacBio reads by

bash script/whole/SpecHLA.sh -n <sample> -t <sample.pacbio.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and Nanopore reads by

bash script/whole/SpecHLA.sh -n <sample> -e <sample.nanopore.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and Hi-C reads by

bash script/whole/SpecHLA.sh -n <sample> -c <sample.hic.fwd.fq.gz> -d <sample.hic.rev.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and 10X linked reads by (LongRanger should be installed in system env)

bash script/whole/SpecHLA.sh -n <sample> -x <sample.10x.read.folder> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Consider long Indels and use population information for annotation by

bash script/whole/SpecHLA.sh -n <sample> -v True -p <Asia> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Full arguments can be seen in

SpecHLA: Full-resolution HLA typing from sequencing data.

Note:
  1) Use HLA reads only, otherwise, it would be slow. Use ExtractHLAread.sh to extract HLA reads first.
  2) WGS, WES, and RNASeq data are supported.
  3) With Exome data like WES or RNASeq, must select exon typing  (-u 0).
  4) Short single-end read data are not supported.

Usage:
  bash SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o <outdir>

Options:
  -n        Sample ID. <required>
  -1        The first fastq file of paired-end data. <required>
  -2        The second fastq file of paired-end data. <required>
  -o        The output folder to store the typing results. Default is ./output
  -u        Choose full-length or exon typing [0|1]. 0 indicates full-length, 1 means exon,
            default is 0. With Exome or RNA data, must select 1 (i.e., exon typing).
  -p        The population of the sample [Asian, Black, Caucasian, Unknown, nonuse] for annotation.
            Default is Unknown, meaning use mean allele frequency in all populations. nonuse indicates
            only adopting mapping score and considering zero-frequency alleles.
  -j        Number of threads [5].
  -t        Pacbio fastq file.
  -e        Nanopore fastq file.
  -c        fwd hi-c fastq file.
  -d        rev hi-c fastq file.
  -x        Path of folder created by 10x demultiplexing. Prefix of the filenames of FASTQs
            should be the same as Sample ID. Please install Longranger in the system env.
  -w        The weight to use allele imbalance info for phasing [0-1]. Default is 0 that means
            not use. 1 means only use imbalance info; other values integrate reads and allele imbalance.
  -m        The maximum mismatch number tolerated in assigning gene-specific reads. Deault
            is 2. It should be set larger to infer novel alleles.
  -y        The minimum different mapping score between the best and second-best aligned genes.
            Discard the read if the score is lower than this value. Deault is 0.1.
  -v        True or False. Consider long InDels if True, else only consider small variants.
            Default is False.
  -q        Minimum variant quality. Default is 0.01. Set it larger in high quality samples.
  -s        Minimum variant depth. Default is 5.
  -a        Use this long InDel file if provided.
  -r        The minimum Minor Allele Frequency (MAF), default is 0.05 for full length and
            0.1 for exon typing.
  -k        The mean depth in a window lower than this value will be masked by N, default is 5.
            Set 0 to avoid masking.
  -z        Whether only mask exon region, True or False, default is False.
  -f        The trio infromation; child:parent_1:parent_2 [Example: NA12878:NA12891:NA12892]. If provided,
            use trio info to improve typing. Note: use it after performing SpecHLA once already.
  -b        Whether use database for unlinked block phasing [0|1], default is 1 (i.e., use).
  -h        Show this message.

HLA typing with long-read data alone

Perform HLA typing only with long reads by

usage: python3 script/long_read_typing.py -h

HLA Typing with only long-read data.

Required arguments:
  -r        Long-read fastq file. PacBio or Nanopore. (default: None)
  -n        Sample ID (default: None)
  -o        The output folder to store the typing results. (default: ./output)

Optional arguments:
  -p        The population of the sample [Asian, Black, Caucasian, Unknown, nonuse] for annotation. Unknown
              means use mean allele frequency in all populations. nonuse indicates only adopting mapping score
              and considering zero-frequency alleles. (default: Unknown)
  -j        Number of threads. (default: 5)
  -d        Minimum score difference to assign a read to a gene. (default: 0.001)
  -g        Whether use G group resolution annotation [0|1]. (default: 0)
  -m        1 represents typing, 0 means only read assignment (default: 1)
  -a        Prefix of filtered fastq file. (default: long_read)
  -h, --help

Pedigree phasing

After performing HLA typing, we can afford trio information to update the results of child by

bash script/SpecHLA.sh -n <child> -1 <child.fq.1.gz> -2 <child.fq.2.gz> -o outdir/ --trio child:parent1:parent2

The fresh results can be found in the trio/ folder inside the original result folder.

Detect HLA LOH in cancer samples

If the purity of cancer sample and the ploidy of HLA region are known, we can infer HLA LOH event by

usage:  perl script/cal.hla.copy.pl [Options] -S <samplename> -C <5> -purity <Purity> -ploidy <Ploidy> -F <filelist> -T <hla.result.txt> -O <outdir>

        OPTIONS:
        -purity  [f]  the purity of tumor sample. <required>
        -ploidy  [f]  the ploidy of tumor sample in HLA gene region. <required>
        -S       [s]  sample name <required>
        -C       [f]  the cutoff of heterogeneous snp number. Default is 5.
        -F       [s]  the HLA_*_freq.txt file list. <required>  Obtained by "ls typing_result_dir/*_freq.txt >freq.list"
        -T       [s]  the hla typing result file of Spechla <required>
        -O       [s]  the output dir. <required> Need exist before running.

The result file "merge.hla.copy.txt" can be found in the outdir.

Use ls typing_result_dir/*_freq.txt >freq.list to generate the file required by -F. The command example is

perl script/cal.hla.copy.pl -purity 0.5 -ploidy 2 -S test -F freq.list -T hla.result.txt -O ./

Interpret output

In the denoted outdir, the results of each sample are saved in a folder named as the sample ID.

In the directory of one specific sample, you will find the below files:

Output	Description
hla.result.txt	HLA-typing results for all alleles
hla.result.details.txt	All the alleles with the highest mapping score in annotation
hla.result.g.group.txt	G group resolution HLA type
hla.allele..HLA_.fasta	Sequence of each allele (the low-depth region is masked by N)
HLA_*_freq.txt	Haplotype frequencies of each gene
HLA_*.rephase.vcf.gz	Phased vcf file for each gene

If you performed pedigree phasing, you will find below files.

Output	Description
trio/HLA_*.trio.vcf.gz	Phased vcf file after pedigree phasing
hla.allele..HLA_.fasta	Allele sequence after pedigree phasing

An example for hla.result.txt is as below:

Sample  HLA_A_1 HLA_A_2 HLA_B_1 HLA_B_2 HLA_C_1 HLA_C_2 HLA_DPA1_1      HLA_DPA1_2      HLA_DPB1_1      HLA_DPB1_2      HLA_DQA1_1    HLA_DQA1_2      HLA_DQB1_1      HLA_DQB1_2      HLA_DRB1_1      HLA_DRB1_2
HG00118 A*24:02:01:01   A*02:01:01:01   B*07:02:01:01   B*40:01:02:02   C*03:04:01:01   C*07:02:01:03   DPA1*01:03:01:02        DPA1*01:03:01:02    DPB1*04:01:01:01        DPB1*04:01:01:01        DQA1*01:02:01:01        DQA1*01:02:01:01        DQB1*06:02:01:01        DQB1*06:02:01:01    DRB1*15:01:01:01        DRB1*15:01:01:01

An example for HLA_*_freq.txt is as below:

# Haplotype     Frequency
hla.allele.1.HLA_A.fasta 0.532
hla.allele.2.HLA_A.fasta 0.468
# The number of heterozygous variant is 30

An example for HLA LOH result file: merge.hla.copy.txt is as below:

Sample  HLA     Allele1 Allele2 copyratio       KeptHLA LossHLA Freq1   Freq2   Purity  Het_num LOH
test    A       A*24:02:01:01   A*02:01:01:01   1:1     A*24:02:01:01   A*02:01:01:01   0.507   0.493   0.5     100     N
test    B       B*07:02:01:01   B*40:01:02:01   1:1     B*07:02:01:01   B*40:01:02:01   0.502   0.498   0.5     32      N
test    C       C*03:04:01:01   C*07:02:01:03   1:1     C*07:02:01:03   C*03:04:01:01   0.433   0.567   0.5     99      N
test    DPA1    DPA1*01:03:01:02        DPA1*01:03:01:02        2:0     DPA1*01:03:01:02        homogeneous     1       0       0.5     0       N
test    DPB1    DPB1*04:01:01:01        DPB1*04:01:01:01        2:0     DPB1*04:01:01:01        DPB1*04:01:01:01        0.909   0.091   0.5     1       N
test    DQA1    DQA1*01:02:01:01        DQA1*01:02:01:01        1:1     DQA1*01:02:01:01        DQA1*01:02:01:01        0.548   0.452   0.5     1       N
test    DQB1    DQB1*06:02:01:01        DQB1*06:02:01:01        2:0     DQB1*06:02:01:01        homogeneous     1       0       0.5     0       N
test    DRB1    DRB1*15:01:01:01        DRB1*15:01:01:01        2:0     DRB1*15:01:01:01        DRB1*15:01:01:01        0.65    0.35    0.5     2       N

Interpret each column as

Header	Description
Sample	Sample Name
HLA	HLA locus
Allele1	HLA type of the first allele
Allele2	HLA type of the second allele
copyratio	Copy numbers of two alleles
KeptHLA	The remaining allele
LossHLA	The possible lost allele
Freq1	Frequency of the first allele
Freq2	Frequency of the second allele
Purity	Tumor purity
Het_num	Number of heterozygous variant
LOH	Whether LOH exists (Y or N)

Dependencies

Systematic requirement

SpecHLA requires conda 4.12.0+, cmake 3.16.3+, and GCC 9.4.0+ for environment construction and software installation.

Programming

python=3.8.12 or above
Python libraries: numpy=1.22.3, pulp=2.6.0, pysam=0.19.0, scipy=1.8.0, and biopython=1.79.
Perl 5 or above

Third party packages

SpecHLA enables automatic installation of these third party packages using conda. Also, we can install the softwares by ourselves, and add their locations to system path (Exception: put bcftools, fermikit, and novoalign to bin/ folder). After applying for the authority of Novoalign, please put the novoalign.lic (License file) in the bin/ folder.

Third party packages:
- SamTools-1.3.1
- bamutil Version: 1.0.14
- BWA-0.7.17-r1188
- blast 2.12.0
- BLAT v. 36x2
- bedtools-v2.26.0
- bcftools-Version: 1.9
- Freebayes-v1.2.0-2-g29c4002
- Novoalign V4.02.01
- ScanIndel v1.3
- Fermikit-0.13
- SpecHap v1.0.1 and ExtractHAIRs
- Minimap 2.17-r941
- pbmm2-1.4.0
- pbsv Version 2.6.2
- Bowtie 2 version 2.3.4.1
- longshot-0.4.5

Getting help

Should you have any queries, please feel free to contact us, we will reply as soon as possible ([email protected]).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpecHLA: full-resolution HLA typing from sequencing data

Install

Test

Basic Usage

Main functions

Extract HLA-related reads

HLA Typing

HLA typing with long-read data alone

Pedigree phasing

Detect HLA LOH in cancer samples

Interpret output

Dependencies

Systematic requirement

Programming

Third party packages

Getting help

About

Uh oh!

Releases 8

Contributors 4

Uh oh!

Languages

License

deepomicslab/SpecHLA

Folders and files

Latest commit

History

Repository files navigation

SpecHLA: full-resolution HLA typing from sequencing data

Install

Test

Basic Usage

Main functions

Extract HLA-related reads

HLA Typing

HLA typing with long-read data alone

Pedigree phasing

Detect HLA LOH in cancer samples

Interpret output

Dependencies

Systematic requirement

Programming

Third party packages

Getting help

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Contributors 4

Uh oh!

Languages