0% found this document useful (0 votes)
93 views

Basic Linux Alignement

This tutorial covers an introduction to Linux for next-generation DNA sequencing data analysis. It will provide instructions on basic Linux commands, how to map millions of DNA reads onto a genome, and use the Integrative Genome Viewer (IGV). The tutorial will not cover sequencing technologies, alignment algorithms, or advanced Linux concepts in detail. It will teach logging into remote servers, working with files and directories, copying/moving files, and using the text editor Emacs.

Uploaded by

Abdo Elmamoun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Basic Linux Alignement

This tutorial covers an introduction to Linux for next-generation DNA sequencing data analysis. It will provide instructions on basic Linux commands, how to map millions of DNA reads onto a genome, and use the Integrative Genome Viewer (IGV). The tutorial will not cover sequencing technologies, alignment algorithms, or advanced Linux concepts in detail. It will teach logging into remote servers, working with files and directories, copying/moving files, and using the text editor Emacs.

Uploaded by

Abdo Elmamoun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Tutorial

 1:  
An  Introduc1on  to  Linux  for  Next-­‐Gen  
DNA  Sequencing  Data  Analysis:  
A  Hands-­‐on  Tutorial  

Dr  Stratos  Efstathiadis  
[email protected]  
Technical  Director  
High  Performance  Compu1ng  Facility  
Center  for  Health  Informa1cs  and  Bioinforma1cs  
NYULMC  
This is the First Part in a Series of tutorials in High
Performance Computing and Sequencing Informatics

•  This tutorial will cover:


–  Running basic Linux commands
•  Logging-in, Files and Directories, File Editing, Running command line
tools.
–  How to map millions of Next-Gen DNA reads onto a Genome
–  A complete Next-Gen Seq. Read Alignment Example
–  Using the Integrative Genome Viewer (IGV)

•  This tutorial will Not cover:


–  Sequencing technologies or alignment algorithms in detail.
–  Advanced Linux concepts.
•  File and Directory permissions, Environment variables, Shell scripts,
batch job submission, etc. will be covered in another tutorial.
Bioinformatics Requires Powerful
Computers
•  One definition of bioinformatics is:
“The use of computers to analyze biological problems.

•  As biological data sets have grown larger and


biological problems have become more complex, the
requirements for computing power have also grown.

•  Computers that can provide this power generally use


the Unix operating system - so you must learn Unix
Will  computers  crash  Genomics?  
SCIENCE,  11  February  2011,  pg  666  
Remote Login using Secure Shell (ssh)

•  Secure Shell: A set of tools that allow secure interaction with


remote servers (ssh, sshd, ssh-add, ssh-keygen, ssh-agent,
etc.) using two-factor authentication, based on
–  What you know (a pass phrase)
–  What you have (a key)

–  ssh is the de-facto remote login mechanism in Linux/Unix.


rlogin, rsh, telnet, etc. use insecure protocols (transmit clear
text passwords) and are not used (actually, blocked).
How to login remotely on a Linux server
using Secure Shell (ssh)
$ ssh [email protected]

Username:    your  NYULMC  Kerberos  id  


Hostname:  ec2-­‐aa-­‐ddd-­‐cc-­‐nn.compute-­‐1.amazonaws.com  

•  Troubleshoo1ng:  ssh  –vvvv  username@hostname  


•  The  ssh-­‐agent  stores  the  key  in  memory  (of  your  laptop/desktop).  
First Linux Commands
-bash-3.2$ date
Wed Jan 2 13:28:49 EST 2013

-bash-3.2$ pwd (Present/current Working Directory)


/home/username
Home  directory  

-bash-3.2$ ls (list files in the pwd)

-bash-3.2$ touch myfile (creates an empty file)

Linux  file  names  and  commands  are  Case  Sensi?ve:  


myfile  is  different  than  MyFile  
-bash-3.2$ LS
-­‐bash:  LS:  command  not  found  
Working  with  Directories  
•  Directories  are  a  means  of  organizing  your  
files  on  a  Unix  computer.    
–  They  are  equivalent  to  folders  on  Windows  and  
Macintosh  computers    
•  Directories  contain  files,  executable  
programs,  and  sub-­‐directories    
•  Understanding  how  to  use  directories  is  
crucial  to  manipula1ng  your  files  on  a  Unix  
system.    
The  Tree  structure  of  the  file  system  
Your  Home  Directory  
•  When  you  login  to  the  server,  you  always  start  
in  your  Home  directory.  
•  Create  sub-­‐directories  to  store  specific  
projects  or  groups  of  informa1on,  just  as  you  
would  place  folders  in  a  filing  cabinet.  
•  Do  not  accumulate  thousands  of  files  with  
cryp1c  names  in  your  Home  directory  
Shortcuts  
•  There  are  some  important  shortcuts  in  Unix  for  
specifying  directories  
•   .  (dot)  means  "the  current  directory"    
•   ..  means  "the  parent  directory"  -­‐  the  directory  one  level  
above  the  current  directory,  so  cd  ..  will  move  you  up  one  
level  
•   ~  (1lde)  means  your  Home  directory,  so  cd  ~  will  move  
you  back  to  your  Home.  
–  Just  typing  a  plain  cd  will  also  bring  you  back  to  your  home  
directory  
File  &  Directory  Commands  
•  This  is  a  minimal  list  of  Unix  commands  that  you  must  know  
for  file  management:  
ls (list) mkdir (make directory)

cd (change directory) rmdir (remove directory)

cp (copy) pwd (present working directory)

mv (move) more (view by page)

rm (remove) cat (view entire file on screen)

•  All  of  these  commands  can  be  modified  with  many  op1ons.  
Learn  to  use  Unix   man  pages  for  more  informa1on.  
Copy  &  Move  
•  cp  lets  you  copy  a  file  from  any  directory  to  
any  other  directory,  or  create  a  copy  of  a  file  
with  a  new  name  in  one  directory  
•  cp filename.ext newfilename.ext
•  cp filename.ext subdir/newname.ext
•  cp /u/jdoe01/filename.ext ./subdir/newfilename.ext

•  mv  allows  you  to  move  files  to  other  


directories,  but  it  is  also  used  to  rename  files.    
–  Filename  and  directory  syntax  for  mv  is  exactly  the  same  as  
for  the  cp  command.    
•  mv filename.ext subdir/newfilename.ext

–  NOTE:  When  you  use  mv  to  move  a  file  into  another  
directory,  the  current  file  is  deleted.    
Delete  
•  Use  the  command  rm  (remove)  to  delete  
files  
•  There  is  no  way  to  undo  this  command!!!  
–  We  have  set  the  server  to  ask  if  you  really  want  
to  remove  each  file  before  it  is  deleted.  
–  You  must  answer   Y  or  else  the  file  is  not  
deleted.  
> ls
af151074.gb_pr5 test.seq
> rm test.seq
rm: remove test.seq? y
> ls
af151074.gb_pr5
Exercise:  Working  with  Files  and  Directories  
-bash-3.2$ mkdir project1 (Make Directory)
-bash-3.2$ cd project1 (Change Directory)
-bash-3.2$ pwd
/home/efstae01/project1
-bash-3.2$ cp /data/tutorial/ChIPseq_chr19.fastq .
-bash-3.2$ ls –la

-bash-3.2$ head ChIPseq_chr19.fastq


@HWUSI-EAS610_0001:3:1:4:1405#0/1
GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG
+
B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
@HWUSI-EAS610_0001:3:1:5:1490#0/1
GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG
+
B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A?
@HWUSI-EAS610_0001:3:1:6:388#0/1
CAGAGTTCATGAAATAGGCCTCTAGTCTTCCTAGAC
FASTQ  Files  
36  bp  read  
@HWUSI-EAS610_0001:3:1:4:1405#0/1
GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG
+
B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
36  Quality  scores  

• The de-facto file format for sharing sequence read data


• Sequence and a per-base quality score
• FASTQ Files are Text files
• 4 Lines per read
-bash-3.2$ wc –l ChIPseq_chr19.fastq
1082524 ChIPseq_chr19.fastq
Star?ng  emacs  
•  To  start  Emacs,  at  the  >  command  prompt,  
just  type:    emacs    
•  To  use  Emacs  to  edit  a  file,  type:  
   emacs  filename  
(where  filename  is  the  name  of  your  file)  
•  When  Emacs  is  launched,  it  opens  either  a  
blank  text  window  or  a  window  containing  
the  text  of  an  exis1ng  file.    
Save  &  Exit  
•  To  save  a  file  as  you  are  working  on  it,  type:  
     Ctrl-­‐x  »  Ctrl-­‐s  
•  To  exit  emacs  and  return  to  the  Unix  shell,  
type:    Ctrl  -­‐x  »  Ctrl  -­‐c  

  If  you  have  made  any  changes  to  the  file,  Emacs  


will  ask  you  if  you  want  to  save:    
Save file /u/browns02/nrdc.msf? (y,n,!,.,q,C-r or C-h)
•  Type  “y”  to  save  your  changes  and  exit  
•  If  you  type  “n”,  then  it  will  ask  again:  
Modified buffers exist; exit anyway? (yes or no)
•  If  you  answer  “no”,  then  it  will  return  you  to  the  file,  
you  must  answer  “yes”  to  exit  without  saving  
changes  
Copying  data  using  Secure  Copy  (scp)  
Exercise:    
Login  on  the  tutorial  Amazon  Instance  
-­‐bash-­‐3.2$  cd  project1  
-­‐bash-­‐3.2$  fastqc  ChIPseq_chr19.fastq  
-­‐bash-­‐3.2$  ls  –l    
Copy  the  file  ChIPseq_chr19.fastq.zip  to  your  laptop  using  
scp:  
Laptop>  scp  user@ec2-­‐54-­‐242-­‐89-­‐16.compute-­‐1.amazonaws.com:~/
project1/ChIPseq_chr19_fastqc.zip  .  
Unzip  it:  
Laptop>  unzip  ChIPseq_chr19_fastqc.zip  
Open  with  your  browser  fastqc-­‐report.html  
Short  Read  Alignment  
•  Given  a  reference  and  a  set  of  reads,  report  at  least  
one  “good”  local  alignment  for  each  read  if  one  
exists  
–  Approximate  answer  to:  where  in  genome  did  the  read  originate?  

•  What is “good”? For now, we concentrate on:


–  Fewer mismatches is better …TGATCATA… beqer  than   …TGATCATA…
–  Failing to align a low-quality GATCAA GAGAAT

base is better than failing to …TGATATTA… beqer  than   …TGATcaTA…


GATcaT GTACAT
align a high-quality base
Short  Read  Applica1ons  

GGTATAC…  
…CCATAG   TATGCGCCC   CGGAAATTT   CGGTATAC  
…CCAT   CTATATGCG   TCGGAAATT   CGGTATAC  
…CCAT   GGCTATATG   CTATCGGAAA   GCGGTATA  
…CCA   AGGCTATAT   CCTATCGGA   TTGCGGTA   C…  

Finding  the  
…CCA   AGGCTATAT   GCCCTATCG   TTTGCGGT   C…  
…CC   AGGCTATAT   GCCCTATCG   AAATTTGC   ATAC…  
…CC   TAGGCTATA   GCGCCCTA   AAATTTGC   GTATAC…  
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…  
alignments  is  
 GAAATTTGC  
GGAAATTTG  
typically  the  
performance  
CGGAAATTT  
CGGAAATTT  
TCGGAAATT  

boqleneck  
CTATCGGAAA  
CCTATCGGA   TTTGCGGT  
GCCCTATCG   AAATTTGC  
…CC   GCCCTATCG   AAATTTGC   ATAC…  
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…  
Indexing  
•  Genomes  and  reads  are  too  large  for  direct  
approaches  like  dynamic  programming  
•  Indexing  is  required  

Suffix  tree   Suffix  array   Seed  hash  tables  


Many  variants,  incl.  spaced  seeds  

•  Choice  of  index  is  key  to  performance  


NGS  Read  Alignment  
Burrows  Wheeler  Transforma1on  (BWT)  

•  Invented  by  David  Wheeler  in  1983  (Bell  Labs).  Published  in  1994.  
           “A  Block  Sor@ng  Lossless  Data  Compression  Algorithm”    
             Systems  Research  Center  Technical  Report  No  124.  Palo  Alto,  CA:  Digital  Equipment  
Corpora@on,  Burrows  M,  Wheeler  DJ.  1994    

•  Originally  developed  for  compressing  large  files  (bzip2,  etc.)  

•  Lossless,  Fully  Reversible  

•  Alignment  Tools  based  on  BWT:  bow@e,  BWA,  SOAP2,  etc.  

•  Approach:  
–  Align  reads  on  the  transformed  reference  genome,  using  an  efficient  index  (FM  index)  
–  Solve  the  simple  problem  first  (align  one  character)  and  then  build  on  that  solu1on  to  solve  a  
slightly  harder  problem  (two  characters)  etc.  

•  Results  in  great  speed  and  efficiency  gains  (a  few  GigaByte  of  RAM  for  the  en1re  H.  
Genome).  Other  approaches  require  tens  of  GigaBytes  of  memory  and  are  much  
slower.  
Generating the SAM file
-­‐bash-­‐3.2$  cd  project1  
-­‐bash-­‐3.2$  bwa  aln  /data/tutorial/mm9.fasta      ChIPseq_chr19.fastq  >  
ChIPseq_chr19.sai  
bwa_aln]  17bp  reads:  max_diff  =  2  
[bwa_aln]  38bp  reads:  max_diff  =  3  
[bwa_aln]  64bp  reads:  max_diff  =  4  
[bwa_aln]  93bp  reads:  max_diff  =  5  
[bwa_aln]  124bp  reads:  max_diff  =  6  
[bwa_aln]  157bp  reads:  max_diff  =  7  
[bwa_aln]  190bp  reads:  max_diff  =  8  
[bwa_aln]  225bp  reads:  max_diff  =  9  
[bwa_aln_core]  calculate  SA  coordinate...  25.94  sec  
[bwa_aln_core]  write  to  the  disk...  0.03  sec  
[bwa_aln_core]  262144  sequences  have  been  processed.  
[bwa_aln_core]  calculate  SA  coordinate...  0.86  sec  
[bwa_aln_core]  write  to  the  disk...  0.00  sec  
[bwa_aln_core]  270631  sequences  have  been  processed.  
[main]  Version:  0.6.2-­‐r126  
[main]  CMD:  bwa  aln  /data/tutorial/mm9.fasta  ChIPseq_chr19.fastq  
[main]  Real  1me:  28.230  sec;  CPU:  28.229  sec  

-­‐bash-­‐3.2$  bwa  samse  /data/tutorial/mm9.fasta  ChIPseq_chr19.sai  ChIPseq_chr19.fastq    


>  ChIPseq_chr19.sam  
[bwa_aln_core]  convert  to  sequence  coordinate...  3.39  sec  
[bwa_aln_core]  refine  gapped  alignments...  0.43  sec  
[bwa_aln_core]  print  alignments...  0.50  sec  
[bwa_aln_core]  262144  sequences  have  been  processed.  
[bwa_aln_core]  convert  to  sequence  coordinate...  1.75  sec  
[bwa_aln_core]  refine  gapped  alignments...  0.31  sec  
[bwa_aln_core]  print  alignments...  0.01  sec  
[bwa_aln_core]  270631  sequences  have  been  processed.  
[main]  Version:  0.6.2-­‐r126  
[main]  CMD:  bwa  samse  /data/tutorial/mm9.fasta  ChIPseq_chr19.sai  ChIPseq_chr19.fastq  
[main]  Real  1me:  6.710  sec;  CPU:  6.711  sec  
more  
•  Use  the  command  more  to  view  at  the  contents  of  a  file  one  screen  at  a  1me:    

   > -bash-3.2$ more ChIPseq_chr19.sam


@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@SQ SN:chr12 LN:121257530
@SQ SN:chr13 LN:120284312
@SQ SN:chr14 LN:125194864
@SQ SN:chr15 LN:103494974
@SQ SN:chr16 LN:98319150
@SQ SN:chr17 LN:95272651
@SQ SN:chr18 LN:90772031
@SQ SN:chr19 LN:61342430
@SQ SN:chr1 LN:197195432
@SQ SN:chr2 LN:181748087
@SQ SN:chr3 LN:159599783
@SQ SN:chr4 LN:155630120
@SQ SN:chr5 LN:152537259
@SQ SN:chr6 LN:149517037
@SQ SN:chr7 LN:152524553
@SQ SN:chr8 LN:131738871
@SQ SN:chr9 LN:124076172
@SQ SN:chrM LN:16299
@SQ SN:chrX LN:166650296
@SQ SN:chrY LN:15902555
HWUSI-EAS610_0001:3:1:4:1405#0 16 chr19 60874227 37 36M * 0 0
CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1
?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9T26
HWUSI-EAS610_0001:3:1:5:1490#0 0 chr19 32960373 37 36M * 0 0
GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>A
A3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
HWUSI-EAS610_0001:3:1:6:388#0 16 chr19 18177553 37 36M * 0 0
GTCTAGGAAGACTAGAGGCCTATTTCATGAACTCTG @AABAAB@@B@?BBBABA
BA;?<CBBBBBA>C;BBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
HWUSI-EAS610_0001:3:1:7:1045#0 16 chr19 60698168 37 36M * 0 0
ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9A
B87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
Hit  the  spacebar  to  page  down  through  the  file  
–  Ctrl-­‐U  moves  back  up  a  page  
–  At  the  boqom  of  the  screen,  more  shows  how  much  of  the  file  has  been  displayed  
hqp://samtools.sourceforge.net/SAM1.pdf  
SAM/BAM  format  
Header  sec1on   @HD
@SQ
VN:1.0
SN:chr20 AS:HG18 LN:62435964
@RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
@RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891

posi1on  of  alignment  

Alignment  sec1on  
Query  Name   Ref  sequence   query  sequence  (same  strand  as  ref)   query  quality  

V00-HWI-EAS132:3:38:959:2035#0 147 chr1 28 255 36M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB NM:i:1


V00-HWI-EAS132:4:99:122:772#0 177 chr1 32 255 36M = 9127 0 AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR NM:i:1
V00-HWI-EAS132:4:44:473:970#0 25 chr1 40 255 36M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2
V00-HWI-EAS132:4:29:113:1934#0 99 chr1 44 255 36M = 107 0 GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1
Converting the SAM file to a BAM file
BAM  file  B:      
Binary  format  vs  Text  format  (SAM),  resul1ng  in  more  efficient  data  storage.  
Sequencing  plazorm  independent.  

-­‐bash-3.2$ samtools view -bt /data/tutorial/mm9.fasta -o


ChIPseq_chr19.bam ChIPseq_chr19.sam
[samopen] SAM header is present: 22 sequences.

-bash-3.2$ samtools sort ChIPseq_chr19.bam


ChIPseq_chr19.sorted
-bash-3.2$ samtools index ChIPseq_chr19.sorted.bam

-bash-3.2$ samtools view in.bam chr2:20,100,000-20,200,000 >


out.bam
-bash-3.2$ samtools view lib15103.sorted.rmdup.bam "gi|
121635883|ref|NC_008769.1|:2,100,000-2,200,000"
BAM Files: Using the bitwise FLAG
-f : bit set
-F: bit is not set

Counting Alignments:
-bash-3.2$ samtools view -c ChIPseq_chr19.sorted.bam
270631

Count unmapped reads –f 4 (the bitwise flag 0x004 is set)


-bash-3.2$ samtools view -c -f 4 ChIPseq_chr19.sorted.bam
1839

Count the reads that are not unmapped (hence, count mapped
alignments) –F 4 (the bitwise flag 0x0004 is Not set)
-bash-3.2$ samtools view -c -F 4 ChIPseq_chr19.sorted.bam
268792

-bash-3.2$ samtools flagstat ChIPseq_chr19.sorted.bam


Putting it all together in a shell script (use emacs)
/data/tutorial/first
REFERENCE=/data/tutorial/mm9.fasta  
FASTQ=ChIPseq_chr19.fastq  
SAI=ChIPseq_chr19.sai  
SAM=ChIPseq_chr19.sam  
BAM=ChIPseq_chr19.bam  
SORTED=ChIPseq_chr19.sorted  
echo  $REFERENCE;  echo  $FASTQ;  echo  $SAI  

bwa  aln  $REFERENCE  $FASTQ  >  $SAI  


bwa  samse  $REFERENCE  $SAI  $FASTQ    >  $SAM  
samtools  view  -­‐bt  $REFERENCE  -­‐o  $BAM  $SAM        
samtools  sort  $BAM  $SORTED  
The  screen  command  
•  Run  “screen”  before  you  run  your  command  
•  Detach  by  pressing:  Control-­‐a  d  
•  Logout  

•  Login  again  
•  Re-­‐aOach  by  running  the  command:  “screen  –r”  
•  You  may  have  several  screen  sessions  acRve  
Using IGV to view alignments
The  Integra1ve  Genome  Viewer  can  be  installed  on  your  laptop  

wget  hqp://www.broadins1tute.org/igv/projects/downloads/IGV_2.1.22.zip    

unzip  IGV_2.1.22.zip    

cd  IGV_2.1.22    

java  -­‐Xmx2g  -­‐jar  igv.jar  

You  will  need  to  transfer  to  your  laptop:  


•  Indexed  Reference  FASTA  file  
•  Reference  Genome  Feature  File  (GFF)  
•  Sorted  and  indexed  mapped  read  BAM  files  
Using IGV to view alignments

You might also like