Basic Linux Alignement
Basic Linux Alignement
1:
An
Introduc1on
to
Linux
for
Next-‐Gen
DNA
Sequencing
Data
Analysis:
A
Hands-‐on
Tutorial
Dr
Stratos
Efstathiadis
[email protected]
Technical
Director
High
Performance
Compu1ng
Facility
Center
for
Health
Informa1cs
and
Bioinforma1cs
NYULMC
This is the First Part in a Series of tutorials in High
Performance Computing and Sequencing Informatics
• All
of
these
commands
can
be
modified
with
many
op1ons.
Learn
to
use
Unix
man
pages
for
more
informa1on.
Copy
&
Move
• cp
lets
you
copy
a
file
from
any
directory
to
any
other
directory,
or
create
a
copy
of
a
file
with
a
new
name
in
one
directory
• cp filename.ext newfilename.ext
• cp filename.ext subdir/newname.ext
• cp /u/jdoe01/filename.ext ./subdir/newfilename.ext
– NOTE:
When
you
use
mv
to
move
a
file
into
another
directory,
the
current
file
is
deleted.
Delete
• Use
the
command
rm
(remove)
to
delete
files
• There
is
no
way
to
undo
this
command!!!
– We
have
set
the
server
to
ask
if
you
really
want
to
remove
each
file
before
it
is
deleted.
– You
must
answer
Y
or
else
the
file
is
not
deleted.
> ls
af151074.gb_pr5 test.seq
> rm test.seq
rm: remove test.seq? y
> ls
af151074.gb_pr5
Exercise:
Working
with
Files
and
Directories
-bash-3.2$ mkdir project1 (Make Directory)
-bash-3.2$ cd project1 (Change Directory)
-bash-3.2$ pwd
/home/efstae01/project1
-bash-3.2$ cp /data/tutorial/ChIPseq_chr19.fastq .
-bash-3.2$ ls –la
GGTATAC…
…CCATAG
TATGCGCCC
CGGAAATTT
CGGTATAC
…CCAT
CTATATGCG
TCGGAAATT
CGGTATAC
…CCAT
GGCTATATG
CTATCGGAAA
GCGGTATA
…CCA
AGGCTATAT
CCTATCGGA
TTGCGGTA
C…
Finding
the
…CCA
AGGCTATAT
GCCCTATCG
TTTGCGGT
C…
…CC
AGGCTATAT
GCCCTATCG
AAATTTGC
ATAC…
…CC
TAGGCTATA
GCGCCCTA
AAATTTGC
GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
alignments
is
GAAATTTGC
GGAAATTTG
typically
the
performance
CGGAAATTT
CGGAAATTT
TCGGAAATT
boqleneck
CTATCGGAAA
CCTATCGGA
TTTGCGGT
GCCCTATCG
AAATTTGC
…CC
GCCCTATCG
AAATTTGC
ATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Indexing
• Genomes
and
reads
are
too
large
for
direct
approaches
like
dynamic
programming
• Indexing
is
required
• Invented
by
David
Wheeler
in
1983
(Bell
Labs).
Published
in
1994.
“A
Block
Sor@ng
Lossless
Data
Compression
Algorithm”
Systems
Research
Center
Technical
Report
No
124.
Palo
Alto,
CA:
Digital
Equipment
Corpora@on,
Burrows
M,
Wheeler
DJ.
1994
• Approach:
– Align
reads
on
the
transformed
reference
genome,
using
an
efficient
index
(FM
index)
– Solve
the
simple
problem
first
(align
one
character)
and
then
build
on
that
solu1on
to
solve
a
slightly
harder
problem
(two
characters)
etc.
• Results
in
great
speed
and
efficiency
gains
(a
few
GigaByte
of
RAM
for
the
en1re
H.
Genome).
Other
approaches
require
tens
of
GigaBytes
of
memory
and
are
much
slower.
Generating the SAM file
-‐bash-‐3.2$
cd
project1
-‐bash-‐3.2$
bwa
aln
/data/tutorial/mm9.fasta
ChIPseq_chr19.fastq
>
ChIPseq_chr19.sai
bwa_aln]
17bp
reads:
max_diff
=
2
[bwa_aln]
38bp
reads:
max_diff
=
3
[bwa_aln]
64bp
reads:
max_diff
=
4
[bwa_aln]
93bp
reads:
max_diff
=
5
[bwa_aln]
124bp
reads:
max_diff
=
6
[bwa_aln]
157bp
reads:
max_diff
=
7
[bwa_aln]
190bp
reads:
max_diff
=
8
[bwa_aln]
225bp
reads:
max_diff
=
9
[bwa_aln_core]
calculate
SA
coordinate...
25.94
sec
[bwa_aln_core]
write
to
the
disk...
0.03
sec
[bwa_aln_core]
262144
sequences
have
been
processed.
[bwa_aln_core]
calculate
SA
coordinate...
0.86
sec
[bwa_aln_core]
write
to
the
disk...
0.00
sec
[bwa_aln_core]
270631
sequences
have
been
processed.
[main]
Version:
0.6.2-‐r126
[main]
CMD:
bwa
aln
/data/tutorial/mm9.fasta
ChIPseq_chr19.fastq
[main]
Real
1me:
28.230
sec;
CPU:
28.229
sec
Alignment
sec1on
Query
Name
Ref
sequence
query
sequence
(same
strand
as
ref)
query
quality
Counting Alignments:
-bash-3.2$ samtools view -c ChIPseq_chr19.sorted.bam
270631
Count the reads that are not unmapped (hence, count mapped
alignments) –F 4 (the bitwise flag 0x0004 is Not set)
-bash-3.2$ samtools view -c -F 4 ChIPseq_chr19.sorted.bam
268792
• Login
again
• Re-‐aOach
by
running
the
command:
“screen
–r”
• You
may
have
several
screen
sessions
acRve
Using IGV to view alignments
The
Integra1ve
Genome
Viewer
can
be
installed
on
your
laptop
wget hqp://www.broadins1tute.org/igv/projects/downloads/IGV_2.1.22.zip
unzip IGV_2.1.22.zip
cd IGV_2.1.22