Safe Haskell	Safe-Inferred
Language	Haskell2010

Bio.TwoBit

Description

.2bit format (from the UCSC Genome Browser FAQ)

A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.

The file begins with a 16-byte header containing the following fields:

signature - the number 0x1A412743 in the architecture of the machine that created the file
version - zero for now. Readers should abort if they see a version number higher than 0
sequenceCount - the number of sequences in the file
reserved - always zero for now

All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.

The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:

nameSize - a byte containing the length of the name field
name - the sequence name itself (in ASCII-compatible byte string), of variable length depending on nameSize
offset - the 32-bit offset of the sequence data relative to the start of the file, not aligned to any 4-byte padding boundary

The index is followed by the sequence records, which contain nine fields:

dnaSize - number of bases of DNA in the sequence
nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
nBlockStarts - an array of length nBlockCount of 32 bit integers indicating the (0-based) starting position of a block of Ns
nBlockSizes - an array of length nBlockCount of 32 bit integers indicating the length of a block of Ns
maskBlockCount - the number of masked (lower-case) blocks
maskBlockStarts - an array of length maskBlockCount of 32 bit integers indicating the (0-based) starting position of a masked block
maskBlockSizes - an array of length maskBlockCount of 32 bit integers indicating the length of a masked block
reserved - always zero for now
packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.

In this format, it is neither possible nor necessary to store Ns in the main sequence, and one wouldn't expect them to take up space there. However, they do; hard masked sequence is typically stored as many Ts. The sensible way to treat these is probably to just say there are two kinds of implied annotation (repeats and large gaps for a typical genome), which can be interpreted in whatever way fits.

Synopsis

data TwoBitFile = TBF {
- tbf_raw :: !(ForeignPtr Word8)
- tbf_size :: !Int
- tbf_path :: !ByteString
- tbf_chroms :: !(Array TwoBitChromosome)
- tbf_chrmap :: !(HashMap ByteString TwoBitChromosome)
}
openTwoBit :: FilePath -> IO TwoBitFile
data TwoBitChromosome = TBC {
- tbc_raw :: !(ForeignPtr Word8)
- tbc_name :: !ByteString
- tbc_index :: !Int
- tbc_dna_offset :: !Word32
- tbc_dna_size :: !Word32
- tbc_fwd_seq :: Int -> TwoBitSequence' Unidrectional
- tbc_rev_seq :: Int -> TwoBitSequence' Bidirectional
}
tbf_chrnames :: TwoBitFile -> [ByteString]
findChrom :: ByteString -> TwoBitFile -> Maybe TwoBitChromosome
data TwoBitSequence' dir
- = SomeSeq !Masking !(ForeignPtr Word8) !Word !Int (TwoBitSequence' dir)
- | RefEnd
type TwoBitSequence = TwoBitSequence' Unidrectional
data Unidrectional
data Bidirectional
unpackRSRaw :: TwoBitSequence' dir -> [Word8]
unpackRS :: TwoBitSequence' dir -> [Word8]
unpackRSMasked :: TwoBitSequence' dir -> [Word8]
newtype Masking = Masking Word8
isSoftMasked :: Masking -> Bool
isHardMasked :: Masking -> Bool
noneMasked :: Masking
softMasked :: Masking
hardMasked :: Masking
bothMasked :: Masking

Documentation

data TwoBitFile Source #

Constructors

TBF
Fields tbf_raw :: !(ForeignPtr Word8) tbf_size :: !Int tbf_path :: !ByteString tbf_chroms :: !(Array TwoBitChromosome) tbf_chrmap :: !(HashMap ByteString TwoBitChromosome)

openTwoBit :: FilePath -> IO TwoBitFile Source #

Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is concurrently modified in any way.

data TwoBitChromosome Source #

Constructors

TBC

Fields

tbc_raw :: !(ForeignPtr Word8)
tbc_name :: !ByteString
tbc_index :: !Int
tbc_dna_offset :: !Word32
tbc_dna_size :: !Word32
tbc_fwd_seq :: Int -> TwoBitSequence' Unidrectional
Lazily generated sequence in forward direction; the argument is the offset of the first base.
tbc_rev_seq :: Int -> TwoBitSequence' Bidirectional
Lazily generated sequence in reverse direction; the argument is the offset of the first base to the right of the beginning. (The first base generated is the complement of the base found at (offset-1).

tbf_chrnames :: TwoBitFile -> [ByteString] Source #

findChrom :: ByteString -> TwoBitFile -> Maybe TwoBitChromosome Source #

Finds a named scaffold in the reference. If it doesn't find the exact name, it will try to compensate for the crazy naming differences between NCBI and UCSC. This doesn't work in general, but is good enough in the common case. In particular, "1" maps to "chr1" and back, "GL000192.1" to "chr1_gl000192_random" and back, and "chrM" to MT and back.

data TwoBitSequence' dir Source #

This is a (piece of a) reference sequence. It consists of stretches with uniform masking.

The offset is stored as a Word. This is done because on a 32 bit platform, every bit counts. This limits the genome to approximately four gigabases, which would be a file of about one gigabyte. That's just about enough to work with the human genome. On a 64 bit platform, the file format itself imposes a limit of four gigabytes, or about 16 gigabases in total.

If length is zero, the piece is empty and the mask, pointer, and offset fields may not be valid. If length is positive, ptr+offset points at the first base of the piece. If length is negative, ptr+offset points just past the end of the piece, ptr+offset+length points to the first base of the piece, and the sequence in meant to be reverse complemented.

In a TwoBitSequence, length must not be negative. In a TwoBitSequence' Bidirectional, length can be positive or negative.

Constructors

SomeSeq
Fields !Masking how is it masked? !(ForeignPtr Word8) primitive bases in 2bit encoding: [0..3] = TCAG !Word offset in bases(!) !Int length in bases (TwoBitSequence' dir)
RefEnd

Instances

Instances details

Show (TwoBitSequence' dir) Source #
Instance details Defined in Bio.TwoBit Methods showsPrec :: Int -> TwoBitSequence' dir -> ShowS # show :: TwoBitSequence' dir -> String # showList :: [TwoBitSequence' dir] -> ShowS #

type TwoBitSequence = TwoBitSequence' Unidrectional Source #

data Unidrectional Source #

data Bidirectional Source #

unpackRSRaw :: TwoBitSequence' dir -> [Word8] Source #

Unpacks a reference sequence into a (very long) list of bytes. Each byte contains the nucleotide in bits 0 and 1 with valjues 0..3 corresponding to TCAG, and the soft and hard mask bits in bits 2 and 3, respectively.

unpackRS :: TwoBitSequence' dir -> [Word8] Source #

Unpacks a reference sequence into a (very long) list of ASCII characters. Hard masked nucleotides become the letter N, others become TCAG.

unpackRSMasked :: TwoBitSequence' dir -> [Word8] Source #

Unpacks a reference sequence into a list of ASCII characters, interpreting masking in the customary way. Specifically, hard masking produces Ns, soft masking produces lower case letters, and dual masking produces lower case Ns.

newtype Masking Source #

2bit supports two kinds of masking, typically rendered as lowercase letters (MaskSoft) and Ns (MaskHard). They can overlap (MaskBoth), and even the hard masking has underlying sequence (which is normally ignored).

Constructors

Masking Word8

Instances

Instances details

Monoid Masking Source #
Instance details Defined in Bio.TwoBit Methods mempty :: Masking # mappend :: Masking -> Masking -> Masking # mconcat :: [Masking] -> Masking #
Semigroup Masking Source #
Instance details Defined in Bio.TwoBit Methods (<>) :: Masking -> Masking -> Masking # sconcat :: NonEmpty Masking -> Masking # stimes :: Integral b => b -> Masking -> Masking #
Bounded Masking Source #
Instance details Defined in Bio.TwoBit Methods minBound :: Masking # maxBound :: Masking #
Enum Masking Source #
Instance details Defined in Bio.TwoBit Methods succ :: Masking -> Masking # pred :: Masking -> Masking # toEnum :: Int -> Masking # fromEnum :: Masking -> Int # enumFrom :: Masking -> [Masking] # enumFromThen :: Masking -> Masking -> [Masking] # enumFromTo :: Masking -> Masking -> [Masking] # enumFromThenTo :: Masking -> Masking -> Masking -> [Masking] #
Read Masking Source #
Instance details Defined in Bio.TwoBit Methods readsPrec :: Int -> ReadS Masking # readList :: ReadS [Masking] # readPrec :: ReadPrec Masking # readListPrec :: ReadPrec [Masking] #
Show Masking Source #
Instance details Defined in Bio.TwoBit Methods showsPrec :: Int -> Masking -> ShowS # show :: Masking -> String # showList :: [Masking] -> ShowS #
Eq Masking Source #
Instance details Defined in Bio.TwoBit Methods (==) :: Masking -> Masking -> Bool # (/=) :: Masking -> Masking -> Bool #
Ord Masking Source #
Instance details Defined in Bio.TwoBit Methods compare :: Masking -> Masking -> Ordering # (<) :: Masking -> Masking -> Bool # (<=) :: Masking -> Masking -> Bool # (>) :: Masking -> Masking -> Bool # (>=) :: Masking -> Masking -> Bool # max :: Masking -> Masking -> Masking # min :: Masking -> Masking -> Masking #

isSoftMasked :: Masking -> Bool Source #

isHardMasked :: Masking -> Bool Source #

noneMasked :: Masking Source #

softMasked :: Masking Source #

hardMasked :: Masking Source #

bothMasked :: Masking Source #