Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Bio.TwoBit
Description
.2bit format (from the UCSC Genome Browser FAQ)
A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.
The file begins with a 16-byte header containing the following fields:
- signature - the number 0x1A412743 in the architecture of the machine that created the file
- version - zero for now. Readers should abort if they see a version number higher than 0
- sequenceCount - the number of sequences in the file
- reserved - always zero for now
All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.
The header is followed by a file index, which contains one entry for each sequence. Each index entry contains three fields:
- nameSize - a byte containing the length of the name field
- name - the sequence name itself (in ASCII-compatible byte string), of variable length depending on nameSize
- offset - the 32-bit offset of the sequence data relative to the start of the file, not aligned to any 4-byte padding boundary
The index is followed by the sequence records, which contain nine fields:
- dnaSize - number of bases of DNA in the sequence
- nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
- nBlockStarts - an array of length nBlockCount of 32 bit integers indicating the (0-based) starting position of a block of Ns
- nBlockSizes - an array of length nBlockCount of 32 bit integers indicating the length of a block of Ns
- maskBlockCount - the number of masked (lower-case) blocks
- maskBlockStarts - an array of length maskBlockCount of 32 bit integers indicating the (0-based) starting position of a masked block
- maskBlockSizes - an array of length maskBlockCount of 32 bit integers indicating the length of a masked block
- reserved - always zero for now
- packedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.
In this format, it is neither possible nor necessary to store Ns in the main sequence, and one wouldn't expect them to take up space there. However, they do; hard masked sequence is typically stored as many Ts. The sensible way to treat these is probably to just say there are two kinds of implied annotation (repeats and large gaps for a typical genome), which can be interpreted in whatever way fits.
Synopsis
- data TwoBitFile = TBF {
- tbf_raw :: !(ForeignPtr Word8)
- tbf_size :: !Int
- tbf_path :: !ByteString
- tbf_chroms :: !(Array TwoBitChromosome)
- tbf_chrmap :: !(HashMap ByteString TwoBitChromosome)
- openTwoBit :: FilePath -> IO TwoBitFile
- data TwoBitChromosome = TBC {
- tbc_raw :: !(ForeignPtr Word8)
- tbc_name :: !ByteString
- tbc_index :: !Int
- tbc_dna_offset :: !Word32
- tbc_dna_size :: !Word32
- tbc_fwd_seq :: Int -> TwoBitSequence' Unidrectional
- tbc_rev_seq :: Int -> TwoBitSequence' Bidirectional
- tbf_chrnames :: TwoBitFile -> [ByteString]
- findChrom :: ByteString -> TwoBitFile -> Maybe TwoBitChromosome
- data TwoBitSequence' dir
- = SomeSeq !Masking !(ForeignPtr Word8) !Word !Int (TwoBitSequence' dir)
- | RefEnd
- type TwoBitSequence = TwoBitSequence' Unidrectional
- data Unidrectional
- data Bidirectional
- unpackRSRaw :: TwoBitSequence' dir -> [Word8]
- unpackRS :: TwoBitSequence' dir -> [Word8]
- unpackRSMasked :: TwoBitSequence' dir -> [Word8]
- newtype Masking = Masking Word8
- isSoftMasked :: Masking -> Bool
- isHardMasked :: Masking -> Bool
- noneMasked :: Masking
- softMasked :: Masking
- hardMasked :: Masking
- bothMasked :: Masking
Documentation
data TwoBitFile Source #
Constructors
TBF | |
Fields
|
openTwoBit :: FilePath -> IO TwoBitFile Source #
Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is concurrently modified in any way.
data TwoBitChromosome Source #
Constructors
TBC | |
Fields
|
tbf_chrnames :: TwoBitFile -> [ByteString] Source #
findChrom :: ByteString -> TwoBitFile -> Maybe TwoBitChromosome Source #
Finds a named scaffold in the reference. If it doesn't find the exact name, it will try to compensate for the crazy naming differences between NCBI and UCSC. This doesn't work in general, but is good enough in the common case. In particular, "1" maps to "chr1" and back, "GL000192.1" to "chr1_gl000192_random" and back, and "chrM" to MT and back.
data TwoBitSequence' dir Source #
This is a (piece of a) reference sequence. It consists of stretches with uniform masking.
The offset is stored as a Word
. This is done because on a 32 bit
platform, every bit counts. This limits the genome to approximately
four gigabases, which would be a file of about one gigabyte. That's
just about enough to work with the human genome. On a 64 bit
platform, the file format itself imposes a limit of four gigabytes,
or about 16 gigabases in total.
If length is zero, the piece is empty and the mask, pointer, and offset fields may not be valid. If length is positive, ptr+offset points at the first base of the piece. If length is negative, ptr+offset points just past the end of the piece, ptr+offset+length points to the first base of the piece, and the sequence in meant to be reverse complemented.
In a TwoBitSequence
, length must not be negative. In a
TwoBitSequence' Bidirectional
, length can be positive or negative.
Constructors
SomeSeq | |
Fields
| |
RefEnd |
Instances
Show (TwoBitSequence' dir) Source # | |
Defined in Bio.TwoBit Methods showsPrec :: Int -> TwoBitSequence' dir -> ShowS # show :: TwoBitSequence' dir -> String # showList :: [TwoBitSequence' dir] -> ShowS # |
data Unidrectional Source #
data Bidirectional Source #
unpackRSRaw :: TwoBitSequence' dir -> [Word8] Source #
Unpacks a reference sequence into a (very long) list of bytes. Each byte contains the nucleotide in bits 0 and 1 with valjues 0..3 corresponding to TCAG, and the soft and hard mask bits in bits 2 and 3, respectively.
unpackRS :: TwoBitSequence' dir -> [Word8] Source #
Unpacks a reference sequence into a (very long) list of ASCII
characters. Hard masked nucleotides become the letter N
, others
become TCAG.
unpackRSMasked :: TwoBitSequence' dir -> [Word8] Source #
Unpacks a reference sequence into a list of ASCII characters, interpreting masking in the customary way. Specifically, hard masking produces Ns, soft masking produces lower case letters, and dual masking produces lower case Ns.
2bit supports two kinds of masking, typically rendered as lowercase
letters (MaskSoft
) and Ns (MaskHard
). They can overlap
(MaskBoth
), and even the hard masking has underlying sequence
(which is normally ignored).
Instances
Monoid Masking Source # | |
Semigroup Masking Source # | |
Bounded Masking Source # | |
Enum Masking Source # | |
Read Masking Source # | |
Show Masking Source # | |
Eq Masking Source # | |
Ord Masking Source # | |
isSoftMasked :: Masking -> Bool Source #
isHardMasked :: Masking -> Bool Source #
noneMasked :: Masking Source #
softMasked :: Masking Source #
hardMasked :: Masking Source #
bothMasked :: Masking Source #