Manipulate genomic features and validate the syntax and reference sequence of your GFF3 files.
- Free software: BSD license
- Documentation: https://gff3-py.readthedocs.org.
- Simple data structures: Parses a
GFF3file into a structure composed of simple pythondictandlist. - Validation: Validates the
GFF3syntax on parse, and saves the error messages in the parsed structure. - Best effort parsing: Despite any detected errors, continue to parse the whole file and make as much sense to it as possible.
- Uses the python
logginglibrary to log error messages with support for custom loggers. - Parses embeded or external
FASTAsequences to check bounds and number ofNs. - Check and correct the phase for
CDSfeatures. - Tree traversal methods
ancestorsanddescendantsreturns a simplelistin Breadth-first search order. - Transfer children and parents using the
adoptandadoptedmethods. - Test for overlapping features using the
overlapmethod. - Remove a feature and its associated features using the
removemethod. - Write the modified structure to a GFF3 file using the
writemthod.
An example that just parses a GFF3 file named annotations.gff and validates it
using an external FASTA file named annotations.fa looks like:
# validate.py
# ============
from gff3 import Gff3
# initialize a Gff3 object
gff = Gff3()
# parse GFF3 file and do syntax checking, this populates gff.lines and gff.features
# if an embedded ##FASTA directive is found, parse the sequences into gff.fasta_embedded
gff.parse('annotations.gff')
# parse the external FASTA file into gff.fasta_external
gff.parse_fasta_external('annotations.fa')
# Check seqid, bounds and the number of Ns in each feature using one or more reference sources
gff.check_reference(allowed_num_of_n=0, feature_types=['CDS'])
# Checks whether child features are within the coordinate boundaries of parent features
gff.check_parent_boundary()
# Calculates the correct phase and checks if it matches the given phase for CDS features
gff.check_phase()A more feature complete GFF3 validator with a command line interface which also generates validation
report in MarkDown is available under examples/gff_valid.py
The following example demonstrates how to filter, tranverse, and modify the parsed gff3 lines list.
- Change features with type
exontopseudogenic_exonand typetranscripttopseudogenic_transcriptif the feature has an ancestor of typepseudogene - If a
pseudogenefeature overlaps with agenefeature, move all of the children from thepseudogenefeature to thegenefeature, and remove thepseudogenefeature.
# fix_pseudogene.py
# =================
from gff3 import Gff3
gff = Gff3('annotations.gff')
type_map = {'exon': 'pseudogenic_exon', 'transcript': 'pseudogenic_transcript'}
pseudogenes = [line for line in gff.lines if line['line_type'] == 'feature' and line['type'] == 'pseudogene']
for pseudogene in pseudogenes:
# convert types
for line in gff.descendants(pseudogene):
if line['type'] in type_map:
line['type'] = type_map[line['type']]
# find overlapping gene
overlapping_genes = [line for line in gff.lines if line['line_type'] == 'feature' and line['type'] == 'gene' and gff.overlap(line, pseudogene)]
if overlapping_genes:
# move pseudogene children to overlapping gene
gff.adopt(pseudogene, overlapping_genes[0])
# remove pseudogene
gff.remove(pseudogene)
gff.write('annotations_fixed.gff')