Skip to content

🧬 High-performance VCF file parser and reformatter with VEP annotation support. Converts complex VCF files to analyzable TSV format with intelligent transcript handling.

License

Notifications You must be signed in to change notification settings

flalom/vcf-reformatter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VCF Reformatter: What is it?

Did it ever happen that you had VCF files and you wanted to have a look at the data as you would do with a normal table? VCF Reformatter is here for your rescue!

A Rust command-line tool for parsing and reformatting VCF (Variant Call Format) files, with support for VEP (Variant Effect Predictor) and SnpEff annotations. This tool flattens complex VCF files into tab-separated values (TSV) format for easier downstream analysis. Also incredibly useful for quick checks to your data!

VCF Reformatter

License: MIT Rust Build Status Performance Release

install with bioconda Conda

Transform complex VCF files into clean, analyzable tables with ease

A high-performance Rust tool for flattening VCF files with intelligent VEP and SnpEff annotation handling


πŸš€ Quick Start

# Download binary from releases (easiest! You download and use it)
wget https://github.com/flalom/vcf-reformatter/releases/latest/download/vcf-reformatter-v0.3.0-linux-x86_64
chmod +x vcf-reformatter-v0.3.0-linux-x86_64

# Transform your VCF file  
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz

# Generate MAF output ⚠️ (in beta!)
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz --output-format maf

OR Via Bioconda

conda install -c bioconda vcf-reformatter
# or
# mamba install vcf-reformatter -c bioconda

OR install from crates.io:

cargo install vcf-reformatter

OR build from source (you need Rust toolchain):

git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
./target/release/vcf-reformatter sample.vcf.gz

⚠️ Experimental MAF support

MAF output is currently in beta testing (v0.3.0). Known limitations:

  • VAF calculation needs refinement for some genotype patterns
  • Multi-sample handling requires validation
  • Use with caution in production workflows

Memory considerations for MAF:

  • Files >100K variants: Monitor memory usage
  • Files >1M variants: Ensure adequate RAM (16GB+)

🎯 Why VCF Reformatter?

The Problem: VCF files are notoriously difficult to analyze. Complex nested annotations, semicolon-separated INFO fields, and multi-transcript VEP annotations make downstream analysis a nightmare.

The Solution: VCF Reformatter flattens everything into clean, readable TSV format that works seamlessly with Excel, R, Python, and any analysis tool (⚠️ beware Excel auto-correction!).

Before & After

Before (Raw VCF):

chr1  69511  .  A  G  1294.53  .  DP=65;AF=1;CSQ=G|missense_variant|MODERATE|OR4F5|ENSG00000186092...

After (Reformatted TSV):

CHROM  POS    REF  ALT  QUAL     INFO_DP  INFO_AF  CSQ_Allele  CSQ_Consequence      CSQ_SYMBOL
chr1   69511  A    G    1294.53  65       1        G           missense_variant     OR4F5

✨ Key Features

Feature Description Benefit
🧬 VEP/SnpEff Annotation Parsing Intelligent handling of CSQ/ANN annotations No more manual parsing of complex VEP/SnpEff output
πŸ‘€ Automatic Annotation Recognition Automatic detection of CSQ/ANN annotations Saving even more time now for both VEP and SnpEff
πŸ”€ Smart Transcript Handling Most severe, first only, or split transcripts Choose the analysis approach that fits your needs
πŸš€ Parallel Processing Multi-threaded processing up to 30k variants/sec Process large cohorts in minutes, not hours
πŸ“ Native Compression Direct .vcf.gz reading & gzip output Seamless workflow with compressed/uncompressed files
🎯 Production Ready Comprehensive error handling & logging Reliable for automated pipelines
🐳 Container Support Docker & Singularity ready Deploy anywhere, from laptops to HPC clusters

πŸ“¦ Installation

Option 1: Download Pre-compiled Binaries (Easiest!)

No Rust installation required - just download and run:

  1. Go to Releases

  2. Download the binary for your platform:

    • vcf-reformatter-v0.3.0-linux-x86_64 β†’ Linux (most users)
    • vcf-reformatter-v0.3.0-linux-x86_64-static β†’ HPC clusters (works everywhere)
    • vcf-reformatter-v0.3.0-windows-x86_64.exe β†’ Windows
    • vcf-reformatter-v0.3.0-macos-x86_64 β†’ Intel Mac
    • vcf-reformatter-v0.3.0-macos-arm64 β†’ Apple Silicon Mac (M1/M2/M3/M4)
  3. Make executable and run:

# Linux/Mac
chmod +x vcf-reformatter-*
./vcf-reformatter-* --help

# Windows
# Just double-click or run from command prompt
# C++ might be required, if not already installed

Option 2: Build from Source

git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release

Option 3: Docker

# Build the container
docker build -t vcf-reformatter .

# Run with your data
docker run --rm -v $(pwd):/data vcf-reformatter /data/sample.vcf.gz

Option 4: Singularity

# Build Singularity image
singularity build vcf-reformatter.sif Singularity

# Run on HPC cluster
singularity run --bind $PWD:/data vcf-reformatter.sif /data/sample.vcf.gz -j 16

πŸ› οΈ Usage

Basic Usage

# Simple conversion
vcf-reformatter input.vcf.gz

# Most severe consequence only (recommended for analysis)
vcf-reformatter input.vcf.gz -t most-severe

# All transcripts in separate rows (comprehensive)
vcf-reformatter input.vcf.gz -t split

Annotation Type Detection

# Auto-detect annotation type (recommended)
vcf-reformatter input.vcf.gz -a auto

# Force VEP processing
vcf-reformatter vep_annotated.vcf.gz -a vep -t most-severe

# Force SnpEff processing  
vcf-reformatter snpeff_annotated.vcf.gz -a snpeff -t most-severe

Advanced Usage

# High-performance processing with compression
vcf-reformatter large_cohort.vcf.gz \
  --transcript-handling most-severe \
  --threads 0 \
  --compress \
  --output-dir results/ \
  --prefix my_analysis \
  --verbose

# Optimized for HPC environments
vcf-reformatter huge_dataset.vcf.gz -t most-severe -j 32 -o /scratch/results/ -c -v

Complete Options

Usage: vcf-reformatter [OPTIONS] <INPUT_FILE>

Arguments:
  <INPUT_FILE>  Input VCF file (supports .vcf.gz)

Options:
  --output-format <FORMAT>     Output format [default: tsv] 
                               [values: tsv, maf]
  --center <CENTER>            Sequencing center for MAF output  
  --ncbi-build <BUILD>         Genome build 
                               [default: GRCh38]
  --sample-barcode <BARCODE>   Sample identifier for MAF output
  -t, --transcript-handling <MODE>  How to handle multiple transcripts
                                   [default: first]
                                   [values: most-severe, first, split]
  -a, --annotation-type <N>        Which annotations to parse VEP/SnpEff
                                   [default: auto]
                                   [values: snpeff, vep, auto]
  -j, --threads <N>                Thread count (0 = auto-detect) [default: 1]
  -o, --output-dir <DIR>           Output directory [default: current]
  -p, --prefix <PREFIX>            Output file prefix [default: input filename]
  -c, --compress                   Compress output with gzip
  -v, --verbose                    Detailed performance statistics
  -h, --help                       Show help
  -V, --version                    Show version

🧬 Transcript Handling Modes

VCF files with VEP annotations often contain multiple transcript annotations per variant. Choose the strategy that fits your analysis:

🎯 Most Severe (--transcript-handling most-severe)

Best for: Clinical analysis, variant prioritization

vcf-reformatter input.vcf.gz -t most-severe

# for maf output
vcf-reformatter input.vcf.gz -t most-severe --output-format maf

Selects the transcript with the most severe consequence (stop_gained > missense_variant > synonymous, etc.)

⚑ First Only (--transcript-handling first) [Default]

Best for: Quick analysis, performance-critical workflows

vcf-reformatter input.vcf.gz  # Uses first transcript by default

Processes only the first transcript annotation (fastest option)

πŸ“Š Split All (--transcript-handling split)

Best for: Comprehensive analysis, transcript-level studies

vcf-reformatter input.vcf.gz -t split

Creates separate rows for each transcript (most detailed output)

πŸ“ˆ Performance

Benchmarks

  • Small files (< 1K variants): ~5,000 variants/sec
  • Medium files (1K-10K variants): ~15,000 variants/sec
  • Large files (10K+ variants): ~30,000 variants/sec

Optimization Tips

# Auto-detect optimal thread count
vcf-reformatter input.vcf.gz -j 0

# For files > 10K variants, use parallel processing
vcf-reformatter input.vcf.gz -t most-severe -j 0 -v

# Combine with compression for large outputs
vcf-reformatter input.vcf.gz -t split -j 0 -c -v

πŸ“Š Output Format

File Structure

VCF Reformatter generates two files:

  • {prefix}_header.txt - Original VCF header and metadata
  • {prefix}_reformatted.tsv - Flattened tabular data

Column Types

  1. Standard VCF: CHROM, POS, ID, REF, ALT, QUAL, FILTER
  2. INFO Fields: INFO_DP, INFO_AF, INFO_AC, etc.
  3. VEP Annotations: CSQ_Allele, CSQ_Consequence, CSQ_SYMBOL, CSQ_Gene, etc.
  4. SnpEff Annotations: ANN_Allele, ANN_Annotation_Impact, ANN_Gene_Name, ANN_Distance, etc.
  5. Sample Data: SAMPLE1_GT, SAMPLE1_DP, SAMPLE1_AD, etc.

Example Output VEP

CHROM  POS    ID     REF  ALT  QUAL     FILTER  INFO_DP  CSQ_Consequence      CSQ_SYMBOL  SAMPLE1_GT
chr1   69511  .      A    G    1294.53  PASS    65       missense_variant     OR4F5       1/1
chr1   69761  rs123  C    T    892.15   PASS    42       synonymous_variant   OR4F5       0/1

Example Output SnpEff

CHROM  POS    ID     REF  ALT  QUAL     FILTER  INFO_DP  ANN_Annotation          ANN_Gene_Name  SAMPLE1_GT
chr1   69761  rs587   C    T  730  PASS   .     214      synonymous_variant      OR4F5          0/1
chr1   924024  .      A    G  53   PASS   .     409      5_prime_UTR_variant     SAMD11         1/1

πŸ”§ Integration Examples

With R

# Read compressed output directly
library(data.table)
data <- fread("output_reformatted.tsv.gz")

# Quick variant summary
summary(data$CSQ_Consequence)

With Python

import pandas as pd

# Load and analyze
df = pd.read_csv("output_reformatted.tsv.gz", sep="\t", compression="gzip")
df['CSQ_Consequence'].value_counts()

In Workflows

# Nextflow pipeline
vcf-reformatter ${vcf} -t most-severe -j ${task.cpus} -o results/ -c

# Snakemake rule
shell: "vcf-reformatter {input.vcf} -t most-severe -j {threads} -o {params.outdir} -c"

🐳 Container Usage

Docker

# Build once
docker build -t vcf-reformatter .

# Run anywhere
docker run --rm \
  -v $(pwd):/data \
  vcf-reformatter \
  /data/input.vcf.gz \
  -t most-severe -j 4 -o /data/results/ -c

Singularity (HPC)

# On HPC cluster
singularity run \
  --bind $PWD:/data \
  --bind /scratch:/scratch \
  vcf-reformatter.sif \
  /data/large_cohort.vcf.gz \
  -t most-severe -j 16 -o /scratch/results/ -c -v

πŸ§ͺ Use Cases

Use Case Command Why It Works
Clinical Variant Review vcf-reformatter variants.vcf.gz -t most-severe Prioritizes clinically relevant consequences
Population Analysis vcf-reformatter cohort.vcf.gz -t first -j 0 -c Fast processing of large cohorts
Transcript Studies vcf-reformatter genes.vcf.gz -t split -v Comprehensive transcript-level analysis
Quick Data Exploration vcf-reformatter sample.vcf.gz Simple, fast conversion for immediate analysis
HPC Batch Processing vcf-reformatter huge.vcf.gz -t most-severe -j 32 -c Optimized for high-performance computing

πŸš€ What's New in v0.3.0

  • βœ… MAF Output Support (in Beta⚠️) - Direct conversion to Mutation Annotation Format
  • βœ… Auto-metadata Detection (in Beta⚠️) - Extracts center/sample info from VCF headers for MAF
  • βœ… Memory-Efficient Processing (streaming) - Chunked streaming for large files (>>100K variants)
  • βœ… Enhanced Error Handling - Better processing of malformed files
  • βœ… Comprehensive Testing - 70+ test cases ensure reliability

Previous Releases

πŸš€ What's New in v0.2.0

  • βœ… SnpEff Support - Full ANN field parsing with intelligent detection
  • βœ… Smart Auto-Detection - Automatically identifies VEP vs SnpEff annotations
  • βœ… Enhanced Error Handling - Better processing of malformed or headerless files

TODOs

  • Add SnpEff supportβœ…
  • Output MAF format optionβœ…
  • Add stdin to combine with other tools, such as bcftools
  • Support for multi-sample VCF files in MAF output

🀝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Add tests for new functionality
  4. Commit your changes: git commit -am 'Add feature'
  5. Push to the branch: git push origin feature-name
  6. Submit a pull request

Development Setup

git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo test  # Run the test suite
cargo run -- data/sample.vcf.gz -v  # Test with sample data

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • VCF Format Contributors - For the standard that enables genomic data sharing
  • VEP Team - For the powerful variant annotation framework
  • Rust Community - For the incredible ecosystem that makes this possible
  • Bioinformatics Community - For feedback and feature requests

Frequently Asked Questions

Q: Which transcript handling mode should I use?

  • Clinical analysis: --transcript-handling most-severe
  • Quick exploration: --transcript-handling first
  • Comprehensive analysis: --transcript-handling split

Q: How does this compare to other VCF tools?

VCF Reformatter is specifically designed for:

  • Converting complex VEP/SnpEff annotations to tabular format
  • Handling multiple transcripts intelligently
  • High-performance parallel processing
  • Easy integration with R/Python workflows

Q: Can I use this in production pipelines?

Yes! VCF Reformatter is designed for production use with:

  • Comprehensive error handling
  • Docker/Singularity support
  • Automated testing
  • Stable CLI interface

Q: What's the difference between TSV and MAF output?

  • TSV: Direct flattening of VCF fields (default)
  • MAF (beta): Standardized cancer genomics format for downstream tools

Q: What if I get out-of-memory errors?

  • Use TSV format instead of MAF: vcf-reformatter file.vcf.gz -j 0 -c
  • Enable verbose mode to monitor: vcf-reformatter file.vcf.gz -v

πŸ“ž Support


⭐ Star this repo if VCF Reformatter helps your research!

Made with ❀️ by Flavio Lombardo

About

🧬 High-performance VCF file parser and reformatter with VEP annotation support. Converts complex VCF files to analyzable TSV format with intelligent transcript handling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages