2 unstable releases
| 0.3.0 | Aug 15, 2025 |
|---|---|
| 0.2.0 | Jul 22, 2025 |
#97 in Biology
75 downloads per month
135KB
2.5K
SLoC
VCF Reformatter: What is it?
Did it ever happen that you had VCF files and you wanted to have a look at the data as you would do with a normal table? VCF Reformatter is here for your rescue!
A Rust command-line tool for parsing and reformatting VCF (Variant Call Format) files, with support for VEP (Variant Effect Predictor) and SnpEff annotations. This tool flattens complex VCF files into tab-separated values (TSV) format for easier downstream analysis. Also incredibly useful for quick checks to your data!
VCF Reformatter
Transform complex VCF files into clean, analyzable tables with ease
A high-performance Rust tool for flattening VCF files with intelligent VEP and SnpEff annotation handling
π Quick Start
# Download binary from releases (easiest! You download and use it)
wget https://github.com/flalom/vcf-reformatter/releases/latest/download/vcf-reformatter-v0.3.0-linux-x86_64
chmod +x vcf-reformatter-v0.3.0-linux-x86_64
# Transform your VCF file
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz
# Generate MAF output β οΈ (in beta!)
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz --output-format maf
OR Via Bioconda
conda install -c bioconda vcf-reformatter
# or
# mamba install vcf-reformatter -c bioconda
OR install from crates.io:
cargo install vcf-reformatter
OR build from source (you need Rust toolchain):
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
./target/release/vcf-reformatter sample.vcf.gz
β οΈ Experimental MAF support
MAF output is currently in beta testing (v0.3.0). Known limitations:
- VAF calculation needs refinement for some genotype patterns
- Multi-sample handling requires validation
- Use with caution in production workflows
Memory considerations for MAF:
- Files >100K variants: Monitor memory usage
- Files >1M variants: Ensure adequate RAM (16GB+)
π― Why VCF Reformatter?
The Problem: VCF files are notoriously difficult to analyze. Complex nested annotations, semicolon-separated INFO fields, and multi-transcript VEP annotations make downstream analysis a nightmare.
The Solution: VCF Reformatter flattens everything into clean, readable TSV format that works seamlessly with Excel, R, Python, and any analysis tool (β οΈ beware Excel auto-correction!).
Before & After
Before (Raw VCF):
chr1 69511 . A G 1294.53 . DP=65;AF=1;CSQ=G|missense_variant|MODERATE|OR4F5|ENSG00000186092...
After (Reformatted TSV):
CHROM POS REF ALT QUAL INFO_DP INFO_AF CSQ_Allele CSQ_Consequence CSQ_SYMBOL
chr1 69511 A G 1294.53 65 1 G missense_variant OR4F5
β¨ Key Features
| Feature | Description | Benefit |
|---|---|---|
| 𧬠VEP/SnpEff Annotation Parsing | Intelligent handling of CSQ/ANN annotations | No more manual parsing of complex VEP/SnpEff output |
| π Automatic Annotation Recognition | Automatic detection of CSQ/ANN annotations | Saving even more time now for both VEP and SnpEff |
| π Smart Transcript Handling | Most severe, first only, or split transcripts | Choose the analysis approach that fits your needs |
| π Parallel Processing | Multi-threaded processing up to 30k variants/sec | Process large cohorts in minutes, not hours |
| π Native Compression | Direct .vcf.gz reading & gzip output |
Seamless workflow with compressed/uncompressed files |
| π― Production Ready | Comprehensive error handling & logging | Reliable for automated pipelines |
| π³ Container Support | Docker & Singularity ready | Deploy anywhere, from laptops to HPC clusters |
π¦ Installation
Option 1: Download Pre-compiled Binaries (Easiest!)
No Rust installation required - just download and run:
-
Go to Releases
-
Download the binary for your platform:
vcf-reformatter-v0.3.0-linux-x86_64β Linux (most users)vcf-reformatter-v0.3.0-linux-x86_64-staticβ HPC clusters (works everywhere)vcf-reformatter-v0.3.0-windows-x86_64.exeβ Windowsvcf-reformatter-v0.3.0-macos-x86_64β Intel Macvcf-reformatter-v0.3.0-macos-arm64β Apple Silicon Mac (M1/M2/M3/M4)
-
Make executable and run:
# Linux/Mac
chmod +x vcf-reformatter-*
./vcf-reformatter-* --help
# Windows
# Just double-click or run from command prompt
# C++ might be required, if not already installed
Option 2: Build from Source
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
Option 3: Docker
# Build the container
docker build -t vcf-reformatter .
# Run with your data
docker run --rm -v $(pwd):/data vcf-reformatter /data/sample.vcf.gz
Option 4: Singularity
# Build Singularity image
singularity build vcf-reformatter.sif Singularity
# Run on HPC cluster
singularity run --bind $PWD:/data vcf-reformatter.sif /data/sample.vcf.gz -j 16
π οΈ Usage
Basic Usage
# Simple conversion
vcf-reformatter input.vcf.gz
# Most severe consequence only (recommended for analysis)
vcf-reformatter input.vcf.gz -t most-severe
# All transcripts in separate rows (comprehensive)
vcf-reformatter input.vcf.gz -t split
Annotation Type Detection
# Auto-detect annotation type (recommended)
vcf-reformatter input.vcf.gz -a auto
# Force VEP processing
vcf-reformatter vep_annotated.vcf.gz -a vep -t most-severe
# Force SnpEff processing
vcf-reformatter snpeff_annotated.vcf.gz -a snpeff -t most-severe
Advanced Usage
# High-performance processing with compression
vcf-reformatter large_cohort.vcf.gz \
--transcript-handling most-severe \
--threads 0 \
--compress \
--output-dir results/ \
--prefix my_analysis \
--verbose
# Optimized for HPC environments
vcf-reformatter huge_dataset.vcf.gz -t most-severe -j 32 -o /scratch/results/ -c -v
Complete Options
Usage: vcf-reformatter [OPTIONS] <INPUT_FILE>
Arguments:
<INPUT_FILE> Input VCF file (supports .vcf.gz)
Options:
--output-format <FORMAT> Output format [default: tsv]
[values: tsv, maf]
--center <CENTER> Sequencing center for MAF output
--ncbi-build <BUILD> Genome build
[default: GRCh38]
--sample-barcode <BARCODE> Sample identifier for MAF output
-t, --transcript-handling <MODE> How to handle multiple transcripts
[default: first]
[values: most-severe, first, split]
-a, --annotation-type <N> Which annotations to parse VEP/SnpEff
[default: auto]
[values: snpeff, vep, auto]
-j, --threads <N> Thread count (0 = auto-detect) [default: 1]
-o, --output-dir <DIR> Output directory [default: current]
-p, --prefix <PREFIX> Output file prefix [default: input filename]
-c, --compress Compress output with gzip
-v, --verbose Detailed performance statistics
-h, --help Show help
-V, --version Show version
𧬠Transcript Handling Modes
VCF files with VEP annotations often contain multiple transcript annotations per variant. Choose the strategy that fits your analysis:
π― Most Severe (--transcript-handling most-severe)
Best for: Clinical analysis, variant prioritization
vcf-reformatter input.vcf.gz -t most-severe
# for maf output
vcf-reformatter input.vcf.gz -t most-severe --output-format maf
Selects the transcript with the most severe consequence (stop_gained > missense_variant > synonymous, etc.)
β‘ First Only (--transcript-handling first) [Default]
Best for: Quick analysis, performance-critical workflows
vcf-reformatter input.vcf.gz # Uses first transcript by default
Processes only the first transcript annotation (fastest option)
π Split All (--transcript-handling split)
Best for: Comprehensive analysis, transcript-level studies
vcf-reformatter input.vcf.gz -t split
Creates separate rows for each transcript (most detailed output)
π Performance
Benchmarks
- Small files (< 1K variants): ~5,000 variants/sec
- Medium files (1K-10K variants): ~15,000 variants/sec
- Large files (10K+ variants): ~30,000 variants/sec
Optimization Tips
# Auto-detect optimal thread count
vcf-reformatter input.vcf.gz -j 0
# For files > 10K variants, use parallel processing
vcf-reformatter input.vcf.gz -t most-severe -j 0 -v
# Combine with compression for large outputs
vcf-reformatter input.vcf.gz -t split -j 0 -c -v
π Output Format
File Structure
VCF Reformatter generates two files:
{prefix}_header.txt- Original VCF header and metadata{prefix}_reformatted.tsv- Flattened tabular data
Column Types
- Standard VCF:
CHROM,POS,ID,REF,ALT,QUAL,FILTER - INFO Fields:
INFO_DP,INFO_AF,INFO_AC, etc. - VEP Annotations:
CSQ_Allele,CSQ_Consequence,CSQ_SYMBOL,CSQ_Gene, etc. - SnpEff Annotations:
ANN_Allele,ANN_Annotation_Impact,ANN_Gene_Name,ANN_Distance, etc. - Sample Data:
SAMPLE1_GT,SAMPLE1_DP,SAMPLE1_AD, etc.
Example Output VEP
CHROM POS ID REF ALT QUAL FILTER INFO_DP CSQ_Consequence CSQ_SYMBOL SAMPLE1_GT
chr1 69511 . A G 1294.53 PASS 65 missense_variant OR4F5 1/1
chr1 69761 rs123 C T 892.15 PASS 42 synonymous_variant OR4F5 0/1
Example Output SnpEff
CHROM POS ID REF ALT QUAL FILTER INFO_DP ANN_Annotation ANN_Gene_Name SAMPLE1_GT
chr1 69761 rs587 C T 730 PASS . 214 synonymous_variant OR4F5 0/1
chr1 924024 . A G 53 PASS . 409 5_prime_UTR_variant SAMD11 1/1
π§ Integration Examples
With R
# Read compressed output directly
library(data.table)
data <- fread("output_reformatted.tsv.gz")
# Quick variant summary
summary(data$CSQ_Consequence)
With Python
import pandas as pd
# Load and analyze
df = pd.read_csv("output_reformatted.tsv.gz", sep="\t", compression="gzip")
df['CSQ_Consequence'].value_counts()
In Workflows
# Nextflow pipeline
vcf-reformatter ${vcf} -t most-severe -j ${task.cpus} -o results/ -c
# Snakemake rule
shell: "vcf-reformatter {input.vcf} -t most-severe -j {threads} -o {params.outdir} -c"
π³ Container Usage
Docker
# Build once
docker build -t vcf-reformatter .
# Run anywhere
docker run --rm \
-v $(pwd):/data \
vcf-reformatter \
/data/input.vcf.gz \
-t most-severe -j 4 -o /data/results/ -c
Singularity (HPC)
# On HPC cluster
singularity run \
--bind $PWD:/data \
--bind /scratch:/scratch \
vcf-reformatter.sif \
/data/large_cohort.vcf.gz \
-t most-severe -j 16 -o /scratch/results/ -c -v
π§ͺ Use Cases
| Use Case | Command | Why It Works |
|---|---|---|
| Clinical Variant Review | vcf-reformatter variants.vcf.gz -t most-severe |
Prioritizes clinically relevant consequences |
| Population Analysis | vcf-reformatter cohort.vcf.gz -t first -j 0 -c |
Fast processing of large cohorts |
| Transcript Studies | vcf-reformatter genes.vcf.gz -t split -v |
Comprehensive transcript-level analysis |
| Quick Data Exploration | vcf-reformatter sample.vcf.gz |
Simple, fast conversion for immediate analysis |
| HPC Batch Processing | vcf-reformatter huge.vcf.gz -t most-severe -j 32 -c |
Optimized for high-performance computing |
π What's New in v0.3.0
- β MAF Output Support (in Betaβ οΈ) - Direct conversion to Mutation Annotation Format
- β Auto-metadata Detection (in Betaβ οΈ) - Extracts center/sample info from VCF headers for MAF
- β Memory-Efficient Processing (streaming) - Chunked streaming for large files (>>100K variants)
- β Enhanced Error Handling - Better processing of malformed files
- β Comprehensive Testing - 70+ test cases ensure reliability
Previous Releases
π What's New in v0.2.0
- β SnpEff Support - Full ANN field parsing with intelligent detection
- β Smart Auto-Detection - Automatically identifies VEP vs SnpEff annotations
- β Enhanced Error Handling - Better processing of malformed or headerless files
TODOs
Add SnpEff supportβOutput MAF format optionβ- Add
stdinto combine with other tools, such asbcftools - Support for multi-sample VCF files in MAF output
π€ Contributing
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Add tests for new functionality
- Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
Development Setup
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo test # Run the test suite
cargo run -- data/sample.vcf.gz -v # Test with sample data
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- VCF Format Contributors - For the standard that enables genomic data sharing
- VEP Team - For the powerful variant annotation framework
- Rust Community - For the incredible ecosystem that makes this possible
- Bioinformatics Community - For feedback and feature requests
Frequently Asked Questions
Q: Which transcript handling mode should I use?
- Clinical analysis:
--transcript-handling most-severe - Quick exploration:
--transcript-handling first - Comprehensive analysis:
--transcript-handling split
Q: How does this compare to other VCF tools?
VCF Reformatter is specifically designed for:
- Converting complex VEP/SnpEff annotations to tabular format
- Handling multiple transcripts intelligently
- High-performance parallel processing
- Easy integration with R/Python workflows
Q: Can I use this in production pipelines?
Yes! VCF Reformatter is designed for production use with:
- Comprehensive error handling
- Docker/Singularity support
- Automated testing
- Stable CLI interface
Q: What's the difference between TSV and MAF output?
- TSV: Direct flattening of VCF fields (default)
- MAF (beta): Standardized cancer genomics format for downstream tools
Q: What if I get out-of-memory errors?
- Use TSV format instead of MAF:
vcf-reformatter file.vcf.gz -j 0 -c - Enable verbose mode to monitor:
vcf-reformatter file.vcf.gz -v
π Support
- π Issues: GitHub Issues
- π§ Email: fl@flaviolombardo.site
β Star this repo if VCF Reformatter helps your research!
Made with β€οΈ by Flavio Lombardo
Dependencies
~6β12MB
~245K SLoC