Did it ever happen that you had VCF files and you wanted to have a look at the data as you would do with a normal table? VCF Reformatter is here for your rescue!
A Rust command-line tool for parsing and reformatting VCF (Variant Call Format) files, with support for VEP (Variant Effect Predictor) and SnpEff annotations. This tool flattens complex VCF files into tab-separated values (TSV) format for easier downstream analysis. Also incredibly useful for quick checks to your data!
Transform complex VCF files into clean, analyzable tables with ease
A high-performance Rust tool for flattening VCF files with intelligent VEP and SnpEff annotation handling
# Download binary from releases (easiest! You download and use it)
wget https://github.com/flalom/vcf-reformatter/releases/latest/download/vcf-reformatter-v0.3.0-linux-x86_64
chmod +x vcf-reformatter-v0.3.0-linux-x86_64
# Transform your VCF file
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz
# Generate MAF output β οΈ (in beta!)
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz --output-format mafOR Via Bioconda
conda install -c bioconda vcf-reformatter
# or
# mamba install vcf-reformatter -c biocondaOR install from crates.io:
cargo install vcf-reformatterOR build from source (you need Rust toolchain):
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
./target/release/vcf-reformatter sample.vcf.gzMAF output is currently in beta testing (v0.3.0). Known limitations:
- VAF calculation needs refinement for some genotype patterns
- Multi-sample handling requires validation
- Use with caution in production workflows
Memory considerations for MAF:
- Files >100K variants: Monitor memory usage
- Files >1M variants: Ensure adequate RAM (16GB+)
The Problem: VCF files are notoriously difficult to analyze. Complex nested annotations, semicolon-separated INFO fields, and multi-transcript VEP annotations make downstream analysis a nightmare.
The Solution: VCF Reformatter flattens everything into clean, readable TSV format that works seamlessly with Excel, R, Python, and any analysis tool (
Before (Raw VCF):
chr1 69511 . A G 1294.53 . DP=65;AF=1;CSQ=G|missense_variant|MODERATE|OR4F5|ENSG00000186092...
After (Reformatted TSV):
CHROM POS REF ALT QUAL INFO_DP INFO_AF CSQ_Allele CSQ_Consequence CSQ_SYMBOL
chr1 69511 A G 1294.53 65 1 G missense_variant OR4F5
| Feature | Description | Benefit |
|---|---|---|
| 𧬠VEP/SnpEff Annotation Parsing | Intelligent handling of CSQ/ANN annotations | No more manual parsing of complex VEP/SnpEff output |
| π Automatic Annotation Recognition | Automatic detection of CSQ/ANN annotations | Saving even more time now for both VEP and SnpEff |
| π Smart Transcript Handling | Most severe, first only, or split transcripts | Choose the analysis approach that fits your needs |
| π Parallel Processing | Multi-threaded processing up to 30k variants/sec | Process large cohorts in minutes, not hours |
| π Native Compression | Direct .vcf.gz reading & gzip output |
Seamless workflow with compressed/uncompressed files |
| π― Production Ready | Comprehensive error handling & logging | Reliable for automated pipelines |
| π³ Container Support | Docker & Singularity ready | Deploy anywhere, from laptops to HPC clusters |
No Rust installation required - just download and run:
-
Go to Releases
-
Download the binary for your platform:
vcf-reformatter-v0.3.0-linux-x86_64β Linux (most users)vcf-reformatter-v0.3.0-linux-x86_64-staticβ HPC clusters (works everywhere)vcf-reformatter-v0.3.0-windows-x86_64.exeβ Windowsvcf-reformatter-v0.3.0-macos-x86_64β Intel Macvcf-reformatter-v0.3.0-macos-arm64β Apple Silicon Mac (M1/M2/M3/M4)
-
Make executable and run:
# Linux/Mac
chmod +x vcf-reformatter-*
./vcf-reformatter-* --help
# Windows
# Just double-click or run from command prompt
# C++ might be required, if not already installedgit clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release# Build the container
docker build -t vcf-reformatter .
# Run with your data
docker run --rm -v $(pwd):/data vcf-reformatter /data/sample.vcf.gz# Build Singularity image
singularity build vcf-reformatter.sif Singularity
# Run on HPC cluster
singularity run --bind $PWD:/data vcf-reformatter.sif /data/sample.vcf.gz -j 16# Simple conversion
vcf-reformatter input.vcf.gz
# Most severe consequence only (recommended for analysis)
vcf-reformatter input.vcf.gz -t most-severe
# All transcripts in separate rows (comprehensive)
vcf-reformatter input.vcf.gz -t split# Auto-detect annotation type (recommended)
vcf-reformatter input.vcf.gz -a auto
# Force VEP processing
vcf-reformatter vep_annotated.vcf.gz -a vep -t most-severe
# Force SnpEff processing
vcf-reformatter snpeff_annotated.vcf.gz -a snpeff -t most-severe# High-performance processing with compression
vcf-reformatter large_cohort.vcf.gz \
--transcript-handling most-severe \
--threads 0 \
--compress \
--output-dir results/ \
--prefix my_analysis \
--verbose
# Optimized for HPC environments
vcf-reformatter huge_dataset.vcf.gz -t most-severe -j 32 -o /scratch/results/ -c -vUsage: vcf-reformatter [OPTIONS] <INPUT_FILE>
Arguments:
<INPUT_FILE> Input VCF file (supports .vcf.gz)
Options:
--output-format <FORMAT> Output format [default: tsv]
[values: tsv, maf]
--center <CENTER> Sequencing center for MAF output
--ncbi-build <BUILD> Genome build
[default: GRCh38]
--sample-barcode <BARCODE> Sample identifier for MAF output
-t, --transcript-handling <MODE> How to handle multiple transcripts
[default: first]
[values: most-severe, first, split]
-a, --annotation-type <N> Which annotations to parse VEP/SnpEff
[default: auto]
[values: snpeff, vep, auto]
-j, --threads <N> Thread count (0 = auto-detect) [default: 1]
-o, --output-dir <DIR> Output directory [default: current]
-p, --prefix <PREFIX> Output file prefix [default: input filename]
-c, --compress Compress output with gzip
-v, --verbose Detailed performance statistics
-h, --help Show help
-V, --version Show version
VCF files with VEP annotations often contain multiple transcript annotations per variant. Choose the strategy that fits your analysis:
Best for: Clinical analysis, variant prioritization
vcf-reformatter input.vcf.gz -t most-severe
# for maf output
vcf-reformatter input.vcf.gz -t most-severe --output-format mafSelects the transcript with the most severe consequence (stop_gained > missense_variant > synonymous, etc.)
Best for: Quick analysis, performance-critical workflows
vcf-reformatter input.vcf.gz # Uses first transcript by defaultProcesses only the first transcript annotation (fastest option)
Best for: Comprehensive analysis, transcript-level studies
vcf-reformatter input.vcf.gz -t splitCreates separate rows for each transcript (most detailed output)
- Small files (< 1K variants): ~5,000 variants/sec
- Medium files (1K-10K variants): ~15,000 variants/sec
- Large files (10K+ variants): ~30,000 variants/sec
# Auto-detect optimal thread count
vcf-reformatter input.vcf.gz -j 0
# For files > 10K variants, use parallel processing
vcf-reformatter input.vcf.gz -t most-severe -j 0 -v
# Combine with compression for large outputs
vcf-reformatter input.vcf.gz -t split -j 0 -c -vVCF Reformatter generates two files:
{prefix}_header.txt- Original VCF header and metadata{prefix}_reformatted.tsv- Flattened tabular data
- Standard VCF:
CHROM,POS,ID,REF,ALT,QUAL,FILTER - INFO Fields:
INFO_DP,INFO_AF,INFO_AC, etc. - VEP Annotations:
CSQ_Allele,CSQ_Consequence,CSQ_SYMBOL,CSQ_Gene, etc. - SnpEff Annotations:
ANN_Allele,ANN_Annotation_Impact,ANN_Gene_Name,ANN_Distance, etc. - Sample Data:
SAMPLE1_GT,SAMPLE1_DP,SAMPLE1_AD, etc.
CHROM POS ID REF ALT QUAL FILTER INFO_DP CSQ_Consequence CSQ_SYMBOL SAMPLE1_GT
chr1 69511 . A G 1294.53 PASS 65 missense_variant OR4F5 1/1
chr1 69761 rs123 C T 892.15 PASS 42 synonymous_variant OR4F5 0/1
CHROM POS ID REF ALT QUAL FILTER INFO_DP ANN_Annotation ANN_Gene_Name SAMPLE1_GT
chr1 69761 rs587 C T 730 PASS . 214 synonymous_variant OR4F5 0/1
chr1 924024 . A G 53 PASS . 409 5_prime_UTR_variant SAMD11 1/1
# Read compressed output directly
library(data.table)
data <- fread("output_reformatted.tsv.gz")
# Quick variant summary
summary(data$CSQ_Consequence)
import pandas as pd
# Load and analyze
df = pd.read_csv("output_reformatted.tsv.gz", sep="\t", compression="gzip")
df['CSQ_Consequence'].value_counts()
# Nextflow pipeline
vcf-reformatter ${vcf} -t most-severe -j ${task.cpus} -o results/ -c
# Snakemake rule
shell: "vcf-reformatter {input.vcf} -t most-severe -j {threads} -o {params.outdir} -c"# Build once
docker build -t vcf-reformatter .
# Run anywhere
docker run --rm \
-v $(pwd):/data \
vcf-reformatter \
/data/input.vcf.gz \
-t most-severe -j 4 -o /data/results/ -c# On HPC cluster
singularity run \
--bind $PWD:/data \
--bind /scratch:/scratch \
vcf-reformatter.sif \
/data/large_cohort.vcf.gz \
-t most-severe -j 16 -o /scratch/results/ -c -v| Use Case | Command | Why It Works |
|---|---|---|
| Clinical Variant Review | vcf-reformatter variants.vcf.gz -t most-severe |
Prioritizes clinically relevant consequences |
| Population Analysis | vcf-reformatter cohort.vcf.gz -t first -j 0 -c |
Fast processing of large cohorts |
| Transcript Studies | vcf-reformatter genes.vcf.gz -t split -v |
Comprehensive transcript-level analysis |
| Quick Data Exploration | vcf-reformatter sample.vcf.gz |
Simple, fast conversion for immediate analysis |
| HPC Batch Processing | vcf-reformatter huge.vcf.gz -t most-severe -j 32 -c |
Optimized for high-performance computing |
- β
MAF Output Support (in Beta
β οΈ ) - Direct conversion to Mutation Annotation Format - β
Auto-metadata Detection (in Beta
β οΈ ) - Extracts center/sample info from VCF headers for MAF - β Memory-Efficient Processing (streaming) - Chunked streaming for large files (>>100K variants)
- β Enhanced Error Handling - Better processing of malformed files
- β Comprehensive Testing - 70+ test cases ensure reliability
- β SnpEff Support - Full ANN field parsing with intelligent detection
- β Smart Auto-Detection - Automatically identifies VEP vs SnpEff annotations
- β Enhanced Error Handling - Better processing of malformed or headerless files
Add SnpEff supportβOutput MAF format optionβ- Add
stdinto combine with other tools, such asbcftools - Support for multi-sample VCF files in MAF output
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Add tests for new functionality
- Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo test # Run the test suite
cargo run -- data/sample.vcf.gz -v # Test with sample dataThis project is licensed under the MIT License - see the LICENSE file for details.
- VCF Format Contributors - For the standard that enables genomic data sharing
- VEP Team - For the powerful variant annotation framework
- Rust Community - For the incredible ecosystem that makes this possible
- Bioinformatics Community - For feedback and feature requests
- Clinical analysis:
--transcript-handling most-severe - Quick exploration:
--transcript-handling first - Comprehensive analysis:
--transcript-handling split
VCF Reformatter is specifically designed for:
- Converting complex VEP/SnpEff annotations to tabular format
- Handling multiple transcripts intelligently
- High-performance parallel processing
- Easy integration with R/Python workflows
Yes! VCF Reformatter is designed for production use with:
- Comprehensive error handling
- Docker/Singularity support
- Automated testing
- Stable CLI interface
- TSV: Direct flattening of VCF fields (default)
- MAF (beta): Standardized cancer genomics format for downstream tools
- Use TSV format instead of MAF:
vcf-reformatter file.vcf.gz -j 0 -c - Enable verbose mode to monitor:
vcf-reformatter file.vcf.gz -v
- π Issues: GitHub Issues
- π§ Email: [email protected]
β Star this repo if VCF Reformatter helps your research!
Made with β€οΈ by Flavio Lombardo