NSHA NGS Analysis Pipeline

In brief data was aligned to the human reference genome (GRCh37.75) with BWA[1] and processed using Picard[2], the GATK[3], and VT[4]. Variants were called using six different variant callers, each with different error profiles and performance characteristics: MuTect[5], FreeBayes[6], VarDict[7], Pindel[8], Platypus[9], and Scalpel[10] and then combined into a unified call set using bcbio-ensemble[11]. Variants were annotated with snpEff[12] and VCFAnno[13] from a variety of data sources including dbSNP[14], 1000 Genomes[15], The Exome Sequencing Project’s EVS[16], Ensembl[17], ClinVar[18], and COSMIC[19].

Standard Filters

Only variants below a 1% allele frequency in any population, with an estimated somatic allele frequency above 2% and with a read depth greater than 500 bp, and with a functional impact on the protein (missense, nonsense mutations) or a known clinical association were kept in the analysis.

References:

BWA
Picard
GATK
VT
MuTect
FreeBayes
VarDict
Pindel
Platypus
Scalpel
bcbio-ensemble
snpEff
VCFAnno
dbSNP
1000 Genomes
EVS
Ensemble
ClinVar
COSMIC

Written on April 21, 2016