NSHA NGS Analysis Pipeline

In brief data was aligned to the human reference genome (GRCh37.75) with BWA[1] and processed using Picard[2], the GATK[3], and VT[4]. Variants were called using six different variant callers, each with different error profiles and performance characteristics: MuTect[5], FreeBayes[6], VarDict[7], Pindel[8], Platypus[9], and Scalpel[10] and then combined into a unified call set using bcbio-ensemble[11]. Variants were annotated with snpEff[12] and VCFAnno[13] from a variety of data sources including dbSNP[14], 1000 Genomes[15], The Exome Sequencing Project’s EVS[16], Ensembl[17], ClinVar[18], and COSMIC[19].

Standard Filters

Only variants below a 1% allele frequency in any population, with an estimated somatic allele frequency above 2% and with a read depth greater than 500 bp, and with a functional impact on the protein (missense, nonsense mutations) or a known clinical association were kept in the analysis.

References:

  1. BWA
  2. Picard
  3. GATK
  4. VT
  5. MuTect
  6. FreeBayes
  7. VarDict
  8. Pindel
  9. Platypus
  10. Scalpel
  11. bcbio-ensemble
  12. snpEff
  13. VCFAnno
  14. dbSNP
  15. 1000 Genomes
  16. EVS
  17. Ensemble
  18. ClinVar
  19. COSMIC
Written on April 21, 2016