Assessment of MuTect2 and VarScan2 for somatic mutation detection in exome sequencing
DOI:
https://doi.org/10.58445/rars.1207Keywords:
Bioinformatics, Variant Calling, Genomics, DNA SequencingAbstract
Next generation sequencing is generally performed to identify somatic mutations in cancer, with increasing use in, not only research, but also for diagnosis of clinical oncological patients to personalize and improve treatments. Somatic variant callers need two sets of sequencing data, one from cancer tissue and its normal tissue counterpart, to compare and detect somatic mutations. There are many somatic variant callers to choose from, but few comparison papers have been published, and therefore it is pivotal to find an efficient way of comparison between these tools, as there is no standard for detection of somatic mutations. An assessment of two somatic variant callers, MuTect2 and VarScan2, was performed on two matching data samples, tumoral and non-tumoral, acquired from the publication “SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach.” by Wang et al. (1). We hypothesized that MuTect2 would perform better with cancer samples, as it employs the probabilistic framework of Bayesian statistics, used by most existing variant callers. Their performance was analyzed in both synthesized and real cancer samples. Both variant callers performed similarly in different samples, although VarScan2 usually surpassed MuTect2 when mutation frequency was high, MuTect2 was more consistent throughout all mutation frequencies. We found out that VarScan2 has a higher number of concordant mutations at high frequencies but, when they drop below 20%, MuTect2 performs better identifying up to 4000 mutations to VarScan2’s 1000. Similarly, at frequencies over 40%, VarScan2 has a lower rate of missing mutations than MuTect2. Also, VarScan2 had a higher recall and higher precision than MuTect2. However, through the measuring of the F1-Score, MuTect2 proved to cover a wider range of accuracy for different mutation frequencies. MuTect2 outperforms VarScan2 in the synthetic data, as well as most of the data acquired from cancer patients.
References
Wang M, et al. “SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach.” Nature Scientific Reports, no.12898, Jul. 2020, doi:10.1038/s41598-020-69772-8.
Fjaer R, et al. “A novel somatic mutation in GNB2 provides new insights to the pathogenesis of Sturge-Weber Syndrome” Hum Mol Genet., Oct 2022, doi: 10.1093/hmg/ddab144.
Wang Q, et al. “Comparison of somatic variant detection algorithms using Ion Torrent targeted deep sequencing data.” BMC Med Genomics, published online, Dec. 2019, doi:10.1186/s12920-019-0636-y.
Rabbani B, Tekin M, Mahdieh N. “The promise of whole-exome sequencing in medical genetics.” Journal of Human Genetics, vol. 59; 5-15, Jan 2014, doi:10.1038/jhg.2013.114.
van Dijk E L, et al. “Ten Years of Next-Generation Sequencing Technology.” Trends in Genetics, vol. 20, 9, 2014, doi:10.1016/j.tig.2014.07.001.
“Mutect2”. GATK. gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2. Accessed 9 Nov 2022.
Koboldt DC, et al. “VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.” Genome Research, vol. 22:568–76, Feb. 2012, doi:10.1101/gr.129684.111.
Kim S, et al. “Strelka2: fast and accurate calling of germline and somatic variants.” Nature Methods, vol. 15:591–4, July 2018, doi:10.1038/s41592-018-0051-x.
Kim S, et al. “Virmid: accurate detection of somatic mutations with sample impurity inference.” Genome Biol, vol. 14: R90, Aug 2013, doi:10.1186/gb-2013-14-8-r90.
Uğur Sezerman O, et al. “Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization.” In book: Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations. Published by Intech Open, Jun 2019, doi: 10.5772/intechopen.85524.
Li H, Durbin R. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics, Volume 25, Issue 14, July 2009, doi:10.1093/bioinformatics/btp324.
Langmead B, Salzberg SL. “Fast gapped-read alignment with Bowtie 2.” Nature Methods, vol. 9; 357–9, Mar 2012, doi:10.1038/nmeth.1923.
“Novoalign: Powerful tool designed for mapping of short reads onto a reference genome from Illumina, Ion Torrent, and 454 NGS platforms.” Novocraft. www.novocraft.com/products/novoalign/. Accessed 1 Dec 2022.
Marçais G, et al. “MUMmer4: A fast and versatile genome alignment system.” PLoS Comput Biol, vol. 14:e1005944, Jan 2018, doi:10.1371/journal.pcbi.1005944
“SAM file format”. Metagenomics. www.metagenomics.wiki/tools/samtools/bam-sam-file-format. Accessed 1 Dec 2022.
“BAM File Format.” Illumina. support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm. Accessed 1 Dec 2022.
“Picard.” Broadinstitute Github. broadinstitute.github.io/picard/. Accessed 1 Dec 2022.
“GATK.” Broadinstitute. gatk.broadinstitute.org/hc/en-us. Accessed 1 Dec 2022.
McKenna A, et al. “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.” Genome Research. vol 20:1297–303, Jul 2010, doi:10.1101/gr.107524.110.
“VarScan.” VarsScan Sourceforge. varscan.sourceforge.net/. Accessed 1 Dec 2022.
Larson DE, et al. “SomaticSniper: identification of somatic point mutations in whole genome sequencing data.” Bioinformatics, vol 28:311–7, Feb 2012, doi: 10.1093/bioinformatics/btr665.
Li H, et al. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics, vol 25:2078–9, Aug 2009, doi: 10.1093/bioinformatics/btp352.
Fang LT, et al. “An ensemble approach to accurately detect somatic mutations using SomaticSeq.” Genome Biol, vol. 16:197, 2015, doi: 10.1186/s13059-015-0758-2.
Ramos AH, et al. “Oncotator: cancer variant annotation tool.” Human Mutation, vol. 36: E2423–9, Apr 2015, doi: 10.1002/humu.22771.
Douville C, et al. “CRAVAT: cancer-related analysis of variants toolkit.” Bioinformatics, vol. 29:647–8, Mar 2013, doi: 10.1093/bioinformatics/btt017.
Benjamin DI, et al. “Calling Somatic SNVs and Indels with Mutect2.” Cold Spring Harbor Laboratory. Dec 2019, doi: 10.1101/861054.
“ICGC-TCGA DREAM Mutation Calling challenge.” Bionetworks S. www.synapse.org/#!Synapse:syn312572/wiki/58893. Accessed 20 Dec 2022.
Alioto TS, et al. “A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing.” Nature Communications, vol. 6:1000, Dec 2015, doi: 10.1038/ncomms10001.
Griffith M, et al. “Optimizing cancer genome sequencing and analysis.” Cell Systems, vol. 1:210–23, Sep 2015, doi:10.1016/j.cels.2015.08.015.
Craig DW, et al. “A somatic reference standard for cancer genome sequencing.” Scientific Reports Nature, vol. 6:24607, Apr 2016, doi: 10.1038/srep24607.
“Overview – precisionFDA.” Precision FDA. precision.fda.gov/. Accessed 20 Dec 2022.
Yu G, et al. “Whole-Exome Sequencing of Nasopharyngeal Carcinoma Families Reveals Novel Variants Potentially Involved in Nasopharyngeal Carcinoma.” Scientific Reports Nature, vol. 9:9916, Jul 2019, doi: 10.1038/s41598-019-46137-4.
Bolger AM, Lohse M, Usadel B. “Trimmomatic: a flexible trimmer for Illumina sequence data.” Bioinformatics, vol. 30, no. 2114–20, Aug 2014, doi:10.1093/bioinformatics/btu170.
Lähnemann D, et al. “Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo.” Nature Communications, vol. 12, no. 6744, Nov 2021, doi: 10.1038/s41467-021-26938-w.
Krøigård AB, et al. “Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data.” PLoS One, vol. 11:e0151664, Mar 2016, doi: 10.1371/journal.pone.0151664.
“FastQC A Quality Control tool for High Throughput Sequence Data.” Babraham Bioinformatics. www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 18 Feb 2023.
Shirley M. “fastqp: Simple FASTQ quality assessment using Python.” Github. github.com/mdshw5/fastqp. Accessed 18 Feb 2023.
“NGS QC Toolkit: a toolkit for quality check and filtering of next generation sequencing data of Roche and Illumina technology”. NGSQCToolkit. www.nipgr.res.in/ngsqctoolkit.html. Accessed 18 Feb 2023.
Tammi MT. “PRINSEQ”. Bioinformatics. bioinformaticshome.com/tools/rna-seq/descriptions/PRINSEQ.html#gsc.tab=0. Accessed 18 Feb 2023.
Zhou Q, et al. “QC-Chain: fast and holistic quality control method for next-generation sequencing data.” PLoS One, vol. 8:e60234, Apr 2013, doi:10.1371/journal.pone.0060234.
Downloads
Posted
Categories
License
Copyright (c) 2024 Carmen Alves Sabin
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.