Best Practices for Mouse SNP and Indel VCF Files Using GATK Mutect2

Answered using ChatGPT 4o by ChatGPT4o

Question:

Is there any best practice for the mouse snp indel vcf files using GATK Mutect2 for mouse whole exome data? For mm10, it seems have several available, for mm39, it seems the newest is from /pub/databases/mousegenomes/REL-2112-v8-SNPs_Indels, 2021 version.

Besides, when dealing with human samples, GATK GetPileupSummaries have gnomad.vcf.gz files to use, as mouse seems do not have this file available. Does it means this steps for mouse can be skipped?

Answer:

When working with mouse whole exome data using GATK Mutect2, there are several best practices and considerations to keep in mind, especially regarding the use of SNP and indel VCF files and handling steps like GetPileupSummaries.

Best Practices for Mouse SNP and Indel VCF Files

  1. Reference Genome: Ensure you are using the correct reference genome (e.g., mm10 or mm39). The choice of reference genome will influence the variant calling and subsequent analyses.

  2. SNP and Indel VCF Files: For mouse data, you should use the most up-to-date and comprehensive SNP and indel VCF files available. For mm39, the 2021 version from /pub/databases/mousegenomes/REL-2112-v8-SNPs_Indels is a good choice. These files are used to inform the variant calling process and help in distinguishing true variants from sequencing artifacts.

  3. Panel of Normals (PoN): If available, use a panel of normals specific to your mouse strain and experimental conditions. This helps in filtering out common technical artifacts and germline variants.

Handling GetPileupSummaries for Mouse Data

GetPileupSummaries is a step in the GATK pipeline used to calculate contamination in human samples by comparing the sample to a known population allele frequency database (e.g., gnomAD). For mouse data, there isn't a direct equivalent to gnomAD, but you can still perform contamination estimation with some adjustments:

  1. Custom Population VCF: If you have access to a large cohort of mouse exome or genome data, you can create a custom population VCF file. This file would serve a similar purpose to gnomAD by providing population allele frequencies for the mouse genome.

  2. Skipping the Step: If creating a custom population VCF is not feasible, you might consider skipping the GetPileupSummaries step. However, be aware that this might affect the accuracy of contamination estimation and downstream analyses.

  3. Alternative Methods: Explore alternative methods for contamination estimation that do not rely on population allele frequency databases. Some tools and methods might be adaptable for mouse data.

Summary

  • Use the most recent and comprehensive SNP and indel VCF files available for your reference genome (e.g., mm39 2021 version).
  • Consider creating a custom population VCF if you have access to a large cohort of mouse data.
  • If a custom population VCF is not available, you may skip the GetPileupSummaries step, but be aware of the potential impact on contamination estimation.
  • Explore alternative methods for contamination estimation that might be suitable for mouse data.

By following these best practices, you can ensure more accurate variant calling and analysis for your mouse whole exome data using GATK Mutect2.