how to parallelise mutect2
Parallelizing Mutect2, a tool from the GATK suite used for calling somatic mutations, can significantly speed up the analysis of large datasets. Here are some strategies to parallelize Mutect2:
This is the most common method for parallelizing GATK tools, including Mutect2. The idea is to split the input data into smaller chunks (scatter), process each chunk in parallel, and then combine the results (gather).
ScatterIntervalsByNs
or IntervalListTools
can be used to create interval lists.Create Interval List:
gatk BedToIntervalList -I regions.bed -O intervals.list
Split Interval List:
gatk SplitIntervals -R reference.fasta -L intervals.list -O interval_files/ -scatter 20
Run Mutect2 in Parallel:
parallel -j 20 gatk Mutect2 -R reference.fasta -I input.bam -L {} -O output_{}.vcf ::: interval_files/*.interval_list
Combine VCF Files:
gatk GatherVcfs -I output_1.vcf -I output_2.vcf -I output_3.vcf -O final_output.vcf
Workflow management systems like Snakemake or Nextflow can automate the scatter-gather process and handle parallel execution.
rule all:
input:
"final_output.vcf"
rule split_intervals:
input:
"reference.fasta"
output:
expand("interval_files/{i}.interval_list", i=range(20))
shell:
"""
gatk SplitIntervals -R {input} -O interval_files/ -scatter 20
"""
rule mutect2:
input:
bam="input.bam",
ref="reference.fasta",
interval="interval_files/{i}.interval_list"
output:
vcf="output_{i}.vcf"
shell:
"""
gatk Mutect2 -R {input.ref} -I {input.bam} -L {input.interval} -O {output.vcf}
"""
rule gather_vcfs:
input:
expand("output_{i}.vcf", i=range(20))
output:
"final_output.vcf"
shell:
"""
gatk GatherVcfs -I {input} -O {output}
"""
If you have access to an HPC cluster, you can submit multiple Mutect2 jobs to the cluster's job scheduler (e.g., SLURM, PBS, SGE).
Create a SLURM Job Script:
#!/bin/bash
#SBATCH --job-name=mutect2
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=24:00:00
#SBATCH --output=mutect2_%A_%a.out
#SBATCH --array=1-20
INTERVAL_FILE=$(ls interval_files/*.interval_list | sed -n "${SLURM_ARRAY_TASK_ID}p")
gatk Mutect2 -R reference.fasta -I input.bam -L $INTERVAL_FILE -O output_${SLURM_ARRAY_TASK_ID}.vcf
Submit the Job:
sbatch mutect2_job.sh
Gather VCF Files:
gatk GatherVcfs -I output_1.vcf -I output_2.vcf -I output_3.vcf -O final_output.vcf
Cloud platforms like Google Cloud, AWS, or Azure can be used to run Mutect2 in parallel using their respective batch processing services (e.g., AWS Batch, Google Cloud Dataflow).
By following these strategies, you can effectively parallelize Mutect2 to handle large datasets more efficiently.