how parallelise mutect2
Parallelizing Mutect2, a tool from the GATK suite used for calling somatic mutations, can significantly speed up the analysis of large datasets. Here are some strategies to parallelize Mutect2:
This is the most common method for parallelizing GATK tools, including Mutect2.
Scatter: Use IntervalListTools
from Picard or SplitIntervals
from GATK to divide the genome into intervals.
gatk SplitIntervals \
-R reference.fasta \
-L intervals.list \
--scatter-count 50 \
-O scattered_intervals/
This command splits the genome into 50 intervals.
Process: Run Mutect2 on each interval in parallel. This can be done using a job scheduler (e.g., SLURM, SGE) or GNU Parallel.
parallel -j 8 gatk Mutect2 -R reference.fasta -I input.bam -L {} -O output_{}.vcf ::: scattered_intervals/*.intervals
This command runs 8 parallel jobs.
Gather: Combine the VCF files using GatherVcfs
from Picard or MergeVcfs
from GATK.
gatk MergeVcfs \
-I output_1.vcf -I output_2.vcf -I output_3.vcf ... -I output_50.vcf \
-O final_output.vcf
Workflow management systems like WDL (Workflow Description Language) with Cromwell, Nextflow, or Snakemake can automate the scatter-gather process.
WDL Script: Write a WDL script to define the scatter-gather workflow.
workflow mutect2_scatter_gather {
File reference
File input_bam
Array[File] intervals
scatter (interval in intervals) {
call Mutect2 {
input:
reference = reference,
input_bam = input_bam,
interval = interval
}
}
call MergeVcfs {
input:
vcfs = Mutect2.output_vcf
}
}
task Mutect2 {
File reference
File input_bam
File interval
command {
gatk Mutect2 -R ${reference} -I ${input_bam} -L ${interval} -O output.vcf
}
output {
File output_vcf = "output.vcf"
}
}
task MergeVcfs {
Array[File] vcfs
command {
gatk MergeVcfs -I ${sep=' -I ' vcfs} -O final_output.vcf
}
output {
File final_output_vcf = "final_output.vcf"
}
}
Run with Cromwell:
java -jar cromwell.jar run mutect2_scatter_gather.wdl -i inputs.json
Cloud platforms like Google Cloud, AWS, or Azure offer managed services for running GATK workflows in parallel.
Containerize the GATK tools and run them on a Kubernetes cluster for scalable and efficient parallel processing.
Dockerize GATK: Create a Docker image with GATK installed.
FROM broadinstitute/gatk:latest
Kubernetes Job: Define a Kubernetes job to run Mutect2 on each interval.
apiVersion: batch/v1
kind: Job
metadata:
name: mutect2-job
spec:
parallelism: 50
template:
spec:
containers:
- name: mutect2
image: your-docker-image
command: ["gatk", "Mutect2", "-R", "reference.fasta", "-I", "input.bam", "-L", "interval", "-O", "output.vcf"]
restartPolicy: Never
Parallelizing Mutect2 involves splitting the input data into smaller chunks, processing each chunk in parallel, and then combining the results. This can be achieved using scatter-gather approaches, workflow management systems, cloud-based solutions, or container orchestration platforms like Kubernetes. Choose the method that best fits your computational resources and workflow requirements.