Next, we will call variants across the genome using Minimap2, Genome Analysis Toolkit (GATK), and Samtools in the following steps:
#1) Map samples to reference genome using Minimap2, convert files from SAM to BAM format
#2) Map samples using GATK
#3) Add read group information to BAM files using GATK
#4) Index with Samtools
#5) Base Quality Recalibration
#6) Call SNPs and indels
#7) Combine multiple VCF files
#8) Perform joint Genotyping with GenotypeGVCFs
#9) SNP Filtering
Map your raw genomic data from an individual or individuals to the reference genome of that particular species using Minimap2, a sequence alignment program that aligns DNA sequences against a large reference database. For more information, or to download, visit Minimap2.
During this step, also convert the file format from .SAM (sequence alignment map) to .BAM (binary alignment map) using Samtools, a suite of programs designed for interacting with high-throughput sequencing data. For more information, or to download, visit Samtools.
To run Minimap2 and Samtools on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
minimap2 -ax sr -t 20 /path/to/reference/genome/file.fasta /path/to/raw/data/files/file_1.fastq.gz /path/to/raw/data/files/file_2.fastq.gz | samtools sort -@20 -O BAM -o output_bam_file.bam -
#
echo = `date` job $JOB_NAME done
#
#Minimap2 Explanation
#-ax: preset configuration to map illumina short reads to genomes, change as needed
#-t: number of threads to use
#
#Samtools Explanation
#view: convert command
#-b output format BAM
#sort: sort command
#-@: number of threads to use
#-O: output format
#-o: name of the outputformat
Genome Analysis Toolkit (GATK) maps genome sequencing data to a reference and produce high-quality variant calls. We will first use GATK to identify duplicate reads, or reads that originated from a single fragment of DNA often occurring during sample preparation. For more information about GATK, to read best practices recommendations, or to download software packages required, visit GATK.
To run GATK on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
rungatk MarkDuplicates
-I=bam_file.bam -O=bam_file_duplicates_marked.bam -M=bam_file_duplicates_marked.metrics
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-I: input .bam file
#-O: output / marked duplicates file
#-M: output / metrics file
Next, apply GATK to add read group information to your .bam file. This command will assign all of the reads in the file to a single new read group. A read group is a set of NGS reads that were all the product of a single sequencing run on one lane. Algorithms in GATK require read group information to identify which reads were sequenced together to compensate for variability and error between sequencing runs.
To run GATK on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
rungatk AddOrReplaceReadGroups -I=/path/to/file/bam_file_duplicates_marked.bam -O=bam_file_duplicates_marked_read_ID.bam -ID=1 -LB=lib2 -PL=ILLUMINA -PU=unit1 -SM=sample_name
#
echo = `date` job $JOB_NAME done
#
#-I: input .bam file
#-O: output / marked duplicates file
#-ID: read Group ID; lane number
#-LB: library ID
#-PL: platform, ex. Illumina
#-PU: run barcode
#-SM: sample Name
Next, use Samtools to index the .bam file. The index command will create a new index file that will allow fast and efficient look-up of data in the .bam file.
The following code can be run in an interactive / computational node on your HPC. Be sure to load GATK and Samtools, specifying the current versions, prior to running the following code:
rungatk CreateSequenceDictionary -R=reference_genome_file.fasta -O=indexed_reference_genome_file.dict
samtools faidx reference_genome_file.fasta
samtools index bam_file_duplicates_marked_read_ID.bam
#Explanation
#-R: reference genome
#-O: output reference index dictionary
The GATK best practices recommend performing Base Quality Score Recalibration, which detects errors in data by comparing it to a reference training data set. However, while this can be done with model organisms, there are challenges in doing this with non-model organisms including the lack a high confidence database for training. Furthermore, while it is an option to create a custom training database, in many cases it has not worked as efficiently and has led to biased results. To apply this procedure to your model organism, or determine if this may be beneficial for your model organism, visit GATK Recalibration.
HaplotypeCaller is the tool within GATK built to call germline single nucleotide polymorphisms and small Indels using local de-novo assembly of haplotype regions. For more information about how this tool works, visit HaplotypeCaller
To run HaplotypeCaller on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
rungatk HaplotypeCaller -I bam_file.bam -R reference_fasta_file.fasta -ERC GVCF -O output_file.g.vcf.gz
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-I: input .bam file
#-R: reference .fasta file
#-O: output
#-ERC: preset parameters for likelihood calculations, recommended for large number of samples
The next step is to combine the files of all of the individuals you have been processing into one VCF file. We will do this by indexing via HTSlib, and then combining using CombineGVCFs.
Once HaplotypeCaller has been completed, move all files to a single folder. Index files, which can be done in interactive or computational mode on your HPC. Here, we used HTSlib for indexing, which is designed for high-throughput sequencing data. For more information, or to download, visit HTSlib.
To index your data using HTSlib, be sure that the module is available and loaded on your HPC, and you are working in an interactive or computational node. Submit the following code:
#change directory to folder with your files
cd /path/to/folder/containing/files
#index command, for indexing vcf files
tabix -p vcf file_1.g.vcf.gz
tabix -p vcf file_2.g.vcf.gz
tabix -p vcf file_3.g.vcf.gz
CombineGVCFs is a tool within GATK built to combine multiple single sample VCF files, merging them into a single multi-sample GVCF file. For more information, visit CombineGVCFs
To run CombineGVCFs on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
rungatk CombineGVCFs -R reference_genome.fasta \
-V file_1.g.vcf.gz \
-V file_2.g.vcf.gz \
-V file_3.g.vcf.gz \
-O output_cohort_file.g.vcf.gz
#
echo = `date` job $JOB_NAME done
#Explanation
#-V: individual.g.vcf files
#-R: reference genome
#-O: output_file.vcf
To index your data using HTSlib, be sure that the module is available and loaded on your HPC, and you are working in an interactive or computational node. Submit the following code:
#change directory to folder with your files
cd /path/to/folder/containing/file
#index command, for indexing vcf files
tabix -p vcf output_cohort_file.g.vcf.gz
GenotypeGVCFs is a tool within GATK built to perform joint genotyping on your samples. Note that this tool only processes one input file, whether it is a cohort file, or one individual. Furthermore, the file must also have genotype likelihoods produced by HaplotypeCaller, as we did so prior in this module. For more information about this tool, visit GenotypeGVCFs.
To run GenotypeGVCFs on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
GATK_JAVA_OPTIONS=-XX:+UseParallelGC
GATK_JAVA_OPTIONS=-XX:ParallelGCThreads=32
rungatk GenotypeGVCFs \
-R /path/to/reference/genome/file/fasta_file.fasta \
-V cohort_vcf_file.g.vcf.gz \
-O output_file.vcf.gz $NSLOTS
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-V: cohort file containing all individual samples
#-R: reference genome
#-O: output file with the genotype in vcf format
VariantFiltration is a tool within GATK built to hard-filter variant calls based on custom quality criteria such as sequencing depth and mapping quality. The output vcf file can then be used in downstream population genomic analyses. For more information about this tool, visit VariantFiltration.
To run GenotypeGVCFs on your data, be sure that the modules are available, versions are identified and loaded on your HPC, and write a job script using the following code:
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
rungatk VariantFiltration \
-R /path/to/reference/genome/file/fasta_file.fasta \
-V output_file.vcf.gz \
-O output_file_varfilter.vcf \
--filter-name "Low_depth10" \
--filter-expression "DP < 10"
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-V: genotyped output vcf files
#-R: reference genome
#-O: output file with the variant filter results
#--filter-name: a given filer name that will be printed in the vcf file for your information
#-filter-expression: hard filter that you want to perform