Gene Prediction and Begin Annotation

GeMoMa

Gene Model Mapper (GeMoMa) is a homology-based gene prediction program that uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thus, GeMoMa uses amino acid and intron position conservation to create gene models. For more information, or to install, visit GeMoMa.

GeMoMa uses the annotation of coding genes from reference genomes, so first, we need to download reference genomes, including a fasta and gff file for each genome. Download reference genomes for your species and / or closely related taxa. To download reference genomes from NCBI:

#download using wget command
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/species/file_genomic.fna.gz
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/species/file_genomic.gff.gz

To run GeMoMa, be sure that the module is installed and loaded on your HPC. Recognize that the GeMoMa pipeline contains several steps, and is therefore computationally heavy. Be sure to consider this when writing your job script and submitting. Write and submit a job with the following code:

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
set maxHeapSize 400000
#
GeMoMa -Xmx400G GeMoMaPipeline threads=$NSLOTS outdir=../gemoma/ GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO p=false o=true t=../assembly/file.fasta s=own i=species a=/path/to/reference/genome/species_genomic.gff g=/path/to/reference/genome/species_genomic.fna s=own i=species #add multiple reference genomes..
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-Xmx: define max heap size to increase default memory use
#threads: # of CPUs
#AnnotationFinalizer.r: to rename the predictions (NO/YES)
#p: obtain predicted proteines (false/true)
#o: keep and safe  individual predictions for each reference species (false/true)
#t: target genome (fasta)
#outdir: path and prefix of output directory
#i: allows to provide an ID for each reference organism
#g: path to individual reference genome
#a: path to indvidual annotation genome