1 Busco

Busco is a software that assesses genome completeness based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs by searching for these select sets of orthologous genes. There are several databases that can be used with Busco, and can be downloaded here.

Busco can be run with or without internet, so we will consider two ways to run this software:

Option 1: For option 1, we will assume that you have internet connection on the node you are running your job on. Note: Busco does not have the ability to direct an output file to a different folder than the folder you submit the job from, even with a path identified. In this case, consider creating a Busco folder.

#make directory / folder
mkdir Busco
#
#change directory to your Busco folder
cd Busco

Be sure that it is available and loaded on your HPC, and write and submit a job with the following code:

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
busco  -o output_file -i /path/to/input/fasta/file.fasta -l mammalia_odb10 -c $NSLOTS -m genome
#
echo = `date` job $JOB_NAME done
#
#Explanation:
#-o: name of the output folder and files
#-i: input file (FASTA)
#-l: lineage - ex. mammalian / name of the database (This will automatically connected and download the database from the BUSCO website)
#-c: number of CPUs
#-m: mode (options are *genome*, transcriptome, proteins)

Option 2: If you do not have internet connection on the node you are running the software on, you can download the database and run the program offline.

#make directory / folder
mkdir Busco
#
#change directory to your Busco folder
cd Busco
#
#download database from web, ex. mammal database
wget https://busco-data.ezlab.org/v5/data/lineages/mammalia_odb10.2021-02-19.tar.gz
#
#extract files within Busco folder
tar -zxf mammalia_odb10.2021-02-19.tar.gz

Write and submit a job with the following code:

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
busco  -o output_file -i /path/to/input/fasta/file.fasta -l mammalia_odb10 -c $NSLOTS -m genome --offline --download_path /path/to/datasets
#
echo = `date` job $JOB_NAME done
#
#Explanation:
#-o: name of the output folder and files
#-i: input file (FASTA)
#-l: lineage - ex. mammalian / name of the database (This will automatically connected and download the database from the BUSCO website)
#-c: number of CPUs
#-m: mode (options are *genome*, transcriptome, proteins)

2 Blobtools

Blobtools is a command line tool designed for interactive quality assessment of genome assemblies and contaminant detection and filtering. We will use both Blobtools and Blobtools2 in the following steps:

1) Apply Blobtools

2) Create Blobtools2 database

3) Add data to database

4) Create interactive html page with results

2.1 Apply Blobtools

Note: Identify the version of Blobtools you are using when loading the module on your HPC Write and submit a job script with the following code:

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
blastn -db /data/genomics/db/ncbi/db/latest_v4/nt/nt -query /path/to_assembly/file.fasta -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 20 -max_hsps 1 -evalue 1e-20 -num_threads $NSLOTS -out clouded_leopard_blast.out
#
echo = `date` job $JOB_NAME done
#
#Explanation
#-db: ncbi nucleotide database
#-query: input file (FASTA)
#-outfmt: format of the output file (important to for blobtools) 
#-max_target_seqs: Number of aligned sequences to keep.
#-max_hsps: Maximum number of HSPs (alignments) to keep for any single query-subject pair.
#-num_threads: number of CPUs
#-out: name of the output file

2.2 Create Blobtools2 Database

The following can be run on an interactive, or computational node. First, load blobtools, specifying the version you are using. Next, submit the following code on your interactive node:

blobtools create --fasta /path/to_assembly/file.fasta species_blobt_nb
#
#Explanation
#create: command to create the database with blobtools2
#--fasta: path and name of the assembly fasta file
#species_blobt_nb: name of the database

2.3 Add Data to Database

Once you have a Blob database, additional data can be added by parsing analysis output files into one or more fields using the blobtools “add” command. Preset parsers are available for a range of analysis types, such as BLAST or “Diamond hits” with taxonomic assignments for scaffolds or contigs. Read mapping files provide base and read coverage and BUSCO results show completeness metrics for the genome assembly.

This step can also be run on an interactive node:

blobtools add --threads 3 --hits species_blast.out --taxrule bestsumorder --taxdump /path/to/taxdump/folder species_blobt_nb 
#
blobtools add --threads 3 --cov species_sorted.bam species_blobt_nb 
#
blobtools add --threads 3 --busco table.tsv species_blobt_nb

2.4 Create Interactive HTML Page with Results

After you finish creating and adding data to the database in order to visualize the results, install blobtools2 on your personal machine and download the database folder.

#tar zip folder
tar -czvf name-of-archive.tar.gz /path/to/directory/or/file

#download using ffsend
module load bio/ffsend

#ffsend
ffsend upload <file>

#Install blobtools on your machine using conda

#make new folder for blobtool results
mkdir blobtools_results

#move to new folder
cd blobtools_results

#move or ffsend folder to this folder

#unzip file
untar -xzvf archive.tar.gz

#After installing blobtools2, activate your conda environment and from the program folder by running the following command on your database folder:
conda activate btk_env
./blobtools view --remote path/to/species_nb

You should be able to interact with and visualize results