Building a Diamond Db using Refseq Protein

The idea is to build a Diamond database using Refseq protein (non redundant) and later compare it to blast.

First thing to do is to download fasta files – nonredundant only.

wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein*.faa.gz

At the moment, there is ~928 fasta files taking ~26 Gb of disk space.

Also, I am interested in having taxonomic information along with results. Consulting Diamond’s manual it is necessary to get two more files:

wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
# Remember to unzip this one

After that, it is possible to build the database using:

# diamond version 0.9.29
zcat *.faa.gz | ./diamond makedb --taxonmap prot.accession2taxid.gz --taxonnames names.dmp --taxonnodes nodes.dmp -d refseq_protein_nonredund_diamond

It takes about 34 minutes to finish, ~26Gb of RAM (peak) in a computer with 48 threads. The index file have 54Gb.

To make things a bit more automatized one can use:

#Makefile content: 

SHELL=/bin/bash
.ONESHELL:

.PHONY: all clear

NCBI_TAXON:=ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy

all: prot.accession2taxid.gz \
        taxdmp.zip \
        faa.done \
        refseq_protein_nonredund_diamond

prot.accession2taxid.gz:
        wget -c $(NCBI_TAXON)/accession2taxid/$@

taxdmp.zip:
        wget -c $(NCBI_TAXON)/$@ && \
        unzip $@

faa.done:
        wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein*.faa.gz && \
        touch $@

refseq_protein_nonredund_diamond: faa.done taxdmp.zip prot.accession2taxid.gz
        zcat *.faa.gz | ./diamond makedb --taxonmap $(word 3, $^) --taxonnodes nodes.dmp --taxonnames names.dmp -d $@

2 thoughts on “Building a Diamond Db using Refseq Protein

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s