Building a Diamond Db using Refseq Protein

The idea is to build a Diamond database using Refseq protein (non redundant) and later compare it to blast.

First thing to do is to download fasta files – nonredundant only.

wget -c*.faa.gz

At the moment, there is ~928 fasta files taking ~26 Gb of disk space.

Also, I am interested in having taxonomic information along with results. Consulting Diamond’s manual it is necessary to get two more files:

wget -c

wget -c
# Remember to unzip this one

After that, it is possible to build the database using:

# diamond version 0.9.29
zcat *.faa.gz | ./diamond makedb --taxonmap prot.accession2taxid.gz --taxonnames names.dmp --taxonnodes nodes.dmp -d refseq_protein_nonredund_diamond

It takes about 34 minutes to finish, ~26Gb of RAM (peak) in a computer with 48 threads. The index file have 54Gb.

To make things a bit more automatized one can use:

#Makefile content: 


.PHONY: all clear


all: prot.accession2taxid.gz \ \
        faa.done \

        wget -c $(NCBI_TAXON)/accession2taxid/$@
        wget -c $(NCBI_TAXON)/$@ && \
        unzip $@

        wget -c*.faa.gz && \
        touch $@

refseq_protein_nonredund_diamond: faa.done prot.accession2taxid.gz
        zcat *.faa.gz | ./diamond makedb --taxonmap $(word 3, $^) --taxonnodes nodes.dmp --taxonnames names.dmp -d $@

2 thoughts on “Building a Diamond Db using Refseq Protein

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s