Sorting Big Files – How to speedup sort?

I had a 11Gb text file with two columns that needed to be sorted by the first column. The question was: Is there a way to make it faster?

For checking that, I have tried that in an idle machine using the following commands (one at the time):

time sort -k1,1 --parallel=1 homologues.tsv > homologues_sorted_parallel1.tsv
time sort -k1,1 --parallel=2 homologues.tsv > homologues_sorted_parallel2.tsv
time sort -k1,1 --parallel=3 homologues.tsv > homologues_sorted_parallel3.tsv
time sort -k1,1 --parallel=4 homologues.tsv > homologues_sorted_parallel4.tsv
time sort -k1,1 --parallel=5 homologues.tsv > homologues_sorted_parallel5.tsv
time sort -k1,1 --parallel=6 homologues.tsv > homologues_sorted_parallel6.tsv
time sort -k1,1 --parallel=7 homologues.tsv > homologues_sorted_parallel7.tsv
time sort -k1,1 --parallel=8 homologues.tsv > homologues_sorted_parallel8.tsv
time sort -k1,1 --parallel=9 homologues.tsv > homologues_sorted_parallel9.tsv
time sort -k1,1 --parallel=10 homologues.tsv > homologues_sorted_parallel10.tsv

time LC_ALL=C sort -k1,1 --parallel=1 homologues.tsv > homologues_sorted_lc_parallel1.tsv
time LC_ALL=C sort -k1,1 --parallel=2 homologues.tsv > homologues_sorted_lc_parallel2.tsv
time LC_ALL=C sort -k1,1 --parallel=3 homologues.tsv > homologues_sorted_lc_parallel3.tsv
time LC_ALL=C sort -k1,1 --parallel=4 homologues.tsv > homologues_sorted_lc_parallel4.tsv
time LC_ALL=C sort -k1,1 --parallel=5 homologues.tsv > homologues_sorted_lc_parallel5.tsv
time LC_ALL=C sort -k1,1 --parallel=6 homologues.tsv > homologues_sorted_lc_parallel6.tsv
time LC_ALL=C sort -k1,1 --parallel=7 homologues.tsv > homologues_sorted_lc_parallel7.tsv
time LC_ALL=C sort -k1,1 --parallel=8 homologues.tsv > homologues_sorted_lc_parallel8.tsv
time LC_ALL=C sort -k1,1 --parallel=9 homologues.tsv > homologues_sorted_lc_parallel9.tsv
time LC_ALL=C sort -k1,1 --parallel=10 homologues.tsv > homologues_sorted_lc_parallel10.tsv

And the output is summarized in the following plot:

So LC_ALL=C seems to be important 😀

Also, after –parallel = 8 it seems that there is no much improvement.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s