2bit – Storing DNA

General aspects

UCSC Genome Browser help page states that 2bit format is a

… highly efficient way to store genomic sequence.

In general it:

  • stores multiple fasta sequences;
  • stores masking information;
  • is compact;
  • is randomly-accessible;
  • can store up to 4Gb;

2bit file contains multiple fields that are organized into:

  1. Header
  2. Index
  3. Sequence records

You can check the UCSC FAQ for fields definition and the twoBit.h source code for specific details in the definitions.

The DNA sequences are kept in 2 bits per base using:

  • T – 00
  • C – 01
  • A – 10
  • G – 11

So, the sequence CCAGT will look like 0101101100.

KentUtils – 2bit

Among the useful KentUtils tools, there are those dedicated to 2bit files:

faToTwoBit – Convert DNA from fasta to 2bit format
twoBitDup – check to see if a 2bit file has any identical sequences in it
twoBitInfo – get information about sequences in a 2bit file
twoBitMask – apply masking to a 2bit file, creating a new 2bit file
twoBitToFa – Convert all or part of 2bit file to fasta

Command line examples

faToTwoBit

faToTwoBit ce11.fa ce11.2bit

twoBitToFa

twoBitToFa ce11.2bit ce11.fa

twoBitToFa for a given interval only

twoBitToFa hg38.2bit:chr1:15000-15025 hg38_chr1_15000_15025.fa

#hg38_chr1_15000_15025.fa content:

#>chr1:15000-15025
#ATCCGACATCAAGTGCCCACCTTGG

using noMask (discards all repeats/masking info):

## -noMask - convert all to uppercase

twoBitToFa -noMask ce11.2bit ce11_nomask.fa

twoBitInfo

# twoBitInfo

twoBitInfo ce11.2bit ce11.out

To learn a bit more on 2bit files, I have played with 91 genomes from UCSC genome browser.

My computer profile looks like:

CPU:

sudo lshw -class processor
*-cpu
description: CPU
product: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
vendor: Intel Corp.
physical id: 37
bus info: cpu@0
version: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
serial: To Be Filled By O.E.M.
slot: U3E1
size: 3115MHz
capacity: 3500MHz
width: 64 bits
clock: 100MHz
capabilities: ... configuration: cores=4 threads=8

RAM:

sudo lshw -short -C memory

H/W path Device Class Description
============================================================
/0/38/0 memory 16GiB SODIMM DDR4 Synchronous 2133 MHz (0.5 ns)
/0/38/1 memory 16GiB SODIMM DDR4 Synchronous 2133 MHz (0.5 ns)

Storage:

sudo lshw -class disk -class storage
*-nvme
description: Non-Volatile memory controller
product: NVMe Controller
vendor: Toshiba Corporation
physical id: 0
bus info: pci@0000:04:00.0
logical name: /dev/nvme0
version: 01
width: 64 bits
clock: 33MHz
capabilities: nvme pm msi pciexpress msix nvm_express bus_master cap_list
configuration: driver=nvme latency=0
resources: irq:16 memory:de400000-de403fff
*-disk
description: NVMe disk
product: THNSN51T02DU7 NVMe TOSHIBA 1024GB
physical id: 0
logical name: /dev/nvme0n1
version: 57DA4103
serial: 46JS103NT61V
size: 953GiB (1024GB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=2d829d72-0083-4967-9c98-094e1c369246 logicalsectorsize=512 sectorsize=512

Conversion times

As a simple exercise I have checked the time for conversion from fasta to 2bit and also from fasta to 2bit. For both comparisons I have checked the time necessary for conversion in sequences masked (repeats identified as lower case letters) and non soft-masked (all capital letters – no repeat information).

Fasta to 2bit:

fasta_to_2bit_conversion

2bit to fasta:

2bit_to_fasta_conversion.png

It seems that the repeat content or masking profile of the sequences play a role on the time necessary to convert from or to 2bit.

Data compression ratio.

As per Wikipedia – Data compression ratio I have checked the data compression ratio from uncompressed fasta to 2bit.

2bit_data_compression

Overall, 2bit does a good job on compressing fasta files, allowing for random access to the data and, the time needed for conversion is not a problem at all.

References

TwoBit Sequence Archives – https://genome.ucsc.edu/goldenpath/help/twoBit.html

.2bit format – http://genome.ucsc.edu/FAQ/FAQformat.html#format7

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s