Fetching Sequences using rest.ensembl.org

Checking on a recently publishedHousekeeping genes database, I found myself interested in fetching transcript sequences for a list of genes of interest.

A snippet of one of the available lists (http://www.housekeeping.unicamp.br/Housekeeping_GenesHuman.csv):

...
ENST00000598742;RPS19;19;41859918;41872926
ENST00000600659;RPS28;19;8321500;8323340
ENST00000600880;RPL37A;2;216498850;216529454
ENST00000601048;SELENOW;19;47778572;47784681
ENST00000602318;LAMTOR5;1;110401253;110407708
ENST00000602402;DNAJC3;13;95677139;95794989
ENST00000602845;NCBP2-AS2;3;196942623;196943540
ENST00000602866;PSMC3;11;47418769;47426266
ENST00000605895;RER1;1;2391833;2405444
...


Ensembl offers Ensembl REST APIs to make this whole process easier!

Checking Ensembl documentation, one can use:

  • Perl
  • Python2
  • Python3
  • Ruby
  • Java
  • R
  • Curl
  • Wget

to fetch data.

Lets try one example using wget!

ENST00000602845;NCBP2-AS2;3;196942623;196943540

wget -q --header='Content-type:text/plain' 'https://rest.ensembl.org/sequence/id/ENST00000602845?'  -O -

# the output:

GAAGACGAGGGCGGCGAGGTCGGGTTCCGGGCGCTTGGAGAAGATGGTGCTGCGGCGGCTGCTGGCCGCCCTGCTGCACAGCCCGCAGCTGGTGGAACGTCTGTCAGAGTCGCGGCCTATCCGACGTGCGGCGCAGCTCACGGCCTTCGCACTGCTGCAGGCCCAGCTGCGGGGCCAGGACGCGGCCCGCCGCCTGCAGGACCTCGCGGCTGGGCCCGTGGGCTCCCTGTGCCGCCGCGCTGAGCGATTTAGAGACGCCTTCACCCAGGAGCTACGCCGCGGCCTCCGAGGCCGCTCGGGGCCACCACCAGGTAGCCAGAGGGGCCCTGGCGCAAACATTTAATCCTGGGCTGTGCGGGGCCGAGGCCGCTTGCTTTTCCTTCCGGGCTCTACAGTGGCATCAATGTGGAGGGGTCATTCCGGGCACTGCGCGCGGCTTCGAATCCCGACTGGGATTGTTGGCCTGCAGACATCCCACGCATAAGAGCCTAGGCCAGACCGCCCGCTCCGTTGAAGTCTTGTGATTGGACAAGACACAGTGTGGAGACAGCCCTAAGCCTAACAGAGATGAAGGTAGGCTGGGTCCAGACACGGCACCTACGGAGAGCCACGGACCGAAGCCAGAGAGCCTTTCCTCTGCAAGTGGGACTGAAACTCTTGACAGATGCTGCTCAATCTGACTGGTATAGCAGGACAGTTAATTCCAGGGACGATATGGATGAAAAGACAACCCTACAGCTGCCAAATTCCTTTGATTAAATGTGTGAGCTGGTTGATAGGCATGAGTGTGATACTTCTCAGGCAAGATGTGTTAAGAATACCGGGGACTGTAGGCCTATGGTAATAATAAACACGTATTTTATGAAATGA

We can specify FASTA output setting header parameter to ‘Content-type:text/x-fasta’:

wget -q --header='Content-type:text/x-fasta' 'https://rest.ensembl.org/sequence/id/ENST00000602845?format=fasta'  -O -

>ENST00000602845.2 chromosome:GRCh38:3:196942674:196943543:1
GAAGACGAGGGCGGCGAGGTCGGGTTCCGGGCGCTTGGAGAAGATGGTGCTGCGGCGGCT
GCTGGCCGCCCTGCTGCACAGCCCGCAGCTGGTGGAACGTCTGTCAGAGTCGCGGCCTAT
CCGACGTGCGGCGCAGCTCACGGCCTTCGCACTGCTGCAGGCCCAGCTGCGGGGCCAGGA
CGCGGCCCGCCGCCTGCAGGACCTCGCGGCTGGGCCCGTGGGCTCCCTGTGCCGCCGCGC
TGAGCGATTTAGAGACGCCTTCACCCAGGAGCTACGCCGCGGCCTCCGAGGCCGCTCGGG
GCCACCACCAGGTAGCCAGAGGGGCCCTGGCGCAAACATTTAATCCTGGGCTGTGCGGGG
CCGAGGCCGCTTGCTTTTCCTTCCGGGCTCTACAGTGGCATCAATGTGGAGGGGTCATTC
CGGGCACTGCGCGCGGCTTCGAATCCCGACTGGGATTGTTGGCCTGCAGACATCCCACGC
ATAAGAGCCTAGGCCAGACCGCCCGCTCCGTTGAAGTCTTGTGATTGGACAAGACACAGT
GTGGAGACAGCCCTAAGCCTAACAGAGATGAAGGTAGGCTGGGTCCAGACACGGCACCTA
CGGAGAGCCACGGACCGAAGCCAGAGAGCCTTTCCTCTGCAAGTGGGACTGAAACTCTTG
ACAGATGCTGCTCAATCTGACTGGTATAGCAGGACAGTTAATTCCAGGGACGATATGGAT
GAAAAGACAACCCTACAGCTGCCAAATTCCTTTGATTAAATGTGTGAGCTGGTTGATAGG
CATGAGTGTGATACTTCTCAGGCAAGATGTGTTAAGAATACCGGGGACTGTAGGCCTATG
GTAATAATAAACACGTATTTTATGAAATGA

The default is to fetch genomic sequence. Also, its possible to fetch cds, cdna and protein:

# CDS - spliced transcript sequence without UTR

wget -q --header='Content-type:text/x-fasta' 'https://rest.ensembl.org/sequence/id/ENST00000602845?format=fasta;type=cds'  -O -

>ENST00000602845.2
ATGGTGCTGCGGCGGCTGCTGGCCGCCCTGCTGCACAGCCCGCAGCTGGTGGAACGTCTG
TCAGAGTCGCGGCCTATCCGACGTGCGGCGCAGCTCACGGCCTTCGCACTGCTGCAGGCC
CAGCTGCGGGGCCAGGACGCGGCCCGCCGCCTGCAGGACCTCGCGGCTGGGCCCGTGGGC
TCCCTGTGCCGCCGCGCTGAGCGATTTAGAGACGCCTTCACCCAGGAGCTACGCCGCGGC
CTCCGAGGCCGCTCGGGGCCACCACCAGGTAGCCAGAGGGGCCCTGGCGCAAACATTTAA


# CDNA - spliced transcript sequence with UTR

wget -q --header='Content-type:text/x-fasta' 'https://rest.ensembl.org/sequence/id/ENST00000602845?format=fasta;type=cdna'  -O -

>ENST00000602845.2
GAAGACGAGGGCGGCGAGGTCGGGTTCCGGGCGCTTGGAGAAGATGGTGCTGCGGCGGCT
GCTGGCCGCCCTGCTGCACAGCCCGCAGCTGGTGGAACGTCTGTCAGAGTCGCGGCCTAT
CCGACGTGCGGCGCAGCTCACGGCCTTCGCACTGCTGCAGGCCCAGCTGCGGGGCCAGGA
CGCGGCCCGCCGCCTGCAGGACCTCGCGGCTGGGCCCGTGGGCTCCCTGTGCCGCCGCGC
TGAGCGATTTAGAGACGCCTTCACCCAGGAGCTACGCCGCGGCCTCCGAGGCCGCTCGGG
GCCACCACCAGGTAGCCAGAGGGGCCCTGGCGCAAACATTTAATCCTGGGCTGTGCGGGG
CCGAGGCCGCTTGCTTTTCCTTCCGGGCTCTACAGTGGCATCAATGTGGAGGGGTCATTC
CGGGCACTGCGCGCGGCTTCGAATCCCGACTGGGATTGTTGGCCTGCAGACATCCCACGC
ATAAGAGCCTAGGCCAGACCGCCCGCTCCGTTGAAGTCTTGTGATTGGACAAGACACAGT
GTGGAGACAGCCCTAAGCCTAACAGAGATGAAGGTAGGCTGGGTCCAGACACGGCACCTA
CGGAGAGCCACGGACCGAAGCCAGAGAGCCTTTCCTCTGCAAGTGGGACTGAAACTCTTG
ACAGATGCTGCTCAATCTGACTGGTATAGCAGGACAGTTAATTCCAGGGACGATATGGAT
GAAAAGACAACCCTACAGCTGCCAAATTCCTTTGATTAAATGTGTGAGCTGGTTGATAGG
CATGAGTGTGATACTTCTCAGGCAAGATGTGTTAAGAATACCGGGGACTGTAGGCCTATG
GTAATAATAAACACGTATTTTATGAAATGA


# Protein

wget -q --header='Content-type:text/x-fasta' 'https://rest.ensembl.org/sequence/id/ENST00000602845?format=fasta;type=protein'  -O -

>ENSP00000488305.1
MVLRRLLAALLHSPQLVERLSESRPIRRAAQLTAFALLQAQLRGQDAARRLQDLAAGPVG
SLCRRAERFRDAFTQELRRGLRGRSGPPPGSQRGPGANI

Using BLAT, one can check how the fetched sequences aligned to the human genome:

Another quite interesting feature is the possibility of expanding 5 and 3 prime by a number of bases you can define (available only for the default type = genomic):

 wget -q --header='Content-type:text/x-fasta' 'https://rest.ensembl.org/sequence/id/ENST00000602845?format=fasta;expand_5prime=1000'  -O -

>ENST00000602845.2 chromosome:GRCh38:3:196941674:196943543:1
CGCCCCGCCAACCCCCAACCGTATTCCATCCCTACCTCCGCTGATACTGAAAAGTCTACT
TTGAATATCTGCTGCCTCCCCTTTTTCCACAAAATATACTGTGATATAGTCCGGACTAAA
AATTTCCCTCTTCTGAAATCCAGCAATTCTCCGATCTTTACATCCGAACACCCTCAAAGT
GCCGAGCTTGGCGGGTCCACCTCCCCACTCCGAAGCTTCCCCGAGGGCGGAGTGAGGACT
CCACTTGTGTCTCCCACGCACCGCGTACAGCTTCCGTAACACCATCCTCCCAGAGAAGGG
GCCCGAATCGCCGAAGGGCACTGCTTCGCCGATTTAAAAAACAAAGCAAAAAGCCCCGCA
TCTGCATCAGGAAGGCGCCTCTGCCTACTCTGGGAGAGAGAAGGGCACCCCTCCCCCTTG
CTACGTAGTCGTCTGCGGAGGCACAACCGTGGAAACGGGAGCCGCCACCACCACCACCGC
TCAAACCTCTCGGCACTGGCTGGGGTACAGGGAGCGGCTGCGAGCGAATGGGATAAGCGA
GCCTCCAGTTCCCCGTCTTCCAGAGCAAGTGGCTTCAGTGATATCCAAGCGCCCTTCCAG
CACCCATTCCCTGCCTCGCCAGCAGGCACCGGGGCCCCACTTGGCGTTTGCGGATTTCAG
CTGAATGGGAGGCGACACAATGAGACAAGAGCAAACCGTTTCTCAGCGTTCTTGCCCAGG
GCCTTCCCGTCTCGCGGCCCGGCCTCCCTCACCCGGAAGTGCTGGTCCCGGTACTGGCTC
AGCTCCACGTAGGAGTCGCTGCGCAGCGCCTTCAGGAGGCCACCCGACATAGTGCAGAGA
AGCGGACCACAATGCGGCGACTCCCGGCACGAGGCTGCGTCCGCGATGGCGGAAGCGGAA
ACGCGCGGAGGCGAGCATCTCATTGGACCCAATCCGAGGGCGGCGTGTCGTCATCAAGCT
GCGCGGGGGCATAGACGTCCGGGTCGGGCGCCGCGGGGCGGAAGACGAGGGCGGCGAGGT
CGGGTTCCGGGCGCTTGGAGAAGATGGTGCTGCGGCGGCTGCTGGCCGCCCTGCTGCACA
GCCCGCAGCTGGTGGAACGTCTGTCAGAGTCGCGGCCTATCCGACGTGCGGCGCAGCTCA
CGGCCTTCGCACTGCTGCAGGCCCAGCTGCGGGGCCAGGACGCGGCCCGCCGCCTGCAGG
ACCTCGCGGCTGGGCCCGTGGGCTCCCTGTGCCGCCGCGCTGAGCGATTTAGAGACGCCT
TCACCCAGGAGCTACGCCGCGGCCTCCGAGGCCGCTCGGGGCCACCACCAGGTAGCCAGA
GGGGCCCTGGCGCAAACATTTAATCCTGGGCTGTGCGGGGCCGAGGCCGCTTGCTTTTCC
TTCCGGGCTCTACAGTGGCATCAATGTGGAGGGGTCATTCCGGGCACTGCGCGCGGCTTC
GAATCCCGACTGGGATTGTTGGCCTGCAGACATCCCACGCATAAGAGCCTAGGCCAGACC
GCCCGCTCCGTTGAAGTCTTGTGATTGGACAAGACACAGTGTGGAGACAGCCCTAAGCCT
AACAGAGATGAAGGTAGGCTGGGTCCAGACACGGCACCTACGGAGAGCCACGGACCGAAG
CCAGAGAGCCTTTCCTCTGCAAGTGGGACTGAAACTCTTGACAGATGCTGCTCAATCTGA
CTGGTATAGCAGGACAGTTAATTCCAGGGACGATATGGATGAAAAGACAACCCTACAGCT
GCCAAATTCCTTTGATTAAATGTGTGAGCTGGTTGATAGGCATGAGTGTGATACTTCTCA
GGCAAGATGTGTTAAGAATACCGGGGACTGTAGGCCTATGGTAATAATAAACACGTATTT
TATGAAATGA

And again, using BLAT we can check it:

That’s it for now !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s