UCSC Genome Browser – Docker solution for a self -hosted instance

The idea of being able to browse a genome using a great solution such as the UCSC Genome Browser seemed really interesting to me.

Fortunately, using Docker it is possible with a few steps.

We will need two docker containers. One for Apache and another for the database – MySQL. The Apache container will retrieve data from the MySQL container and render the results.

To build the Apache and Docker images the following Dockerfiles (modified from https://github.com/icebert/docker_ucsc_genome_browser and https://github.com/icebert/docker_ucsc_genome_browser_db) are necessary.

Dockerfile – Apache

FROM ubuntu:19.10
LABEL Description="UCSC Genome Browser"

RUN apt-get update && \
	apt-get install -y \
	git \
	build-essential \
        apache2 \
	mysql-client-8.0 \
	mysql-client-core-8.0 \
        libpng-dev \
	libssl-dev \
	openssl \
	rsync \
	libmysqlclient-dev && \
        apt-get clean

ENV MACHTYPE x86_64
RUN mkdir -p ~/bin/${MACHTYPE}
RUN rm /var/www/html/index.html && mkdir /var/www/trash && \
    mkdir /usr/local/apache && ln -s /var/www/html /usr/local/apache/htdocs && \
    rsync -avzP rsync://hgdownload.cse.ucsc.edu/htdocs/ /var/www/html/

RUN mkdir /var/www/cgi-bin && \
    rsync -avP rsync://hgdownload.soe.ucsc.edu/cgi-bin/ /var/www/cgi-bin/

RUN { \
        echo 'db.host=gbdb'; \
        echo 'db.user=admin'; \
        echo 'db.password=admin'; \
        echo 'db.trackDb=trackDb'; \
        echo 'defaultGenome=Human'; \
        echo 'central.db=hgcentral'; \
        echo 'central.host=gbdb'; \
        echo 'central.user=admin'; \
        echo 'central.password=admin'; \
        echo 'central.domain='; \
        echo 'backupcentral.db=hgcentral'; \
        echo 'backupcentral.host=gbdb'; \
        echo 'backupcentral.user=admin'; \
        echo 'backupcentral.password=admin'; \
        echo 'backupcentral.domain='; \
    } > /var/www/cgi-bin/hg.conf


# Setup Housekeeping
RUN { echo '#!/bin/bash'; \
      echo 'find /var/www/trash/ \! \( -regex "/var/www/trash/ct/.*" \
      -or -regex "/var/www/trash/hgSs/.*" \) -type f -amin +5040 -exec rm -f {} \;'; \
      echo 'find /var/www/trash/    \( -regex "/var/www/trash/ct/.*" \
      -or -regex "/var/www/trash/hgSs/.*" \) -type f -amin +10080 -exec rm -f {} \;'; \
    } > /etc/cron.daily/genomebrowser

RUN chmod +x /etc/cron.daily/genomebrowser

RUN sed -i 's/<\/VirtualHost>//' /etc/apache2/sites-enabled/000-default.conf && \
    { \
        echo 'XBitHack on'; \
        echo ''; \
        echo '<Directory /var/www/html>'; \
        echo '    Options +Includes'; \
        echo '    SSILegacyExprParser on'; \
        echo '</Directory>'; \
        echo ''; \
        echo 'ScriptAlias /cgi-bin/ /var/www/cgi-bin/'; \
        echo '<Directory "/var/www/cgi-bin">'; \
        echo '    AllowOverride None'; \
        echo '    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch'; \
        echo '    SetHandler cgi-script'; \
        echo '    Require all granted'; \
        echo '</Directory>'; \
        echo ''; \
        echo '<Directory /var/www/html/trash>'; \
        echo '    Options MultiViews'; \
        echo '    AllowOverride None'; \
        echo '    Order allow,deny'; \
        echo '    Allow from all'; \
        echo '</Directory>'; \
        echo ''; \
        echo '</VirtualHost>'; \
    } >> /etc/apache2/sites-enabled/000-default.conf

RUN ln -s /etc/apache2/mods-available/include.load /etc/apache2/mods-enabled/ && \
    ln -s /etc/apache2/mods-available/cgi.load /etc/apache2/mods-enabled/

RUN chown -R www-data.www-data /var/www /gbdb


ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2

EXPOSE 80 443

CMD ["/usr/sbin/apache2ctl", "-D", "FOREGROUND"]

Dockerfile – MySQL

FROM ubuntu:19.10
LABEL Description="UCSC Genome Browser"

RUN apt-get update && \
	apt-get install -y \
	git \
	build-essential \
    apache2 \
	mysql-client-8.0 \
	mysql-client-core-8.0 \
    libpng-dev \
	libssl-dev \
	openssl \
	rsync \
	libmysqlclient-dev && \
    apt-get clean

ENV MACHTYPE x86_64
RUN mkdir -p ~/bin/${MACHTYPE}
RUN rm /var/www/html/index.html && mkdir /var/www/trash && \
    mkdir /usr/local/apache && ln -s /var/www/html /usr/local/apache/htdocs && \
    rsync -avzP rsync://hgdownload.cse.ucsc.edu/htdocs/ /var/www/html/

RUN mkdir /var/www/cgi-bin && \
    rsync -avP rsync://hgdownload.soe.ucsc.edu/cgi-bin/ /var/www/cgi-bin/


RUN { \
        echo 'db.host=gbdb'; \
        echo 'db.user=admin'; \
        echo 'db.password=admin'; \
        echo 'db.trackDb=trackDb'; \
        echo 'defaultGenome=Human'; \
        echo 'central.db=hgcentral'; \
        echo 'central.host=gbdb'; \
        echo 'central.user=admin'; \
        echo 'central.password=admin'; \
        echo 'central.domain='; \
        echo 'backupcentral.db=hgcentral'; \
        echo 'backupcentral.host=gbdb'; \
        echo 'backupcentral.user=admin'; \
        echo 'backupcentral.password=admin'; \
        echo 'backupcentral.domain='; \
    } > /var/www/cgi-bin/hg.conf


RUN { \
        echo '#!/bin/bash'; \
        echo 'find /var/www/trash/ \! \( -regex "/var/www/trash/ct/.*" \
              -or -regex "/var/www/trash/hgSs/.*" \) -type f -amin +5040 -exec rm -f {} \;'; \
        echo 'find /var/www/trash/    \( -regex "/var/www/trash/ct/.*" \
              -or -regex "/var/www/trash/hgSs/.*" \) -type f -amin +10080 -exec rm -f {} \;'; \
    } > /etc/cron.daily/genomebrowser

RUN chmod +x /etc/cron.daily/genomebrowser

RUN sed -i 's/<\/VirtualHost>//' /etc/apache2/sites-enabled/000-default.conf && \
    { \
        echo 'XBitHack on'; \
        echo ''; \
        echo '<Directory /var/www/html>'; \
        echo '    Options +Includes'; \
        echo '    SSILegacyExprParser on'; \
        echo '</Directory>'; \
        echo ''; \
        echo 'ScriptAlias /cgi-bin/ /var/www/cgi-bin/'; \
        echo '<Directory "/var/www/cgi-bin">'; \
        echo '    AllowOverride None'; \
        echo '    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch'; \
        echo '    SetHandler cgi-script'; \
        echo '    Require all granted'; \
        echo '</Directory>'; \
        echo ''; \
        echo '<Directory /var/www/html/trash>'; \
        echo '    Options MultiViews'; \
        echo '    AllowOverride None'; \
        echo '    Order allow,deny'; \
        echo '    Allow from all'; \
        echo '</Directory>'; \
        echo ''; \
        echo '</VirtualHost>'; \
    } >> /etc/apache2/sites-enabled/000-default.conf

RUN ln -s /etc/apache2/mods-available/include.load /etc/apache2/mods-enabled/ && \
    ln -s /etc/apache2/mods-available/cgi.load /etc/apache2/mods-enabled/


RUN mkdir -p /gbdb

RUN chown -R www-data.www-data /var/www /gbdb

ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2

EXPOSE 80 443

CMD ["/usr/sbin/apache2ctl", "-D", "FOREGROUND"]

One could structure both Dockerfiles like:

build/
├── apache
│   └── Dockerfile
└── mysql
    └── Dockerfile

Building Docker Images

To build docker images:

# at build/apache
docker build -t foo/ucsc_apache .
# about 7 minutes to finish and in 5.33GB size

inside build/mysql

# at build/mysql
docker build -t foo/ucsc_db .
# about 2 minutes to finish and 1.03GB in size

Checking Docker images:

docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             SIZE
foo/ucsc_apache        latest              85c2451cb518        3 minutes ago       5.28GB
foo/ucsc_db            latest              ea107ce25fb3        28 minutes ago      1.03GB

Preparing the storage

There are two options to work with docker and the data stored. Either you store it inside the docker image (i.e. mysql database and genomes) or you store it outside images and mount a volumes containing the data. According to docker documentation, the second option is more reliable.

During the docker image build, mysql and apache have been already initiated so it is necessary to copy data from the images into the host directories that will be mounted as volumes later.

docker run -d --name gbdb -p 3306:3306 foo/ucsc_db

docker run -d --name apache --link gbdb:gbdb -p 8041:80 foo/ucsc_apache

Checking docker containers:

docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS                           NAMES
36543465a161        foo/ucsc_apache     "/usr/sbin/apache2ct…"   About a minute ago   Up About a minute   443/tcp, 0.0.0.0:8041->80/tcp   apache
e6de41fa43c9        foo/ucsc_db         "mysqld -u root"         2 minutes ago        Up 2 minutes        0.0.0.0:3306->3306/tcp          gbdb

Now lets create data and gbdb directories and copy the data from the docker container to the host:

# into build
# my container id for apache = 36543465a161 - check yours
docker cp 36543465a161:/gbdb ./
# my container id for mysql = e6de41fa43c9 - check yours
docker cp e6de41fa43c9:/data ./

After that the tree should look like:

build/
├── apache
├── data
│   ├── hgcentral
│   ├── hgFixed
│   ├── #innodb_temp
│   ├── mysql
│   ├── performance_schema
│   └── sys
├── gbdb
└── mysql

Test Run

It is time to check if everything worked until now. For that, lets run the containers mounting the volumes. However, bear in mind that we do not have any genomes. But we will do it after this test!

# stop running docker containers and remove them
docker stop apache gbdb
docker rm apache gbdb


docker run -d -v $(pwd)/data:/data --name gbdb -p 3306:3306 foo/ucsc_db
docker run -d -v $(pwd)/gbdb:/gbdb --name apache --link gbdb:gbdb -p 8041:80  foo/ucsc_apache

If everything went well, now it should be possible to go to “0.0.0.0:8041” in the browser and see something like that:

For now, the containers are not needed:

docker stop apache gbdb
docker rm apache gbdb

Adding a genome

This part is a bit more tricky. It is necessary to generate some files and also add MySQL entries.

KentUtils is needed and can be installed using rsync!

# at build 
mkdir kentutils
cd kentutils
rsync -aP  rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/* ./

Here I will use Acoel genome. In 15 Mar 2019 a paper on Acoel genome have been published in Science by Andrew R. Gehrke and collaborators. The data is available at http://srivastavalab.rc.fas.harvard.edu/ and there is a Jbrowse instance where you can explore the data.

# at build directory
mkdir acoel
cd acoel
# genome
wget http://srivastavalab.rc.fas.harvard.edu/genome_files/hmi_genome.fa

Not always, but in general UCSC Genome Browser name species using three letters form the first name and three letters from the second name followed by the version. For example, Equus caballus is referred as equCab1.

For the Acoel – Hofstenia miamia it will be hofMia1 as ID. Now, a set of files is needed by UCSC to render Hofstenia session:

2bit

# at build dir - create directory in gbdb
mkdir -p gbdb/hofMia1
cd gbdb/hofMia1
../../kentutils/faToTwoBit ../../acoel/hmi_genome.fa ./hofMia1.2bit

agp

# at build/gbdb/hofMia1
../../kentutils/hgFakeAgp -minContigGap=1 ../../acoel/hmi_genome.fa hofMia1.agp
# test file
../../kentutils/checkAgpAndFa hofMia1.agp hofMia1.2bit  > checkagp.out

chrom.sizes

# at build/gbdb/hofMia1
../../kentutils/twoBitInfo hofMia1.2bit stdout | sort -k2nr > chrom.sizes

bed/chromInfo/chromInfo.tab

# at build/gbdb/hofMia1
mkdir -p bed/chromInfo
awk '{printf "%s\t%d\t/gbdb/hofMia1/hofMia1.2bit\n", $1, $2}' chrom.sizes > bed/chromInfo/chromInfo.tab

bed/gc5Base/gc5Base.wib

# at build/gbdb/hofMia1
mkdir -p bed/gc5Base wib
../../kentutils/hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 hofMia1 hofMia1.2bit | ../../kentutils/wigEncode stdin bed/gc5Base/gc5Base.{wig,wib}
cp -R bed wib/

track_files/trackDb.ra

# at build/gbdb/hofMia1
mkdir track_files

Inside track_files put trackDb.ra:

# trackDb.ra file:
track gc5Base
shortLabel GC Percent
type wig 0 100
longLabel GC Percent in 5-Base Windows
visibility dense
group map
priority 20
colorR 250
colorG 130
colorB 0
altColorR 128
altColorG 128
altColorB 128
 
track gaps
shortLabel Gaps
longLabel gaps
group map
priority 149.1
visibility dense
type bed 3 .

~/.hg.conf

This file is necessary to configure MySQL access and should look like this:

###########################################################
# MINIMAL Config file for the  UCSC Human Genome Browser 
#
# format is key=value, no spaces around the values or around the keys.
#
# For a documentation of all config options in hg.conf, see our example file at
# https://github.com/ucscGenomeBrowser/kent/blob/master/src/product/ex.hg.conf
# It includes many comments.
# 
# This hg.conf file is intended to be copied and placed into a user's
# ~/.hg.conf file so they can use the UCSC Genome Browser command line
# programs.

# Credentials to access the local mysql server
db.host=0.0.0.0
db.user=admin
db.password=admin
db.trackDb=hofMia1

# if your MySQL system is configured for a different socket connection,
# use the following variables to override the MySQL defaults:
#db.socket=/var/run/mysqld/mysqld.sock
db.port=3306

# The locations of the directory that holds file-based data
# (e.g. alignments, database images, indexed bigBed files etc)
# By default, this mirror can load missing files from the hgdownload server at UCSC
# To disable on-the-fly loading of files, comment out gbdbLoc2 and 
# the slow-db.* section below.
gbdbLoc1=/gbdb/
gbdbLoc2=http://hgdownload.soe.ucsc.edu/gbdb/

# central.host is the name of the host of the central MySQL
# database where stuff common to all versions of the genome
# and the user database is stored.
central.db=hgcentral
central.host=localhost
central.socket=/var/run/mysqld/mysqld.sock

# Be sure this user has UPDATE AND INSERT privs for hgcentral
central.user=readwrite
central.password=update

MySQL entries

To add information into MySQL we need to start the docker container:

# at build dir
docker run -d -v $(pwd)/data:/data --name gbdb -p 3306:3306 foo/ucsc_db
docker run -d -v $(pwd)/gbdb:/gbdb --name apache --link gbdb:gbdb -p 8041:80  foo/ucsc_apache

Now, MySQL to create a hofMia1 database;

# at build dir
kentutils/hgsql
# this should give you a MySQL prompt
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| hgFixed            |
| hgcentral          |
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
6 rows in set (0.00 sec)

mysql> create database hofMia1;
Query OK, 1 row affected (0.03 sec)

Add trackDb.sql, hgFindSpec.sql and chromInfo.sql schemas to hofMia1 database:

# at build dir
wget https://raw.githubusercontent.com/ucscGenomeBrowser/kent/master/src/hg/lib/trackDb.sql
wget https://raw.githubusercontent.com/ucscGenomeBrowser/kent/master/src/hg/lib/hgFindSpec.sql
wget https://raw.githubusercontent.com/ucscGenomeBrowser/kent/master/src/hg/lib/chromInfo.sql

sed -i 's/varchar(255)/varchar(200)/g' *.sql
kentutils/hgsql hofMia1 < hgFindSpec.sql
kentutils/hgsql hofMia1 < trackDb.sql
kentutils/hgsql hofMia1 < chromInfo.sql

Now, add information into just created tables:

# at build dir
# add chromInfo
kentutils/hgLoadSqlTab hofMia1 chromInfo chromInfo.sql gbdb/hofMia1/bed/chromInfo/chromInfo.tab
# add agp
kentutils/hgGoldGapGl hofMia1 gbdb/hofMia1/hofMia1.agp
# add gc5Base
kentutils/hgLoadWiggle -pathPrefix=/gbdb/hofMia1/wib hofMia1 gc5Base gbdb/hofMia1/gc5Base/gc5Base.wig

Lets go back to MySQL and create more entries:

# at build dir
kentutils/hgsql

mysql> use hgcentral;
Database changed

mysql> INSERT INTO defaultDb VALUES ("H. miamia", "hofMia1");
mysql> INSERT INTO genomeClade VALUES ("H. miamia", "worm", "1");
mysql> INSERT INTO dbDb VALUES ("hofMia1","Mar 2019","/gbdb/hofMia1","H. miamia","scaffold1_1:1-100",1,1,"H. miamia","Hofstenia miamia","/gbdb/hofMia1/html/description.html",1,0, "version 1.0", 442651);

Update database based on track_files/trackDb.ra

# at build/gbdb/hofMia1
../../kentutils/hgTrackDb . hofMia1 trackDb ../../trackDb.sql track_files

And now check “http://0.0.0.0:8041&#8221;. Selecting genome on top left bar and typing Hofstenia in “species search” it is possible to see “Hofestenia miamia” as combobox option. Then, “GO”!

A rather empty browser is shown but that’s it for now!

In future posts tracks will be added and also a BLAT service will be configured.

References

Manual installation of the UCSC Genome Browser on a Unix server – https://genome.ucsc.edu/goldenpath/help/mirrorManual.html

2 thoughts on “UCSC Genome Browser – Docker solution for a self -hosted instance

    1. Hi Gabriela, depending on which type of track you need, i.e. bed file, use hgLoadBed dbname filename.bed. Then, you need to add proper track information at track_files/trackdb.ra and finally update your database .

Leave a Reply to andreirozanski Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s