Parameters Optimization with Nextflow

Recently I have used Nextflow to try different parameters for a tool and found it worth sharing. We start by defining parameters: param1 = [50,70,90] param2 = [1,2,3,4,5] param3 = [0.1,0.2,0.3] Then we declare input, output and commands to run: in_data = Channel.fromPath('test') process test { input: path indata from in_data each p1 from param1 … Continue reading Parameters Optimization with Nextflow

Fasterq-dump and Snakemake

A note on how to automatize public datasets fetch from NCBI using SRA toolkit and Snakemake. Here we use a config, a rule and a conda environment file. First, the Snakefile: configfile: "include/rules/config.yaml" include: "include/rules/fasterqdump.rule" rule all: input: expand("01_raw/done__{srr}_dump", srr=config['srr']) the configfile (include/rules/config.yaml) srr: - SRR12345678 and the rule file (include/rules/fasterqdump.rule): rule prefetch: output: "01_raw/.prefetch/sra/{srr}.sra" … Continue reading Fasterq-dump and Snakemake

InterProScan and Snakemake

Following up with a previous post "InterProScan and Docker", here a quick note on a InterProScan using Snakemake. The commented Snakefile: # Input fasta files with proteins sequences should be at lib/foobar.pep PEPS, = glob_wildcards("lib/{pep}.pep") # configfile path configfile: "include/rules/config.yaml" rule all: input: expand("02_interproscan/{pep}.tsv",pep=PEPS) # get/install interproscan rule install_interproscan: #https://interproscan-docs.readthedocs.io/en/latest/HowToDownload.html input: output: touch("02_interproscan/done__install_interproscan") params: "temp/" … Continue reading InterProScan and Snakemake

Download Project from Basespace

Quick note on how to download data from Basespace. For Linux users, you cannot bulk download files using web interface. The alternative is to use BaseSpace Sequence Hub CLI. First, fetch "bs" application by using: wget "https://api.bintray.com/content/basespace/BaseSpaceCLI-EarlyAccess-BIN/latest/\$latest/amd64-linux/bs?bt_package=latest" -O bs Fix permissions: chmod +X ./bs chmod 755 ./bs Then, authenticate at Basespace: ./bs auth This command … Continue reading Download Project from Basespace

Duplicity – Incremental backups of a single file

To keep versions of a single file can be tricky. Specially if each version of the file big enough to cause troubles. I have tried duplicity and it seems to fit very well. Using debian 10, one can install it via: apt-get install duplicity -y duplicity --version duplicity 0.7.18.2 Now, lets try it: # create … Continue reading Duplicity – Incremental backups of a single file

Creating bigPsl track for UCSC genome browser

Starting point is to have a psl file. Its also needed to have in your path pslToBigPsl, bedToBigBed, bigPsl.as and chrom.sizes file. The Makefile to automatize it is: SHELL=/bin/bash .ONESHELL: PSL=$(wildcard *.psl) TXT:=$(addsuffix .txt, $(basename $(PSL))) BB:=$(addsuffix .bb,$(basename $(TXT))) .PHONY: all clear all: $(BB) %.txt: %.psl pslToBigPsl $< stdout | sort -k1,1 -k2,2n > $@ … Continue reading Creating bigPsl track for UCSC genome browser

Pandas : compare – Checking differences between DataFrames

When comparing DataFrames, compare is here to help. Imagine you have two different methods and you want to check the differences in results by comparing tables. import pandas as pd import numpy as np # Lets create two dataframes df1 = pd.DataFrame(np.array([[101, 102, 103], [201, 202, 203], [301, 302, 303]]), columns=['Value1', 'Value2', 'Value3'], index=['A1',"A2","A3"]) df1 … Continue reading Pandas : compare – Checking differences between DataFrames

Pandas : pipe – Tablewise function application

pipe is designed to help chaining function calls on DataFrames and Series. As showcase, lets grab ensembl genomes table and play with that. First, import libraries: # import libraries import pandas as pd import numpy as np # show full columns pd.set_option('display.max_colwidth', None) Now, get ensembl genomes table: # get ensembl genomes table colnames = … Continue reading Pandas : pipe – Tablewise function application

Conda environment and projects

A very important aspect of reproducible bioinformatics is to manage software, tools and environment properly. One interesting alternative for such difficult task is Conda. As stated in Conda's website: "Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux." . I have been using conda environments … Continue reading Conda environment and projects