Minor (And Not So Minor) Annoyances in Bioinformatics. An Ongoing Saga

So today started out as a pretty frustrating morning due to random (not really random) failure in some analysis pipelines on some data I am trying to work on for a collaborator. The analysis has already taken far longer than expected for various reasons, some of which are my fault, and some of which aren’t. But given some of the issues that crop up I was inspired to post a little bit of a vent concerning things that end up just annoying me as a researcher in bioinformatics. Some of these are specific to today, some arent but I am putting them all here today anyway. In some cases I may call out specific software. In all cases I appreciate the work put into the tool by its developers. Often it is a tool I use a lot. Sometimes it is a specific case where it is symptomatic of larger issues in Bioinformatics software design. After all, if I thought the tool was complete garbage I wouldn’t care enough to vent about it (unless it was really widely used and terrible, but thats something for other posts). Ok, so in no particular order:

NSHA NGS Analysis Pipeline

In brief data was aligned to the human reference genome (GRCh37.75) with BWA[1] and processed using Picard[2], the GATK[3], and VT[4]. Variants were called using six different variant callers, each with different error profiles and performance characteristics: MuTect[5], FreeBayes[6], VarDict[7], Pindel[8], Platypus[9], and Scalpel[10] and then combined into a unified call set using bcbio-ensemble[11]. Variants were annotated with snpEff[12] and VCFAnno[13] from a variety of data sources including dbSNP[14], 1000 Genomes[15], The Exome Sequencing Project’s EVS[16], Ensembl[17], ClinVar[18], and COSMIC[19].

Organization and Getting Things Done

Like many scientists I can be a bit ‘scatter-brained.’ Stereotypes are sometimes, true after all. My brain is usually always ‘on’, thinking about a million different things (or sometimes locked onto something specific) and I can easily loose track of things. I started really working on my organization, time management, and productivity skills a few years ago during my Post-Doc. It started out of necessity, I had a lot of standing meetings to go to for our project because it was a large multi-group, Genome Canada funded project and I was the sole Bioinformatician. So I had my subset of Exome sequencing projects that I was following all the way through, but I was also doing all of the initial data analysis for all projects before it got passed on to another post-doc or graduate student for further study. As the only Bioinformatician (even among PIs), I was also involved in lots of the higher-level planning and meetings as well. Coupled with normal post-doc life I really needed to start living by my calendar. I also needed to start learning some work-life balance skills in terms of answering emails at 1am that could easily wait for morning. Later during my post-doc I was also involved with some friends in getting a start up going. I’m not involved with that anymore, except in an occasional advisory capacity, but it definitely made organization even more important. I was doing a few hours in the offices every morning before heading to the lab, a few hours in the evening at home, and some random meetings either on Google Hangouts or in person on some days. My calendar became even more important, but so did things like time tracking and task tracking tools.

Ubuntu and Gnome

I have been using Ubuntu for a long time, and while I don’t hate the Unity desktop manager, I was growing increasingly disillusioned with it. I’ve also always been irritated at resource usage. Especially with Compiz turned on, RAM usage is fairly substantial. My workstation has 16GB of RAM, so I’m not that concerned for general usage, but I also tend to do a lot of heavy computation on this system and testing for development. When your processes use RAM in the GB range you want to keep as much free as possible so you don’t run into any issues. Further I’m usually runnin a Virualbox instance of Windows, because within the hospital we have managed desktops that are all that can have access to Clinical Applications and the Shared drive. I don’t NEED to run this all of the time, as the most important thing (Outlook) also runs on my phone. But it is easier if it is running in the background as much as possible. I give it 4 GB of RAM because otherwise it tends to run pretty slowly and I hate any sort of lag in my program response. Like I said, I can turn it off or reduce it’s RAM usage whenever I need. And if I have Cassandra running for my testing database, it uses a good 4GB of RAM as well. Anyway, long story short I found Unity uses a fair number of resources as well so I decided to do some experimenting.

Building a Cluster Part 3

It’s been awhile and all of the cluster components are here and have been running through burn-in at the datacentre for a few weeks. We had some minor hiccups waiting on a power cable to come in for the CISCO 10G switch, because apparently they had to use a power connection just different enough from all other computer equipment that you need to buy theirs, and of course it was back-ordered.

Web Development Tools and Bioinformatics

Over the last few years I have been doing a lot of experimentation and development work (mostly unpublished) surrounding things like genomic pipelines and ways of managing and exploring genomic level data (focused on rare variant analysis in humans). While there are plenty of exceptional programs and tools out there for this (particularly pipelines), we bioinformaticians do like to tinker and re-invent the wheel a lot. Sometimes this is bad (I’ve been guilty of this in the past), and sometimes it isn’t. We all also tend to come at how to execute and configure things (again, particularly pipelines) in our own particular ways so sometimes even the best software can be a chore for us to use, because some early step just doesn’t seem right to us.

Bioinformatics Microservices Part I

Following on from my previous post about why I think moving towards microservices type designs will be beneficial to bioinformatics I want to discuss a project I am currently working on and how I envision the different microservices fitting together and why I’m going that route.

Building a Cluster Part 2

Last time I discussed a bit about the needs we had identified and outlines for supporting next-generation sequencing in a clinical setting from a bioinformatics perspective. I focused a bit at the end on the solution for storage we are using from based off of their Storinator product. We have three of the 4U units in-house now with some drives on the way. The 10 GbE switch is also now in the data centre and other than that we are just waiting on our compute solution to be delivered. For our compute option we went with the Dell FX2 platform. The FX2 is an example of the recent trend of moving towards converged architectures to simplify operations and generally reduce costs. This platform comes in a variety of configurations and densities, we opted for the 4 node compute option with I/O aggregators to simplify our networking. The 4 compute nodes themselves actually communicate over the backplane of the FX2, meaning that communication between nodes doesn't need to go to the switch and back, that is definitely one of the main advantages of the aggregator over the ethernet pass-through module they offer, and the cost difference is pretty minimal. With the 10GbE between the storage nodes and compute, overall we should have a blazing fast cluster.

Building a Cluster Part 1

As part of my job as a Clinical Bioinformatician getting a Next-Generation Sequencing-based Diagnostics test up and running I am designing and building the small-scale computing cluster that we need to support this. Now we don’t need a tremendous amount of computing power since we are engaging in targeted sequencing coming from Illumina MiSeq benchtop sequencers. Of course while the primary purpose of this computing resource is to support clinical diagnostic needs, by purchasing these sequencers in the hospital we will also be supporting research using the equipment and research by other Faculty members who are new to NGS. As the sole Bioinformatician in the hospital, and one of the few working on human genomics at the University I tend to collaborate on a lot of diverse projects. So the cluster needs to support that as well, with all clinical work receiving priority tasking.