So today started out as a pretty frustrating morning due to random (not really random) failure in some analysis pipelines on some data I am trying to work on for a collaborator. The analysis has already taken far longer than expected for various reasons, some of which are my fault, and some of which aren’t. But given some of the issues that crop up I was inspired to post a little bit of a vent concerning things that end up just annoying me as a researcher in bioinformatics. Some of these are specific to today, some arent but I am putting them all here today anyway. In some cases I may call out specific software. In all cases I appreciate the work put into the tool by its developers. Often it is a tool I use a lot. Sometimes it is a specific case where it is symptomatic of larger issues in Bioinformatics software design. After all, if I thought the tool was complete garbage I wouldn’t care enough to vent about it (unless it was really widely used and terrible, but thats something for other posts). Ok, so in no particular order:
In brief data was aligned to the human reference genome (GRCh37.75) with BWA and processed using Picard, the GATK, and VT. Variants were called using six different variant callers, each with different error profiles and performance characteristics: MuTect, FreeBayes, VarDict, Pindel, Platypus, and Scalpel and then combined into a unified call set using bcbio-ensemble. Variants were annotated with snpEff and VCFAnno from a variety of data sources including dbSNP, 1000 Genomes, The Exome Sequencing Project’s EVS, Ensembl, ClinVar, and COSMIC.
Like many scientists I can be a bit ‘scatter-brained.’ Stereotypes are sometimes, true after all. My brain is usually always ‘on’, thinking about a million different things (or sometimes locked onto something specific) and I can easily loose track of things. I started really working on my organization, time management, and productivity skills a few years ago during my Post-Doc. It started out of necessity, I had a lot of standing meetings to go to for our project because it was a large multi-group, Genome Canada funded project and I was the sole Bioinformatician. So I had my subset of Exome sequencing projects that I was following all the way through, but I was also doing all of the initial data analysis for all projects before it got passed on to another post-doc or graduate student for further study. As the only Bioinformatician (even among PIs), I was also involved in lots of the higher-level planning and meetings as well. Coupled with normal post-doc life I really needed to start living by my calendar. I also needed to start learning some work-life balance skills in terms of answering emails at 1am that could easily wait for morning. Later during my post-doc I was also involved with some friends in getting a start up going. I’m not involved with that anymore, except in an occasional advisory capacity, but it definitely made organization even more important. I was doing a few hours in the offices every morning before heading to the lab, a few hours in the evening at home, and some random meetings either on Google Hangouts or in person on some days. My calendar became even more important, but so did things like time tracking and task tracking tools.
I have been using Ubuntu for a long time, and while I don’t hate the Unity desktop manager, I was growing increasingly disillusioned with it. I’ve also always been irritated at resource usage. Especially with Compiz turned on, RAM usage is fairly substantial. My workstation has 16GB of RAM, so I’m not that concerned for general usage, but I also tend to do a lot of heavy computation on this system and testing for development. When your processes use RAM in the GB range you want to keep as much free as possible so you don’t run into any issues. Further I’m usually runnin a Virualbox instance of Windows, because within the hospital we have managed desktops that are all that can have access to Clinical Applications and the Shared drive. I don’t NEED to run this all of the time, as the most important thing (Outlook) also runs on my phone. But it is easier if it is running in the background as much as possible. I give it 4 GB of RAM because otherwise it tends to run pretty slowly and I hate any sort of lag in my program response. Like I said, I can turn it off or reduce it’s RAM usage whenever I need. And if I have Cassandra running for my testing database, it uses a good 4GB of RAM as well. Anyway, long story short I found Unity uses a fair number of resources as well so I decided to do some experimenting.
It’s been awhile and all of the cluster components are here and have been running through burn-in at the datacentre for a few weeks. We had some minor hiccups waiting on a power cable to come in for the CISCO 10G switch, because apparently they had to use a power connection just different enough from all other computer equipment that you need to buy theirs, and of course it was back-ordered.
Over the last few years I have been doing a lot of experimentation and development work (mostly unpublished) surrounding things like genomic pipelines and ways of managing and exploring genomic level data (focused on rare variant analysis in humans). While there are plenty of exceptional programs and tools out there for this (particularly pipelines), we bioinformaticians do like to tinker and re-invent the wheel a lot. Sometimes this is bad (I’ve been guilty of this in the past), and sometimes it isn’t. We all also tend to come at how to execute and configure things (again, particularly pipelines) in our own particular ways so sometimes even the best software can be a chore for us to use, because some early step just doesn’t seem right to us.
Following on from my previous post about why I think moving towards microservices type designs will be beneficial to bioinformatics I want to discuss a project I am currently working on and how I envision the different microservices fitting together and why I’m going that route.
Last time I discussed a bit about the needs we had identified and outlines for supporting
next-generation sequencing in a clinical setting from a bioinformatics perspective.
I focused a bit at the end on the solution for storage we are using from
As part of my job as a Clinical Bioinformatician getting a Next-Generation Sequencing-based Diagnostics test up and running I am designing and building the small-scale computing cluster that we need to support this. Now we don’t need a tremendous amount of computing power since we are engaging in targeted sequencing coming from Illumina MiSeq benchtop sequencers. Of course while the primary purpose of this computing resource is to support clinical diagnostic needs, by purchasing these sequencers in the hospital we will also be supporting research using the equipment and research by other Faculty members who are new to NGS. As the sole Bioinformatician in the hospital, and one of the few working on human genomics at the University I tend to collaborate on a lot of diverse projects. So the cluster needs to support that as well, with all clinical work receiving priority tasking.