It’s a good week to start seriously using computers for science.
(For non-bio-y friends: all this paragraph is saying is that I’ve more or less settled into my new role in doing science solely off of a computer, but that the transition has been pretty challenging). After nearly a month of writing perl scripts to to parse increasingly convoluted files and reading about high throughput sequence analysis software, I was given the reins to several transcriptome data sets on Monday. I spent Monday getting used to working on the Minnesota Supercomputing Institute’s servers and running FastQC on subsets of the data. Tuesday was going to be a big day: I planned to run FastQC on all of the files, begin looking through the output, and start pruning/filtering with the FastX toolkit. This is probably a light schedule for serious bioinformatics people, but it was enough for me to get worked up — I was pretty damn apprehensive Monday night!
Luckily for me, I woke up on Tuesday morning to find a couple of highly relevant (and strangely comforting) papers in PLoS Biology:
1. Best Practices for Scientific Computing (http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001745). Though I’m not really writing my own software (not for now, anyway), there’s a lot of great stuff in this article. It has been very challenging to live by Best Practice 4 (Don’t Repeat Yourself, or DRY). So far, I’ve been writing my perl & shell scripts locally, uploading it onto the MSI cluster, and running it there. If I run into any error, I’m never sure whether I should edit locally and reupload the script, or whether I should just edit it on the server and download the updated copy! Augh!
2. A Field Guide to Genomics Research (http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001744). I don’t really know what kind of genomics researcher I will shape up to be (if any). For now, I guess “Servant” is the most appropriate, though Dr. Hirsch is a much better Master than the one portrayed in the paper.
Then, today, I came across the preprint to “10 Simple Rules for the Care and Feeding of Scientific Data” on arXiv (http://arxiv.org/pdf/1401.2134v1.pdf). Now that I’ve seen how intense the datasets and associated scripts can be, I appreciate everything in this paper! I am becoming increasingly convinced that I should start using Git (or even just Figshare?) to manage the scripts I am writing. Luckily there’s a nice little tutorial to begin on that too!
Though my more-or-less solo adventures into world of “supercomputing institutes” and processing hundreds files that contain 16000000 lines of text each have been kind of bewildering, these sorts of papers are helping a lot. There’s nobody in the lab (other than Dr. Hirsch, whose schedule as a new professor doesn’t offer much wiggle room) who I can turn to for help, so I really appreciate these general pieces of advice.
Now if only I can figure out a good system to manage the 100s of shell scripts I have written over this week to run jobs on the MSI servers, I’ll feel even better! I have no idea if people’s “scripts” folder looks like mine did after half a day of data processing:
And while I’m talking about PLoS love, I’ll also add that the work I am doing is follow-up work on the science published by my advisor in this PLoS One paper.
Next time, I promise to write a less introspective, more interesting post 😉