Beginning to get comfortable as a computer guy. Also, PLoS rocks.

It’s a good week to start seriously using computers for science.

(For non-bio-y friends: all this paragraph is saying is that I’ve more or less settled into my new role in doing science solely off of a computer, but that the transition has been pretty challenging). After nearly a month of writing perl scripts to to parse increasingly convoluted files and reading about high throughput sequence analysis software, I was given the reins to several transcriptome data sets on Monday. I spent Monday getting used to working on the Minnesota Supercomputing Institute’s servers and running FastQC on subsets of the data. Tuesday was going to be a big day: I planned to run FastQC on all of the files, begin looking through the output, and start pruning/filtering with the FastX toolkit. This is probably a light schedule for serious bioinformatics people, but it was enough for me to get worked up — I was pretty damn apprehensive Monday night!

Luckily for me, I woke up on Tuesday morning to find a couple of highly relevant (and strangely comforting) papers in PLoS Biology:

1. Best Practices for Scientific Computing ( Though I’m not really writing my own software (not for now, anyway), there’s a lot of great stuff in this article. It has been very challenging to live by Best Practice 4 (Don’t Repeat Yourself, or DRY). So far, I’ve been writing my perl & shell scripts locally, uploading it onto the MSI cluster, and running it there. If I run into any error, I’m never sure whether I should edit locally and reupload the script, or whether I should just edit it on the server and download the updated copy! Augh!

2. A Field Guide to Genomics Research ( I don’t really know what kind of genomics researcher I will shape up to be (if any). For now, I guess “Servant” is the most appropriate, though Dr. Hirsch is a much better Master than the one portrayed in the paper.

Then, today, I came across the preprint to “10 Simple Rules for the Care and Feeding of Scientific Data” on arXiv ( Now that I’ve seen how intense the datasets and associated scripts can be, I appreciate everything in this paper! I am becoming increasingly convinced that I should start using Git (or even just Figshare?) to manage the scripts I am writing. Luckily there’s a nice little tutorial to begin on that too!

Though my more-or-less solo adventures into world of “supercomputing institutes” and processing hundreds files that contain 16000000 lines of text each have been kind of bewildering, these sorts of papers are helping a lot. There’s nobody in the lab (other than Dr. Hirsch, whose schedule as a new professor doesn’t offer much wiggle room) who I can turn to for help, so I really appreciate these general pieces of advice.

Now if only I can figure out a good system to manage the 100s of shell scripts I have written over this week to run jobs on the MSI servers, I’ll feel even better! I have no idea if people’s “scripts” folder looks like mine did after half a day of data processing:

Screen Shot 2014-01-10 at 11.10.28 AM

And while I’m talking about PLoS love, I’ll also add that the work I am doing is follow-up work on the science published by my advisor in this PLoS One paper.

Next time, I promise to write a less introspective, more interesting post 😉


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s