The data science hype is real. Glassdoor labeled data scientist as the best job in America four years in a row, nudged out of the top spot only this year. Data science is transforming medicine, healthcare, finance, business, nonprofits, and government. MIT is spending a billion dollars on a college dedicated solely to AI. An entire education industry has sprouted to train new data scientists as fast as possible to fill the burgeoning demand, and for good reason: when 90 percent of the world’s data was generated in the last two years, we’re in dire need of people who understand how to find patterns in that pile of numbers.
My first programming language was R. I fell in love with the nuance R granted for visualizing data, and how with a little practice it was straightforward to pull off complex statistical analyses. I coded in R throughout my Ph.D., but I needed to switch to Python for my first non-academic job. Picking up a second language went much faster than the first, but there was a lot to get used to when I transitioned.
It’s often tempting to make multiple t-test comparisons when running analyses with multiple groups. If you have three groups, this logic would look like “I’ll run a t-test to see if Group A is significantly different from Group B, then another to check if Group A is significantly different from Group C, then one more for whether Group B is different from Group C.” This logic, while seemingly intuitive, is seriously flawed. I’ll use an R function I wrote,
false_pos, to help visualize why multiple t-tests can lead to highly inflated false positive rates.
After hearing so much about Andrew Ng’s famed Machine Learning Coursera course, I started taking the course and loved it. (His demeanor can make any topic sound reassuringly simple!) Early in the course, Ng covers linear regression via gradient descent. In other words, given a series of points, how can we find the line that best represents those points? And to take it a step further, how can we do that with machine learning?
I love data visualization, and one holiday my partner surprised me with the book Dear Data. The book is a series of weekly letters two data analysts wrote to one another with visualizations of data on random topics. One week they tracked the number of times they said “thank you,” for example; another week, they counted the number of times they looked at a clock. In their letters, they visualized their data. One of the most interesting parts of the book was seeing how differently they could plot the same type of data.
Welcome to another episode of “Random R,” where we’ll ask random programming and statistical questions and answer them with R. Today, for whatever reason, let’s say we want to dive into methods for comparing values. We’ll start simple (e.g. is 5 greater than 4? Read on to find out.) and then work our way towards trickier element-wise comparisons among multiple matrices.
Writing this in September 2017 after the new first-years have arrived on campus, I realize it’s now been four years since I started the PhD. Back in 2013, I had just finished a year in Germany on a Fulbright scholarship, studying social and antipredator behavior in birds at the Max Planck Institute for Ornithology. The work I had helped with there was on its way to being published in Animal Behaviour, as had my undergraduate senior thesis work. I had just received an NSF-GRFP fellowship – a great vote of confidence from the federal government – and had spent the last weeks of summer traveling and enjoying the Behaviour conference in Newcastle, UK.
Welcome to the first Random R post, where we ask random programming questions and use R to figure them out. In this post we’ll look at the computational efficiency of
for loops versus the
We’ve reached the point in learning R where we can now afford to focus on efficiency over “whatever works.” A prime example of this is the
apply functions, which are powerful tools to quickly analyze data. The functions can find slices of data frames, matrices, or lists and rapidly perform calculations on them. These functions make it simple to perform analyses like finding the variance of all rows of a matrix, or calculating the mean of all individuals that meet conditions X, Y, and Z in your data, or to feed each element of a vector into a complex equation. With these functions in hand, you will have the tools to move beyond introductory knowledge of R and into more advanced analyses.
Year three. If an American PhD takes 5-6 years, then this is when you pass the halfway point. You’re now in the thick of the weeds. The big picture science that originally got you into this PhD gets harder to remember. The questions you set off to answer years ago really need to stand up to the second guesses that come from when you start dedicating hundreds of hours to answering them. It gets hard not to hear those quiet voices asking if there’s a better way to do things: is academia the path of most happiness for me, what’s consulting or industry like, am I actually doing interesting work? And of course, the eternal question of investing in automation versus doing the mind-numbing manual work! (See xkcd for the right ratio of investment to payoff!)
If you’ve been following this series from the first post, you might recall this soapbox speech I gave:
With programming, you get rid of the comfortable structure of a friendly interface with buttons in favor of freedom. Your analyses are now limited by your imagination and knowledge of the R language, not what someone else thought was relevant for you.
When I was a teenager, I didn’t mind repetitive, mindless labor. To document my musical tastes, every few months I’d gather data from my iTunes library, manually counting the number of songs that I had listened to 0, 1, 2, 3, up to 6 times so I could create a histogram in Excel. I loved office work like stapling papers for hours, and pipetting was one of my favorite aspects of the evolutionary development lab I briefly worked in during college. (Side note: my favorite aspect was feeding the opossums, which made me realize I loved animal behavior.)
For me, coding redefines possibility. With the widespread availability of cheap computational power and online resources for learning how to code, it is easier than ever to pick up a language and start learning from data. A key step early in this journey to unlock insights is data visualization. Producing effective and interesting graphs can not only explain the data better - it can draw viewers in who wouldn’t otherwise be interested.
When I first encountered R in 2011, I was a junior in college. I had heard about it from other undergrads and from my TAs, and the conversations varied widely from loving R to hating it. One constant, though, was how powerful R was for data analysis and visualization. Ambitious, I tried downloading R to familiarize myself and learn its quirks.
The 2nd year of the PhD felt like being a teenager. You’re no longer new to graduate school, and you’re starting to feel the pressure of having something to show for your time here. Within your second year, you go from a PhD student interested in a topic, to a PhD candidate who understands the topic well enough that he can convince others it’s important. This transition has felt a bit like growing up: a loss of naivety and the addition of responsibilities, but also a legitimization of who you are as a researcher. This blog post describes my experiences during this year and what I’ve learned from them.
Sometime within the first two years of a North American biology PhD, grad students take an exam that determines whether their research ideas hold water or whether they should leave. No pressure! This rite of passage is called the generals, qualifying, or preliminary exam (“generals,” “quals,” and “prelims”), and it’s analogous to a Masters defense in the European system. The specifics of the exam vary greatly between universities but tend to involve a written literature review and thesis proposal, sometimes a written exam, and a multiple-hour oral exam by the thesis committee. The committee, which consists of 3-5 professors who read your proposal, will ask you questions for around three hours and then decide whether you should stay.
The story behind Couzin et al. 2011: “Uninformed individuals promote democratic consensus in animal groups”
Couzin ID, Ioannou CC, Demirel G, Gross T, Torney CJ, Hartnett A, Conradt L, Levin SA, Leonard NE. 2011. Uninformed individuals promote democratic consensus in animal groups. Science. 334: 1578-1580.
The first year of the PhD is over… I guess! It doesn’t really feel like your second year until the new first-years arrive in September, and work hasn’t suddenly stopped with the end of the semester, unlike in college. If anything, this summer is when I’ll actually make any progress on experimental ideas I’ve been developing since I first e-mailed my advisor two years ago. But enough time has passed that I think I can share some reflections on my first year that can hopefully help someone else starting or considering starting a PhD in biology.
You are not sitting next to me right now as I type these thoughts. You’re most likely not in New Jersey, and you might not even be in the U.S. The fact that it’s even possible for you to be reading these words right now highlights the power of communicating ideas through writing. Effective communication is the difference between you growing bored and leaving halfway through this blog post to explore other parts of the internet, and you reaching the end (before moving on to explore the rest of the internet!).
In the sciences it’s easy to get in the mindset of “go to college, go to grad school, get a postdoc, be a professor” for your career. While this trajectory works, I want to talk about the crazy idea of breaking from the path for a year or two before you throw yourself into a PhD program. This applies to people applying to professional schools like medicine or law, as well!
[Note: Dr. Bergen has since completed his Ph.D. and is now a Strategy Insights & Planning Consultant at ZS.]
[Disclaimer: I received the GRFP during the 2012-13 application cycle, which might not reflect what the current GRFP is looking for. But hopefully the broader themes in this post still apply!]
As the one-year anniversary of my graduation approaches, I’ve had some time to reflect on my college experience and the ways it has - and hasn’t - prepared me for graduate-level research.
At this point in the year, with grad applications closing and the waiting process beginning (or continuing for some of us), this post might not seem all that relevant to the college seniors who have hopefully figured out how to apply to graduate schools. This post may seem early for juniors who are interested in grad school but figure they have time before they apply. Maybe the occasional freshman or sophomore who stumbles across this blog will think that grad school is so far in the distance it’s not even worth thinking about right now. However, the following advice is a general path for leveling up as a researcher and figuring out what about biology interests you, knowledge that will serve you well regardless if you pursue grad school.