When I started learning SQL, I found it hard to progress beyond the absolute basics. I loved DataCamp’s courses because I could just type the code directly into a console on the screen. But once the courses ended, how could I practice what I learned? And how could I continue improving, when all the tutorials I found just consisted of code snippits, without an underlying database I could query myself?
In our last post, we covered data structures, or the ways that programming languages store data in memory. We touched upon abstract data types, theoretical entities that are implemented via data structures. The concept of a “vehicle” can be viewed as an abstract data type, for example, with a “bike” being the data structure.
Imagine you build a wildly popular app that is quickly growing towards a million users. (Congrats!) While users love the app, they’re complaining that the app is becoming slower and slower, to the point that some users are starting to leave. You notice that the main bottleneck is how user info is retrieved during authentication: currently, your app searches through an unsorted list of Python dictionaries until it finds the requested user ID.
Predicting the future has forever been a universal challenge, from decisions like whether to plant crops now or next week, marry someone or remain single, sell a stock or hold, or go to college or play music full time. We will never be able to perfectly predict the future, but we can use tools from the statistical field of forecasting to better understand what lies ahead.
Cloud services firm Domo estimates that for every minute in 2020, WhatsApp users sent 41.7 million messages, Netflix streamed 404,000 hours of video, $240,000 changed hands on Venmo, and 69,000 people applied for jobs on LinkedIn. In that firehose of data are patterns those companies use to understand the present, predict the future, and ultimately stay alive in a hyper-competitive market.
When it comes to writing complex pipelines running in production, it’s critical to have a clear understanding of what each function does, and how its outputs affect downstream functions. But despite our best efforts to write modular, well-tested functions, bugs love hiding in the handoffs between functions, and they can be hard to catch even with end-to-end tests.
From ancient government, library, and medical records to present-day video and IoT streams, we have always needed ways to efficiently store and retrieve data. Yesterday’s filing cabinets have become today’s computer databases, with two major paradigms for how to best organize data: the relational (SQL) versus non-relational (NoSQL) approach.
“I could always do data science if academia doesn’t work out.” It’s a recurring thought many graduate students and postdocs experience, especially if their work involves hearty servings of programming and statistics, the core elements of data science. Data science can be a rewarding alternative to academia, and academics do have many qualities that make them attractive candidates for data science roles. However, there are also often large holes in academics’ skill sets that can deter them from being hired straight off the bat.
So far, we’ve covered the technical side to data science: statistics, analytics, and software engineering. But no matter how talented you are at crunching numbers and writing code, your effectiveness as a data scientist is limited if you chase questions that don’t actually help your company, or you can’t get anyone to incorporate the results of your analyses. Similarly, how do you stay motivated and relevant in a field that’s constantly evolving?
Welcome to the fourth post in our series on how to enter data science! So far, we’ve covered the range of data science roles, some inferential statistics fundamentals, and manipulating and analyzing data. This post will focus on software engineering concepts that are essential for data science.
Welcome to the third post in our series on how to enter data science! The first post covered how to navigate the broad diversity of data science roles in the industry, and the second was a deep dive on (some!) statistics essential to being an effective data scientist. In this post, we’ll cover skills you’ll need when manipulating and analyzing data. Get ready for lots of syntax highlighting!
In the last post, we defined the key elements of data science as 1) deriving insights from data and 2) communicating those insights to others. Despite the huge diversity in how these elements are expressed in actual data scientist roles, there is a core skill set that will serve you well no matter where you go. The remaining posts in this series will define and explore these skills in detail.
The data science hype is real. Glassdoor labeled data scientist as the best job in America four years in a row, nudged out of the top spot only this year. Data science is transforming medicine, healthcare, finance, business, nonprofits, and government. MIT is spending a billion dollars on a college dedicated solely to AI. An entire education industry has sprouted to train new data scientists as fast as possible to fill the burgeoning demand, and for good reason: when 90 percent of the world’s data was generated in the last two years, we’re in dire need of people who understand how to find patterns in that pile of numbers.
My first programming language was R. I fell in love with the nuance R granted for visualizing data, and how with a little practice it was straightforward to pull off complex statistical analyses. I coded in R throughout my Ph.D., but I needed to switch to Python for my first non-academic job. Picking up a second language went much faster than the first, but there was a lot to get used to when I transitioned.
It’s often tempting to make multiple t-test comparisons when running analyses with multiple groups. If you have three groups, this logic would look like “I’ll run a t-test to see if Group A is significantly different from Group B, then another to check if Group A is significantly different from Group C, then one more for whether Group B is different from Group C.” This logic, while seemingly intuitive, is seriously flawed. I’ll use an R function I wrote,
false_pos, to help visualize why multiple t-tests can lead to highly inflated false positive rates.
After hearing so much about Andrew Ng’s famed Machine Learning Coursera course, I started taking the course and loved it. (His demeanor can make any topic sound reassuringly simple!) Early in the course, Ng covers linear regression via gradient descent. In other words, given a series of points, how can we find the line that best represents those points? And to take it a step further, how can we do that with machine learning?
I love data visualization, and one holiday my partner surprised me with the book Dear Data. The book is a series of weekly letters two data analysts wrote to one another with visualizations of data on random topics. One week they tracked the number of times they said “thank you,” for example; another week, they counted the number of times they looked at a clock. In their letters, they visualized their data. One of the most interesting parts of the book was seeing how differently they could plot the same type of data.
Welcome to another episode of “Random R,” where we’ll ask random programming and statistical questions and answer them with R. Today, for whatever reason, let’s say we want to dive into methods for comparing values. We’ll start simple (e.g. is 5 greater than 4? Read on to find out.) and then work our way towards trickier element-wise comparisons among multiple matrices.
Writing this in September 2017 after the new first-years have arrived on campus, I realize it’s now been four years since I started the PhD. Back in 2013, I had just finished a year in Germany on a Fulbright scholarship, studying social and antipredator behavior in birds at the Max Planck Institute for Ornithology. The work I had helped with there was on its way to being published in Animal Behaviour, as had my undergraduate senior thesis work. I had just received an NSF-GRFP fellowship – a great vote of confidence from the federal government – and had spent the last weeks of summer traveling and enjoying the Behaviour conference in Newcastle, UK.
Welcome to the first Random R post, where we ask random programming questions and use R to figure them out. In this post we’ll look at the computational efficiency of
for loops versus the
Learning R series
Year three. If an American PhD takes 5-6 years, then this is when you pass the halfway point. You’re now in the thick of the weeds. The big picture science that originally got you into this PhD gets harder to remember. The questions you set off to answer years ago really need to stand up to the second guesses that come from when you start dedicating hundreds of hours to answering them. It gets hard not to hear those quiet voices asking if there’s a better way to do things: is academia the path of most happiness for me, what’s consulting or industry like, am I actually doing interesting work? And of course, the eternal question of investing in automation versus doing the mind-numbing manual work! (See xkcd for the right ratio of investment to payoff!)
Learning R series
Learning R series
Learning R series
Learning R series
The 2nd year of the PhD felt like being a teenager. You’re no longer new to graduate school, and you’re starting to feel the pressure of having something to show for your time here. Within your second year, you go from a PhD student interested in a topic, to a PhD candidate who understands the topic well enough that he can convince others it’s important. This transition has felt a bit like growing up: a loss of naivety and the addition of responsibilities, but also a legitimization of who you are as a researcher. This blog post describes my experiences during this year and what I’ve learned from them.
Sometime within the first two years of a North American biology PhD, grad students take an exam that determines whether their research ideas hold water or whether they should leave. No pressure! This rite of passage is called the generals, qualifying, or preliminary exam (“generals,” “quals,” and “prelims”), and it’s analogous to a Masters defense in the European system. The specifics of the exam vary greatly between universities but tend to involve a written literature review and thesis proposal, sometimes a written exam, and a multiple-hour oral exam by the thesis committee. The committee, which consists of 3-5 professors who read your proposal, will ask you questions for around three hours and then decide whether you should stay.
The story behind Couzin et al. 2011: “Uninformed individuals promote democratic consensus in animal groups”
Couzin ID, Ioannou CC, Demirel G, Gross T, Torney CJ, Hartnett A, Conradt L, Levin SA, Leonard NE. 2011. Uninformed individuals promote democratic consensus in animal groups. Science. 334: 1578-1580.
The first year of the PhD is over… I guess! It doesn’t really feel like your second year until the new first-years arrive in September, and work hasn’t suddenly stopped with the end of the semester, unlike in college. If anything, this summer is when I’ll actually make any progress on experimental ideas I’ve been developing since I first e-mailed my advisor two years ago. But enough time has passed that I think I can share some reflections on my first year that can hopefully help someone else starting or considering starting a PhD in biology.
You are not sitting next to me right now as I type these thoughts. You’re most likely not in New Jersey, and you might not even be in the U.S. The fact that it’s even possible for you to be reading these words right now highlights the power of communicating ideas through writing. Effective communication is the difference between you growing bored and leaving halfway through this blog post to explore other parts of the internet, and you reaching the end (before moving on to explore the rest of the internet!).
In the sciences it’s easy to get in the mindset of “go to college, go to grad school, get a postdoc, be a professor” for your career. While this trajectory works, I want to talk about the crazy idea of breaking from the path for a year or two before you throw yourself into a PhD program. This applies to people applying to professional schools like medicine or law, as well!
[Note: Dr. Bergen has since completed his Ph.D. and is now a Strategy Insights & Planning Consultant at ZS.]
[Disclaimer: I received the GRFP during the 2012-13 application cycle, which might not reflect what the current GRFP is looking for. But hopefully the broader themes in this post still apply!]
As the one-year anniversary of my graduation approaches, I’ve had some time to reflect on my college experience and the ways it has - and hasn’t - prepared me for graduate-level research.
At this point in the year, with grad applications closing and the waiting process beginning (or continuing for some of us), this post might not seem all that relevant to the college seniors who have hopefully figured out how to apply to graduate schools. This post may seem early for juniors who are interested in grad school but figure they have time before they apply. Maybe the occasional freshman or sophomore who stumbles across this blog will think that grad school is so far in the distance it’s not even worth thinking about right now. However, the following advice is a general path for leveling up as a researcher and figuring out what about biology interests you, knowledge that will serve you well regardless if you pursue grad school.