Building a Random Forest by Hand in Python

January 28, 2024
#machine-learning

From drug discovery to species classification, credit scoring to cybersecurity and more, the random forest is a popular and powerful algorithm for modeling our complex world. Its versatility and predictive prowess would seem to require cutting-edge complexity, but if we dig into what a random forest actually is, we see a shockingly simple set of repeating steps.

A business lens on precision and recall

December 20, 2023
#machine-learning #statistics

Population under a magnifying glass

SQL Riddles to Test Your Wits

February 21, 2023
#sql

SQL is a deceptively simple language. Across its many dialects, users can query databases in a syntax similar to English. What you see is what you get… until you don’t.

AWS Essentials for Data Science
3. Compute

December 31, 2022
#aws #python

Screenshot from Christopher Hesse’s amazing Image-to-Image Demo

AWS Essentials for Data Science
2. Storage

August 20, 2022
#aws #sql #python

Do you store your music, videos, and personal files in a garage full of hard drives? My bet is… no. Unless you’ve avoided iCloud, Dropbox, and Google Drive the last fifteen years – and if you have, props to you! – then you’re likely using cloud storage. You can recover your texts if you lose your phone; you can share files with links instead of massive email attachments; you can organize and search your photos by who’s in them.

AWS Essentials for Data Science
1. Why Cloud Computing?

April 17, 2022
#aws #python

Imagine you’re a coordinator for data science meet & greets in New York. A major part of your planning involves reserving a venue to accommodate your guests. You’ve always rented venues around the city, but you wonder if it’d be better to just buy your own to avoid the hassle of searching for a free one every time.

Intermediate SQL

January 17, 2022
#sql

When I started learning SQL, I found it hard to progress beyond the absolute basics. I loved DataCamp’s courses because I could just type the code directly into a console on the screen. But once the courses ended, how could I practice what I learned? And how could I continue improving, when all the tutorials I found just consisted of code snippits, without an underlying database I could query myself?

Exploring stacks and queues

November 28, 2021
#computer-science #python

In our last post, we covered data structures, or the ways that programming languages store data in memory. We touched upon abstract data types, theoretical entities that are implemented via data structures. The concept of a “vehicle” can be viewed as an abstract data type, for example, with a “bike” being the data structure.

Intro to data structures

October 19, 2021
#computer-science #python

Imagine you build a wildly popular app that is quickly growing towards a million users. (Congrats!) While users love the app, they’re complaining that the app is becoming slower and slower, to the point that some users are starting to leave. You notice that the main bottleneck is how user info is retrieved during authentication: currently, your app searches through an unsorted list of Python dictionaries until it finds the requested user ID.

A deep dive on ARIMA models

August 11, 2021
#python #statistics

Predicting the future has forever been a universal challenge, from decisions like whether to plant crops now or next week, marry someone or remain single, sell a stock or hold, or go to college or play music full time. We will never be able to perfectly predict the future^[1], but we can use tools from the statistical field of forecasting to better understand what lies ahead.

A hands-on demo of analyzing big data with Spark

June 16, 2021
#python

Cloud services firm Domo estimates that for every minute in 2020, WhatsApp users sent 41.7 million messages, Netflix streamed 404,000 hours of video, $240,000 changed hands on Venmo, and 69,000 people applied for jobs on LinkedIn. In that firehose of data are patterns those companies use to understand the present, predict the future, and ultimately stay alive in a hyper-competitive market.

Fish schools as ensemble learning algorithms

June 3, 2021
#academia #machine-learning

Fish school Photo by jean wimmerlin on Unsplash

Lessons from the first two data scientists at a startup

April 29, 2021
#careers #data-science

Two knights Photo by Camerauthor Photosandstories on Unsplash

Efficient type validation for Python functions

April 18, 2021
#projects #python

When it comes to writing complex pipelines running in production, it’s critical to have a clear understanding of what each function does, and how its outputs affect downstream functions. But despite our best efforts to write modular, well-tested functions, bugs love hiding in the handoffs between functions, and they can be hard to catch even with end-to-end tests.

3 levels of technical abstraction when sharing your code

April 13, 2021
#computer-science

SQL vs. NoSQL databases in Python

April 5, 2021
#python #sql

From ancient government, library, and medical records to present-day video and IoT streams, we have always needed ways to efficiently store and retrieve data. Yesterday’s filing cabinets have become today’s computer databases, with two major paradigms for how to best organize data: the relational (SQL) versus non-relational (NoSQL) approach.

Building a full-stack spam catching app
3. Frontend & Deployment

March 21, 2021
#machine-learning #projects #python

Building a full-stack spam catching app
2. Backend

March 14, 2021
#machine-learning #projects #python

Building a full-stack spam catching app
1. Context

March 11, 2021
#machine-learning #projects #python

Transitioning to data science from academia

February 10, 2021
#academia #careers #data-science

“I could always do data science if academia doesn’t work out.” It’s a recurring thought many graduate students and postdocs experience, especially if their work involves hearty servings of programming and statistics, the core elements of data science. Data science can be a rewarding alternative to academia, and academics do have many qualities that make them attractive candidates for data science roles. However, there are also often large holes in academics’ skill sets that can deter them from being hired straight off the bat.

How to enter data science
5. The people

January 21, 2021
#careers #data-science

So far, we’ve covered the technical side to data science: statistics, analytics, and software engineering. But no matter how talented you are at crunching numbers and writing code, your effectiveness as a data scientist is limited if you chase questions that don’t actually help your company, or you can’t get anyone to incorporate the results of your analyses. Similarly, how do you stay motivated and relevant in a field that’s constantly evolving?

How to enter data science
4. The engineering

December 30, 2020
#careers #data-science #python

Welcome to the fourth post in our series on how to enter data science! So far, we’ve covered the range of data science roles, some inferential statistics fundamentals, and manipulating and analyzing data. This post will focus on software engineering concepts that are essential for data science.

How to enter data science
3. The analytics

December 15, 2020
#careers #data-science #machine-learning #python

Welcome to the third post in our series on how to enter data science! The first post covered how to navigate the broad diversity of data science roles in the industry, and the second was a deep dive on (some!) statistics essential to being an effective data scientist. In this post, we’ll cover skills you’ll need when manipulating and analyzing data. Get ready for lots of syntax highlighting!

How to enter data science
2. The statistics

December 5, 2020
#careers #data-science #statistics

In the last post, we defined the key elements of data science as 1) deriving insights from data and 2) communicating those insights to others. Despite the huge diversity in how these elements are expressed in actual data scientist roles, there is a core skill set that will serve you well no matter where you go. The remaining posts in this series will define and explore these skills in detail.

How to enter data science
1. The target

August 27, 2020
#careers #data-science

The data science hype is real. Glassdoor labeled data scientist as the best job in America four years in a row, nudged out of the top spot only this year. Data science is transforming medicine, healthcare, finance, business, nonprofits, and government. MIT is spending a billion dollars on a college dedicated solely to AI. An entire education industry has sprouted to train new data scientists as fast as possible to fill the burgeoning demand, and for good reason: when 90 percent of the world’s data was generated in the last two years, we’re in dire need of people who understand how to find patterns in that pile of numbers.

Perspectives on Python after R

July 24, 2020
#python #r

My first programming language was R. I fell in love with the nuance R granted for visualizing data, and how with a little practice it was straightforward to pull off complex statistical analyses. I coded in R throughout my Ph.D., but I needed to switch to Python for my first non-academic job. Picking up a second language went much faster than the first, but there was a lot to get used to when I transitioned.

Visualizing the danger of multiple t-test comparisons

May 13, 2018
#projects #r #statistics

It’s often tempting to make multiple t-test comparisons when running analyses with multiple groups. If you have three groups, this logic would look like “I’ll run a t-test to see if Group A is significantly different from Group B, then another to check if Group A is significantly different from Group C, then one more for whether Group B is different from Group C.” This logic, while seemingly intuitive, is seriously flawed. I’ll use an R function I wrote, false_pos, to help visualize why multiple t-tests can lead to highly inflated false positive rates.

Linear regression via gradient descent

April 22, 2018
#machine-learning #projects #r #statistics

After hearing so much about Andrew Ng’s famed Machine Learning Coursera course, I started taking the course and loved it. (His demeanor can make any topic sound reassuringly simple!) Early in the course, Ng covers linear regression via gradient descent. In other words, given a series of points, how can we find the line that best represents those points? And to take it a step further, how can we do that with machine learning?

Visualizing my daily commute

November 1, 2017
#projects #r

I love data visualization, and one holiday my partner surprised me with the book Dear Data. The book is a series of weekly letters two data analysts wrote to one another with visualizations of data on random topics. One week they tracked the number of times they said “thank you,” for example; another week, they counted the number of times they looked at a clock. In their letters, they visualized their data. One of the most interesting parts of the book was seeing how differently they could plot the same type of data.

How to be fancy with comparisons in R

October 14, 2017
#r

Welcome to another episode of “Random R,” where we’ll ask random programming and statistical questions and answer them with R. Today, for whatever reason, let’s say we want to dive into methods for comparing values. We’ll start simple (e.g. is 5 greater than 4? Read on to find out.) and then work our way towards trickier element-wise comparisons among multiple matrices.

Ph.D. reflections
4th year

September 18, 2017
#academia

Writing this in September 2017 after the new first-years have arrived on campus, I realize it’s now been four years since I started the PhD. Back in 2013, I had just finished a year in Germany on a Fulbright scholarship, studying social and antipredator behavior in birds at the Max Planck Institute for Ornithology. The work I had helped with there was on its way to being published in Animal Behaviour, as had my undergraduate senior thesis work. I had just received an NSF-GRFP fellowship – a great vote of confidence from the federal government – and had spent the last weeks of summer traveling and enjoying the Behaviour conference in Newcastle, UK.

For loops vs. apply - a race in efficiency

July 13, 2017
#projects #r

Welcome to the first Random R post, where we ask random programming questions and use R to figure them out. In this post we’ll look at the computational efficiency of for loops versus the apply function.

Learning R
5. The apply functions

February 1, 2017
#r

Learning R series

Ph.D. reflections
3rd year

August 12, 2016
#academia

Year three. If an American PhD takes 5-6 years, then this is when you pass the halfway point. You’re now in the thick of the weeds. The big picture science that originally got you into this PhD gets harder to remember. The questions you set off to answer years ago really need to stand up to the second guesses that come from when you start dedicating hundreds of hours to answering them. It gets hard not to hear those quiet voices asking if there’s a better way to do things: is academia the path of most happiness for me, what’s consulting or industry like, am I actually doing interesting work? And of course, the eternal question of investing in automation versus doing the mind-numbing manual work! (See xkcd for the right ratio of investment to payoff!)

Learning R
4. Functions and if statements

May 15, 2016
#r

Learning R series

Learning R
3. For loops and random walks

April 24, 2016
#r

Learning R series

Learning R
2. Random data and plotting

December 22, 2015
#r

Learning R series

Learning R
1. Introduction

December 15, 2015
#r

Learning R series

Ph.D. reflections
2nd year

October 8, 2015
#academia

The 2nd year of the PhD felt like being a teenager. You’re no longer new to graduate school, and you’re starting to feel the pressure of having something to show for your time here. Within your second year, you go from a PhD student interested in a topic, to a PhD candidate who understands the topic well enough that he can convince others it’s important. This transition has felt a bit like growing up: a loss of naivety and the addition of responsibilities, but also a legitimization of who you are as a researcher. This blog post describes my experiences during this year and what I’ve learned from them.

Prelims summary and advice

July 28, 2015
#academia

Sometime within the first two years of a North American biology PhD, grad students take an exam that determines whether their research ideas hold water or whether they should leave. No pressure! This rite of passage is called the generals, qualifying, or preliminary exam (“generals,” “quals,” and “prelims”), and it’s analogous to a Masters defense in the European system. The specifics of the exam vary greatly between universities but tend to involve a written literature review and thesis proposal, sometimes a written exam, and a multiple-hour oral exam by the thesis committee. The committee, which consists of 3-5 professors who read your proposal, will ask you questions for around three hours and then decide whether you should stay.

Behind the Scenes
Couzin et al. 2011

July 16, 2015
#academia

The story behind Couzin et al. 2011: “Uninformed individuals promote democratic consensus in animal groups”

Couzin ID, Ioannou CC, Demirel G, Gross T, Torney CJ, Hartnett A, Conradt L, Levin SA, Leonard NE. 2011. Uninformed individuals promote democratic consensus in animal groups. Science. 334: 1578-1580.

Ph.D. reflections
1st year

June 5, 2014
#academia

The first year of the PhD is over… I guess! It doesn’t really feel like your second year until the new first-years arrive in September, and work hasn’t suddenly stopped with the end of the semester, unlike in college. If anything, this summer is when I’ll actually make any progress on experimental ideas I’ve been developing since I first e-mailed my advisor two years ago. But enough time has passed that I think I can share some reflections on my first year that can hopefully help someone else starting or considering starting a PhD in biology.

Writing the self-contained universe

November 23, 2013
#academia

You are not sitting next to me right now as I type these thoughts. You’re most likely not in New Jersey, and you might not even be in the U.S. The fact that it’s even possible for you to be reading these words right now highlights the power of communicating ideas through writing. Effective communication is the difference between you growing bored and leaving halfway through this blog post to explore other parts of the internet, and you reaching the end (before moving on to explore the rest of the internet!).

Gap years - trying things out

June 8, 2013
#academia

In the sciences it’s easy to get in the mindset of “go to college, go to grad school, get a postdoc, be a professor” for your career. While this trajectory works, I want to talk about the crazy idea of breaking from the path for a year or two before you throw yourself into a PhD program. This applies to people applying to professional schools like medicine or law, as well!

Advice for the Gates Cambridge application - II

April 27, 2013
#academia

Hi all,

Advice for the Gates Cambridge application - I

April 25, 2013
#academia

[Note: Dr. Bergen has since completed his Ph.D. and is now a Strategy Insights & Planning Consultant at ZS.]

Application advice for the NSF-GRFP

April 16, 2013
#academia

[Disclaimer: I received the GRFP during the 2012-13 application cycle, which might not reflect what the current GRFP is looking for. But hopefully the broader themes in this post still apply!]

Nine months out of the college bubble

February 16, 2013
#career

As the one-year anniversary of my graduation approaches, I’ve had some time to reflect on my college experience and the ways it has - and hasn’t - prepared me for graduate-level research.

How to get into grad school for bio

November 1, 2011
#academia

At this point in the year, with grad applications closing and the waiting process beginning (or continuing for some of us), this post might not seem all that relevant to the college seniors who have hopefully figured out how to apply to graduate schools. This post may seem early for juniors who are interested in grad school but figure they have time before they apply. Maybe the occasional freshman or sophomore who stumbles across this blog will think that grad school is so far in the distance it’s not even worth thinking about right now. However, the following advice is a general path for leveling up as a researcher and figuring out what about biology interests you, knowledge that will serve you well regardless if you pursue grad school.