Tag: statistics

A business lens on precision and recall

December 20, 2023
#machine-learning #statistics

Population under a magnifying glass

A deep dive on ARIMA models

August 11, 2021
#python #statistics

Predicting the future has forever been a universal challenge, from decisions like whether to plant crops now or next week, marry someone or remain single, sell a stock or hold, or go to college or play music full time. We will never be able to perfectly predict the future^[1], but we can use tools from the statistical field of forecasting to better understand what lies ahead.

How to enter data science
2. The statistics

December 5, 2020
#careers #data-science #statistics

In the last post, we defined the key elements of data science as 1) deriving insights from data and 2) communicating those insights to others. Despite the huge diversity in how these elements are expressed in actual data scientist roles, there is a core skill set that will serve you well no matter where you go. The remaining posts in this series will define and explore these skills in detail.

Visualizing the danger of multiple t-test comparisons

May 13, 2018
#projects #r #statistics

It’s often tempting to make multiple t-test comparisons when running analyses with multiple groups. If you have three groups, this logic would look like “I’ll run a t-test to see if Group A is significantly different from Group B, then another to check if Group A is significantly different from Group C, then one more for whether Group B is different from Group C.” This logic, while seemingly intuitive, is seriously flawed. I’ll use an R function I wrote, false_pos, to help visualize why multiple t-tests can lead to highly inflated false positive rates.

Linear regression via gradient descent

April 22, 2018
#machine-learning #projects #r #statistics

After hearing so much about Andrew Ng’s famed Machine Learning Coursera course, I started taking the course and loved it. (His demeanor can make any topic sound reassuringly simple!) Early in the course, Ng covers linear regression via gradient descent. In other words, given a series of points, how can we find the line that best represents those points? And to take it a step further, how can we do that with machine learning?