Learning R
5. The apply functions
Written by Matt Sosna on February 1, 2017 We’ve reached the point in learning R where we can now afford to focus on efficiency over “whatever works.” A prime example of this is the apply
functions, which are powerful tools to quickly analyze data. The functions can find slices of data frames, matrices, or lists and rapidly perform calculations on them. These functions make it simple to perform analyses like finding the variance of all rows of a matrix, or calculating the mean of all individuals that meet conditions X, Y, and Z in your data, or to feed each element of a vector into a complex equation. With these functions in hand, you will have the tools to move beyond introductory knowledge of R and into more advanced analyses.
This is the fifth in a series of posts on R. This post covers:
apply
,sapply
,tapply
,lapply
(pronounced “apply,” “sapply,” “tapply,” and “lapply”)sweep
Previous posts included:
 An introduction to R
 Distributions, plotting, and linear regression
 For loops and random walks
 Functions and if statements
The most important code from those posts is shown below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
c() # Concatenate
x < c(5, 10) # x is now a twoelement vector
x[x > 7] # Subset the values of x that are greater than 7
# Assign to the object "data" 100 randomlydrawn values from a normal
# distribution with mean zero and standard deviation 1
data < rnorm(100)
# Plot a histogram of the "data" object
hist(data)
# Assign to the object "y" 100 randomlydrawn values from a uniform
# distribution with minimum zero and maximum 1
y < runif(100)
# Make a scatterplot of y as a function of data
plot(y ~ data)
for(i in 1:length(x)){ # For i in the sequence 1 to the length of x...
print(i) # ...print i
}
# Create our own mean function
our.mean < function(x){sum(x) / length(x)}
# Print the statement "Yup, less than 5" if 1 + 1 < 5 is true
if(1 + 1 < 5){print("Yup, less than 5")}
apply
The first of the apply functions is the one the family is named after. Let’s revisit the following plot from the third R post on for loops.
A random walk, if you remember, is a series of random steps. The walk can be in as many dimensions as you want; the ten walks shown above are in one dimension (where there’s only one movement axis: up/down). It seems like the variance in the walks increases with time: the more you move to the right, the more spread out the walks are. How could we quantify this?
In case you missed it, here’s how we generated the data for the above figure. But this time, let’s run 100 random walks instead of 10.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
start < 0
mean.step < 0
sd.step < 1
n.steps < 100000
n.walks < 100 # The number of random walks you want to run
# Create an empty matrix that we'll fill with the random walk values
WALKS < matrix(NA, nrow = n.walks, ncol = n.steps)
# Start all the walks at the origin
WALKS[, 1] < start
# Perform the random walks
for(i in 2:n.steps){
WALKS[, i] < WALKS[, i1] + # Take the previous positions...
rnorm(n.walks, mean.step, sd.step) # ...and add a step to each one
}
[Note: if you did rnorm(1, ...)
, you would generate 100 identical random walks because the same value would be added to each walk. That’s why we need to specify the number of random walks in the first argument in rnorm
.]
WALKS
is a 100 x 100,000 matrix, where each row is a random walk and each column is a time point. To show how the variance of the walks changes over time, we want to take the variance of each column. One way to do this would be to create a for
loop that iterates over each column:
1
2
3
4
5
walk.variance < c()
for(i in 1:ncol(WALKS)){
walk.variance[i] < var(WALKS[, i])
print(i)
}
This works, but there’s a simpler way to do this. Let’s use apply
, which only requires one line of code.
1
walk.variance < apply(WALKS, 2, var)
apply
’s inputs are the matrix, whether you want rows (apply(..., 1, ...)
) or columns (apply(..., 2, ...)
), and the function you want to apply to that dimension of the matrix, which for us is var
.
[Note: you can also do this for higherdimensional arrays, e.g. a 3dimensional “book” where you want to summarize each “page.” That would be apply(..., 3, ...)
]
If we plot walk.variance
, we see a nice linear relationship between the variance and how long we’ve run the random walk.
You can apply any function to each column or row, as long as it can take a vector input. Using what we learned in post 4 of our R series, let’s apply a more complicated function to each row of the matrix. This function will find the mean of the six highest values of each random walk, then subtract it from the range. No real reason why. Just flexing a little R muscle!
1
2
3
4
5
6
7
# Find the mean of the six highest values of the input,
# then subtract it from the range
fun.times < function(x){
return(mean(head(x))  (range(x)[2]  range(x)[1]))
}
lots.of.fun < apply(WALKS, 1, fun.times)
sweep
apply
is great for performing a calculation on each row or column of a matrix, but what if you want to, say, subtract an entire vector across each column of a matrix? You’re not subtracting just one value from the matrix. You have a separate value for each column. For this, you’ll want sweep
.
Here’s a simple example that doesn’t require sweep
. You’re subtracting one value from the entire matrix.
1
WALKS.adjusted < WALKS  mean(WALKS)
Here, the new object (WALKS.adjusted
) has been centered around the mean of WALKS
. This isn’t incredibly informative, though, because the mean of WALKS
condenses the information from all the random walks and all the time points of those walks into one value.
With sweep
, we can be more nuanced. Let’s create a new matrix, WALKS.adjusted2
, where all of the walks are centered around the mean of all the random walks for that particular time point. Walks that are have particularly high values will have positive values, average walks will be around zero, and walks with low values will have negative values.
1
WALKS.adjusted2 < sweep(WALKS, 2, colMeans(WALKS), "")
In English, the code above says subtract (""
) the mean of each column (colMeans(WALKS)
) from each column (2
) of WALKS
.
sapply
apply
works great on matrices, but it doesn’t work if you want to perform a calculation on every element of a vector, or if you want to perform a calculation on only select columns of a matrix or data frame. In these cases, sapply
is the answer. sapply
takes in a vector, matrix, or list (a data type we’ll cover here) and returns a vector. In this post, we’ll just focus on using sapply
on vectors^{[1]}.
Let’s say we have a list of probabilities, and we want to simulate outcomes based on those probabilities. Maybe we have 50 friends, each with some probability of wanting to marathon the eight Harry Potter movies back to back in a massive sleepover party. We can simulate “rolling the dice” on a vector of those probabilities with sapply
.^{[2]}.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 50 random values drawn from a uniform distribution between 0 and 1
probs < runif(50, 0, 1)
# The possible outcomes
outcomes < c("It's marathon time!",
"Sorry, I actually already have plans...")
# A function that returns an outcome, given a probability
prob_to_outcome < function(x){sample(outcomes, 1, prob = c(x, 1x))}
result < sapply(probs, FUN = prob_to_outcome)
print(result)
# > "Sorry, I actually already have plans..."
# "Sorry, I actually already have plans..."
# "It's marathon time!"
# ...
prob_to_outcome
takes in a probability of wanting to marathon the Harry Potter movies and simulates whether the person ended up choosing to marathon the series or not. In line 11, sapply
runs prob_to_outcome
on every probability in probs
. We could do the same with a for
loop, but it would be both more lines of code and less efficient:
1
2
3
4
5
# Less efficient version of line 11 above
result < c()
for(i in 1:length(probs)){
result[i] < prob_to_outcome(probs[i])
}
tapply
While you can use sapply
on data frames^{[1]}, tapply
is a friendlier tool. Let’s say we successfully hosted our Harry Potter marathon and we surveyed our guests on how they rank the eight films. Our data is vertically stacked in a dataframe like this. And for a bit of fun, we also know whether they consider themselves Twilight fans or not.
We’ll start by defining some variables:
1
2
3
4
5
6
7
n_movies < 8
n_guests < 10
titles < c("Sorcerer's Stone", "Chamber of Secrets",
"Prisoner of Azkaban", "Goblet of Fire",
"Order of the Phoenix", "HalfBlood Prince",
"Deathly Hallows p.1", "Deathly Hallows p.2")
Then we’ll loop through our guests, generate a dataframe with each guest’s info, and stacking the dataframes on top of one another. We could avoid using a for
loop here, but given the fact that we’re vertically stacking data and dealing with both single values (guest_id
, being a fan of Twilight) and vectors (titles
, rankings
), it’s a lot clearer to use a loop.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for(i in 1:n_guests){
rankings < sample(1:n_movies, n_movies, replace=F)
twilight_fan < sample(c(TRUE, FALSE), 1)
df_guest < data.frame('guest_id' = i,
'twilight_fan' = twilight_fan,
'movie' = titles,
'rank' = rankings)
# Start df_final or add to it
if(i == 1){
df_final < df_guest
} else {
df_final < rbind(df_final, df_guest)
}
}
Now we can actually analyze our data. Let’s ask what the average rank is for each movie. With tapply
, the first argument is the data to analyze, the second argument is how to group the data, and the third is the function to apply.
1
2
3
4
5
6
7
tapply(df$rank, df$movie, mean)
# Chamber of Secrets Deathly Hallows p.2 Deathly Hallows p.1
# 4.2 4.0 5.2
# Goblet of Fire HalfBlood Prince Order of the Phoenix
# 3.4 4.3 4.7
# Prisoner of Azkaban Sorcerer's Stone
# 5.0 5.2
Looks like Goblet of Fire wins! (Let’s just ignore any semblance of statistical rigor.) But what about for people who like vs. dislike Twilight? For this extra nuance, we can just modify the second argument in tapply
to include both df$movie
and df$twilight_fan
. As a step further to keep the code clean, I’ll use the with
command, which lets us avoid needing to type df$
before any column in df
.
1
2
3
4
5
6
7
8
9
10
with(df, tapply(rank, list(movie, twilight_fan), mean))
# FALSE TRUE
# Chamber of Secrets 2.0 4.75
# Deathly Hallows p.2 3.0 4.25
# Deathly Hallows p.1 6.0 5.00
# Goblet of Fire 6.0 2.75
# HalfBlood Prince 6.5 3.75
# Order of the Phoenix 3.5 5.00
# Prisoner of Azkaban 2.0 5.75
# Sorcerer's Stone 7.0 4.75
In a shocking upset, turns out the bestranked film is a tie between Chamber of Secrets and Prisoner of Azkaban for people who don’t like Twilight. For those who do like Twilight, Goblet of Fire reigns again. But to see if any of these insights are worth making it past a toy example in a blog post, we’ll want a much bigger representative sample of… real people.
lapply
Finally, we have lapply
, a function that lets you apply a function to every element of a list. In R, lists are objects that contain other objects. We can create a sample list, L
, that has the numbers 1 to 3, a string, and a dataframe.
1
2
3
4
5
6
7
8
9
10
11
L < list(1:3, 'hello', data.frame('id'=1))
L
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] "hello"
#
# [[3]]
# id
# 1 1
With lapply
, we can efficiently apply a function to each object in the list. It doesn’t make a whole lot of sense, but we could, for example, apply min
to each element. Note that lapply
returns a list^{[3].}
1
2
3
4
5
6
7
8
9
lapply(L, min)
# [[1]]
# [1] 1
#
# [[2]]
# [1] "hello"
#
# [[3]]
# [1] 1
Let’s finish this post with an example that is more sensible and has more Harry Potter in it. In an effort to get on the front page of /r/dataisbeautiful, we decide to collect some data at our party. Every 30 minutes, we poll the audience on how much fun they’re having. (It’d be solid 10/10’s throughout the night, but play along and imagine there’s some variation.) With these data, we build a matrix where each row is a party participant and each column is a time point (3:10am, 3:20am, etc.). The data are how much fun person i
is having at time point j
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n_samples < 100
n_guests < 10
fun_mat < matrix(round(runif(5000, 0, 10)),
ncol = n_samples, nrow = n_guests)
rownames(fun_mat) < sapply(1:nrow(fun_mat), function(x){paste0("person", x)})
colnames(fun_mat) < sapply(1:ncol(fun_mat), function(x){paste0("t", x)})
print(fun_mat)
# t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 ...
# person1 7 7 5 7 5 1 1 7 7 5 0 7 8 9 ...
# person2 6 9 5 10 4 2 2 1 2 9 0 1 1 5 ...
# person3 8 1 7 4 5 1 3 5 4 1 8 9 10 1 ...
# person4 2 1 2 10 1 2 8 5 1 6 6 5 0 0 ...
# ...
What if we want to know what moments each person was having at least 9/10 fun? We could just write which(fun_mat >= 9)
, but that would convert fun_mat
into a vector and then return the indices of the vector where the value is greater than 9. So for our 100 x 50 matrix, we’d get answers like 9
and 738
and 2111
, which don’t have any identifying information on who was happy when.
This sounds like a job for apply
, so we run this:
1
2
3
4
5
6
7
8
9
10
passed_threshold < apply(fun_mat, 1, function(x){which(x >= 9)})
print(passed_threshold)
# $person1
# t2 t6 t23 t27 t53 t76
# 2 6 23 27 53 76
#
# $person2
# t1 t2 t30 t32 t40 t47 t48 t50 ...
# 1 2 30 32 40 47 48 50 ...
# ...
We now have a list, where each object is a person, and the values are the time points at which they scored at least a 9/10 on having fun. However, we can now use lapply
to get more nuanced information from passed_threshold
. Here is the first time point at which someone passed the threshold of having fun:
1
2
3
4
5
6
7
8
# Find the first time point that passed the threshold
lapply(passed_threshold, min)
# $person1
# [1] 2
#
# $person2
# [1] 1
# ...
And here is a full summary of when having fun occurred:
1
2
3
4
5
6
7
8
9
10
# Get a summary of the time points for each person that
# passed the threshold
lapply(passed_threshold, summary)
# $person1
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2.00 10.25 25.00 31.17 46.50 76.00
#
# $person2
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 36.00 50.00 51.87 76.50 90.00
Note that these commands return their contents as a list. If you’d prefer they were returned as a vector, we can use sapply
instead.
1
2
3
sapply(passed_threshold, min)
# person1 person2 person3 person4 person5 person6 ...
# 2 1 4 3 6 9
Conclusions
Thanks for reading the “Introduction to R” series! If you want to keep learning R, check out some “Random R” posts in the blog page. You can find little experiments in there like racing for
loops vs. apply
, visualizing GPS data, and more.
Cheers,
Matt
Footnotes

[sapply] We can also use
sapply
on dataframes and matrices when we only want to run our calculation on specific columns. Let’s say we run a lab experiment and have the results in a dataframe that has columns for our data, as well as metadata like the date of the trial, the treatment name, etc. If we want to find the median of our data columns, we can’t just doapply
 R will complain about trying to find the median for the nonnumeric columns. Instead, we can get the indices of the numerical columns and then usesapply
.1 2 3 4 5 6 7 8 9 10 11 12
# Generate sample data height_gain < rnorm(20) weight_gain < rnorm(20) treatment < sort(rep(c("medicine", "placebo"), 10)) trial < factor(rep(1:5, 4)) # Indicate vals are factors, not ints df < data.frame(height_gain, weight_gain, treatment, trial) numeric_cols < sapply(df, is.numeric) sapply(which(numeric_cols), function(x){median(df[, x])}) # > height_gain weight_gain # 2.683714 2.070083
You’ll notice we call
sapply
twice  first to get a vector of booleans (TRUE
andFALSE
) for whether each column indf
is numeric, and then again to find the median of each numeric column. [Becausenumeric_cols
is a vector ofTRUE
andFALSE
values, we don’t have to specifywhich(numeric_cols == T)
;which
already only returns indices that areTRUE
.]
However, while it’s possible to usesapply
like this, I find it more convenient to usetapply
so we can avoid using indices (and can instead use column names, which is easier to verify that we’re doing what we think we’re doing). We can even get away with usingapply
again, such as withapply(df[, which(numeric_cols)], 2, median)
. So I usually stick with analyzing vectors withsapply
, which is why this section is a footnote instead of in the main text. 
[sapply] Note that when we simulate outcomes based on probabilities, the outcomes will differ between runs  if we have a probability of 0.5, for example, we can expect roughly half our runs to have a positive outcome and half to have a negative. If you want to control this randomness (e.g. for sharing a reproducible analysis), use
set.seed
to force the next random function to always return the same value.1 2 3 4 5 6 7 8 9 10
# Without seed rnorm(3) # 1.2724293 0.4146414 1.5399500 rnorm(3) # 1.1476570 0.2894616 0.2992151 # With seed set.seed(1) rnorm(3) # 0.6264538 0.1836433 0.8356286 set.seed(1) rnorm(3) # 0.6264538 0.1836433 0.8356286

[lapply] We can actually run
sapply
on the list to get our result as a vector, though the response might seem confusing.1
sapply(L, min) # "1" "hello" "1"
Why are the first and third elements strings instead of integers? This is because vectors in R must have elements of the same type. Because we have both integers and strings when we apply
min
to each element ofL
, R needs to choose one of the two data types for our vector. The number1
has a clear conversion to the string"1"
, but there’s no obvious integer to convert"hello"
to. R therefore converts all elements to strings before returning the vector.