I fell out with tapply and in love with dplyr

A long time ago (5 years) I wrote a blog post on tapply. Back then I was just getting into programming and I thought the possibilities of tapply were amazing. So it seems, do many others as it’s become one of my most viewed articles.

However, I never use tapply these days because the output is either a named vector or a matrix. Both of these require munging if I’m going to use the output. Three months after I wrote my tapply post a little package called dplyr was released. It took a while before it became integral to my workflow (I like to use as few packages as possible), but now I use it almost daily. The two biggest reasons are:

  1. A data frame as output
  2. Readable code.

Now we’re all living in the tidyverse, I’m a bit confused that so many folk still land on my blog looking for tapply. So this post updates/supersedes what I wrote previously. I’ve repeated the toy example I made before:

# Generate an example time series
df = data.frame(date=seq.Date(as.Date("1990-01-01"),
                              as.Date("2013-12-31"),
                              by=1))

# Add some data (0s and 1s)
df = df %>%
   mutate(snow_lying=sample(c(0, 1), nrow(df), replace=T))

# Get month and year from date
df = df %>%
   mutate(month=format(date, "%m"),
          year=format(date, "%Y"))

# Sum for each month
df %>%
   group_by(month) %>%
   summarise(snow_days=sum(snow_lying))

# Sum for each month, each year
df %>%
   group_by(year, month) %>%
   summarise(snow_days=sum(snow_lying))
hex-dplyr

Read more about dplyr here: https://dplyr.tidyverse.org/

Advertisements