I fell out with tapply and in love with dplyr
A long time ago (5 years) I wrote a blog post on tapply. Back then I was just getting into programming and I thought the possibilities of tapply were amazing. So it seems, do many others as it’s become one of my most viewed articles.
However, I never use tapply these days because the output is either a named vector or a matrix. Both of these require munging if I’m going to use the output. Three months after I wrote my tapply post a little package called dplyr was released. It took a while before it became integral to my workflow (I like to use as few packages as possible), but now I use it almost daily. The two biggest reasons are:
- A data frame as output
- Readable code.
Now we’re all living in the tidyverse, I’m a bit confused that so many folk still land on my blog looking for tapply. So this post updates/supersedes what I wrote previously. I’ve repeated the toy example I made before:
# Generate an example time series df = data.frame(date=seq.Date(as.Date("1990-01-01"), as.Date("2013-12-31"), by=1)) # Add some data (0s and 1s) df = df %>% mutate(snow_lying=sample(c(0, 1), nrow(df), replace=T)) # Get month and year from date df = df %>% mutate(month=format(date, "%m"), year=format(date, "%Y")) # Sum for each month df %>% group_by(month) %>% summarise(snow_days=sum(snow_lying)) # Sum for each month, each year df %>% group_by(year, month) %>% summarise(snow_days=sum(snow_lying))