Why tapply should be your new best friend

Ever been faced with a dataset with data spread across a range of categories? Perhaps you had shopper counts per day over a multi-year period and wanted to know what the footfall was for each day of the week? Or with my snow theme you might have a range of sites recording snow lying and want to know how many days snow lay at each site, or how many days per month snow lay at each site.

While you can do these things in a spreadsheet using countif() type statements, they’re clumsy, become very awkward across larger numbers of variables and get very slow when you need to process larger amounts of data.

Using the R statistical programming language you can use a function called tapply. This is part of the apply/lapply/sapply/mapply suite that lets you iterate an action over a dataset. Specifically, tapply lets you iterate over a data type called factors. So if you have a column with different station numbers/names in this can be used as an category index so values are grouped together for the function you call.

The key point is, it’s much more straightforward in practice than trying to explain it!

To get started the first thing to do is install R, it’s available for all major operating systems and is free. You can get it here: http://cran.r-project.org/. I prefer to use it with R Studio because it lets you have a plot, help and text editor all open in the same window.

As an example I’ve generated some data so you can see how it works. In practice you’d want to import your own – tips later on. (NB comments are marked with a # in R)

# Create some data

# Generage an example time series
data = as.data.frame(seq(as.Date("1990-01-01"), as.Date("2013-12-31"), by="day"))
names(data) = "date"
# Create column with months
data$month = substring(data$date,6,7)
data$year = substring(data$date,1,4)
# Create some data using a normal distribution and the length of
# the data series, then round to create a binary set.
data$snow_lying = round(rnorm(dim(data)[1], mean=0.5, sd=0.1))

So now we’ve got a dataframe with four columns: full date, month, year and some 0/1s to represent if snow was lying on a given day.

You can explore this data via R. Some useful commands are:

summary(data)
dim(data) # gives the dataframe dimensions
names(data)

But how do you go about using tapply? Very straightforwardly, it turns out! Here are some examples:

# Most simple application
tapply(data$snow_lying, data$month, sum)
# Using it with more than one index
tapply(data$snow_lying, list(data$month, data$year), sum)
# What if you wanted to apply something using options
# within a function?
tapply(data$snow_lying, list(data$month,data$year), function(i){
  sum(i>0, na.rm=1)
})

The last example lets you access the options within a function, in this case getting sum to work if there are NAs present in the dataset. For our generated data this isn’t an issue, but the function command used here is the way to access some more complicated statistical tasks in R. You can also assign these outputs to a variable (add x= before tapply), which then makes it easier to pass these data onto something else, like a graph, table output or statistical test.

Download R and have a play with the above examples. When you’re feeling comfortable try using your own data. To import it use the read.table/csv functions. To get data out again use write.table/csv. Example:

# Set your working directory:
setwd("~/Files/")
# Reading in a csv, in this case | is used as the column separator
# and there are no column names. If there are use header=1
x = read.csv("file.csv", sep="|", header=0)
# To write a csv out:
write.csv(x, "mynewfile.csv")

The R help files are excellent, although they do take a little getting used to! These are available online or via R/R Studio.

Get stuck in!

Advertisements