How to write functions and use lists in R

R is a programming language used for data analysis, statistics and graphics production. If you’ve not come across it before this video will give you some background:

OK, OK, it couldn’t be much more corny… but it serves a purpose (not entirely sure what it is though).

If you analyse data and particularly statistics, you should really have R in your toolbox. Like all programming, it makes light work of repetitive tasks. As a trade off, there’s a learning curve involved. Thankfully the syntax is easy to get to grips with (if I can manage it, anyone can). There are some tips here for moving from spreadsheets to R.

R is platform independent (Linux, Mac, Windows…) and is free and open source software. You can download R via your local repository indexed by CRAN. I use R through RStudio, which incorporates a text editor and help tab, and I recommend this as a much more user friendly option!

One of the great things about R is being able to write your own functions to undertake multi-step tasks with very little effort. There are loads of great resources on the web for R, I particularly like Quick-R, and CRAN itself has some excellent help documentation and manuals.

So what’s the point of this post? To make life much easier in R you should write functions to accomplish repetitive tasks. If you find you’re copying code and just changing a few parameters, then you should probably be writing a function to ease your work flow. Many generic ones have already been written and are included as packages in the software, ones you might want to write would allow you to work specifically with your data.

Here’s an example that calculates a moving average (taken from stackoverflow):

ma = function(i, n){ # parameters
   filter(i, rep(1/n, n), sides=2) # what to do to parameters
}

Where i is your dataset and n is the moving average. As you’ll note from the original stackoverflow post, the sides argument instructs if the average should spread either side of the data point or just take following values. If you wanted a more general function where you specified this each time you would write:

ma = function(i, n, s){
   filter(i, rep(1/n, n), sides=s)
}

Quite often I’ll write a short function that is task specific and so variables will be specified within the function, and I just want to apply the work flow to a number of data sets. This is most easily accomplished using lists. A list in R is a collection of other objects that can form any of R’s data types. There’s a function in R called lapply that allows you to loop over a list, or generate a list while looping over something else. Here’s an example of lapply to import a folder of files:

# Tell R where the files are
setwd("~/dir/dir/") # Use C:/ for windows

# Make a vector of file names to import
in.files = list.files("./")
# use pattern="formathere"
# to specify only certain files
# import files to a list using a function
data.in = lapply(in.files, function(i){
   read.csv(i, sep="|", header=0)
   # specify file separator and
   # whether it contains headings
}

So what you end up with is a list of the content of each file. Here are some handy tips for working with lists:

# Select first list element:
your.list[[1]]
# Use head() to view the first few lines:
head(your.list[[1]])

# Or if you have many columns:
your.list[[1]][1:10, 1:5] # rows,columns

# Do something to the elements of the list:
mean.list = lapply(your.list, mean) # works if one column

# Do something to the second column of each element:
mean.list = lapply(your.list, function(i){
mean(i[, 2])

# Combine the list into a data frame:
your.df = do.call(cbind.data.frame, your.list) # by columns
your.df = do.call(rbind.data.frame, your.list) # by rows

The best thing to do is take these examples and start working with them on your data. Good luck!

Advertisements