dplyr, (mc)lapply, for-loop and speed

UPDATED: following some great comments, I’ve added sections on purrr and data.table.

I was at EdinbR on Friday evening and David Robinson‘s talk prompted some excellent discussions, not least with Isla and Gergana. One topic was on dplyr and lapply. I started using R in 2012, just before dplyr came to prominence and so I seem to have one foot in base and the other in the tidyverse. Ambitiously aiming for the best of both worlds!

I often use lapply to wrap up my scripts which clean and process files, but Isla pointed out I could do this with dplyr. I can’t figure out (with available Sunday time) how to use write_csv in a pipe and pass it an argument for a file name (please tell me in the comments!), without resorting to making a loop/*apply.

This post is for all those folk who might be considering wrapping some tidyverse commands in a function to make their code cleaner and hopefully gain some speed advantages.

For this example I’ve created a dummy set of raw data that we can make some arbitrary selection and transformation on and save as clean. In this case I’ve created dates with some observations happening each day, labelled as ‘a’ to ‘c’. Alongside this I’ve duplicated these columns to give us something to drop. Dummy data created with:

library(tidyverse)

# Dummy files to process
dir.create("./temp/raw", recursive=T)
dir.create("./temp/clean")

lapply(1950:2017, function(i){
   date = seq.Date(as.Date(paste0(i, "-01-01")),
                 as.Date(paste0(i, "-12-31")),
                 by=1)
   a = rnorm(length(date))
   a1 = rnorm(length(date))
   a2 = rnorm(length(date))
   b = rpois(length(date), 10)
   b1 = rpois(length(date), 10)
   b2 = rpois(length(date), 10)
   c = rexp(length(date), 5)
   c1 = rexp(length(date), 5)
   c2 = rexp(length(date), 5)

   write_csv(data.frame(date, a, a1, a2, b, b1, b2, c, c1, c2),
             paste0("./temp/raw/file_", i, ".csv"))
})

We’ve now got a directory with 68 csv files, each containing some fabricated daily data. In order to read files into R, the first thing to do is get a path to it, we can do this with `list.files()`:

# Get a vector of file names
f = list.files("./temp/raw", pattern="file")

Now we have an object, `f`, which contains all our file names we can write a process to get them ready for analysis. I’m illustrating this by selecting 4 columns (date, a, b and c) and converting them to a long tidy format. I’ve had a stab at writing this process in tidyverse alone, but can’t figure out how to pass `write_csv()` a file name. I suspect the answer lies in turning `f` into a data frame with a column for in and a column for out. Seems pretty awkward to me. I welcome answers in the comments!

# Interactive dplyr
system.time(
 walk(f, ~ paste0("./temp/raw/", .x) %>%
 read_csv() %>%
 select(date, a, b, c) %>%
 gather(variable, value, -date) %>%
 write_csv(paste0("./temp/clean/", .x)))
)

We can adapt the above slightly to make it a function. We’re now able to pass our tidyverse code individual file names, here represented by `i`. Finally, the clean data are written out into the clean folder. In the real world we may also want to change the file name to reflect the clean status.

# As a function
read_clean_write = function(i){
   paste0("./temp/raw/", i) %>%
      read_csv() %>%
      select(date, a, b, c) %>%
      gather(variable, value, -date) %>%
      write_csv(paste0("./temp/clean/", i))
}

We can make a second function using the data.table package, which is renowned for its speed:

# data.table function
fread_clean_fwrite = function(i){
   paste0("./temp/raw/", i) %>%
      fread() %>%
      select(date, a, b, c) %>%
      melt.data.table(id.vars = "date") %>%
      fwrite(paste0("./temp/clean/", i))
}

Finally, we can run either of the above two functions as a loop (usually bad), lapply or something else:

# Loop
for (j in f){
   read_clean_write(j)
}

# lapply
lapply(f, read_clean_write)

But how fast were they? Can we get faster? Thankfully R provides `system.time()` for timing code execution. In order to get faster, it makes sense to use all the processing power our machines have. The ‘parallel’ library has some great tools to help us run our jobs in parallel and take advantage of multicore processing. My favourite is `mclapply()`, because it is very very easy to take an `lapply` and make it multicore. Note that mclapply doesn’t work on Windows. The following script runs the `read_clean_write()` function in a for loop (boo, hiss), lapply and mclapply. I’ve run these as list elements to make life easier later on.

library(parallel)

# Loop
times = list(
   purrr=system.time(
      walk(f, read_clean_write)
   ),
   forloop=system.time(
      for (j in f){
         read_clean_write(j)
      }
   ),
   lapply = system.time(
      lapply(f, read_clean_write)
   ),
   lapply_data.table=system.time(
      lapply(f, fread_clean_fwrite)
   ),
   mclapply = system.time(
      mclapply(f, mc.cores=4, read_clean_write)
   ),
   mclapply_data.table=system.time(
      mclapply(f, mc.cores=4, fread_clean_fwrite)
   )
)

Next we can plot up these results. I’m using sapply to get only the elapsed time from the proc.time object, and then cleaning the elapsed part from the vector name.

barplot

x = sapply(times, function(i){i["elapsed"]})
names(x) = substr(names(x), 1, nchar(names(x)) - 8)

par(mar=c(5, 8.5, 4, 2) + 0.1)
barplot(x, names.arg=names(x),
        main="Elapsed time on an Intel i5 4460 with 4 cores at 3.2GHz",
        xlab="Seconds", horiz=T, las=1)
par(mar=c(5, 4, 4, 2) + 0.1)

Unsurprisingly mclapply is the clear winner. It’s spreading the work across four cores instead of one, so unless the job is very simple it will always be fastest!

Having run this code a few times, I noticed the results are not consistent. Because we’ve been working in code we can examine the variability. I’ve done this by running each method 100 times:

# Many times run
times = lapply(1:100, function(i){
   list(
      purrr=system.time(
         walk(f, read_clean_write)
      ),
      forloop=system.time(
         for (j in f){
            read_clean_write(j)
         }
      ),
      lapply = system.time(
         lapply(f, read_clean_write)
      ),
      lapply_data.table=system.time(
         lapply(f, fread_clean_fwrite)
      ),
      mclapply = system.time(
         mclapply(f, mc.cores=4, read_clean_write)
      ),
      mclapply_data.table=system.time(
         mclapply(f, mc.cores=4, fread_clean_fwrite)
      )
   )
})

x = lapply(times, function(i){
   y = sapply(i, function(k){k["elapsed"]})
   names(y) = substr(names(y), 1, nchar(names(y))-8)
   y
})

x = lapply(seq_along(x), function(i){
   data.frame(run=i,
              purrr=x[[i]]["purrr"],
              forloop=x[[i]]["forloop"],
              lapply=x[[i]]["lapply"],
              lapply_data.table=x[[i]]["lapply_data.table"],
              mclapply=x[[i]]["mclapply"],
              mclapply_data.table=x[[i]]["mclapply_data.table"])
})
x = do.call("rbind.data.frame", x)

My poor computer! Now we can plot these results up. I’ve chosen violin plots to help us see the distribution of results:

violin

png("./temp/violin.png", height=500, width=1000)
x %>%
   gather(variable, value, -run) %>%
   ggplot(aes(variable, value)) +
   geom_violin(fill="grey70") +
   labs(title="100 runs comparing read, transform and write in R",
        x="",
        y="Seconds") +
   coord_flip() +
   theme_bw() +
   theme(text = element_text(size=20),
         panel.grid=element_blank())
dev.off()

Then, we can pull out median values for each:

 

Method Time (seconds)
purrr 1.85
lapply 1.84
forloop 1.83
lapply_data.table 0.83
mclapply 0.68
mclapply_data.table 0.36
x %>%
   gather(variable, value, -run) %>%
   group_by(variable) %>%
   summarise(med=median(value)) %>%
   arrange(med)

Finally, what do I recommend you use? Occasionally I need to use a for loop (<1 a year), because using lapply is too difficult. It’s nice to see I’m probably not incurring much penalty for this heresy. Generally, I believe lapply is an excellent solution, not least because if I need a speed boost I need only call `library(parallel)` and tell mclapply how many cores to use and I’m away. As it seems to be shout-out season (isn’t it always in the great R community?!), this book on efficient R programming by Colin and Robin is excellent!

9 Comments

  1. dplyr is not the right tidyverse tool for the read_csv and write_csv parts of this task. Those two functions will not accept a vector of filenames.
    The tidyverse tool to use is purrr.
    purrr::walk() will iterate over the filenames in f. It probably won’t be faster than the lapply, mclappy.
    you could use it exactly how you’ve used lapply and the function you created :

    walk(f, read_clean_write)

    or without creating a function like this:

    walk(f, ~ paste0(“./temp/raw/”, .x) %>%
    read_csv() %>%
    select(date, a, b, c) %>%
    gather(variable, value, -date) %>%
    write_csv(paste0(“./temp/clean/”, .x)))

  2. Thanks for a great post. In regards to your question on writing multiple CSV files, the tidyverse approach would be to use the purrr package which is a part of the core tidyverse. As I’m rather new to using purrr, I’m afraid my explanations would serve you poorly, but http://purrr.tidyverse.org/ is a great starting resource. The following approach, using the map() and walk2() functions from purrr, does the trick:
    library(tidyverse)
    library(here)
    df_list <- map(.x = f, .f = ~read_csv(here::here("temp", "raw", .x)))
    names(df_list) %
    map(~select(., date, a, b, c)) %>%
    walk2(.y = names(.), ~write_csv(x = .x, path = here::here(“temp”, “clean”, .y)))

  3. Hmmm… that should read:
    library(tidyverse)
    library(here)
    df_list <- map(.x = f, .f = ~read_csv(here::here("temp", "raw", .x)))
    names(df_list) %
    map(~select(., date, a, b, c)) %>%
    walk2(.y = names(.), ~write_csv(x = .x, path = here::here(“temp”, “clean”, .y)))

  4. Hi,

    1. microbenchmark::microbenchmark does the repeated testing with a nice overview for free. 🙂
    2. you won’t lose much time in the loop part for just about any real-life workload. It’s about the efficiency of reading and melting the data. For larger files it’s also about keeping IO and CPU working at full speed.
    3. As always, data.table is faster:

    Unit: milliseconds
    expr min lq mean median uq max neval cld
    for (j in f) { read_clean_write(j) } 2532.5078 2555.4614 2623.5843 2630.8622 2643.1606 2827.8491 10 b
    for (j in f) fread_clean_fwrite(j) 681.6711 704.0201 751.9411 759.9708 780.9979 839.2101 10 a

    I just use fread instead of read_csv, melt.data.table instead of gather and fwrite instead of write_csv.

    fread_clean_fwrite %
    fread %>%
    .[, .(date, a, b, c)] %>%
    melt.data.table(id.vars = “date”) %>%
    fwrite(paste0(“./temp/clean/”, file))

    4. Even better parallel:

    Unit: milliseconds
    expr min lq mean median uq max neval cld
    for (j in f) { read_clean_write(j) } 2631.6450 2649.4903 2687.9149 2658.6920 2702.2436 2801.7455 10 c
    for (j in f) fread_clean_fwrite(j) 660.8406 703.7845 751.5452 767.5334 788.0965 799.4225 10 b
    parLapply(cl, f, fread_clean_fwrite) 227.3648 233.9839 244.5252 242.1347 253.1194 267.0709 10 a

    4. Please don’t use T for TRUE. It’s not the same 🙂

    > T T == TRUE
    [1] FALSE

    1. Thanks for your comment. I do keep meaning to investigate data.table, but there are only so many hours in the day!

      I did expect there to be a benchmarking package, but that wasn’t the original purpose of the post and there’s a certain satisfaction (and learning) from writing a quick script to do the work yourself.

      Yes, TRUE/FALSE are reserved and T/F are pre-assigned global variables. I don’t see a problem with their use, YMMV.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s