A quick primer on split-apply-combine problems

Home > R > A quick primer on split-apply-combine problems

A quick primer on split-apply-combine problems

16th December, 2011 richierocks Leave a comment Go to comments

I’ve just answered my hundred billionth question on Stack Overflow that goes something like

I want to calculate some statistic for lots of different groups.

Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.

First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.

head(InsectSprays)
  count spray
1    10     A
2     7     A
3    20     A
4    14     A
5    14     A
6    12     A

These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.

First, we split the count column by the spray column.

(count_by_spray <- with(InsectSprays, split(count, spray)))

Secondly, we apply the statistic to each element of the list. Lets use the mean here.

(mean_by_spray <- lapply(count_by_spray, mean))

Finally, (if possible) we recombine the list as a vector.

unlist(mean_by_spray)

This procedure is such a common thing that there are many functions to speed up the process. sapply and vapply do the last two steps together.

sapply(count_by_spray, mean)
vapply(count_by_spray, mean, numeric(1))

We can do even better than that however. tapply, aggregate and by all provide a one-function solution to these S-A-C problems.

with(InsectSprays, tapply(count, spray, mean))
with(InsectSprays, by(count, spray, mean))
aggregate(count ~ spray, InsectSprays, mean)

The plyr package also provides several solutions, with a choice of output format. ddply takes a data frame and returned another data frame, which is what you’ll want most of the time. dlply takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.

ddply(InsectSprays, .(spray), summarise, mean.count = mean(count))
dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))

You can read much more on this type of problem and the plyr solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.

One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the ave function (which provides mean as the default function).

with(InsectSprays, ave(count, spray))

Tags: apply, combine, plyr, r, split, statistics

Comments (9) Trackbacks (0) Leave a comment Trackback

Anonymous

16th December, 2011 at 23:47 pm

Reply

Hi Richie

Thanks for this run-through – having the different options listed together gives more insight into what R does things.

Kevin
- 13th April, 2013 at 6:55 am
  
  Reply
  
  And, why is replications=1000? With total time for 1000 runs tanikg 215 seconds in the worst case, that’s only 0.215 seconds for one run of the test. So it appears that this benchmark, as presented, is merely finding significant differences of insignificant times. In other words, it’s comparing very small and insignificant timings.Suggestion : report the fastest of (just) 3 runs on a _large_ dataset. Or some other benchmark where a single run does actually take a long time. Significant differences of _significant_ times are of interest.
soundray

17th December, 2011 at 11:43 am

Reply

This is very useful, thanks. Small suggestion: “ddply takes a data frame and reurns the uncombined list” — this should be “dlply” and “returns”.
- richierocks
  
  19th December, 2011 at 10:40 am
  
  Reply
  
  Glad you liked it. Typos now fixed.
Anonymous

17th December, 2011 at 22:26 pm

Reply

Bad link:

https://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/www.jstatsoft.org/v40/i01

->

http://www.jstatsoft.org/v40/i01
- richierocks
  
  19th December, 2011 at 10:40 am
  
  Reply
  
  Ta. Link now fixed.
- 11th February, 2013 at 20:54 pm
  
  Reply
  
  Did you purposely write a non-vectorized vorsien of your function? If not, a vectorized vorsien is only 30-40% slower than your Rcpp vorsien (which could probably be faster with some profiling). Still nice, but far from Priceless .vaccinateVectorized <- function(age, female, ily) { p <- (0.25 + 0.3 * 1/(1-exp(0.04 * age)) + 0.1 * ily) * (0.75 + female * 0.5) # replaces slow ifelse() # vectorized max/min, documented in ?max p <- pmax(0,p) p <- pmin(1,p) data.frame(age, female, ily, p)}do_vectorized <- function(df) { vaccinateVectorized(df$age, df$female, df$ily)}identical(do_forloop(cohort), do_vectorized(cohort)) # TRUEbenchmark(do_forloop(cohort), do_rcpp(cohort), do_vectorized(cohort))
  - 7th August, 2013 at 5:03 am
    
    Reply
    
    Great post! I always see plyr, ddply, and data.table used in rsnopsees on both stack overflow and the R help list. It’s great to have an explanation of these packages and functions.
11th February, 2013 at 17:58 pm

Reply

Hi Josh,Thanks. I tried your suggestion and I got a defrifent problem to you!First the compilation error:Error in compileCode(f, code, language = language, verbose = verbose) : Compilation ERROR, function(s)/method(s) not created! cygwin warning:Then some warnings about msdos style paths.Then the relevant error messages, about comparison between signed and unsigned integers:file1fdc365c273.cpp:1:0: sorry, unimplemented: 64-bit mode not compiled infile1fdc365c273.cpp: In function ‘SEXPREC* file1fdc365c273(SEXPREC*)’:file1fdc365c273.cpp:58:38: warning: comparison between signed and unsigned integer expressionsmake: *** [file1fdc365c273.o] Error 1

No trackbacks yet.

4D Pie Charts

A quick primer on split-apply-combine problems

Leave a Reply Cancel reply

Richie Cotton

Categories

Archives

Blogroll

Licensing

Follow Blog via Email

Follow “4D Pie Charts”

4D Pie Charts

A quick primer on split-apply-combine problems

Share this:

Like this:

Related

Leave a Reply Cancel reply

Richie Cotton

Categories

Archives

Blogroll

Licensing

Follow Blog via Email

Follow “4D Pie Charts”