## user2013: The caret tutorial

This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.

The basic usage of caret is to split your data into training and test sets.

my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p names(my_data) <- c("training", "testing")

Then call `train`

on your training set.

training_model <- train( response ~ ., data = my_data$training, method = "a type of model!")

Then predict it using `predict`

.

predictions <- predict(training_model, my_data$testing)

So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.

Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.

I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.

## A little Christmas Present for you

Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!

I spent six years working in the statistical modeling team at the UK’s Health and Safety

Laboratory. A large part of my job was working with the laboratory’s chemists, looking

at occupational exposure to various nasty substances to see if an industry was adhering

to safe limits. The laboratory gets sent tens of thousands of blood and urine samples

each year (and sometimes more exotic fluids like sweat or saliva), and has its own team

of occupational hygienists who visit companies and collect yet more samples.

The sample collection process is known as “biological monitoring.” This is because when

the occupational hygienists get home and their partners ask “How was your day?,” “I’ve

been biological monitoring, darling” is more respectable to say than “I spent all day

getting welders to wee into a vial.”

In 2010, I was lucky enough to be given a job swap with James, one of the chemists.

James’s parlour trick is that, after running many thousands of samples, he can tell the

level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This

skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”

and James had suggested that “we make it more interesting.” I’d lost two packets of fig

rolls before I twigged that I was onto a loser.The principle of the job swap was that I would spend a week in the lab assisting with

the experiments, and then James would come to my office to help out generating the

statistics. In the process, we’d both learn about each other’s working practices and find

ways to make future projects more efficient.

In the laboratory, I learned how to pipette (harder than it looks), and about the methods

used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as

well as testing urine samples, within each experiment you need to test blanks (distilled

water, used to clean out the pipes, and also to check that you are correctly measuring

zero), calibrators (samples of a known concentration for calibrating the instrument5),

and quality controllers (samples with a concentration in a known range, to make sure

the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining

and recalibrating to ensure its accuracy.

Just knowing that these things have to be done to get sensible answers out of the ma?

chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think

about where my data came from; that was someone else’s problem. From my point of

view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they

were a mistake; otherwise they were simply “right.” Afterwards, my view is more

nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an

approximation of the truth. This measurement error is important to remember, though

for health and safety purposes, there’s a nice feature. Values can be out by an order of

magnitude at the extreme low end for some tests, but we don’t need to worry so much

about that. It’s the high exposures that cause health problems, and measurement error

is much smaller at the top end.

## A quick primer on split-apply-combine problems

I’ve just answered my hundred billionth question on Stack Overflow that goes something like

I want to calculate some statistic for lots of different groups.

Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.

First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.

head(InsectSprays) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A

These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.

First, we split the `count`

column by the `spray`

column.

(count_by_spray <- with(InsectSprays, split(count, spray)))

Secondly, we apply the statistic to each element of the list. Lets use the `mean`

here.

(mean_by_spray <- lapply(count_by_spray, mean))

Finally, (if possible) we recombine the list as a vector.

unlist(mean_by_spray)

This procedure is such a common thing that there are many functions to speed up the process. `sapply`

and `vapply`

do the last two steps together.

sapply(count_by_spray, mean) vapply(count_by_spray, mean, numeric(1))

We can do even better than that however. `tapply`

, `aggregate`

and `by`

all provide a one-function solution to these S-A-C problems.

with(InsectSprays, tapply(count, spray, mean)) with(InsectSprays, by(count, spray, mean)) aggregate(count ~ spray, InsectSprays, mean)

The `plyr`

package also provides several solutions, with a choice of output format. `ddply`

takes a data frame and returned another data frame, which is what you’ll want most of the time. `dlply`

takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.

ddply(InsectSprays, .(spray), summarise, mean.count = mean(count)) dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))

You can read much more on this type of problem and the `plyr`

solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.

One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the `ave`

function (which provides `mean`

as the default function).

with(InsectSprays, ave(count, spray))

## More useless statistics

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as `anscombe`

, but it requires a little reshaping to be useful.

anscombe2 <- with(anscombe, data.frame( x = c(x1, x2, x3, x4), y = c(y1, y2, y3, y4), group = gl(4, nrow(anscombe)) ))

Note the use of `gl`

to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using `ddply`

from the `plyr`

package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

library(plyr) (stats <- ddply(anscombe2, .(group), summarize, mean = mean(y), std_dev = sd(y), correlation = cor(x, y), lm_intercept = lm(y ~ x)$coefficients[1], lm_x_effect = lm(y ~ x)$coefficients[2] )) group mean std_dev correlation lm_intercept lm_x_effect 1 1 7.500909 2.031568 0.8164205 3.000091 0.5000909 2 2 7.500909 2.031657 0.8162365 3.000909 0.5000000 3 3 7.500000 2.030424 0.8162867 3.002455 0.4997273 4 4 7.500909 2.030579 0.8165214 3.001727 0.4999091

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

library(ggplot2) (p <- ggplot(anscombe2, aes(x, y)) + geom_point() + facet_wrap(~ group) )

Each dataset is *really* different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.