How R will turn into SQL

16th July, 2013 14 comments

Up until very recently the only way of running R code was through the standard R distribution. Of course you could use another IDE, but somewhere underneath it all you would be running the same, standard R engine from the R-core team.

This is no longer your only option. A couple of weeks ago Radford Neal released pqR, a research project that reimplements pieces of the R engine to make it “pretty quick”. Much of pqR is likely to be folded back into the main R engine, but that isn’t the only new player in town.

A team at Tibco, including software architect Michael Sannella, have rewritten the R engine from the ground up for TERR, a commercial R distribution. This leads to some very interesting possibilities. Maybe in the few years we could see many more R engines appearing. Oracle have recently invested heavily in R, and it’s certainly imaginable that they could create a high performance R that is tightly coupled to their database products. Google are also big R investors, and I could easily see them creating an R engine that has parallelism built in.

After that, perhaps even Microsoft might notice R and fix the fact that its integration with .NET is rubbish. IronR, anyone?

This situation isn’t as far fetched as you may think: there is a very obvious precedent in the data analysis world. There are dozens of different database vendors that use a common interface langauge: SQL. In the world of databases, the engine is separate from the programming language.

This is a good thing – multiple vendors provides competition, differentiation and innovation. You have a full spectrum of products from big corporate databases (Oracle again) down to miniature data stores like sqlite.

The fact that SQL has led the way means that some of the potential traps are visible before we embark down this road. My biggest bugbear with SQL is that it isn’t quite universal: different vendors have their own non-standard extensions, which means that not all SQL code is portable. A situation like HTML or C++ or Fortran, where a standards committee defines an official version of the language would be preferable. Whether R-core or another body would set this standard is a matter to be decided. (I suspect that R-core would not welcome the additional administrative burden, and commercial vendors may want more control in defining the spec, so a separate R-standards group is more likely.)

These are interesting times for R, and I look forward to seeing how the separation of language and engine progresses.


  1. A sloppy sentence in a previous version of this post made it sound like Michael Sannella ran Tibco. He’s actually a software architect.
  2. The future is happening faster than we think. Renjin and Riposte are two other R engines.
Tags: , , , , ,

user2013: The caret tutorial

9th July, 2013 Leave a comment

This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.

The basic usage of caret is to split your data into training and test sets.

my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p
names(my_data) <- c("training", "testing")

Then call train on your training set.

training_model <- train(
  response ~ ., 
  data   = my_data$training,
  method = "a type of model!")

Then predict it using predict.

predictions <- predict(training_model, my_data$testing)

So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.

Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.

I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.

user2013: The Rcpp tutorial

9th July, 2013 Leave a comment

I’m at user 2013, and this morning I attended Hadley Wickham and Romain Francois’s tutorial on the Rcpp package for calling C++ code from R. I’ve spent the last eight years avoiding C++ afer having nightmares about obscure pointer bugs, so I went into the room slightly skeptical about this package.

I think the most important takeaway from the tutorial was a clear sense of when and why you might want to use C++.

The main selling point for using C++ with R is, in Hadley’s words, that R is optimised for making programmers efficient whereas C++ is made for making machine efficient, so the langauages are complimentary. That is, most of the time the slow part of doing statistics is you. Occasionally however, the slow part will be running your code, and in those instances C++ is better than R.

In order to write fast R code, it needs to be vectorised, and that often means using different functions to a scalar version. A classic example is using ifelse instead of separate if and else blocks, or using pmax instead of max.

Knowing how to vectorise R code thus requires quite a large vocabulary of functions. In C++ there is no vectorisation – you just write a for loop.

There are three things in particular that C++ does much faster than R: the above mentioned looping, resizing vectors and calling functions. (For the last point, Hadley quoted an overhead of 2ns to call a function in C++ versus 200ns in R.)

This means that C++ is useful for the following restricted use cases:

  1. When vectorisation is difficult or impossible. This is common when one element of a vector depends upon previous elements. MCMC is a classic example.
  2. When you are changing the size of a vector in a loop. Run length encoding was the example given.
  3. When you need to make millions of function calls. Recursive functions and some optimisation and simulation problems fit this category.

Typically C++ can give you an order of magnitude or two speed up over an R equivalent, but this is wildly problem-dependent and many of the built-in functions call C code which will run at the same speed (more or less) as a C++ version. It’s also important to consider how often the code will be run. Even if you have a thousand-fold speedup, if the running time of the R function is 0.01s, then you need to run it 60000 times just to get back the 10 minutes it took you to rewrite it in C++.

Anyway, using Rcpp makes it surprisingly simple to call C++ code. You need to install Rtools under windows, and of course the Rcpp package.

install.packages(c("installr", "Rcpp"))

Check that Rcpp is working by seeing if the following expression returns 2.

evalCpp("1 + 1")

Then you can create a C++ function using the cppFunction function. Here’s a reimplementation of the any function. (Although it doesn’t deal with missing values.)

  bool Any(LogicalVector x)
    for(int i = 0; i < x.size(); ++i)
        return true;
    return false;

Notice that in C++ you must be explicit about the types of variable that are passed into and returned from a function. How to wrie C++ is beyond the scope of the post, so I’ll say no more.

You can now call the Any function like this.

Any(runif(10) > 0.5) #returns TRUE
Any(runif(10) > 1.5) #returns FALSE
Tags: , , ,

A brainfuck interpreter for R

24th April, 2013 2 comments

The deadline for my book on R is fast approaching, so naturally I’m in full procrastination mode.  So much so that I’ve spent this evening creating a brainfuck interpreter for R.  brainfuck is a very simple programming language: you get an array of 30000 bytes, an index, and just 8 eight commands.  You move the index left or right along the array with < and >; increase or decrease the value at the current position with + and -; read and write characters using . and ,; and start and end loops with [ and ].

There seem to be two approaches to creating a brainfuck interpreter: directly execute the commands, or generate code in a sensible language and execute that. I’ve opted for the latter approach because it’s easier, at least in R. Generating R code and then calling eval is probably a little slower than directly executing commands, but that’s the least of your worries with brainfuck. Even writing a trivial page-long program will take you many million times longer than it takes to execute.

The fact that you have to mix data variables (that 30000 element raw vector and an index) with commands means that an object oriented approach is useful. The whole interpreter is stored in a single reference class, of type brainfuck. Rather than me showing you all the code here, I suggest that you take a look at it (or clone it) from its repository on bitbucket. (I’ll submit to CRAN soon.)

Here’s a Hello World example taken from Wikipedia. To use the brainfuck package, you just create/import your brainfuck program as a character vector (non-command characters are ignored, so you can comment your code). Call fuckbrain once to create the interpreter variable, then call its interpret method on each program that you want to run.

hello_world <- "+++++ +++++  initialize counter (cell #0) to 10
[                            use loop to set the next four cells to 70/100/30/10
    > +++++ ++               add  7 to cell #1
    > +++++ +++++            add 10 to cell #2 
    > +++                    add  3 to cell #3
    > +                      add  1 to cell #4
    <<<< -                   decrement counter (cell #0)
> ++ .                       print 'H'
> + .                        print 'e'
+++++ ++ .                   print 'l'
.                            print 'l'
+++ .                        print 'o'
> ++ .                       print ' '
<< +++++ +++++ +++++ .       print 'W'
> .                          print 'o'
+++ .                        print 'r'
----- - .                    print 'l'
----- --- .                  print 'd'
> + .                        print '!'
> .                          print '\n'"
bfi <- fuckbrain()

A little Christmas Present for you

25th December, 2012 Leave a comment

Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!

I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.

The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.


Have my old job!

14th November, 2012 Leave a comment

My old job at the Health & Safety Laboratory is being advertised, and at a higher pay grade to boot.  (Though it is still civil service pay, and thus not going to make you rich.)

You’ll need to have solid mathematical modelling skills, particularly solving systems of ODEs, and be proficient at writing scientific code, preferably R or MATLAB or acslX. From chats with a few people at the lab, management are especially keen to get someone who can bring in money so grant writing and blagging skills are important too.

It’s a smashing place to work and the people are lovely.  Also, you get flexitime and loads of holiday.  If you are looking for a maths job in North West* England then I can heartily recommend applying.

*Buxton is sometimes North West England (when we get BBC local news) and sometimes in the East Midlands (like when we vote in European elections).

Tags: , , , , ,

Indexing with factors

8th November, 2012 1 comment

This is a silly problem that bit me again recently. It’s an elementary mistake that I’ve somehow repeatedly failed to learn to avoid in eight years of R coding. Here’s an example to demonstrate.

Suppose we create a data frame with a categorical column, in this case the heights of ten adults along with their gender.

(heights <- data.frame(
  height_cm = c(153, 181, 150, 172, 165, 149, 174, 169, 198, 163),
  gender    = c("female", "male", "female", "male", "male", "female", "female", "male", "male", "female")

Using a factory fresh copy of R, the gender column will be assigned a factor with two levels: “female” and then “male”. This is all well and good, though the column can be kept as characters by setting stringsAsFactors = FALSE.

Now suppose that we want to assign a body weight to these people, based upon a gender average.

avg_body_weight_kg <- c(male = 78, female = 63)

Pop quiz: what does this next line of code give us?


Well, the first value of heights$gender is “female”, so the first value should be 63, and the second value of heights$gender is “male”, so the second value should be 78, and so on. Let’s try it.

#  male female   male female female   male   male female female   male 
#    78     63     78     63     63     78     78     63     63     78 

Uh-oh, the values are reversed. So what really happened? When you use a factor as an index, R silently converts it to an integer vector. That means that the first index of “female” is converted to 1, giving a value of 78, and so on.

The fundamental problem is that there are two natural interpretations of a factor index – character indexing or integer indexing. Since these can give conflicting results, ideally R would provide a warning when you use a factor index. Until such a change gets implemented, I suggest that best practice is to always explicitly convert factors to integer or to character before you use them in an index.


Get every new post delivered to your Inbox.

Join 160 other followers