Archive

Posts Tagged ‘r’

Regular expressions for everyone else

25th September, 2014 Leave a comment

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

library(devtools)
install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.

Finally, a use for rapply

15th July, 2014 2 comments

In the apply family of functions, rapply is the unloved ginger stepchild. While lapply, sapply and vapply make regular appearances in my code, and apply and tapply have occasional cameo appearances, in ten years of R coding, I’ve never once found a good use for rapply.

Maybe once a year I take a look at the help page, decide it looks to complicated, and ignore the function again. So today I was very pleased to have found a genuine use for the function. It isn’t life-changing, but it’s quite cute.

Complex classes often have a print method that hides their internals. For example, regression models created by glm are lists with thirty elements, but their print method displays only the call, the coefficients and a few statistics.

# From example(glm)
utils::data(anorexia, package = "MASS")
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
                family = gaussian, data = anorexia)
str(anorex.1)

To see everything, you need to use unclass.

unclass(anorex.1) #many pages of output

unclass has a limitation: it only removes the top level class, so subelements keep their classes. For example, compare:

class(unclass(anorex.1)$qr) # qr
class(unclass(anorex.1$qr)) # list

Using rapply, we can remove classes throughout the whole of the object, turning it into a list of simple objects.

rapply(anorex.1, unclass, how = "replace")

As well as allowing us to thoroughly inspect the contents of the object, it also allows the object to be used with other code that doesn’t understand particular classes.

Tags: , , ,

Automatically convert RUnit tests to testthat tests

12th May, 2014 1 comment

There’s a new version of my assertive package, for sanity-checking code, on its way to CRAN. The release has been delayed a while, since my previous attempt at an upload met with an error that was only generated on the CRAN machine, but not on my own. The problem lay with some code designed to autorun the RUnit tests for the package. After fiddling for a while and getting nowhere, I decided it was time to make the switch to testthat.

I’ve been a long-time RUnit user, since the syntax is near-identical to every other xUnit variant in every programming language. So switching between RUnit and MATLAB xUnit or NUnit requires no thinking. testthat has a couple of important advantages over RUnit though.

test_package makes it easy to, ahem, test your package. A page of code for finding and nicely displaying bad tests has been reduced to test_package("assertive").

Secondly, testing for warnings is much cleaner. In RUnit, you have to use convoluted mechanisms like:

test.sqrt.negative_numbers.throws_a_warning <- function()
{
  old_ops <- options(warn = 2)
  on.exit(options(old_ops))
  checkException(sqrt(-1))
}

The testthat equivalent is more readable:

test_that(
  "sqrt throws a warning for negative number inputs",
  {
    expect_warning(sqrt(-1))
  }
)

Thirdly, testthat caches tests, so you spend less time waiting for tests that you know are fine to rerun.

These benefits mean that I’ve been meaning to switch packages for a while. The big problem was that the assertive package contains over 300 unit tests. At about a minute or so to update each tests, that was five hours of tedious work that I couldn’t be bothered to do. Instead, I spent two days making a package that automatically converts RUnit tests to testthat tests. Not exactly a time saving, but it was more fun.

It isn’t on CRAN yet, but you can get it from github.

library(devtools)
install_github("richierocks/runittotestthat")

The package contains functions to convert tests on an individual/file/package basis.

convert_test takes an RUnit test function and returns a call to test_that. Here’s an example for the sqrt function.

library(runittotestthat)
test.sqrt.3.returns_1.732 <- function()
{
  x <- 3
  expected <- 1.73205080756888
  checkEquals(sqrt(x), expected)
}
convert_test(test.sqrt.3.returns_1.732)
## test_that("test.sqrt.3.returns_1.732", {
##   x <- 3
##   expected <- 1.73205080756888
##   expect_equal(expected, sqrt(x))
## })

convert_test works with more complicated test functions. You can have multiple checks, nested inside if blocks or loops if you really want.

test.some_complicated_nonsense.returns_an_appropriate_testthat_test <- function()
{
  x <- 6:10
  for(i in 1:5)
  {
    if(i %% 2 == 0)
    {
      checkTrue(all(x > i), msg = "i divisible by 2") 
      if(i == 4)
      {
        checkIdentical(4, i, msg = "i = 4")
      } else
      {
        while(i > 0) 
        {
          checkIdentical(2, i, msg = "i = 2")
        }
        repeat
        {
          checkException(stop("!!!"))
          break
        }
      }
    }               
  }
}
convert_test(test.some_complicated_nonsense.returns_an_appropriate_testthat_test)
## test_that("test.some_complicated_nonsense.returns_an_appropriate_testthat_test", 
## {
##   x <- 6:10
##   for (i in 1:5) {
##     if (i%%2 == 0) {
##       expect_true(all(x > i), info = "i divisible by 2")
##       if (i == 4) {
##         expect_identical(i, 4, info = "i = 4")
##       }
##       else {
##         while (i > 0) {
##           expect_identical(i, 2, info = "i = 2")
##         }
##         repeat {
##           expect_error(stop("!!!"))
##           break
##         }
##       }
##     }
##   }
## })

Of course, the main use for this is converting whole files or packages at at time, so runittotestthat contains convert_test_file and convert_package_tests for this purpose. By default (so you don’t overwrite your RUnit tests by mistake), they write their output to the console, but you can also write the resulting testthat tests to a file. converting all 300 of assertive’s tests was as easy as

convert_package_tests(
  "assertive", 
  test_file_regexp = "^test", 
  testthat_files = paste("new-", runit_files)
)

In that line of code, the runit_files variable is a special name that refers to the names of the files that contain you RUnit tests. It means that the output testthat file names can be be based upon original input names.

Although runittotestthat works fine on all my tests, automatic code editing is a tricky task, so there may be some weird edge cases that I’ve missed. Please download the package and play with it, and let me know if you find any bugs.

Introducing the pathological package for manipulating paths, files and directories

28th April, 2014 7 comments

I was recently hunting for a function that will strip the extension from a file – changing foo.png to foo, and so forth. I was knitting a report, and wanted to replace the file extension of the input with the extension of the the output file. (knitr handles this automatically in most cases but I had some custom logic in there that meant I had to work things manually.)

Finding file extensions is such a common task that I figured that someone must have written a function to solve the problem already. A quick search using findFn("file extension") from the sos package revealed a few thousand hits. There’s a lot of noise in there, but I found a few promising candidates.

There’s removeExt in the limma package (you can find it on Bioconductor), strip_extension in Kmisc, remove_file_extension which has identical copies in both spatial.tools and gdalUtils, and extension in the raster.

To save you the time and effort, I’ve tried them all, and unfortunately they all suck.

At a bare minimum, a file extension stripper needs to be vectorized, deal with different file extensions within that vector, deal with multiple levels of extension (for things like “tar.gz” files), and with filenames with dots in the name other than the extension, and with missing values, and with directories. OK, that’s quite a few things but I’m picky.

Since all the existing options failed, I’ve made my own function. In fact, I went overboard and created a package of path manipulation utilities, the pathological package. It isn’t on CRAN yet, but you can install it via:

library(devtools)
install_github("richierocks/pathological")

It’s been a while since I’ve used MATLAB, but I have fond recollections of its fileparts function that splits a path up into the directory, filename and extension.

The pathological equivalent is to decompose a path, which returns a character matrix data.frame with three columns.

library(pathological)
x <- c(
  "somedir/foo.tgz",         # single extension
  "another dir\\bar.tar.gz", # double extension
  "baz",                     # no extension
  "quux. quuux.tbz2",        # single ext, dots in filename
  R.home(),                  # a dir
  "~",                       # another dir
  "~/quuuux.tar.xz",         # a file in a dir
  "",                        # empty 
  ".",                       # current dir
  "..",                      # parent dir
  NA_character_              # missing
)
(decomposed <- decompose_path(x))
##                          dirname                      filename      extension
## somedir/foo.tgz         "d:/workspace/somedir"       "foo"         "tgz"    
## another dir\\bar.tar.gz "d:/workspace/another dir"   "bar"         "tar.gz" 
## baz                     "d:/workspace"               "baz"         ""       
## quux. quuux.tbz2        "d:/workspace"               "quux. quuux" "tbz2"   
## C:/PROGRA~1/R/R-31~1.0  "C:/Program Files/R/R-3.1.0" ""            ""       
## ~                       "C:/Users/richie/Documents"  ""            ""       
## ~/quuuux.tar.xz         "C:/Users/richie/Documents"  "quuuux"      "tar.xz" 
## ""                           ""            ""       
## .                       "d:/workspace"               ""            ""       
## ..                      "d:/"                        ""            ""       
## <NA>                    NA                           NA            NA       
## attr(,"class")
## [1] "decomposed_path" "matrix" 

There are some shortcut functions to get at different parts of the filename:

get_extension(x)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "tgz"                "tar.gz"                      "" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##                  "tbz2"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "tar.xz"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA 
                     
strip_extension(x)
##  [1] "d:/workspace/somedir/foo"         "d:/workspace/another dir/bar"    
##  [3] "d:/workspace/baz"                 "d:/workspace/quux. quuux"        
##  [5] "C:/Program Files/R/R-3.1.0"       "C:/Users/richie/Documents"       
##  [7] "C:/Users/richie/Documents/quuuux" "/"                               
##  [9] "d:/workspace"                     "d:/"                             
## [11] NA 

strip_extension(x, include_dir = FALSE)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "foo"                   "bar"                   "baz" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##           "quux. quuux"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "quuuux"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA 

You can also get your original file location (in a standardised form) using

recompose_path(decomposed)
##  [1] "d:/workspace/somedir/foo.tgz"           
##  [2] "d:/workspace/another dir/bar.tar.gz"    
##  [3] "d:/workspace/baz"                       
##  [4] "d:/workspace/quux. quuux.tbz2"          
##  [5] "C:/Program Files/R/R-3.1.0"             
##  [6] "C:/Users/richie/Documents"              
##  [7] "C:/Users/richie/Documents/quuuux.tar.xz"
##  [8] "/"                                      
##  [9] "d:/workspace"                           
## [10] "d:/"                                    
## [11] NA 

The package also contains a few other path utilities. The standardisation I mentioned comes from standardise_path (standardize_path also available for Americans), and there’s a dir_copy function for copying directories.

It’s brand new, so after I’ve complained about other people’s code, I’m sure karma will ensure that you’ll find a bug or two, but I hope you find it useful.

Fearsome Engines Part 3: Which one should you use?

13th October, 2013 2 comments

There are lots of R engines emerging! I’ve interviewed members of each of the teams involved in these projects. In part 1 of this series, we covered the motivation of each project. Part 2 looked at the technical achievements and new features. This part tries to determine which projects are suitable for which users.

Compatibility

CXXR and pqR and forks of GNU R, and have retained full package compatibility. This means that they should work out of the box on your existing R code (though note that both are Linux only and require you to build from source).

The other four engines are complete rebuilds, and consequently have had to recreate compatibility.

TERR currently has over 1200 completely compatible packages, and many more partially compatible ones.

Renjin also has, I think, over one thousand compatible packages. Helpfully, there’s a list of which CRAN packages work, and which don’t. (I haven’t bothered counting the blue dots, but it looks like almost half the packages are compatible.)

Riposte and FastR are at an earlier stage in development FastR has no package support at all, the Riposte is just getting around to it.

Justin Talbot:

A couple of months back I started a big push to transition Riposte from an academic project to a full-featured drop-in replacement for R. Right now, Riposte can load the base and recommended packages with just a few errors, which is a substantial improvement over a couple months back when it couldn’t load even one.

Interestingly, one of the big headaches reported by several engine developers is that some packages make assumptions about the internals of R. So they assume things like numeric vectors being pointers to double arrays, and it makes it harder for other engines to extend the functionality. It seems that some work is needed to the R Language definition to clarify exactly what a package should and should not be allowed to assume.

Licensing

pqR, CXXR, Renjin and FastR are all licensed under the GPL. Riposte is also open source, under the more permissive 2-clause BSD licence.

TERR, by contrast is a closed-source product. It comes in both free and paid for versions.

Ecosystem

Having lots of different engines is mostly great, but fragmentation is a big worry. Web development suffered for a long time (and still does, a little bit) from having to write different code for different browsers. SQL is subtly different for different databases, with vendor-specific extensions rampant.

All the engine developers have stated their intention to work towards (or retain) full compatibility with GNU R, and use that as the reference implementation. Certainly fragmentation is a problem that no-one wants.

Good intentions are all very well, but I was curious to know how much interaction there has been between the different teams, and with R-Core.

Between teams:

Tomas Kalibera previously worked at the University of Kent, where Andrew Runnalls is based. Surprisingly, they didn’t have much contact, as they were working in different areas at the time.

Jan Vitek was on Justin Talbot’s dissertation committee, so there has been some contact between the FastR and Riposte teams.

Justin has also spoken with Alex Bertram of Renjin.

Overall, there hasn’t been that much contact. I think I’d be less worried about fragmentation if the teams talked with each other more. (TERR is an exception in this; as a commercial product, a certain level of secrecy is required.)

A few R-Core names crop up in relation to several projects:

Doug Bates has been involved with CXXR.

Duncan Murdoch helped get some of pqR’s bug fixed merged into GNU R, and Radford Neal has had some contact with Luke Tierney and Robert Gentleman.

Luke Tierney has helped with the bytecode compilation in Renjin.

John Chambers, Ross Ihaka and Simon Urbanek have provided feedback on the TERR project.

Luke Tierney, Duncan Temple-Lang and Robert Gentleman have provided feedback to FastR.

Luke Tierney has helped on Riposte.

So at least some members of R-Core are aware of these projects, and there is some collaboration going on. Personally, I’d quite like to see a more formal arrangement with a language committee trying to refine the R Langauge Definition to aid compatibility between projects. This is probably just a legacy of my time as a civil servant.

The FastR team organized a number of workshops that featured members of the Core-R team (one on virtual execution environments for scientific computing and one on big data). After the SPLASH conference in October 2013, key developers from the Renjin and TERR engines will meet up with the FastR team, along with some Core-R members at an NSF workshop on scalable data analytics organized by Vitek.

The future

As a commercial project, TERR lives or dies by its ability to be sold (it does have a real customer base already, predominantly in the oil & gas, life sciences and consumer goods sectors).

Renjin is supported by BeDataDriven’s consultancy work, so it also has ongoing financial support.

The other four projects are all academic, so their long term support is trickier.

Justin works for Tableau Software, and they let him develop Riposte in his 20% time, but additional developers would help the project.

Justin Talbot:

Right now the team is just me. Zach has moved on to his dissertation work, the very cool Terra project (http://terralang.org). I’d be more than happy to expand the team if anyone is interested in helping out. Right now Riposte is at a place where there is a lot of easily factorable work in supporting popular packages or R’s internal functions that people could pick up if they wanted to. As I said at the beginning, Riposte’s overriding goal is to figure out how to execute dynamically-typed vector code like R at near the performance of highly optimized C. I think this sets it apart from a number of the other R projects like CXX, Renjin, and pqR, where better performance is nice, but not the driving design goal. If Riposte’s mission interests anyone, please get in contact!

CXXR and pqR are also solo projects and would benefit from developers.

Radford Neal:

I’m treating pqR as a fork, without any expectation that it will necessarily be merged with the version maintained by the R Core Team.

One necessary thread of work on pqR consists of fixing bugs, getting it to work in more environments (Windows, Mac, with GUIs), and adding features from later R Core releases. I’d like to have it compatible soon with R-2.15.1, and later with at least R-2.15.3.

There are still many areas in which performance can be improved
without major design changes.

At some point I’d like to get back to doing statistics, so I hope that
other people will also get involved with pqR.

FastR has a long term goal of being maintained by statisticians and not computer scientists!

Jan Vitek:

The hope, in an ideal world, is eventually that we could turn FastR over to R-Core.

As well as developers, of course, the biggest thing that these projects need is a userbase. Feedback and bug reporting is hugely important, so a great way for you to contribute is to simply to try these projects out. Failing that, telling other R users that these projects exists is an excellent start. Get tweeting!

Fearsome Engines Part 2: Innovations and new features

13th October, 2013 5 comments

There are lots of R engines emerging! I’ve interviewed members of each of the teams involved in these projects. In part 1 of this series, we covered the motivation of each project. This part looks at the technical achievements and new features.

Many of the innovations are performance improvements, reflecting the primary goal of several of the teams. pqR (the only engine that like R is written in C) get a lot of benefit from simple fixes to R’s C code.

Radford Neal:

A big part of the speed-up in pqR comes just from local rewrites of C code, or relatively limited redesigns of how operations are done.

In general however, some bigger changes have been required to make R go faster.

Garbage collection and reference counting

One performance bottleneck seems to be R’s garbage collector (the thing that frees up memory when you remove variables), with almost all the projects using a different collection strategy to try and make memory management more efficient.

Renjin and FastR get a lot of benefit for free by using the Java Virtual Machine (JVM):

Alex Bertram:

The JVM has 20 years and millions of man hours invested in the garbage collector. As revolutionary as R is you just don’t have a group of people that are focused on garbage collection. So out of the box we see that if you a have a memory intensive benchmark then Renjin sails through it [compared to GNU R].

One of the things that we get for free is parallel garbage collection. That’s a nice extra.

In theory, if you copy a variable then R will create a completely new copy of it. For example, if you do y <- x <- 1:10, then x and y both store completely separate copies of 1:10. In practice, this is really wasteful of memory, so y will just point to the same memory location as x until you modify one of the variables to make them different. This means that you need to keep a count of how many variables point to a particular location in memory; a process known as reference counting.

GNU R uses a simply system where variables are referenced zero, one or two-or-more times. Both pqR and TERR have tried to improve on this. pqR stores the number of references in a three-bit field, so it can count from zero to seven (seven is really “seven-or-more”). TERR goes even further, storing the count as a 16-bit integer, allowing counts up to 65535. The more accurate reference count allows some memory to be reclaimed when a reference count decreases to zero, which is cheaper than reclaiming it via the garbage collector. (GNU R and pqR always use the garbage collector for reclaiming memory.)

While the pros and cons of different methods of reference counting and garbage collection can get quite technical, the good news is that is seems to be an area where substantial performance improvement is possible. Radford Neal has written in depth about the pqR implementation, and Michael Sannella’s talk from useR2013 is available on slideshare.

Smart execution

GNU R uses what is known as “lazy evaluation” on variables. That is, when you pass variables into a function they aren’t copied or evaluated until absolutely necessary. Several of the engines have taken this concept further, with smarter ways of evaluating (or not evaluating) code in order to improve performance.

Riposte uses “fusion” operations on vectors to reduce the creation of intermediate variables.

Justin Talbot:

While R uses C code for vector operations, it only does one vector operation at a time. This leads to lots of memory traffic as R must read in each vector operand and write out each vector result. Riposte tackles this problem by performing “fusion”, an optimization where multiple vector operations are performed simultaneously in the machine’s registers. This eliminates a lot of memory traffic and can speed up long vector code. In some of our benchmarks I’ve seen 10-50x
performance improvements for long vectors (100K-100s of millions of elements).

As an example, if you wanted to find which elements of a vector corresponded to males over 40, you could write something like age > 40 & gender == "male". GNU R would create several intermediate vectors for age > 40 and gender == "male" before calculating the result. An alternative that uses less memory is to calculate the result of the whole operation on each element, and then loop over the vector (in C++, where loops are fast).

Of course, R’s flexibility causes problems.

Justin Talbot:

R allows you to replace the default operators within almost any scope, by e.g.:

`+` <- function(x,y) x*y

Because of this, if you have a simple for loop:


j <- 0
for(i in 1:10000) {
j <- j+1
}

an interpreter can’t simply do an addition. Instead, it has to check each time through the loop that `+` is still really plus and not something else. If j were some long vector, this wouldn’t matter, the time to do the check would be dominated by the cost of actually doing the addition. But if j is short, as it is here, then the time to check dominates.

Riposte solves this problem by assuming that a small set of very common operators cannot be replaced in very limited circumstances. Basically, if you write:

j+1

and j is a basic type (not an S3/S4 class) Riposte assumes that + must be normal addition. If you really want + to use your redefined version you can either (1) make j an object or (2) call it using names, e.g.: `+`(x=j, y=1).

Most of the other engines have similar concepts to Riposte’s fusion. Renjin calls it a “vector pipeline” and FastR calls it a “view”. All refer to executing multiple steps as a group.

Jan Vitek:

We have two implementations of a vector. One is the usual one, and one is an interface. This is a lazy proxy to the computation. So if you do binary plus on two vectors you could allocate an array and do the plus on every element of the inputs as GNU R would do it, but our implementation you could also create a view. This is an object that just remembers that there are these two vectors that it needs to add. It only starts adding the numbers when asked for them.

The last sentence is particularly important. Somewhat non-intuitively, executing code straight away doesn’t always result in the best performance – sometimes it can be faster to wait. This “deferred execution” becomes very important when trying to parallelise code, as we’ll see in a moment. One surprising challenge for deferred execution is respecting the order of error messages and warnings produced by GNU R.

Parallelism

Much of GNU R is single threaded by default, so with four cores now standard on desktop machines, exploiting multicore capabilities is an obvious target for performance boosting. In fairness, since R2.14.0, the parallel package has allowed multicore capabilities with a bit of code rewriting, and there are many packages for various types of parallelism. The holy grail of parallelism, however is for it to happen automatically (“implicit parallelism”) without code rewrites.

The FastR project have been pushing parallel computing in several forms.

Tomas Kalibera:

We aim to push on parallelism a lot. There are different parallelisms that you may look at: multi-threading, GPUs and possibly distributed settings.

One of the good things about an implementation from scratch is that you can build a mostly thread-safe interpreter. At least we know the points where we aren’t thread-safe. That means that we can evaluate the views in parallel. We know that they are simple operations on vectors. They don’t have side-effects. We can just put them in a thread-pool and almost magically you get parallelism.

In the long run the goal is to integrate parallelisation into the compiler. To use the compiler to analyse the code and understand where parallelism makes sense and to automatically synthesise parallelism in the programs. It should work!

Renjin’s vector pipeline provides implicit parallelism.

Alex Bertram:

With our vector pipeline, we have support for implicit parallelisation. Add sum of two large vectors, it will recognise that the calculation has two independent sums and calculate those on different threads. You don’t have to deal with that; it just parallelises it automatically.

Renjin will try and defer work as long as possible so that it has an opportunity to parallelise more things. You can get quite far before you have to do any work, and then Renjin will say “I’ve got all these different calculations that I have to do, a lot of them are independent. Let me distribute them across the available cores”.

pqR uses a related concept called helper threads. While individual calculations (“tasks” in pqR terminology) are single threaded, several calculations can be done in parallel, one on each thread.

Riposte has several parallelisation tricks up its sleeve. As well as vector fusion and implicit multicore behaviour it also uses SSE and AVX processor level instructions. (AFAIK, these were originally designed for parallel processing in media streaming so this is quite a novel use.) Surprisingly, fusion is much more important than multicore parallelism.

Justin Talbot:

For big vector operations on which Riposte performs fusion, it will attempt to leverage the SSE or AVX instructions to get a 2-4x performance improvement on a single core. It will also distribute the work across multiple threads. This leads to nearly linear performance increases for large enough vectors. Note the orders of magnitude though: fusion itself can lead to 10-50x performance improvement. Distributing the work across 4-6 cores only leads to at most a 4-6x improvement. So there’s more to be gained by doing fusion well.

Bytecode compilation

Interpreted languages are, in general, slower than compiled languages. This means that there is scope for improving R’s performance by compiling the code before it is executed. Recent versions of GNU R include the compiler package, which converts functions to bytecode, which can then be run more quickly than pure interpreted code. Riposte and Renjin both try and improve on this.

Justin Talbot:

when I started working on R it only used the older AST walking
interpreter originally developed in the early 1990s. Interpreter
technology has advanced considerably since then. Riposte uses a much
more modern register-based bytecode interpreter. This made it about 5x
faster than R on scalar code (e.g. your typical for loop). Luke
Tierney’s new stack-based bytecode interpreter speeds up R somewhat,
but I think Riposte’s interpreter is still 2-3x faster in general.

Compiling R code is, in general, a very hard problem since R code is very flexible. If you include promises, calls to the eval function, or start redefining operators, or messing about with environments, then optimising code is almost impossible.

A similar situation occurs with JavaScript, and the browser makers (who have considerably more resource than the R community) have put in a lot of effort trying to make JavaScript run faster. There, the solution seem to be “compile the bits that you can, optimise other bits where possible, and just run the hard stuff”.

Renjin uses a similar approach for R code.

Alex Bertram:

We have different interpretation modes. The baseline is this Abstract Syntax Tree (AST) interpreter that’s modelled almost exactly like GNU R. If you want to figure out how an eval statement works or system calls, we just went in and looked at how it was implemented in GNU R. The aim of that is that we have at least one AST interpreter that is capable of handling everything.

If we see that you’re dealing with large data or there’s a function that is getting called a lot, or you’ve hit a for loop of one to one million then it breaks out of that baseline interpreter and tries to use more advanced optimisations. That will only work for a subset of the R language, so the worst case it that we get performance equal to GNU R.

If you’ve got a for loop and you’re calling eval, then Renjin will throw up its hands and say ‘alright, if that’s what you want to do then you’re not going to get any performance speedup’. But if it sees that in the for loop, all you’re doing are some computations, then it will break out [of the AST interpreter] and compile down to machine code.

Data that isn’t in memory

By default, all R variables are stored in RAM. This contrasts with, for example, SAS, where variables are stored on disk. There is a tradeoff in this – RAM can be accessed more quickly than disk, but it limits the size of the data the that you can store. Some R packages allow disk-based variables, most notably the ff and bigmemory packages. The MonetDB.R package allows database variables stored in a MonetDB.

In an ideal world, you should be allowed to store your variables anywhere: in RAM, on disk, or in a database, and R would be able to work with them in the same way. GNU R isn’t well designed for this. As an example, numeric vectors, in the underlying C code, are always assumed to be a pointer to a double array. To easily allow different data types, a level of abstraction is needed.

TERR leverages the object oriented nature of C++ to achieve this. In TERR, each data type is an abstract C++ class, that can have one or more representations.

I don’t know any details of Tibco’s implementation, but it seems that from an abstract class of NumericVector, you could have concrete classes of NumericVectorInRam and NumericVectorOnDisk to account for different data sources (not to mention specialisation for vectors with or without names and other attributes).

Data from disk and database sources isn’t yet supported in TERR, but it is planned for the future.

Both FastR and Renjin have also made some progress towards non-RAM variables.

Alex Bertram:

We have been able to introduce a layer of abstraction between the algorithms that operate on data and where that data is stored. In GNU R there is an expectation that all the data is in arrays, and that’s very difficult to overcome.

In Renjin, it’s very basic but we have an interface, so if you want the sum of something, say a vector or other object, we hand some algorithm a pointer to an interface that says ‘if you want the tenth element, then call this method; if you want the length, then call that, etc.’. So it could be in memory, backed by an array, but it could also be on disk. It could be in a database or in a buffer.

That’s the basis of everything else: there’s a layer of abstraction between your algorithms and your storage.

Code specialisation

CXXR and FastR both make some performance gains by providing specialised routines for particular cases.

Tomas Kalibera:

Imagine that you are indexing a vector. Every user of R knows that there are very rich semantics. You can have logical or numeric indicies, you can have NAs, you can have attributes on the base or on the index. There are many different semantics. One possibility is to implement the code using a single subroutine that will cover all the general cases but this is slow.

We implement many different selection methods. One a still a general one that can do everything, but we also have specialised ones that make some assumptions. [For example], in one we assume that a base numeric vector is a double vector and an index is a scalar integer or a scalar double within the bounds of the vector. Of course you have to have some guards – some checks that these assumptions are valid, but once you check these guards then the code is much simpler and the just-in-time Java compiler is then able to optimise the code much better because it is so much smaller than the general case. We get big speedups from this.

Other features

TERR provides a C++ API to allow easy integration with C++ code. (Using Rcpp, it isn’t hard, but alternatives are always welcome.)

FastR uses some Oracle technology to improve performance.

Jan Vitek:

FastR 0.1 is a self-modifying interpreter. The interpreter is formed by a tree of executable nodes, which initially reflects the syntax tree of the program, but as the program executes, the tree self re-writes to more optimised versions based on properties of the observed execution. Most optimisations come from restricting to specialised types of arguments (for example nodes for vector indexing or arithmetic), or specialising the contents of the arguments (for exmaple that a vector index is within bounds). The tree can then rewrite its nodes to general versions if the applied specialisations cease to hold.

The new version of FastR, 0.2, includes some novel Oracle technologies. (Oracle is one of the sponsors of the project.) Truffle and Graal allow code to self-modify as it runs, and to generate native code, providing on-the-fly optimisation. FastR has also experimented with Oracle’s Delite which autogenerates GPU code.

CXXR has been conspicuously absent from this so far. It’s single user-level new feature is provenance tracking, based on the S AUDIT feature.

Andrew Runnalls:

R itself stores (in the .Rhistory) file a record of all R commands that have been issued, so to that extent you can “recreate your current workspace”. The S AUDIT feature, however, enabled you to identify the specific sequence of commands that contributed to the current value of a particular S object: it was this selectivity that I particularly valued.

Rather than adding new features, a substantial amount of development effort has gone into refactoring R to make it more maintainable and better documented. Are these features? Hell yes! Rather than being user-level features, they are developer-level features. In this respect, I think CXXR might be useful for the developers of other engines. Rather than trying to replicate features from GNU R’s source code; it might be easier to replicate them from CXXR.

This post is far from exhaustive on the new features in the engines. But as you can tell from what I’ve included, there really are a ton of technical innovations that have been made, with more on the way.

I put it to Radford Neal that it was difficult to tell which ones were the important ones.

Radford Neal:

What is most important is very hard to say, since it depends so much
on the user’s program and environment. Helper threads are of no use
on a single-core system! And speeding up matrix multiplication
doesn’t help programs that don’t use it. I’d encourage people who’ve
tried pqR to let me know what programs have and have not been sped up
a lot – so far I haven’t gotten much of that. This is most helpful,
of course, if you can send me your program and data so that I can try
it myself, and maybe change pqR to make your program faster.

Let me know in the comments which features seem most exciting to you, or if there is anything that you want to see in R that isn’t possible.

In the next part, I’ll have a go at figuring out which engines you might want to use for which purpose.

Webcast on Writing Great R Code

19th September, 2013 6 comments

While I’m promoting things, you might also want to know that I’m doing a webcast on how to write great R code next Wednesday. It’s at 6pm British Summer Time or 10am Pacific Daylight Time.

the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.

You’ll get to hear about:

  • Writing stylish code.
  • Finding bad functions with the sig package.
  • Writing robust code with the assertive package.
  • Testing your code with the testthat package.
  • Documenting your code with the roxygen2 package.

It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.

Follow

Get every new post delivered to your Inbox.

Join 218 other followers