Author Archive

How do you get things into base-R?

15th January, 2015 2 comments

A couple of months ago I spotted that the examples for the paste function weren’t very good, and actually, there were quite a few functions that new users of R are likely to encounter, that weren’t well explained.

I’ve now managed to get some updated examples into R (paste, sum, NumericConstants, pie, a couple dozen more functions hopefully to follow), and a few people have asked how I did it.

The important thing to remember is that R is a communist ecosystem: all R users are equal, except those in the ruling R Core Team Party. If you want anything to happen, you need to persuade a member of R Core.

If the change can be made by creating of adding to a package, you’ll probably find it easier to do that. R Core members all have day jobs, and get a lot of crappy requests, so expect your request to be considered very low priority.

Of course, there is a certain amount of psychology involved in this. If R Core have heard of you, then your chances increase a little. Spend some time discussing things on r-devel, upload some packages to CRAN, and say hello to them at useR. (I mostly avoid r-devel in favour Stack Overflow, spend a fair amount of time getting packages rejected from CRAN, but I have said hello to a few of them.)

There are three ways of making your request to R Core:

File a bug report

If you have found a problem that you can reproduce, and can articulate, and seems possible to fix in a reasonable amount of time, then file a bug report.

My success rate with bugs is slightly better than fifty-fifty: 12 out of 22 were closed as fixed, and one more was closed as won’t-fix but was then fixed anyway. This means that you need to be psychologically prepared for rejection: not all your fixes will be made, and you shouldn’t get upset when they don’t.

There are many pitfalls involved in filing a bug report, so if you want to stand any chance of success, then try to follow these guidelines.

– You absolutely must have a reproducible example of the behaviour that you want to change. If you can’t reproduce it, you aren’t ready to submit the bug report.

– Speculation as to the cause of a problem is a cultural no-no. Brian Ripley will tell you off if you do this. Stick to the facts.

– If the problem involves external libraries (the Windows API for example), then your chance of getting a successful change is drastically reduced, and you may want to consider other methods of contacting R Core before you file a bug report.

– The same is true if your proposed change involves changing the signature of a base R function (even if it doesn’t conflict with existing behaviour).

– Make it as easy as you can for a change to be made. Your probability of success decays exponentially with the amount of time it takes for the fix to be made. (This is true of any software project, not just R.)

Start a discussion on the r-devel mailing list

If you can’t consistently reproduce a problem, or if there are several possible behaviours that could be right, or there is some sort of trade-off involved in making a change, then you should ask here first.

Having a discussion on r-devel means that other community members get to discuss the pros and cons of your idea, and allows people to test your idea on other operating systems/versions of R, and with other use cases.

Contact the member of R Core directly

If your problem is fuzzy, and you aren’t finding any luck on r-devel, then you might want to try contacting a member of R Core directly. (I want to emphasise that you should usually try r-devel first, and email is less invasive than a phone call or turning up at their house.)

Most bits of R fall under general R Core maintenance but some specialist packages have a single maintainer, and direct contact is likely to be more successful for these.

For example, for anything related to the grid package, you need to speak to Paul Murrell. For anything related to lattice, you want Deepayan Sarkar. For codetools, you want Luke Tierney. Find out who the maintainers are with:

lapply(dir(R.home("library")), packageDescription)

In the end, you may use several methods of communication: I submitted a bug report about the paste examples, and then had some follow up email with Martin Maechler about updating further examples.

In terms of time-scales, I’ve had a few bugs fixed the same day, but in general expect things to take weeks to months, especially for bigger changes.

Tags: , ,

Update on improving examples in base-R

24th December, 2014 2 comments

Last month I was ranting about the state of some of the examples in base-R, particularly the paste function.

Martin Maechler has now kindly taken my suggested examples and added them into R. Hopefully this will reduce the number of newbie questions about “how do I join these strings together”.

Since Martin showed some interest in improving the state of the examples, I’ve given him updated content for another 30 or so help pages, and some Christmas homework of getting them into R too!

Tags: ,

Improving base-R examples

25th November, 2014 11 comments

Earlier today I saw the hundred bazillionth question about how to use the paste function. My initial response was “take a look at example(paste) to see how it works”.

Then I looked at example(paste), and it turns out that it’s not very good at all. There isn’t even an example of how to use the collapse argument. Considering that paste is one of the first functions that beginners come across, as well as being a little bit tricky (getting to understand the difference between the sep and collapse arguments takes a bit of thinking about when you are new), this seems like a big oversight.

I’ve submitted this as a bug, with a suggested improvement to the examples. Fingers crossed that R-core will accept the update, or something like it.

It got me thinking though, how many other base functions could do with better examples? I had a quick look at some common functions that beginners seems to get confused with, and the following all have fairly bad example sections:
In base: browser, get, seq
In stats: formula, lm, runif, t.test
In graphics: plot
In utils: download.file, read.table

If you have half an hour spare, have a go at writing a better example page for one of these functions, or any other function in the base distribution, then submit it to the bug tracker. (If you aren’t sure that your examples are good enough, or you need advice, try posting what you have on r-devel before submitting a bug report. Dealing with bug reports takes up valuable R-core time, so you need to be sure of quality first.)

This seems like a really easy way to make R more accessible for beginners.

Regular expressions for everyone else

25th September, 2014 Leave a comment

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.

Finally, a use for rapply

15th July, 2014 2 comments

In the apply family of functions, rapply is the unloved ginger stepchild. While lapply, sapply and vapply make regular appearances in my code, and apply and tapply have occasional cameo appearances, in ten years of R coding, I’ve never once found a good use for rapply.

Maybe once a year I take a look at the help page, decide it looks to complicated, and ignore the function again. So today I was very pleased to have found a genuine use for the function. It isn’t life-changing, but it’s quite cute.

Complex classes often have a print method that hides their internals. For example, regression models created by glm are lists with thirty elements, but their print method displays only the call, the coefficients and a few statistics.

# From example(glm)
utils::data(anorexia, package = "MASS")
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
                family = gaussian, data = anorexia)

To see everything, you need to use unclass.

unclass(anorex.1) #many pages of output

unclass has a limitation: it only removes the top level class, so subelements keep their classes. For example, compare:

class(unclass(anorex.1)$qr) # qr
class(unclass(anorex.1$qr)) # list

Using rapply, we can remove classes throughout the whole of the object, turning it into a list of simple objects.

rapply(anorex.1, unclass, how = "replace")

As well as allowing us to thoroughly inspect the contents of the object, it also allows the object to be used with other code that doesn’t understand particular classes.

Tags: , , ,

Automatically convert RUnit tests to testthat tests

12th May, 2014 2 comments

There’s a new version of my assertive package, for sanity-checking code, on its way to CRAN. The release has been delayed a while, since my previous attempt at an upload met with an error that was only generated on the CRAN machine, but not on my own. The problem lay with some code designed to autorun the RUnit tests for the package. After fiddling for a while and getting nowhere, I decided it was time to make the switch to testthat.

I’ve been a long-time RUnit user, since the syntax is near-identical to every other xUnit variant in every programming language. So switching between RUnit and MATLAB xUnit or NUnit requires no thinking. testthat has a couple of important advantages over RUnit though.

test_package makes it easy to, ahem, test your package. A page of code for finding and nicely displaying bad tests has been reduced to test_package("assertive").

Secondly, testing for warnings is much cleaner. In RUnit, you have to use convoluted mechanisms like:

test.sqrt.negative_numbers.throws_a_warning <- function()
  old_ops <- options(warn = 2)

The testthat equivalent is more readable:

  "sqrt throws a warning for negative number inputs",

Thirdly, testthat caches tests, so you spend less time waiting for tests that you know are fine to rerun.

These benefits mean that I’ve been meaning to switch packages for a while. The big problem was that the assertive package contains over 300 unit tests. At about a minute or so to update each tests, that was five hours of tedious work that I couldn’t be bothered to do. Instead, I spent two days making a package that automatically converts RUnit tests to testthat tests. Not exactly a time saving, but it was more fun.

It isn’t on CRAN yet, but you can get it from github.


The package contains functions to convert tests on an individual/file/package basis.

convert_test takes an RUnit test function and returns a call to test_that. Here’s an example for the sqrt function.

test.sqrt.3.returns_1.732 <- function()
  x <- 3
  expected <- 1.73205080756888
  checkEquals(sqrt(x), expected)
## test_that("test.sqrt.3.returns_1.732", {
##   x <- 3
##   expected <- 1.73205080756888
##   expect_equal(expected, sqrt(x))
## })

convert_test works with more complicated test functions. You can have multiple checks, nested inside if blocks or loops if you really want.

test.some_complicated_nonsense.returns_an_appropriate_testthat_test <- function()
  x <- 6:10
  for(i in 1:5)
    if(i %% 2 == 0)
      checkTrue(all(x > i), msg = "i divisible by 2") 
      if(i == 4)
        checkIdentical(4, i, msg = "i = 4")
      } else
        while(i > 0) 
          checkIdentical(2, i, msg = "i = 2")
## test_that("test.some_complicated_nonsense.returns_an_appropriate_testthat_test", 
## {
##   x <- 6:10
##   for (i in 1:5) {
##     if (i%%2 == 0) {
##       expect_true(all(x > i), info = "i divisible by 2")
##       if (i == 4) {
##         expect_identical(i, 4, info = "i = 4")
##       }
##       else {
##         while (i > 0) {
##           expect_identical(i, 2, info = "i = 2")
##         }
##         repeat {
##           expect_error(stop("!!!"))
##           break
##         }
##       }
##     }
##   }
## })

Of course, the main use for this is converting whole files or packages at at time, so runittotestthat contains convert_test_file and convert_package_tests for this purpose. By default (so you don’t overwrite your RUnit tests by mistake), they write their output to the console, but you can also write the resulting testthat tests to a file. converting all 300 of assertive’s tests was as easy as

  test_file_regexp = "^test", 
  testthat_files = paste("new-", runit_files)

In that line of code, the runit_files variable is a special name that refers to the names of the files that contain you RUnit tests. It means that the output testthat file names can be be based upon original input names.

Although runittotestthat works fine on all my tests, automatic code editing is a tricky task, so there may be some weird edge cases that I’ve missed. Please download the package and play with it, and let me know if you find any bugs.

Update: \code>runittotestthat is on CRAN these days.

Introducing the pathological package for manipulating paths, files and directories

28th April, 2014 8 comments

I was recently hunting for a function that will strip the extension from a file – changing foo.png to foo, and so forth. I was knitting a report, and wanted to replace the file extension of the input with the extension of the the output file. (knitr handles this automatically in most cases but I had some custom logic in there that meant I had to work things manually.)

Finding file extensions is such a common task that I figured that someone must have written a function to solve the problem already. A quick search using findFn("file extension") from the sos package revealed a few thousand hits. There’s a lot of noise in there, but I found a few promising candidates.

There’s removeExt in the limma package (you can find it on Bioconductor), strip_extension in Kmisc, remove_file_extension which has identical copies in both and gdalUtils, and extension in the raster.

To save you the time and effort, I’ve tried them all, and unfortunately they all suck.

At a bare minimum, a file extension stripper needs to be vectorized, deal with different file extensions within that vector, deal with multiple levels of extension (for things like “tar.gz” files), and with filenames with dots in the name other than the extension, and with missing values, and with directories. OK, that’s quite a few things but I’m picky.

Since all the existing options failed, I’ve made my own function. In fact, I went overboard and created a package of path manipulation utilities, the pathological package. It isn’t on CRAN yet, but you can install it via:


It’s been a while since I’ve used MATLAB, but I have fond recollections of its fileparts function that splits a path up into the directory, filename and extension.

The pathological equivalent is to decompose a path, which returns a character matrix data.frame with three columns.

x <- c(
  "somedir/foo.tgz",         # single extension
  "another dir\\bar.tar.gz", # double extension
  "baz",                     # no extension
  "quux. quuux.tbz2",        # single ext, dots in filename
  R.home(),                  # a dir
  "~",                       # another dir
  "~/quuuux.tar.xz",         # a file in a dir
  "",                        # empty 
  ".",                       # current dir
  "..",                      # parent dir
  NA_character_              # missing
(decomposed <- decompose_path(x))
##                          dirname                      filename      extension
## somedir/foo.tgz         "d:/workspace/somedir"       "foo"         "tgz"    
## another dir\\bar.tar.gz "d:/workspace/another dir"   "bar"         "tar.gz" 
## baz                     "d:/workspace"               "baz"         ""       
## quux. quuux.tbz2        "d:/workspace"               "quux. quuux" "tbz2"   
## C:/PROGRA~1/R/R-31~1.0  "C:/Program Files/R/R-3.1.0" ""            ""       
## ~                       "C:/Users/richie/Documents"  ""            ""       
## ~/quuuux.tar.xz         "C:/Users/richie/Documents"  "quuuux"      "tar.xz" 
## ""                           ""            ""       
## .                       "d:/workspace"               ""            ""       
## ..                      "d:/"                        ""            ""       
## <NA>                    NA                           NA            NA       
## attr(,"class")
## [1] "decomposed_path" "matrix" 

There are some shortcut functions to get at different parts of the filename:

##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "tgz"                "tar.gz"                      "" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##                  "tbz2"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "tar.xz"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA 
##  [1] "d:/workspace/somedir/foo"         "d:/workspace/another dir/bar"    
##  [3] "d:/workspace/baz"                 "d:/workspace/quux. quuux"        
##  [5] "C:/Program Files/R/R-3.1.0"       "C:/Users/richie/Documents"       
##  [7] "C:/Users/richie/Documents/quuuux" "/"                               
##  [9] "d:/workspace"                     "d:/"                             
## [11] NA 

strip_extension(x, include_dir = FALSE)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "foo"                   "bar"                   "baz" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##           "quux. quuux"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "quuuux"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA 

You can also get your original file location (in a standardised form) using

##  [1] "d:/workspace/somedir/foo.tgz"           
##  [2] "d:/workspace/another dir/bar.tar.gz"    
##  [3] "d:/workspace/baz"                       
##  [4] "d:/workspace/quux. quuux.tbz2"          
##  [5] "C:/Program Files/R/R-3.1.0"             
##  [6] "C:/Users/richie/Documents"              
##  [7] "C:/Users/richie/Documents/quuuux.tar.xz"
##  [8] "/"                                      
##  [9] "d:/workspace"                           
## [10] "d:/"                                    
## [11] NA 

The package also contains a few other path utilities. The standardisation I mentioned comes from standardise_path (standardize_path also available for Americans), and there’s a dir_copy function for copying directories.

It’s brand new, so after I’ve complained about other people’s code, I’m sure karma will ensure that you’ll find a bug or two, but I hope you find it useful.