Archive

Posts Tagged ‘programming’

mabbles: The Missing Piece of a Unified Theory of Tidyness

20th April, 2017 12 comments

R programming has seen a big shift in the last couple of years. All those packages that RStudio have been creating to solve this or that problem suddenly started to cohere into a larger ecosystem of packages. Once it was given a name, the tidyverse, it became possible to start thinking about the structure of the ecosystem and how packages relate to each other and what new packages were needed. At this point, the tidyverse is already the dominant ecosystem on CRAN. Five of the top ten downloaded packages are tidyverse packages, and most of the packages in the ecosystem are in the top one hundred.

As the core tidyverse packages like dplyr mature, the most exciting developments are its expansion into new fields. Notably tidytext is taking over text mining, tidyquant is poised to conquer financial time analyses, and sf is getting the spatial stats community excited.

There is one area that remains stubbornly distinct from the tidyverse. Bioconductor dominates biological research, particularly ‘omics fields (genomics, transcriptomics, proteomics, and metabolomics). Thanks to the heavy curation of package by Bioconductor Core, the two and a half thousand packages in the Bioconductor repository also form a coherent ecosystem.

In the same way that the general theory of relativity and quantum mechanics are incredibly powerful by themselves but are currently irreconcilable when it come to thinking about gravity, the tidyverse and Bioconductor are more or less mutually exclusive ecosystems of R packages for data analysis. The fundamental data structure of the tidyverse is the data frame, but for Bioconductor it is the ExpressionSet.

If you’ve not come across ExpressionSets before, they essentially consist of a data frame of feature data, a data frame of response data, and matrix of measurements. This data type is marvelously suited to dealing with data from ‘omics experiments and has served Bioconductor well for years.

However, over the last decade, biological datasets have been growing exponentially, and for many experiments it is now no longer practical to store them in RAM, which means that an ExpressionSet is impractical. There are some very clever workarounds, but it strikes me that what Bioconductor needs is a trick from the tidyverse.

My earlier statement that the data frame is the fundamental data structure in the tidyverse isn’t quite true. It’s actually the tibble, an abstraction of the data frame. From a user point of view, tibbles behave like data frames with a slightly nicer print method. From a technical point of view, they have one huge advantage: they don’t care where their data is. tibbles can store their data in a regular data.frame, a data.table, a database, or on Spark. The user gets to write the same dplyr code to manipulate them, but the analysis can scale beyond the limits of RAM.

If Bioconductor could have a similar abstracted ExpressionSet object, its users and developers could stop worrying about the rapidly expanding sizes of biological data.

Swapping out the data frame parts of an ExpressionSet is simple – you can just use tibbles already. The tricky part is what to do with the matrix. What is needed is an object that behaves like a matrix to the user, but acts like a tibble underneath.

I call such a theoretical object a mabble.

Unfortunately, right now, it doesn’t exist. This is where you come in. I think that there is plenty of fame and fortune for the person or team that can develop such an object, so I urge you to have a go.

The basic idea seems reasonably simple. You store the mabble as a tibble, with three columns for row, column, and value. Here’s a very simple implementation.

mabble <- function(x, nrow = NULL, ncol = NULL) {
  # Decide on dimensions
  n <- length(x)
  if(is.null(nrow)) {
    if(is.null(ncol)) {
      # Default to column vector
      nrow <- n
      ncol <- 1
    } else { # only ncol known
      nrow <- n / ncol
      assert_all_are_whole_numbers(nrow)
    }
  } else {
    if(is.null(ncol)) { # only nrow known
      nrow <- n / ncol
      assert_all_are_whole_numbers(ncol)
    } else { # both known
      # Not allowing recycling for now; may change my mind later
      assert_all_are_equal_to(nrow * ncol, length(x))
    }
  }

  m <- tibble(
    r = rep.int(seq_len(nrow), times = ncol),
    c = rep(seq_len(ncol), each = nrow),
    v = x
  )
  class(m) <- c("mbl", "tbl_df", "tbl", "data.frame")
  m
}

Then you need a print method so it displays like a matrix. Here’s a simple solution, though ideally only a limited number of rows and column would be displayed.

as.matrix.mbl <- function(x) {
  reshape2::acast(x, r ~ c, value.var = "v")
}

print.mbl <- function(x) {
  print(as.matrix(x))
}
(m <- mabble(1:12, 3, 4))
##   1 2 3  4
## 1 1 4 7 10
## 2 2 5 8 11
## 3 3 6 9 12

The grunt work is to write methods for all the things that matrices can do. Transposing is easy – you just swap the r and c columns.

t.mbl <- function(x) {
  x %>% 
    dplyr::select(r = c, c = r, v)
}
t(m)
##    1  2  3
## 1  1  2  3
## 2  4  5  6
## 3  7  8  9
## 4 10 11 12

There are a lot of things that need to be worked out. Right now, I have no idea how you implement linear algebra with a mabble. I don’t have time to make this thing myself but I’d be happy to advise you if you are interested in creating something yourself.


Update: A few people have quite rightly pointed out that Bioconductor is moving towards having SummarizedExperiments as its fundamental data structure. Further, SummarizedExperiments contain Assays which are a virtual class. This means they they can have different backends.  So it looks like other people have been thinking along similar lines to me.

I still think that harnessing the power of tibbles to provide instant connections to databases and Spark is useful. So a mabble could be a useful intermediate object. That is, the user accesses the Assay element of their SummarizedExperiment which is instantiated as a MabbleAssay which is a mabble underneath, which is actually a tibble which connects to the data store somewhere else. Simple!

Also, Dave Robinson has the biobroom package, for tidying up Bioconductor objects.

 

 

 

 

 

Automatically convert RUnit tests to testthat tests

12th May, 2014 2 comments

There’s a new version of my assertive package, for sanity-checking code, on its way to CRAN. The release has been delayed a while, since my previous attempt at an upload met with an error that was only generated on the CRAN machine, but not on my own. The problem lay with some code designed to autorun the RUnit tests for the package. After fiddling for a while and getting nowhere, I decided it was time to make the switch to testthat.

I’ve been a long-time RUnit user, since the syntax is near-identical to every other xUnit variant in every programming language. So switching between RUnit and MATLAB xUnit or NUnit requires no thinking. testthat has a couple of important advantages over RUnit though.

test_package makes it easy to, ahem, test your package. A page of code for finding and nicely displaying bad tests has been reduced to test_package("assertive").

Secondly, testing for warnings is much cleaner. In RUnit, you have to use convoluted mechanisms like:

test.sqrt.negative_numbers.throws_a_warning <- function()
{
  old_ops <- options(warn = 2)
  on.exit(options(old_ops))
  checkException(sqrt(-1))
}

The testthat equivalent is more readable:

test_that(
  "sqrt throws a warning for negative number inputs",
  {
    expect_warning(sqrt(-1))
  }
)

Thirdly, testthat caches tests, so you spend less time waiting for tests that you know are fine to rerun.

These benefits mean that I’ve been meaning to switch packages for a while. The big problem was that the assertive package contains over 300 unit tests. At about a minute or so to update each tests, that was five hours of tedious work that I couldn’t be bothered to do. Instead, I spent two days making a package that automatically converts RUnit tests to testthat tests. Not exactly a time saving, but it was more fun.

It isn’t on CRAN yet, but you can get it from github.

library(devtools)
install_github("richierocks/runittotestthat")

The package contains functions to convert tests on an individual/file/package basis.

convert_test takes an RUnit test function and returns a call to test_that. Here’s an example for the sqrt function.

library(runittotestthat)
test.sqrt.3.returns_1.732 <- function()
{
  x <- 3
  expected <- 1.73205080756888
  checkEquals(sqrt(x), expected)
}
convert_test(test.sqrt.3.returns_1.732)
## test_that("test.sqrt.3.returns_1.732", {
##   x <- 3
##   expected <- 1.73205080756888
##   expect_equal(expected, sqrt(x))
## })

convert_test works with more complicated test functions. You can have multiple checks, nested inside if blocks or loops if you really want.

test.some_complicated_nonsense.returns_an_appropriate_testthat_test <- function()
{
  x <- 6:10
  for(i in 1:5)
  {
    if(i %% 2 == 0)
    {
      checkTrue(all(x > i), msg = "i divisible by 2") 
      if(i == 4)
      {
        checkIdentical(4, i, msg = "i = 4")
      } else
      {
        while(i > 0) 
        {
          checkIdentical(2, i, msg = "i = 2")
        }
        repeat
        {
          checkException(stop("!!!"))
          break
        }
      }
    }               
  }
}
convert_test(test.some_complicated_nonsense.returns_an_appropriate_testthat_test)
## test_that("test.some_complicated_nonsense.returns_an_appropriate_testthat_test", 
## {
##   x <- 6:10
##   for (i in 1:5) {
##     if (i%%2 == 0) {
##       expect_true(all(x > i), info = "i divisible by 2")
##       if (i == 4) {
##         expect_identical(i, 4, info = "i = 4")
##       }
##       else {
##         while (i > 0) {
##           expect_identical(i, 2, info = "i = 2")
##         }
##         repeat {
##           expect_error(stop("!!!"))
##           break
##         }
##       }
##     }
##   }
## })

Of course, the main use for this is converting whole files or packages at at time, so runittotestthat contains convert_test_file and convert_package_tests for this purpose. By default (so you don’t overwrite your RUnit tests by mistake), they write their output to the console, but you can also write the resulting testthat tests to a file. converting all 300 of assertive’s tests was as easy as

convert_package_tests(
  "assertive", 
  test_file_regexp = "^test", 
  testthat_files = paste("new-", runit_files)
)

In that line of code, the runit_files variable is a special name that refers to the names of the files that contain you RUnit tests. It means that the output testthat file names can be be based upon original input names.

Although runittotestthat works fine on all my tests, automatic code editing is a tricky task, so there may be some weird edge cases that I’ve missed. Please download the package and play with it, and let me know if you find any bugs.

Update: \code>runittotestthat is on CRAN these days.

Exploring the functions in a package

26th January, 2012 5 comments

Sometimes it can be useful to list all the functions inside a package. This is done in the same way that you would list variables in your workspace. That is, using ls. The syntax is ls(pos = "package:packagename"), which is easy enough if you can remember it. Unfortunately, I never can, and have to type search() first to see what the format of that string is.

Today, that problem is solved with a tiny utility function to save remembering things, and to save typing.

lsp <- function(package, all.names = FALSE, pattern) 
{
  package <- deparse(substitute(package))
  ls(
      pos = paste("package", package, sep = ":"), 
      all.names = all.names, 
      pattern = pattern
  )
}

all.names and pattern behave in the same way as they do in regular ls. You use it like this:

lsp(base)
lsp(base, TRUE)
lsp(base, pattern = "^is")

EDIT: I’ve had a couple of questions about the use case, and there are some interesting comments on alternatives. My thinking behind this function was that I sometimes know I’ve seen a function in a package but can’t remember what it’s called. If you can hazard a guess at the name, then apropos is probably better, though it looks everywhere on the search path rather than in a particular package. Autocompletion is also useful for this, but you need to know the first few characters of what you are looking for. (Activate autocompletions by pressing TAB in R GUI or Rstudio or CTRL+space in eclipse. I can’t remember what the shortcut is in emacs, but you probably just mash CTRL+META until you have RSI.) Finally, the unknownR package is useful for finding new functions that you hadn’t heard of yet.