Archive

Archive for the ‘R’ Category

mabbles: The Missing Piece of a Unified Theory of Tidyness

20th April, 2017 12 comments

R programming has seen a big shift in the last couple of years. All those packages that RStudio have been creating to solve this or that problem suddenly started to cohere into a larger ecosystem of packages. Once it was given a name, the tidyverse, it became possible to start thinking about the structure of the ecosystem and how packages relate to each other and what new packages were needed. At this point, the tidyverse is already the dominant ecosystem on CRAN. Five of the top ten downloaded packages are tidyverse packages, and most of the packages in the ecosystem are in the top one hundred.

As the core tidyverse packages like dplyr mature, the most exciting developments are its expansion into new fields. Notably tidytext is taking over text mining, tidyquant is poised to conquer financial time analyses, and sf is getting the spatial stats community excited.

There is one area that remains stubbornly distinct from the tidyverse. Bioconductor dominates biological research, particularly ‘omics fields (genomics, transcriptomics, proteomics, and metabolomics). Thanks to the heavy curation of package by Bioconductor Core, the two and a half thousand packages in the Bioconductor repository also form a coherent ecosystem.

In the same way that the general theory of relativity and quantum mechanics are incredibly powerful by themselves but are currently irreconcilable when it come to thinking about gravity, the tidyverse and Bioconductor are more or less mutually exclusive ecosystems of R packages for data analysis. The fundamental data structure of the tidyverse is the data frame, but for Bioconductor it is the ExpressionSet.

If you’ve not come across ExpressionSets before, they essentially consist of a data frame of feature data, a data frame of response data, and matrix of measurements. This data type is marvelously suited to dealing with data from ‘omics experiments and has served Bioconductor well for years.

However, over the last decade, biological datasets have been growing exponentially, and for many experiments it is now no longer practical to store them in RAM, which means that an ExpressionSet is impractical. There are some very clever workarounds, but it strikes me that what Bioconductor needs is a trick from the tidyverse.

My earlier statement that the data frame is the fundamental data structure in the tidyverse isn’t quite true. It’s actually the tibble, an abstraction of the data frame. From a user point of view, tibbles behave like data frames with a slightly nicer print method. From a technical point of view, they have one huge advantage: they don’t care where their data is. tibbles can store their data in a regular data.frame, a data.table, a database, or on Spark. The user gets to write the same dplyr code to manipulate them, but the analysis can scale beyond the limits of RAM.

If Bioconductor could have a similar abstracted ExpressionSet object, its users and developers could stop worrying about the rapidly expanding sizes of biological data.

Swapping out the data frame parts of an ExpressionSet is simple – you can just use tibbles already. The tricky part is what to do with the matrix. What is needed is an object that behaves like a matrix to the user, but acts like a tibble underneath.

I call such a theoretical object a mabble.

Unfortunately, right now, it doesn’t exist. This is where you come in. I think that there is plenty of fame and fortune for the person or team that can develop such an object, so I urge you to have a go.

The basic idea seems reasonably simple. You store the mabble as a tibble, with three columns for row, column, and value. Here’s a very simple implementation.

mabble <- function(x, nrow = NULL, ncol = NULL) {
  # Decide on dimensions
  n <- length(x)
  if(is.null(nrow)) {
    if(is.null(ncol)) {
      # Default to column vector
      nrow <- n
      ncol <- 1
    } else { # only ncol known
      nrow <- n / ncol
      assert_all_are_whole_numbers(nrow)
    }
  } else {
    if(is.null(ncol)) { # only nrow known
      nrow <- n / ncol
      assert_all_are_whole_numbers(ncol)
    } else { # both known
      # Not allowing recycling for now; may change my mind later
      assert_all_are_equal_to(nrow * ncol, length(x))
    }
  }

  m <- tibble(
    r = rep.int(seq_len(nrow), times = ncol),
    c = rep(seq_len(ncol), each = nrow),
    v = x
  )
  class(m) <- c("mbl", "tbl_df", "tbl", "data.frame")
  m
}

Then you need a print method so it displays like a matrix. Here’s a simple solution, though ideally only a limited number of rows and column would be displayed.

as.matrix.mbl <- function(x) {
  reshape2::acast(x, r ~ c, value.var = "v")
}

print.mbl <- function(x) {
  print(as.matrix(x))
}
(m <- mabble(1:12, 3, 4))
##   1 2 3  4
## 1 1 4 7 10
## 2 2 5 8 11
## 3 3 6 9 12

The grunt work is to write methods for all the things that matrices can do. Transposing is easy – you just swap the r and c columns.

t.mbl <- function(x) {
  x %>% 
    dplyr::select(r = c, c = r, v)
}
t(m)
##    1  2  3
## 1  1  2  3
## 2  4  5  6
## 3  7  8  9
## 4 10 11 12

There are a lot of things that need to be worked out. Right now, I have no idea how you implement linear algebra with a mabble. I don’t have time to make this thing myself but I’d be happy to advise you if you are interested in creating something yourself.


Update: A few people have quite rightly pointed out that Bioconductor is moving towards having SummarizedExperiments as its fundamental data structure. Further, SummarizedExperiments contain Assays which are a virtual class. This means they they can have different backends.  So it looks like other people have been thinking along similar lines to me.

I still think that harnessing the power of tibbles to provide instant connections to databases and Spark is useful. So a mabble could be a useful intermediate object. That is, the user accesses the Assay element of their SummarizedExperiment which is instantiated as a MabbleAssay which is a mabble underneath, which is actually a tibble which connects to the data store somewhere else. Simple!

Also, Dave Robinson has the biobroom package, for tidying up Bioconductor objects.

 

 

 

 

 

Advertisements

Introducing DohaR – A new R User Group in Doha, Qatar

I’m starting a new R User Group in Doha, Qatar.  Our first meetup is on 26th May, at the HBKU Student Center in Education City.  I’ll be talking about run-time testing with my assertive package, and there will be two other speakers who I need to find pretty sharpish.  (If you want to talk, get in touch!)

Both new and more seasoned useRs are welcome. RSVP on the meetup site:

http://www.meetup.com/doha-rug

(Registration is free.)

Tags: , , ,

Big package update: assertive is now 16 packages; new pathological also on CRAN

9th October, 2015 2 comments

One of the bits of feedback that I got from the useR conference was that my assertive package, for run-time testing of code, was too big. Since then I’ve been working to modularise it, and the results are now on CRAN.

Now, rather than one big package, there are fifteen assertive.* packages for specific pieces of functionality. For example, assertive.numbers contains functionality for checking numeric vectors, assertive.datetimes contains functionality for checking dates and times, and assertive.properties contains functionality for checking names, attributes, and lengths.

This finer grained approach means that if you want to develop a package with assertions, you can choose only the parts that you need, allowing for a much smaller footprint.

The pathological package, which depends upon assertive, gives you an example of how to do this.

The assertive package itself now contains no functions of its own – it merely imports and re-exports the functions from the other 15 packages. This means that if you are working with assertive interactively, you can still simply type library(assertive) and have access to all the functionality.

Qatar R User Group

15th September, 2015 5 comments

Microsoft have announced new funding for R User Groups (Python and Hadoop too), so now seems as good a time as any for me to stop procrastinating and set up a Qatar R User Group.

If you live anywhere near Doha, and are interested in coming along to a (probably monthly) meetup about R, then fill in the survey to let me know when and where is best for you to meet, and drop me an email at richierocksATgmailDOTcom to say you’d like to come along.

Tags: ,

R packages with unlimited licenses

26th August, 2015 7 comments

I had an interesting email today saying that developers at the writer’s company wanted to use one of my packages, but weren’t allowed because it was under an unlimited license.

I release quite a few of my packages under an unlimited licence, at least for toy projects and immature packages. In those cases, letting users do what they want is more important to me than the fairness of, for example, the GPL.

(assertive is a notable exception, because it’s taken a lot of work, and also because it contains some RStudio code.)

Anyway, the lady who wrote to me requested that I release my package under a dual license to enable her staff to use it.

My alternate solution is more elegant: since the license is unlimited, you can simply download the package source, edit the DESCRIPTION file to change the license to whatever you want, and use it as you see fit.

Tags: ,

New version of assertive and answers to tutorial exercises

16th July, 2015 Leave a comment

I gave a tutorial at useR on testing R code, which turned out to be a great way of getting feedback on my code! Based on the suggestions by attendees, I’ve made a big update to the package, which is now on CRAN. Full details of the new features can be access in the ?changes help page within the package.

Also, the slides, exercises and answers from the tutorial are now available online.

Tags: ,

The state of assertions in R

3rd July, 2015 7 comments

“Assertion” is computer-science jargon for a run-time check on your code. In R , this typically means function argument checks (“did they pass a numeric vector rather than a character vector into your function?”), and data quality checks (“does the date-of-birth column contain values in the past?”).

The four packages

R currently has four packages for assertions: assertive, which is mine; assertthat by Hadley Wickham, assertr By Tony Fischetti, and ensurer by Stefan Bache.

Having four packages feels like too many; we’re duplicating effort, and it makes package choice too hard for users. I didn’t know about the existence of assertr or ensurer until a couple of days ago, but the useR conference has helped bring these rivals to my attention. I’ve chatted with the authors of the other three packages to see if we can streamline things a little.

Hadley said that assertthat isn’t a high priority for him – dplyr, ggplot2 and tidyr (among many others) are more important – so he’s not going to develop it further. Since assertthat is mostly a subset of assertive anyway, this shouldn’t be a problem. I’ll take a look how easy it is to provide an assertthat API, so existing users can have a direct replacement.

Tony said that the focus of assertr is predominantly data checking. It only works with data frames, and has a more limited remit than assertive. He plans to change the backend to be built on top of assertive. That is, assertr will be an assertive extension that make it easy to apply assertions to multiple columns in data frames.

Stefan has stated that he prefers to keep ensurer separate, since it has a different philosophical stance to assertive, and I agree. ensurer is optimised for being lightweight and elegant; assertive is optimised for clarity of user code and clarity of error messages (at a cost of some bulk).

So overall, we’re down from four distinct assertion packages to two groups (assertive/assertr and assertive). This feels sensible. It’s the optimum number for minimizing duplication while still having the some competition to spur development onwards.

The assertive development plan

ensurer has one feature in particular that I definitely want to include in assertive: you can create type-safe functions.

The question of bulk has also been playing on my mind for a while. It isn’t huge by any means – the tar.gz file for the package is 836kB – but the number of functions can make it a little difficult for new users to find their way around. A couple of years ago when I was working with a lot of customer data, I included functions for checking things like the validity of UK postcodes. These are things that I’m unlikely to use at all in my current job, so it seems superfluous to have them. That means that I’d like to make assertive more modular. The core things should be available in an assertive.base package, with specialist assertions in additional packages.

I also want to make it easier for other package developers to include their own assertions in their packages. This will require a bit of rethinking about how the existing assertion engine works, and what internal bits I need to expose.

One bit of feedback I got from the attendees at my tutorial this week was that for simulation usage (where you call the same function millions of times), assertions can slow down the code too much. So a way to turn off the assertions (but keep them there for debugging purposes) would be useful.

The top feature request however, was for the use of pipe compatibility. Stefan’s magrittr package has rocketed in popularity (I’m a huge fan), so this definitely needs implementing. It should be a small fix, so I should have it included soon.

There are some other small fixes like better NA handling and a better error message for is_in_range that I plan to make soon.

The final (rather non-trivial) feature I want to add to assertive is support for error messages in multiple languages. The infrastructure is in place for translations (it currently support both the languages that I know; British English and American English), I just need some people who can speak other languages to do the translations. If you are interested in translating; drop me an email or let me know in the comments.