## mabbles: The Missing Piece of a Unified Theory of Tidyness

R programming has seen a big shift in the last couple of years. All those packages that RStudio have been creating to solve this or that problem suddenly started to cohere into a larger ecosystem of packages. Once it was given a name, the tidyverse, it became possible to start thinking about the structure of the ecosystem and how packages relate to each other and what new packages were needed. At this point, the tidyverse is already the dominant ecosystem on CRAN. Five of the top ten downloaded packages are tidyverse packages, and most of the packages in the ecosystem are in the top one hundred.

As the core tidyverse packages like dplyr mature, the most exciting developments are its expansion into new fields. Notably tidytext is taking over text mining, tidyquant is poised to conquer financial time analyses, and sf is getting the spatial stats community excited.

There is one area that remains stubbornly distinct from the tidyverse. Bioconductor dominates biological research, particularly ‘omics fields (genomics, transcriptomics, proteomics, and metabolomics). Thanks to the heavy curation of package by Bioconductor Core, the two and a half thousand packages in the Bioconductor repository also form a coherent ecosystem.

In the same way that the general theory of relativity and quantum mechanics are incredibly powerful by themselves but are currently irreconcilable when it come to thinking about gravity, the tidyverse and Bioconductor are more or less mutually exclusive ecosystems of R packages for data analysis. The fundamental data structure of the tidyverse is the data frame, but for Bioconductor it is the ExpressionSet.

If you’ve not come across ExpressionSets before, they essentially consist of a data frame of feature data, a data frame of response data, and matrix of measurements. This data type is marvelously suited to dealing with data from ‘omics experiments and has served Bioconductor well for years.

However, over the last decade, biological datasets have been growing exponentially, and for many experiments it is now no longer practical to store them in RAM, which means that an ExpressionSet is impractical. There are some very clever workarounds, but it strikes me that what Bioconductor needs is a trick from the tidyverse.

My earlier statement that the data frame is the fundamental data structure in the tidyverse isn’t quite true. It’s actually the tibble, an abstraction of the data frame. From a user point of view, tibbles behave like data frames with a slightly nicer print method. From a technical point of view, they have one huge advantage: they don’t care where their data is. tibbles can store their data in a regular data.frame, a data.table, a database, or on Spark. The user gets to write the same dplyr code to manipulate them, but the analysis can scale beyond the limits of RAM.

If Bioconductor could have a similar abstracted ExpressionSet object, its users and developers could stop worrying about the rapidly expanding sizes of biological data.

Swapping out the data frame parts of an ExpressionSet is simple – you can just use tibbles already. The tricky part is what to do with the matrix. What is needed is an object that behaves like a matrix to the user, but acts like a tibble underneath.

I call such a theoretical object a *mabble*.

Unfortunately, right now, it doesn’t exist. This is where you come in. I think that there is plenty of fame and fortune for the person or team that can develop such an object, so I urge you to have a go.

The basic idea seems reasonably simple. You store the mabble as a tibble, with three columns for row, column, and value. Here’s a very simple implementation.

mabble <- function(x, nrow = NULL, ncol = NULL) { # Decide on dimensions n <- length(x) if(is.null(nrow)) { if(is.null(ncol)) { # Default to column vector nrow <- n ncol <- 1 } else { # only ncol known nrow <- n / ncol assert_all_are_whole_numbers(nrow) } } else { if(is.null(ncol)) { # only nrow known nrow <- n / ncol assert_all_are_whole_numbers(ncol) } else { # both known # Not allowing recycling for now; may change my mind later assert_all_are_equal_to(nrow * ncol, length(x)) } } m <- tibble( r = rep.int(seq_len(nrow), times = ncol), c = rep(seq_len(ncol), each = nrow), v = x ) class(m) <- c("mbl", "tbl_df", "tbl", "data.frame") m }

Then you need a print method so it displays like a matrix. Here’s a simple solution, though ideally only a limited number of rows and column would be displayed.

as.matrix.mbl <- function(x) { reshape2::acast(x, r ~ c, value.var = "v") } print.mbl <- function(x) { print(as.matrix(x)) } (m <- mabble(1:12, 3, 4)) ## 1 2 3 4 ## 1 1 4 7 10 ## 2 2 5 8 11 ## 3 3 6 9 12

The grunt work is to write methods for all the things that matrices can do. Transposing is easy – you just swap the r and c columns.

t.mbl <- function(x) { x %>% dplyr::select(r = c, c = r, v) } t(m) ## 1 2 3 ## 1 1 2 3 ## 2 4 5 6 ## 3 7 8 9 ## 4 10 11 12

There are a lot of things that need to be worked out. Right now, I have no idea how you implement linear algebra with a mabble. I don’t have time to make this thing myself but I’d be happy to advise you if you are interested in creating something yourself.

**Update**: A few people have quite rightly pointed out that Bioconductor is moving towards having SummarizedExperiments as its fundamental data structure. Further, SummarizedExperiments contain Assays which are a virtual class. This means they they can have different backends. So it looks like other people have been thinking along similar lines to me.

I still think that harnessing the power of tibbles to provide instant connections to databases and Spark is useful. So a mabble could be a useful intermediate object. That is, the user accesses the Assay element of their SummarizedExperiment which is instantiated as a MabbleAssay which is a mabble underneath, which is actually a tibble which connects to the data store somewhere else. Simple!

Also, Dave Robinson has the biobroom package, for tidying up Bioconductor objects.

Notice that R’s matrix is optimized so that is works more efficiently with many internal algebraic functions etc. If you use a data.frame-like object instead it would lead to decreased performance in many cases. Moreover matrix is a very simple object with nice and well-defined properties. I can’t see any reason for replacing it.

The reason for replacing it is when your matrix is so big that it doesn’t fit in RAM. There are solutions like bigmemory that let you process matrices on disk, but if your data gets big enough, you probably want to process it in a database or with Spark, which means storing it in a table, which means it has to look a bit like a data frame internally.

oh i figured you would say “SummarizedExperiment” – ExpressionSet is just for expression. I think the rest of your argument is sound.

Yeah I skipped mentioning SummarizedExperiments for simplicity.

aren’t there already solutions for memory mapped data structures in R? e.g. BigMemory package + the stuff from Microsoft R.

bigmemory implements matrices that are really objects on disk. What I want is an abstract concept that can be a local matrix or a bigmemory matrix or a matrix-like in a database or wherever.

My idea here is that if you harness the power of tibbles, then the work of connecting to other data stores is done for you.

I’ll try to write up a vignette on DelayedArray (https://github.com/Bioconductor-mirror/DelayedArray) because having an “abstract concept that can be a local matrix or a bigmemory matrix or a matrix-like” is exactly what its intent. Here’s some quick thoughts, but do check out the docs.

Currently, the backend (‘seed’) to a DelayedArray can be a base::matrix, a matrix from the Matrix package (e.g. sparse matrices), a data.frame/tibble/DataFrame, an array of run-length encoded data (RleArray), or an on-disk array backed by a HDF5 file (HDF5Array). The latter is probably of most relevance to this discussion. Basically, anything that is matrix-like. FWIW I think it should be possible to support objects like bigmemory matrices and ff as a seed in a DelayedArray (I don’t think a ‘database-backed’ tibble works out-of-the-box, but it’s an interesting avenue to explore).

You can then stick a DelayedArray inside a SummarizedExperiment as en element in the assays slot – if you use the disk-based backend, such as HDF5Array (https://github.com/Bioconductor-mirror/HDF5Array), then you have a rich object for storing genomics data that is disk-backed rather than all in RAM. This is really powerful, I believe.

Now, something to bear in mind is that methods for DelayedArray objects currently work using a block-processing (apply+combine) strategy – chunks of the data are realised in memory (if needed) and coerced to a base::array. This won’t be the most efficient way of doing things for all backends or operations. Something that I’ve discussed with Hervé Pages (core BioC developer and the developer of DelayedArray, HDF5Array) is to use the ‘native’ method when its available. The idea being that if a specialized method exists then it is used, but if not we can fallback on the apply+combine strategy. For example, suppose you have a DelayedArray where the seed is a sparse matrix and you want to compute the colSums. Rather than the current strategy of apply+combine that coerces the data to a (dense) array, we could/should instead defer to the colSums method defined in the Matrix package for sparse matrices. You can imagine how this might extend to database-backed DelayedArray objects.

Have you looked at the “Matrix” package? It solves the sparse-matrix requirement that you described, as well as the linear-algebra operations. The missing part is the “where the object lives” abstraction. As some have posted already, the memory-mapped tools get you there. Ideally you would be able to borrow the abstraction from the tidyverse tools by incorporating it to the “Matrix” package OOP…

I’m a bioconductor user/contributor, and I really like the high-level suggestion and motivation for this post.

A SpatialGridDataFrame is essentially this, it had a lot promise but never was generalized or fully fleshed out or back ended.

It’s really a set of columns with dims, and could apply to spatial grids by addition of an affine transformation (defaulting to a grid index if not a specific one). I agree this needs a unified treatment for the future of R.

This would indeed be a nice/useful extension for SummarizedExperiments objects. I guess an implementation would use HDF5, already supported by Bioc (HDF5Array) and spark-hdf5 (A plugin to enable Apache Spark to read HDF5 files).