I’m starting a new R User Group in Doha, Qatar. Our first meetup is on 26th May, at the HBKU Student Center in Education City. I’ll be talking about run-time testing with my assertive package, and there will be two other speakers who I need to find pretty sharpish. (If you want to talk, get in touch!)
Both new and more seasoned useRs are welcome. RSVP on the meetup site:
(Registration is free.)
One of the bits of feedback that I got from the useR conference was that my
assertive package, for run-time testing of code, was too big. Since then I’ve been working to modularise it, and the results are now on CRAN.
Now, rather than one big package, there are fifteen
assertive.* packages for specific pieces of functionality. For example,
assertive.numbers contains functionality for checking numeric vectors,
assertive.datetimes contains functionality for checking dates and times, and
assertive.properties contains functionality for checking names, attributes, and lengths.
This finer grained approach means that if you want to develop a package with assertions, you can choose only the parts that you need, allowing for a much smaller footprint.
pathological package, which depends upon
assertive, gives you an example of how to do this.
assertive package itself now contains no functions of its own – it merely imports and re-exports the functions from the other 15 packages. This means that if you are working with
assertive interactively, you can still simply type
library(assertive) and have access to all the functionality.
Microsoft have announced new funding for R User Groups (Python and Hadoop too), so now seems as good a time as any for me to stop procrastinating and set up a Qatar R User Group.
If you live anywhere near Doha, and are interested in coming along to a (probably monthly) meetup about R, then fill in the survey to let me know when and where is best for you to meet, and drop me an email at richierocksATgmailDOTcom to say you’d like to come along.
I had an interesting email today saying that developers at the writer’s company wanted to use one of my packages, but weren’t allowed because it was under an unlimited license.
I release quite a few of my packages under an unlimited licence, at least for toy projects and immature packages. In those cases, letting users do what they want is more important to me than the fairness of, for example, the GPL.
assertive is a notable exception, because it’s taken a lot of work, and also because it contains some RStudio code.)
Anyway, the lady who wrote to me requested that I release my package under a dual license to enable her staff to use it.
My alternate solution is more elegant: since the license is unlimited, you can simply download the package source, edit the DESCRIPTION file to change the license to whatever you want, and use it as you see fit.
I gave a tutorial at useR on testing R code, which turned out to be a great way of getting feedback on my code! Based on the suggestions by attendees, I’ve made a big update to the package, which is now on CRAN. Full details of the new features can be access in the
?changes help page within the package.
Also, the slides, exercises and answers from the tutorial are now available online.
“Assertion” is computer-science jargon for a run-time check on your code. In R , this typically means function argument checks (“did they pass a numeric vector rather than a character vector into your function?”), and data quality checks (“does the date-of-birth column contain values in the past?”).
The four packages
R currently has four packages for assertions:
assertive, which is mine;
assertthat by Hadley Wickham,
assertr By Tony Fischetti, and
ensurer by Stefan Bache.
Having four packages feels like too many; we’re duplicating effort, and it makes package choice too hard for users. I didn’t know about the existence of
ensurer until a couple of days ago, but the useR conference has helped bring these rivals to my attention. I’ve chatted with the authors of the other three packages to see if we can streamline things a little.
Hadley said that
assertthat isn’t a high priority for him – dplyr, ggplot2 and tidyr (among many others) are more important – so he’s not going to develop it further. Since
assertthat is mostly a subset of
assertive anyway, this shouldn’t be a problem. I’ll take a look how easy it is to provide an
assertthat API, so existing users can have a direct replacement.
Tony said that the focus of
assertr is predominantly data checking. It only works with data frames, and has a more limited remit than
assertive. He plans to change the backend to be built on top of
assertive. That is,
assertr will be an
assertive extension that make it easy to apply assertions to multiple columns in data frames.
Stefan has stated that he prefers to keep
ensurer separate, since it has a different philosophical stance to
assertive, and I agree.
ensurer is optimised for being lightweight and elegant;
assertive is optimised for clarity of user code and clarity of error messages (at a cost of some bulk).
So overall, we’re down from four distinct assertion packages to two groups (
assertive). This feels sensible. It’s the optimum number for minimizing duplication while still having the some competition to spur development onwards.
The assertive development plan
ensurer has one feature in particular that I definitely want to include in
assertive: you can create type-safe functions.
The question of bulk has also been playing on my mind for a while. It isn’t huge by any means – the tar.gz file for the package is 836kB – but the number of functions can make it a little difficult for new users to find their way around. A couple of years ago when I was working with a lot of customer data, I included functions for checking things like the validity of UK postcodes. These are things that I’m unlikely to use at all in my current job, so it seems superfluous to have them. That means that I’d like to make
assertive more modular. The core things should be available in an
assertive.base package, with specialist assertions in additional packages.
I also want to make it easier for other package developers to include their own assertions in their packages. This will require a bit of rethinking about how the existing assertion engine works, and what internal bits I need to expose.
One bit of feedback I got from the attendees at my tutorial this week was that for simulation usage (where you call the same function millions of times), assertions can slow down the code too much. So a way to turn off the assertions (but keep them there for debugging purposes) would be useful.
The top feature request however, was for the use of pipe compatibility. Stefan’s
magrittr package has rocketed in popularity (I’m a huge fan), so this definitely needs implementing. It should be a small fix, so I should have it included soon.
There are some other small fixes like better NA handling and a better error message for
is_in_range that I plan to make soon.
The final (rather non-trivial) feature I want to add to assertive is support for error messages in multiple languages. The infrastructure is in place for translations (it currently support both the languages that I know; British English and American English), I just need some people who can speak other languages to do the translations. If you are interested in translating; drop me an email or let me know in the comments.
The morning opened with someone who I was too bleary eyed to work out who it was. Possibly the dean of the University of Aalborg. Anyway, he said that this is the largest ever useR conference, and the first ever in a Nordic country. Take that, Norway! Also, considering that there are now quite a few R-based conferences (Bioconductor has its own conference, not to mention R in Finance and EARL), it’s impressive that these haven’t taken away from the main event.
Torben the conference organiser then spoke briefly and mentioned that planning for this event started back in June 2013.
Romain Francois gave a talk of equal parts making-R-go-faster, making-R-syntax-easier, and cat-pix. He quipped that he has an open relationship with R: he gets to use Java and C++, and “I’m fine with other people using R”. Jokes aside, he gave an overview of his big R achievements: the
J functions for
rJava that massively simplified that calling syntax; the
//[[Rcpp::export]] command that massively simplified writing
Rcpp code, and the internals to the
He also gave a demo of JJ Allaire’s
RcppParallel, and talked about plans to intgrate that into
dplyr for a free performance boost on multicore systems.
I also had a coffee-break chat with mathematician Luzia Burger-Ringer (awesome name), who has recently started R programming after a seventeen year career break to raise children. She said:
“When I returned to work I struggled to remember my mathematics training, but using R I could be productive. Compared to Fortran it’s just a few lines of code to get an answer.”
Considering that I’ve forgotten pretty much everything I know by the time I’ve returned from vacation, I’m impressed that by Lucia’s ability to dive in after a 17 year break. And I think this is good counter-evidence of R’s perceived tricky learning curve. Try fitting a random forest model in Fortran!
After being suitably caffeinated, I went to the interfacing session discussing connecting R to other languages.
Kasper Hansen had some lessons for integrating external libraries into R packages. He suggested two approaches:
“Either you link to the library, maybe with a function to download that library – this is easiest for the developer; or you include the library in your package – this is easiest for the user”.
He said he’s mostly gone for the latter approach, but said that cross-platform development in this way is mostly a bit of a nightmare.
Kasper gave examples of the
illuminaio package, for reading some biological files with no defined specification, some versions of which were encrypted; the
affxparser package for reading Affymetrix RNA sequence files, which didn’t have proper OS-independent file paths, and
RGraphviz which connects to the apparently awfully implemented Graphviz network visualization software. There were many tales of death-by-memory-leak.
In the discussion afterwards it was interesting to note the exchange between Kasper and Dirk Eddelbuettel. Dirk suggested that Kasper was overly negative about the problems of interfacing with external libraries because he’d had the unfortunate luck to deal with many bad-but-important ones, whereas in general you can just pick good libraries to work with.
My opinion is that Kasper had to pick libraries built by biologists, and my experience is that biologists are generally better at biology than software development (to put it politely).
Christophe Best talked about calling R from Go. After creating the language, Google seem to be making good internal use of Go. And as a large organisation, they suffer from the different-people-writing-different-languages problem quite acutely. Consequently, they have a need for R to plug modelling gaps in their fledgeling systems language.
Their R-Go connector runs R in a different process to Go (unlike Rcpp, which uses an intra-process sytem, according to Christophe). This is more complex to set-up, but means that “R and Go don’t have shared crashes”.
It sounds promising, but for the moment, you can only pass atomic types and list. Support for data frames is planned, as is support for calling Go from R, so this is a project to watch.
Matt Ziubinski talked about libraries to help you work with Rcpp. He recommended Catch, a testing framework C++. The code for this looked pretty readably (even to me, who hasn’t really touched C++ in over a decade).
He also recommended Boost, which allows compile-time calculations, easy parallel processing, and pipes.
He was also a big fan of C++11, which simplifies a lot of boilerplate coding.
Dan Putler talked about connecting to Sparks Mlib packge for machine learning. He said that connecting to the library was easy, but then they wondered why they had bothered! Always fun to see some software being flamed.
Apparently the regression tools in Spark Mlib don’t hold a candle to R’s
glm. They may not be fancy functions, but they’ve been carefully built for robustness.
After some soul-searching, Dan decided that Spark was still worth using, despite the weakness of Mlib, since it nicely handles distributing your data.
He and his team have created a <a href=”https://github.com/AlteryxLabs/sparkGLM”>SparkGLM package that ports R’s linear regression algorithms to Spark.
lm is mostly done;
glm is work-in-progress.
After lunch, I went to the clustering session.
Anders Bilgram kicked off the afternoon session with a talk on unsupervised meta-analysis using Gaussian mixed copula models. Say that ten times fast.
He described this a a semi-parametric version of the more standard Gaussian mixed models. I think he meant this as in “mixture models” where you consider your data to be consist of things from several different distributions, rather than mixed effects models where you have random effects.
The Gaussian copula bit means that you have to transform you data to be normally distributed first, and he recommended rank normalization for that.
(We do that in proteomics too; you want
qnorm(rank(x) / (length(x) + 1)), and yeah, that should be in a package somewhere.)
Anders gave a couple of nice examples: he took a 1.4Mpx photo of the sapce shuttle and clustered it by pixel color, and clustered the ranks of a replicated gene study.
He did warn that he hadn’t tested his approach with high-dimensional data though.
Claudia Beleites, whic asked the previous question about high-dimensional data, went on to talk about hierarchical clustering of (you guessed it) high dimensional data. In particular, she was looking at the results of vibrational spectroscopy. This looks at the vibrations of molecules, in this case to try to determine what some tissue consists of.
The data is a big 3D-array: two dimensional images at lots of different spectral frequencies.
Claudia had a bit of a discussion about k-means versus hierarchical modelling. She suggested that the fact that k-means often overlooks small clusters, and the fact that you need to know the number of clusters in advance, meant that it was unsuitable for her datasets. The latter point was vigorously debated after the talk, with Martin Maechler arguing that for k-means analyses, you just try lots of values for the number of clusters, and see what gives you the best answer.
Anyway, Claudia had been using hierarchical clustering, and running into problems with calculation time because she has fairly big datasets and hierarchical clustering takes O(n^2) to run.
Her big breakthrough was to notice that you get more or less the same answer clustering on images, or clustering on spectra, and clustering on spectra takes far less time. She had some magic about compressing the information in spectra (peak detection?) but I didn’t follow that too closely.
Silvia Liverani talked about profile regression clustering and her
PReMiuM package. She clearly has better accuracy with the shift key than I do.
Anyway, she said that if you have highly correlated variables (body mass and BMI was her example), it can cause instability and general bad behaviour in your models.
Profile regression models were her solution to this, and she described them as “Bayesian infinite mixture models”, but the technical details went over my head.
The package has support for normal/Poisson/binomial/categorical/censored response variable, missing values, and spatial correlations, so it sounds fully featured.
Silvia said it’s written in C++, but runs MCMC underneath, so that makes it medium speed.
I then dashed off to to the Kaleidoscope session for hear about Karl Broman’s socks.
Rasmus Bååth talked about using approximate Bayesian computation to solve the infamous Karl Broman’s socks problem. The big selling point of ABC is that you can calculate stuff where you have no idea how the calculate the maximum likelihood. Anyway, I mostly marvelled at Rasmus’s ability to turn a silly subject into a compelling statistical topic.
Adrian Baddesley gave a keynote os spatial statistics and his work with the
spatstat package. He said that in 1990 when work began on the S version of
spatstat, the field of spatial statistics was considered a difficult domain to work in.
“In 1990 I taught that likelihood methods for spatial statistics were infeasible, and that time-series methods were not extensible to spatial problems.”
Since then, the introduction of MCMC, composite likelihood and non-parametric moments have made things easier, but he gave real credit to the R language for pushing things forward.
“For the first time, we could share code easily to make cumulative progress”
One persistent problem in spatial stats was how to deal with edge corrections. If you sample values inside a rectangular area, and try to calculate the distance to their nearest neighbour, then values near the edge appear to be further away because you didn’t match to points that you didn’t sample outside the box.
Apparently large academic wars were fought in the 1980s and early 90s over how best to correct for the edge effects, until R made it easy to compare methods and everyone realised that there wasn’t much difference between them.
Adrian also talked about pixel logistic regression as being a development made by the
spatstat team, where you measure the distance from each pixel in an image to a response feature, then do logistic regression on the distances. This turned out to be equivalent to a Poisson point process.
He also said that the structure of R models helped to generate new research questions. The fact that you are supposed to implement
influence functions for every model meant that they had to invent new mathematics to calculate them.
Adrian concluded with the idea that we should seek a grand unification theory for statistics to parallel the attempt to reconcile relativity and quantum physics. Just as several decades ago lm and glm were considered separate classes of model, but today are grouped together, one day we might reconcile frequentist and Bayesian stats.
These are 5 minute talks.
Rafaël Coudret described an algorithm for SAEM. It was a bit technical, and I didn’t grasp what the “SA” stood for, but the apparently it works well when you can’t figure how how to write the usual Expectation Maximization.
Thomas Leeper talked about the
MTurkR interface to Amazon’s Mechanical Turk. This let’s you hire workers to do tasks like image recognition, modify and report on tasks, and even pay the workers, all without leaving the R command line.
In future, he wants to support rival services microWorkers and CrowdFunder too.
Luis Candanedo discussed modelling occupancy detection in offices, to save on the heating and electricity bills. He said that IR sensors are too expensive t obe practical, so he tried using temperature, humidity, light and CO2 sensors to detect the number of people in the office, then used photographs to make it a supervised dataset.
Random forest models showed that the light sensors were best for predicting occupancy.
He didn’t mention it, but knowing how many hospital beds are taken up is maybe an even more important use case. Though you can probably just see who has been allocated where.
Dirk Eddelbuettel talked about his
drat package for making local file systems or github (or possibly anywhere else) behave like an R repo.
Basically, it bugs him that if you use
devtools::install_github, then you can’t do
utils::update.packages on it afterwards, and
drat fixes that problem.
Saskia Freytag talked about epilepsy gene sequencing. She had 6 datasets of
children’s brains gene expression data, and drew some correlation networks of them. (Actually, I’m not sure if they were correlation networks, or partial correlation networks, which seem to be more popular these days.)
Her idea was that true candidate genes for epilepsy should lie in the same networks as known epilepsy genes, thus filtering out many false negatives.
She also had a Shiny interface to help her non-technical colleagues interact with the networks.
Soma Datta talked about teaching R as a first programming language to secondary school children. She said that many of them found C++ and Java to be too hard, and that R had a much higher success rate.
Simple things like having array indicies start at one rather than zero, not having to bother with semi-colons to terminate lines, and not having to declare variable types made a huge difference to the students ability to be productive.
Alan Friedman talked about Lotka’s Law, which states that a very small number of journal paper authors write most of the papers, and it quickly drops off so that 60% of journal authors only write one paper.
He has an implementation package called
LoktasLaw, which librarians might find useful.
Berry Boessenkool talked about extreme value stats. Apparently as the temperature increases, the median chance of precipitation does to. However when you look at the extreme high quantiles (> 99.9%) of the chance of precipitation, they increase upto to a temperature of 25 degrees Celsius or so, then drop again.
Berry suggested that this was a statistical artefact of not having much data, and when he did a more careful extreme value analysis, the high-quantile probability of precipitation kept increasing with temperature, as the underlying physics suggested it should.
When he talked about precipitation, I’m pretty sure he meant rain, since my rudimentary meteorological knowledge suggests that the probability of sleet and snow drops off quite sharply above zero degrees Celsius.
Jonathan Arfa talked about his participation in a Kaggle competition predicting NCAA Basketball scores in a knockout competition called March Madness.
His team used results from the previous season’s league games, Las Vegas betting odds, a commercial team metric dataset, and the distance travelled to each game to try to predict the results.
He suggested that they could have done better if they’d used a Bayesian approach: if a poor team wins it’s first couple of games, you know it is better than your model predicts.
Adolfo Alvarez gave a quick rundown of the different approaches for making your code go faster. No time for details, just a big list.
Vectorization, data.table and dplyr, do things in a database, try alternate R engines, parallelize stuff, use GPUs, use Hadoop and Spark, buy time on Amazon or Azure machines.
Karen Nielsen talked about predicting EEG data (those time series of your heart beating) using regression spline mixed models. Her big advance was to include person and trail effects into the model, which was based on the
Andrew Kriss talked about his rstats4ag.org website, which gives statistics advice for arable farmers. The stats are fairly basic (on purpose), tailored for use by crop farmers.
Richard Layton talked about teaching graphics. “As well as drawing ‘clear’ graphs, it is important to think about the needs of the audience”, he argued.
While a dot plot may be the best option, if you’re audience had never seen them before, it may be best to use a boxplot instead. (There are no question in the lightning talks, so I didn’t get chance to ask him if he would go so far as to recommend a 4D pie chart!)
One compelling example for considering the psychology of the audience was a mosaic plot of soldiers’ deaths in the (I think first) Iraq war. By itself, the plot evokes little emotion, but if you put a picture of a soldier dying next to it, it reminds you what the numbers mean.
Michael Höhle headlined today with a talk on zombie preparedness, filling in some of the gaps in the Zombie Survival Guide.
He explained that the most important thing was to track possible zombie outbreak metrics in order to ge tan early warning of a problem. He gave a good explanation of monitoring homicides by headshot and decapitation, then correcting for the fact that the civil servants reporting these numbers had gone on holiday.
surveillance package can also be used for non-zombie related disease outbreaks.