Archive

Archive for the ‘Uncategorized’ Category

RL10N: Let R Speak Your Language

23rd March, 2016 1 comment

R has been translated into 20 languages but currently not many packages have translations. In a survey of CRAN done last December, of the 8274 packages on CRAN, only 50 had any installed translations. Of those, 28 had only a single translation. As the plot below shows, the number of translated packages is almost indistinguishable from zero.

The number of R packages with translations is ridiculously small.

The RL10N project by myself and the excellent Thomas Leeper has just been funded by the R Consortium in order to address this problem, and help R users who aren’t native English speakers. In short, we want to ASSIST R TO TAKE OVER THE WORLD (of data analysis). Cue evil laugh.

There are three strands to the project:

Firstly, there are tools in the tools package for working with translations, but they are a bit fiddly to use. Thomas has a work in progress package called msgtools that aims to make things easier. We’ll develop this package to be robust, well documented, and easy for novice package developers to use.

Secondly, Thomas’s MTurkR package provides an interface between R and Amazon’s Mechanical Turk API, a marketplace for human intelligence tasks (HITs), including translation tasks. We’ll develop a package that wraps MTurkR, with functionality for creating and managing translation HITs.

Thirdly, Christopher Lucas and Dustin Tingley’s translateR package provides an interface to the Application Programming Interfaces (APIs) for the Google Translate and Microsoft Translator services for automated translation of text. We’ll create an R package that wraps translateR, with functionality for integrating the automated translations into a package development workflow.

Introducing the pathological package for manipulating paths, files and directories

I was recently hunting for a function that will strip the extension from a file – changing foo.png to foo, and so forth. I was knitting a report, and wanted to replace the file extension of the input with the extension of the the output file. (knitr handles this automatically in most cases but I had some custom logic in there that meant I had to work things manually.)

Finding file extensions is such a common task that I figured that someone must have written a function to solve the problem already. A quick search using findFn("file extension") from the sos package revealed a few thousand hits. There’s a lot of noise in there, but I found a few promising candidates.

There’s removeExt in the limma package (you can find it on Bioconductor), strip_extension in Kmisc, remove_file_extension which has identical copies in both spatial.tools and gdalUtils, and extension in the raster.

To save you the time and effort, I’ve tried them all, and unfortunately they all suck.

At a bare minimum, a file extension stripper needs to be vectorized, deal with different file extensions within that vector, deal with multiple levels of extension (for things like “tar.gz” files), and with filenames with dots in the name other than the extension, and with missing values, and with directories. OK, that’s quite a few things but I’m picky.

Since all the existing options failed, I’ve made my own function. In fact, I went overboard and created a package of path manipulation utilities, the pathological package. It isn’t on CRAN yet, but you can install it via:

library(devtools)
install_github("richierocks/pathological")


It’s been a while since I’ve used MATLAB, but I have fond recollections of its fileparts function that splits a path up into the directory, filename and extension.

The pathological equivalent is to decompose a path, which returns a character matrix data.frame with three columns.

library(pathological)
x <- c(
"somedir/foo.tgz",         # single extension
"another dir\\bar.tar.gz", # double extension
"baz",                     # no extension
"quux. quuux.tbz2",        # single ext, dots in filename
R.home(),                  # a dir
"~",                       # another dir
"~/quuuux.tar.xz",         # a file in a dir
"",                        # empty
".",                       # current dir
"..",                      # parent dir
NA_character_              # missing
)
(decomposed <- decompose_path(x))
##                          dirname                      filename      extension
## somedir/foo.tgz         "d:/workspace/somedir"       "foo"         "tgz"
## another dir\\bar.tar.gz "d:/workspace/another dir"   "bar"         "tar.gz"
## baz                     "d:/workspace"               "baz"         ""
## quux. quuux.tbz2        "d:/workspace"               "quux. quuux" "tbz2"
## C:/PROGRA~1/R/R-31~1.0  "C:/Program Files/R/R-3.1.0" ""            ""
## ~                       "C:/Users/richie/Documents"  ""            ""
## ~/quuuux.tar.xz         "C:/Users/richie/Documents"  "quuuux"      "tar.xz"
## ""                           ""            ""
## .                       "d:/workspace"               ""            ""
## ..                      "d:/"                        ""            ""
## <NA>                    NA                           NA            NA
## attr(,"class")
## [1] "decomposed_path" "matrix"


There are some shortcut functions to get at different parts of the filename:

get_extension(x)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz
##                   "tgz"                "tar.gz"                      ""
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~
##                  "tbz2"                      ""                      ""
##         ~/quuuux.tar.xz                                               .
##                "tar.xz"                      ""                      ""
##                      ..                    <NA>
##                      ""                      NA

strip_extension(x)
##  [1] "d:/workspace/somedir/foo"         "d:/workspace/another dir/bar"
##  [3] "d:/workspace/baz"                 "d:/workspace/quux. quuux"
##  [5] "C:/Program Files/R/R-3.1.0"       "C:/Users/richie/Documents"
##  [7] "C:/Users/richie/Documents/quuuux" "/"
##  [9] "d:/workspace"                     "d:/"
## [11] NA

strip_extension(x, include_dir = FALSE)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz
##                   "foo"                   "bar"                   "baz"
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~
##           "quux. quuux"                      ""                      ""
##         ~/quuuux.tar.xz                                               .
##                "quuuux"                      ""                      ""
##                      ..                    <NA>
##                      ""                      NA


You can also get your original file location (in a standardised form) using

recompose_path(decomposed)
##  [1] "d:/workspace/somedir/foo.tgz"
##  [2] "d:/workspace/another dir/bar.tar.gz"
##  [3] "d:/workspace/baz"
##  [4] "d:/workspace/quux. quuux.tbz2"
##  [5] "C:/Program Files/R/R-3.1.0"
##  [6] "C:/Users/richie/Documents"
##  [7] "C:/Users/richie/Documents/quuuux.tar.xz"
##  [8] "/"
##  [9] "d:/workspace"
## [10] "d:/"
## [11] NA


The package also contains a few other path utilities. The standardisation I mentioned comes from standardise_path (standardize_path also available for Americans), and there’s a dir_copy function for copying directories.

It’s brand new, so after I’ve complained about other people’s code, I’m sure karma will ensure that you’ll find a bug or two, but I hope you find it useful.

A little Christmas Present for you

Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!

I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.

The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.

Today I went to the Radical Statistics conference in London. RadStats was originally a sort of left wing revolutionary group for statisticians, but these days the emphasis is on exposing dubious statistics by companies and politicians.

Here’s a quick rundown of the day.

First up Roy Carr-Hill spoke about the problems with trying to collect demographic data and estimating soft measures of societal progress like wellbeing. (Household surveys exclude people not in households, like the homeless soldiers and old people in care homes; and English people claim to be 70% satisfied regardless of the question.)

Next was Val Saunders who started with a useful debunking of done methodological flaws in schizophrenia research, then blew it by detailing her own methodologically flaws research and making overly strong claims to have found the cause of that disease.

Aubrey Blunsohn and David Healy both talked about ways that the pharmaceutical industry fudges results. The list was impressively long, leading me to suspect that far to many people have spent far too long thinking of ways to game the system. The two main recommendations that resonated with me were to extend the trials register to phase 1 trials to avoid unfavourable studies being buried and for raw data to be made available for transparent analysis. Pipe dreams.

After lunch Prem Sikka pointed out that tax avoidance isn’t just shady companies trying to scam the system, but actually accountancy firms pay people to dream up new wheezes and sell them to those companies.

Ann Pettifor and final speaker Howard Reed had similar talks evangelising Keynesian stimulus (roughly, big government spending in times of recession) for the UK economy amongst some economic myth debunking. Thought provoking, though both speakers neglected to mention the limitations of such stimuli – you have to avoid spending in pork barrel nonsense (see Japan in the 90s, that buy-a-banger scheme in the UK in 2009) and you have to find a ways to turn of the taps w when recession is over.

The other speaker was Allyson Pollack who discussed debunking a dubious study by Zac Cooper claiming that patients being allowed to choose their surgeon improved success rated treating acute myocardial infarction. Such patients are generally unconscious while having their heart attack so out was inevitably nonsense.

Overall a great day.

A Great European Bailout Infographic

8th September, 2011 1 comment

Whenever there’s a financial crisis, I tend to assume that it’s Dirk Eddelbuettel‘s fault, though apparently the EMU debt crisis is more complicated than that. JP Morgan have just released a white paper about the problem, including a Lego infographic of who is asking who for money. Created, apparently, by Peter Cembalest, aged nine. Impressive stuff.

Found via The Register.

The Stats Clinic

27th July, 2011 1 comment

Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.

While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.

Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.

I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.

Tags: ,