I was recently hunting for a function that will strip the extension from a file – changing
foo, and so forth. I was knitting a report, and wanted to replace the file extension of the input with the extension of the the output file. (
knitr handles this automatically in most cases but I had some custom logic in there that meant I had to work things manually.)
Finding file extensions is such a common task that I figured that someone must have written a function to solve the problem already. A quick search using
findFn("file extension") from the
sos package revealed a few thousand hits. There’s a lot of noise in there, but I found a few promising candidates.
removeExt in the
limma package (you can find it on Bioconductor),
remove_file_extension which has identical copies in both
extension in the
To save you the time and effort, I’ve tried them all, and unfortunately they all suck.
At a bare minimum, a file extension stripper needs to be vectorized, deal with different file extensions within that vector, deal with multiple levels of extension (for things like “tar.gz” files), and with filenames with dots in the name other than the extension, and with missing values, and with directories. OK, that’s quite a few things but I’m picky.
Since all the existing options failed, I’ve made my own function. In fact, I went overboard and created a package of path manipulation utilities, the
pathological package. It isn’t on CRAN yet, but you can install it via:
It’s been a while since I’ve used MATLAB, but I have fond recollections of its
fileparts function that splits a path up into the directory, filename and extension.
The pathological equivalent is to decompose a path, which returns a
data.frame with three columns.
library(pathological) x <- c( "somedir/foo.tgz", # single extension "another dir\\bar.tar.gz", # double extension "baz", # no extension "quux. quuux.tbz2", # single ext, dots in filename R.home(), # a dir "~", # another dir "~/quuuux.tar.xz", # a file in a dir "", # empty ".", # current dir "..", # parent dir NA_character_ # missing ) (decomposed <- decompose_path(x)) ## dirname filename extension ## somedir/foo.tgz "d:/workspace/somedir" "foo" "tgz" ## another dir\\bar.tar.gz "d:/workspace/another dir" "bar" "tar.gz" ## baz "d:/workspace" "baz" "" ## quux. quuux.tbz2 "d:/workspace" "quux. quuux" "tbz2" ## C:/PROGRA~1/R/R-31~1.0 "C:/Program Files/R/R-3.1.0" "" "" ## ~ "C:/Users/richie/Documents" "" "" ## ~/quuuux.tar.xz "C:/Users/richie/Documents" "quuuux" "tar.xz" ## "" "" "" ## . "d:/workspace" "" "" ## .. "d:/" "" "" ## <NA> NA NA NA ## attr(,"class") ##  "decomposed_path" "matrix"
There are some shortcut functions to get at different parts of the filename:
get_extension(x) ## somedir/foo.tgz another dir\\bar.tar.gz baz ## "tgz" "tar.gz" "" ## quux. quuux.tbz2 C:/PROGRA~1/R/R-31~1.0 ~ ## "tbz2" "" "" ## ~/quuuux.tar.xz . ## "tar.xz" "" "" ## .. <NA> ## "" NA strip_extension(x) ##  "d:/workspace/somedir/foo" "d:/workspace/another dir/bar" ##  "d:/workspace/baz" "d:/workspace/quux. quuux" ##  "C:/Program Files/R/R-3.1.0" "C:/Users/richie/Documents" ##  "C:/Users/richie/Documents/quuuux" "/" ##  "d:/workspace" "d:/" ##  NA strip_extension(x, include_dir = FALSE) ## somedir/foo.tgz another dir\\bar.tar.gz baz ## "foo" "bar" "baz" ## quux. quuux.tbz2 C:/PROGRA~1/R/R-31~1.0 ~ ## "quux. quuux" "" "" ## ~/quuuux.tar.xz . ## "quuuux" "" "" ## .. <NA> ## "" NA
You can also get your original file location (in a standardised form) using
recompose_path(decomposed) ##  "d:/workspace/somedir/foo.tgz" ##  "d:/workspace/another dir/bar.tar.gz" ##  "d:/workspace/baz" ##  "d:/workspace/quux. quuux.tbz2" ##  "C:/Program Files/R/R-3.1.0" ##  "C:/Users/richie/Documents" ##  "C:/Users/richie/Documents/quuuux.tar.xz" ##  "/" ##  "d:/workspace" ##  "d:/" ##  NA
The package also contains a few other path utilities. The standardisation I mentioned comes from
standardize_path also available for Americans), and there’s a
dir_copy function for copying directories.
It’s brand new, so after I’ve complained about other people’s code, I’m sure karma will ensure that you’ll find a bug or two, but I hope you find it useful.
Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!
I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.
The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.
Today I went to the Radical Statistics conference in London. RadStats was originally a sort of left wing revolutionary group for statisticians, but these days the emphasis is on exposing dubious statistics by companies and politicians.
Here’s a quick rundown of the day.
First up Roy Carr-Hill spoke about the problems with trying to collect demographic data and estimating soft measures of societal progress like wellbeing. (Household surveys exclude people not in households, like the homeless soldiers and old people in care homes; and English people claim to be 70% satisfied regardless of the question.)
Next was Val Saunders who started with a useful debunking of done methodological flaws in schizophrenia research, then blew it by detailing her own methodologically flaws research and making overly strong claims to have found the cause of that disease.
Aubrey Blunsohn and David Healy both talked about ways that the pharmaceutical industry fudges results. The list was impressively long, leading me to suspect that far to many people have spent far too long thinking of ways to game the system. The two main recommendations that resonated with me were to extend the trials register to phase 1 trials to avoid unfavourable studies being buried and for raw data to be made available for transparent analysis. Pipe dreams.
After lunch Prem Sikka pointed out that tax avoidance isn’t just shady companies trying to scam the system, but actually accountancy firms pay people to dream up new wheezes and sell them to those companies.
Ann Pettifor and final speaker Howard Reed had similar talks evangelising Keynesian stimulus (roughly, big government spending in times of recession) for the UK economy amongst some economic myth debunking. Thought provoking, though both speakers neglected to mention the limitations of such stimuli – you have to avoid spending in pork barrel nonsense (see Japan in the 90s, that buy-a-banger scheme in the UK in 2009) and you have to find a ways to turn of the taps w when recession is over.
The other speaker was Allyson Pollack who discussed debunking a dubious study by Zac Cooper claiming that patients being allowed to choose their surgeon improved success rated treating acute myocardial infarction. Such patients are generally unconscious while having their heart attack so out was inevitably nonsense.
Overall a great day.
Whenever there’s a financial crisis, I tend to assume that it’s Dirk Eddelbuettel‘s fault, though apparently the EMU debt crisis is more complicated than that. JP Morgan have just released a white paper about the problem, including a Lego infographic of who is asking who for money. Created, apparently, by Peter Cembalest, aged nine. Impressive stuff.
Found via The Register.
Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.
While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.
Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.
I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.
The Washington Monthly magazine has a long article about graphics guru Edward Tufte. It mostly covers his work on presenting data, with a few snippets about his powerful friends (he has advised both the Bush and Obama administrations), and his current work on recovery.gov.
I still have no idea whether his name is pronounced “Tuft” or “Tufty” or “Tooft”.
I was recently pointed in the direction of a thermal comfort model by the engineering company Arup (p27–28 of this pdf). Figure 3 at the top of p28 caught my attention.
It’s mostly a nice graph; there’s not too much junk in it. One thing that struck me was that there is an awful lot of information in the legend, and that I found it impossible to retain all that information while switching between the plot and the legend.
The best way to improve this plot then is to find a way to simplify the legend. Upon closer inspection, it seems that there is a lot of information that is repeated. For example, there are only two temperature combinations, and three levels of direct solar energy. Humidity and diffused solar energy are kept the same in all cases. That makes it really easy for us: our five legend options are
|Outdoor temp (deg C)||Direct solar energy (W/m^2)|
Elsewhere we can explain that the mezannine/platform temps are always 2/4 degrees higher than outdoors, and that the humidity is always 50%, and that the diffused solar energy is always 100W/m^2.
Living in Buxton, one of the coldest, rainiest towns in the UK, it amuses me to see that their “low” outdoor temperature is 29°C.
The other thing to note is that we have two variables mapped to the hue. For just five cases, this is just about acceptable, but it isn’t the best option and it won’t scale to many more categories. It’s generally considered best practice to work in HCL color space when mapping variables to colours. I would be tempted to map temperature to hue – whether you pick red as hot and blue as cold or the other way around depends upon how many astronomers you have in your target audience. Then I’d map luminance (lightness) to solar energy: more sunlight = lighter line.
I don’t have the values to exactly recreate the dataset, but here are some made up numbers with the new legend. Notice the combined outdoor temp/direct solar energy variable.
time_points <- 0:27 n_time_points <- length(time_points) n_cases <- 5 comfort_data <- data.frame( time = rep.int(time_points, n_cases), comfort = jitter(rep(-2:2, each = n_time_points)), outdoor.temperature = rep( c(32, 29), times = c(3 * n_time_points, 2 * n_time_points) ), direct.solar.energy = rep( c(700, 150, 500, 500, 150), each = n_time_points ) ) comfort_data$combined <- with(comfort_data, factor(paste(outdoor.temperature, direct.solar.energy, sep = ", ")) )
We manually pick the colours to use in HCL space (using
str_detect to examine the factor levels).
library(stringr) cols <- hcl( h = with(comfort_data, ifelse(str_detect(levels(combined), "29"), 0, 240)), c = 100, l = with(comfort_data, ifelse(str_detect(levels(combined), "150"), 20, ifelse(str_detect(levels(combined), "500"), 50, 80)) ) )
Drawing the plot is very straightforward, it’s just a line plot.
library(ggplot2) p <- ggplot(comfort_data, aes(time, comfort, colour = combined)) + geom_line(size = 2) + scale_colour_manual( name = expression(paste( "Outdoor temp (", degree, C, "), Direct solar (", W/m^2, ")" )), values = cols) + xlab("Time (minutes)") + ylab("Comfort") p
Sensible people should stop here, and write the additional detail in the figure caption. There is currently no sensible way of writing annotations outside of the plot area (
annotate only works inside panels). The following hack was devised by Baptiste Auguie, read this forum thread for other variations.
library(gridExtra) caption <- tableGrob( matrix( expression( paste( "Mezzanine temp is 2", degree, C, " warmer than outdoor temp" ), paste( "Platform temp is 4", degree, C, " warmer than outdoor temp" ), paste("Humidity is always 50%"), paste( "Diffused solar energy is always 100", W/m^2 ) ) ), parse = TRUE, theme = theme.list( gpar.corefill = gpar(fill = NA, col = NA), core.just = "center" ) ) grid.arrange(p, sub=caption)