I’m writing a book on R for O’Reilly, and I need interesting datasets for the examples. Any data that you provide will get you a mention in the book and in the publicity material, so it’s a great opportunity to publicise your work or your organisation.
Datasets from any area or industry are suitable; the only constraint is that it can be analysed with a few pages of R code to provide a result that a general reader might go “ooh”. There’s a chapter on data cleaning, so even dirty data is suitable!
All the data will be provided in an R package to accompany the book, so you need to be willing to make it publically available. I can help you anonymise the data, or strip out commercially sensitive parts if you require.
If you can provide anything, or you know someone who might be able to, then drop me an email at richierocks AT gmail DOT com. Thanks.
EDIT: There are some (quite) frequently asked questions already! Here are the answers; you can use your Jeopardy! skills to guess the questions.
1. The book is called “Learning R”, and it’s a fairly gentle introduction to the language, covering both how you program in R, and how you analyse data.
2. If you provide data, then yes, you can have an PDF of the pre-release version to make sure I haven’t done something silly with your dataset.
My girlfriend’s biological clock is ticking, and so we’ve started trying to spawn. Since I’m impatient, that has naturally lead to questions like “how long will it take?”. If I were to believe everything on TV, the answer would be easy: have unprotected sex once and pregnancy is guaranteed.
A more cynical me suggests that this isn’t the case. Unfortunately, it is surpisingly difficult to find out the monthly chance of getting pregnant (technical jargon: the “monthly fecundity rate”, or MFR), given that you are having regular sex in the days leading up to ovulation. Everyone agrees that age has a big effect, with women’s peak fertility occuring somewhere around the age of 25. Beyond that point, the internet is filled with near-useless summary statistics like the chance of conceiving after one year. For example, the usually reliable NHS site says
Women become less fertile as they get older. For women aged 35, about 94 out of every 100
who have regular unprotected sex will get pregnant after three years of trying. However, for
women aged 38, only 77 out of every 100 will do so.
I found a couple of reasonably sciency links(George and Kamath, Socal Fertility) that suggest that the MFR is about 25% for a women aged 25, and 10% at age 35. The Scoal link also gives rates of 15% at age 30, 5% at age 40 and less than 1% at age 45. If the woman is too fat, too thin, a smoker, or has hormone problems, or is stressed, then the rate needs reducing.
Given the MFR, the probability of getting pregnant after a given number of months can be calculated with a negative binomial distribution.
months <- 0:60 p_preg_per_month <- c("25" = 0.25, "30" = 0.15, "35" = 0.1, "40" = 0.05, "45" = 0.01) p_success <- unlist(lapply( p_preg_per_month, function(p) pnbinom(months, 1, p) ))
Now we just create a data frame suitable for passing to ggplot2 …
mfr_group <- paste( "MFR =", format(p_preg_per_month, digits = 2), "at age", names(p_preg_per_month) ) mfr_group <- factor(mfr_group, levels = mfr_group) preg_data <- data.frame( months = rep.int(months, length(mfr_group)) , mfr_group = rep(mfr_group, each = length(months)), p_success = p_success )
and draw the plot.
library(ggplot2) (p <- ggplot(preg_data, aes(months, p_success, colour = mfr_group)) + geom_point() + scale_x_continuous(breaks = seq.int(0, 60, 12)) + scale_y_continuous(breaks = seq.int(0, 1, 0.1), limits = c(0, 1)) + scale_colour_discrete("Monthly fecundity rate") + xlab("Months") + ylab("Probability of conception") + opts(panel.grid.major = theme_line(colour = "grey60")) )
So almost half of the (healthy) 25 year olds get pregnant in the first
monthtwo months, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.
Of course, from a male point of view, conception is an embarrassingly parallel problem: you can dramatically reduce the time to conceive a child by sleeping with lots of women at once. (DISCLAIMER: Janette, if you’re reading this, I’m not practising or advocating this technique!)
assertive, my new package for writing robust code, is now on CRAN. It consists of lots of
is functions for checking variables, and corresponding
assert functions that throw an error if the condition doesn’t hold. For example,
is_a_number checks that the input is numeric and scalar.
is_a_number(1) #TRUE is_a_number("a") #FALSE is_a_number(1:10) #FALSE
In the last two cases, the return value of FALSE has an attribute “
cause” that indicates the cause of failure. When “a” is the input, the cause is “
"a" is not of type 'numeric'.“, whereas for
1:10, the cause is “
1:10 does not have length one.“. You can get or set the cause attribute with the
m <- lm(uptake ~ 1, CO2) ok <- is_empty_model(m) if(!ok) cause(ok)
assert functions call an
is function, and if the result is FALSE, they throw an error; otherwise they do nothing.
assert_is_a_number(1) #OK assert_is_a_number("a") #Throws an error
There are also some
has functions, primarily for checking the presence of attributes.
has_names(c(foo = 1, bar = 4, baz = 9)) has_dims(matrix(1:12, nrow = 3))
Some functions apply to properties of vectors. In this case, the
assert functions can check that all the values conform to the condition, or any of the values conform.
x <- -2:2 is_positive(x) #The last two are TRUE assert_any_are_positive(x) #OK assert_all_are_positive(x) #Error
“Why would you want to use these functions?”, you may be asking. The dynamic typing and extreme flexibility of R means that it is very easy to have variables that are the wrong format. This is particularly true when you are dealing with user input. So while you know that the sales totals passed to your function should be a vector of non-negative numbers, or that the regular expression should be a single string rather than a character vector, your user may not. You need to check for these invalid conditions, and return an error message that the user can understand. assertive makes it easy to do all this.
Since this is the first public release of assertive, it hasn’t been widely tested. I’ve written a moderately comprehensive unit-test suite, but there are likely to be a few minor bugs here and there. In particular, I suspect there may be one or two typos in the documentation. Please give the package a try, and let me know if you find any errors, or if you want any other functions adding.
Today I went to the Radical Statistics conference in London. RadStats was originally a sort of left wing revolutionary group for statisticians, but these days the emphasis is on exposing dubious statistics by companies and politicians.
Here’s a quick rundown of the day.
First up Roy Carr-Hill spoke about the problems with trying to collect demographic data and estimating soft measures of societal progress like wellbeing. (Household surveys exclude people not in households, like the homeless soldiers and old people in care homes; and English people claim to be 70% satisfied regardless of the question.)
Next was Val Saunders who started with a useful debunking of done methodological flaws in schizophrenia research, then blew it by detailing her own methodologically flaws research and making overly strong claims to have found the cause of that disease.
Aubrey Blunsohn and David Healy both talked about ways that the pharmaceutical industry fudges results. The list was impressively long, leading me to suspect that far to many people have spent far too long thinking of ways to game the system. The two main recommendations that resonated with me were to extend the trials register to phase 1 trials to avoid unfavourable studies being buried and for raw data to be made available for transparent analysis. Pipe dreams.
After lunch Prem Sikka pointed out that tax avoidance isn’t just shady companies trying to scam the system, but actually accountancy firms pay people to dream up new wheezes and sell them to those companies.
Ann Pettifor and final speaker Howard Reed had similar talks evangelising Keynesian stimulus (roughly, big government spending in times of recession) for the UK economy amongst some economic myth debunking. Thought provoking, though both speakers neglected to mention the limitations of such stimuli – you have to avoid spending in pork barrel nonsense (see Japan in the 90s, that buy-a-banger scheme in the UK in 2009) and you have to find a ways to turn of the taps w when recession is over.
The other speaker was Allyson Pollack who discussed debunking a dubious study by Zac Cooper claiming that patients being allowed to choose their surgeon improved success rated treating acute myocardial infarction. Such patients are generally unconscious while having their heart attack so out was inevitably nonsense.
Overall a great day.
Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.
While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.
Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.
I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.