### Archive

Posts Tagged ‘dataviz’

## How long does it take to get pregnant?

My girlfriend’s biological clock is ticking, and so we’ve started trying to spawn. Since I’m impatient, that has naturally lead to questions like “how long will it take?”. If I were to believe everything on TV, the answer would be easy: have unprotected sex once and pregnancy is guaranteed.

A more cynical me suggests that this isn’t the case. Unfortunately, it is surpisingly difficult to find out the monthly chance of getting pregnant (technical jargon: the “monthly fecundity rate”, or MFR), given that you are having regular sex in the days leading up to ovulation. Everyone agrees that age has a big effect, with women’s peak fertility occuring somewhere around the age of 25. Beyond that point, the internet is filled with near-useless summary statistics like the chance of conceiving after one year. For example, the usually reliable NHS site says

Women become less fertile as they get older. For women aged 35, about 94 out of every 100
who have regular unprotected sex will get pregnant after three years of trying. However, for
women aged 38, only 77 out of every 100 will do so.

I found a couple of reasonably sciency links(George and Kamath, Socal Fertility) that suggest that the MFR is about 25% for a women aged 25, and 10% at age 35. The Scoal link also gives rates of 15% at age 30, 5% at age 40 and less than 1% at age 45. If the woman is too fat, too thin, a smoker, or has hormone problems, or is stressed, then the rate needs reducing.

Given the MFR, the probability of getting pregnant after a given number of months can be calculated with a negative binomial distribution.

```months <- 0:60
p_preg_per_month <- c("25" = 0.25, "30" = 0.15, "35" = 0.1, "40" = 0.05, "45" = 0.01)
p_success <- unlist(lapply(
p_preg_per_month,
function(p) pnbinom(months, 1, p)
))
```

Now we just create a data frame suitable for passing to ggplot2 …

```mfr_group <- paste(
"MFR =",
format(p_preg_per_month, digits = 2),
"at age",
names(p_preg_per_month)
)
mfr_group <- factor(mfr_group, levels = mfr_group)
preg_data <- data.frame(
months = rep.int(months, length(mfr_group))  ,
mfr_group = rep(mfr_group, each = length(months)),
p_success = p_success
)
```

and draw the plot.

```library(ggplot2)
(p <- ggplot(preg_data, aes(months, p_success, colour = mfr_group)) +
geom_point() +
scale_x_continuous(breaks = seq.int(0, 60, 12)) +
scale_y_continuous(breaks = seq.int(0, 1, 0.1), limits = c(0, 1)) +
scale_colour_discrete("Monthly fecundity rate") +
xlab("Months") +
ylab("Probability of conception") +
opts(panel.grid.major = theme_line(colour = "grey60"))
)
``` So almost half of the (healthy) 25 year olds get pregnant in the first monthtwo months, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.

Of course, from a male point of view, conception is an embarrassingly parallel problem: you can dramatically reduce the time to conceive a child by sleeping with lots of women at once. (DISCLAIMER: Janette, if you’re reading this, I’m not practising or advocating this technique!)

## A Great European Bailout Infographic

8th September, 2011 1 comment

Whenever there’s a financial crisis, I tend to assume that it’s Dirk Eddelbuettel‘s fault, though apparently the EMU debt crisis is more complicated than that. JP Morgan have just released a white paper about the problem, including a Lego infographic of who is asking who for money. Created, apparently, by Peter Cembalest, aged nine. Impressive stuff. Found via The Register.

## Interactive graphics for data analysis I got a copy of Martin Theus and Simon Urbanek’s Interactive Graphics for Data Analysis a couple of years ago, whence it’s been sat on my bookshelf. Since I’ve recently become a self-proclaimed expert on interactive graphics I thought it was about time I read the thing. Which is exactly what I did last weekend at the Leeds Festival (in between rocking out).

It’s a book of two halves, and despite the title the interactivity isn’t really the focus. The book is actually a guide on how to do exploratory data analysis. The first half of the book works like an advanced chart chooser, explaining which plots are useful for which types of data, and what types of interactivity they can benefit from. For me, it was worth it for the many rare plots, like spineplots and interaction plots and mosaic plots and fluctuation diagrams. If you’re bored of barcharts, this is a great way to expand your graphical vocabulary. The second half of the book consists entirely of case studies, where you can practice a workflow for exploring data, which is something that’s always worthwhile doing.

The really big takeaway that I got is that exploratory graphics have different priorities to publication graphics. When you are in the courting stage with a dataset, just getting to know each other, you don’t really care so much about whether the greek letters in your axis label are formatted correctly or whether the shade of pink in your dots is quite right. All you really need is to be able to generate lots and lots of plots quickly, and to be able to see the relationships between them.

It is this last point that the authors claim interactivity is most useful for. Perhaps the canonical example of this is clicking a bar on a histogram or barchart, and having corresponding points on a scatterplot highlighted. To demonstrate this, here’s an example using Simon’s `Acinonyx` package (shortly to be renamed `ix` for “iplots Extreme”). Acinonyx isn’t yet available on CRAN, see its home page
for installation details.

```library(Acinonyx)
library(MASS)
data(Cars93)
interactive_scatter <- with(Cars93, iplot(Horsepower, MPG.city))
interactive_histo <- with(Cars93, ihist(EngineSize))
```

Click a bar in the histogram and the the corresponding points in the scatterplot are highlighted. Likewise, drag to select points in the scatterplot and fractions of the histgram are highlighted.

The equivalent static version would be to use trellising and draw each possible graph combination. Splitting a scatterplot into different groups depending upon bars of a histogram works something like this:

```library(ggplot2)
Cars93\$EngineSizeGroup <- cut(Cars93\$EngineSize, 11)
(static_trellis_scatter <- ggplot(Cars93, aes(Horsepower, MPG.city)) +
geom_point() +
facet_wrap(~ EngineSizeGroup)
)
```

(We don’t actually need to bother with the histograms, since they are a little boring.) The reverse operation – going from a selected region of scatterplot to a higlighted region of bar chart is also possible, but trickier. In this case, we do need both graphs.

```Cars93 <- within(Cars93,
{
selected <- ifelse(
Horsepower < 200 & MPG.city > 20 & MPG.city < 30,
"selected",
"unselected"
)
})
(static_scatter_with_highlight <-
ggplot(Cars93, aes(Horsepower, MPG.city, colour = selected)) +
geom_point()
)
(static_histo_with_highlight <-
ggplot(Cars93, aes(EngineSizeGroup, fill = selected)) +
geom_histogram() +
opts(axis.text.x = theme_text(angle = 30, hjust = 1, vjust = 1))
)
```

My conclusion from reading the book, and from my initial experimentation with `Acinonyx` is that anything you can do interactively is also possible by drawing many static graphs, but the interaction can let you see things quicker.

## Nomograms everywhere!

At useR!, Jonty Rougier talked about nomograms, a once popular visualisation that has fallen by the wayside with the rise of computers. I’d seen a few before, but hadn’t understood how they worked or why you’d want to use them. Anyway, since that talk I’ve been digging around in biology books from the 60s and 70s, and it seems they are full of them. So for those of you who haven’t seen the talk, here’s how they work.

A basic nomogram consists of three scales. By reading off known values from two of the scales, you can estimate a third one. Here’s an example I found in the ICRP‘s reference manual. It’s difficult to measure people’s skin surface area, but height and bodyweight are very straightforward. To use the nomogram, you place a ruler (or other straight edge) on the height* and weight scales and read of the point where the ruler crosses the surface area scale. I’m 177cm tall and weigh 72kg, so according to this, my estimated skin surface area is 1.89m2.

Of course nowadays, the standard way to solve this sort of problem is to write a function. Jonty suggested that the main modern use of nomograms is in fieldwork situations, where computers aren’t handily available. (His case study was Kenyan vets trying to estimate the weight of donkeys form there height and girth.)

Altman and Dittmer’s Respiration and Circulation has many more pretty nomograms. I was particularly impressed by those on blood pH, reproduced below for your pleasure.  Your homework is to dig out a pre-1980 textbook and hunt for more nomograms.

*Gruesomely, the fact that the scale is labelled “length” rather than “height” makes me suspect that the bodies that provided the data were in a permanent lying down position – that is, they were corpses.

## useR2011 Easy interactive ggplots talk

I’m talking tomorrow at useR! on making ggplots interactive with the gWidgets GUI framework. For those of you at useR, here is the code and data, so you can play along on your laptops. For everyone else, I’ll make the slides available in the next few days so you can see what you missed. Note that for confidentiality reasons, I’ve added some random junk to the sample values, so please don’t use this data for actual research. (If you are interested in chemical exposure data, contact the lab.)

Chromium data. Nickel data. Code for examples 1 to 4. Once we introduce example 5, the rest of the code needs a little reworking.

Update: The permissions on the code files were blocking downloads from people who aren’t me. My bad. It should be fixed now.

Another update: By popular demand, here are the code chunks.

```library(ggplot2)
library(gWidgetstcltk)

p <- ggplot(chromium, aes(air, bm)) +
geom_point()

win_ctrls <- gwindow("Plot controls 1-4")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
Linear      = "identity",
Log           = "log10"
)

frm_scale_trans_y <- gframe(
"Y scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_y,
handler = function(h, ...)
{
scale_trans_y <- available_scales[svalue(h\$obj)]
p <<- p +
scale_y_continuous(
trans = scale_trans_y
)
print(p)
}
)

frm_scale_trans_x <- gframe(
"X scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_x,
handler = function(h, ...)
{
scale_trans_x <- available_scales[svalue(h\$obj)]
p <<- p +
scale_x_continuous(
trans = scale_trans_x
)
print(p)
}
)

#2 Changing facets
facet_choices <- list(
None = ". ~ .",
"RPE" = ". ~ rpe",
"Welding type" = ". ~ welding.type",
"RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
"Faceting",
container = grp_ctrls,
expand = TRUE
)

names(facet_choices),
container = frm_facets,
handler = function(h, ...)
{
facet_formula <-
facet_choices[[svalue(h\$obj)]]
p <<- p +
facet_grid(facet_formula)
print(p)
}
)

#3 Changing the dataset
frm_datasets <- gframe(
"Datasets",
container = grp_ctrls,
expand = TRUE
)

cmb_datasets <- gcombobox(
c("chromium", "nickel"),
container = frm_datasets,
handler = function(h, ...)
{
p <<- p %+% get(svalue(h\$obj))
print(p)
}
)

#4 Changing the title
frm_title <- gframe(
"Plot title",
container = grp_ctrls,
expand = TRUE
)

txt_title <- gedit(
p\$options\$title,
container = frm_title,
handler = function(h, ...)
{
p <<- p + opts(title = svalue(txt_title))
print(p)
}
)
```
```library(ggplot2)
library(gWidgetstcltk)

p <- ggplot(chromium, aes(air, bm)) +
geom_blank()

print_p <- function()
{
pp <- get("p", envir = globalenv())
if(svalue(chk_points)) pp <- pp + geom_point()
if(svalue(chk_smooth)) pp <- pp + geom_smooth()
print(pp)
}

win_ctrls <- gwindow("Plot controls 1-5")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
Linear      = "identity",
Log           = "log10"
)

frm_scale_trans_y <- gframe(
"Y scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_y,
handler = function(h, ...)
{
scale_trans_y <- available_scales[svalue(h\$obj)]
p <<- p +
scale_y_continuous(
trans = scale_trans_y
)
print_p()
}
)

frm_scale_trans_x <- gframe(
"X scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_x,
handler = function(h, ...)
{
scale_trans_x <- available_scales[svalue(h\$obj)]
p <<- p +
scale_x_continuous(
trans = scale_trans_x
)
print_p()
}
)

#2 Changing facets
facet_choices <- list(
None = ". ~ .",
"RPE" = ". ~ rpe",
"Welding type" = ". ~ welding.type",
"RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
"Faceting",
container = grp_ctrls,
expand = TRUE
)

names(facet_choices),
container = frm_facets,
handler = function(h, ...)
{
facet_formula <-
facet_choices[[svalue(h\$obj)]]
p <<- p +
facet_grid(facet_formula)
print_p()
}
)

#3 Changing the dataset
frm_datasets <- gframe(
"Datasets",
container = grp_ctrls,
expand = TRUE
)

cmb_datasets <- gcombobox(
c("chromium", "nickel"),
container = frm_datasets,
handler = function(h, ...)
{
p <<- p %+% get(svalue(h\$obj))
print_p()
}
)

#4 Changing the title
frm_title <- gframe(
"Plot title",
container = grp_ctrls,
expand = TRUE
)

txt_title <- gedit(
p\$options\$title,
container = frm_title,
handler = function(h, ...)
{
p <<- p + opts(title = svalue(txt_title))
print_p()
}
)

#5 Changing geoms
frm_geoms <- gframe(
container = grp_ctrls,
expand = TRUE
)

chk_points <- gcheckbox(
"Points",
container = frm_geoms,
handler = function(h, ...) print_p()
)
chk_smooth <- gcheckbox(
"Smooth curve",
container = frm_geoms,
handler = function(h, ...) print_p()
)
```

## Legendary Plots

I was recently pointed in the direction of a thermal comfort model by the engineering company Arup (p27–28 of this pdf). Figure 3 at the top of p28 caught my attention. It’s mostly a nice graph; there’s not too much junk in it. One thing that struck me was that there is an awful lot of information in the legend, and that I found it impossible to retain all that information while switching between the plot and the legend.

The best way to improve this plot then is to find a way to simplify the legend. Upon closer inspection, it seems that there is a lot of information that is repeated. For example, there are only two temperature combinations, and three levels of direct solar energy. Humidity and diffused solar energy are kept the same in all cases. That makes it really easy for us: our five legend options are

Outdoor temp (deg C) Direct solar energy (W/m^2)
32 700
32 150
32 500
29 500
29 150

Elsewhere we can explain that the mezannine/platform temps are always 2/4 degrees higher than outdoors, and that the humidity is always 50%, and that the diffused solar energy is always 100W/m^2.

Living in Buxton, one of the coldest, rainiest towns in the UK, it amuses me to see that their “low” outdoor temperature is 29°C.

The other thing to note is that we have two variables mapped to the hue. For just five cases, this is just about acceptable, but it isn’t the best option and it won’t scale to many more categories. It’s generally considered best practice to work in HCL color space when mapping variables to colours. I would be tempted to map temperature to hue – whether you pick red as hot and blue as cold or the other way around depends upon how many astronomers you have in your target audience. Then I’d map luminance (lightness) to solar energy: more sunlight = lighter line.

I don’t have the values to exactly recreate the dataset, but here are some made up numbers with the new legend. Notice the combined outdoor temp/direct solar energy variable.

```time_points <- 0:27
n_time_points <- length(time_points)
n_cases <- 5
comfort_data <- data.frame(
time = rep.int(time_points, n_cases),
comfort = jitter(rep(-2:2, each = n_time_points)),
outdoor.temperature = rep(
c(32, 29),
times = c(3 * n_time_points, 2 * n_time_points)
),
direct.solar.energy = rep(
c(700, 150, 500, 500, 150),
each = n_time_points
)
)
comfort_data\$combined <- with(comfort_data,
factor(paste(outdoor.temperature, direct.solar.energy, sep = ", "))
)
```

We manually pick the colours to use in HCL space (using `str_detect` to examine the factor levels).

```library(stringr)
cols <- hcl(
h = with(comfort_data, ifelse(str_detect(levels(combined), "29"), 0, 240)),
c = 100,
l = with(comfort_data,
ifelse(str_detect(levels(combined), "150"), 20,
ifelse(str_detect(levels(combined), "500"), 50, 80))
)
)
```

Drawing the plot is very straightforward, it’s just a line plot.

```library(ggplot2)
p <- ggplot(comfort_data, aes(time, comfort, colour = combined)) +
geom_line(size = 2) +
scale_colour_manual(
name = expression(paste(
"Outdoor temp (", degree, C, "), Direct solar (", W/m^2, ")"
)),
values = cols) +
xlab("Time (minutes)") +
ylab("Comfort")
p
``` Sensible people should stop here, and write the additional detail in the figure caption. There is currently no sensible way of writing annotations outside of the plot area (`annotate` only works inside panels). The following hack was devised by Baptiste Auguie, read this forum thread for other variations.

```library(gridExtra)
caption <- tableGrob(
matrix(
expression(
paste(
"Mezzanine temp is 2", degree, C, " warmer than outdoor temp"
),
paste(
"Platform temp is 4", degree, C, " warmer than outdoor temp"
),
paste("Humidity is always 50%"),
paste(
"Diffused solar energy is always 100", W/m^2
)
)
),
parse = TRUE,
theme = theme.list(
gpar.corefill = gpar(fill = NA, col = NA),
core.just = "center"
)
)
grid.arrange(p,  sub=caption)
``` ## Presenting Immer’s barley data

Last time I talked about adapting graphs for presentations.  This time I’m putting some of the concepts I discussed there into action, with a presentation of Immer’s barley dataset.  This is a classic dataset, originally published in 1934; in 1993 Bill Cleveland mentioned it in his book Visualising Data on account of how it may contain an error.  Here’s the paper/screen version. Here’s the presentation.

For the record, the presentation was created with Impress and the audio recorded with Audacity. Using these tools, it’s pretty straightforward to make and share an audio presentation.

EDIT:

I’ve corrected the map slide. Some people also asked about the code to draw the plots. First up, I recoded the factors so that site and variety appear in order of increasing yield.

```barley\$variety <- with(barley, reorder(variety, yield))
barley\$site <- with(barley, reorder(site, yield))
```

The plot for the graph is straightforward.

```ggplot(barley, aes(yield, variety, colour = year)) +
geom_point() +
facet_grid(site ~ .)
```

The presentation version uses `facet_wrap(~ site)` rather than `facet_grid`, and the text size is increased with `theme_set(theme_grey(24))`.

Tags: , ,