Archive

Posts Tagged ‘ggplot2’

The tenure of Doctor Who incarnations

3rd August, 2013 6 comments

With a new actor being announced tomorrow, it got me pondering about the good Doctor. Specifically, who is the longest serving doctor? IMDB has the data:

whos <- data.frame(
  doctor = c(
    "William Hartnell",
    "Patrick Troughton",
    "Jon Pertwee",
    "Tom Baker",
    "Peter Davison",
    "Colin Baker",
    "Sylvester McCoy",
    "Paul McGann",
    "Christopher Ecclestone",
    "David Tennant",
    "Matt Smith"
  ),
  n_episodes = c(
    136,
    127,
    129,
    173,
    70,
    35,
    42,
    1,
    13,
    49,
    44
  ),
  stringsAsFactors = FALSE
)
whos$doctor <- factor(whos$doctor, levels = whos$doctor)

Let’s plot it to see how it changes over time.

library(ggplot2)
(bars <- ggplot(whos, aes(doctor, n_episodes)) +
  geom_bar(stat = "identity") +
  coord_flip()
) 

The plot shows that earlier doctor whos were typically in more episodes

There was a definite shift after Tom Baker towards a shorter term as the doctor. In terms of screen time, the shift is a little less pronounced due to the move from 25 minutes to 45 minutes per episode from Christopher Ecclestone onwards (and for one of the Colin Baker series).

Tags: , ,

How long does it take to get pregnant?

15th June, 2012 38 comments

My girlfriend’s biological clock is ticking, and so we’ve started trying to spawn. Since I’m impatient, that has naturally lead to questions like “how long will it take?”. If I were to believe everything on TV, the answer would be easy: have unprotected sex once and pregnancy is guaranteed.

A more cynical me suggests that this isn’t the case. Unfortunately, it is surpisingly difficult to find out the monthly chance of getting pregnant (technical jargon: the “monthly fecundity rate”, or MFR), given that you are having regular sex in the days leading up to ovulation. Everyone agrees that age has a big effect, with women’s peak fertility occuring somewhere around the age of 25. Beyond that point, the internet is filled with near-useless summary statistics like the chance of conceiving after one year. For example, the usually reliable NHS site says

Women become less fertile as they get older. For women aged 35, about 94 out of every 100
who have regular unprotected sex will get pregnant after three years of trying. However, for
women aged 38, only 77 out of every 100 will do so.

I found a couple of reasonably sciency links(George and Kamath, Socal Fertility) that suggest that the MFR is about 25% for a women aged 25, and 10% at age 35. The Scoal link also gives rates of 15% at age 30, 5% at age 40 and less than 1% at age 45. If the woman is too fat, too thin, a smoker, or has hormone problems, or is stressed, then the rate needs reducing.

Given the MFR, the probability of getting pregnant after a given number of months can be calculated with a negative binomial distribution.

months <- 0:60
p_preg_per_month <- c("25" = 0.25, "30" = 0.15, "35" = 0.1, "40" = 0.05, "45" = 0.01)
p_success <- unlist(lapply(
  p_preg_per_month, 
  function(p) pnbinom(months, 1, p)
))

Now we just create a data frame suitable for passing to ggplot2 …

mfr_group <- paste(
  "MFR =", 
  format(p_preg_per_month, digits = 2), 
  "at age", 
  names(p_preg_per_month)
)
mfr_group <- factor(mfr_group, levels = mfr_group)
preg_data <- data.frame(
  months = rep.int(months, length(mfr_group))  ,
  mfr_group = rep(mfr_group, each = length(months)),
  p_success = p_success
)

and draw the plot.

library(ggplot2)
(p <- ggplot(preg_data, aes(months, p_success, colour = mfr_group)) +
  geom_point() +
  scale_x_continuous(breaks = seq.int(0, 60, 12)) +
  scale_y_continuous(breaks = seq.int(0, 1, 0.1), limits = c(0, 1)) +
  scale_colour_discrete("Monthly fecundity rate") +
  xlab("Months") +
  ylab("Probability of conception") +
  opts(panel.grid.major = theme_line(colour = "grey60"))
)

The plot shows the probability of conception by number of months of trying for different age groups.

So almost half of the (healthy) 25 year olds get pregnant in the first monthtwo months, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.

Of course, from a male point of view, conception is an embarrassingly parallel problem: you can dramatically reduce the time to conceive a child by sleeping with lots of women at once. (DISCLAIMER: Janette, if you’re reading this, I’m not practising or advocating this technique!)

useR2011 Easy interactive ggplots talk

17th August, 2011 10 comments

I’m talking tomorrow at useR! on making ggplots interactive with the gWidgets GUI framework. For those of you at useR, here is the code and data, so you can play along on your laptops. For everyone else, I’ll make the slides available in the next few days so you can see what you missed. Note that for confidentiality reasons, I’ve added some random junk to the sample values, so please don’t use this data for actual research. (If you are interested in chemical exposure data, contact the lab.)

Chromium data. Nickel data. Code for examples 1 to 4. Once we introduce example 5, the rest of the code needs a little reworking.

Update: The permissions on the code files were blocking downloads from people who aren’t me. My bad. It should be fixed now.

Another update: By popular demand, here are the code chunks.

library(ggplot2)
library(gWidgetstcltk)

chromium <- read.csv("chromium.csv")
nickel <- read.csv("nickel.csv")

p <- ggplot(chromium, aes(air, bm)) +
  geom_point()

win_ctrls <- gwindow("Plot controls 1-4")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
  Linear      = "identity", 
  Log           = "log10"
)   
  
frm_scale_trans_y <- gframe(
  "Y scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_y <- gradio(
  names(available_scales), 
  container = frm_scale_trans_y,
  handler = function(h, ...)
  {
    scale_trans_y <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_y_continuous(
        trans = scale_trans_y
      )   
    print(p)  
  }
)

frm_scale_trans_x <- gframe(
  "X scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_x <- gradio(
  names(available_scales), 
  container = frm_scale_trans_x,
  handler = function(h, ...)
  {
    scale_trans_x <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_x_continuous(
        trans = scale_trans_x
      )   
    print(p)  
  }
)

#2 Changing facets
facet_choices <- list(
  None = ". ~ .",
  "RPE" = ". ~ rpe",
  "Welding type" = ". ~ welding.type",
  "RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
  "Faceting", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_facets <- gradio(
  names(facet_choices), 
  container = frm_facets,
  handler = function(h, ...)
  {
    facet_formula <- 
      facet_choices[[svalue(h$obj)]]
    p <<- p + 
      facet_grid(facet_formula)
    print(p) 
  }
)

#3 Changing the dataset                     
frm_datasets <- gframe(
  "Datasets", 
  container = grp_ctrls,
  expand = TRUE
)  

cmb_datasets <- gcombobox(
  c("chromium", "nickel"), 
  container = frm_datasets,
  handler = function(h, ...)
  {
    p <<- p %+% get(svalue(h$obj)) 
    print(p)  
  }
)

#4 Changing the title
frm_title <- gframe(
  "Plot title", 
  container = grp_ctrls,
  expand = TRUE
)  

txt_title <- gedit(
  p$options$title,
  container = frm_title,
  handler = function(h, ...)
  {  
    p <<- p + opts(title = svalue(txt_title))
    print(p)
  }
)   
library(ggplot2)
library(gWidgetstcltk)

chromium <- read.csv("chromium.csv")
nickel <- read.csv("nickel.csv")

p <- ggplot(chromium, aes(air, bm)) +
  geom_blank()
  
print_p <- function()
{
  pp <- get("p", envir = globalenv())
  if(svalue(chk_points)) pp <- pp + geom_point()  
  if(svalue(chk_smooth)) pp <- pp + geom_smooth()
  print(pp)
}

win_ctrls <- gwindow("Plot controls 1-5")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
  Linear      = "identity", 
  Log           = "log10"
)   
  
frm_scale_trans_y <- gframe(
  "Y scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_y <- gradio(
  names(available_scales), 
  container = frm_scale_trans_y,
  handler = function(h, ...)
  {
    scale_trans_y <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_y_continuous(
        trans = scale_trans_y
      )   
    print_p()  
  }
)

frm_scale_trans_x <- gframe(
  "X scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_x <- gradio(
  names(available_scales), 
  container = frm_scale_trans_x,
  handler = function(h, ...)
  {
    scale_trans_x <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_x_continuous(
        trans = scale_trans_x
      )   
    print_p()  
  }
)

#2 Changing facets
facet_choices <- list(
  None = ". ~ .",
  "RPE" = ". ~ rpe",
  "Welding type" = ". ~ welding.type",
  "RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
  "Faceting", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_facets <- gradio(
  names(facet_choices), 
  container = frm_facets,
  handler = function(h, ...)
  {
    facet_formula <- 
      facet_choices[[svalue(h$obj)]]
    p <<- p + 
      facet_grid(facet_formula)
    print_p() 
  }
)

#3 Changing the dataset                     
frm_datasets <- gframe(
  "Datasets", 
  container = grp_ctrls,
  expand = TRUE
)  

cmb_datasets <- gcombobox(
  c("chromium", "nickel"), 
  container = frm_datasets,
  handler = function(h, ...)
  {
    p <<- p %+% get(svalue(h$obj)) 
    print_p()  
  }
)

#4 Changing the title
frm_title <- gframe(
  "Plot title", 
  container = grp_ctrls,
  expand = TRUE
)  

txt_title <- gedit(
  p$options$title,
  container = frm_title,
  handler = function(h, ...)
  {  
    p <<- p + opts(title = svalue(txt_title))
    print_p()
  }
)

#5 Changing geoms
frm_geoms <- gframe(
  "Add/remove geoms", 
  container = grp_ctrls,
  expand = TRUE
)  

chk_points <- gcheckbox(
  "Points", 
  container = frm_geoms,
  handler = function(h, ...) print_p()  
)      
chk_smooth <- gcheckbox(
  "Smooth curve", 
  container = frm_geoms,
  handler = function(h, ...) print_p()  
)     

Legendary Plots

12th March, 2011 Leave a comment

I was recently pointed in the direction of a thermal comfort model by the engineering company Arup (p27–28 of this pdf). Figure 3 at the top of p28 caught my attention.

Arup thermal comfort model, figure 3

It’s mostly a nice graph; there’s not too much junk in it. One thing that struck me was that there is an awful lot of information in the legend, and that I found it impossible to retain all that information while switching between the plot and the legend.

The best way to improve this plot then is to find a way to simplify the legend. Upon closer inspection, it seems that there is a lot of information that is repeated. For example, there are only two temperature combinations, and three levels of direct solar energy. Humidity and diffused solar energy are kept the same in all cases. That makes it really easy for us: our five legend options are

Outdoor temp (deg C) Direct solar energy (W/m^2)
32 700
32 150
32 500
29 500
29 150

Elsewhere we can explain that the mezannine/platform temps are always 2/4 degrees higher than outdoors, and that the humidity is always 50%, and that the diffused solar energy is always 100W/m^2.

Living in Buxton, one of the coldest, rainiest towns in the UK, it amuses me to see that their “low” outdoor temperature is 29°C.

The other thing to note is that we have two variables mapped to the hue. For just five cases, this is just about acceptable, but it isn’t the best option and it won’t scale to many more categories. It’s generally considered best practice to work in HCL color space when mapping variables to colours. I would be tempted to map temperature to hue – whether you pick red as hot and blue as cold or the other way around depends upon how many astronomers you have in your target audience. Then I’d map luminance (lightness) to solar energy: more sunlight = lighter line.

I don’t have the values to exactly recreate the dataset, but here are some made up numbers with the new legend. Notice the combined outdoor temp/direct solar energy variable.

time_points <- 0:27
n_time_points <- length(time_points)
n_cases <- 5
comfort_data <- data.frame(
  time = rep.int(time_points, n_cases),
  comfort = jitter(rep(-2:2, each = n_time_points)),
  outdoor.temperature = rep(
    c(32, 29),
    times = c(3 * n_time_points, 2 * n_time_points)
  ),
  direct.solar.energy = rep(
    c(700, 150, 500, 500, 150),
    each = n_time_points
  )
)
comfort_data$combined <- with(comfort_data,
  factor(paste(outdoor.temperature, direct.solar.energy, sep = ", "))
)

We manually pick the colours to use in HCL space (using str_detect to examine the factor levels).

library(stringr)
cols <- hcl(
  h = with(comfort_data, ifelse(str_detect(levels(combined), "29"), 0, 240)),
  c = 100,
  l = with(comfort_data,
    ifelse(str_detect(levels(combined), "150"), 20,
    ifelse(str_detect(levels(combined), "500"), 50, 80))
  )
)

Drawing the plot is very straightforward, it’s just a line plot.

library(ggplot2)
p <- ggplot(comfort_data, aes(time, comfort, colour = combined)) +
  geom_line(size = 2) +
  scale_colour_manual(
    name = expression(paste(
      "Outdoor temp (", degree, C, "), Direct solar (", W/m^2, ")"
    )),
    values = cols) +
  xlab("Time (minutes)") +
  ylab("Comfort")
p

My version of the plot, with an improved legend

Sensible people should stop here, and write the additional detail in the figure caption. There is currently no sensible way of writing annotations outside of the plot area (annotate only works inside panels). The following hack was devised by Baptiste Auguie, read this forum thread for other variations.

library(gridExtra)
caption <- tableGrob(
  matrix(
    expression(
      paste(
        "Mezzanine temp is 2", degree, C, " warmer than outdoor temp"
      ),
      paste(
        "Platform temp is 4", degree, C, " warmer than outdoor temp"
      ),
      paste("Humidity is always 50%"),
      paste(
        "Diffused solar energy is always 100", W/m^2
      )
    )
  ),
  parse = TRUE,
  theme = theme.list(
    gpar.corefill = gpar(fill = NA, col = NA),
    core.just = "center"
  )
)
grid.arrange(p,  sub=caption)

The additional information is included in the plot's subcaption

Follow

Get every new post delivered to your Inbox.

Join 204 other followers