Archive

Posts Tagged ‘data-viz’

Introducing the Lowry Plot

11th January, 2011 8 comments

Here at the Health and Safety Laboratory* we’re big fans of physiologically-based pharmacokinetic (PBPK) models (say that 10 times fast) for predicting concentrations of chemicals around your body based upon an exposure. These models take the form of a big system of ODEs.

Because they contain many equations and consequently many parameters (masses of organs and blood flow rates and partition coefficients), it is important to do a sensitivity analysis on the model to determine which parameters are important and which are not. After all, it is tricky to accurately measure many of these parameters, so you need to know whether or not an error in a parameter will throw off the whole model.

How to do sensitivity analysis of an ODE-based model is a worthy topic of discussion, but not one I want to cover today. For now, it is sufficient to know the basic idea: you input different parameter combinations and see which parameters cause the model output to change the most.

For each parameter, the sensitivity analysis returns a main effect, which is the fraction of variance in model output accounted for by that parameter. Each parameter also has an interaction effect, which is the fraction of variance accounted for by that parameter’s interaction with all the other parameters. The main effect + interactions give the total effect for that parameter. Got it?

When I was working on this problem recently, I searched the literature for ways to display the results of the sensitivity analysis. “Draw a pie chart” it said. Not with less than 4 dimensions, thank you! “How about a bar chart?”, the literature suggested. How boring. It was clear that something better needed to be done.

Note that the plotting technique I’m about to describe is suitable for any sensitivity analysis, not just those of PBPK models.

The first step from the basic bar chart was to switch to a series of 3 Pareto plots for main effects, interactions and total effects. This was a big improvement, but having three graphs to look at was a pain.

I also tried a stacked bar chart for the total effects, but this still didn’t answer the most important question:

How many parameters should I include in my model?

This is a rather fuzzy question, but it is possible to restate it in a way that becomes solvable

What is the least number of parameters that I need to include to account for a given percentage of variance in my model?

It turns out that you don’t know an exact value for the amount of variance accounted for – you just know a range. The lower bound is the sum of the main effects of the parameters you include in the model. The upper bound is the sum of that total effects that you included. Actually we can do better than that. The effect of interactions are double counted (since each interaction is the interaction with all other parameters). So our second upper bound is that the included vairance can never exceed 100% minus the sum of all the main effects that we didn’t include.

I realise that there have been far too many long sentences in this post already, so I’ll skip to my solution. After some pondering, I realised that I could combine the Pareto plot with the stacked bar chart, which I call the Lowry Plot, named for the Lancashire artist – the resultant graph looks a lot like a smoke stack, of which he painted many.

A Lowry plot of some m-xylene data

If you include the first 4 parameters then, reading up from the fourth bar, you account for between 85% and 90% of the variance. Conversely, to include at least 90% of the variance, you find the bar where the whole ribbon is above 90%. In this case you need the first six parameters.

To recreate this, here’s some sample data taken from a model of m-xylene. The important thing is that you need 3 columns: parameter name, main effect and interaction.

library(ggplot2)

m_xylene_data <- data.frame(
  Parameter = c(
    "BW", "CRE", "DS", "KM", "MPY", "Pba", "Pfaa", 
    "Plia", "Prpda", "Pspda", "QCC", "QfaC", "QliC", 
    "QPC", "QspdC", "Rurine", "Vfac", "VliC", "Vmax"),
  "Main Effect" = c(
    1.03E-01, 9.91E-02, 9.18E-07, 3.42E-02, 9.27E-3, 2.82E-2, 2.58E-05, 
    1.37E-05, 5.73E-4, 2.76E-3, 6.77E-3, 8.67E-05, 1.30E-02, 
    1.19E-01, 4.75E-04, 5.25E-01, 2.07E-04, 1.73E-03, 1.08E-03),
  Interaction = c(
    1.49E-02, 1.43E-02, 1.25E-04, 6.84E-03, 3.25E-03, 7.67E-03, 8.34E-05, 
    1.17E-04, 2.04E-04, 7.64E-04, 2.84E-03, 8.72E-05, 2.37E-03, 
    2.61E-02, 6.68E-04, 4.57E-02, 1.32E-04, 6.96E-04, 6.55E-04
  )
)

Sometimes it is easier to use the data in wide format, other times in long format. This data fortification process returns both, plus some extra columns.

fortify_lowry_data <- function(data, 
  param_var = "Parameter", 
  main_var = "Main.Effect", 
  inter_var = "Interaction")
{
  #Convert wide to long format
  mdata <- melt(data, id.vars = param_var)

  #Order columns by main effect and reorder parameter levels
  o <- order(data[, main_var], decreasing = TRUE)
  data <- data[o, ]
  data[, param_var] <- factor(
    data[, param_var], levels = data[, param_var]
  )

  #Force main effect, interaction to be numeric
  data[, main_var] <- as.numeric(data[, main_var])
  data[, inter_var] <- as.numeric(data[, inter_var])

  #total effect is main effect + interaction
  data$.total.effect <- rowSums(data[, c(main_var, inter_var)])

  #Get cumulative totals for the ribbon
  data$.cumulative.main.effect <- cumsum(data[, main_var])
  data$.cumulative.total.effect <- cumsum(data$.total.effect)

  #A quirk of ggplot2 means we need x coords of bars
  data$.numeric.param <- as.numeric(data[, param_var])

  #The other upper bound
  #.maximum =  1 - main effects not included
  data$.maximum <- c(1 - rev(cumsum(rev(data[, main_var])))[-1], 1)

  data$.valid.ymax <- with(data, 
    pmin(.maximum, .cumulative.total.effect)
  )
  
  mdata[, param_var] <- factor(
    mdata[, param_var], levels = data[, param_var]
  ) 
  list(data = data, mdata = mdata)
}

The plot is essentially a bar chart plus a cumulative frequency ribbon, with some tweaks to make it prettier.

lowry_plot <- function(data,
  param_var = "Parameter",
  main_var = "Main.Effect",
  inter_var = "Interaction",
  x_lab = "Parameters",
  y_lab = "Total Effects (= Main Effects + Interactions)",
  ribbon_alpha = 0.5,
  x_text_angle = 25)
{
  #Fortify data and dump contents into plot function environment
  data_list <- fortify_lowry_data(data, param_var, main_var, inter_var)
  list2env(data_list, envir = sys.frame(sys.nframe()))

  p <- ggplot(data) +
    geom_bar(aes_string(x = param_var, y = "value", fill = "variable"),
      data = mdata) +
    geom_ribbon(
      aes(x = .numeric.param, ymin = .cumulative.main.effect, ymax = .valid.ymax),
      data = data,
      alpha = ribbon_alpha) +
    xlab(x_lab) +
    ylab(y_lab) +
    scale_y_continuous(formatter = "percent") +
    opts(axis.text.x = theme_text(angle = x_text_angle, hjust = 1)) +
    scale_fill_grey(end = 0.5) +
    opts(legend.position = "top",
      legend.title = theme_blank(),
      legend.direction = "horizontal"
    )
  p
}

m_xylene_lowry <- lowry_plot(m_xylene_data)
m_xylene_lowry

EDIT: I’ve tweaked to code to make it a littler simpler; the resampling is no longer needed.

*Dear legal folk, this is a personal blog and I’m not officially speaking on behalf of my employer. That said, a substantial chunk of the people who care about these things love PBPK.

Tags: ,

Pareto plot party!

5th December, 2010 2 comments

A Pareto plot is an enhanced bar chart. It comes in useful for deciding which bars in your bar chart are important. To see this, take a look at some made up DVD sales data.

set.seed(1234)
dvd_names <- c("Toy Tales 3", "The Dusk Saga: Black Out", "Urban Coitus 2", "Dragon Training for Dummies", "Germination", "Fe Man 2", "Harold The Wizard", "Embodiment", "Alice in Elysium", "The Disposables", "Burpee 3", "The Morning After")
n_products <- length(dvd_names)
dvd_sales <- data.frame(
  product = dvd_names,
  value    = rlnorm(n_products) ^ 2
)

library(ggplot2)
bar_chart <- ggplot(dvd_sales, aes(product, value)) + 
  geom_bar() +   
  xlab("") +
  ylab("Sales value") +
  opts(axis.text.x=theme_text(angle=20, hjust=1))
bar_chart

Bar chart of DVD sales
The “importance” of each bar is given by its height, so we can make the plot easier to interpret by simply sorting the bars from largest to smallest.

ordered_dvd_sales <- dvd_sales
ordered_dvd_sales$product <- with(ordered_dvd_sales, reorder(product, -value))
ordered_bar_chart <- bar_chart %+% ordered_dvd_sales
ordered_bar_chart

Ordered bar chart of DVD sales

Suppose you need to describe what you’ve been selling to your boss. You don’t want to list every DVD because bosses have short attention spans and get confused easily. To make it easy for the boss, you can tell him which films make up 90% of sales. (Or some other percentage — if your catalogue is much bigger then a smaller percentage may be more realistic.)

To find 90% of sales, we need a cumulative total of the sales values. Due to a technicality of ggplot2, since bars use a categorical scale but lines require a numeric scale, we also need to convert the x values to be numeric. This function fortifies our dataset with the requisite columns.

fortify_pareto_data <- function(data, xvar, yvar)
{
  for(v in c(xvar, yvar))
  {
    if(!(v %in% colnames(data))) 
    {
      stop(sQuote(v), " is not a column of the dataset")
    }
  }

  o <- order(data[, yvar], decreasing = TRUE)
  data <- data[o, ]
  data[, xvar] <- factor(data[, xvar], levels = data[, xvar])
                                            
  data[, yvar] <- as.numeric(data[, yvar])
  data$.cumulative.y <- cumsum(data[, yvar])
  
  data$.numeric.x <- as.numeric(data[, xvar])
  data
}

#Convert sales to fraction of sales
fractional_dvd_sales <- dvd_sales             
total_sales <- sum(fractional_dvd_sales$value)
fractional_dvd_sales$value <- fractional_dvd_sales$value / total_sales

#Now add columns for Pareto plot
fortified_dvd_sales <- fortify_pareto_data(fractional_dvd_sales, "product", "value")
pareto_plot <- bar_chart %+% fortified_dvd_sales +
    geom_line(aes(.numeric.x, .cumulative.y)) +
    ylab("Percentage of sales") +
    scale_y_continuous(formatter = "percent")
pareto_plot

Pareto plot of DVD sales

To see which DVDs constitute 90% of sales, read across from 90% on the y-axis until you hit the cumulative total line. Now read down until you hit the x-axis, and all the DVDs to the left of that point constitute your “important” set. In this case, you’d be telling your boss that 90% of sales come from “Urban Coitus 2″, “Fe Man 2″, “Germination” and “The Dusk Saga: Black Out”.

And there you have it: a Pareto plot. These plots are useful whenever you need to reduce the number of categories of data. As well as these businessy examples, they are great for things like principal component analysis and factor analysis where you need to reduce the number of components/factors.

Tags: ,

The truth is stranger than fiction

27th September, 2010 Leave a comment

It turns out that when you search Google for the phrase 4D pie chart, this blog is only the second link.  The number one spot is taken by a tutorial on how to create animated pie charts with Maxon Computer’s Cinema 4D package.  I can’t detect any hint of sarcasm in the instructions.  Hang your head in shame, Maxon Computer employee.

Tags: ,

Dotplot on xkcd

27th September, 2010 Leave a comment

There’s a nice example of a dotplot on xkcd today.  Hand drawn lines aside, it’s well drawn.  The only obvious thing wrong with it is the scale.  If you’re going to log-transform your data, then the axes labels should show the untransformed amounts.  So the numbers -17 to -5 along the top x-axis should really be e^-17 to e^-5, or even better, number that are based 10, because they are much easier to understand.  For example, compare:

“Stochastic is used e to the power 14 times as often as f*cking stochastic.”

“Stochastic is used 10 to the power 6 times as often as f*cking stochastic.”

Tags:

Visualising questionnaires

25th September, 2010 9 comments

Last week I was shown the results of a workplace happiness questionnaire.  There’s a cut down version of the dataset here, with numbers and wording changed to protect the not-so-innocent.

The plots were ripe for a makeover.  The ones I saw were 2nd hand photocopies, but I’ve tried to recreate their full glory as closely as possible.

A badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeover To the creator’s credit, they have at least picked the correct plot-type: a stacked bar chart is infinitely preferable to a pie chart.  That said, there’s a lot of work to be done.  Most obviously, the pointless 3D effect needs removing, and the colour scheme is badly chosen.  Rainbow style colour schemes that change hues are best suited to unordered categorical variables.  If you have some sense of ordering to the variable then a sequential scale is more appropriate.  That means keeping the hue fixed and either scaling from light to dark, or from grey to a saturated colour.  In this case, we have ordering and also a midpoint – the “neutral” response.  That means that we should use a diverging scale (where saturation or brightness increases as you move farther from the mid-point).

More problematic than these style issues is that it isn’t easy to answer any useful question about the dataset.  To me, the obvious questions are

On balance, are people happy?

Which questions indicate the biggest problems?

Which sections indicate the biggest problems?

All these questions  require us to condense the seven points of data for each question down to a single score, so that we can order the questions from most negative to most positive.  The simplest, most obvious scoring system is a linear one.  We give -3 points for “strongly disagree”, -2 for “disagree”, through to +3 for “strongly agree”.  (In this case, all the questions are phrased so that agreeing is a good thing.  A well designed questionnaire should contain a balance of positively and negatively phrased questions to avoid yes ladder (link is NSFW) type effects.  If you have negatively phrased questions, you’ll need to reverse the scores.  Also notice that each question uses the same multiple choice scale.  If your questions have different numbers of responses, or more than one answer is allowed then it may be inappropriate to compare the questions.)

Since the scoring system is slightly arbitrary, it is best practise to check your results under a different scoring system.  Perhaps you think that the “strongly” responses should be more heavily weighted, in which case a quadratic scoring system would be appropriate.  (Replace the weights 1:3 with (1:3)^2/2.)  Assume the data is stored in the data frame dfr.

dfr$score.linear <- with(dfr,
   -3 * strongly.disagree - 2 * disagree - slightly.disagree +
   slightly.agree + 2 * agree + 3 * strongly.agree)
dfr$score.quad <- with(dfr,
   -4.5 * strongly.disagree - 2 * disagree - 0.5 * slightly.disagree +
   0.5 * slightly.agree + 2 * agree + 4.5 * strongly.agree)

For the rest of this post, I’ll just present results and code for the linear scoring system. Switch the word “linear” with “quad” to see the alternative results. To get an ordering from “worst” to “best”, we order by -score.linear.

dfr_linear <- within(dfr,
{
   question <- reorder(question, -score.linear)
   section <- reorder(section, -score.linear)
})

To make the data frame suitable for plotting with ggplot, we reshape it from wide to long format.

library(reshape)
w2l <- function(dfr) melt(dfr, measure.vars = colnames(dfr)[4:10])
mdfr_linear <- w2l(dfr_linear)

To answer the first question, we simply take a histogram of the score, and see if they are mostly above or below zero.

library(ggplot2)
hist_scores_linear <- ggplot(dfr, aes(score.linear)) + geom_histogram(binwidth = 10)

histogram of linear scores

Hmm, not good. Most of the questions had a negative score, implying that the workforce seems unhappy. Now we want to know why they are unhappy. Here are the cleaned up versions of those stacked bar charts again. As well as the style improvements mentioned above, we plot all the questions together, and in order of increasing score (so the problem questions are the first things you read).

bar_all_q_linear <- ggplot(mdfr_linear, aes(question, value, fill = variable)) +
   geom_bar(position = "stack") +
   coord_flip() +
   xlab("") +
   ylab("Number of responses") +
   scale_fill_brewer(type = "div")

bar chart of all questions, linear score

So deaf managers are the biggest issue.  Finally, it can be useful to know if which of the sections scored badly, to find more general problem areas. First we find the mean score by section.

mean_by_section <- with(dfr_linear, tapply(score.linear, section, mean))
dfr_mean_by_section <- data.frame(
   value = mean_by_section,
   section = names(mean_by_section)
)

Now we visualise these scores as a dotplot.

plot_by_section <- function(p)
{
   p + geom_point(colour = "grey20") +
      geom_point(aes(value, section),
      data = dfr_mean_by_section,
      xlab("Score") + ylab("")
}

pt_by_section_linear <- plot_by_section(
   ggplot(mdfr_linear, aes(score.linear, section))
)

dot plot of scores by section, linear score

Here you can see that communication the biggest problem area.

Tags: ,

Pie charts over at Juice

3rd September, 2010 Leave a comment

The Juice Analytics blog (incidentally, one of the few corporate blogs worth reading) has a makeover of the Federal IT dashboard, including a discussion of 3D pie charts.  Only 3 dimensions?  Bah!

Tags:

Sweet bar chart o’ mine

30th August, 2010 2 comments

Last week I was asked to visualise some heart rate data from an experiment. The experimentees were clothed in protective suits and made to do a bunch of exercises while various physiological parameters were measured. Including “deep body temperature”. Gross. The heart rates were taken every five minutes over the two and a half hour period. Here’s some R code to make fake data for you to play with. The heart rates rise as the workers are made to do exercise, and fall again during the cooling down period, but it’s a fairly noisy series.
interval <- 5
heart_data <- data.frame(
+++time = seq.int(0, 150, interval)
)
n_data <- nrow(heart_data)
frac_n_data <- floor(.7 * n_data)
heart_data$rate = runif(n_data, 50, 80) +
+++c(seq.int(0, 50, length.out = frac_n_data),
+++seq.int(50, 0, length.out = n_data - frac_n_data)
)
heart_data$lower <- heart_data$rate - runif(n_data, 10, 30)
heart_data$upper <- heart_data$rate + runif(n_data, 10, 30)

The standard way of displaying a time series (that is, a numeric variable that changes over time) is with a line plot. Here’s the ggplot2 code for such a plot.

library(ggplot2)
plot_base <- ggplot(heart_data, aes(time, rate))
plot_line <- plot_base + geom_line()
plot_line

Line plot of heart rate dataUsing a line isn’t always appropriate however. If you have missing data, or the data are irregular or infrequent, then it is misleading to join them together with a line. Other things are happening during the times that you have no data for. ggplot2 will automatically removes lines that have a missing value between them (as represented by NA values) but in the case of irregular/infrequent data you don’t want any lines at all. In this case, using points rather than lines is the best option, effectively creating a scatterplot.

plot_point <- plot_base + geom_point()
plot_point

Scatter plot of heart rate dataSince heart rate can change dramatically over the course of five minutes, the data generated by the experiment should be considered infrequent, and so I opted for the scatterplot approach.

The experimenters, however, wanted a bar chart.

plot_bar <- plot_base +
+++geom_bar(aes(factor(time), rate), alpha = 0.7) +
+++opts(axis.text.x = theme_text(size = 8))
plot_bar

Bar chart of heart rate dataI hadn’t considered this use of a bar chart before, so it was interesting to think about the pros and cons relative to using points. First up, the bar chart does successfully communicate the numeric values, and the fact they they are discrete. The big difference is that the bars are forced to stretch down to zero, squeezing the data into a small range near the top of the plot. Whether or not you think this is a good thing depends upon the questions you want to answer about the heart rates.

If you want to be able to say “the maximum heart rate was twice as fast as the minimum heart rate”, then bars are great for this. Comparing lengths is what bars are made for. If on the other hand, you want to focus on the relative differences between data (“how much does the heart rate go up by when the subject did some step-ups?”), then points make more sense, since you are zoomed in to the range of the data.

There are a couple of other downsides to using a bar chart. Bars have a much lower data-ink ratio than points. Further, if we want to add a confidence region to the plot, it gets very busy with bars. Compare

plot_point_region <- plot_point +
+++geom_segment(aes(
++++++x = time, xend = time, y = lower, yend = upper),
++++++size = 2, alpha = .4)
plot_point_region

plot_bar_region <- plot_bar +
+++geom_segment(aes(
++++++x = as.numeric(factor(time)),
++++++xend = as.numeric(factor(time)),
++++++y = lower,
++++++yend = upper), size = 2, colour = "grey30")
plot_bar_region

Bar chart of heart rate data, with confidenceScatter plot of heart rate data, with confidenceThe big deal-breaker for me is that a bar chart seems semantically wrong. Bar charts are typically used to visualise a numeric variable split over several categories. This isn’t the case here: time is not categorical.

Something about this analysis was bugging me though, and I started wondering “Is it ever appropriate to use bars in a time series?”. Last night, as I was watching Guns ‘N’ Roses headline the Leeds Festival, the answer came to me. GNR were at least an order of magnitude more awesome than expected, but damn, some of those power ballads go on a long time, which allowed my mind to wander. Here’s their set list, with song lengths. (Solos and instrumentals omitted, and I wasn’t standing there with a stopwatch so data are taken from the album versions.)

songs <- c(
+++"Chinese Democracy",
+++"Welcome To The Jungle",
+++"It's So Easy",
+++"Mr. Brownstone",
+++"Sorry",
+++"Live And Let Die",
+++"This I Love",
+++"Rocket Queen",
+++"Street Of Dreams",
+++"You Could Be Mine",
+++"Sweet Child O' Mine",
+++"November Rain",
+++"Knockin' On Heaven's Door",
+++"Nightrain",
+++"Paradise City"
)

albums <- c(
+++"Appetite for Destruction",
+++"G 'N' R Lies",
+++"Use your Illusion I",
+++"Use your Illusion II",
+++"\"The Spaghetti Incident?\"",
+++"Chinese Democracy"
)

gnr <- data.frame(
+++song = ordered(songs, levels = songs),
+++length = c(283, 274, 203, 229, 374, 184, 334, 373, 286, 344, 355, 544, 336, 269, 406),
+++album = ordered(albums[c(6, 1, 1, 1, 6, 3, 6, 1, 6, 4, 1, 3, 4, 1, 1)], levels = albums)
)

plot_gnr <- ggplot(gnr, aes(song, length, fill = album)) +
geom_bar() +
opts(axis.text.x = theme_text(angle = 90, hjust = 1))
plot_gnr

Guns 'N' Roses set listHere we have a “categorical time series”. The data are ordered in time, but form discrete chunks. As a bonus, the album colouring tells you which tunes have stood the test of time. In this case, the band’s debut Appetite for Destruction was played even more than the current miracle-it-arrived-at-all Chinese Democracy . G ‘N’ R Lies and “The Spaghetti Incident?”, by contrast,  didn’t feature at all.

Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 218 other followers