Home > R > Visualising questionnaires

Visualising questionnaires

Last week I was shown the results of a workplace happiness questionnaire.  There’s a cut down version of the dataset here, with numbers and wording changed to protect the not-so-innocent.

The plots were ripe for a makeover.  The ones I saw were 2nd hand photocopies, but I’ve tried to recreate their full glory as closely as possible.

A badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeover To the creator’s credit, they have at least picked the correct plot-type: a stacked bar chart is infinitely preferable to a pie chart.  That said, there’s a lot of work to be done.  Most obviously, the pointless 3D effect needs removing, and the colour scheme is badly chosen.  Rainbow style colour schemes that change hues are best suited to unordered categorical variables.  If you have some sense of ordering to the variable then a sequential scale is more appropriate.  That means keeping the hue fixed and either scaling from light to dark, or from grey to a saturated colour.  In this case, we have ordering and also a midpoint – the “neutral” response.  That means that we should use a diverging scale (where saturation or brightness increases as you move farther from the mid-point).

More problematic than these style issues is that it isn’t easy to answer any useful question about the dataset.  To me, the obvious questions are

On balance, are people happy?

Which questions indicate the biggest problems?

Which sections indicate the biggest problems?

All these questions  require us to condense the seven points of data for each question down to a single score, so that we can order the questions from most negative to most positive.  The simplest, most obvious scoring system is a linear one.  We give -3 points for “strongly disagree”, -2 for “disagree”, through to +3 for “strongly agree”.  (In this case, all the questions are phrased so that agreeing is a good thing.  A well designed questionnaire should contain a balance of positively and negatively phrased questions to avoid yes ladder (link is NSFW) type effects.  If you have negatively phrased questions, you’ll need to reverse the scores.  Also notice that each question uses the same multiple choice scale.  If your questions have different numbers of responses, or more than one answer is allowed then it may be inappropriate to compare the questions.)

Since the scoring system is slightly arbitrary, it is best practise to check your results under a different scoring system.  Perhaps you think that the “strongly” responses should be more heavily weighted, in which case a quadratic scoring system would be appropriate.  (Replace the weights 1:3 with (1:3)^2/2.)  Assume the data is stored in the data frame dfr.

dfr$score.linear <- with(dfr,
   -3 * strongly.disagree - 2 * disagree - slightly.disagree +
   slightly.agree + 2 * agree + 3 * strongly.agree)
dfr$score.quad <- with(dfr,
   -4.5 * strongly.disagree - 2 * disagree - 0.5 * slightly.disagree +
   0.5 * slightly.agree + 2 * agree + 4.5 * strongly.agree)

For the rest of this post, I’ll just present results and code for the linear scoring system. Switch the word “linear” with “quad” to see the alternative results. To get an ordering from “worst” to “best”, we order by -score.linear.

dfr_linear <- within(dfr,
   question <- reorder(question, -score.linear)
   section <- reorder(section, -score.linear)

To make the data frame suitable for plotting with ggplot, we reshape it from wide to long format.

w2l <- function(dfr) melt(dfr, measure.vars = colnames(dfr)[4:10])
mdfr_linear <- w2l(dfr_linear)

To answer the first question, we simply take a histogram of the score, and see if they are mostly above or below zero.

hist_scores_linear <- ggplot(dfr, aes(score.linear)) + geom_histogram(binwidth = 10)

histogram of linear scores

Hmm, not good. Most of the questions had a negative score, implying that the workforce seems unhappy. Now we want to know why they are unhappy. Here are the cleaned up versions of those stacked bar charts again. As well as the style improvements mentioned above, we plot all the questions together, and in order of increasing score (so the problem questions are the first things you read).

bar_all_q_linear <- ggplot(mdfr_linear, aes(question, value, fill = variable)) +
   geom_bar(position = "stack") +
   coord_flip() +
   xlab("") +
   ylab("Number of responses") +
   scale_fill_brewer(type = "div")

bar chart of all questions, linear score

So deaf managers are the biggest issue.  Finally, it can be useful to know if which of the sections scored badly, to find more general problem areas. First we find the mean score by section.

mean_by_section <- with(dfr_linear, tapply(score.linear, section, mean))
dfr_mean_by_section <- data.frame(
   value = mean_by_section,
   section = names(mean_by_section)

Now we visualise these scores as a dotplot.

plot_by_section <- function(p)
   p + geom_point(colour = "grey20") +
      geom_point(aes(value, section),
      data = dfr_mean_by_section,
      xlab("Score") + ylab("")

pt_by_section_linear <- plot_by_section(
   ggplot(mdfr_linear, aes(score.linear, section))

dot plot of scores by section, linear score

Here you can see that communication the biggest problem area.

Tags: ,
  1. Emil
    26th September, 2010 at 8:28 am

    Nice! I’m about to analyse my first questionare in R, and this is just what I need. Thanks!

  2. 26th September, 2010 at 9:37 am

    Thanks for the ggplot2 demonstration. I’d change the x-axis to percentage of respondents. I think the fact that there are different numbers of respondents to each question makes it difficult to compare the results between questions. It would also be useful to have the score for each item displayed on the graph. Managers often like to think of their performance on an item in terms of a single number.

    • 27th September, 2010 at 15:03 pm

      Good point, although an advantage of leaving the scale as counts is that it might indicate whether nonresponse is correlated to the question. For example, if few people respond to questions about management, it might indicate that the employees fear reprisal.

      Another way to normalize the categories is to plot the data as a mosaic plot. Then the bar heights (horizontal direction) represent percentages of responses within a question and the widths of each bar (vertical direction) indicate the relative response rate.

      • 27th September, 2010 at 17:54 pm

        Interesting idea with the mosaic plots. There are some good (lattice-based) functions in the vcd package for drawing those, and other variations like association plots.

        I’ve been contemplating the counts vs proportions thing some more, and my natural response is “draw both and see what you learn from each”. After all, graphs are cheap. I don’t think that fact that the positive responses not lining up is a huge issue in this case. Management only tend to run these questionnaires when they think they have problems, so the negative responses are likely to be the bits they worry about the most (and they are aligned).

  3. Ruben
    2nd October, 2010 at 15:52 pm

    Hi there,
    this is a great post but I’m wondering if it would be possible to learn how to get the data into a data frame object in R since I’m just a beginner.
    I managed to download the google spreadsheet as a csv file but now I’m at a loss as to how to load it to R properly. I tried read.csv but the results are not good.
    Many thanks in advance,

    • 3rd October, 2010 at 12:48 pm

      You are on the right track. Navigate to the directory where you stored the CSV file, then type

      dfr <- read.csv("sample%20questionnaire.csv")

  4. Ruben
    5th October, 2010 at 17:04 pm

    Thanks a lot for the suggestion.
    I’ll try to replicate the results then.

    • 6th October, 2010 at 17:39 pm

      It just occurred to me that Google docs might be trying to do something clever with the format and giving you a european-style CSV file (depending upon your location/OS locale settings). Open the file in a text editor, and see if the values are separated by semi-colons instead of commas. If so, you need read.csv2 instead of read.csv.

  5. Ruben
    8th October, 2010 at 11:20 am

    Hi there,
    thanks for the update. After checking, I saw that the file was using commas so I followed your suggestion and I was able to replicate your example.
    Thanks a lot for the posting.
    Looking forward to keep on reading more interesting postings!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: