Home > R > Indexing with factors

Indexing with factors

This is a silly problem that bit me again recently. It’s an elementary mistake that I’ve somehow repeatedly failed to learn to avoid in eight years of R coding. Here’s an example to demonstrate.

Suppose we create a data frame with a categorical column, in this case the heights of ten adults along with their gender.

(heights <- data.frame(
  height_cm = c(153, 181, 150, 172, 165, 149, 174, 169, 198, 163),
  gender    = c("female", "male", "female", "male", "male", "female", "female", "male", "male", "female")
))

Using a factory fresh copy of R, the gender column will be assigned a factor with two levels: “female” and then “male”. This is all well and good, though the column can be kept as characters by setting stringsAsFactors = FALSE.

Now suppose that we want to assign a body weight to these people, based upon a gender average.

avg_body_weight_kg <- c(male = 78, female = 63)

Pop quiz: what does this next line of code give us?

avg_body_weight_kg[heights$gender]  

Well, the first value of heights$gender is “female”, so the first value should be 63, and the second value of heights$gender is “male”, so the second value should be 78, and so on. Let’s try it.

avg_body_weight_kg[heights$gender]  
#  male female   male female female   male   male female female   male 
#    78     63     78     63     63     78     78     63     63     78 

Uh-oh, the values are reversed. So what really happened? When you use a factor as an index, R silently converts it to an integer vector. That means that the first index of “female” is converted to 1, giving a value of 78, and so on.

The fundamental problem is that there are two natural interpretations of a factor index – character indexing or integer indexing. Since these can give conflicting results, ideally R would provide a warning when you use a factor index. Until such a change gets implemented, I suggest that best practice is to always explicitly convert factors to integer or to character before you use them in an index.

         
avg_body_weight_kg[as.character(heights$gender)]  
avg_body_weight_kg[as.integer(heights$gender)]
About these ads
  1. causu
    26th December, 2012 at 7:01 am

    Nice!!I dont find this question.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 217 other followers

%d bloggers like this: