R hits 10000 questions on stackoverflow
A milestone, though not that exciting as questions go. Still, if you haven’t yet joined the cult of Stack Exchange, take a look here.
Viewing the internals of MATLAB Matrices
A cool undocumented trick I just learnt from The MathWorks’ Bob Gilmore. If you type
format debug
Then printing any vector reveals information about its internal representation. For example:
x = magic(3)
x =
Structure address = 6bc1ab0
m = 3
n = 3
pr = d8dccf0
pi = 0
8 1 6
3 5 7
4 9 2
The structure address is the address in memory where the matrix is stored, m and n are the number of rows and columns respectively of the matrix, and pr and pi are pointers to the addresses of the matrices storing the real and imaginary components of the matrix.
One interesting thing to look at is the representation of scalar numbers.
y = 1
y =
Structure address = 6bc31e0
m = 1
n = 1
pr = d790b90
pi = 0
1
Yep: they are stored in exactly the same way as matrices: in the same way the “everything in R is a vector”, everything in MATLAB is a matrix. To finish up, here are some more examples for you to explore:
% higher dimensional arrays
rand(2, 3, 4)
% cell arrays (unfortunately not that revealing)
{1, magic(3)}
% sparse matrices (very interesting)
sparse(ones(3))
Exploring the functions in a package
Sometimes it can be useful to list all the functions inside a package. This is done in the same way that you would list variables in your workspace. That is, using ls. The syntax is ls(pos = "package:packagename"), which is easy enough if you can remember it. Unfortunately, I never can, and have to type search() first to see what the format of that string is.
Today, that problem is solved with a tiny utility function to save remembering things, and to save typing.
lsp <- function(package, all.names = FALSE, pattern)
{
package <- deparse(substitute(package))
ls(
pos = paste("package", package, sep = ":"),
all.names = all.names,
pattern = pattern
)
}
all.names and pattern behave in the same way as they do in regular ls. You use it like this:
lsp(base) lsp(base, TRUE) lsp(base, pattern = "^is")
EDIT: I’ve had a couple of questions about the use case, and there are some interesting comments on alternatives. My thinking behind this function was that I sometimes know I’ve seen a function in a package but can’t remember what it’s called. If you can hazard a guess at the name, then apropos is probably better, though it looks everywhere on the search path rather than in a particular package. Autocompletion is also useful for this, but you need to know the first few characters of what you are looking for. (Activate autocompletions by pressing TAB in R GUI or Rstudio or CTRL+space in eclipse. I can’t remember what the shortcut is in emacs, but you probably just mash CTRL+META until you have RSI.) Finally, the unknownR package is useful for finding new functions that you hadn’t heard of yet.
A quick primer on split-apply-combine problems
I’ve just answered my hundred billionth question on Stack Overflow that goes something like
I want to calculate some statistic for lots of different groups.
Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.
First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.
head(InsectSprays) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A
These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.
First, we split the count column by the spray column.
(count_by_spray <- with(InsectSprays, split(count, spray)))
Secondly, we apply the statistic to each element of the list. Lets use the mean here.
(mean_by_spray <- lapply(count_by_spray, mean))
Finally, (if possible) we recombine the list as a vector.
unlist(mean_by_spray)
This procedure is such a common thing that there are many functions to speed up the process. sapply and vapply do the last two steps together.
sapply(count_by_spray, mean) vapply(count_by_spray, mean, numeric(1))
We can do even better than that however. tapply, aggregate and by all provide a one-function solution to these S-A-C problems.
with(InsectSprays, tapply(count, spray, mean)) with(InsectSprays, by(count, spray, mean)) aggregate(count ~ spray, InsectSprays, mean)
The plyr package also provides several solutions, with a choice of output format. ddply takes a data frame and returned another data frame, which is what you’ll want most of the time. dlply takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.
ddply(InsectSprays, .(spray), summarise, mean.count = mean(count)) dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))
You can read much more on this type of problem and the plyr solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.
One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the ave function (which provides mean as the default function).
with(InsectSprays, ave(count, spray))
MATLAB’s stand out new feature
It’s been a while since my last MATLAB post, not because I don’t love the language, but more because I do most of my blogging from home, where I have no license, and because (mostly thanks to R-bloggers) I get ten times as many page views for the R posts. (TODO: Create MATLAB-bloggers service.)
Having returned from holiday (it was lovely, thanks for asking) I’ve been trying out the latest release of MATLAB – R2011b. So far, the standout new feature is the automatic variable renaming. If you change the name of a variable at the point where it was declared, then pressing Shift+Enter lets MATLAB rename all other instances. IDEs for statically-typed languages have had this feature for years, but to see it in a dynamically-typed language is very impressive.
A Great European Bailout Infographic
Whenever there’s a financial crisis, I tend to assume that it’s Dirk Eddelbuettel‘s fault, though apparently the EMU debt crisis is more complicated than that. JP Morgan have just released a white paper about the problem, including a Lego infographic of who is asking who for money. Created, apparently, by Peter Cembalest, aged nine. Impressive stuff.
Found via The Register.
Interactive graphics for data analysis
I got a copy of Martin Theus and Simon Urbanek’s Interactive Graphics for Data Analysis a couple of years ago, whence it’s been sat on my bookshelf. Since I’ve recently become a self-proclaimed expert on interactive graphics I thought it was about time I read the thing. Which is exactly what I did last weekend at the Leeds Festival (in between rocking out).
It’s a book of two halves, and despite the title the interactivity isn’t really the focus. The book is actually a guide on how to do exploratory data analysis. The first half of the book works like an advanced chart chooser, explaining which plots are useful for which types of data, and what types of interactivity they can benefit from. For me, it was worth it for the many rare plots, like spineplots and interaction plots and mosaic plots and fluctuation diagrams. If you’re bored of barcharts, this is a great way to expand your graphical vocabulary. The second half of the book consists entirely of case studies, where you can practice a workflow for exploring data, which is something that’s always worthwhile doing.
The really big takeaway that I got is that exploratory graphics have different priorities to publication graphics. When you are in the courting stage with a dataset, just getting to know each other, you don’t really care so much about whether the greek letters in your axis label are formatted correctly or whether the shade of pink in your dots is quite right. All you really need is to be able to generate lots and lots of plots quickly, and to be able to see the relationships between them.
It is this last point that the authors claim interactivity is most useful for. Perhaps the canonical example of this is clicking a bar on a histogram or barchart, and having corresponding points on a scatterplot highlighted. To demonstrate this, here’s an example using Simon’s Acinonyx package (shortly to be renamed ix for “iplots Extreme”). Acinonyx isn’t yet available on CRAN, see its home page
for installation details.
library(Acinonyx) library(MASS) data(Cars93) interactive_scatter <- with(Cars93, iplot(Horsepower, MPG.city)) interactive_histo <- with(Cars93, ihist(EngineSize))
Click a bar in the histogram and the the corresponding points in the scatterplot are highlighted. Likewise, drag to select points in the scatterplot and fractions of the histgram are highlighted.
The equivalent static version would be to use trellising and draw each possible graph combination. Splitting a scatterplot into different groups depending upon bars of a histogram works something like this:
library(ggplot2) Cars93$EngineSizeGroup <- cut(Cars93$EngineSize, 11) (static_trellis_scatter <- ggplot(Cars93, aes(Horsepower, MPG.city)) + geom_point() + facet_wrap(~ EngineSizeGroup) )
(We don’t actually need to bother with the histograms, since they are a little boring.) The reverse operation – going from a selected region of scatterplot to a higlighted region of bar chart is also possible, but trickier. In this case, we do need both graphs.
Cars93 <- within(Cars93,
{
selected <- ifelse(
Horsepower < 200 & MPG.city > 20 & MPG.city < 30,
"selected",
"unselected"
)
})
(static_scatter_with_highlight <-
ggplot(Cars93, aes(Horsepower, MPG.city, colour = selected)) +
geom_point()
)
(static_histo_with_highlight <-
ggplot(Cars93, aes(EngineSizeGroup, fill = selected)) +
geom_histogram() +
opts(axis.text.x = theme_text(angle = 30, hjust = 1, vjust = 1))
)
My conclusion from reading the book, and from my initial experimentation with Acinonyx is that anything you can do interactively is also possible by drawing many static graphs, but the interaction can let you see things quicker.
Nomograms everywhere!
At useR!, Jonty Rougier talked about nomograms, a once popular visualisation that has fallen by the wayside with the rise of computers. I’d seen a few before, but hadn’t understood how they worked or why you’d want to use them. Anyway, since that talk I’ve been digging around in biology books from the 60s and 70s, and it seems they are full of them. So for those of you who haven’t seen the talk, here’s how they work.
A basic nomogram consists of three scales. By reading off known values from two of the scales, you can estimate a third one. Here’s an example I found in the ICRP‘s reference manual.
It’s difficult to measure people’s skin surface area, but height and bodyweight are very straightforward. To use the nomogram, you place a ruler (or other straight edge) on the height* and weight scales and read of the point where the ruler crosses the surface area scale. I’m 177cm tall and weigh 72kg, so according to this, my estimated skin surface area is 1.89m2.
Of course nowadays, the standard way to solve this sort of problem is to write a function. Jonty suggested that the main modern use of nomograms is in fieldwork situations, where computers aren’t handily available. (His case study was Kenyan vets trying to estimate the weight of donkeys form there height and girth.)
Altman and Dittmer’s Respiration and Circulation has many more pretty nomograms. I was particularly impressed by those on blood pH, reproduced below for your pleasure.
Your homework is to dig out a pre-1980 textbook and hunt for more nomograms.
*Gruesomely, the fact that the scale is labelled “length” rather than “height” makes me suspect that the bodies that provided the data were in a permanent lying down position – that is, they were corpses.







