Tim Slekar at the Huffington Post tries to give readers a lecture in “stats 101″ and ends up making an elementary error himself. Tim writes:
For those that forgot stat 101, correlation does not mean causation. Example: If you take a sample of people involved in automobile accidents on their way to work and ask the sample if they had breakfast and then checked the correlation between eating breakfast and automobile accidents — it would be through the roof. But what does this mean? Nothing from a cause and effect stand point. Eating breakfast does not cause car accidents — period. Remember, correlation does not mean causation. However, this simple statistical rule doesn’t seem to matter to the reformers.
If you took a sample of people involved in accidents and looked at the correlation between eating breakfast and being in an accident it wouldn’t be “through the roof”, it would be undefined. Let X be a dummy variable being 1 if someone ate breakfast and 0 if they didn’t, and let Y be a dummy variable equal to 1 if someone is involved in an accident, and 0 if they weren’t. A dataset like Tim is talking about might look like this:
X ,Y
1,1
0,1
1,1
0,1
0,1
1,1
1,1
1,1
1,1
Notice that all of the Ys are equal to one, because Tim said we are sampling people who have been involved in an accident. The problem is, as they teach you in stats 101, the formula for correlation is cov(X,Y)/std_dev(X)*std_dev(Y), where cov is covariance, and std_dev is the standard deviation. The standard deviation of a number that is always equal to 1 is zero. And so the correlation has a zero in the denominator, which is undefined, not “through the roof”.

20 comments
Comments feed for this article
Saturday ~ June 25th, 2011 at 7:36 pm
Max
Probably at least 1 guy on the roads skipped breakfast
Saturday ~ June 25th, 2011 at 10:00 pm
govt_mule
A little knowledge is a dangerous thing.
Sunday ~ June 26th, 2011 at 1:29 am
ed
Wow, that guy is quite a piece of work. Don’t send your kids to Penn State Altoona, I guess.
Sunday ~ June 26th, 2011 at 7:56 pm
govt_mule
Send your kids to Altoona – profs there apparently understand that the concept of correlation is perfectly valid even if the application of a particular formula gives an odd answer.
As Johnnie Linn points out, the covariance is also 0, so the formal result using Adam’s formula is zero, not “undefined”, and some more sophisiticated calculations using limits or variational principles could be used get a defined result.
But all you really need to do is read to the bottom of the Wikipedia page on the Pearson Correlation Coefficient to find the proper formula to apply: “The reflective correlation is a variant of Pearson’s correlation in which the data are not centered around their mean”. This formula works perfectly fine with Adam’s data and gives a correlation coefficient of 0.79.
Sunday ~ June 26th, 2011 at 8:19 pm
Adam Ozimek
You’re joking, right?
Monday ~ June 27th, 2011 at 7:49 am
govt_mule
Nope
Monday ~ June 27th, 2011 at 9:24 am
govt_mule
To elaborate, I read this blog to learn about economics and its application to social science issues. I’d find almost any conceptual critique of Slekar’s breakfast/accident scenario informative. But saying his example is wrong because plugging numbers into a particular formula gives a wacky answer seems both uninformative and incorrect (in the sense that there are formulae or techniques for getting a number out of the data, even though that number may not be meaningful).
Monday ~ June 27th, 2011 at 9:48 am
Adam Ozimek
Dude, it wasn’t just “some particular formula”, it was the correlation. When someone refers to “the correlation” that is the formula they are talking about unless it’s specifically stated otherwise. The guy said “stats 101″, he did not mean the reflective correlation and for you to act like that’s a possibility is a joke. When someone reports a correlation people don’t ask “hold on a sec, do you mean the normal or uncentered correlation?”. It’s clearly what he intended. When you make ridiculous arguments like this it makes me take the rest of your comments less seriously, because I know you just have some weird need to argue with whatever I say.
Monday ~ June 27th, 2011 at 3:48 pm
govt_mule
Not trying to vex you. I don’t measure correlations in my work, so my knowledge of calculations is limited to what’s on Wikipedia. The articles on Correlation state that “Correlation refers to any of a broad class of statistical relationships involving dependence” and goes on to describe Pearson’s formula, the Reflective formula, as well as other methods such as Rank, Distance and Brownian Correlation. My impression was there are many valid ways to measure correlation, and I didn’t expect a formula explicitly described on a Wikipedia page to be so arcane that it would be considered a joke. I stand corrected.
Monday ~ June 27th, 2011 at 8:03 pm
jme2
“When you make ridiculous arguments like this it makes me take the rest of your comments less seriously, because I know you just have some weird need to argue with whatever I say.”
Sorry, but I have to chuckle at this, since it seems likely to me that the author of the post you (Ozimek) are responding to might be thinking exactly the same thought about you!
He certainly messed up the details of his example, but his intent was in the right place, and anyone who understands stats would get what he was trying to say. Worth pointing out the error, I suppose, but hardly worth getting worked up over.
Sunday ~ June 26th, 2011 at 12:38 pm
Cahal Moran
To be fair, it’s easy to trip up when thinking of examples of things like this.
I guess the best one is the king who discovered there was the most disease in the towns with the most doctors and had them all killed (whether or not it is true).
Sunday ~ June 26th, 2011 at 3:26 pm
White Rabbit
btw., aside the somewhat bad (although still pretty unambiguous) formulation of the example, it’s not even true that “eating breakfast does not cause accidents, period”.
I’m quite sure there’s a fairly significant increase in the risk of being involved in a traffic accident, amongst drivers who are eating breakfast *while* driving
Sunday ~ June 26th, 2011 at 6:03 pm
Johnnie Linn
The zero variance of Y is not the whole story. The covariance of X and Y is also zero. So we have a zero divided by zero. In the case of continuous functions, L’Hopital’s Rule can resolve a zero/zero into a finite number. I don’t know if there is a corresponding rule for discrete distributions. Educators with master’s degrees, even from Penn State Altoona, might not know either.
Sunday ~ June 26th, 2011 at 6:11 pm
Paul
One name for this would be “the importance of denominators”. I all you have are people who had accidents nothing can be correlated, associated with a variable which doesn’t vary. In this case, all you have is the numerator to find the correlation you need breakfast eating in the non accident.
Monday ~ June 27th, 2011 at 1:09 pm
mike
The reason your example gave undefined is you set it up very much incorrectly. If I was looking at coinflips and defined a variable to be only the times I got heads, then I tried to apply a statistic about the distribution of coinflips …would that work? No. pearsons correlation coefficient is only DEFINED when you have two random variables that have distributions. It turns out that coinflips do have distributions when you include both heads and tails filps.
In the same way, you set up the problem to fail. The only way this problem would make sense is to define Y to be ALL people who drove that day, with 1 for those who got into accidents and 0, no accidents. The way you defined it is meaningless.
If you now compute correlation the way you do it, you will not get undefined.
Also, the way you set it up, you ignore the correlation of times people ate breakfast and didn’t get in a car accident, you’re making an anti-Bayesian type mistake.
Thank you for proving your own point.
Monday ~ June 27th, 2011 at 2:10 pm
Adam Ozimek
The author at the Huffington Post is the one who set it up to fail by defining the sample as being people who got into car accidents, I’m pointing that out.
Monday ~ June 27th, 2011 at 2:26 pm
Johnnie Linn
It would be nice if we could have a daily census of all drivers as to whether they had breakfast that day, but censuses are costly, so usually we have to make do with samples, and there is difficulty in getting a control sample, after the fact, that matches our sample of drivers in accidents. But we might be able to do more with our sample of drivers in accidents. Select rear-end collisions only. Two drivers in each accident. Ascertain the distribution of breakfasts for each pair of drivers. Find out the correlation between breakfasts and being hit, and breakfasts and not being hit. We can get meaningful results with our sample consisting only of drivers who have been in rear-end collisions.
Monday ~ June 27th, 2011 at 2:28 pm
Johnnie Linn
“Not being hit” means being the driver that hit the other.
Monday ~ June 27th, 2011 at 7:36 pm
Khal Mojo
You don’t even need to *go there*. As Tim Slekar said, the “correlation” would be through the roof. The correlation between eating breakfast and NOT being in an accident would be stratospheric.
Tuesday ~ November 8th, 2011 at 1:52 am
Links for 2011-06-25 « Random Ramblings of Rude Reality
[...] Correlation, Causation, and Confusion on Modeled Behavior [...]