ABSTRACT. Theta Hat is a blog about statistics, launched in February of 2009, and abandoned in April of the same year.
The odds ratio is a statistical measure used to compare the odds for a particular outcome across two groups. It is, as the name implies, a ratio of two odds. Suppose for instance that among smokers the prevalence for a particular disease is 5%, whereas among non-smokers the prevalence is 2.5%. The odds for disease amongst smokers is 0.05/(1-0.05) or approximately 0.053; the odds for disease amongst non-smokers is 0.025/(1-0.025) or approximately 0.026. The odds ratio for disease comparing smokers to non-smokers is the ratio of the two odds: 0.053/0.026, approximately 2.05.
The odds ratio is not a very intuitive measure; what does it mean, for instance, to have twice the odds of disease? Unless we spend a lot of time gambling, we are generally more accustomed to thinking about probabilities rather than odds. Thus, from a conceptual perspective, the relative risk is often preferred to the odds ratio. The relative risk, as the name implies, is a comparison of probabilities across two groups: it is the probability for an outcome amongst one group, divided by the probability for the outcome amongst a second group. In the example above, the relative risk for disease comparing smokers to non-smokers is 0.05/0.025, which is precisely 2.
Why does the odds ratio persist in the literature if the relative risk is more intuitive? Practicalities. Mathematically, it is more convenient to model odds than it is to model probabilities. Note that a probability can only take on a value between 0 and 1, and the natural log of a probability is bounded by negative infinity and 0. In comparison, the natural log of an odds can range from negative infinity to positive infinity. Logistic regression models — by far the most popular method for modeling dichotomous outcomes — exploits this property: it models the log odds of an outcome as a linear function of the predictors.
Mathematical convenience isn’t the only reason why the odds ratio is so popular: there are also practical constraints. In certain circumstances, the relative risk simply can not be appropriately calculated. A common example is situations in which the data are sampled separately by outcome — for instance, in case-control studies, diseased individuals are sampled from a registry of cases and controls are sampled from the general population of non-diseased individuals. When the data are sampled in such a manner, the relative risk computed from the sample is simply not an appropriate estimate for the relative risk in the population. The reason is intuitive: if the sampling scheme depends on the outcome, the likelihood of the outcome can not be appropriately estimated from the sample.
How are case-control data analyzed? It turns out that the odds ratio for an outcome is equal to the odds ratio for the predictor (this is a mathematical fact; and is always true). The odds ratio for exposure can be appropriately estimated from a case-control study (the sampling scheme does not depend on the predictor), and thus the odds ratio for disease can be appropriately estimated from a case-control study. Thus, analysis for case-control studies almost invariably involves the odds ratio.
But: the odds ratio is not intuitive! Fortunately, in situations when the outcome is rare, the odds ratio approximates the relative risk; this is a mathematical nicety owing to the fact that the odds of an outcome is roughly equal to the probability for the outcome if the probability is small. Thus, when the outcome is rare, and the analysis explicitly involves the odds ratio, the odds ratio is often presented as an approximation for the relative risk. Note for instance that in the example provided above, the prevalence of the disease is fairly low, and thus the odds ratio is fairly close to the relative risk (2.05 versus 2).
When is the odds ratio not a good approximation for the relative risk? When the outcome is not rare. Many situations arise when the outcome is not sufficiently rare to use the odds ratio to approximate the relative risk. For instance, in a study examining risk factors for unprotected sex amongst groups with a high risk for STDs, the outcome (unprotected sex) is probably reasonably common.
Unfortunately, it is not actually very difficult to find examples — even in generally esteemed journals and publications — in which the odds ratio is inappropriately presented as a relative risk. For example, a 2004 article in CHANCE (a general-interest statistical magazine published by the American Statistical Association) regarding field goals in (American) football, odds ratios are explicitly labeled as relative risks, even though the outcome — hitting field goals — was clearly not sufficiently rare to allow for such an approximation. Worse, a couple passages in the text imply a comparison of probability, and not in fact a comparison of odds. The author writes, for instance, that in cloudy weather there is “an estimated 20.2% increase in the probability of success on each kick.” This is simply not true. There is an estimated 20.2% increase in odds, not a 20.2% increase in risk.
Such mistakes can have unfortunate consequences. Interesting scientific research is often picked up by the popular media, and any odds ratio inappropriately presented as a relative risk in the scientific article is likely to be presented in the media as a relative risk (I’ve come across examples where this has occurred). Headline: “people who do A are 2.2 times more likely to have B.” Maybe… but if 2.2 was actually an odds ratio, then no.
Moral of the story: be careful when presenting or reading about odds ratios or relative risks; if the odds ratio is being used to approximate the relative risk, be sure that the outcome is sufficiently rare to ensure that the approximation is appropriate.
Posted: March 6, 2009 by Y-H.
Tags: case-control studies, odds ratios, relative risks.
Comments: please leave a comment. There are 2 to read.
As hinted in my post about R, I’m a big fan of libre software. And, as suggested in that post, I think that there would be many advantages to borrowing concepts from the libre-software movement and conducting statistical analysis in an open and transparent fashion. I had planned on posting a short draft of principles to adhere to when conducting “open analysis,” and was surprised but pleased to come across a post at Dataspora’s blog from July of last year which pretty much hit all the suggestions I wanted to make. I don’t really have much to add to the post except to say that I think the principles can and should be extended to all statistical analysis, and not just data visualization.
I will attempt, as much as possible, to adhere to principles of open analysis when preparing analysis for this site. This means including (1) data, (2) source code, and (3) using software that is itself open.
As you may already be aware, the New York Times ran a story last month about the statistical software program (and programming language) R; they also posted a follow-up note from the author on the Times‘ technology blog. I thought the Times article did a reasonably good job summarizing what R is and how it’s being used, but I thought this would be an appropriate time to share some of my own thoughts about the program.
I’ve been using R since around 2001 or 2002 and I use it almost exclusively. R is by no means always the most satisfying of programs to work with — for example, there is definitely a bit of a learning curve — but its advantages, in my mind, far outweigh its disadvantages. Some of the reasons why I like R:
R can be operated directly from a command prompt. This is really convenient, as it means that you can don’t have to create a file of code simply to perform basic operations (although you can and, for sake of convenience, probably should use an external file for code if you expect to be using a lot of code for your project).
Running a statistical software program in an interactive manner can also lend itself to “good” ways of statistical thought. Statistical analysis is often performed in an algorithmic fashion; yet, depending on the goals of the project, carrying out methods in a prescribed manner can be an inappropriate way to conduct analysis (e.g., if your goal is to confirm some hypothesized relationship between two variables but you want to control for potential confounders, algorithmic methods of model building can actually be dangerous). Taking the time to think and understand the data is almost always useful (though: performing analysis in an interactive manner can also be dangerous, but more about this at some later point in time).
This is a pretty big deal considering the cost of most commercial statistical programs. The base package for SPSS, for instance, retails for a couple thousand dollars, and if you want additional features (say, something actually fairly routine like logistic regression), you have to purchase additional modules.
You can do almost whatever you like with R; you can look at and modify the source code and you can share the software with your friends. Having access to the source code may seem to have few practical advantages, but openness is transparency, and transparency is a pretty good thing to have in research. Indeed, it is puzzling to me that people question the “validity” of R when in fact one of the great things about R is that you can actually look at the source code to see how a particular method is being implemented (and see if it is being implemented correctly). The same simply can not be said of proprietary systems like SAS or Stata.
The freedom to examine and modify R’s code makes R not only transparent, but also highly customizable. Pretty much any function and capability in R can be changed — and advanced programming skills aren’t always required (some functions, for instance, can be easily modified with knowledge of the R language alone). Situations do arise when customization is useful, and I like that R allows for modifications.
Numerous add-on packages are available for R, representing a diverse array of fields and methods. The availability of these packages makes R powerful and rich of features. There are few things you can’t do in R and there are actually some things you can do in multiple ways: if you don’t like R’s default graphics system, for instance, you can try lattice or ggplot2.
There are also relatively minor things I like about R. Most statistical packages, for instance, require files (e.g., data) to be referenced via absolute paths; R appears unique in that it recognizes relative paths. Thus, supposing a file is in the working directory, moving the directory does not affect the way in which the file needs to be referenced from R (in comparison, in programs in which absolute paths are required, the call to the data must always change if the directory containing the file is moved).
Posted: February 4, 2009 by Y-H.
Tags: R, SAS, SPSS, Stata, statistical computing.
Comments: please leave a comment. There are 0 to read.