My previous post, though not really intended to be focused on p values, led to a long and interesting discussion on the issue by Revere.

Revere commented that it wasn't so clear whether I was writing as a frequentist or a Bayesian. In reality I'm a consumer of biostatistics, not a statistician. For me, the issue is pretty simply to try to know whether a given result I'm seeing is true.

Although I wrote about this in my last post I didn't frame it in exactly this way: the problem with p values is that I want to know how likely it is that the result I'm seeing is correct, while the p value tells me how unlikely it is that I would be seeing a given result if the null hypothesis were true. This is similar to the problem with knowing the sensitivity and specificity of a test, when what you really want to know is the positive predictive value (how likely is it that the patient has the disease now that the test has come back positive?).

The academic medical world started hearing a lot about using confidence intervals in preference to p values 15 or 20 years ago, and I think that led to many doctors concluding that confidence intervals solve this basic p value problem. Confidence intervals have some real benefits compared with p values, but this is not one of them. To look at this, we need to try to understand what a CI means.

Again using the HIV vaccine example from my last post, we can look at the same parameter that had a p value of 0.04, but now examine the point estimate of efficacy (31.2%) and its 95% CI (1.1 to 52.1%). What is it that has a 95% chance of being true, given that CI of 1.1 to 52.1%?

However tricky p value interpretation is for physicians, understanding the meaning of this CI is much worse. I've almost never heard a physician able to correctly interpret the meaning of such a CI when put on the spot with the above question. When trying to teach the interpretation I repeatedly get told there must be a simpler way to communicate the idea. Given my years of failure at this, I have little hope that this post will adequately clarify things for most readers, so if others who teach this have found a way that works to explain it, please write!

All we can really say about that CI is that (excluding any problems with the design or performance of the trial) if we performed the same study 100 times and calculated 95% CIs each time, we would expect that 95 of the 100 CIs formed in this way would include the true point estimate of vaccine efficacy.

Note that just as with p values, this isn't what we want to know: we want to know how likely it is that the true value is inside the particular CI that we are looking at, but that isn't what the CI actually tells us.

However, despite recognizing what the CI really does and does not tell us, consumer of biostatistics that I am, I (and others) approach CIs operationally: we choose to interpret the CI as a range of values with which the data are reasonably compatible, and to interpret values outside the CI as reasonably incompatible with the data. So, other things being equal, I would say that a vaccine efficacy of 5% was compatible with the results of the NEJM study, while a vaccine efficacy of 60% was not. This does not mean that I think the study has excluded the possibility of the vaccine having 60% efficacy, just that this would be unusual under the plays of chance.

This operational definition works as I decide how to write recommendations about whether to administer such a vaccine. If the vaccine truly had an efficacy of 31% I would likely recommend wide use in high risk patients. If the high end of the CI were true (52% efficacy), I might recommend universal vaccination. If the low end were true (1% efficacy) I would probably recommend leaving it on the shelf. Looking at this, I can quickly realize that if this trial were the only information I had about vaccine efficacy then I have inadequately precise data to support whatever recommendation (or set of recommendations) I might want to make about administering HIV vaccine.

If I were using the GRADE scheme for grading such recommendations, I might have started with the assumption that I had high quality evidence from a large randomized trial. But when I realize that I would make different recommendations based on the reasonable values at each end of the CI, I know that I must downgrade the quality of the evidence for such imprecision. Recommendations for HIV vaccine based on this trial, using the GRADE scheme, would certainly be graded as having no better than moderate quality evidence because of this imprecision.

Instead of some arbitrary definition of whether a trial is large enough or precise enough, using the CI in this way allows me to communicate something important about the quality of the evidence as I grade recommendations. I downgrade for imprecision not because a CI crosses a null effect boundary (like 1.0 for a relative risk) but because the CI crosses a clinical boundary where the appropriate recommendation would change from one side of the boundary to the other.

In this way, the CI is far more useful than a p value. I keep in the back of my mind, though, that the CI doesn't really mean what I'm trying to use it to mean -- it's just that I usually don't have anything better.

I'll write more in the future about how GRADE looks at other types of limitations on the quality of evidence from randomized trials.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.