Over at Effect Measure, Revere has posed a thought experiment (or, perhaps, a disguised real experiment) that provides a valuable jumping off point for discussing the worth of observational data.
In brief, Revere describes an uncontrolled study for a new application of an antiepileptic drug as a treatment for refractory hypertension. A study is conducted by administering the drug to 29 patients and checking their blood pressures at baseline and after a month of therapy.
I glanced at the comments on the blog, and interestingly a lot of people are spending time trying to apply a name to what exactly this study is (i.e. "case series", "observational", and the like). I often don't find naming things all that helpful unless everyone has a clear understanding of the naming convention, but this would most commonly either be called an observational study (if we view that the therapy was going to be administered independent of the interest in collecting data about what happened), or an uncontrolled clinical trial (if we view that the primary goal was to find out if the AED works for refractory hypertension).
In the thought experiment, the blood pressure readings a month later are substantially lower.
Revere poses the following question:
"...is this a sufficiently reliable study that a reasonable practitioner, committed to the use of scientific evidence in her practice, would consider?"
I'm going to move on quickly from that question, because I think the correct answer isn't that helpful: the study contains data about a potential new therapy, and a reasonable practitioner should always consider such evidence if she or he has the time and ability to do so.
To understand, instead, whether we should believe that the results in this study can be ascribed to the drug working, we might look first at how we would feel about the result if it had been obtained from a randomized trial performed in the same manner described as the uncontrolled trial, except now we have 29 patients randomly assigned to receive the AED, and 29 to not receive it. Notice that this RCT is still unblinded and, as such, has no placebo arm.
We'd likely note a few problems with this RCT. First, it has a relatively small number of patients, though if the drug is highly effective this might not be a problem. We'll come back to this point a bit later.
Second, the outcome is fairly indirect (in GRADE terminology) to an outcome of clinical importance. Revere points out that BP reduction is not a clinical outcome: it's a commonly used surrogate, but what we really care about is a reduction in cardiovascular events and this is not being measured. BP lowering is such a standard surrogate that it's easy to lose sight that it is a surrogate at all, but not everything that lowers BP is good for you (consider endotoxin, for instance). However the actual measurement has an additional level of indirectness: we are measuring BP reduction at one month as a surrogate for long-term blood pressure reduction. Long-term BP reduction is the marker that we actually typically use as a surrogate for a reduction in CV events.
Third, as noted above, the study is unblinded. Does this matter? We're told that the BP both at baseline and at one month is being taken with an automated cuff and all the clinical assistant has to do is write down the measurement. This sounds like a simple solution, but it is fraught with problems. The patients are unblinded, and so the ones taking the medication may believe in its efficacy, relax more, and so have lower BP readings in the office. The clinical assistants believe in the drug and want the study to succeed, and so may take several readings in patients on the AED ("why don't you sit here and relax, and I'll recheck your BP in a few minutes -- this drug looks like it's working in almost everyone, so I'm sure your BP will come down if we just wait"), but record the first BP reading in the control group patients. And, of course, those analyzing the results know who is and is not receiving the AED, and this might affect all sorts of things (not that we like to admit this).
So this small, unblinded, short-term study of a surrogate outcome might not be all that convincing, even as an RCT.
When we push it a step further and make it uncontrolled, we add an additional major problem of regression to the mean (as well as secular trends in BP over time). In the RCT, we had a control group, and so we at least knew what happened to the BP readings in patients not treated with the AED. in the uncontrolled study we have no such knowledge. We're not told exactly what the entry criteria are, but presumably having a well-controlled BP would prevent a patient from getting the AED as an additional anti-hypertensive. Thus, patients who are oscillating around a mean BP value will likely be excluded if they are at a BP minimum and enrolled if they are at a BP maximum. One month later, just due to the usual biologic variation plus this entry criterion, the BP can be expected to be lower on average.
Adding this problem to the ones we noted in the RCT version, we are unlikely to be terribly sanguine about believing the results of this small trial.
Are there any circumstances, though, where despite all this we might trust the results anyway? One of the real insights of the GRADE group, to my mind, is a focus on magnitude of effect in thinking about observational data, rather than focusing on minor differences in study design (such as cohort versus case control). With large enough effects, we might accept some pretty minimalist study designs.
For instance, rabies is historically a 100% fatal illness once clinical symptoms appear in someone who has never been vaccinated. In 2005, a 15-year-old girl survived rabies after treatment with a novel experimental regimen. You could imagine that anything can happen once, and it may have been coincidence that this girl survived and received this novel regimen. However, if one more person with rabies were to receive the regimen and survive, it would seem spectacularly unlikely that the explanation would be anything but that the regimen works. A rational clinician would treat any new patient with rabies with that regimen from that moment until a superior regimen was found. (As far as I know, no other patient has survived rabies on this regimen.)
Similarly, in response to the humorous attack on EBM, asking what the RCT data are supporting parachutes when skyjumping, the GRADE group would respond that the magnitude of effect of parachutes is sufficient to constitute high quality evidence for their use (okay, clinical epidemiologists can be somewhat humor-challenged). That is, we have all sorts of historical evidence about what happens when people fall from great heights (and that we might consider only slightly indirect to falling from an airplane), as well as lots of observational data about what happens when people fall from airplanes wearing parachutes. Not everyone who falls from a great height without a parachute dies, and not everyone wearing a parachute lives, but the effect size is so large that we have high quality evidence for parachutes in the absence of a clinical trial.
To give one example that might feel more real, I was doing a lot of AIDS care in the 1990s. In 1995 an abstract was published from one of the ID meetings about the effects of ritonavir in about 50 people with late AIDS. (I've tried in vain in the recent past to find this abstract -- if anyone can point me to it, I would be grateful.) The results were like nothing we had seen before -- patients' CD4 counts rose dramatically, opportunistic infections improved, and many patients who would have been expected to die improved instead. We did not know what would happen long-term, but it was obvious, without any RCT, that ritonavir was effective therapy for AIDS, at least in the short term. By 1996, we were treating people with triple therapy "cocktails" for HIV, again without any RCTs with clinical endpoints, and watching people who had been dying walk out of hospitals and hospice care as their OIs resolved. The magnitude of effect was such that we had high quality evidence for these cocktails based on observational data alone. (Not that this actually prevented researchers from proceeding with an RCT of triple therapy, but that's a post for another day.)
In the refractory hypertension thought experiment, the problem is that we do not have enough evidence about how unusual the BP lowering seen was to conclude that this was a very large (and thus very unexpected) result. The result could have been due to measurement biases and regression to the mean, and so have had nothing to do with the actual drug being studied.
Revere started his/her post talking about Cochrane and the focus on RCTs. I agree both with concerns about Cochrane reviews (which can be of quite variable quality), and about the general issue of ignoring high quality observational data (most often high quality because of very large magnitudes of effect) to focus exclusively on RCTs. However, I don't find the evidence from the study in Revere's thought experiment to be sufficiently high quality that I would be generally willing to administer the AED for refractory hypertension without further study. Having read such a study, I might take it into consideration sufficiently to administer the AED as a last resort -- a situation in which I might be willing to grasp at any straw of evidence.