If the answer isn’t 42, how do we find it?

flickr_adam jones_13845541433Those of you around my age may be familiar with Douglas Adams’ Hitchhikers Guide to the Galaxy in which the answer to the question ‘What is the meaning of life, the universe and everything?’ turns out to be the number 42.  We wish that systematic reviews could be like that. Throw all the evidence into a big number cruncher and out pops a single answer.

This is what statistical meta-analysis does. But are the answers it gives as absurd as Douglas Adams’ 42?

In his recent blog, Cyrus Samii argues that statistical meta-analysis may often be inappropriate (meta-analysis more generally defined applies to any synthesis). There is just too little evidence to meaningfully combine in this way. Heterogeneity of context, intervention and estimation method render any measure of an ‘average treatment effect’ useless. So, says Cyrus, many of the meta-analyses being produced today – including his own – are probably wrong-headed. They are in pursuit of a single number answer where none exists.  Rather than meta-analysis, studies should present the ‘best available evidence’ for the question at hand.

Now, I don’t disagree with Cyrus at all, but I do think there is more to be said.

First, this argument is not a rejection of meta-analysis and should not be read that way.  If we are advocating for the best available method to answer the question at hand, that best method will sometimes be meta-analysis.   It is then a secondary consideration, albeit an important one, if there is currently sufficient evidence of the right sort to carry out the meta-analysis.

Second, let us unpack a bit what we mean by ‘the right method to answer the right question’.  3ie supported a Campbell review of interventions to improve schooling in low- and middle-income countries by Petrosino et al., which we repackaged with our own analysis in a 3ie working paper. This review lumps together many studies of different interventions in different contexts. But if the question is ‘on average, have interventions to get children into school been effective?’, then lumping all these studies and doing meta-analysis is the right approach. And the answer is ‘yes, they have been effective, as have been those interventions for improving learning outcomes. But it takes different interventions to get children into school than it does for them to learn once they are there.’  These are useful questions to answer, and meta-analysis is the best approach to answer them.

Third, heterogeneity is the friend of meta-analysis not its enemy. With sufficient observations we can unpack the average treatment effects by intervention type, beneficiary population and so on. For example, which interventions are most effective at getting children into school?  Answer: conditional cash transfers and providing resources for teachers, with pre-school and school feeding also looking promising.

One of Cyrus’s own reviews uses meta-analysis to show that land reform generally has productivity enhancing effects, but not in Africa.  And one of my favourite recent reviews, by Sarah Baird and colleagues, coded the degree of monitoring and enforcement of conditionality in conditional cash transfer programmes. The review clearly shows that programmes in which conditions are better monitored and enforced have a greater impact on school enrolments.

Having said all that, there are certainly cases in which quantitative synthesis of effect sizes is not appropriate. 3ie promotes synthesis of both factual and counterfactual evidence across the causal chain – as exemplified in our recent review of farmer field schools.  But how many studies actually do that?

The most pertinent methods issue I believe raised by Cyrus’s blog is that qualitative synthesis is generally done so poorly.  Many studies present what is essentially an annotated bibliography – that is a list of studies devoting a paragraph to each one, perhaps organised into sections. But this is just a presentation of the data, it is not a synthesis.  There are well-established methods of qualitative synthesis, including coding and matrices which are barely applied in the reviews I have seen.   This approach will allow a thematically organised presentation rather than a study-oriented approach.

So, I am all for using the right method for the right question. But there is too little understanding and agreement of the right methods of qualitative synthesis.

The efficacy – effectiveness continuum and impact evaluation

514914659_220e004fd4_mThis week we proudly launch the Impact Evaluation Repository, a comprehensive index of around 2,400 impact evaluations in international development that have met our explicit inclusion criteria. In creating these criteria we set out to establish an objective, binary (yes or no) measure of whether a study is an impact evaluation, as defined by 3ie, or not. Some criteria were simple (does the study evaluate a programme or policy?) while others were more controversial (does it use experimental or quasi-experimental methods?).

But for one particular criterion, studies did not always fit neatly into a ‘Yes’ or ‘No’ category: Does the study measure programme effectiveness?

One of the key identification strategies in impact evaluation is the randomised controlled trial (RCT). This method involves the random assignment of an intervention to a study population. As you can well imagine, there are an awful lot of RCTs in the biomedical sciences. A quick search of PubMed reveals that more than 360,000 studies published since 1961 have been indexed as RCTs. By our estimation, around 12,000 of these have taken place in low- and middle-income countries. We knew right away that many were medical efficacy trials that would not be directly relevant for international development policy making.

So we had to draw a line in the sand; a line we called ‘Effectiveness’. The problem with drawing lines in the sand, of course, is that sometimes they disappear.

At the outset, the difference between efficacy and effectiveness studies seemed simple enough. Efficacy trials (usually small scale) determine whether a treatment works under ideal (laboratory) conditions. Meanwhile, the more relevant effectiveness studies (normally large scale) examine whether that treatment works under ‘real world’ conditions. But after a while, that line in the sand became pretty blurry. A lot of questions started cropping up. Should all large-scale community-based trials of Vitamin A supplementation count as development impact evaluations? Does every vaccination trial in the developing world count as a development impact evaluation? What sets impact evaluations apart?

Many folks in the biomedical sciences have already pointed out that these two categories exist on more of a continuum than as mutually exclusive concepts. As Mark Borigini notes, it is rare of have a perfect clinical study. Indeed, a number of trials we considered for the repository primarily examined efficacy, and also added value to the conversation of treatment effectiveness. But in that case, any biomedical RCT that measures outcomes at the household, community, or regional level (just about anything outside of a laboratory setting) could also be considered an effectiveness trial.

There were many times when we found ourselves saying: “this one feels pretty efficacy-ey,” or “that study has the distinct aroma of effectiveness,” or “it has a certain, je ne sais quoi.” As it turns out this kind of gut-feeling analysis isn’t far off. But we needed to ground our subjectivity in a more deliberate way.

To do this, we drew from early conversations around explanatory and pragmatic trials. Explanatory trials are used to test causal research hypotheses, while pragmatic trials are intended more to inform policy decisions. As Schwartz and Lellouch (1967) point out, this distinction often comes down to the ex ante attitude of the authors around trial design. Pragmatic trials answer real world questions about what treatment is best for the patient in the immediate moment. For 3ie, pragmatic trials produce pragmatic results, and are generally the most applicable. We are not merely concerned with the effect of a drug; we are concerned with the effectiveness of the overall intervention such that we can make recommendations that inform development programming.

To guide this sometimes-subjective decision making, we created a screening criterion (below) to help our screeners conceptualise where a study fits on the efficacy – effectiveness continuum. This tool notwithstanding, what we ultimately found is that in a small number of circumstances it is not totally clear where a study belongs.

In these cases we look to Schwartz and Lellouch’s attitude towards trial design. Though, a more apt metaphor might be found in the 1964 landmark U.S. Supreme Court case Jacobellis v. Ohio. Writing for the concurring opinion, Justice Potter Stewart famously described his threshold test for determining whether obscenity was protected under the first amendment by saying, “I know it when I see it.”

Item 6a from the 3ie Repository Screening Tool

Studies may exist anywhere on the efficacy-effectiveness continuum. Typically, efficacy studies examine treatment outcomes under highly controlled conditions. Effectiveness studies go beyond laboratory trials and examine interventions in real world settings. Note that RCTs that only address the biomedical efficacy of a drug or treatment should be excluded. The following are screening guidelines to help make this judgment:

If any of these conditions are met in addition to methodological criteria in #6 above, select ‘Yes’:

  1. The intervention under study promotes a social, economic or behavioral change either as one of the final measured outcomes or as a mechanism within the theory of change (beyond the self-administration of a drug). For example, the study may include health behavior messaging, training, provision of information, or screening or surveillance for specific disease conditions.
  2. The study measures any other outcomes in addition to or beyond purely biomedical indicators (such as returns to education, economic productivity, quality of life, disability adjusted life years (DALYs) and spillover effects).
  3. The study measures the cost-effectiveness or cost-benefit of the treatment(s).
  4. The study records any additional formative information that could guide the design or execution of future studies. For example, an RCT that also measures acceptability of a particular treatment (measuring respondent satisfaction with treatment not merely a rate of compliance or uptake) would be included.
  5. The treatment is both prepared and delivered by a community health worker, or trained layperson (such as a parent, teacher or community member and not merely one of the program or study enumeration team).
  6. The programme or outcomes measured answer, or attempt to answer, a question relevant to the roll-out of international development policies or interventions.

If it is unclear whether the study meets any of these conditions (a-f), select ‘Unclear’.

Note that in erring on the side of inclusion, studies which are ‘Unclear’ should likely be included.

If the study meets none of these conditions (a-f), select ‘No’.