Monthly Archives: November 2014

How big is big? The need for sector knowledge in judging effect sizes and performing power calculations

Josiah Mackenzie_3414064391A recent Innovations for Poverty Action (IPA) newsletter reported new study findings from Ghana on using SMS reminders to ensure people complete their course of anti-malaria pills. The researchers concluded that the intervention worked. More research is needed to tailor the messages to be even more effective.

The proportion of those completing the pack of pills was 62 per cent in the control group and 66 per cent in the treatment group.  My first reaction was a rather different one to that of the researchers. I thought, Well, that didn’t work very well.  They had better look for something else. The researchers seem to be falling into the all too common trap of mistaking statistical significance for practical significance.

But then I remembered an interview I heard in the wake of the Lance Armstrong drug doping scandal.  “Without the drugs”, a fellow cyclist said, “Armstrong would have been in the top ten rather than first.”  But in many races just a second or two separates the person coming in first and the one in tenth place.

How much difference do performance enhancing drugs really make in athletics? This rather nice blog lays out lots of data to answer this question. Performance enhancing drugs clearly do work. The improvement for the 1500 metres is 7-10 seconds.  Given the world record of 3 minutes 26 seconds, that may not sound like a lot. It is just three to five per cent.  But it took thirty years for the world record to improve by 10 seconds from 3.36 to 3.26.  The women’s world record for the 400 metres was set, mostly likely with the assistance of performance-enhancing drugs, back in the 1980s. No one has come close since.

As a runner myself, I know the big difference between running 10 kilometres in 39 minutes 50 seconds compared to 40 minutes 20 seconds.  Or a marathon (42.2 kilometres) in 2 hours 59 minutes 40 seconds rather than 3 hours 1 minute 10 seconds. It is very frustrating speaking to non-runners who say, “Well, that’s not much of a difference.”

So, going back to the IPA study in Ghana, who am I to say if the increase from 61 to 65 per cent is big or not? I simply don’t have enough sector knowledge to make that judgment. I don’t know what else has been tried and how well it worked. And in particular, I don’t know which of the approaches to get people to complete their malaria treatment is most cost effective. As my 3ie colleague Shagun argued in her recent blog, it takes specialist sector knowledge to know how big an effect needs to be for it to matter, and knowing this is crucial in performing accurate power calculations. Impact evaluation study teams often don’t have that knowledge.  They should be consulting sector policymakers to find out how big is big enough.

Calculating success: the role of policymakers in setting the minimum detectable effect

Trust for africa's orphans_8689131265

When you think about how sample sizes are decided for an impact evaluation, the mental image is that of a lonely researcher laboring away on a computer, making calculations on STATA or Excel. This scenario is not too far removed from reality.

But this reality is problematic. Researchers should actually be talking to government officials or implementers from NGOs while making their calculations. What is often deemed as ‘technical’ actually involves making several considered choices based on on-the-ground policy and programming realities.

In a recent blog here on Evidence Matters, my 3ie colleagues Ben and Eric highlighted the importance of researchers clarifying the assumptions they had made for their power and sample size calculations. One assumption that is key to power and sample size calculations is the minimum detectable effect (MDE), the smallest effect size you have a reasonable chance of detecting as statistically significant, given your sample size.

For example, in order to estimate the sample size required for testing the effects of a conditional cash transfer (CCT) programme on children’s learning outcomes at a given power level, say 80 per cent, the MDE needs to be assumed. This assumption is about the minimum effect we want to be able to detect. Suppose the CCT programme is expected to lead to a 20 per cent or more increase in learning outcomes as compared to the control group, then the researchers might use 20 per cent as the MDE for calculating the required sample size. Of course, the baseline value of the outcome is important and one’s expectation regarding how much the outcome will improve depends on the baseline value with which one is starting.

But there is often no clear basis for knowing what number to pick as the MDE. So, researchers make a best guess estimate or assume a number. Some researchers might use the estimated effect size of similar past interventions on selected outcomes. Others might do a thorough literature review to understand theoretically how much the impact should be, given the timeline of the intervention.

But how one comes up with this number has serious implications. Once you decide on a MDE of 20 per cent, your impact evaluation will have less chance of detecting a difference of anything smaller than 20 per cent. In other words, if the actual difference between treatment and control schools for learning outcomes is 10 per cent, the study that is powered for a 20 per cent difference is more likely to come up with a null finding and conclude that the CCT programme has no impact on learning outcomes.

And the smaller your MDE is, the larger is the sample size required. So, researchers always have to balance the need for having a study that can detect as low a difference as possible with the additional expense incurred on data collection with larger sample sizes.

Having a study that can detect as low a difference as possible between treatment and control groups is not necessarily a good thing. That is why it is important to involve decision makers and other key stakeholders using the evaluation in deciding the MDE. They may feel that a 10 per cent increase in learning outcomes between control and treatment groups, as a result of the CCT programme, is too low to justify investment in the programme. Spending money on a study with an MDE of 10 per cent, which would require a larger sample size, would not then be meaningful for those policymakers. If we did, wouldn’t we be spending money on a study that is ‘overpowered’?

At 3ie, many of the research proposals we receive make no mention of the policymaker’s perspective on the expected minimum improvement in outcomes required for justifying the investment being made in the programme.  No one has defined what success will look like for the decision maker.

To get that important definition, researchers can ask policymakers and programme managers some simple questions at the time the research proposal is being prepared. Here are some indicative questions to illustrate: What aspects of the scheme or programme are most important to you? What outcomes should this scheme change or improve according to you? If the scheme were to be given to village A and not village B, then what is the difference you would expect to see in the outcomes for individuals in village A as compared to village B (assuming the baseline value of the outcome is the same in both villages)? Would this difference be sufficient for you to decide whether or not to roll out this programme in other districts?

Of course, as 3ie’s Executive Director, Howard White, highlights in his blog, researchers need to balance this information and ensure that the MDE is not set at too large a MDE.

So, as researchers, instead of monopolizing, let us involve policymakers and programme implementers in the power calculations of a study. It’s high time that we started following the conventional wisdom of having the MDE be the minimum change in outcome that would justify the investment made in the intervention for a policymaker.

“Well, that didn’t work. Let’s do it again.”

4024254468_Michael Foley

Suppose you toss a coin and it comes up heads. Do you conclude that it is a double-headed coin? No, you don’t. Suppose it comes up heads twice, and then a third time. Do you now conclude the coin is double-headed? Again, no you don’t. There is a one in eight chance (12.5 per cent) that a coin will come up heads three times in a row. So, though it is not that likely, it can and does happen.

So, if an impact evaluation finds that an intervention doesn’t work, should we discard that intervention? No, we shouldn’t.  We should do it again.  Our study is based on a sample, so there is a probability attached to the study findings. More specifically the power of the study is the probability that we correctly conclude that a successful intervention is working (’don’t accept the null’). Power is typically set to be 80 per cent. That means that 20 per cent of the time we find that successful programmes don’t work.

Actually it is worse than this.  The true power for many impact evaluations is only around 50 per cent. So, if a programme is working, an under-powered study is no better than tossing a coin for finding that fact out! This is a rather distressing state of affairs. But it can be addressed in three ways: (1) realistic power calculations, (2) external replication, and (3) meta-analysis.

My colleagues, Ben and Eric, recently blogged on the importance of performing and reporting power calculations.  And I would emphasise one of their points: it is crucial to have realistic assumptions for these calculations. That is frequently not the case. A main culprit is setting too large a minimum effect size – and the larger you set this effect, then the smaller the sample you need to detect it. But if the actual effect is less, then your study is underpowered. One reason this happens is that researchers believe project staffs’ inflated views of programme impact e.g. a 50 per cent increase in income (Really? Please include me in that project). So, if you use 50 per cent as your minimum effect but the true impact is ‘only’ 15 per cent, you have a greatly reduced chance of detecting it.  A second factor is that researchers ignore the funnel of attrition. Far fewer people take part in the intervention than expected, so estimates of the treatment of the treated effect will be underpowered. This is why 3ie requires proposals it considers for funding to have detailed and well-grounded power calculations.

Second, as I hope is clear by now, false negatives are likely to be very common. Just because a study finds no significant impact doesn’t mean the intervention doesn’t work. To improve internal validity, one approach is to develop a theory of change (see here, here and here) which may show an obvious reason why an intervention failed, as in the case of the recent randomised controlled trial that found that textbooks don’t affect learning outcomes if they are not given to the students! But for external validity, the answer is to do it again! This is external replication: trying the same programme, usually in a different place. But actually, doing it in the same place is scientifically more sound.

But I don’t mean you should just keep doing it again and again until you have one study that finds an impact and then say, “Ah ha, so it does work. All the other results are false negatives.” ‘Goal scoring’, that is counting how many studies find a significant impact and how many don’t, is simply an incorrect way of summarising these data.  Of course findings will be mixed since they are based on a sample and not the entire population.

But meta-analysis turns the confused signal of ‘mixed findings’ from multiple studies into a clear signal: the intervention works or it does not. The findings from all the studies can be pooled together to get one overall estimate of the impact of the programme using meta-analysis.  Using meta-analysis levers the power of the combined sample across studies to get a more precise impact estimate. It can even turn out that three poorly powered studies individually find no effect. But once they are combined in meta-analysis, a positive effect is found.

So, think about power and get it right. If something doesn’t work, try it again. And, then take all the results and conduct a meta-analysis.  Evidence can improve and even save lives. But if evidence is misused, it is just a waste of money.