Yearly Archives: 2014

Proof-of-concept evaluations: Building evidence for effective scale-ups

Heather_1066I delivered a talk at 3ie’s Delhi Seminar Series on a recently published PLoS ONE paper) and follow-up research. This project was a randomised experiment evaluating the potential for text messages to remind malaria patients to complete their treatment course of antimalarial medication. Specifically, we looked at completion of the only class of drugs fully effective in curing malaria in Sub-Saharan Africa: Artemisinin-based Combination Therapies (ACTs). An individual’s failure to complete treatment can have both private and public harms – parasite resistance to these drugs is already emerging in Southeast Asia and there is no clear alternative treatment in the pipeline.

Several interesting questions arose during the course of the seminar, including from discussant Simon Brooker.  Some of these questions about the study also came up in follow-up visits to vendors in Ghana. The main overarching question in all of these was: Why did we design the intervention to be so hands-off?


  • Why didn’t we allow the vendors to play a stronger role in educating and enrolling patients into the text messaging system?
  • Why didn’t we provide financial support to those for whom phone credit was a barrier to enrolling in the system?
  • Why didn’t we use more interactive forms of texting or even voice-calling (including Interactive Voice Response, such as used here)?
  • Why didn’t we link our messages to a larger system of messaging the drug vendors themselves to remind them about protocol (as was done here)?

Why this way?

 I believe we took this approach for three main reasons.

First, our funder CHAI (as an operational research project for the Affordable Medicines Facility – malaria (AMFm) wanted a proof-of-concept about the minimal supportive moving parts required to get patients enrolled into a text messaging system of reminders to complete their medication.  In the context of the AMFm, as well as Ghana’s National Health Insurance Scheme, the availability and affordability of ACTs has been expanding rapidly. But support to encourage appropriate use of those ACTs lagged behind.

So, we wanted to learn what could be scaled up cheaply and easily.

This study is the first randomised evaluation of a direct-to-patient (rather than to health workers) text messaging programme for malaria in Sub-Saharan Africa. We purposefully chose northern Ghana as the site of the study (specifically, in and around Tamale in Northern Region, which falls below the Ghanaian average on most welfare and development indicators). We worried that finding an effect from a text messaging programme in the capital, Accra, wouldn’t go very far in convincing people that a similar programme could work across Ghana. So, we tried to make things a bit difficult to find an impact.

Second, we wanted to isolate the effect of the text message itself.  By having the vendors play a stronger role in educating their patients about the need to complete their antimalarial medication, we would find ourselves unable to identify the effect of the text messages alone (without proliferating to an octopus of treatment arms, which budget constraints would not allow).

In this context, we were looking for answers to questions such as: Would the vendors hand out the flyers with minimal encouragement? Would it work if the vendors didn’t tell patients that the point of the messages was to remind them to finish their meds (vendors themselves were kept in the dark about this point until the end of the study)? Would it work if surveyors did not help assist patients in enrolling into the system (by either giving a missed call or sending a text)?

Third, the intervention was a somewhat narrow conception of mHealth-as-text-message, rather than text messages as social interactions embedded within larger social systems of communication and health care. This mHealth intervention, though run through a computer speaking Python and sending messages directly to mobile phones, was still very much embedded in social relationships, such as those between drug vendors and their patients (a point I bring out here).

Which way next?

From this study, we see that text messages can indeed have an effect on treatment completion. Precisely how to interpret the effect size is open to debate but as a proof-of-concept, we now have an idea that even in a purposively tough context, text messages may be part of the arsenal that moves patients towards full completion of malaria medication. This has practical significance as well as statistical significance: it can work. Moreover, there is suggestive evidence that the programme could be scaled up, given the hands-off approach we took and the enthusiasm of the vendors with whom we followed up.

There’s still however a long way to go, as this intervention only gets us to around 70 per cent completion rate of antimalarial medication. A likely way forward is thinking about text messages as one part of a larger, socially embedded intervention with multiple prongs to reach health providers, caregivers and patients through a variety of media and interaction mechanisms. This proof-of-concept evaluation should allow us to build on what works, making this more than a one-off study. It pushes us closer to the ultimate goal of a 100 per cent completion of anti-malarial medication.

Myths about microcredit and meta-analysis


It is widely claimed that microcredit lifts people out of poverty and empowers women. But evidence to support such claims is often anecdotal.

A typical microfinance organisation website paints a picture of very positive impact through stories: “Small loans enable them (women) to transform their lives, their children’s futures and their communities… The impact continues year after year.” Even where claims are based on rigorous evidence, as in a recent article on microfinance in the Guardian by the chief executive officer of CGap, the evidence presented is usually from a small number of chosen single impact evaluations, rather than the full range of available evidence. On the other hand, leading academics such as Naila Kabeer have long questioned the empowerment benefits of microcredit.

So, how do we know if microcredit works? The currency in which policymakers and journalists trade to answer such questions should be systematic reviews and meta-analyses, not single studies. Meta-analysis, which is the appraisal and synthesis of statistical information on programme impacts from all relevant studies, can offer credible answers.

When meta-analysis was first proposed in the 1970s, psychologist Hans Eysenck called it ‘an exercise in mega-silliness’. It still seems to be a dirty word in some policy and research circles, including lately in international development. Some of the concerns about meta-analysis, such as those around pooling evidence from wildly different contexts, may be justified. But others are due to misconceptions about why meta-analysis should be undertaken and the essential components of a good meta-analysis.

3ie and the Campbell Collaboration have recently published a systematic review and meta-analysis by Jos Vaessen and colleagues on the impact of microcredit programmes on women’s empowerment. Vaessen’s meta-analysis paints a very different picture of the impact of microcredit. The research team systematically collected, appraised and synthesised evidence from all the available impact studies. A naïve assessment of that evidence would have indicated that the majority of studies (15 of the 25) found a positive and statistically significant relationship between microcredit and women’s empowerment. The remaining 10 studies found no significant relationship.

So, the weight of evidence based on this vote-count would have supported the positive claims about microcredit. In contrast, Vaessen’s meta-analysis concluded “there is no evidence for an effect of microcredit on women’s control over household spending… (and) it is therefore very unlikely that, overall, microcredit has a meaningful and substantial impact on empowerment processes in a broader sense.” So, what then explains these different conclusions, and in particular, the unequivocal findings from the meta-analysis?

The Vaessen study is a good example of why meta-analysis is highly policy relevant. The meta-analysis process has four distinct phases: calculation of impacts from the studies into policy-relevant quantities, quality assessment of studies, assessment of reporting biases, and synthesis including the possible statistical pooling across studies to estimate an average impact. It uses these methods to overcome four serious problems in interpreting evidence from single impact evaluations for decision makers.

First, the size of the impacts found in single studies may not be policy significant. That is, impacts are not sufficiently large in magnitude to justify the costs of delivery or participation. But this information is often not communicated transparently. Thus, many single impact evaluations – and unfortunately a large number of systematic reviews in international development – focus their reporting on whether their impact findings are positive or negative, and not on how big the impact is. This is why an essential component of meta-analysis is to calculate study effect sizes, which measure the magnitude of the impacts in common units. Vaessen’s review concludes that the magnitude of the impacts found in all studies is too small to be of policy significance.

The second problem with single studies is that they are frequently biased. Biased studies usually overestimate impacts. Many microcredit evaluations illustrate this by naïvely comparing outcomes among beneficiaries and non-beneficiaries without accounting for innate personal characteristics such as entrepreneurial spirit and attitude to risk. These characteristics are very likely to be the reason why certain women get the loans and make successful investments. All good meta-analyses critically appraise evidence through systematic risk-of-bias assessment. The Vaessen review finds that 16 of the 25 included studies show ‘serious weaknesses’, and that these same studies also systematically over-estimate impacts. In contrast, the most trustworthy studies (the randomised controlled trials and credible quasi-experiments) do not find any evidence to suggest microcredit improved women’s position in the household in communities in Asia and Africa (see Figure).Figure

The third problem is that the sample size in many impact evaluations is too small to detect statistically significant changes in outcomes – that is, they are under-powered. As noted in recent 3ie blogs by Shagun Sabarwal and Howard White, the problem is so serious that perhaps half of all impact studies wrongly conclude that there is no significant impact, when in fact there is. Meta-analysis provides a powerful solution to this problem by taking advantage of the larger sample size from multiple evaluations and pooling that evidence. A good meta-analysis estimates the average impact across programmes and also illustrates how impacts in individual programmes vary, using what are called forest plots.

The forest plot for Vaessen’s study, presented in the figure, shows an average impact of zero (0.01), as indicated by the diamond, and also shows very little difference in impacts for the individual programmes, as indicated by the horizontal lines which measure the individual study confidence intervals.

There are of course legitimate concerns about how relevant and appropriate it is to pool evidence from different programmes across different contexts. Researchers have long expressed concerns about the misuse of meta-analysis to estimate a significant impact by pooling findings from incomparable contexts or biased results (“junk in, junk out”).

But where evaluations are not sufficiently comparable to pool statistically, for example because studies use different outcomes measures, a good meta-analysis should use some other method to account for problems of statistical power in the individual evaluation studies. Edoardo Masset’s systematic review of nutrition impacts in agriculture programmes assesses statistical power in individual studies, concluding that most studies simply lack the power to provide policy guidance.

In the case of Vaessen’s meta-analysis, which estimates the impacts of micro-credit programmes on a specific indicator of empowerment – women’s control over household spending – the interventions and outcomes were considered sufficiently similar to pool. Subsequent analysis concluded that any differences across programmes were unlikely to be due to contextual factors and much more likely a consequence of reporting biases.

This brings us to the fourth and final problem with single studies, which is that they are very unlikely to represent the full range of impacts that a programme might have. Publication bias, well-known across research fields, occurs where journal editors are more likely to accept findings that are able to prove or disprove a theorem. Conversely, they are less likely to publish studies with null or statistically insignificant findings. The Journal of Development Effectiveness explicitly encourages publication of null findings in an attempt to reduce this problem, but most journals still don’t. In what is possibly one of the most interesting advances in research science in recent years, meta-analysis can be used to test for publication bias. Vaessen’s analysis suggests that publication biases may well be present. But the problems of bias and ‘salami slicing’ in the individual evaluation studies, where multiple publications appeared on the same data and programmes, are also important.

Like all good meta-analyses, the Vaessen review incorporates quality appraisal, the calculation of impact magnitudes and assessment of reporting biases. The programmes and outcomes reported in the single impact evaluations were judged sufficiently similar to pool statistically. By doing this, the review reveals ‘reconcilable differences’ across single studies.

Microcredit and other small-scale financial services may have beneficial impacts for other outcomes, although other systematic reviews of impact evidence (here and here) suggest this is often not the case. But it doesn’t appear to stand up as a means of empowering women.


Demand creation for voluntary medical male circumcision: how can we influence emotional choices?

Chipiliro Khonje_15882458375This year in anticipation of World AIDS Day, UNAIDS is focusing more attention on reducing new infections as opposed to treatment expansion. As explained by Center for Global Development’s Mead Over in his blog post, reducing new infections is crucial for easing the strain on government budgets for treatment as well as for eventually reaching “the AIDS transition” when the total number of people living with HIV begins to decline.

Male circumcision is one of few biomedical HIV prevention strategies with evidence of a large impact on reducing HIV acquisition among men, based on three trials conducted in South Africa, Kenya, and Uganda.  In 2007, the World Health Organization and UNAIDS recommended scaling up voluntary medical male circumcision (VMMC) particularly in priority countries in Eastern and Southern Africa. Although some progress has been made the last few years, with close to 6 million circumcisions completed by the end of 2013 in the priority countries, we are still far from the goal of 20.2 million male circumcisions by 2015 necessary to avert 3.36 million new HIV infections. How can we design interventions to achieve the level of male circumcisions necessary to help reach the AIDS transition?

Interventions to promote increased VMMC have been quite successful at increasing the supply of circumcisions. The slow progress is blamed on demand. Governments and others employ two main approaches, informed by acceptability studies, for increasing the demand for VMMC—behaviour change communication (BCC) and opportunity or transaction cost reduction. BCC uses a variety of channels to provide information on benefits (primarily health benefits) from VMMC. The cost reduction approaches compensate men for financial costs incurred, such as travel expenses, and/or opportunities costs incurred, such as lost working days, when they are circumcised.

The results from these approaches have been disappointing. A recent randomised controlled trial evaluating the impact of comprehensive information about male circumcision and HIV risk in Lilongwe, Malawi shows no significant effect on adult or child demand for circumcisions after one year. In another 3ie-supported study in Malawi, information increased the likelihood of getting circumcised by only 1.4 percentage points. On the cost side, a randomised controlled trial evaluation conducted in Kenya of an intervention to reduce costs associated with VMMC finds that small, fixed economic incentives to compensate for lost wages ranging from KES 700-1200 (USD 8.75-15) increased VMMC uptake within two months among men aged 25-49 years by 7.1 per cent. Although this finding is statistically significant, the effect size is small suggesting that targeted level of male circumcision coverage might not be achieved by only addressing barriers related to costs.

Both these approaches are based on rational choice theory, that is, they assume that men make the decision to get circumcised as a rational choice that maximises benefits to them and minimises costs to them. Given the low numbers overall and the impact evaluation findings of small effects, we have to wonder whether the decision to be circumcised is really a standard rational choice type decision for many men. Maybe—just maybe—some men simply don’t want to be circumcised. Perhaps the decision about male circumcision is an emotional choice decision more than a rational choice decision.

Where does that leave us? Is there anything we can do to influence the emotional choice decision?

3ie’s thematic window for increasing the demand for VMMC is designed to promote and test innovative approaches for increasing the demand for male circumcision. In our scoping paper for this window, we suggest that one approach to innovation may be to engage peers and female intimate partners as catalysers to generate demand. These influencers may appeal to both the emotional choice and rational choice aspects of the circumcision decision. For example, men may find information provided by their peers to be more credible, but they may also feel more comfortable with circumcision if they know someone else who has chosen to do it. Intimate partners may be in a better position to frame the information being given to uncircumcised men, but they may also be able to convince men on an emotional level in the way that mass media certainly cannot. One concern that has been raised about promoting circumcision among men in consistent relationships is that circumcision could indicate that the man intends on having sex with others. Only within the relationship can a barrier to demand like this one be addressed.

Two studies funded under 3ie’s thematic window are testing interventions based on peers and intimate partners. One is in Zambia using peer referral incentives to increase demand for voluntary medical male circumcision. The second is in Uganda involving female intimate partners (pregnant women in their third trimester) to deliver a customised behavior change communication message to their partners in order to increase the uptake of VMMC.

There is one advantage for this desired behaviour change compared to many of the others we often seek to influence—we only need this behavioural response to occur once. A second approach to influencing the emotional choice takes advantage of this aspect. We know from behavioural economics and many other fields that people often have present-biased preferences, that is, they are not good at doing things they don’t want to do in the present in order to gain benefits over the long term. So perhaps an intervention designed to give a reward in the present can be used to induce behaviour in the present.

Two studies funded under the thematic window are testing interventions based on this idea in Kenya and Tanzania. These interventions employ a lottery (or raffle) for men getting circumcised to have a chance to win a prize, in both cases, some kind of phone. The material gain is only received by a subset of men, so the intent is not to compensate them for costs. Rather the theory is that the prospect of winning the prize will induce the desired behavioural for the single time needed. We hypothesise that these interventions will also benefit from the theory that people tend to overestimate probabilities near zero.

The final results from these and three other studies on demand creation for VMMC will be available in the first half of 2015. We hope to learn from them both what works for increasing the demand for VMMC and some insights into influencing emotional choices.

How big is big? The need for sector knowledge in judging effect sizes and performing power calculations

Josiah Mackenzie_3414064391A recent Innovations for Poverty Action (IPA) newsletter reported new study findings from Ghana on using SMS reminders to ensure people complete their course of anti-malaria pills. The researchers concluded that the intervention worked. More research is needed to tailor the messages to be even more effective.

The proportion of those completing the pack of pills was 62 per cent in the control group and 66 per cent in the treatment group.  My first reaction was a rather different one to that of the researchers. I thought, Well, that didn’t work very well.  They had better look for something else. The researchers seem to be falling into the all too common trap of mistaking statistical significance for practical significance.

But then I remembered an interview I heard in the wake of the Lance Armstrong drug doping scandal.  “Without the drugs”, a fellow cyclist said, “Armstrong would have been in the top ten rather than first.”  But in many races just a second or two separates the person coming in first and the one in tenth place.

How much difference do performance enhancing drugs really make in athletics? This rather nice blog lays out lots of data to answer this question. Performance enhancing drugs clearly do work. The improvement for the 1500 metres is 7-10 seconds.  Given the world record of 3 minutes 26 seconds, that may not sound like a lot. It is just three to five per cent.  But it took thirty years for the world record to improve by 10 seconds from 3.36 to 3.26.  The women’s world record for the 400 metres was set, mostly likely with the assistance of performance-enhancing drugs, back in the 1980s. No one has come close since.

As a runner myself, I know the big difference between running 10 kilometres in 39 minutes 50 seconds compared to 40 minutes 20 seconds.  Or a marathon (42.2 kilometres) in 2 hours 59 minutes 40 seconds rather than 3 hours 1 minute 10 seconds. It is very frustrating speaking to non-runners who say, “Well, that’s not much of a difference.”

So, going back to the IPA study in Ghana, who am I to say if the increase from 61 to 65 per cent is big or not? I simply don’t have enough sector knowledge to make that judgment. I don’t know what else has been tried and how well it worked. And in particular, I don’t know which of the approaches to get people to complete their malaria treatment is most cost effective. As my 3ie colleague Shagun argued in her recent blog, it takes specialist sector knowledge to know how big an effect needs to be for it to matter, and knowing this is crucial in performing accurate power calculations. Impact evaluation study teams often don’t have that knowledge.  They should be consulting sector policymakers to find out how big is big enough.

Calculating success: the role of policymakers in setting the minimum detectable effect

Trust for africa's orphans_8689131265

When you think about how sample sizes are decided for an impact evaluation, the mental image is that of a lonely researcher laboring away on a computer, making calculations on STATA or Excel. This scenario is not too far removed from reality.

But this reality is problematic. Researchers should actually be talking to government officials or implementers from NGOs while making their calculations. What is often deemed as ‘technical’ actually involves making several considered choices based on on-the-ground policy and programming realities.

In a recent blog here on Evidence Matters, my 3ie colleagues Ben and Eric highlighted the importance of researchers clarifying the assumptions they had made for their power and sample size calculations. One assumption that is key to power and sample size calculations is the minimum detectable effect (MDE), the smallest effect size you have a reasonable chance of detecting as statistically significant, given your sample size.

For example, in order to estimate the sample size required for testing the effects of a conditional cash transfer (CCT) programme on children’s learning outcomes at a given power level, say 80 per cent, the MDE needs to be assumed. This assumption is about the minimum effect we want to be able to detect. Suppose the CCT programme is expected to lead to a 20 per cent or more increase in learning outcomes as compared to the control group, then the researchers might use 20 per cent as the MDE for calculating the required sample size. Of course, the baseline value of the outcome is important and one’s expectation regarding how much the outcome will improve depends on the baseline value with which one is starting.

But there is often no clear basis for knowing what number to pick as the MDE. So, researchers make a best guess estimate or assume a number. Some researchers might use the estimated effect size of similar past interventions on selected outcomes. Others might do a thorough literature review to understand theoretically how much the impact should be, given the timeline of the intervention.

But how one comes up with this number has serious implications. Once you decide on a MDE of 20 per cent, your impact evaluation will have less chance of detecting a difference of anything smaller than 20 per cent. In other words, if the actual difference between treatment and control schools for learning outcomes is 10 per cent, the study that is powered for a 20 per cent difference is more likely to come up with a null finding and conclude that the CCT programme has no impact on learning outcomes.

And the smaller your MDE is, the larger is the sample size required. So, researchers always have to balance the need for having a study that can detect as low a difference as possible with the additional expense incurred on data collection with larger sample sizes.

Having a study that can detect as low a difference as possible between treatment and control groups is not necessarily a good thing. That is why it is important to involve decision makers and other key stakeholders using the evaluation in deciding the MDE. They may feel that a 10 per cent increase in learning outcomes between control and treatment groups, as a result of the CCT programme, is too low to justify investment in the programme. Spending money on a study with an MDE of 10 per cent, which would require a larger sample size, would not then be meaningful for those policymakers. If we did, wouldn’t we be spending money on a study that is ‘overpowered’?

At 3ie, many of the research proposals we receive make no mention of the policymaker’s perspective on the expected minimum improvement in outcomes required for justifying the investment being made in the programme.  No one has defined what success will look like for the decision maker.

To get that important definition, researchers can ask policymakers and programme managers some simple questions at the time the research proposal is being prepared. Here are some indicative questions to illustrate: What aspects of the scheme or programme are most important to you? What outcomes should this scheme change or improve according to you? If the scheme were to be given to village A and not village B, then what is the difference you would expect to see in the outcomes for individuals in village A as compared to village B (assuming the baseline value of the outcome is the same in both villages)? Would this difference be sufficient for you to decide whether or not to roll out this programme in other districts?

Of course, as 3ie’s Executive Director, Howard White, highlights in his blog, researchers need to balance this information and ensure that the MDE is not set at too large a MDE.

So, as researchers, instead of monopolizing, let us involve policymakers and programme implementers in the power calculations of a study. It’s high time that we started following the conventional wisdom of having the MDE be the minimum change in outcome that would justify the investment made in the intervention for a policymaker.