Monthly Archives: August 2014

Gearing up for Making Impact Evaluation Matter

Conference bannerOver the last week, 3ie staff in Delhi, London and Washington were busy coordinating conference logistics, finalising the conference programme, figuring out how to balance 3ie publications and clothing in their suitcases, and putting the last touches to their presentations. This is usual conference preparation for a conference that is going to be different. Why is this conference different? The participant mix – more than 500 people – is balanced among policymakers, programme managers and implementers, and researchers.

With our partners – the Asian Development Bank (ADB) and the Philippine Institute for Development Studies (PIDS) – we’ve put together an exciting programme for the first-ever global conference on impact evaluations and systematic reviews held in Asia. Making impact evaluation matter:  better evidence for effective policies and programmes will run from 1 – 5 September.

We have put together workshops and parallel conference sessions that will promote peer learning. We will be fostering engagement and rich discussions on how to use, make decisions from, design, implement, and learn from impact evaluations and evidence syntheses.

In the opening plenary, Paul Gertler will address challenges to using rigorous evidence to make better policy, and this theme will run through the conference.  A key conference goal is to have this diverse and expert collective of participants share knowledge and experience about overcoming barriers to evidence supply and demand.

To further this goal, implementers will be able to learn more about commissioning and using impact evaluation findings. They can, then, discuss with researchers about what evaluation questions and methods will most likely answer their decision-making needs on the ground.  These discussions will benefit researchers too, as they better understand the priority evidence questions of implementers, policymakers and funders.

Policymakers and funders will find out about the benefits – and challenges — of undertaking high-quality, mixed-methods impact evaluations and systematic reviews. They will also understand more about when studies and reviews are most effective and how they can use their findings.

We will foster productive and diverse dialogues and debates on how to improve development effectiveness.  There will be ample knowledge-sharing on the state-of-the-field methods and learning about how and when to do impact evaluations and reviews well.

We will be asking important questions. Are we providing quality evidence that informs a decision-maker’s need?  What are the gaps that limit uptake and use of rigorous findings? We hope to find important answers as well.

The conference organisers will encourage practitioners, researchers, policymakers and donors to address questions we know interest them: how and when does community-driven development work?; how can impact evaluation be used to assess agency performance?; and how can we improve infrastructure policy and planning?

Sessions will raise sector-specific questions, such as how to scale-up a successful programme (or modify or terminate one that has proved less successful), as well as how to take lessons learnt in one context and translate them for planning and priority setting in another one.

David McKenzie, reviewing this paper from Boudreau and colleagues, recently reminded us that conferences and coffee meet-ups are important for finding and engaging with research collaborators. We fully agree! But we also want to expand the conversation beyond co-authoring to match-making between the needs of implementers, decision makers, and researchers. We certainly hope some new projects and partnerships are initiated at this conference.

If you’re coming to Manila, we look forward to seeing you.  Remember that you can start to get in the swing of the many conversations by viewing the joint ADB-3ie video lecture series that 3ie put together for this conference. You can also receive regular updates from the conference organisers, including new blogs, videos, photos and other conference updates through the IE matters mobile app available in Google Play Store (an app for iPhone users will be launched shortly). Scan this QR code or visit this link to download the app.

If you can’t make it to Manila, don’t worry, there are lots of ways to participate. You might miss networking during coffee breaks and shared meals but you can still be part of the larger conversation about how to make impact evaluation and systematic reviews matter for planning and decision-making. There will be live streaming of the plenary sessions; check the conference homepage for details. We’ll be blogging and tweeting #IEmatters.  Our roaming videographers will be interviewing participants throughout and catching some of the parallel sessions.  We’ll be posting those videos online on the conference website, on Facebook, as well as our own.

Stay tuned.

Ten things that can go wrong with randomised controlled trials

World Bank-Education-Flickr-8249943681I am often in meetings with staff of implementing agencies in which I say things like ‘a randomised design will allow you to make the strongest conclusions about causality’. So I am not an ‘unrandomista’.

However, from the vantage point of 3ie having funded over 150 studies in the last few years, there are some pitfalls to watch for in order to design and implement randomised controlled trials (RCTs) that lead to better policies and better lives. If we don’t watch out for these, we will just end up wasting the time and money of funders, researchers and the intended beneficiaries.

So here’s my list of top ten things that can go wrong with RCTs. And yes, you will quickly spot that most of these points are not unique to RCTs and could apply to any impact evaluation. But that does not take away from the fact that they still need to be avoided when designing and implementing RCTs.

  1. Testing things that just don’t work: We have funded impact evaluations in which the technology at the heart of the intervention didn’t work under actual field conditions. We don’t need a half – million dollar impact evaluation to find this out. In such cases, a formative evaluation, which includes small-scale field testing of the technology, should precede an RCT.
  2. Evaluating interventions that no one wants: Many impact evaluations fail because there is little or no take-up of the intervention.  If 10 per cent or fewer of intended beneficiaries are interested in an intervention, then we don’t need to evaluate its impact. It will not work for more than 90 per cent of the intended beneficiaries because they don’t want it.  The funnel of attrition is a tool that can help us understand low take-up and assess whether it can be fixed. But designers commonly over-estimate benefits of their interventions whilst under-estimating costs to users. So the programme that is implemented may simply be unappealing or inappropriate. Like point one, this point also relates to intervention design rather than evaluation. But this is an important point for evaluators to pay attention to since many of our studies examine researcher-designed interventions. And here again, a formative evaluation prior to the impact evaluation will give information on adoption rates, as well as the facilitators of and barriers to adoption.
  3. Carrying out underpowered evaluations: Studies are generally designed to have power of 80 per cent, which means that one fifth (20 per cent) of the time that the intervention works, the study will fail to find that it does so. In reality, the actual power of many RCTs is only around 50 per cent. So, an RCT is no better than tossing a coin for correctly finding out if an intervention works. Even when power calculations are properly done, studies can be underpowered. Most often, this is because it is assumed that the project will have a much larger impact than it actually does. So the true impact cannot be detected, and calculations are over-optimistic about adoption.
  4. Getting the standard errors wrong: Most RCTs are cluster RCTs in which random assignment is at a higher level than the unit at which outcomes are measured. So, an intervention is randomly assigned to schools, but we measure child learning outcomes. Or it is randomly assigned to districts, but we measure village-level outcomes.  The standard errors in these cases have to be adjusted for this clustering, which makes them larger.  We are therefore less likely to find an impact from the intervention. So, studies which don’t adjust the standard errors in this way may incorrectly find an impact where there is none. And if clustering is not taken into account in the power calculations, then an underpowered study with too few clusters will almost certainly be the result.
  5. Not getting buy-in for randomisation: The idea of random allocation of a programme remains anathema to many programme implementers. This is despite the many arguments that can be made in favour of RCT designs that would overcome their objections. Getting buy – in for randomisation can thus be a difficult task. Buy- in needs to be across all relevant agencies, and at all levels within those agencies. The agreed random assignment may fail if the researchers miss out getting the buy-in of a key agency for the implementation of the impact evaluation.  Political interference from above or at the local level, or even the actions of lower level staff in the implementing agency can act as stumbling blocks to the implementation of random assignment. This  leads to…© Talitha Chairunissa
  6. Self-contamination: Contamination occurs when the control group is exposed to the same intervention or another intervention that affects the same outcomes. Self-contamination occurs when the project itself causes the contamination. Such contamination may occur through spillovers, such as word of mouth in the case of information interventions. It could happen if the people in the control group use services in the treatment area. But it can also occur when staff from the implementing agency are left with unutilized resources from the project area, so they deliver them to the control group.
  7. Measuring the wrong outcomes: The study may be well conducted but fail to impress policymakers if it doesn’t measure the impact on the outcomes they are interested in, or those which matter most to beneficiaries. Do women value time, money, control over their lives or their children’s health?  Which of these outcomes should we measure in the case of microfinance, water, sanitation and hygiene interventions and so on? A common reason that important outcomes are not measured is that unintended consequences, which should have ideally been captured in the theory of change, were ignored. Prior qualitative work at the evaluation design stage and engagement with policymakers, intended beneficiaries and other key stakeholders can reduce the risks of this error.
  8. Looking at the stars: The ‘cult of significance’ has a strong grip on the economics profession, with far too much attention paid to statistical significance (the number of stars a coefficient has in the table of results), and too little to the size and importance of the coefficient.  Hence researchers can miss the fact that a very significant impact is actually really rather small in absolute terms and too little to be of interest to policymakers. Where there is a clear single outcome of the intervention, then cost effectiveness is a good way of reporting impact, preferably in a table of comparisons with other interventions affecting the same outcome. Where researchers have followed 3ie’s advice to take this approach, it has sometimes reversed the policy conclusion derived from focusing on statistical significance alone.
  9. Reporting biased findings: Studies should report and discuss all estimated outcomes. And preferably, these outcomes should have been identified in the evaluation design stage. The design should also be registered, for example in 3ie’s Registry for International Development Impact Evaluations.  Many studies focus unduly on significant coefficients, often the positive ones, discounting ‘perverse’ (negative) and insignificant results. Or even where there is no impact, the authors still conclude that the intervention should be scaled up, possibly because of publication bias or it is a researcher-designed intervention, or because they have fallen foul of the biases that affect traditional evaluations to favour the intervention.
  10. Failing to unpack the causal chain: Causal chain analysis is necessary to understand how an intervention was implemented and how and why it worked for whom and where.  Researchers are often left to speculate in the interpretation of their findings because they have failed to collect data on the intermediate variables which would have allowed them to test their interpretation. The theory of change needs to be specified at the design stage; the evaluation questions need to be based upon the theory of change; and both factual and counterfactual analysis should be conducted to understand the whole causal chain.

This list of pitfalls is not meant to stop researchers from doing RCTs or warn policymakers from using their findings.  As I said at the outset, an RCT is a design that allows us to make the strongest conclusions about causality.  But they fail to live up to their potential if they fall into any of the above ten traps of designing and implementing RCTs.  So, let’s carry out better RCTs for better policies and better lives. How 3ie tackles these challenges will be the subject of a future blog.



How fruity should you be?

flickr_ubelong-volunteer-abroad 12767496505_306afe0a6e_nA couple of months back the BBC reported a new study which questioned existing advice to eat five portions of fresh fruit and vegetables a day.  Five was not enough according to the study authors, it should be seven.  I really do try each day to eat five portions. Where was I going to find the time and space for these extra two portions?  But this looked like a sound study published in a respected academic journal, with data from over 65,000 people.

But hang on a minute.  This is a study based on observational data with no attention whatsoever to selection bias. That is, the observations are of people who cram in seven portions of fruit and vegetables a day, are health nuts who exercise three times a day,  and follow it up with a bracing cold bath and a quick yoga session.  The BBC also quotes the sensible sounding Professor Tom Sanders, at the School of Medicine, King’s College London, who says it was ‘already known’ that people who said they ate lots of fruit and vegetables were health conscious, educated and better-off, which could account for the drop in risk. Exactly, Tom. So, it is not clear why the BBC is pegging the article  on this apparently erroneous finding. A better headline would have been ‘UK academic blasts study for erroneously mistaking correlation for causation’.

But the BBC redeemed itself last week by reporting a systematic review published in the British Medical Journal which concludes that five portions a day really is enough. More than five has no additional health benefits.

Why should I believe this ‘new research’, as the BBC calls it, when it was misleading me back in April to eat seven portions?  I should believe it because it is not new research at all.It is something better: it is a systematic review.

Why is a systematic review so great? It is great because the study team did an extremely comprehensive search for all of the studies of this topic they could possibly find, which ended up being over 7,000 research papers. They then screened them all for quality, and only kept those which contained credible evidence of a causal link.  They finally kept only 16 of the 7,000 studies.  Turning to the original systematic review , I see the authors are also using observational data. But they use only the estimates that adjust for confounders, which can potentially deal with selection bias on observables, though not entirely. The systematic review pools together all these findings to get a single estimate based on over 800,000 people.  And they present a very nice graph which shows how the risk of dying is reduced by eating more fruit and vegetables, but the effect clearly plateaus between four and five portions. So, actually four would be okay and five is better. But more than five will just result in flatulence not fitness.

The rise of evidence-based medicine was closely associated with the rise of systematic reviews.  Doctors’ decisions on treatment options need to be based on evidence from thousands of patients, not just the dozens they have seen. We can and should apply the same principles to development policy.  And there are hundreds of reviews already available to development policy makers.  Check out the 3ie systematic review database and learn about what works in development programmes based on evidence from reviews, and not just single studies. Systematic reviews can inform better policies that can lead to  better lives.