Monthly Archives: April 2014

Failure is the new black in development fashion: Why learning from mistakes should be more than a fad.

Blog PixDuring a meeting at the Inter-American Development Bank (IADB) last week, I mentioned the UK Department for International Development’s moves toward recognising failure, and the part recognizing failure has in learning (see Duncan Green’s recent blog on this).   Arturo Galindo, from IADB’s Office of Strategic Planning and Development Effectiveness, responded by picking up a copy of their latest Development Effectiveness Overview, and opening it to show the word FAILURE emblazoned across the page in large yellow letters.  The word failure appears 27 times in the report, compared to just 22 times for the word success.  This doesn’t mean that IADB is in any way a failing institution. Quite the opposite. They are putting into action the same principle emphasised by Ian Goldman from South Africa’s Department of Performance Monitoring and Evaluation: incentives should not penalise failure but failing to learn from failure.

Impact evaluations have an important part to play in learning from failure.  Eighty per cent of new businesses fail in the first five years: a fact that holds true generally across the world. Do we really think that public sector and NGO programmes do any better?  Failing development programmes survive because they don’t face the bottom line in the way unsuccessful businesses do. So, how does one estimate the bottom line for development programmes? Traditional process evaluations are subject to various biases which make them less likely to point to the harsh reality of failure (as discussed in my 3ie working paper with Daniel Phillips). Impact evaluations are not subject to these biases. Impact evaluations are the bottom line for development programmes.

In the past, development agencies have shied away from acknowledging failure. They have cherry picked the projects and programmes to learn from best practice. In doing this, they ignore the fact that we should also learn from our mistakes.  But now, some agencies are doing impact evaluations on a serious scale. A few years ago Oxfam GB, instituted a new results system which includes 30 new impact evaluations a year on a random sample of their projects. Last year, close to 60 new IADB projects – that is nearly half of their total projects – included impact evaluations.  The systematic collection of results from these studies as they are completed will start to give more accurate pictures of the agency’s performance.

The intention to learn from failure signals an important step in the use of impact evaluation. Agencies have started producing impact studies, but not really thought about how they fit into their overall learning and accountability frameworks. In an earlier blog I wrote about the unhappy marriage between results frameworks and impact evaluation. In fact, they have not even been dating.  So, systematic attempts to draw lessons from impact evaluations – as IADB does in the 2013 Development Effectiveness Overview – should be lauded.  These attempts take us further down the road of improving lives through impact evaluation.

When will researchers ever learn?

flickr_ausaid_10660095616I was recently sent a link to this 1985 World Health Organization (WHO) paper which examines the case for using experimental and quasi-experimental designs to evaluate water supply and sanitation (WSS) interventions in developing countries.

This paper came out nearly 30 years ago. But the problems it lists in impact evaluation study designs are still encountered today. What are these problems?

Lack of comparability of treatment and control, including in randomised control trials (RCT): Experience in several large trials in both developed and developing countries shows that differences in secular trends, including changes caused by epidemics which disproportionately affect one of the two groups, make it very difficult to attribute observed changes to the intervention. Researchers who use black box RCT designs, which ignore context and just assume that randomization will achieve balance, fall into this trap. Data must be collected so as to be aware of contextual factors, including other interventions in the study area.

Small sample sizes are often too small: Ex ante power calculations are becoming more common but they are still not common enough. When they are done they assume unrealistically large effect sizes, over-estimate compliance and under-estimate attrition, and sometimes even ignore the effect of clustering on the standard errors. All of this reduces the true power of the study. A typical power calculation calculates the sample size required for power of 80 per cent, but the actual power achieved is more likely around 50 per cent.  This means that if the intervention works, there is only a 50 per cent chance that the study will find that it does so. An under-powered RCT is no better than tossing a coin at finding out if a successful intervention is actually working or not.  The WHO paper suggests that sample sizes of 120,000 households may be too low to detect the impact of a WSS intervention – but we see many studies with sample sizes of 200 or less!

Inaccurate data (misclassification bias): There are good reasons to believe that the reporting of outcomes by those surveyed, and data on whether they utilised the intervention or not, are likely to be inaccurate. Better designed instruments, with cross checks, and using more triangulation, including complementary qualitative research, can help get round this problem. In the absence of a counter-acting measure, studies are likely to under-estimate programme impact, thus exacerbating the problem of lack of power.

Ethical problems:  There are legitimate ethical concerns regarding withholding interventions which we know to have positive impacts. For many interventions, including those in WSS, we know there is a positive impact. Most WSS interventions achieve a 40-60 per cent reduction in child diarrhoea. So, research resources are better devoted to addressing questions of how to ensure sustained adoption of improved facilities with proper use. Doing this can avoid ethical problems. But too few researchers concern themselves with answering these practical design and implementation questions.

Time and budget constraints: Several of the above problems, such as poor survey instrument design and insufficient sample size, stem from the unrealistic time and budget constraints imposed on studies.  Study time frames are often too short to allow study effects to emerge, and certainly too short to see if they are sustained. So, what is the solution then?

Impact evaluation is not the only sort of evaluation, and with short time frames and small budgets it is probably better to do a high quality process evaluation rather than a low quality impact evaluation. This does not mean that impact evaluations are not required. Impact evaluations are needed but resources need to be strategically deployed to undertake high quality studies which avoid these problems. Considering we have known about these problems for nearly thirty years, it is about time that we learn and not make the same mistakes.