Monthly Archives: March 2013

Collaborations key to improved impact evaluation designs

Do funding agencies distort impact evaluations? A session organised by BetterEvaluation on choosing and using evaluation methods, at the recent South Asian Conclave of Evaluators in Kathmandu, focused on this issue. Participants were quite candid about funding agencies dictating terms to researchers. “The terms of reference often define the log frame of evaluation (i.e. the approach to designing, executing and assessing projects) and grants are awarded on the basis of budgets that applicants submit. It’s a bidding process — all about executing the funding agency’s design at a minimal cost,” said a researcher.

But participants stated that 3ie’s approach to funding impact evaluations is refreshingly different. Instead of a priori structuring of impact evaluation design, 3ie focuses on active engagement with researchers and implementing agencies to develop an appropriate study design for the evaluation. This approach was demonstrated by an on-going 3ie-supported study to assess the impact of media campaigns on the incidence of early child marriage.

Child marriage continues to be practised in India, even though it is a punishable crime. The implementing agency, Breakthrough, a global human rights organization uses interventions like mass media, community mobilization and training of community members, developed during an earlier campaign on domestic violence. Breakthrough approached the government to partner them on the child marriage campaign, but they were asked to demonstrate the effectiveness of their strategies. The organisation then approached 3ie to assist in the evaluation of the programme. Through an open bidding process, Catalyst Management Services from Bangalore were selected to design and undertake an impact evaluation, funded through 3ie’s Policy Window.

Extensive discussions were held between 3ie, the research team and the implementing agency, to arrive at a study design that would encompass both qualitative and quantitative aspects. “3ie worked with us not merely as funders but as partners on the study design,” said Urvashi Wattal, project coordinator, Catalyst Management Services. The collaboration led to changes in the project design. The study, initially planned in three districts, was expanded to include three more districts. A mid-line survey was also added.

Collecting data on a sensitive issue like child marriage is challenging. The research team will use the polling booth method, which they have used in other studies. The community members will be asked to enter an enclosed area resembling a polling booth and vote on certain issues related to child marriage. The voting will be devised through colours and other means, thus accommodating the low level of literacy.

As a staff member of 3ie I was pleased to attend a session – which 3ie was not at all involved in organising in any way – which highlighted as best practice the way we work in order to respond to the policy needs of Southern governments and NGOs. More and more implementing agencies are approaching 3ie to undertake similar studies supported by our Policy Window.

Tips on selling randomised controlled trials

Development programme staff often throw up their hands in horror when they are told to randomise assignment of their intervention. “It is not possible, it is not ethical, it will make implementation of the programme impossible”, they exclaim.

In a new paper in the Journal of Development Effectiveness I outline how different randomised controlled trial (RCT) designs overcome all these objections. Randomisation need not be a big deal.

When we randomise, we obviously don’t do it across the whole population. We randomise only across the eligible population. Conducting an RCT requires that we first define and identify the eligible population. This is a good thing. Designing an RCT can help ensure better targeting by making sure the eligible population is identified properly.

It is very rare that the entire eligible population gets the programme from day one. Usually some of the eligible population are excluded, at least temporarily because of either resource or logistical constraints. We can exploit the fact of ‘untreated but eligible’ to get a valid control group. A common way of doing this is to ‘randomise across the pipeline’. Let’s say a programme is to reach 600 communities over three years. These communities can be divided into three equal groups to enter the programme in years 1, 2 and 3. We randomly choose which community gets in to which group. This approach was used in the evaluation of the well-known Progresa conditional cash transfer programme in Mexico. The communities receiving the intervention in year 3 acted as a control group for the first two years.

If the entire eligible population can be treated, then we can use a ‘raised threshold design’. The eligibility criteria can be slightly relaxed to expand the eligible group. For example a vocational training programme in Colombia admitted around 25 trainees to each course. Course administrators were asked to identify 30 entrants from the 100 or so applicants for each course. Twenty-five of these 30 were randomly picked to be admitted to the course, while the remaining five entered the comparison group. This approach virtually made no difference to the programme. The same number would have taken the course, and the same number would have been rejected, even in the absence of randomisation.

The ‘raised threshold design’ can also be applied geographically. If you plan to implement a programme in 30 communities, identify 60 which are eligible and randomly assign half to the comparison group.

And you need not randomly assign the whole eligible population. In the first example with 600 communities, power calculations would probably show that at most 120 of the 600 are needed for the evaluation. So for 480 of the communities, that is 80 per cent of the eligible population, the programme can be implemented in whatever way the managers want. It’s just for the remaining 20 per cent that we need to be randomising the order in which the communities enter the programme. Everyone gets the programme, and in the planned time frame. As evaluators, we are just requesting a change in the order for a small share of them. Randomisation is no big deal

There may still be objections to RCTs in cases where the control group does not receive the programme. And indeed in clinical trials it is the norm that the control group do get a treatment rather than no treatment. They usually get the existing treatment. We can do the same when we evaluate development interventions. In any case, it’s more likely that policymakers want to know how the new programme compares to existing programmes, rather than how it compares to doing nothing. Or the control group can get some basic package (A), and the treatment group receives the basic package plus some other component we think increases effectiveness (A+B). Or we can use a three treatment arm factorial design: A, B and A+B.

And finally, for programmes which are indeed universally available, such as health insurance, an encouragement design can be used. These designs randomly allocate an intervention, such as information, to one group. This approach creates a new group of programme participants, whose outcomes can be compared to those in areas which have not received the encouragement, thus allowing calculation of the impact of the programme. These designs do not affect the programme in anyway, other than to increase take up.

So randomisation is indeed not a big deal. Various evaluation designs make little actual difference to the intervention. And what about ethics, you may ask. In most cases it is unethical not to do an RCT if you can. We don’t know if most programmes work or not, and so we need rigorous evaluations to provide that information.

Of sausages and systematic reviews

“Literature reviews are like sausages… I don’t eat sausages as I don’t know what goes into them.” Dean Karlan said this to an unfortunate researcher at a conference recently. The ‘sausage problem’ puts in a nutshell why at 3ie we favour the scientific approach to evidence synthesis — evidence as encapsulated by the systematic review.

We know that systematic reviews can be a very good accountability exercise in helping answer the question “do we know whether a particular programme is beneficial or harmful?”. So instead of cherry picking our favourable development stories, we collect and synthesise all the rigorous evidence. We also say how reliable we think the evidence is through quality appraisal – which should be conducted by two researchers independently, to a level at least as detailed as would be required by the peer review system of a top quality journal. This helps to answer the question “do we know what we think we know?” and, therefore, whether policymakers can trust the evidence. As Mark Petticrew, professor of public health evaluation at the London School of Hygiene and Tropical Medicine says, if policymakers don’t follow a review’s recommendations (which may well be for legitimate reasons), they should at least explain why.

But, outside medicine, reviews have not been very useful in helping the majority of decision makers – those practitioners involved in the design and implementation of projects, programmes and policies. Doing so in reviews usually requires drawing on a wider body of evidence than impact evaluations.

Fortunately, we now have decades of existing development research to draw on, including that generated through surveys as well as by more in-depth ethnographic and participatory research, to help us answer many questions. As I explain below, we are experimenting with different approaches to incorporating broader evidence into reviews at 3ie. But first, a little myth-busting. A typical systematic review goes from several thousand or so papers identified in the initial search to just a couple of hundred for which a full-text screening review is conducted, and then a dozen or less included effectiveness studies. These figures give the impression that a lot of evidence is being thrown away.

Study search flow for farmer field schools literature


The figure above shows the study search for 3ie’s systematic review of farmer field schools evaluations. The exclusion of the first 28,000 papers is not an issue: reviews cast the net widely to ensure studies are not missed, and so pick up a lot of studies that are irrelevant. The real issue is at the next stage – narrowing down from the 460 full text studies. These studies are relevant evaluations, which generally get excluded on grounds of study design.

Thus, a traditional ‘review of effects’ would have limited the farmer field schools review to just 15 included studies (and if restricted to just RCTs, the review would have returned precisely zero results). The analysis of these 15 studies is important, since it tells us which studies we believe are relevant to high-level policymakers in terms of answering the ‘what works’ question.

The approach we use at 3ie does require that studies without credible designs are excluded from the synthesis of causal effects. In the case of farmer field schools, our analysis indicates that farmer field schools do have an impact on real-life outcomes like yields, revenues and empowerment, at least in the short to medium term. But it also confirms that diffusion from trained farmers to their neighbours doesn’t happen.

But what about the remaining 97 per cent of the potentially relevant literature identified for full text assessment? The farmer field schools community of practice is committed to generating evidence, and we found an additional 119 impact evaluations. We don’t believe these additional evaluations are policy actionable for outcomes relevant to farmers’ quality of life, such as yields and incomes (usually due to problems in assuring comparability of the control group). But many of the studies do support the findings in terms of process outcomes (knowledge and adoption of practices). We can also use these studies to make recommendations for improving evaluation design. They suggest the scale of resources that have been devoted to farmer field schools evaluations, might usefully be re-allocated in future to conducting fewer but more rigorous impact evaluations, particularly those based on a solid counterfactual which assess impacts in the medium to longer term. 3ie is itself funding one such study in China.

20However, analysis of the rest of the causal chain requires other types of evidence. And this evidence is thin in impact evaluations. Hence, there is a need to turn to evidence in studies which are usually excluded from reviews at the final search stage.

Thus, we included 25 qualitative evaluations in the review, which have helped us understand the reasons for lack of diffusion found in the quantitative analysis – mainly that the message is too complex for farmers to learn outside of formal training. More generally, the studies identify some of the more common problems in implementation, notably where a top-down ‘transfer of technology’ approach has been implemented for an intervention based on a participatory-transformative theoretical approach. The qualitative studies also helped us to understand better the empowerment aspect of farmer field schools.

Some of the best reviews to-date use mixed methods in this way (see ‘Teenage pregnancy and social disadvantage: systematic review integrating controlled trials and qualitative studies’ by Angela Harden, Ginny Brunton, Adam Fletcher and Ann Oakley). But there are still important policy-relevant questions left unanswered, relating to the scale of implementation and how targeting occurs. So, in the final components of the review, we are providing a global portfolio review of 260 farmer field schools projects, and collecting data from 130 studies reporting on targeting and participation. The latter work is ongoing, but we expect it to provide useful information about important goals of some schools such as ability to reach women farmers. More generally, the analysis should help us understand whether those who have taken part in the impact evaluations are ‘typical’ farmers or not, and therefore how generalisable are our review findings.

Systematic reviewing, done right, has the potential to change the culture of development policymaking and research, as it has already done in other fields. Its main strengths are rigour and transparency, and these principles can be applied to answer a wide range of policy questions. Doing the ‘full’ evidence synthesis which we have undertaken for farmer field schools does require more resources. I encourage those interested to read Birte Snilstveit’s paper on going ‘Beyond Bare Bones’ in systematic reviews for options with different resource implications for effectiveness reviews, and to look out for 3ie’s farmer field schools review for an example of broader evidence synthesis.

I wrote this during the inspirational Evaluation Conclave meeting in Kathmandu in February 2013, where Kultar Singh Siddhu, Director, Sambodhi Research and Communications, presented the following quote from the Buddha as a mantra for evaluation. I think it’s also an apt description of why we should do and use systematic reviews:

“Believe nothing, because you have been told it… do not believe merely out of respect for the teacher. But whatsoever, after due examination and analysis, you find to be kind, conducive to the good, the benefit, the welfare of all beings – that doctrine take as your guide.”

Unraveling the opaque through impact evaluations

“If anybody tells you to conduct an impact evaluation, tell that person to go to hell!”

This was the comment made by a renowned impact evaluator at a recent conference after Khalid Al Kudair presented his NGO Glowork’s remarkable success in mobilising the female labour force in Saudi Arabia. The evaluator was trying to make the point that if a programme is obviously working, it is a waste of time and money to conduct an impact evaluation. I disagree.

In Saudi Arabia, many women want to work but they often face difficulties getting their applications through to the human resources department of an organisation. Glowork’s online portal solves this problem by bringing demand and supply together. Since many women prefer to work from home, Glowork created a work-from-home scheme, allowing women to set up an office in their own house with employer supervision of the home worker. Glowork’s system seems to have created a win-win situation: it gives families an additional source of income and allows the economy to benefit from a well-educated part of the population that had been excluded from the labour market.

The fact that women have been brought into the workforce through this system seems to show the success of the idea. Yet the comment of the impact evaluator kept ringing in my ear. I kept asking myself whether an impact evaluation could uncover something more. Marx summed up my doubts very well when he said, ‘If the essence and appearance of things directly coincided, all science would be superfluous’. Assessing success by merely looking at the number of women brought into the workforce might be deceiving. What if there is something going on under the surface?

Let’s take the example of microcredit. Until very recently we were quite sure about its positive effect. The high number of people taking up credit suggested a success story. But there is now a debate about whether microcredit really works and if so, then for whom. Impact evaluation is able to provide answers by examining why some benefit while others don’t. The microfinance crisis in Andhra Pradesh, India, clearly showed that people do not always benefit from receiving loans. The risk of microcredit is that people can get pulled into a debt-trap. Borrowers may end up cross-financing their loans, taking up one loan to pay back another one. Both the unintended consequences and the positive impacts of microcredit are revealed only by rigorous impact evaluations. It’s clear that impressive stats on the high number of people taking up microcredit did not tell us the whole story.

The example of microcredit shows that the numbers on take-up tell only part of the story. The highly skilled and experienced commentator at the conference should have known that he is making a conclusion about the impact of Glowork from monitoring data. Yes, the number of women participating in the labour force did increase. And yes, female employment increased as well. But what these changes have meant for the woman’s role in the household is still unclear. What do these numbers really convey about the empowerment of women?

As the founder of Glowork pointed out, female labour market participation changes the distribution of tasks in a household. Does the weight of a woman’s voice increase merely by giving her a job? We hope it does! But a woman’s new job could also have negative unintended effects on her bargaining power vis-à-vis her husband.

Monitoring cannot tell us if such changes are taking place. Impact evaluation on the other hand can do this by peeling away layers of opacity and shedding light on issues like empowerment and social transformation. A comprehensive theory of change, which should be the basis of any impact evaluation, unravels not only the direct causal pathway but also factors in possible negative externalities or spill-overs. Does a new job imply that the woman spends less time on household chores? What is the effect on the upbringing of children? How does it affect household dynamics? These second generation questions are much more interesting and useful for programme design than the first generation question of whether the chances of being employed increase as a result of Glowork.

So I disagree with the comment that impact evaluation is unnecessary for Glowork. Khalid Al Khudair, Glowork’s founder, showed great insight in deciphering the problems of women in the Saudi labour market. He was also highly creative in finding solutions. The results of an impact evaluation are likely to only further fuel his creativity in designing more effective programmes.

(This blog is authored  by Markus as a reflection from his participation in the Arab Youth and Entrepreneurship conference in Doha.)