Monthly Archives: April 2016

Let’s bring back theory to theory of change

Anyone who has ever applied for a grant from 3ie knows that we care about theory of change. Many others in development care about theory of change as well. Craig Valters of the Overseas Development Institute explains that development professionals are using the term theory of change in three ways: for discourse, as a tool, and as an approach. And indeed a google search for theory of change in international development returns images of elaborate diagrams of logical frameworks, long documents describing policies and approaches, manuals presenting tools, and blog post after blog post. Theory of change has come to mean many things to many people, and in the process it has lost its soul: theory.

We need to bring theory back. Theory is what allows us to learn cumulatively from one programme to another and from one evaluation to another. Theory is our best hope for the external validity of any impact evaluation and for using small pilots to indicate what might happen at scale. Theory is the best way to design a programme with the highest probability of working and the lowest probability of causing unintended bad things to happen.

What is theory?

Academic circles take the concept so much for granted that I had to go looking for a good definition. I found this one: a theory is ‘a well-substantiated explanation of some aspect of the world, based on a body of facts that have been repeatedly confirmed through observation and experiment.’ (Thank you, Wikipedia.) Wikipedia goes on to explain, ‘The strength of a theory is related to the diversity of phenomena it can explain, which is measured by its ability to make falsifiable predictions with respect to those phenomena. Theories are improved (or replaced by better theories) as more evidence is gathered, so that accuracy in prediction improves over time…’ This definition reminds us about what is needed for an explanation to be considered a theory and also why theories are so useful. A well-founded theory is one based on evidence gathered over time, and it can be used to make predictions. This predictive power of a theory, which we get from repeated tests of it, is a key element to establishing external validity, or the ability to apply what we learn about what works from one situation to another.

Assumptions ≠ theory

Quote-1We often see ‘theory of change’ presented simply as a results framework or logical framework, sometimes augmented by assumptions, which are intended to provide the theory. Vogel (2012) explains, ‘The central idea in theory of change thinking is making assumptions explicit. Assumptions act as “rules of thumb” that influence our choices, as individuals and organisations.’ But assumptions are not theory. Assumptions do not influence our choices. The predictive powers of most theories rely on making assumptions, but simply assuming that an activity will lead to a result is no better than employing the good-things-must-cause-good-things theory or the if-you-build-it-they-will-come theory.

From arrows and assumptions to theory

The figures below depict the distinctions I am making. Figure 1 shows a logical framework for a programme to provide literacy training to adults. Figure 2 shows some of the assumptions that we make in following the logic of the arrows in the framework. Figure 3 shows the underlying theories that explain (or predict) whether the programme will work.

Figure 1. Logical framework for an adult literacy programme

Figure1-logical-framework

Figure 2. Assumptions implicit in the logical framework

figure2-assumptions-logical-framework

Figure 3. Theories underlying the design of the adult literacy programme

figure3-theories-adult-literacy-programme

When we think about the programme in terms of the theories underlying the causal chain, we have a richer picture of what is required for the logical framework to work. We should not just assume that adults will seek training to become literate. We should understand how they are able to optimise their welfare in their current environment. This allows us to predict whether they will make the investment in literacy training as the result of a rational choice. We should not just assume that any training about reading will lead to literacy. We should understand how adults learn to read so that we can design a training programme most likely to achieve an improvement in literacy outcomes. We should also not just assume that individuals who learn how to read will be able to find and get better jobs for higher incomes. We should understand the features of the local labour market so that we know there is an excess demand for literate workers at jobs with higher wages. We should also understand the local labour market well enough to predict whether or not an influx of new literate workers will drive down wages.

You may be thinking that’s a lot of theories for one simple programme. You are right. Development is complicated.

How we get it wrong when we ignore theory

Here are two examples of how we can get things wrong when we ignore theory. The first example is an edutainment intervention. I am on the research team for the impact evaluation of a television soap opera designed to address violent extremism (not funded by 3ie). When we began to design the impact evaluation, we looked carefully at the programme – the television programme, that is. The outline was the same as any soap opera. There are bad people; there are good people; as viewers, we are supposed to dislike the bad people and like the good people. I wondered, why should we believe that bad people viewing this soap opera will dislike the bad people and change their own behaviour as a result? I never became a better person from watching all those episodes of All My Children in graduate school.

I looked to my own discipline, economics, for an explanation and couldn’t find one. I turned to my political scientist colleague on the study team. He didn’t have an explanation either. So we turned to psychology. It turns out the theories in psychology do address how observing others’ behaviour might influence our own, but the theory supported by empirical evidence, i.e. social learning theory, does not support the standard soap opera approach. It predicts that we model behaviours we observe, but that we model both the good and the bad. There is no reason to predict that when we observe both the good and the bad in a soap opera, we only model the good, even when the bad might be punished. Instead social learning theory supports what my colleague and I call the Remember-the-Titans approach, where the drama depicts primarily the positive behaviour that we want observers to model (albeit with some tension to make the story interesting). After identifying the relevant theory, we suggested some changes in the soap opera story to the programme team.

The second example comes from the agriculture sector. An intervention that has been replicated in many countries over the years is the farmer field school. The farmer field school is a programme designed to increase the productivity of small-scale farmers by teaching them better agricultural practices, such as fertiliser use. The design of the training itself is based on adult learning theory. The premise behind the sustainability of the farmer field school model, and the claim for its cost effectiveness, is that there will be positive spillovers. Farmers who attend the school and learn the better practices directly will teach these practices to their neighbours who do not attend the school so that all farmers in the community will ultimately benefit from the programme.

The question is, why would any self-respecting economist predict that a rational agent who gains a productive advantage would share that advantage with her competitors? In fact many training programmes rely on the assumption that people who receive training will pass on that knowledge to others creating positive spillovers, or indirect benefits. For some kinds of training (e.g., political mobilisation, sexually-transmitted disease prevention) that assumption makes sense. But when markets are involved, we should not ignore competition unless we have a good reason to rule it out.

In the case of farmer field schools, a 3ie systematic review shows that the intervention produces positive outcomes for those who attend the school, providing evidence in support of the adult learning theory. But there is no evidence that neighbouring or non-participant farmers have improved outcomes, supporting, in a sense, economic theory.

Whose job is it anyway?

As I mentioned above, we at 3ie often press impact evaluators to provide a detailed theory of change for the programmes they are evaluating. Unless they designed those programmes, however, the responsibility for the theory is not really theirs. They have to take the programme as it is, and all too often, we see development programmes with no more than (and sometimes less than) results frameworks supporting them. Does that mean we should not require the impact evaluators to design their evaluations around theories?

No, it doesn’t. Remember that a theory is an explanation that is repeatedly confirmed using experiment and observation. Whenever possible, we want to increase the return on investment from individual impact evaluations by having them contribute evidence that helps to improve the predictive power of these explanations, even where the improvement means showing that the theory is wrong or does not apply. The more evidence we can lend to theories, the better the theories we have for programme designers to use the next time around. If an impact evaluator cannot suggest a clear theory or set of theories on which the programme is based, then we need to recognise that this limits (but does not eliminate) what can be learned from that impact evaluation.

So what can we do?

How to fully solve the problem is a longer discussion for perhaps another blog post. One part of the solution, though, is that we need to make sure that our programme design teams and our impact evaluation teams are able to draw on the social scientific expertise needed to explore all the relevant theories. They need to draw on this expertise to understand the body of evidence that supports, or not, the application of those theories to the challenge at hand. This is not easy to do. Development is complicated, and often theories from several disciplines apply. But we are not going to be successful if we rely on the ‘logic’ that good things cause good things to happen or if you build it they will come. Let’s bring theory back to theory of change.

How synthesised evidence can help with meeting the Sustainable Development Goals

beans-wikipedia

In early 2016, 193 governments across the world put together a to-do list that would intimidate even the most workaholic overachiever: wipe out poverty, fight inequality and tackle climate change over the next 15 years. The United Nations led in articulating these into 17 Sustainable Development Goals (SDGs) which were then translated into 169 target indicators that will be monitored – a remarkable feat given the disparate views of the various stakeholders.

But what needs to be done to get these indicators to move? What actions will be most effective, especially given the constrained resources globally? The Copenhagen Consensus, for example, claims that the goals and indicators need to be prioritised for their cost-effectiveness. Since countries the world over have been addressing most of the SDGs for some time, one would assume there must be a body of lessons of what works, where and when. That’s where the problem lies.

There are indeed many lessons. But they need to be curated. Anecdotes, correlations and rigorous counterfactual evaluations need to be sorted out, the conditions for which lessons in one setting can be applied in another need to be identified, and how an intervention’s effects can be traced over time and over different scales have to be tracked.

Research syntheses do this much-needed curation. Systematic reviews (SRs) use transparent methods to find, assess and synthesise the best available results on a research question, such as what works best in achieving outcomes like some of the SDG targets. Evidence Gap Maps (EGMs) provide a visualisation of the density of the evidence available on a particular topic. These kinds of research syntheses are sorely needed for informing decision-making related to the SDGs.

How systematic reviews offer a sound basis for decision-making

An impact evaluation of a programme may show different results based on the context. Let’s take the example of supplementary feeding programmes that aim to reduce undernutrition in young children. A recent 3ie-supported systematic review shows that these interventions are effective overall, but there is great deal of variability in the results depending on the context. Food supplements are effective at critical times, especially in infants and children younger than two years. They are also effective when children are poor or malnourished and when supplementary feeding is supervised. Finally, the quality and quantity of the food also matters.

SRs, unlike single impact evaluations, thus show the variety of results that a programme can achieve under different circumstances. Programme designers and policymakers need to consider these contextual factors, if they want their programmes to succeed. Evidence from systematic reviews can be used for making decisions that are better tailored to specific context. See this 3ie blog for some more examples of SRs that can be used for informing approaches to achieving the SDGs.

There are however challenges to using these reviews. No matter how good they are, SRs cannot be expected to provide a policy ‘magic bullet’ or define a specific set of interventions that nations should implement to move an SDG needle. It is therefore important to manage expectations while dealing with the challenge of distilling lessons from research syntheses to move on the SDG indicators.

How systematic reviews can address complexity

Another challenge is that by their nature, SDGs and their targets are breathtakingly broad. 3ie’s in-house systematic review on education effectiveness, difficult enough as it was to carry out, was able to study only three or maybe four of SDG 4’s ten targets. SDG 4 seeks to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all. The targets excluded from the systematic review included those that relate to ensuring equal access to affordable vocational and technical education, increasing the number of youth and adults who have relevant skills for employment, promoting culture and peace, and providing access to persons with disability.

The inclusion of these targets would have made the study too unwieldy and impossible to finish. The problem is compounded because all of the SDGs are related to one another. Moving the needle on one target has implications for the others. Mapping these interlinkages was such a complex exercise that the UN sponsored a competition on how to visualise them interactively (See winner’s graph).

Research syntheses can address this complexity if they provide a theory of change which maps out the most important linkages. For example, a recently published 3ie-supported SR investigates the impact of self-help groups (SHGs) on women’s empowerment. SDG5 seeks to empower all women and girls, and SHGs are promoted in South Asia precisely to do that. But do they actually empower women? And if yes, why? A 3ie-supported review examined not only the impact of SHGs on various kinds of empowerment but also the intermediate outcomes of SHG participation. It does so by building a solid theory of change that informs the search and review process. The authors found that SHGs have a positive effect on the economic, social and political empowerment of women. They build women’s independence by improving the ability of women to access, own and control resources. They also increased women’s bargaining power within the household and also fostered solidarity amongst SHG members.

SDG-graph

Source: United Nations

How EGMs explore the causal chain

EGMs are a first step in understanding the complexity of the causal linkages between interventions and their final outcomes. EGMs do not tell us whether interventions are successful but they show the available evidence on the effectiveness of a particular intervention. Again, it is not only the final outcome that is interesting but also the several steps leading to it.

For example, a 3ie EGM on education interventions points to the evidence on teacher performance, student attendance and test scores, one outcome leading to the other. Similarly, another 3ie EGM on youth and transferable skills directs us to the evidence in the following causal order: students’ learning, market behaviours, employment and wages. A 3ie blog provides an example of how an EGM on productive safety net programmes can quickly tell us what we know and don’t know for designing solutions to eradicate poverty and meet SDG1.

In sum, evaluation is critical if we are to know what will move the SDG indicators – a point already made forcefully by the UN Evaluation Group. An important component of this effort is to synthesise the results of rigorous evidence gathering. Applying the lessons from these syntheses must account for local context and be based on a robust theory of change.

 3ie is organising its second London Evidence Week from 11-15 April at venues in the London School of Hygiene and Tropical Medicine (LSHTM) and Birkbeck, University of London. It is a series of events that will bring together evaluators, researchers, policymakers and programme managers The week-long discussions will try to explore the challenges and opportunities in using high-quality evidence to inform decision-making, especially around key sectors identified by the UN SDGs.

The pitfalls of going from pilot to scale, or why ecological validity matters

Hands On technologiesThe hip word in development research these days is scale. For many, the goal of experimenting has become to quickly learn what works and then scale up those things that do. It is so popular to talk about scaling up these days that some have started using upscale as a verb, which might seem a bit awkward to those who live in upscale neighbourhoods or own upscale boutiques.

We have very little evidence about whether estimated effect sizes from pilots can be observed at scale. When we have impact evaluation evidence from the pilot, we rarely bother to conduct another rigorous evaluation at scale. Evidence Action, a well-known proponent of randomised controlled trial-informed programming, frankly states, “…we do not measure impact at scale (and are unapologetic about it)”.

The reasons are easy to understand. We are eager to use what we learn from our experiments or pilots, to make a positive difference in developing countries as quickly as possible. And impact evaluations are expensive, so why spend the resources to measure a result that we have already demonstrated? Especially when it is so much more expensive and complicated to conduct an impact evaluation at scale.

We do not always see the same effects at scale

A few recent studies, however, suggest that we cannot necessarily expect the same results at scale that we measured for the pilot. Bold et al. (2013) use an impact evaluation to test at scale an education intervention in Kenya that was shown to work well in a pilot trial. They find, ‘Strong effects of short-term contracts produced in controlled experimental settings are lost in weak public institutions.’ Berge, et al. (2012) use an impact evaluation to test what they call a local implementation of a business training intervention in Tanzania and conclude, ‘the estimated effect of research-led interventions should be interpreted as an upper bound of what can be achieved when scaling up such interventions locally’. Vivalt (2015) analyses a large database of impact evaluation findings to address the question of how much we can generalise from them and reports, “I find that if policymakers were to use the simplest, naive predictor of a program’s effects, they would typically be off by 97%.”

There are a number of reasons to expect that the measured effects from our pilot studies are not directly predictive of the impacts at scale. Most observers focus on the concept of external validity. Aronson et al. (2007) define external validity as, ‘the extent to which the results of a study can be generalised to other situations and to other people’. Based on that definition, external validity should not be the main problem of generalising from a pilot to a scaled implementation of the same intervention in the same place.

The problem is ecological validity

The concept I find more useful for the pilot-to-scale challenge is ecological validity. Brewer (2000) defines ecological validity as ‘the extent to which the methods, materials and setting of a study approximate the real-world that is being examined.’ If we want to go from pilot to scale and expect the same net impacts in the same setting, what we need to establish is the ecological validity of the impact evaluation of the pilot. There are however several potential threats to ecological validity. Here are some examples:

  • Novelty effects. Some of the interventions that we pilot for development are highly innovative and creative. The strong results we see in the pilot phase may be partly due to the novelty of the intervention. The novelty may wear off at scale, particularly if everyone is now a participant or hears the same message. 3ie funded two impact evaluations that piloted cell phone lotteries as a way to increase demand for voluntary medical male circumcision. As it turned out, the pilots did not produce results, but if they had, I fear they would have measured a lot of novelty effect.
  • Partial versus general equilibrium. A basic premise of market economics is that if everyone plays on the same level playing field (e.g. has the same information, faces the same prices), one will not be able to earn more profits than the other. That outcome is the general equilibrium outcome. By definition, pilot programmes designed to improve economic outcomes measure only partial equilibrium outcomes. That is, only a small sample of the ultimate targeted group receives the economic advantage of the intervention. If those in the treatment group operate in the same markets as those not in the treatment group, the pilot programme can introduce a competitive advantage; it can unlevel the playing field.

AG-infra.euWhat is measured is not predictive of what will happen when the intervention is scaled up and the playing field is leveled. I saw an example of this in a pilot intervention that delivered price information to a treatment group of farmers through cell phones, while ensuring that the control farmers did not have access to the same price information. The profits measured for the treatment farmers who had superior information cannot be expected when all the farmers in the market get the same information.

  • Implementation agents. This threat to ecological validity is one that Bold et al. and Berge et al. highlight, although they lump it under the concept of external validity. Researchers who want to pilot programmes for the purpose of experimentation often find NGOs or even local students and researchers to implement their programme. In order to ensure the fidelity of the implementation, the researchers carefully train the local implementers and sometime monitor them closely (although sometimes not, which is a topic for another blog post). At scale, programmes need to be implemented by the government or other local implementers not trained and monitored by researchers. But who implements a programme and how they are managed and monitored makes a big difference to the outcomes, as shown by Bold et al. and Berge et al.
  • Implementation scale. What we hope when a programme goes to scale is that there will be economies of scale, i.e. the costs increase less than proportionately when you spread the programme over a larger group of beneficiaries. These economies of scale should mean that the cost effectiveness at scale is even better than what was measured for the pilot. Unfortunately, there can also be diseconomies of scale (i.e. costs increase more than proportionately) particularly for complicated interventions. In addition, at scale the programme may encounter input or labor supply constraints that did not affect the pilot. Supply constraints will be important to consider when scaling up successful pilot programmes for HIV self-tests, for example.

Unfortunately, if we cannot achieve the same results at scale, the findings from the pilot are of little use. Sure we can often collect performance evaluation data to observe outcomes at scale, but without the counterfactual analysis, we do not know what the net impact is at scale, which means we cannot conduct cost-effectiveness analysis. In allocating scarce resources across unlimited development needs, we need to be able to compare cost effectiveness of programmes as they will ultimately be implemented.

What can we do?

My first recommendation is that we pay more attention to the behavioural theories and assumptions and to the economic theories and assumptions that we explicitly or implicitly make when designing our pilot interventions. If our intervention is a cool new nudge, what is the novelty effect likely to be? How does our small programme operate in the context of the larger market? And so on.

My second recommendation is that we conduct more impact evaluations at scale. I am not arguing that we need to test everything at scale. Working at 3ie, I certainly understand how expensive impact evaluations of large programmes can be. But when careful consiQuote-1deration reveals high threats to ecological validity, a new intervention should not be labelled as successful until we can measure a net impact at scale. Contrary to the arguments of those who oppose impact evaluation at scale, scale does not need to mean that the programme covers every person in the entire country. It just needs to mean that the programme being tested closely approximates, in terms of agents and markets and size, a programme covering the whole country. Put differently, an impact evaluation of a programme at scale should be defined as a programme impact evaluation with minimal threats to ecological validity.

My third recommendation is that we pay more attention to the difference between pilot studies that test mechanisms and pilot studies that test programmes. Instead of expecting to go from pilot to scale, we should expect more often to go from pilot to better theory. Better theory can then inform the design of the full programme, and impact evaluation of the programme at scale can provide the evidence policymakers need for resource allocation.

(This blog post is adapted from a presentation I gave at the American Evaluation Association 2015 conference. Registered users can see the full panel session here.)