Ten things that can go wrong with randomised controlled trials

| August 19, 2014

World Bank-Education-Flickr-8249943681I am often in meetings with staff of implementing agencies in which I say things like ‘a randomised design will allow you to make the strongest conclusions about causality’. So I am not an ‘unrandomista’.

However, from the vantage point of 3ie having funded over 150 studies in the last few years, there are some pitfalls to watch for in order to design and implement randomised controlled trials (RCTs) that lead to better policies and better lives. If we don’t watch out for these, we will just end up wasting the time and money of funders, researchers and the intended beneficiaries.

So here’s my list of top ten things that can go wrong with RCTs. And yes, you will quickly spot that most of these points are not unique to RCTs and could apply to any impact evaluation. But that does not take away from the fact that they still need to be avoided when designing and implementing RCTs.

  1. Testing things that just don’t work: We have funded impact evaluations in which the technology at the heart of the intervention didn’t work under actual field conditions. We don’t need a half – million dollar impact evaluation to find this out. In such cases, a formative evaluation, which includes small-scale field testing of the technology, should precede an RCT.
  2. Evaluating interventions that no one wants: Many impact evaluations fail because there is little or no take-up of the intervention.  If 10 per cent or fewer of intended beneficiaries are interested in an intervention, then we don’t need to evaluate its impact. It will not work for more than 90 per cent of the intended beneficiaries because they don’t want it.  The funnel of attrition is a tool that can help us understand low take-up and assess whether it can be fixed. But designers commonly over-estimate benefits of their interventions whilst under-estimating costs to users. So the programme that is implemented may simply be unappealing or inappropriate. Like point one, this point also relates to intervention design rather than evaluation. But this is an important point for evaluators to pay attention to since many of our studies examine researcher-designed interventions. And here again, a formative evaluation prior to the impact evaluation will give information on adoption rates, as well as the facilitators of and barriers to adoption.
  3. Carrying out underpowered evaluations: Studies are generally designed to have power of 80 per cent, which means that one fifth (20 per cent) of the time that the intervention works, the study will fail to find that it does so. In reality, the actual power of many RCTs is only around 50 per cent. So, an RCT is no better than tossing a coin for correctly finding out if an intervention works. Even when power calculations are properly done, studies can be underpowered. Most often, this is because it is assumed that the project will have a much larger impact than it actually does. So the true impact cannot be detected, and calculations are over-optimistic about adoption.
  4. Getting the standard errors wrong: Most RCTs are cluster RCTs in which random assignment is at a higher level than the unit at which outcomes are measured. So, an intervention is randomly assigned to schools, but we measure child learning outcomes. Or it is randomly assigned to districts, but we measure village-level outcomes.  The standard errors in these cases have to be adjusted for this clustering, which makes them larger.  We are therefore less likely to find an impact from the intervention. So, studies which don’t adjust the standard errors in this way may incorrectly find an impact where there is none. And if clustering is not taken into account in the power calculations, then an underpowered study with too few clusters will almost certainly be the result.
  5. Not getting buy-in for randomisation: The idea of random allocation of a programme remains anathema to many programme implementers. This is despite the many arguments that can be made in favour of RCT designs that would overcome their objections. Getting buy – in for randomisation can thus be a difficult task. Buy- in needs to be across all relevant agencies, and at all levels within those agencies. The agreed random assignment may fail if the researchers miss out getting the buy-in of a key agency for the implementation of the impact evaluation.  Political interference from above or at the local level, or even the actions of lower level staff in the implementing agency can act as stumbling blocks to the implementation of random assignment. This  leads to…© Talitha Chairunissa
  6. Self-contamination: Contamination occurs when the control group is exposed to the same intervention or another intervention that affects the same outcomes. Self-contamination occurs when the project itself causes the contamination. Such contamination may occur through spillovers, such as word of mouth in the case of information interventions. It could happen if the people in the control group use services in the treatment area. But it can also occur when staff from the implementing agency are left with unutilized resources from the project area, so they deliver them to the control group.
  7. Measuring the wrong outcomes: The study may be well conducted but fail to impress policymakers if it doesn’t measure the impact on the outcomes they are interested in, or those which matter most to beneficiaries. Do women value time, money, control over their lives or their children’s health?  Which of these outcomes should we measure in the case of microfinance, water, sanitation and hygiene interventions and so on? A common reason that important outcomes are not measured is that unintended consequences, which should have ideally been captured in the theory of change, were ignored. Prior qualitative work at the evaluation design stage and engagement with policymakers, intended beneficiaries and other key stakeholders can reduce the risks of this error.
  8. Looking at the stars: The ‘cult of significance’ has a strong grip on the economics profession, with far too much attention paid to statistical significance (the number of stars a coefficient has in the table of results), and too little to the size and importance of the coefficient.  Hence researchers can miss the fact that a very significant impact is actually really rather small in absolute terms and too little to be of interest to policymakers. Where there is a clear single outcome of the intervention, then cost effectiveness is a good way of reporting impact, preferably in a table of comparisons with other interventions affecting the same outcome. Where researchers have followed 3ie’s advice to take this approach, it has sometimes reversed the policy conclusion derived from focusing on statistical significance alone.
  9. Reporting biased findings: Studies should report and discuss all estimated outcomes. And preferably, these outcomes should have been identified in the evaluation design stage. The design should also be registered, for example in 3ie’s Registry for International Development Impact Evaluations.  Many studies focus unduly on significant coefficients, often the positive ones, discounting ‘perverse’ (negative) and insignificant results. Or even where there is no impact, the authors still conclude that the intervention should be scaled up, possibly because of publication bias or it is a researcher-designed intervention, or because they have fallen foul of the biases that affect traditional evaluations to favour the intervention.
  10. Failing to unpack the causal chain: Causal chain analysis is necessary to understand how an intervention was implemented and how and why it worked for whom and where.  Researchers are often left to speculate in the interpretation of their findings because they have failed to collect data on the intermediate variables which would have allowed them to test their interpretation. The theory of change needs to be specified at the design stage; the evaluation questions need to be based upon the theory of change; and both factual and counterfactual analysis should be conducted to understand the whole causal chain.

This list of pitfalls is not meant to stop researchers from doing RCTs or warn policymakers from using their findings.  As I said at the outset, an RCT is a design that allows us to make the strongest conclusions about causality.  But they fail to live up to their potential if they fall into any of the above ten traps of designing and implementing RCTs.  So, let’s carry out better RCTs for better policies and better lives. How 3ie tackles these challenges will be the subject of a future blog.

 

 



Tags: , , ,

4 Comments on “Ten things that can go wrong with randomised controlled trials

  1. Mustafa

    We shall be waiting for the promised future blog to know how 3ie tackles these challenges.

  2. Lauchlan Munro

    The 11th thing that can go wrong is doing the RCT at the wrong time. RCTs often assume that, after a certain time, the impacts “must” be detectable. But the relationship between time and impacts is often non-linear, and there are many possible sets of relations between time elapsed and impacts, depending on the type of project/programme. projects like vaccination campaigns may yield benefits almost immediately, for example, but in bridge construction, there are likely to be no impacts until the bridge is complete, then there will be a steady level of benefits flowing for a long time. A capacity building programme, or a gender rights programme, may yield no detectable benefits for a long time, but when the benefits come, they may take off exponentially. See Michael Woolcock 2009 “Toward a plurality of methods in project evaluation: a contextualised approach to understanding impact trajectories and efficacy”, Journal of Development Effectiveness, Vol. 1, No. 1.

  3. Daniele Giovannucci

    Your common-sense frankness about RCTs is appreciated and the lessons of 3IE’s considerable experience are very valuable.

    We particularly value distilled wisdom around things like not being ruled merely by mathematics as in the case of considering the size and importance of the coefficient and not merely the statistical significance. Similarly, I would suggest not being limited by the causal chain and add the concept of employing the range of appropriate tools including qualitative ones, that are needed for a holistic interpretation of the multiple factors (policy, ecological, social, economic, etc) that are often at play in any assessment.

    1. Bart Doorneweert

      Great comment Daniele. In fact you should stress the need for qualitative analysis even fruther. A good qualitative, exploratory study could help prevent many of the points above:

      1. Testing things that just don’t work: Do the homework on your users, understand what is relevant to them. Then build a prototype that puts together some really sound and contextually relevant hypotheses. No more inceptions based only on quantitative baseline studies!

      2. Evaluating interventions that no one wants: Same as 1. Get some traction on your solution first before considering anything at scale. See it moving with your own eyes

      3. Carrying out underpowered evaluations: If you understand how and where you’re getting traction, then you’ll smell an RCT that is off right from the paper

      7. Measuring the wrong outcomes: Again, understand your traction, and you’ll easily be able to come up with a very parsimonious set of indicators that capture where you want the project to go to.

      10. Failing to unpack the causal chain: if you don’t look at the context of the user, and the journey this user has to take through your intervention to arrive at an outcome, you don’t understand what you’re doing, and of course your project will fail. Test early one with mock-ups or demonstrations to go through the whole idea of the intervention. Observe what people are doing from 2 holes in a proverbial cardboard box. Then collect insights, and make your causal chain.

      Development totally sucks at design. And there are designers that have no role in development yet. High time to include them, and build in some budget for sound qualitative work and testing before we scale our brilliant technologies we aim to bring to all worlds.

Comments are closed.