Learning power lessons: verifying the viability of impact evaluations

and | June 11, 2018

Learning from one’s past mistakes is a sign of maturity. Given that metric, 3ie is growing up. We now require pilot research before funding most full impact evaluation studies. Our pilot studies requirement was developed to address a number of issues, including assessing whether there is sufficient intervention uptake, identifying or verifying whether the expected or detectable effect is reasonable and determining the similarity of participants within clusters. All of these inter-related issues have the same origin, in a problem 3ie recognized a few years ago (highlighted in this blog), which is that eager impact evaluators frequently jump into studies with incomplete information about the interventions they are evaluating. While the evaluation question may seem reasonable and justified to everyone involved, inadequate background information can cause miscalculations that render years of work and hundreds of thousands of dollars meaningless. Low intervention uptake levels, unrealistic expected or detectable effects, unexpected effect sizes or overly low intracluster correlation coefficients (ICCs) may result in insufficiently powered research and thus waste valuable evaluation resources.

Increasing the accuracy of evaluation assumptions matters. Insufficient power can torpedo an evaluation, which is the motivation behind 3ie’s proposal preparation pilot phase grants. By providing a small grant to demonstrate the viability of an evaluation, we aim to maximize the effectiveness of our limited resources to fund answerable evaluation questions. The formative research grants provide evidence of adequate intervention uptake while validating the accuracy of power calculation assumptions. We reached this point by learning from previous power assumption missteps.

Lesson one: intervention uptake

Our first power lesson revolves around unreasonable intervention uptake assumptions. Low uptake may arise from several factors, including but not limited to the design of the intervention with insufficient knowledge of local culture or a low level of demand among the intended intervention beneficiaries. Low uptake immediately endangers the usefulness of proposed evaluations, as insufficient sample sizes may not allow the researchers to detect a significant (or a null) change in treatment recipients. Pilot studies help to validate the expected uptake of interventions, and thus enable correct calculation of sample size while demonstrating the viability of the proposed intervention.

One example of an evaluation with an incorrect uptake assumption occurred in a 3ie-funded evaluation in 2015. The intervention used a cadre of community health workers to deliver HIV medicines (antiretrovirals) to “stable” patients. These are patients who had been on treatment for at least six months and had test results that indicated that the virus was under control.  During study enrollment, the evaluators realized that many fewer patients qualified as “stable” than they had anticipated. In addition, it was taking a lot longer to get test results confirming eligibility. These two challenges resulted in much slower and lower enrollment than expected. In the end, the researchers needed three extensions, expansion to two additional districts and nearly US$165,000 in additional funding to complete their study in a way that allowed them to evaluate the effects they hypothesized.

Lesson two: expected effect sizes

Our second lesson stems from poorly rationalized expected changes in outcomes of interest. Many researchers, policymakers and implementers expect interventions will result in substantial positive changes in various outcomes for the recipients of the intervention. Compounding this potential error is that many researchers then use this optimistic assessment as their “minimal detectable effect.” However, unrealistic expectations used to power studies will likely lead to underpowered sample sizes. Studies require larger sample sizes to detect smaller changes in the outcome of interest. By groundtruthing the expected effectiveness of an intervention, researchers can both recalculate their sample size requirements and confirm with policymakers the intervention’s potential impact.

Knowing what will be “useful” to a policymaker should also inform how researchers design and power an evaluation. Policymakers have little use in knowing that an intervention caused a 5 percentage point increase in an outcome if that change is not clinically relevant or sufficiently large to save the government money or large enough to create a meaningful difference for the beneficiary. At the same time, if a 10 percentage point increase would make the policymaker very excited to expand a program, but the implementers or researchers “hope” that the intervention will result in a 25 percentage point increase, powering the study to detect 25 percentage points may be a fatal error. If the intervention “only” produces a 20 percentage point increase, the study will be underpowered, and the evaluators will likely not be able to detect statistically significant changes. Sometimes the researchers’ best choice is to conservatively power their study, allowing for the greatest likelihood of detecting smaller, but still policy relevant, levels of impact.

An example of unrealistic expected effects comes from a 3ie-funded evaluation of how cash transfers influence livelihoods and conservation in Sierra Leone (final report available here). The researchers designed their randomized controlled trial to measure both the influence of earned versus windfall aid and the effect of community versus individual aid. The researchers note that the implementing agency expected the different aid interventions to cause large changes in economic, social and conservation outcomes. Unfortunately, after visiting the intervention and control areas six times over three years, the researchers were unable to detect any consistent statistically significant impacts on changes in these outcomes compared to control or each other. While they estimated some differences, they argue the effects were not significant due to being underpowered for their analysis.

Lesson three: outcome intracluster correlation coefficients

Our third lesson focuses on ICCs. Many researchers assume ICCs, either based on previous studies (that oftentimes assumed the inter-relatedness of their samples), or based on a supposed “rule of thumb” for an ICC that does not exist. Time and place may cause variations in ICCs. Underestimating one’s ICC may lead to underpowered research, as high ICCs require larger sample sizes to account for the similarity of the research sample clusters.

Of all of the evaluation design problems, an incomplete understanding of ICCs may be the most frustrating. This is a problem that does not have to persist. Instead of relying on assumed ICCs or ICCs for effects that are only tangentially related to the outcomes of interest for the proposed study, current impact evaluation researchers could simply report the ICCs from their research. The more documented ICCs in the literature, the less researchers would need to rely on assumptions or mismatched estimates, and the less likelihood of discovering a study is underpowered because of insufficient sample size.

Recently 3ie funded a proposal preparation grant for an evaluation of a Nepalese education training intervention. In their proposal, the researchers, based on previous evaluations, assumed an ICC of 0.2. After winning the award, the researchers delved deeper into two nationally representative educational outcomes datasets. Based on that research, they calculated a revised ICC of 0.6. This tripling of the ICC ultimately forced them to remove one intervention arm to ensure they had a large enough sample size to measure the effect of the main intervention of interest. This is a good example of the usefulness of 3ie’s new approach to most evaluation studies, as the researchers’ sample size recalculations gave them the greatest likelihood of properly measuring the effectiveness of this education intervention.

Without accurate assumptions, researchers lose their time, donors lose their money, stakeholders lose interest and policymakers’ questions remain unanswered. Power calculations and the subsequent sample size requirements underlie all impact evaluations. The evaluation community has the power to correct many of these miscalculations. Please join us in raising the expectations for impact evaluation research by holding proposals to a higher bar. We can all do better.

More resources:  3ie published a working paper, Power calculation for causal inference in social science: sample size and minimum detectable effect determination, that draws from real world examples to offer pointers for calculating statistical power for individual and cluster randomised controlled trials. The manual is accompanied by the Sample size and minimum detectable effect calculator©, a free online tool that allows users to work directly with each of the formulae presented in the paper.

This blog is a part of a special series to commemorate 3ie’s 10th anniversary. In these blogs, our current and former staff members reflect on the lessons learned in the last decade working on some of 3ie’s major programmes.



Tags: , ,

3 Comments on “Learning power lessons: verifying the viability of impact evaluations

  1. Steven Glazerman

    These are important lessons and I’m glad to see them articulated like this. It speaks to the need for researchers to publish their ICCs for the benefit of other researchers who follow in their footsteps. Much like these papers:
    • Kelcey, Shen, and Spybrook (Evaluation Review, Africa)
    • Hedges and Hedberg 2007 (EEPA)
    • Hedges and Heberg 2013 (Evaluation Review)
    • Jacob, Zhu, and Bloom (JREE, Education)
    • Westine, Spybrook, and Taylor (Evaluation Review)
    • Schochet (JEBS)

    One quibble though: under-powered studies are not a total waste, because of (a) meta-analysis and (b) Bayesian analysis, which allows us to update our priors with any data, even if it is insufficient by classical frequentist standards. The ability to reject the null at 0.05 is the not the only reason for doing research, although if you are spending precious development dollars to do evaluation, it’s not a bad threshold to use.

    That having been said, your first lesson, about uptake (and I would add, program fidelity) is a great one, probably even more important than statistical power. Program implementers routinely overestimate their ability to provide services that are truly innovative, that test an important idea, and that beneficiaries will want to participate in.

    Reply
  2. David Levine

    Regarding ICCs: It would be great to have a database of ICCs for different outcomes. Importantly, the ICCs would need to be in changes, as well as levels. I can find ICCs for levels pretty easily, but it is hard to find changes in published results. Can 3ie take on this task?

    Reply
  3. Anna HeardAnna Heard

    Thanks for these insightful comments.

    @Steven: Thanks for providing some examples of published ICCs from the literature. We’re hoping this trend catches on! Regarding under-powered analysis, we agree with your point. The main issues here are limited funding and timeliness. Impact evaluations are usually expensive and long. It’s a tough sell to donors if the design prevents meaningful interpretation without additional analysis. And, when a study could have been designed better if the researcher(s) used more appropriate assumptions, it may not be a total waste, but it is definitely not the best use of precious resources.

    @David: 3ie is currently setting up infrastructure to improve our database management. Documenting ICCs isn’t in our plans at the moment. If we found a funder, adjustments might be possible….

    -Anna and Ben

    Reply

Leave a Comment:

Your email address will not be published. Required fields are marked *