Requiring fuel gauges: A pitch for justifying impact evaluation sample size assumptions

and | October 17, 2014

Fuel-Gauge-Arrow-iStock-680x455We expect researchers to defend their assumptions when they write papers or present at seminars. Well, we expect them to defend most of their assumptions. However, the assumptions behind their sample size, determined by their power calculations, are rarely discussed. Sample sizes and power calculations matter. Power calculations determine sample size requirements, which match budget constraints with minimum sample size requirements. If the sample size is statistically too small, then evaluators are increasing the risk of making mistaken conclusions regarding the effectiveness of interventions (see here for implications of carrying out underpowered evaluations).

Power calculations are performed a few different ways. After reviewing numerous 3ie grant requests, we’ve learned some lessons about the key power calculation parameters, namely the minimum detectable effect and the intracluster correlation coefficient (ICC).

The choice of the minimum detectable effect is a critical part of the power analysis. The smaller the study’s minimum detectable effect, the larger the required sample size to ensure sufficient power. When the variables being studied have intrinsic meaning (income, production per hectares, etc.), as in many cases in development economics, the minimum detectable effect should simply be the expected raw difference between the population mean of the experimental group and the population mean of the control group.

Unfortunately, many economists are not following this approach. In many proposals we receive, the minimum detectable effect is standardized or expressed as the minimum detectable effect size, with effect size being the difference in the population means of the two groups divided by the standard deviation of the outcome of interest. As a result, the minimum detectable effect size is reported in units of standard deviations. This standardization trend should change. Economists should base their interventions both on the ability to detect the relevant minimum level of impact and also on the cost effectiveness of the intervention.

Many proposals compound their minimum detectable effect sizes problems by borrowing Cohen’s classification system, where effect sizes of .20 are small, .50 are medium, and .80 are large. Little justification exists for applying this framework to economic research (see here for discussion of some methodological limitations of Cohen’s classification). It’s unclear both how Cohen’s classification became a rule of thumb for minimum detectable effect sizes and how it’s relevant to economics (or education, examples here and here).

Economists are using standardized effect sizes and Cohen’s classification system as a benchmark regardless. In a recent proposal that aimed to evaluate the impact of payment for environment services, the authors powered their study to detect a minimum of a 0.25 standard deviation in the number of hectares of natural forest conserved (the outcome variable of interest). This minimum detectable effect size corresponds to 0.35 hectares. The mean of natural forest at the baseline, which is 1.15 hectares, corresponds to a 30 per cent increase of natural forest. Although this increase seems quite substantial to us, the minimum detectable effect size of 0.25 standard deviations is considered small according to Cohen’s classification. Minimum detectable effect is more than just a number. Researchers should justify their assumptions by explaining how this minimum detectable effect is relevant for both the cost of the intervention and the impact on treatment populations.

Economists should look at how epidemiologists calculate minimum detectable effect sizes. Economists typically provide neither mean nor variance of outcome indicator in their publications. Public health researchers provide both of these variables. We attribute this reporting difference to economists using optimal design, which does not require mean and variance, to compute power calculations. Public health researchers conduct power calculating using formulas from Hayes and Bennett, which require mean and variance. Without these variables, it is impossible to present minimum detectable effect in terms of change in percentage. Presenting minimum detectable effect as a change in percentage allows researchers to judge the detectable magnitude of change due to the intervention. We therefore reiterate McKenzie’s call for including the assumed mean and variance of each outcome indicator when reporting power calculation measurements.

The ICC, which measures how similar individuals are within a cluster, is the other major power calculation assumption for cluster RCTs (here’s a related blog). As research moves from examining individuals to examining clusters of individuals, the similarity of those clusters must be accounted for when determining sample size requirements. It is important to accurately account for ICCs as the greater the similarity within a cluster, the greater the number of observations that are needed to adequately power the impact evaluation.

Like minimum detectable effect, ICC assumptions determine required samples sizes. These assumptions are also typically unjustified. We receive many proposals with ICCs seemingly pulled from thin air. One recent application based its ICC on a study conducted in a different country, which had a completely different socio-economic background. Another focused on research conducted over a decade ago, with no argument for its current validity. Methods to improve ICC estimates are evolving (repeated measurement appear to increase ICC accuracy). Ideally, researchers should use pilot surveying to calculate actual ICCs. If piloting the intervention is impossible, an alternative is to test multiple, realistic ICCs to determine the study’s power sensitivity and better understand sample size requirements.

Online power calculation appendices are a method to increase transparency in assumptions. As the number of studies reporting null results increases, appendices that include minimum detectable effect and ICC assumptions would allow researchers to assess whether null findings are due, or not, to a lack of power. Standardizing power calculation ‘fuel gauge’ reporting, through comprehensible minimum detectible effects and justified intra-cluster correlation coefficients, would improve the accuracy of social science impact evaluation research.

8 Comments on “Requiring fuel gauges: A pitch for justifying impact evaluation sample size assumptions

  1. Heather LanthornHeather

    Great work guys — very informative post.

    Looking forward to a follow-up in which you discuss how researchers might determine a “relevant minimum level of impact,” where relevance is *both* in the academic and the policy/programmatic-influence sense. In some ways, this is about defining expectations or parameters for program ‘success’ and is thus small-p political and operational as much as ‘technical.’

    Also, I think that further discussion on the ICC — how researchers should estimate it in the first place given limited data availability and, then, report it later (so that future researchers can estimate it based on a wider swath of data) would be helpful.

    1. Benjamin DK WoodBen and Eric

      Thank you for your comment Heather.

      We recognize the difficulty with determining a relevant minimum level of impact where relevance is both in the academic and the policy/programmatic-influence sense. While there are a number of ways researchers might begin justifying their MDE, the first step is simply reporting the study’s MDE assumptions.

      There are a few possibilities when trying to estimate a MDE. A good place to start is to explore the estimated effect size of similar past interventions on selected outcomes. Another option is to hold discussions with the intervention program managers and ask them what they consider a relevant minimum effect size to both justify the cost of the intervention and consider their programme to be effective. The bottom line is that the choice should be based on something tangible, not a MDE researchers pulled from nowhere.

      Finally, you might look at the level before starting the intervention. When you have a small baseline outcome value, you can expect to have a higher increase in your outcome than when you start with a higher. For example, it is easier to have an increase of 30% of your school enrollment when your baseline enrollment is 50% than when your baseline enrollment is 80%.

      We agree that is an important topic and there is plenty of room for more blogs and research!

  2. Rohit

    Thanks for this – in general, I think this is a very good (and needed) post. Far too many times, studies are either underpowered, or study designs incorporate assumptions in their power calculations that are not adequately motivated.

    That having been said, I want to push back a little bit on some of the points raised. A focus on the standardized effect size can be immensely useful for a number of things: it allows for easier comparison across studies from a purely linguistic perspective by standardizing units and gives one a clearer sense of the magnitudes being discussed. I have no way of knowing how I should think about a 20% increase in profits, or even a $200 increase in profits, but a 0.20 standard deviation increase in profits gives me a sense of how much the increase is when considered against a standard measure. You might argue that the 20% or $200 numbers need to be put in context by presenting some characteristics of the distribution (such as the mean and standard deviation) as well. However, that is precisely what would be needed in order to calculate the standardized effect size as well. Perhaps we are approaching the same issue – standardized effect sizes pulled out of thin air – from different positions, but to my mind, the issue is not the use of standardized effect sizes as you mention. Rather, the issue is with starting off with standardized effect sizes rather than by calculating the standardized effect sizes using information from empirical distributions (whether pilots, or similar samples.) From my experience, a number of economists calculate the standardized effect size, they don’t begin with it; surely that is just as acceptable as presenting non-standardized effect sizes with characteristics of the distribution? From an Occam’s razor perspective though, standardized effect sizes present the information more effectively. Standardized effect sizes pulled out of thin air are indeed problematic, but no more so than talking about non-standardized effect sizes without any discussion of the distribution.

    On the issue of rules-of-thumb, I am also very sympathetic to the argument that one size (say a standardized effect size of 0.2) fits all approaches are bad. However, rules of thumb can still be very useful with suitable caveats. Indeed, while 0.2 might be appropriately considered a moderate effect size with some types of interventions, 0.4 may be far more appropriate with others. Again though, this is not an issue with rules-of-thumb per se, but inappropriate usage of them. Even the Hayes et al. power calculations paper (which you mention) refers to rules-of-thumb that health researchers can consider when computing the intra-cluster correlation (not often over 0.25, very rarely over 0.5). Ultimately, what is required is a well justified assumption, not a total abandonment of heuristics.

    1. Benjamin DK WoodBen and Eric

      Thank you for supporting the general concept behind our blog Rohit. We appreciate your thoughtful response.

      In general, we agree with most of your points. There is a place for MDES in facilitating comparisons between different contexts and as a metric to measure a specific outcome. And if researchers are clearly exploring the distribution of their baseline or pilot or other existing data, we support the use of MDES. In those cases, we would simply urge more justification for the MDES by encouraging researchers to be transparent in their assumptions and list the reasons behind their decision making.

      However, we believe a number of studies use MDES and general rules-of-thumb as a short cut to circumvent difficult sampling analysis. When using MDES researchers should justify their ideas of both the expected MDE and the standard deviation. Currently, these MDES, if presented at all, are typically just stated. Our concern is not with the use of MDES per se, our concern is with MDES that appear to be pulled from thin air. We may all be in agreement here.

      As for Hayes’ usage of a rule of thumb for K, the coefficient of variation of true proportions between clusters, we are less sympathetic. With the increasing amount of available data, being able to justify K (and your ICC) using existing data or a pilot survey should be easier. We are less concerned in general with researchers making conservative assumptions, although we still believe they should work to justify their assumptions.

      In regards to Cohen’s classification, we continue to not see its relevance to economics. This system evolved from psychology, where outcomes of interest are less likely to have an intrinsic meaning. Economists are typically using outcomes with an intrinsic meaning, increasing the importance of justifying assumptions behind the ICC and MDE. We are not sure of the background behind Hayes’ choice of a K value of 0.25. That being said, we do not believe there is a parallel with economics and Cohen’s classification. Cohen’s system was not designed for economics and we are unaware of a formal argument explaining why we should inherently believe the small, medium, and large effect size thresholds hold for these types of research inquiries.

  3. Pingback: Weekly links October 24: graphing coefficients, the Berlin Wall, minimum wages in China, and more… - Sig Nordal, Jr

  4. Hugh WaddingtonHugh Waddington

    Sounds like someone (not me) really needs to put together a database of existing ICCs for social and economic development variables, if there’s not one out there already…?

    1. Benjamin DK WoodBen and Eric

      We are unaware of an existing ICC database. Anyone else?

      An ICC database would be helpful, but would involve a lot of work. ICCs are available for a number of outcomes, depending on the questions asked in the survey. These ICCs are also constrained to the enumeration area, although they might arguably relevant to other nearby communities. We believe increasing the transparency around ICC in existing studies, and encouraging researchers to make their data available for post-intervention ICC calculation, are good first step towards increasing the accuracy of ICC evaluation assumptions.

  5. Pingback: Weekly links October 24: graphing coefficients, the Berlin Wall, minimum wages in China, and more… | Impact Evaluations

Leave a Comment:

Your email address will not be published. Required fields are marked *