Can We Please Stop Talking About Power?

Nov. 12, 2020, 8:45 a.m.

There are four situations when peer reviewers shouldn't talk about power. If you are currently reviewing manuscripts before publication, avoiding talking about power in these four situations will make you a top-notch reviewer who provides constructive, actionable advice to the researchers.

For excellent reasons, talking about p-values and statistical significance has become very unpopular. Unfortunately, talking about power has become the next new trendy-talk for peer reviewers. While many peer reviewers are knowledgeable in statistics and provide worthwhile and valuable comments to the researchers, we have been seeing an increasing number of papers returned with the comment "Was a power calculation done?" when it is clear that the peer reviewer doesn't understand the concept of statistical power. If you are a peer reviewer, this post will help you talk smartly about power.

What is Power?

To talk smart about power as a peer reviewer, you first need to understand what power is. Keep this simple definition in mind when talking about power:

Power is the probability of rejecting the null hypothesis when in fact the null hypothesis is false.

Read this over a couple of times if you aren't certain you understand it. Read it again the next time you comment about power. Understanding this definition of power is vital to provide constructive advice. It makes it clear that there are four situations when peer reviewers should not talk about power.

1. The Study is Positive

When a study rejects its null hypothesis it is pointless to speak about power.

Looking back at the definition above, we see that power is addressing the possibility of failing to reject the null hypothesis. If the study has a small p-value, and rejects the null hypothesis it is mathematically silly to talk about power.

The flip side of power is sample size. Again, when the null hypothesis is rejected it is pointless to talk about adequacy of the sample size. Clearly the sample size was large enough to reject the null hypothesis: both the sample size and the power were adequate.

I once had a peer reviewer comment that "although the study rejected the null hypothesis the sample size was small. Did the study have adequate power?" In this case the peer reviewer was showing a clear lack of understanding of power by confusing power with statistical significance. The chance that a study accidentally rejects the null hypothesis is not the studies power.... that is the study alpha level (or p-value).

What is a simple way to really up your game as a peer reviewer? Don't talk about power when a study rejects its null hypothesis.

2. Nothing is Known of the Target Population

When little is know of the target population it can be dangerous to do a power calculation.

Making an appropriate power calculation is not trivial, as the math behind power calculations is a bit strange. Firstly, because the calculations are done before any data is collected, we often don't really know enough about the target population to make accurate calculations. Secondly, because we are trying to calculate the probability of an event that may or may not be true, we are forced to make some rather interesting speculations.

What do you need to know to calculate sample size based on a power calculation?

The null and alternative hypothesis
The statistical test that will be used for the analysis
The level of significance (α)
The effect size of interest
A target power

Look complicated? It is. Let's go through each of these six steps one at a time.

1. The null and alternative hypothesis must be specified fully before starting the study. It must be worded correctly, and follow the general statistical best practice for hypotheses. Importantly, the hypotheses specified at the study design phase must be the same as the one used to analyze the data. If the researchers change the research hypotheses the power calculation must be redone before data is collected. If the researchers change the hypotheses after data collection - a statistically indefensible but common practice - the power calculation is meaningless.

2. The statistical test also must be specified for the power calculation. Again, this cannot be changed later. If the researcher initially planned on doing a t-test but changed to a linear regression, the power calculation is no longer valid.

3. Unfortunately, to do a power calculation the researchers must specify a statistical level of significance to make the power calculation. As we discussed in a previous blog post setting a level of significance is a bit controversial. This number is always a guess. Usually, we choose 0.0.5. Unfortunately, power calculation relies heavily on this number.

4. Effect size is a complex calculation, and it is at this step where the whole process of calculating power often falls apart. In most cases, it requires way more information than the researchers have. Let's take for example, a recent study I participated in that looked at the time for school aged children to apply different types of tourniquets. To calculate the effect size for the time to apply the tourniquet, we would need to specify:

The mean value time for students to apply the tourniquet
The standard deviation of the time to apply the tourniquet
The mean time for the alternative hypothesis

In most cases, we won't know the values of a and b above. Remember that we will need to know the mean and standard deviation for these particular students to apply these particular tourniquets. If the study is novel, these values will not be available in the literature. Often we need a pilot study to determine these.

What about c? Wait! How can we know the mean time for the alternative hypothesis? Unfortunately, we don't, and can't. Here we need to rely on a scientific judgment, not math. Here, researchers need to estimate what a meaningful alternative hypothesis would be. For instance, in this study we may want to assume that a difference of 30 seconds is meaningful, while a difference of less than 30 seconds would be meaningless. Sadly, this guess by the investigators plays a huge role in sample size calculation, and small changes to this number will make huge changes in the power calculation.

5. Another guess here. Specify a target power. Usually, we just use 80%. Again, this is arbitrary and can make a huge effect on the power calculation.

Finally, we make the calculation. Making the calculation is not easy. Researchers who are not statisticians should get help. My personal favourite reference is a textbook by Thomas P Ryan and is 374 pages long.

What does all this mean for the peer reviewer? Making a power calculation is not easy. It involves a myriad of assumptions and a complex math. For novel studies, and investigating a phenomenon of which little is known, an accurate power calculation is probably not really possible. Making a power calculation based on a number of guesses is likely to lead to a false sense of knowledge where none actually exits.

3. The Sample Size is Not Controlled by the Researcher

There is little point in doing a power calculation when the researcher cannot control the sample size.

What is the purpose of doing a power calculation anyway? The main reason for this calculation is to determine the appropriate sample size. In many cases, the sample size is out of the control of the researcher, in which case doing a sample size calculation is truly a waste of time. This is common in small studies; often the entire population is sampled.

For instance, consider a study that looks at implementation of simulation training in disaster medicine for medical students. The researchers may choose to implement a randomized trial where medical students are randomized to the new training or not. Here, the study sample size is all medical students at the medical school. It is pointless to do a power calculation.

4. The Effect on the Subjects is Harmless

When the effect of the intervention is harmless, or even beneficial to the subjects, determining power may be an entirely academic exercise and have little practical utility.

Why calculate a sample size anyway? The origin of calculating a sample size is to ensure that research maintains an ethically acceptable approach to the research subjects. Sample size calculation is critical with human subjects when they are being tested with a new and unproven medical treatment.

For example, consider a study that is investigating the role of steroids in COVID-19. We know that there is ample evidence from previous studies in other similar diseases that show the steroids may be helpful. However, we also know that steroid use is associated with side effects including such things as gastric ulcers, adrenal gland suppression, avascular necrosis, and delayed wound healing. An appropriate research protocol would have a just-the-right-size sample:

Not Too Small: It would be unethical to subject participants to a new treatment if there is very little possibility that the study can prove the effectiveness.
Not Too Big: It would be unfair to expose too many participants to a risky treatment when a smaller sample could be expected to be adequate.

In this study of potentially harmful steroids a rigid sample size calculation would be mandatory. If there is insufficient data about the mean and standard deviation of the effect on the participants, a small pilot sample should be done first to ensure that the sample size calculation is correct. The ethics board should insist on an accurate sample size calculation before the study is performed. The peer reviewer should insist on seeing the details of the power calculation. The journal should refuse to publish the study without assuring that the power calculation was correct.

In contrast, consider a study that looks at a relatively harmless treatment. For instance, an educational study that randomized participants to education via in-person lessons and video lessons in the factor fire code. In this case it appears that it is quite unlikely that the participants would be harmed by the treatment. In fact, if the baseline is no education in fire code, it is very unlikely to be harmful. For this study, precise calculation of a sample size may not be important and it may be fine for the researchers simply to try their protocol on a reasonable sample.

Whose responsibility is it anyway to ensure that the study is done in an ethical manner? This is primarily the responsibility of the ethical permission granting agency for the research, and the decision whether the sample size is appropriate or not should be made before the study is performed.

Journals should ensure that they publish only ethical studies, and certainly peer reviewers should look out for studies that are ethically irresponsible. However, in many cases the lack of sample size calculation is not an ethical risk to the participants. When studies are small and likely harmless, peer reviewers should not be insistent on a power calculation.

Confidence Intervals to the Rescue

We saw above that power calculations are difficult and sometimes unimportant. But what about negative studies? If a study fails to reject its null hypothesis, doesn't the power calculation help us to make the decision about how to interpret the data?

Personally, I find it much easier to consider this maxim:

Power calculations should be used for study planning. Confidence intervals should be used for study interpretation.

There are two reasons why confidence intervals are preferable to power in the interpretation of negative studies:

Power calculations are done before the study was performed. None of the valuable information actually obtained in the study was involved in the calculation. Remember that power is calculated based on pre-study guesses about the mean, standard deviation, and effect size. Conversely, confidence intervals are based on the mean, standard deviation, and effect size actually found in the study.
Looking at the confidence interval for the effect is also far more intuitive than trying to consider the power calculation. Consider the example of the tourniquet study above. Imagine that the study failed to detect a difference between the two types of tourniquet. If we use the concept of power to interpret the negative study we could say, "yes this study failed to show a difference between the two types of tourniquet, and the study had an 80% power to detect a difference of 30 seconds." Does this help us decide which tourniquet to buy? Conversely, if we look at the confidence interval for effect, we would be able to say "we can be 95% confident that the true difference between the two tourniquets is between -10 and +15 seconds." This, is useful actionable information.

How can the peer reviewer help if the study fails to reject its null hypothesis? When I am asked to review a study rather than asking the researchers "was a power calculation performed?" I encourage the researchers to look at the confidence interval for the effect size. I suggest that they look at both the upper and lower ends of the confidence intervals, and in the discussion, consider what the practical significance would be if each of these values was true.

What Should a Peer Reviewer Do?

Peer review is always a difficult job, and the mathematics of power calculations certainly do not make it any easier. I suggest peer reviewers take the following simple actions:

Try asking "how was the sample size determined" rather than "was a power calculation done."
For small studies of novel, harmless phenomena don't insist on a formal power calculation.
Encourage researchers to use the confidence interval of the effect size to consider the results of a negative study.

Do you want a visual reminder to help you assess power? Get our Can We Please Stop Talking About Power? infographic.