P-hacking in Scientific Research

How statistics can be weaponized

In 2015, a study indicated that eating a 1.5-ounce bar of dark chocolate every day would aid in  weight loss (1). How did a real clinical study, with recruited human participants and legitimate data, reach such a ridiculous conclusion (2)? The answer is P-hacking (3): when scientists change how data is analyzed to reach certain outcomes.

In general, scientific research is conducted by collecting data and observing how changing independent variables affects dependent ones. The gold-standard for structuring research nowadays is Null Hypothesis Significance Testing (NHST) (4). When performing research using NHST, researchers come up with two hypotheses: the null hypothesis, and the alternative hypothesis. The null hypothesis simply states that there is no association between the two variables, while the alternative hypothesis attempts to explain the phenomenon. By analyzing collected data, scientists can show that it is unlikely for the observed effect to occur simply by chance and ‘reject’ the null hypothesis (5). 

A scientific conclusion’s p-value is the probability that a relationship between independent and dependent variables is completely by random chance. For the past several decades, scientists have taken p = 0.05, a 1/20 chance that results occurred randomly, as the gold standard for rejecting the null hypothesis. When p is less than 0.05 for a specific result, the result is “statistically significant”.

Popular scientific journals have been shown to disproportionately favor publishing statistically significant results (6). Unfortunately, researchers need to publish in these journals if they want to get funding, so they are incentivized to pursue and publish statistically significant results, a phenomenon described as “publish-or-perish” (4). To get statistically significant results when they might not exist, scientists can abuse the flaws of NHST and p-values via P-hacking. 

P-hacking primarily occurs when researchers selectively report data. How can they do this? Researchers get their data sets, choose how to interpret and analyze their data, and then choose what to publish (4). One common manifestation of P-hacking is recording many dependent variables and then deciding which to analyze and report (4). This is exactly what happened during the chocolate study. The researchers tracked the participants’ weight, cholesterol, and many other factors related to health. Even if the probability of a correlation between cholesterol levels and eating chocolate were small, there were still 15 other variables that could yield a correlation. This greatly increased the probability of a “noteworthy” finding. Another common cause of P-hacking is researchers stopping data collection before finishing the experiment if the so-far-collected data has significant p-values (4). 

Oftentimes, P-hacking can occur without malicious intent. Scientists have to make decisions about how to interpret their data all the time. For instance, scientists once put a fish that was “not alive at the time of scanning” in an MRI machine. They then showed the fish a series of images with people in them and then researchers asked the fish to identify the emotions of the people in photos (7). Throughout this exercise, the researchers recorded the brain activity of the fish with the MRI machine. Surprisingly, the researchers found significant activation in an 81 cubic millimeter area of the brain: the fish was alive according to the MRI machine. After researchers using identical t-contrasts, a commonly used data correction tool, they found no activation in the fish’s brain (7). However, identical t-contrasts and similar data correction tools often have the effect of making p-values qualify as ‘significant.’ Scientists can accidentally overuse these tools or not use them at all, and therefore accidentally P-hack. 

The MRI scans of the dead fish along with the areas of the brain highlighted in red to indicate apparent activation.

P-hacking and the unreliability of the system we use to calculate scientific significance are large hurdles to meaningful research and increasing trust in science. Thankfully, there are initiatives to prevent P-hacking in science. One of the most promising is a total reworking of the scientific publication process known as 2-step manuscript submission (5). In this system, a journal reviews the methodology behind data collection and analysis in the paper and decides whether to publish the study before even seeing the results. This method promises to put good scientific practices before results and may be the solution to our P-hacking predicament.