June 22nd, 2018 - The P-Value and P-Hacking
The Strategic Partner for your Qualitative & Quantitative Research Needs
The British Prime Minister Benjamin Disraeli is attributed with the old saying that there are three types of lies: lies, damn lies, and …. Statistics. It is easy to manipulate data to support one’s position. An emerging misleading practice in the world of research is p-hacking, in which an effect appears statistically significant but in reality is not.
A p-value, which is used to determine statistical significance, assesses how unique data are relative to what one might normally expect. The lower the p-value, the more unique the data are, and the more statistically significant the effect is. However, if a particular p-value analysis was one of many analyses performed, is it really that noteworthy?
A p-value can be understood as the percentage of time in which one’s results could be generated, by randomly selecting numbers, thereby undermining the probability that an actual effect exist. A p-value of .05 means that there is a 5% chance of the data being no different than randomly selected numbers (thereby increasing confidence in the hypothesized effect). By itself, that is an acceptable risk, but when it is part of other analyses, the risk actually inflates.
Suppose you had a lottery ticket that had a 5% chance of winning. With that single ticket, the chance that you have a winning ticket is 5%. With three tickets, the probability of having a winning ticket increases to 14%. If you bought a dozen, your chances jump to 46%. Likewise, it would be extraordinary if you bought a single Powerball ticket that won the jackpot, but it would be hardly surprising if you won after purchasing half of all possible number combinations (because, in that case, your chance of winning is 50%).
The same applies to multiple statistical analyses. A single test might be noteworthy with a p-value of 5%, but not if it is one of 20 or 30 analysis. When conducting that many, it is expected that an unusual observation would emerge, even among data that was randomly generated.
This was brilliantly demonstrated a few years ago when a published study reported that chocolate helped people lose weight, lower cholesterol, and improve well-being. However, these three variables were part of 18 different analyses. When looking at the overall picture, the “improvements” are meaningless (it should be noted that the study was conducted deliberately to make light of this research problem).
It is one thing if all analyses are presented and an observer judges for oneself how they should be interpreted. It is another thing if only low p-value results are reported, because one cannot tell if the analyses were one of many! To be fair, few researchers do this deliberately as a means to deceive others, but their results are just as misleading.
Running multiple analyses is not a bad thing if proper practices are followed. When working with people who are conducting your analyses, be sure that they are not p-hacking to show results that do not exist. A future post will highlight strategies to help ensure that misleading, and sometimes deceptive practices, do not occur.