# The Cult of Statistical Significance

“But is it significant?”

That’s always one of the first questions researchers in economics and finance are asked. It is an interesting contrast to: “Does it matter?”

*The Cult of Statistical Significance* by Stephen T. Ziliak and Deirdre N. McCloskey is a book that every economist, research analyst, and investor probably needs to read but very few have. The authors describe how the entire field of economics and finance has become enthralled by p-values. If a result is statistically significant at the 5% level, it is considered a valid phenomenon. A result that fails that test is supposed to be non-existent.

Obviously, the 5% rule misses two points. First, by chance alone, one in every 20 experiments should meet that threshold. Since thousands, perhaps millions, of tests are conducted on finance and economics data every year, we can imagine how many spuriously positive results are found and then published. After all, a positive result is way easier to publish than a negative one.

I remember sitting through a seminar in my university days. A researcher presented statistically significant evidence that company directors leave the board before the firm gets into trouble with their auditors or regulators. That’s all fine and well. But then he showed us that this observation can make money: a full 0.2% outperformance per year — before transaction costs.

Because the researcher had so many data points to estimate his regression, he could generate statistical significance even though the effect had no economic significance. In the end, it was a purely academic exercise.

And second, in the 21st century, the amount of available data has multiplied time and time again. Hedge funds and traditional asset managers apply big data to find patterns in markets that they can exploit. They analyze the data with artificial intelligence (AI) to find “meaningful” correlations that traditional analyses would miss. This approach to investing has a lot of *challenges* to overcome.

A major and rarely mentioned one: The more data we look at, the more likely we’ll find statistically significant effects, and the more underlying data we have, the more powerful our statistical tests become. So with more data, we can detect smaller and smaller effects that may or may not be economically meaningful.

In “Statistical Nonsignificance in Empirical Economics,” Alberto Abadie analyzes how much knowledge we gain with a statistically significant test result. The dashed curve in the chart below shows the assumption of the possible distribution of a variable before any tests are done. Then, we measure the data — for example, returns of stocks with specific characteristics — and end up with a statistically significant result. The solid curve demonstrates where the true effect could be depending on the number of data points. With very few data points, a statistically significant result carves out quite a big chunk of the distribution. So we learn much more if we get a significant result with few data points.

But with 10,000 data points, the carve-out is extremely small. What that means is the more data we have, the less informative a statistically significant result becomes. On the other hand, if there’s a failure of statistical significance with a test on 10,000 data points, we learn an awful lot. In fact, we would know that the true value would have to be almost exactly zero. And that, in itself, could give rise to an extremely powerful investment strategy.

**The Impact of a Statistically Significant Result on Our Knowledge**

This is a major reason why so many big data and AI applications fail in real life and why so many equity *factors* stop working once they’re described in the academic literature.

In fact, a stricter definition of *significance* that accounts for possible data-mining bias demonstrates that out of the hundreds of equity factors only three are largely immune from p-hacking and data mining: the value factor, the momentum factor, and a really esoteric factor that I still haven’t understood properly.

So what’s the big takeaway? Just because it’s statistically “significant” doesn’t mean it matters. And if it isn’t significant, it may well matter a lot. The next time you come across a significant new result, ask yourself if it matters.

**For more from Joachim Klement, CFA, don’t miss ****7 Mistakes Every Investor Makes (And How to Avoid Them)**** and Risk Profiling and Tolerance, and sign up for his Klement on Investing commentary.**

**If you liked this post, don’t forget to subscribe to the Enterprising Investor.**

*All posts are the opinion of the author. As such, they should not be construed as investment advice, nor do the opinions expressed necessarily reflect the views of CFA Institute or the author’s employer.*

Image credit: ©Getty Images / fstop123

#### Professional Learning for CFA Institute Members

CFA Institute members are empowered to self-determine and self-report professional learning (PL) credits earned, including content on *Enterprising Investor*. Members can record credits easily using their online PL tracker.

Joachim: Can you please elaborate on this statement, “And if it isn’t significant, it may well matter a lot.”

While I guess the discussion about something being “exactly zero” gives rise to a potential for useful information, I can’t see an actual application in real life.

Can you give examples of things which matter a lot which do not move the needle on statistical significance? Having trouble following the corollary.

P-hacking may also be described as ex-post specification searching.

Unfortunately, we keep repeating tests that have no statistical significance or statistical power because it is much much easier to publish results that are statistically significant. This bias in publication leads to academic filing cabinets overflowing with unpublished research destined to be replicated.

Parabolic increases in the national debt and money supply are probably going to overshadow what used to be “statistically significant” from here on out. The Fed continues to be excused (actually complimented) for what would have been considered crazy in 2008-9. The commitment has been made. We’ll see.

Thank you for this piece and especially for your reference to Abadie’s article. P-hacking has been written up in the sciences and now in social sciences. The question is what happens now. Perhaps, the beginning of a change? Have not seen it yet. Not only do editors want to publish papers with significant results, but typically papers whose results are consistent with current paradigms.

All the more reason why studies that are well designed with good data but which lack significance should not be ignored.

Very interesting article, thanks Joachim.

Of course when it comes to time series analysis, which is often the focus of investment research, we must remember that a limited sample also presents an additional and unique challenge, i.e. that the observations made over a relatively short interval in time may be tainted by the governing dynamics applicable to a specific regime, whereas ceteris paribus, a larger sample, it could be argued, is more likely to transcend multiple regimes, and therefore contain historical observations that capture a wider range of possible outcomes.

Note that this is different from the standard problem associated with limited samples, which is about the increased possibility that a small random sample is simply not representative of the population as a whole.

The regime problem is far more profound than that: The parameters of the probability distribution are evolving over time and a larger sample, transcending multiple regimes, is needed to get a truer sense of the range of possible outcomes.

In this sense, and without contradicting the validity of the essence of the article above, in times series analysis, there are many instances where you want the sample (historical interval) to be large (long) and the statistical significance of the conclusion to be high.

“And if it isn’t significant, it may well matter a lot.”

Great food for thought. Too much faked statistical significance is frequently mis-/used for marketing “smart” factor investments for outperformance.

Thus, rational investors prefer simple indexing with untilted broad equity market cap investments. With them they can avoid the uncompensated risk of large cyclical factor swings and decay for up to several decades, usually not accounted for properly.

However, many investors have more emotional and/or expressive needs for risk taking. They may prefer certain investment styles, no matter if statistically significant or not. The fit to their risk profiles and investor personalities matters a lot more to them, e.g., as value or growth investors. According to behavioral finance, appropriate factor investments raises their chances significantly to stay invested during good and bad times.

By the way, Damodaran has an interesting point for those, who take value for granted:

“Even those people who believe they’re value players are far more dependent on momentum than they realize, because ultimately, for them to make money, the price has to move to its value.”

“…the biggest factor in pricing is what other people are doing. Investing has always been a momentum game…”

Accordingly, the simple combination of pure traditional indexing and time-series momentum with managed futures is the best approach to benefit from all market regimes, boom, bust or normal. This approach can exploit the most significant return potentials of both non-correlated markets for convergent and divergent risk taking, respectively, tail risk protection included. Interestingly, this seems to matter least.