I have just posted a new paper on a fundamental problem that all scientists face: how do you evaluate whether a discovery is true or just a fluke?
We are accustomed to using a “two sigma” rule. If the effect that we observe is more than two standard deviations from zero, then we have 95% confidence that the effect is true. The effect might be a financial product that claims to “outperform” the market.
However, this rule is flawed when there are multiple tests. Let’s first consider two examples from non-financial fields.
My first example is the widely heralded discovery of the Higgs Boson in 2012. The particle was first theorized in 1964 – the same year as William Sharpe’s paper on the capital asset pricing model (CAPM) was published. The first tests of the CAPM were published eight years later and Sharpe was awarded a Nobel Prize in 1990. For Peter Higgs, it was a much longer road. It took years to complete the Large Hadron Collider (LHC) at a cost of about $5 billion. The Higgs Boson was declared “discovered” on July 4, 2012 and Nobel Prizes were awarded in 2013.
So why is this relevant for finance? It has to do with the testing method. Scientists knew that the particle was rare and that it decays very quickly. The idea of the LHC is to have beams of particles collide. Theoretically, you would expect to see the Higgs Boson in one in ten billion collisions within the LHC. The Boson quickly decays and key is measuring the decay signature. Over a quadrillion collisions were conducted and a massive amount of data was collected. The problem is that each of the so-called decay signatures can also be produced by normal events from known processes.
To declare a discovery, scientists agreed to what appeared to be very tough standard. The observed occurrences of the candidate particle (Higgs Boson) had to be “five sigma” different from a world where there was no new particle. Five standard deviations is generally considered a tough standard. Yet in finance, we routinely accept discoveries where the hurdle is two – not five. Indeed, there is a hedge fund called Two Sigma.
Particle physics is not alone in having a tougher hurdle to exceed. Consider the research done in bio-genetics. In genetic association studies, researchers try to link a certain disease to human genes and they do this by testing the causal effect between the disease and a gene. Given that there are more than 20,000 human genes that are expressive, multiple testing is a real issue. To make it even more challenging, a disease is often not caused by a single gene but the interactions among several genes. Counting all the possibilities, the total number of tests can easily exceed a million. Given this large number of tests, a tougher standard must be applied. With the conventional thresholds, a large percentage of studies that document significant associations are not replicable.
To give an example, a recent study in Nature claims to find two genetic linkages for Parkinson’s disease. About a half a million genetic sequencesare tested for the potential association with the disease. Given this large number of tests, tens of thousands of genetic sequenceswill appear to affect the disease under conventional standards. We need a tougher standard to lower the possibility of false discoveries. Indeed, the identified gene locifrom the tests where more than “five sigma”.
There are many more examples such as the search for exoplanets. However, there is a common theme in these examples. A higher threshold is required because the number of tests is large. For the Higgs Boson, there were potentially trillions of tests. For research in bio-genetics, there are millions of combinations. With multiple tests, there is a chance of a fluke finding.
Here is another classic example. Suppose you receive a promotional email from an investment manager promoting a stock. The email asks you to judge the record of recommendations in real time. Only a single stock is recommended and the recommendation is either long or short. You get an email every week for 10 weeks. Each week the manager is correct. The track record is amazing because the probability of such an occurrence is very small (0.510=0.000976). Conventional statistics would say there is a very small chance (0.00976% this is a false discovery, i.e. the manager is no good). You hire the manager.
Later you find out the strategy. The manager randomly picks a stock and initially sends out 100,000 emails with 50% saying long and 50% saying short. If the stock goes up in value, the next week’s mailing list is trimmed to 50,000 (only sending to the long recommendations). Every week the list is reduced by 50%. By the end of the 10th week, 97 people would have received this “amazing” track record of 10 correct picks in a row.
If these 97 people had realized how the promotion was organized, then getting 10 in a row would be expected. Indeed, you get the 97 people by multiplying 100,000 x 0.510. There is no skill here. It is random and the claims were snake oil.
There are many obvious applications. One that is immediate is in the evaluation of fund managers. With over 10,000 managers, you expect some to randomly outperform year after year. Indeed, if managers were randomly choosing strategies, you would expect at least 300 of them to have five consecutive years of outperformance.
My paper shows how to adjust evaluation methods for multiple tests. The usual Sharpe Ratios (excess return divided by volatility) need to be adjusted.
The bottom line is that most of the claims of outperformance by financial products are likely false.
See my paper “Evaluating Trading Strategies“.