Vaguely Bayesian: 2017

A few readings on overfitting

The basis for these papers is sound. Significant amounts of overfitting is done in the name of so-called quant or smart beta funds. While in typical statistical or econometric models, methodologies to a avoid this have golfed over the years, from statistical testing procedures, criteria, various cross validation (and bootstrap confidence interval derivations), almost all methods, require an actual forecast. Yet in many algorithmic trading strategies, there is no forecast, only an allocation of weight for the strategy. (We will discuss why this approach of estimating a weight directly rather than estimating a forecasting model which is adapted to produce a weight, is in many ways much more efficient in a later discussion). The strategy itself is judged by a number of factors primary among them being Sharpe ratios. The issue is that Sharpe ratios, when optimised in-sample (IS) may be spurious and the resulting strategies lead to poor out-of-sample (OOS) performance.

While p-hacking or overfitting or irreproducibility has been studied for some time, including in finance (Lo and MacKinlay, Halbert White's Bootstrap Reality Check Test, Romano-Wolf etc) it seems to have recently been rediscovered by a group of mathematicians and by Campbell Harvey. We will review the recent papers, try to put them in context and explain the minimal impact they have offered to the current understanding. Finally, we will look into Campbell Harvey's more recent Lucky Factors analysis which not only properly contextualises his analysis, it helps advance the theory of search for data mined (equities quant factors).

Bailey et al - Financial Charlatanism

There are many approaches to how to deal with this overfitting. Bailey et al consider one approach to strategies which leads to some interesting heuristics but in some sense is far from practicable. In their paper, they consider a large sample approach to Sharpe ratios. Given a distribution of returns which is normal iid, they quote Andy Lo's result on the asymptotic sampling distribution for the Sharpe ratio (as is typical, it is asymptotically normal). But given the idea that strategies maximise IS Sharpe ratios, it is more relevant to consider the max Sharpe ratio (rather than the $E[\hat{Sharpe Ratio}]$ and the $stdev[\hat{Sharpe Ratio}]$). In particular, they look to the asymptotic distribution of the maximum for Sharpe ratios on iid $N(0,1)$ returns, ie, those for which then expected value is zero. Using a standard result from EVT they find the distribution for $E[max{x_n}] $for $x_n$ distributed as the z score of an $N(0,1)$ iid distribution. This is tantamount to considering multiple independent trials which are meant to depict strategies.

Given this EVT style distribution and its time-scaling the authors are able to put forward a minimal Backtest length for a given number of independent strategies, when maximised across strategies, to be below a Sharpe ratio of 1. Again, they consider strategies that should have Sharpe of 0 (OOS it is expected to be zero) and then look at the expected maximum in sample Sharpe. For a given set N of strategies they want to find the minimum backtest length minBTL such that the max Sharpe ratio is below 1. In spite of the claim that you need at least a certain number of data points to backtest a given number of strategies! it is not clear whether this is really achieved in the scope of their definition.

While instructive there is almost nothing in the presentation which is actually applicable. Strategies are rarely chosen from a discrete set and then optimised. The only discrete strategies which I can conceive of are moving average rules (22 days or 35 days?). They are sometimes constrained to have only a few parameter choices.

As the authors say, anything continuous is econometric and beyond their scope of study. I would claim that all models ever considered are econometric (EWMA are econometric. Vol scaling is econometric even if the underlying rules are single MA rules). Moreover, how can we define independent for strategies? As though they are random samples. Is a crossing moving avg rule of 30-60 days independent from a crossing moving average rule of 15-20 days? We have no guidance, no formal definitions. In fact, I would say that the paper is not even what we might deem to be mathematics since the results are merely heuristic. Interesting but close to useless.

The paper is partly a rant against overfitting with so-called technicals. It is partly diatribe. While being altogether sympathetic with this critique, that the PAMS should be a vehicle for diatribe is, IMHO, quite inappropriate.

Harvey Liu, Backtesting

Harvey takes quite a different tack and grounds his approach to more familiar multiple testing results. He appears unaware of the significant literature on multiple testing in finance (partly because the papers were in the econometrics literature rather than the financial econometrics literature). Nonetheless, Harvey's explanations are sound and reasonable. Harvey considers Sharpe ratios to have t-distributions and thus each Sharpe ratio is effectively a test that the Expected excess returns of a strategy are above zero and with this he can define a p-value. Harvey then considers the theory of multiple tests.

We first review testing and errors. A Type I error is a false positive. A Type II error is a false negative. So if in reality a hypothesis is false but using our method we claim it to be true, then this is a Type I error. If the hypothesis was true and we mistakenly claim it to be false, it is a Type II error.

To make it more complicated, in statistics, we usually have a null hypothesis, $H_0$ and we want to test. If we mistakenly accept $H_0$ when it is actually false it is a type I error. We usually control Type I errors by controlling the significance level of a test statistic. If we choose a 5% significance we are saying we are ok with making an error one in every 20 trials. If we choose 1% we are only OK with making an error one in every 100 trials.

The problem with multiple testing is we cannot perform 100 trials and just choose the one test which happens to pass. The probability that at least one tear has a p-value which is below our significance cutoff increases dramatically as we increase the number of tests. The testing procedure is valid for just a single test. If one wants to consider multiple possibly correlated tests then the resulting p values are incorrect and need some adjustment.

Harvey considers the sample size and a predetermined number of models under consideration (no concept of independent strategies here a la Bailey) works on an adjustment to all Sharpe ratios. Three standard adjustments are considered, Bonferroni, Holm and Benjamini-Hochberg-Yakutieli. The first two are adjustments to p-values to prevent multiple (possibly correlated) tests from resulting in exactly one Type I error. If a type I error is denoted by $N_r$ then we are controlling the Family-Wise Error Rate $FWER =P\{N_r> 1\}$. The BHY is meant to control the False discovery Rate or $E[N_r/R]$ where $R$ is the total number of tests run and $N_r$ is the number of false positives.

Using any of the above p-value adjustment methods (take raw p-values and alter them to control either FDR or FWER at the cost of absolute power of the test stats), Harvey manages to take Sharpe ratios, and upon adjusting for new p-values, can reapply the t-distributions to derive adjusted Sharpe ratios which haircut multiple tests and are meant to be more robust OOS.

The paper is quite intuitive and is worth a read. It does not offer much new to the scientific literature but is a good introduction.

White and others
Unfortunately, Harvey was not aware of the significant work by White and others on dealing with adjustments to p-values to correct for multiple testing but also with higher power. It turns out the Holm and Bonferroni methods are quite extreme and result in a large loss of power for tests. Consequently in using them, one is making lots of type II errors and rejecting strategies that actually work. White attempts to take this into account by estimating the correlation between the tested hypotheses via bootstrap. The resulting method can be used for any arbitrary number of models to determine whether at least one of them is significant or has a positive excess return etc. White's method has been used and reused

To debunk day of the week effects
To debunk hundreds of technical trading models.
To show momentum indeed does offer significant positive returns
To determine the set of models which outperform Romano Wolf. In fact, this was a major innovation and improved quite considerably upon Whites method
To determine model confidence sets. Which models do indeed outperform others so we can rank and which groups have indeterminate ranking. Model Confidence Sets
To debunk all the equities factors being farmed almost continuously

References

Harvey-Liu Backtesting

Harvey Liu. Lucky Factors Lucky Factors

Harvey (video), The Risks of Historical Backtesting

Bailey-Borwein-de Prado-Zhu - Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance

White: A reality check for data snooping.

Romano-Wolf: STEPWISE MULTIPLE TESTING AS FORMALIZED DATA SNOOPING

Romano-Shaikh-Wolf: http://www.econ.uzh.ch/dam/jcr:ffffffff-935a-b0d6-ffff-ffffff21355a/et_2008.pdf

Lo-MacKinlay: Data-Snooping Biases in Tests of Financial Asset Pricing

Romano-Wolf (Efficient Computation): Efficient computation of adjusted p-values for resampling-based stepdown multiple testing

Corradi-Swanson: http://econweb.rutgers.edu/nswanson/papers/corradi_swanson_whitefest_1108_2011_09_06.pdf

Hansen-Lunde-Nason The Model Confidence Set