Keeping Track of Quant Funds
Columbia Threadneedle publishes a list of quant funds/Bloomberg tickers which presumably it holds in portfolio. Probably this is a regulatory requirement. These are not representative of quant funds and they do not appear to be in any way the best quant funds.
I have cut and pasted a list from Jan 17, just in case the link goes inactive.
Columbia Threadneedle Jan 2017 Quant Fund Indices.
Columbia Threadneedle Quant Fund Indices (updated monthly)
Bayesian Statistics, Econometrics, a bit of Machine Intelligence and the Finance of European Rates Markets.
13 January, 2017
12 January, 2017
Backtest overfitting, charlatanism and touchy-feely math heuristics
A few readings on overfitting
The basis for these papers is sound. Significant amounts of overfitting is done in the name of so-called quant or smart beta funds. While in typical statistical or econometric models, methodologies to a avoid this have golfed over the years, from statistical testing procedures, criteria, various cross validation (and bootstrap confidence interval derivations), almost all methods, require an actual forecast. Yet in many algorithmic trading strategies, there is no forecast, only an allocation of weight for the strategy. (We will discuss why this approach of estimating a weight directly rather than estimating a forecasting model which is adapted to produce a weight, is in many ways much more efficient in a later discussion). The strategy itself is judged by a number of factors primary among them being Sharpe ratios. The issue is that Sharpe ratios, when optimised in-sample (IS) may be spurious and the resulting strategies lead to poor out-of-sample (OOS) performance.
While p-hacking or overfitting or irreproducibility has been studied for some time, including in finance (Lo and MacKinlay, Halbert White's Bootstrap Reality Check Test, Romano-Wolf etc) it seems to have recently been rediscovered by a group of mathematicians and by Campbell Harvey. We will review the recent papers, try to put them in context and explain the minimal impact they have offered to the current understanding. Finally, we will look into Campbell Harvey's more recent Lucky Factors analysis which not only properly contextualises his analysis, it helps advance the theory of search for data mined (equities quant factors).
Bailey et al - Financial Charlatanism
There are many approaches to how to deal with this overfitting. Bailey et al consider one approach to strategies which leads to some interesting heuristics but in some sense is far from practicable. In their paper, they consider a large sample approach to Sharpe ratios. Given a distribution of returns which is normal iid, they quote Andy Lo's result on the asymptotic sampling distribution for the Sharpe ratio (as is typical, it is asymptotically normal). But given the idea that strategies maximise IS Sharpe ratios, it is more relevant to consider the max Sharpe ratio (rather than the $E[\hat{Sharpe Ratio}]$ and the $stdev[\hat{Sharpe Ratio}]$). In particular, they look to the asymptotic distribution of the maximum for Sharpe ratios on iid $N(0,1)$ returns, ie, those for which then expected value is zero. Using a standard result from EVT they find the distribution for $E[max{x_n}] $for $x_n$ distributed as the z score of an $N(0,1)$ iid distribution. This is tantamount to considering multiple independent trials which are meant to depict strategies.
Given this EVT style distribution and its time-scaling the authors are able to put forward a minimal Backtest length for a given number of independent strategies, when maximised across strategies, to be below a Sharpe ratio of 1. Again, they consider strategies that should have Sharpe of 0 (OOS it is expected to be zero) and then look at the expected maximum in sample Sharpe. For a given set N of strategies they want to find the minimum backtest length minBTL such that the max Sharpe ratio is below 1. In spite of the claim that you need at least a certain number of data points to backtest a given number of strategies! it is not clear whether this is really achieved in the scope of their definition.
While instructive there is almost nothing in the presentation which is actually applicable. Strategies are rarely chosen from a discrete set and then optimised. The only discrete strategies which I can conceive of are moving average rules (22 days or 35 days?). They are sometimes constrained to have only a few parameter choices.
As the authors say, anything continuous is econometric and beyond their scope of study. I would claim that all models ever considered are econometric (EWMA are econometric. Vol scaling is econometric even if the underlying rules are single MA rules). Moreover, how can we define independent for strategies? As though they are random samples. Is a crossing moving avg rule of 30-60 days independent from a crossing moving average rule of 15-20 days? We have no guidance, no formal definitions. In fact, I would say that the paper is not even what we might deem to be mathematics since the results are merely heuristic. Interesting but close to useless.
The paper is partly a rant against overfitting with so-called technicals. It is partly diatribe. While being altogether sympathetic with this critique, that the PAMS should be a vehicle for diatribe is, IMHO, quite inappropriate.
Harvey Liu, Backtesting
Harvey takes quite a different tack and grounds his approach to more familiar multiple testing results. He appears unaware of the significant literature on multiple testing in finance (partly because the papers were in the econometrics literature rather than the financial econometrics literature). Nonetheless, Harvey's explanations are sound and reasonable. Harvey considers Sharpe ratios to have t-distributions and thus each Sharpe ratio is effectively a test that the Expected excess returns of a strategy are above zero and with this he can define a p-value. Harvey then considers the theory of multiple tests.
We first review testing and errors. A Type I error is a false positive. A Type II error is a false negative. So if in reality a hypothesis is false but using our method we claim it to be true, then this is a Type I error. If the hypothesis was true and we mistakenly claim it to be false, it is a Type II error.
To make it more complicated, in statistics, we usually have a null hypothesis, $H_0$ and we want to test. If we mistakenly accept $H_0$ when it is actually false it is a type I error. We usually control Type I errors by controlling the significance level of a test statistic. If we choose a 5% significance we are saying we are ok with making an error one in every 20 trials. If we choose 1% we are only OK with making an error one in every 100 trials.
The problem with multiple testing is we cannot perform 100 trials and just choose the one test which happens to pass. The probability that at least one tear has a p-value which is below our significance cutoff increases dramatically as we increase the number of tests. The testing procedure is valid for just a single test. If one wants to consider multiple possibly correlated tests then the resulting p values are incorrect and need some adjustment.
Harvey considers the sample size and a predetermined number of models under consideration (no concept of independent strategies here a la Bailey) works on an adjustment to all Sharpe ratios. Three standard adjustments are considered, Bonferroni, Holm and Benjamini-Hochberg-Yakutieli. The first two are adjustments to p-values to prevent multiple (possibly correlated) tests from resulting in exactly one Type I error. If a type I error is denoted by $N_r$ then we are controlling the Family-Wise Error Rate $FWER =P\{N_r> 1\}$. The BHY is meant to control the False discovery Rate or $E[N_r/R]$ where $R$ is the total number of tests run and $N_r$ is the number of false positives.
Using any of the above p-value adjustment methods (take raw p-values and alter them to control either FDR or FWER at the cost of absolute power of the test stats), Harvey manages to take Sharpe ratios, and upon adjusting for new p-values, can reapply the t-distributions to derive adjusted Sharpe ratios which haircut multiple tests and are meant to be more robust OOS.
Using any of the above p-value adjustment methods (take raw p-values and alter them to control either FDR or FWER at the cost of absolute power of the test stats), Harvey manages to take Sharpe ratios, and upon adjusting for new p-values, can reapply the t-distributions to derive adjusted Sharpe ratios which haircut multiple tests and are meant to be more robust OOS.
The paper is quite intuitive and is worth a read. It does not offer much new to the scientific literature but is a good introduction.
White and others
Unfortunately, Harvey was not aware of the significant work by White and others on dealing with adjustments to p-values to correct for multiple testing but also with higher power. It turns out the Holm and Bonferroni methods are quite extreme and result in a large loss of power for tests. Consequently in using them, one is making lots of type II errors and rejecting strategies that actually work. White attempts to take this into account by estimating the correlation between the tested hypotheses via bootstrap. The resulting method can be used for any arbitrary number of models to determine whether at least one of them is significant or has a positive excess return etc. White's method has been used and reused
- To debunk day of the week effects
- To debunk hundreds of technical trading models.
- To show momentum indeed does offer significant positive returns
- To determine the set of models which outperform Romano Wolf. In fact, this was a major innovation and improved quite considerably upon Whites method
- To determine model confidence sets. Which models do indeed outperform others so we can rank and which groups have indeterminate ranking. Model Confidence Sets
- To debunk all the equities factors being farmed almost continuously
References
Harvey-Liu Backtesting
Harvey Liu. Lucky Factors Lucky Factors
Harvey (video), The Risks of Historical Backtesting
Bailey-Borwein-de Prado-Zhu - Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance
Bailey-Borwein-de Prado-Zhu - Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance
Romano-Shaikh-Wolf: http://www.econ.uzh.ch/dam/jcr:ffffffff-935a-b0d6-ffff-ffffff21355a/et_2008.pdf
Lo-MacKinlay: Data-Snooping Biases in Tests of Financial Asset Pricing
Romano-Wolf (Efficient Computation): Efficient computation of adjusted p-values for resampling-based stepdown multiple testing
Corradi-Swanson: http://econweb.rutgers.edu/nswanson/papers/corradi_swanson_whitefest_1108_2011_09_06.pdf
Hansen-Lunde-Nason The Model Confidence Set
Subscribe to:
Posts (Atom)