Double-Digit Numerics has decades of experience in statistics, both theoretical and applied. We are in a good position to do research having done PhD level Statistics at Stanford University California, ranked one of the best universities in the world for Statistics research (see for example here and here).
The areas listed below are a sample of the areas we have experience in. They are some of the more relevant ones to business.
A number estimated from data without a confidence level may be misleading or dangerous. A hedge fund may report that it has averaged a 20% return over the past 5 years. Sounds like a good fund. But if all that return has been due to luck then it is not a good fund.
That figure of 20% is an estimate of the fund manager's ability to generate returns. It is an estimate because it was obtained from 5 years worth of data. You are using that estimate to estimate the fund manager's skill. So you want a confidence interval to give you confidence in the manager. If the confidence interval is -10% to +50% then you cannot be sure that the manager has skill. The manager's true return generation ability may be as low as -10% p.a.
Much or our work goes into generating confidence intervals. This is important although it may not be easy. For example, if you backtest a model and find that it returned 20% p.a. over 5 years how do you calculate a Confidence Interval? Most people don't because they don't care (they should!) or they don't know how. There are statistical methods for doing this.
Data Mining has a terrible reputation in the mind of the public. There are phrases such as "how to lie with statistics," and "a statistician uses numbers like a drunk uses a lamppost - for support rather than illumination," and "torture the data until it confesses to what you want."
At DDNUM we have been torturing data for decades and we know when we have gone too far. We have a good feeling for what happens to probability significance levels when you start mining for data (the significance drops!). We have the right degree of skepticism and cynicism when we go mining. And we understand the implications of regression to the mean (which means that what you have found may likely disappear next year).
Data mining uses algorithms to discover predictive patterns in data sets.
It is tempting to develop a model, to fit it and optimise it using past data, and then to project the past into the future. But it cannot be expected to produce the same results in the future due to "regression to the mean" (RTTM). An easy way to explain RTTM is to consider a breeder's attempt to breed a tall plant variety by selecting tall plants to breed from. If you have a batch of plants with mean height 1.0 metres you can select the tallest 10 plants from the batch to breed from. Suppose these plants have a mean height of 1.2m. They are all extreme plants and their offspring will not be as extreme. So their offspring may have a mean height of 1.05m. Yes you have offspring taller than the original batch but the offspring are shorter than the parents.
This is a law of probability. If you select from the extremes of a distribution the next value from the distribution is more likely to be closer to the mean simply because there is more probability there.
The process of selecting the tallest plants is the same process conceptually as fitting a model (using the best fit) or optimising a model because you are going for the extremes. Best fit or optimisation is by definition an extreme-seeking process. So using the model in the future won't produce as good results as it did in the past. The results will regress back to the mean.
That's why there are thousands of books and papers and blogs promoting trading schemes that have worked in the past but won't work in the future.
When we back test models we use hypothesis testing and cross-validation to test the models. These processes separate out the data used to fit the model and the data used to test the model so that the performance of the model is measured after regression to the mean has occurred. That is, we measure how much RTTM there is. Usually there is a lot. Efficient markets are like that!
The analysis of Time Series is one of the more difficult areas of statistics because the data are correlated with each other. This means that a lot of the statistical rules that assume independent observations break down. Some of these rules are the Central Limit Theorem and formulas that divide by the square root of n.
Unfortunately a lot of non-statisticians don't know that these rules are broken and use them anyway. Sigh. People divide by the square root of n and get numbers for risk that are too small. I hope that's not what caused the credit crunch of 2007. People use the Central Limit Theorem to assume that their numbers have a gaussian (normal) distribution when in fact their numbers really have some other distribution that has higher risk (ditto credit crunch).
Mean Reversion (MR) sounds like Regression to the Mean but the two are conceptually different. Most people get them mixed up. Mean Reversion says that something that deviates from the mean will eventually return to the mean ("what goes up must come down"). But that's not a law. It's a model. RTTM says that extreme values are more likely to be followed by less extreme values. That's a law, not a model.
If you get the two mixed up you could lose a lot of money. If what you think is a law isn't then what you expect to happen may not.
Cointegration is a model whereby two time series can exhibit a relationship that keeps them interacting with each other in a way that can lead to mean reversion. The amount of Cointegration can be measured so as to tell you whether to expect mean reversion or not. And when the model no longer fits you can stop using it. This is an example of one of several ways of detecting mean reversion other than just noticing that it appears to happen in a chart.
An example is exchange rates. Here is a chart of the value of the USD vs the NZD since 1986. It certainly looks like it fluctuates around a mean value of 0.6. When it gets too far away it "reverts" back. But that is just a model. It's not a law. If the inflation rate in New Zealand is consistently higher than the USA then you would expect the exchange rate to drift up. It may revert to a mean but the mean itself is drifting.
Since MR is a model you must test that the model is valid and fits before you use it. Yet so many people make the claim that a process is mean reverting thinking that that is a law and therefore doesn't need validating.
The theory of the optimal Design of Experiments was started by agricultural scientists early last century who wanted to do experiments to compare, say, different fertilisers. You don't want to waste land or plants or time so experiments are designed to make optimal use of these resources.
The theory is less used in finance. Suppose you want to test out two trading models. You have to design an "experiment" to test them. In this simple case the main element of the design is to work out long to trade the models for before declaring a winner (and being reasonably sure that the declared winner is the true best model).
The main use of experiments for business is perhaps for testing marketing campaigns on clients. Or to test web pages for their ability to attract sales. If you have 10 web pages and you want to test two different versions of each page that means you have 1024 different combinations of pages to test. Experimental design theory lets you get the information you want by using less than the full 1024 different combinations.
When you find a pattern using Data Mining you wonder if the pattern is "significant" or arose due to chance. So you want to test the null hypothesis that the pattern arose by chance. You hope to reject that hypothesis and thereby conclude that the pattern is a genuine pattern (that you can make money from).
Hypothesis Testing theory is pretty much used only by statisticians as most other people have not heard of it. You can save a lot of money by using a null hypothesis to test a trading rule rather than spending money trying it out.
We at DDNUM have a saying: "if you can't see it then it ain't really there." This refers to the role of visualisation and intuition in the analysis of data.
Some effect may be significant because it is highly unlikely to be a chance event (people say "p < 0.05"). But if you can't plot some data to give a visual impression of the effect then is (and rightly so) hard to swallow.
A fund manager may say that the fund has beaten the index by 4% p.a. over the last 10 years. But I'd rather see a chart of the monthly fund returns vs the monthly index returns to see what the fund has really been up to. Most people can tell at a glance if the fund has alpha and get a rough idea of a confidence interval for that alpha (the alpha value is where a line drawn through the returns intersects the y axis).