Statistical Learning builds prediction models. The term refers to learning from data, specifically data that has an element of randomness in it. This data is called the training set and consists of observables and outcomes. We then use this data to create prediction rules which we apply to new data to predict outcomes for the new data.
A financial example is to predict stock prices of companies from their financial reports by looking at past data showing the relationship between financial reports and stock prices.
There are a number of different ways to construct prediction rules. See some examples below. The examples are arranged in order of increasing power and ability to predict.
Regression Models construct equations that give the outcome as an explicit function of the data such as y = a + b*x. Linear function are the best known functions and are commonly available in most packages. Nonlinear models and models with more than one explanatory variable are more complicated and require expertise to fit and to interpret.
Financial data often has dozens of explanatory variables and regression models quickly become dangerous as the number of variables increases. The coefficients of the regression become meaningless due to noise and correlations among the variables. Techniques such a Ridge Regression and the Lasso are required to reduce prediction error.
Projection pursuit is a type of regression technique which involves finding the most "interesting" possible projections in multidimensional data. A projection is a way of reducing multidimensional data into fewer dimensions similar to the way a projector can project a 3D object onto a 2D screen.
Deep Learning is a new area of Machine Learning which goes a bit "deeper" into the data to come up with some transformations of the data (such as clustering) to which the usual methods of machine learning are then applied. An example might be a word recognition task where the deeper task is to recognise letters first.
(From Wikipedia) Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce state-of-the-art results on various tasks.
We can find the best Deep Learning algorithms for your situation and deliver them as R, MATLAB, Python, or stand-alone code including implementations using, for example, 0xdata, Torch, Theano, and Caffe.
Instead of giving the outcome as an explicit function of the data the outcomes are simply smoothed. This gives the prediction as a kind of average of the outcomes in the neighbourhood of the predicted outcome. In the chart at the left the black line is a regression line which has a simple equation of the form y = a + b*x and the blue line is a smoothed line (calculated using the LOWESS smoother). It isn't possible to give the equation for this other than as y = LOWESS(x). But you can almost tell by eye that the blue line is a better predictor of the y value than the black line.
The advantage of this approach over regression is that the predictions are not constrained to fit into a nice equation calculating them from the data. So the predictions can fit the data better. But the disadvantage is that the predictions mostly use only data close to the predictions and so get less help from data further away. So the predictions may not be as good. The optimum tradeoff has to be determined by techniques such as cross-validation where you drop out some data to see how well the dropped data is predicted by the remaining data.
There are numerous different smoothers available to statisticians. Their general form could be described by y = smooth(x). But there are other ways of smoothing data such as Kernel methods.
Kernel Methods smooth by fitting a different regression function for each data point. The regression uses only data close to that point and the whole sequence of regressions is combined by weighting to produce an overall smooth curve. The weighting function is called the kernel.
These kernel smoothers give us a bit more information than the smoothers above because the weights tell us something about the observations. The methods also have application to other problems such as classification and probability density estimation.
Classification is where you have some data with observables which you are able to classify and you wish to determine rules for making the classification. An example is where you have company financial information for a number of stocks for which you know whether they are "growth" or "value" stocks. You can then develop classification rules that you can apply to other stocks to determine their classification.
Clustering is where you do not know the classifications in advance but let the data themselves suggest classifications. Clusters may reveal useful information about stocks. But this kind of analysis tends to be used more in examination of customer databases. There is can be very useful to know if your customers fall into clusters.
A Classification Tree (sometimes called a decision tree) is a prediction method which provides a set of rules that ask a series of questions where the content of one question depends on the answer to the previous question.
For example, if you want to determine whether a company is a "growth" stock or a "value" stock you may ask "is the PEG ratio less than 0.88", followed by the question "is the dividend payout ratio less than 73%?" Statistical methods will have determined which are the best questions to ask and what the best value of the parameters (0.88 and 73% in the example) are. The rule will also have been calibrated so that it gives the probability of its prediction being correct.
Regression Trees are similar but the prediction from the set of questions is a numerical value such as the predicted stock price rather than a classification such as whether it is a growth stock or not.
Neural Networks were developed by computer scientists working in artificial intelligence as a way to mimic the way the brain works. The brain consists of interconnected cells where the strength of the connection varies. This is easily mimicked in a computer but the difficult part is converting the problem into cell form. It is difficult to interpret the resulting model but is nevertheless easy to get predictions from the model.
Essentially the neural networks take nonlinear functions of linear combinations of the input data which is a powerful approach for prediction and classification. It may not be of concern that the model cannot be interpreted - whether or not it makes money could be more important.
There is much available software for fitting Neural Networks - a lot more than for many statistical techniques so this might be a consideration
Support Vector Machines are used for classification and regression. They use hyperplanes to maximise the "margin" between the data sets (such as value and growth stocks) that we are trying to classify. Support Vector Regression is similar but produces a numerical prediction rather than a classification.
Support Vector Machines are relatively new in Statistical Learning and so offer you a competitive edge. They are a nonlinear method and so can be quite powerful. They can require a lot of computer power as they have some difficult optimisations to do.
This is one of our favourite learning methods because it is powerful. The downside is that it does not easily produce point estimates of predictions. It produces probability distributions instead and numerical integration methods to produce point estimates can use a LOT of computer time. But modern fast multi-core computers make this area of learning much more feasible than in the past. This means that this area is a relatively new field of research that can give you a competitive edge.
Bayesian Analysis is where you update probabilities based on numbers you find in the data. For example, you may start off with the probability of 50% that the stock you are interested in beats the market. Then after examining the data you refine the probability to 72%. This makes the stock a good buy. But if the probability drops to 38% it may be a bad buy. (These probabilities are calculated using an old probability rule called Bayes rule. The rule lets you update probabilities of a hypothesis given evidence by using the probability of the evidence given the hypothesis.)
In applications to Statistical Learning in finance we mostly use probability distributions rather than probabilities. The distributions allow you to calculate risks as well as probabilities.