1

30

在继续浏览本公司网站前，请您确认您或您所代表的机构是一名“合格投资者”。“合格投资者”指根据任何国家和地区的证券和投资法规所规定的有资格投资于私募证券投资基金的专业投资者。例如根据我国《私募投资基金监督管理暂行办法》的规定，合格投资者的标准如下：

一、具备相应风险识别能力和风险承担能力，投资于单只私募基金的金额不低于100万元且符合下列相关标准的单位和个人：

1、净资产不低于1000万元的单位；

2、金融资产不低于300万元或者最近三年个人年均收入不低于50万元的个人。(前款所称金融资产包括银行存款、股票、债券、基金份额、资产管理计划、银行理财产品、信托计划、保险产品、期货权益等。)

二、下列投资者视为合格投资者：

1、社会保障基金、企业年金等养老基金、慈善基金等社会公益基金；

2、依法设立并在基金业协会备案的投资计划；

3、投资于所管理私募基金的私募基金管理人及其从业人员；

4、中国证监会规定的其他投资者。

如果您继续访问或使用本网站及其所载资料，即表明您声明及保证您或您所代表的机构为“合格投资者”，并将遵守对您适用的司法区域的有关法律及法规，同意并接受以下条款及相关约束。如果您不符合“合格投资者”标准或不同意下列条款及相关约束，请勿继续访问或使用本网站及其所载信息及资料。

投资涉及风险，投资者应详细审阅产品的发售文件以获取进一步资料，了解有关投资所涉及的风险因素，并寻求适当的专业投资和咨询意见。产品净值及其收益存在涨跌可能，过往的产品业绩数据并不预示产品未来的业绩表现。本网站所提供的资料并非投资建议或咨询意见，投资者不应依赖本网站所提供的信息及资料作出投资决策。

与本网站所载信息及资料有关的所有版权、专利权、知识产权及其他产权均为本公司所有。本公司概不向浏览该资料人士发出、转让或以任何方式转移任何种类的权利。

本声明包含网络使用的有关条款。凡浏览本网站及相关网页的用户，均表示接受以下条款。

1、并非所有的客户都可以获得所有的产品和服务，您是否符合条件享受特别产品和服务，最终的解释权归我公司。我公司保留对该网页包含的信息和资料及其显示的条款、条件和说明变更的权利。

2、任何在本网站出现的信息包括但不限于评论、预测、图表、指标、理论、直接的或暗示的指示均只作为参考，您须对任何自主决定的行为负责。

3、本网站提供的有关投资分析报告、股市预测文章信息等仅供参考，股市有风险，入市须谨慎！本网站所提供之公司资料、个股资料等信息，力求但不保证数据的准确性，如有错漏，请以基金业协会公示信息报刊为准。本网站不对因本网资料全部或部分内容产生的或因依赖该资料而引致的任何损失承担任何责任。

4、互联网传输可能会受到干扰，中断、延迟或数据错误，本公司对于非本公司能控制的通讯设施故障可能引致的数据及交易之准确性或及时性不负任何责任。

5、凡通过本网站与其他网站的链结，而获得其所提供的网上资料及内容，您应该自己进行辨别及判断，我公司不承担任何责任。

6、本站某些部分或网页可能包括单独条款和条件，作为对本条款和条件的补充，如果有任何冲突，该等附加条款和条件将对相关部分或网页适用。

7、本人已阅读并同意《数字证书服务协议》。

1、并非所有的客户都可以获得所有的产品和服务，您是否符合条件享受特别产品和服务，最终的解释权归我公司。我公司保留对该网页包含的信息和资料及其显示的条款、条件和说明变更的权利。

2、任何在本网站出现的信息包括但不限于评论、预测、图表、指标、理论、直接的或暗示的指示均只作为参考，您须对任何自主决定的行为负责。

3、本网站提供的有关投资分析报告、股市预测文章信息等仅供参考，股市有风险，入市须谨慎！本网站所提供之公司资料、个股资料等信息，力求但不保证数据的准确性，如有错漏，请以基金业协会公示信息报刊为准。本网站不对因本网资料全部或部分内容产生的或因依赖该资料而引致的任何损失承担任何责任。

4、互联网传输可能会受到干扰，中断、延迟或数据错误，本公司对于非本公司能控制的通讯设施故障可能引致的数据及交易之准确性或及时性不负任何责任。

5、凡通过本网站与其他网站的链结，而获得其所提供的网上资料及内容，您应该自己进行辨别及判断，我公司不承担任何责任。

6、本站某些部分或网页可能包括单独条款和条件，作为对本条款和条件的补充，如果有任何冲突，该等附加条款和条件将对相关部分或网页适用。

7、本人已阅读并同意《数字证书服务协议》。

Chuan Shi 2016-11-09
本文章840阅读

**1**

**Introduction**

**The marriage between machine learning and investment populates stock selection using artificial intelligence. In the context of machine learing, stock picking is a typical classification problem. **Various financial and technical factors are viewed as features of stocks, and the returns of the assets are used to determine the labels they possess.

**Stock selection with machine learning belongs to the domain of supervised learning. It separates good and bad stocks with labels and trains a classification model with historical data. The model extracts the relationship among the labels and the features and parameters of the model are then determined. **Once the model is obtained, it is applied to the latest features' data of stocks in the hope of picking stocks that will generate higher absolute returns in the next holding period. Without a doubt, the classification algorithm is the key in this process.

**Prevailing supervised learning classification algorithms include support vector machines, naïve Bayes, K nearest neighbors, neural nets, as well as classification trees (or decision trees) and a set of more advanced algorithms derived from simple tree-structured methods. **The figure below shows the effects of these algorithms on three public data sets.

In today's article and the next one, we will talk about tree-structured classification algorithms. In particular, we will start with the basic classification tree and then introduce more advanced techniques including **bagging**, **boosting**, and **random forest**. Unlike the basic algorithm that uses a single tree, these advanced algorithms employee multiple trees to improve the classification performance. **Bagging, Boosting, and Random forest are all in the family of ensemble learning, which combines the results from multiple basic trees by using the majority rule to achieve better classification accuracy. **

The introduction to the basic classification tree and its CART algorithm is the topic of today's article. We will also talk about its pros and cons.

**2**

**Classification tree and the CART algorithm**

**2.1 Classification tree**

We describe a typical classification problem as follows.

**There are n sample points and m classes. Each sample point belongs to one of those m classes. Each sample point is described by a k-dimension vector of features. A classification tree (or any other classification algorithm) is a classifier that maps the sample points from the feature space to the class domain. It classifies the sample points according to their features.**

A classifier belongs to supervised learning because its parameter selection depends on how the model learns from the historical labelled training data. When training a classifier, the correct labels of all training samples are known in advance, and they are used along with features as input in the modeling process.

**A classification tree is constructed by making repetitive splits of all the sample points and the subsequently created subsets of the data. In each split, a linear combination of one or more features is used as the rule to further partition the data. Each round of split is a finer partition of its predecessor, and eventually a hierarchical structure is formed.**

The following example comes from Wikipedia and it demonstrates a classification tree that predicts the survival of passengers on the Titanic. In this example, demographic properties are features and class labels are 'survived' and 'died'.

In this tree, a bold black node represents the feature selected at each stage. For example, the first stage uses gender (two branches are men and women), the second stage selects age (two branches are for people over and under 9.5 years old), and the third stage picks the number of spouses or siblings aboard (two branches for the number greater than 2.5 or not). They are **internal nodes** in this classification tree. The red and green round-corner rectangles represent the **leaves (terminal nodes)**, with each leaf being marked with a definite class label ('survived' or 'died' in this case). Once the classification tree is constructed, its internal nodes clearly illustrate the classification process. In this case, it is straightforward to see that women and children are predicted as the main survivors by the model, and this is consistent with the reality.

Although this example is simple, it demonstrates two important points about a classification tree:

**1. As a supervised learning algorithm, a classification tree is capable of identifying the most effective features. **As in this example, the classification tree correctly chooses the gender and age (rather than other characteristics, such as body weight or cabin level) as the most important classification features.

**2. A classification tree predicts all samples belonging to a given leaf to be in the class of that leaf, no matter what their actual class labels are. When training the model, we do not deliberately require samples of one leaf indeed all belong to the class of the leaf. **(With finer and finer segmentation of the data, it is always possible to construct a tree where all samples match the labels of the leaves they belong to, but that often means overfitting.) In this example, all female passengers are assigned to a class of 'survived'. However, in fact, only 73% of women survived, which means that there are 27% of female samples whose predicted labels do not agree with their true labels. Of course, we can 'improve' this in-sample classification accuracy by further partitioning, such that the tree assigns the class of 'died' to a female whose name is Rose and who knows Jack, and the class of 'survived' to a female whose name is Anna from the first cabin. However, this will lead us to step into the trap of overfitting and such a model makes no sense to the out-of-sample data. Will there be another female named Rose who happens to know Jack in the future? Unlikely. Even if she does, will she must die when the ship hits an iceberg?

These two points show that the prediction accuracy of a classification tree depends on two things: **(1) how to select the most effective features that capture the relationship between features and labels**, and **(2) how to determine the size of the classification tree **(a rough tree means features are not fully utilized, while a fine tree has the danger of overfitting; so there must be a trade-off.).

We next introduce a nice CART algorithm that determines the optimal classification tree.

**2.2 CART algorithm**

CART stands for Classification And Regression Trees, and it was developed by Breiman et al. (1984). It has become a popular commercial software package for implementing classification tree algorithms. It determines the classification tree model according to the historical data and evaluation criteria (i.e., selecting the classification features of each stage and determining the size of the classification tree).

CART is concerned with the following issues: **(1) how to pick the best feature at each split, (2) how to measure the accuracy of the classification, and (3) how to determine the size of the tree by considering both the accuarcy and the complexity of the tree. **In what follows, we explain these issues one by one.

**2.2.1 Selecting the classification features**

**CART uses recursive binary partitioning to create a binary tree.** For each potential feature to be selected for the current split, samples to be further partitioned are assigned to two segments according to whether or not (binary) they satisfy that feature. In the illustration below, suppose that we have chosen feature A1 in the previous stage, and we want to determine which feature to use for the current split to further partition the subset of the data that satisfies A1.

**When we do so, all feasible features are evaluated, and the one which leads to the greatest increase in node purity is chosen. This can be accomplised using an impurity function, which is a function of the proportions of the learning sample belonging to the possible classes.**

Let's say we have s samples that satisfy A1, and among them there are Pi samples belong to class i, i = 1, ..., J. Then, before we pick the feature for the current split, the purity of node A1 can be measured by either the **Gini index of diversity **or** the entropy function.**

**Gini index of diversity:**

**Entropy function:**

The impurity function assumes its minimum value for a node that is completely pure, having all cases from the learning sample corresponding to the node belonging to the same class. In this case, the purity of the node is 1. In reality, since it is not possible for all samples to have the same label, the purity of a node is less than 1.

To assess the quality of a potential split (that is the chosen of a potential feature for the current node), one can compute the value of the impurity function using the cases in the learning sample corresponding to the parent node (A1 in our case), and substract from this the weighted average of the impurity for the two children nodes (i.e., the two nodes for whether or not A2 is satisfied), with the weights proportional to the number of cases of the sample corresponding to each of the two children nodes, to get the decrease in the overall impurity that would result from the split. Among all the potential ways to split (i.e., features to use), the one which will result in the greatest increase in node purity is chosen. Note that this only guarantees local optimum as the split at the current layer does not consider subsequent splits. As a result, CART is a greedy algorithm when it selects features to partition the data.

**2.2.2 Measuring the classification accuracy**

We are concerned with the prediction accuracy of the classification tree on **out-of-sample new data**. Therefore, in evaluating the accuracy of the classification tree, it is meaningless to only consider its effectiveness on the training data. In case of extreme overfitting, the model's error on the training set can be reduced to 0. Suppose that we construct a tree such that each sample itself becomes an terminal node of the tree. Such a tree provide 100% accuracy on the training data but it is not useful at all when we apply it to run classification on new data.

Therefore, to measure the classification accuracy, we must use the data unseen in the model training process. We call such data the testing data set. It is often assumed that the training and testing data come from the same underlying population and therefore they share the same distribution. As a result, as long as we have sufficient independent data, it is always a good idea to save enough data from training and use them for testing. But in many cases, especially in the financial sector, data is scarce and valuable. For stock selection, we want to use historical data as much as possible to train the model. That is we do not want to simply save some data for testing (although we know how important it is!) such that we do not have sufficient training data to train a meaningful classification tree. In this case, **K-fold cross validation** can be employeed.

K-fold cross validation divides all the sample data into K segments of equal size. Each time one segment of data is preserved for the validation purpose while the other K - 1 segments are used for traininng the model. The classification error on that validation data set is used to evaluate the accuracy of the model built using the rest of the data. As a result, there are a total of K trees (in practice, the value of K is generally between 5 and 10).

**To train these K trees, we must ensure that the modeling criteria are the same (for example, the impurity function should be the same, or the number of terminal nodes, i.e., leaves, in all of these threes should be the same). **Averaging the classification errors of these K trees gives the final measure of the accuracy of the classification tree. Should we use the above modeling criteria to construct a tree using all the sample data, this average can serve as an estimate of the model error.

**2.2.3 Determining the size of the tree**

One must specify how to grow the best tree. **However, it does not work very well to use some sort of a stop splitting rule to determine that a node should be declared a terminal node and the corresponding subset of data not split any further. **This is because it can be the case that the best split possible at a certain stage may decrease impurity by only a small amount, but if that split is made, each of the subset of data corresponding to both of the descendant nodes can be split in ways to produce an appreciable decrease in impurity.

**Because of this phenomenon, what works better is to first grow a very large tree (for example, splits can be made until 5 or fewer members of the sample correspond to each terminal node). We refer to this large tree as T0. Then a sequence of smaller trees can be created by pruning T0 and a tree having a fewer number of nodes is produced. **In this process, both the accuracy and the complexity of the tree will be considered to determine the best tree. A method that achieves these goals is called **minimum cost–complexity pruning.**

Suppose we have a tree denoted by T, and the let |T| be the number of terminal nodes of the tree, R(T) be the estimate of the classification error. Let a ≥ 0 be the penalty coefficient for the complexity of the tree. Then, the cost-complexity measure, Ra(T), for the tree T is,

Ra(T) = R(T) + a|T|

When a is small, accuracy takes precedence over complexity, and this will lead to a tree with a large number of nodes. As a increases, so does the aversion to complexity. This will result in a tree with fewers nodes and less accuracy. We must search for the optimal value of a that minimizes Ra(T).

To obtain the best tree, the algorithm starts with T0. For different values of a in an ascending order, it generates a set of **nested trees** T1, T2, ... using **weakest-link pruning**. Since pruning reduces the number of nodes, it decreases the classification accuracy. Weakest-link pruning means to cut the node whose negetive impact to Ra(T) is minimum (Sutton 2005).

Nested trees mean that T1 is obtained by pruning T0, and T2 is obtained by pruning T1, etc. **Each new tree is obtained by pruning its predecessor.** To give you a concrete example, suppose we want to pick the best value of the parameter a among 1 to 10, and we start with a = 1. We generate T1 from T0 such that R1(T1) is minimized. Then, we set a to 2, and generate T2 from T1 to ensure that R2(T2) is minimized. We keep doing this and finally we have a tree T9 and a = 10. We then grow the last tree T10 by requiring R10(T10) to be minimum. The procedure above creates a set of nested trees T1 to T10, and each of the tree is optimal to their corresponding values of a. Then we compare these 10 values of Ra(Ta), a = 1, 2, ..., 10, and the a* that gives the minimum of Ra(Ta) is the optimal value of a. The corresponding tree Ta* is the best tree.

To summarize, the steps to determine the best tree with the usage of K-fold cross validation are:

Steps to determine the optimal tree

**1. **Divide the training data into K folds.

**2. **For each fold, conduct the following:

**2.1 **Set aside this fold as testing data

**2.2 **Use the rest K-1 folds of data to train the model and grow a initial tree T that is large enough

**2.3 **Start with T and the range of a, construct a set of nested trees by using minimum cost–complexity pruning

**3. **After step 2, for each value of a, there are K different trees (since we use K-fold), and therefore K different Ra for that specific a. We take their average as the measure of the quality of the tree associated with a. For all the values of a, we pick the optimal one that gives us the minimum of Ra. This value of a, denoted by a*, is optimal.

**4. **Use all the training data and a*, train the model to grow a tree that minimizes Ra*, and this is our best tree.

**3**

**Pros and Cons**

Tree-structured classification is a nonparametric computationally intensive method. With the power of modern computers, the computation intensity is no longer a big concern and this method has greatly increased in popularity during the past dozen years. They can be applied to data sets having both a large number of sample points and a large number of features. In addition, classification trees are easy to interpret and they have no requirement on the dependency of the data.

**There, however, are some inherent deficiencies in classification trees. **Let x be the features' vector and it comes from some unknown population distribution, Yx be the true class labels of samples, and C(x) be the predicted labels of samples from the tree. Note that C(x) is therefore a random variable and we use E[C(x)] to denote the expectation of C(x). The error of such a classification tree then satisfies,

**This derivation says the error of the classification has two parts -- the bias E([Yx – E[C(x)]]^2) and the variance Var(C(x)). A classification tree is a unstable classifier such that it is very sensitive to the training data. **When there is variation in the training data, the model parameters can be different and the out-of-sample classification results can change. This means that Var(C(x)) can make a big contribution to the classification error. **Such a large variance is a intrinsic part of a classification tree, but if we could somehow use E[C(x)] to replace C(x) in the prediction, we can avoid the variance of C(x).**

**Of course, E[C(x)] is unknown in reality. However, we can average the results of a large number of classification trees and expect the average of many C(x) from these trees to approximate E[C(x)] well. By the law of large numbers, the average of multiple C(x) approaches to E[C(x)] as the number of trees increases. This is the advantage of having multiple trees over a single tree.**

We mentioned some advanced algorithms in the beginning of this article and they are bagging (Breiman 1996a, 1996b), boosting (Freund and Schapire 1996), and random forests (Breiman 2001). They all use multiple trees to replace a single tree for classification. They effectively reduce the variance of classification without changing the bias, and therefore significantly improve the classification accuracy. We will introduce these advanced algorithms in the next article of this series.

**References**

Breiman, L. (1996a). Bagging predictors. *Machine Learning* Vol. 24, 123 – 140.

Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. *Ann. Statist.* Vol. 24, 2350–2383.

Breiman, L. (2001). Random forests. *Machine Learning* Vol. 45, 5 – 32.

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Pacific Grove, CA.

Freund, Y., Schapire, R. (1996). Experiments with a new boosting algorithm. In: Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference. Morgan Kaufmann, San Francisco, CA.

Sutton, C. D. (2005). Classification and Regression Trees, Bagging, and Boosting. In Rao, C. R., Wegman, E. J., Solka, J. L. (Eds.), Handbook of Statistics 24: Data Mining and Data Visualization, Chapter 11.

地址 : 北京市朝阳区领地 OFFICE 大厦 3 号楼 203

电话 : 8610 - 5601 3848 E-mail : info@liang-xin.com

Copyright © 2017-2020 北京量信投资管理有限公司版权所有 京ICP备17020789号