1

30

在继续浏览本公司网站前，请您确认您或您所代表的机构是一名“合格投资者”。“合格投资者”指根据任何国家和地区的证券和投资法规所规定的有资格投资于私募证券投资基金的专业投资者。例如根据我国《私募投资基金监督管理暂行办法》的规定，合格投资者的标准如下：

一、具备相应风险识别能力和风险承担能力，投资于单只私募基金的金额不低于100万元且符合下列相关标准的单位和个人：

1、净资产不低于1000万元的单位；

2、金融资产不低于300万元或者最近三年个人年均收入不低于50万元的个人。(前款所称金融资产包括银行存款、股票、债券、基金份额、资产管理计划、银行理财产品、信托计划、保险产品、期货权益等。)

二、下列投资者视为合格投资者：

1、社会保障基金、企业年金等养老基金、慈善基金等社会公益基金；

2、依法设立并在基金业协会备案的投资计划；

3、投资于所管理私募基金的私募基金管理人及其从业人员；

4、中国证监会规定的其他投资者。

如果您继续访问或使用本网站及其所载资料，即表明您声明及保证您或您所代表的机构为“合格投资者”，并将遵守对您适用的司法区域的有关法律及法规，同意并接受以下条款及相关约束。如果您不符合“合格投资者”标准或不同意下列条款及相关约束，请勿继续访问或使用本网站及其所载信息及资料。

投资涉及风险，投资者应详细审阅产品的发售文件以获取进一步资料，了解有关投资所涉及的风险因素，并寻求适当的专业投资和咨询意见。产品净值及其收益存在涨跌可能，过往的产品业绩数据并不预示产品未来的业绩表现。本网站所提供的资料并非投资建议或咨询意见，投资者不应依赖本网站所提供的信息及资料作出投资决策。

与本网站所载信息及资料有关的所有版权、专利权、知识产权及其他产权均为本公司所有。本公司概不向浏览该资料人士发出、转让或以任何方式转移任何种类的权利。

本声明包含网络使用的有关条款。凡浏览本网站及相关网页的用户，均表示接受以下条款。

1、并非所有的客户都可以获得所有的产品和服务，您是否符合条件享受特别产品和服务，最终的解释权归我公司。我公司保留对该网页包含的信息和资料及其显示的条款、条件和说明变更的权利。

2、任何在本网站出现的信息包括但不限于评论、预测、图表、指标、理论、直接的或暗示的指示均只作为参考，您须对任何自主决定的行为负责。

3、本网站提供的有关投资分析报告、股市预测文章信息等仅供参考，股市有风险，入市须谨慎！本网站所提供之公司资料、个股资料等信息，力求但不保证数据的准确性，如有错漏，请以基金业协会公示信息报刊为准。本网站不对因本网资料全部或部分内容产生的或因依赖该资料而引致的任何损失承担任何责任。

4、互联网传输可能会受到干扰，中断、延迟或数据错误，本公司对于非本公司能控制的通讯设施故障可能引致的数据及交易之准确性或及时性不负任何责任。

5、凡通过本网站与其他网站的链结，而获得其所提供的网上资料及内容，您应该自己进行辨别及判断，我公司不承担任何责任。

6、本站某些部分或网页可能包括单独条款和条件，作为对本条款和条件的补充，如果有任何冲突，该等附加条款和条件将对相关部分或网页适用。

7、本人已阅读并同意《数字证书服务协议》。

1、并非所有的客户都可以获得所有的产品和服务，您是否符合条件享受特别产品和服务，最终的解释权归我公司。我公司保留对该网页包含的信息和资料及其显示的条款、条件和说明变更的权利。

2、任何在本网站出现的信息包括但不限于评论、预测、图表、指标、理论、直接的或暗示的指示均只作为参考，您须对任何自主决定的行为负责。

3、本网站提供的有关投资分析报告、股市预测文章信息等仅供参考，股市有风险，入市须谨慎！本网站所提供之公司资料、个股资料等信息，力求但不保证数据的准确性，如有错漏，请以基金业协会公示信息报刊为准。本网站不对因本网资料全部或部分内容产生的或因依赖该资料而引致的任何损失承担任何责任。

4、互联网传输可能会受到干扰，中断、延迟或数据错误，本公司对于非本公司能控制的通讯设施故障可能引致的数据及交易之准确性或及时性不负任何责任。

5、凡通过本网站与其他网站的链结，而获得其所提供的网上资料及内容，您应该自己进行辨别及判断，我公司不承担任何责任。

6、本站某些部分或网页可能包括单独条款和条件，作为对本条款和条件的补充，如果有任何冲突，该等附加条款和条件将对相关部分或网页适用。

7、本人已阅读并同意《数字证书服务协议》。

Chuan Shi 2017-09-18
本文章756阅读

**1**

**Start with t distribution**

Parameter estimation is everywhere in quantitative investment. As an example, we must provide estimate of mean and variance of assets in the Markowitz's mean-variance framework of asset allocation. In most of the cases, having only the point estimate of a parameter of interest is insufficient. We are more interested in the error between our estimate and the true value of that parameter from the unknown population. In this regard, interval estimate -- in particular the **confidence interval** of the parameter -- provides the information we demand.

People are most familiar with the confidence interval (CI) for the **population mean**. With the **Central Limit Theorem** and the **Normal distribution assumption**, there is an elegant expression for the CI of the population mean. Specifically, one can convert the sample mean to a t-statistic using the sample standard deviation, and then find the critical values by looking up the t-value table. With the critical values, it is very easy to calculate the CI. Since Student's t distribution is symmetric, the CI computed this way is also symmetric.

Let's refer to the method described above as **the tranditional method from Normal Theory**. I'd like to talk more about the two strong assumptions behind it: the Central Limit Theorem and the Normal distribution.

Suppose that the population satisfies the Normal distribution and we want to find the CI of its mean. If σ -- the standard deviation of the population -- is known, then we can use Normal distribution to compute the CI. If, on the other hand, σ is unknown, then we must use s -- the standard deviation of the sample -- to replace σ and use t-distribution to replace Normal distribution in the calculation of the CI. This is the intention for the t-distribution to be developed. Therefore, **using t-distribution to compute CI requires the population to be normally distributed.**

However, for most problems in practice, the population is far from normally distributed and it seems that this preventes us from using t-distribution. The good news is that we have another powerful weapon in our arsenal, the Central Limit Theorem. It says that no matter what the distribution of the population looks like, that of the population mean is asympototically approaching to Normal. As a result, we can still use the t distribution to compute its CI.

In probability theory, the central limit theorem establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

The argument above indicates that we heavily rely on the Central Limit Theorem and the Normal distribution assumption when calculating the CI for the mean. **However, if the population distribution is very irregular or if the sample size is far from enough, the normal approximation of mean from the Central Limit Theory can be very questionable and the CI calculated using the t distribution can be inaccurate.**

**Other than mean, we are also interested in other statistics such as the median, the percentiles, the standard deviation, the correlation coefficient, just to name a few. Unlike the mean, there exists no elegant expression from which their CIs can be computed nicely. As a result, their CIs cannot be computed by the trandition method based on the Normal Theory.**

To conquer these difficulties, this article introduces a statistical technique named **Bootstrap **and we will show that it is very powerful in computing the CI for various statistics.

**2**

**Bootstrap: the origin and principle**

*The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates. -- Efron & Tibshirani, An introduction to the bootstrap, 1993*

Bootstrap was popularized by Bardley Efron in 1979 and he also coined the term 'Bootstrap'. **The key idea behind it is to perform computations on the data itself to estimate the variation of statistics that are themselves computed from the same data.** Modern computing power makes Bootstrap very simple to implement.

Confused? I am going to explain the meaning of Bootstrap and its principle right away.

Bootstrap comes from the English phrase 'pull yourself up by your bootstraps', which means that one must lift himself up the ground by pulling his bootstrap. What it implies is that **'you should improve your situation by your own efforts'**.

In the context of parameter estimation, **Bootstrap means we can only use the data at hand, without making any assumption about the population, to measure the error of the sample statistics when they are used as the estimates of the true yet unknown population statistics.**

*The central idea is that it may sometimes be better to draw conclusions about the characteristics of a population strictly from the sample at hand, rather than by making perhaps unrealistic assumptions about the population. -- Mooney & Duval, Bootstrapping, 1993*

So what shall we do specifically? **How do we estimate the error of the statistics obtained from the sample data just by using the same data at hand (again and again)?**

To answer this question, we must talk about a very important technique first: **resampling with replacement**. Here, 'with replacement' is the key. To explain it, think about the following example. Suppose a jar contains 10 balls labelled from 1 to 10, and we draw balls from the bag one at a time. Suppose we get No. 3 ball in the first draw. 'Resampling with replacement' requires we put it back into the jar before the next draw. As a consequence, in the second draw, we are equally likely to draw any of the 10 balls, including No. 3 ball. In contrast, there are many 'resampling without replacement' in our daily life, such as the 36/7 lottery and the World Cup draw. In those circumstances, the balls won't be put back once they are drawn from the pool.

With this technique in mind, we are now ready to explain the Bootstrap princple. We start with the following setup:

**1. **Let v represent a populate statistic of interest (e.g., it can be the mean, the median, the standard deviation, etc) from the unknown population distribution F.

**2. **Let x1, x2, …, xn be a sample from the population. We refer to it as the original sample.

**3. **Let u represent the sample statistic corresponding to v.

**4. **Using the data in the original sample as our new 'population', conduct resampling with replacement to derive a Bootstrap sample. The data in the Bootstrap sample is denoted by x1*, x2*, …, xn*. The Bootstrap sample's size must be the same as the size of the original sample.

**5. **Let u* be the Bootstrap statistic computed from the resample.

**Bootstrap principle says: The variation of u (around v) is well-approximated by the variation of u* (around u). **

**To determine the variation of u*, we conduct resampling with replacement using the original sample data to derive a large number of Bootstrap samples (**without the power of a modern computer, this could be a mission impossible for a human being**). We compute the statistic u* from each of these samples and they constitute the distribution of u*. Using this distribution, it is easy to tell how u* varies around u, and we use this to approximate how u varies around v. The variation of the statistic u will depand on the size of the original sample. Therefore, if we want to approximate this variation we need to use Bootstrap samples of the same size.**

**With the Bootstrap principle, we can use the empirical Bootstrap method to compute the confidence interval of any statistic.**

**3**

**Empirical Bootstrap method**

We use a 'toy example' to explain how to find the CI for the population mean using the empirical Bootstrap method. Suppose we have the following 10 sample data from an unknown population: 30, 37, 36, 43, 42, 48, 43, 46, 41, 42.

There are two questions to be addressed: (1) find the point estimate of the mean; (2) find the 80% Bootstrap confidence interval.

Since the sample mean is the point estimate of the population mean, the answer to the first question is straightforward and it is 40.8. As for the second question, since the sample size is small and we have no knowledge about the distribution of the population, we use the empirical Bootstrap method, rather than the tranditional method, to find the CI.

**To find the CI we need to know how much the distribution of sample mean, \bar x, varies around the population mean, μ. In other words, we would like to know the distribution of δ = \bar x – μ. δ is the error when we use \bar x to estimate μ.**

If we knew the distribution of δ, we could find the critical values that are required to compute the CI. In this example, since we want to derive the 80% CI, the critical values are δ_{0.9} and δ_{0.1}, the 10% and 90% percentile critical values of δ. With these the confidence interval is then,

The reasoning behind this is,

**We hasten to point out that the probability computed above is conditional. **It represents the probability that the variation of the sample mean \bar x around the true mean μ is between δ_{0.1} and δ_{0.9}, given that the true mean is μ.

**Unfortunately, since there is only one sample from the population and the true mean μ is unknown, we don't have the distribution of δ, and therefore we don't know the values of δ_{0.9} and δ_{0.1}. However, we do have the Bootstrap principle. It says even though we don't know how \bar x varies around μ (i.e., the distribution of δ), it can be approximated by how \bar x* varies around \bar x, i.e., the distribution of δ*, **where** **δ* is the difference between the Bootstrap statistic and the sample statistic,

Since δ* is computed by resampling the original data, we can have a computer simulate δ* as many times as we'd like -- from each Bootstrap sample, we find its mean and subtract the original sample mean (40.8) from it. **By the law of large numbers, we can estimate the distribution of δ* with high precision.**

We find the critical values δ*_{0.9} and δ*_{0.1} from the distribution of δ*, and use them as the approximation of δ_{0.9} and δ_{0.1}. The CI of μ then follows,

The procedure above shows the power of the empirical Bootstrap method.

Back to the example, 200 Bootstrap samples are generated with the help of a computer program. The figure below shows 10 of them (one resample in each column).

With these resamples, we find 200 values of δ* who ranges from -4.4 to 4.0. Its cumulative density function looks like the following.

Next, we need to find out δ*_{0.9} and δ*_{0.1} from these values. To do this, we sort these 200 δ* in an ascending order. Since **δ*_{0.9} is the 10th percentile we choose the 20th element in the list. Likewise, since δ*_{0.1} is the 90th percentile we choose the 181st element in the list. **They are δ*_{0.9} = -1.9 and δ*_{0.1} = 2.2. Recall that the sample mean is 40.8, and therefore the 80% confidence interval for μ is,

**4**

**Bootstrap percentile method**

Let's turn our attention to another method, the **Bootstrap percentile method**. It differs from the empirical Bootstrap method in an apparent way. **Instead of computing the differences δ*, the Bootstrap percentile method uses the distribution of the Bootstrap sample statistic as a direct approximation of the original sample statistic. **

Let's reuse the previous example to explain it.

In that example, we resample the original data and derive 200 Bootstrap samples. For each resample, there exists a mean and therefore they constitute the distribution of \bar x* (see the figure below).

The percentile method says to use the distribution of \bar x* as an approximation to the distribution of \bar x. As a result, we only need to find the 0.1 and 0.9 critical values from this distribution and they are the boundary of the CI for μ. In this example, the two values are 38.9 and 43, respectively. The confidence interval for μ by using the Bootstrap percentile method is therefore [38.9, 43].

It becomes very clear that the CIs computed using these two approaches are not the same. **Is one better than the other?**

The difference between the empirical Bootstrap method and the Bootstrap percentile method can be summarized as follows.

Empirical Bootstrap method **approximates the distribution of δ using that of δ***. It then adds the error to both sides of the sample mean. As a result, the confidence interval is centered at \bar x.

Bootstrap percentile method** approximates the distribution of \bar x using that of \bar x* **(since we only have one sample from the population, we do not have the distribution of \bar x, and this method says we can use the distribution of \bar x* as a good approximation). Then the CI comes from the distribution of \bar x* directly. **A very strong assumption here is that the distribution of \bar x can be well-approximated by the distribution of \bar x*. ****However, this cannot be guaranteed, and therefore this method should not be used.**

Bootstrap principle tells us the following: **the sample statistic \bar x is 'centered' at the population mean μ, and the Bootstrap statistic \bar x* is centered at \bar x. If there is a significant separation between \bar x and μ, then the distributions of \bar x and \bar x* will also differ significantly (i.e., the assumption of the Bootstrap percentile method fails to hold). On the other hand, the distribution of δ = \bar x - μ describes the variation of \bar x about its center. Likewise the distribution of δ* = \bar x* - \bar x describes the variation of \bar x* about its center. So, even if the centers are different, the two variations about the centers can be approximately equal. Consequently, we should use \bar x as the center and use the distribution of **

The following figure summarizes the difference between the two approaches.

**5**

**Bootstrapped-t method**

In addition to the two methods discussed in the previous two sections, we'd like to talk about one more method, the **Bootstrapped-t method**.

As its name suggests, this method is similar to the traditional method. In the traditional method based on Normal Theory, we look up the t-value table to find the critical values for the CI of the mean. **It assumes that the distribution of the population statistic is symmetric.**

In practice, however, this assumption may fail. In that case, the critical values from the t-value table are incorrect. This motivates the development of the Bootstrapped-t method, which allows asymmetric t values for the CI.

**The key idea of this method is to convert the statistics of those Bootstrap samples to a set of t-statistic. These many values of the t-statistic will give us a distribution of it. We use this distribution, instead of the t-value table, to derive the critical values required by the CI. The CI is then computed by,**

where s_{\bar x} is the standard deviation of the original sample.

As for the mean, we can use the following formula to transform the Bootstrap sample mean to a t-statistic (note that if the target statistic is not the mean, it is possible that there exists no analytic expression for such a transformation),

where \bar x*_i and s*_i are the mean and standard deviation of the ith Bootstrap sample, respectively, and n is the sample size.

Back to our example, the cumulative density function of the 200 Bootstrapped t-statistic looks like the following.

The critical values of the Bootstrapped t-statistic are -1.17 and 1.81. The CI for μ is therefore [31.82, 46.62]. Note that the range of this CI is wider than the CIs calculated using the other two methods. This is because we use the standard deviation of the original sample in this calculation. Since there are only 10 data in the original sample, its standard deviation is huge, and it leads to a wider CI.

We hope to broaden your mind by introducing this method. In practice, we still recommend using the empirical Bootstrap method.

**6**

**Not just the mean**

For the sake of comparing different Bootstrap methods, so far in this article we have been talking about the population mean as our target statistic. The Bootstrap technique is equally effective to find the CIs for other statistics.

Let's use the median as an example. We still have the 10 sample data as before (30, 37, 36, 43, 42, 48, 43, 46, 41, 42) and they come from some unknown population. We will apply the empirical Bootstrap method to find the 95% confidence interval for the median.

With those 200 Bootstrap samples, it is straightforward to find the critical values we need. Since we are concerned with the 95% CI, the critical values are the errors corresponding to the 2.5% and 97.5% percentiles, and they are -5.0 and 2.5. In addition, the sample median is 42. Therefore, the 95% confidence interval for the median is [39.5, 47].

**7**

**Bootstrap and quantitative investment**

This article explains how to use Bootstrap technique to measure the error in parameter estimation. Bootstrap does not make assumption of the population distribution, and therefore it can be applied to any statistic. This makes it a very powerful tool.

It is important to mention that **the resampling of ****Bootstrap can't improve our point estimate. **Using the mean as an example, the sample mean \bar x is the point estimate of the population mean μ. By using Bootstrap, we would compute \bar x* for many resamples of the data. If we took the average of all the \bar x* we would expect it to be very close to \bar x (in fact, it can be shown that the expectation of \bar x* equals \bar x). This wouldn't tell us anything new about the true value of μ. However, the values of \bar x* is very effective to help us measure how \bar x varies around μ, and this is the essence of the Bootstrap method.

Bootstrap technique has a lot of useful applications in quantitative investment. For instance, it can be used to correct the parameter estimation bias. Suppose we'd like to know the correlation coefficient of the returns of two assets and the only data we have is time series of historical return. By using the empirical Bootstrap method, we can find the error in the parameter estimation, and this can help derive better investment strategy or achieve more solid risk management. As another example, classification tree is a simple algorithm that can be used for stock selection. It is sensitive to the in-sample data and its prediction has a lot of variance. Bootstrap technique can be viewed as a great ensemble learning meta algorithm and it can be applied to boost the performance of the simple tree-based classification algorithms. The 'bagging' algorithm, which combines classification tree and the Bootstrap technique, is capable of providing more accurate prediction.

Finally, the methods discussed in this article are all nonparametric, i.e., we make no assumptions at all about the underlying distribution and draws Bootstrap samples by resampling the data. In some problems, if we know the population distribution, we can use the so called parametric Bootstrap. The only difference between the parametric and empirical Bootstrap is the source of the Bootstrap sample. For the parametric Bootstrap, we generate the bootstrap sample from a parametrized distribution. For example, if we know that the population is exponentially distributed with some unknown parameter λ, we can apply the parametric Bootstrap method to estimate the confidence interval of this parameter. Due to space concern, we are not going to expand the discussion on this topic. The readers are encouraged to go through the relevant reference.

地址 : 北京市朝阳区领地 OFFICE 大厦 A 座 1108

电话 : 8610 - 5601 3848 E-mail : info@liang-xin.com

Copyright © 2017-2020 北京量信投资管理有限公司版权所有 京ICP备17020789号

一键咨询