The second question has to be answered through information on the distribution or at least the variance of the estimator.
Of course there are answers in very simple contexts:
for instance when the parameter of interest is
the mean then the estimator
has a known standard
deviation : the estimated standard error
noted sometimes
In maximum liklihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of
The bootstrap is a more general way to answer question 2 , with the following aspects:
If we had several samples from the unknown
(true) distribution
then we could consider the variations
of the estimator :
Heart attacks | Subjects | ||
Aspirin | 104 | 10933 | 11037 |
Placebo | 189 | 10845 | 11034 |
Or is the difference between rates in both populations strictly different from zero?
We could run a simulation experiment to find out.
>> [zeros(1,10) ones(1,2)] ans = 0 0 0 0 0 0 0 0 0 0 1 1 >> [zeros(1,10) ones(1,2)]' ans = 0 0 0 0 0 0 0 0 0 0 1 1 >> [zeros(1,10) ; ones(1,2)] ??? All rows in the bracketed expression must have the same number of columns. >> sample1=[zeros(1,109) ones(1,1)]'; >> sample2=[zeros(1,108) ones(1,2)]'; >> orig=sample1 >>[n,p]=size(orig) n = 110 p = 1 >> thetab=zeros(1,1000);
File myrand.m for the bootstrap index creation:
function out=myrand(m,n,range) %Returns a vector of draws %of integers from 1to n, %with replacement out=floor(rand(m,n)*range)+1;
File bsample.m for the bootstrap resample creation:
function out=bsample(orig) %Function to create one resample from %the original sample orig, where %orig is the original data, and is a %matrix with nrow observations and ncol variables [n,p]=size(orig); indices=myrand(1,n,n); out=orig(indices,:);
File bsample.m:
function out=bsample(orig) %Function to create one resample from %the original sample orig, where %orig is the original data, and is a %matrix with nrow observations and ncol variables [n,p]=size(orig); indices=myrand(1,n,n); out=orig(indices,:);
>> for (b =(1:1000)) res.bsample1=bsample(sample1); thetab(b)=sum(res.bsample1==1); end >>hist(thetab)
This is what the histogram looks like:
Here is the complete data set computation
>> sample1=[zeros(1,10933),ones(1,104)]'; >> sample2=[zeros(1,10845),ones(1,189)]'; >> thetab=zeros(1,1000); >> for (b =(1:1000)) res.bsample1=bsample(sample1); thetab(b)=sum(res.bsample1==1); end
>> for (b =(1:1000)) res.bsample1=bsample(sample1); thetab(b)=sum(res.resample1==1)-2; end >>hist(thetab) >> mean(thetab) ans = -0.9530 >> var(thetab) ans = 1.0398 >> sum(thetab>0) ans = 96
This is what the histogram looks like:
>>thetab=zeros(1,1000); >> for (b =(1:1000)) res.bsample1=bsample(sample1); res.bsample2=bsample(sample2); thetab(b)=sum(res.bsample2==1)-sum(res.bsample1==1); end >>hist(thetab)
This is what the histogram looks like:
From an original sample
If we are interested in the behaviour of a random variable , then we can consider the sequence of new values obtained through computation of new bootstrap samples.
Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.
Here is an example of a line of matlab that does just that:
indices=randint(1,n,n)+1;
If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).
sample(n,n,replace=T)
An approximation of the distribution of the
estimate
is provided by the distribution
of
If we were given true samples, and their associated
estimates
,
we could compute the usual variance estimate
for this sample of values, namely:
Here are some computations for the mouse data(page 11 of text)
treat=[94 38 23 197 99 16 141]' treat = 94 38 23 197 99 16 141 >> median(treat) ans = 94 >> mean(treat) ans = 86.8571 >> var(treat) ans = 4.4578e+03 >> var(treat)/7 ans = 636.8299 >> sqrt(637) ans = 25.2389 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(treat)); end hist(thetab) >> sqrt(var(thetab)) ans = 37.7768 >> mean(thetab) ans = 80.5110
This is what the histogram looks like:
Control Group control=[52 104 146 10 51 30 40 27 46]'; >> median(control) ans = 46 >> mean(control) ans = 56.2222 >> var(control) ans = 1.8042e+03 >> var(control)/length(control) ans = 200.4660 >> sqrt(200.4660) ans = 14.1586 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(control)); end hist(thetab) >> sqrt(var(thetab)) ans = 11.9218 >> mean(thetab) ans = 45.4370This is what the histogram looks like: |
Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?
Suppose we condition on the sample of distinct observations , there are as many different samples as there are ways of choosing objects out of a set of possible contenders, repetitions being allowed.
At this point it is interesting to introduce a new
notation for a bootstrap resample,
up to now we have noted a possible
reasample, say
,
because of the exchangeability/symmetry property
we can recode this as the vector counting
the number of occurrences of each of the observations.
in this recoding we have
and the set of all bootstrap resamples
is the dimensional simplex
here is the function file approxcom.m
function out=approxcom(n) out=round((pi*n)^(-.5)*2^(2*n-1));that produces the following table of the number of resamples:
Theoretically, we could give a complete enumeration of the bootstrap sampling distribution, all we need to know is how to compute the statistic for all the bootstrap resamples. There is a way of going through all resamles as defined by the simplex , it is called a Gray code. As an example, consider the law school data used by Efron (1982). One can only hope to enumerate completely for moderate sample sizes ( today). For larger sample sizes partial enumeration through carefully spaced points is discussed in Diaconis and Holmes (1994a). Another idea is to use a small dose of randomness, not as much as Monte Carlo, by doing a random walk between close points so as to be able to use updating procedures all the same, this is detailed in the case of exploring the tails of a bootstrap distribution in Diaconis and Holmes (1994b).
It is easy to give a recursive description of such a list, starting from the list for (namely 0,1). Given a list of length , form by putting a zero before each entry in , and a one before each entry in . Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from we get and then the list displayed above for . For the list becomes :
Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.
One way to express this is as follows : let be the binary representation of the integer , and let be the string of rank in the Gray code list. Then and . For example, when , the list above shows the string of rank is ; now . So . Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).
Let be the original data supposed to be independent and identically distributed from an unknown distribution on a space . The bootstrap proceeds by supposing that replacement of by , the empirical distribution, can provide insight on sampling variability problems.
Practically one proceeds by repeatedly choosing from the points with replacement. This leads to bootstrap replications . There are such possible replications, however these are not all different and grouping together replications that generate the same subset we can characterize each resample by its weight vector where is the number of times appears in the replication. Thus .
Let the space of compositions of into at most parts be
Efficient updating avoids multiplying such large numbers by factors of . This is what makes the computation feasible. Gray codes generate compositions by changing two coordinates of the vector by one up and one down. This means that can be easily updated by multiplying and dividing by the new coordinates. Similar procedures, discussed in Section C below allow efficient changes in the statistics of interest.
Following earlier work by Nijenhuis, Wilf and Knuth, Klingsberg (1982) gave methods of generating Gray codes for compositions. We will discuss this construction briefly here, details can be found in Klingsberg (1982) and Wilf (1989).
For , the algorithm produces the 10 compositions of in the
following order:
The easiest way to understand the generation of such a list is recursive, construction of the -compositions of can actually be done through that of the compositions of .
For any , the 2-composition is just provided by the list , which is of length . So the 2-compositions out of 3 are
the 2-compositons out of 2 are
and 2-compositions out of 1 are
Finally there is only one 2-composition of :
The 3-out of 3 list is obtained by appending a to the list, a
1 to the list, a 2 to the list and a 3 to the
list. These four lists are then concatenated by writing the first, the
second in reverse order, the third in its original order followed by the
fourth in reverse order. This is actually written:
The same procedure leads to the 35 compositions of in the following
order:
The lists generated in this way have the property that two successive compositions differ only by in two coordinates.
Klingsberg (1982) provides a simple nonrecursive algorithm that generates the successor of any compositon in this Gray code. This is crucial for the implementationin the the present paper. It requires that one keep track of the whereabouts of the first two non-zero elements and an updating counter. Both a and a version of the algorithm are provided in the appendix.
We conclude this subsection by discussing a different algorithm due to Nijenhuis and Wilf (1978, pp. 40-46) which runs through the compositions in lexicographic order (reading from right to left). This algorithm was suggested for bootstrapping by Fisher and Hall (1991).
algorithm to run through compositions .
(1) Set .
(2) Let first with . Set .
(3) Stop when .
For example, the following list gives all 35 compoisitions in in the order produced by the N.W. algroithm
Intermediary between the complete enumeration of all samples and the drawing of a few samples at random lies the Balanced Bootstrap, (see Gleason, 1988, American Statistician,263-266) for a simple algorithm.
The idea is that as when we count all samples each observation has the same chance of appearing, we would like to have each observation appearing overall the same number of times after all bootstrap resamples have been drawn.
We can be ensured of this by taking a list of repetitions of each of the observations and repeating them times, take a permutation of these values and then take the first as the first sample, and so on.
This is the method used for drawing a sample at random from the empirical distribution. I will start by giving a history and general remarks about Monte Carlo methods for those who have never studied them before.
There are certain desirable properties that we want estimators to have, consistency which ensures that as the sample size increases the estimates converges to the right answer is ensured here by the properties of Riemann sums. Other properties of interest are:
More precisely, the variance of crude Monte Carlo is
This the idea in antithetic resampling (see Hall, 1989). Suppose we are interested with a real-valued parameter, and that we have ordered our original sample , for each resample and statistic we associate by taking a special permutation of the 's that will make , and as small as possible. If the statistic is a smooth function of mean for instance, then the 'reversal permutation' that maps into , into , etc... is the best, the small sample values are transformed into the larger observations, and the average of these two estimates will give an estimate with smaller variance.
We often speak of the asymptotic properties of the sample mean .These refer to the sequence . These functions are the same in some sense, for all sample size. The notion of statistical functional makes this clearer.
Suppose we are interested in real-valued parameters.
We often have a situation where the parameter of
interest is a function of the
distribution function , these are called statistical functionals.
Examples:
The usual estimates for these functionals are obtained by simply plugging in the empirical distribution function for the unknown theoretical one.
Thus taking into account that
for any function we have:
We say that if the random variable has
cdf and the rv has cdf ,
converges in law to , and we write
Convergence in Probability
Note:
If
where is a limit distribution and
then
.
Because of the result noted above, this also ensures that
, this is actually true
uniformly in because Kolmogorov's statistic is pivotal
Definition: A statistic is said to be pivotal if its distribution does not depend on any unknown parameters.
Example: Student's statistic.
The bootstrap is said to work if
Typically, we will be in the situation where
Examples:
Estimating the bias of the median:
Bootstrapping a probability:
Note: is not an unknown parameter.
This method was the jackknife.
This is the first time that the sample was manipulated, each observation is dropped once from the sample, the new distribution function is called when the observation has been dropped, and the functional estimate gives the estimator and their average is denoted
We can use the jackknife for estimating the bias by:
We can show that if the bias is of the order
as in then the Jackknife estimate is
in
Example:
Suppose
for which the plug in estimate is:
%------------------------------------------ function out=jackbias(theta,orig) %Estimate the bias using the jackknife %Theta has to be a character string containg % a valid function name [n,p]=size(orig); lot=feval(theta,orig(2:n,:)); k=length(lot); lo=zeros(n,k); lo(1,:)=lot; lo(n,:)=feval(theta,orig(1:(n-1),:)); for i=(2:(n-1)) lo(i,:)=feval(theta,orig([1:(i-1),(i+1):n],:)); end thetadot=mean(lo); out=(n-1)*thetadot -(n-1)*feval(theta,orig); %------------------------------- function out=ratio(yszs) %Computes the ratio of mean of first %column of mean of second column out=mean(yszs(:,1))/mean(yszs(:,2)); %------------------------------------- >>z=oldpatch-placebo ; y=newpatch-oldpatch >> yz=[y,z] -1200 8406 2601 2342 -2705 8187 1982 8459 -1290 4795 351 3516 -638 4796 -2719 10238 >>ratio(yz) -0.0713 >> jackbias('ratio',yz) ans = 0.0080
Bootstrap Simulations
function out=bootbias(theta,orig,B) thetabs=zeros(1,B); for b=(1:B) bs=bsample(orig); thetabs(b)=feval(theta,bs); end theta0=feval(theta,orig); out=mean(thetabs)-theta0; %----------------------------------- >> bootbias('ratio',yz,1000) ans = 0.0085 %----------------------------------- function out=bootmultin(orig,B) [n,p]=size(orig); out=zeros(B,n); for b=1:B inds=randint(1,n,n)+1; for i=1:n out(b,inds(i))=out(b,inds(i))+1; end; end %-----------------------------------
The best thing to do is at the beginning of the study to take a random sub-sample, without any particular stratification and to put it aside for the confirmatory stage. Many scientists are mean with their data, and only have just enough to model, but nowadays the expense of an extra 25 % or so, should be made - especially when the consequences of the study are medical, this is what tukey and mallows call a careful serarate diagnostic.
For instance in Discriminant Analysis
For each observation, do the analysis without that one, and
look whether or not it is well classified, this will give an unbiased
estimate of the percentage of badly classified.
Cross Validation can thus be used when one variable
has the particular status of being explained.
And in regresssion
We want to estimate the prediction error:
However it has also been used at the diagnostic stage in principal components, and in classification and regression trees where it helps choose the size of an `optimal tree'.
We call the unknown parameter and our estimate . Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of , we will be interested especially in its quantiles : denote the quantile by and the quantile by .
By definition we have:
and
giving a confidence interval:
What we called and were the ideal quantiles from which we build the confidence interval : .Estimated by : using the th quantile of , the 's do NOT cancel out.
Preliminaries : Pivotal quantities
Construction of confidence intervals
is based (ideally) on a pivotal quantity
, that is a function of the sample
and the parameter whose distribution
is independant of the parameter
, the sample, or any other unknown
parameter.
Thus for instance no knowledge
is needed about the parameters and
of a normal variable
to construct
confidence intervals for
both of these parameters,
because one has two pivotal quantities available :
Let
denote the largest
quantile of the distribution of the
pivot , and
the parameter space, set of all possible values for
then
Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level () of the confidence sets. Level error is sensitive to how far is from being a pivot.
We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation.
We will suppose that the asymptotic variance of is with a corresponding estimate .
The studentized bootstrap or bootstrap-t method (EFRON 1982, HALL 1988) is based on approximating the distribution function of by , distribution of .
Denote by
the th quantile
of the distribution of
,
function out=corr(orig) %Correlation coefficient c1=corrcoef(orig); out=c1(1,2); %------------------------------------------ function interv=boott(orig,theta,B,sdB,alpha) %Studentized bootstrap confidence intervals % theta0=feval(theta,orig); [n,p]=size(orig); thetab=zeros(B,1); sdstar=zeros(B,1); thetas=zeros(B,1); for b=1:B indices=myrand(1,n,n); samp=orig(indices,:); thetab(b)=feval(theta,samp); %Compute the bootstrap se,se* sdstar(b)=feval('btsd',samp,theta,n,sdB); %Studentized statistic thetas(b)=(thetab(b)-theta0)/sdstar(b); end se=sqrt(var(thetab)); Pct=100*(alpha/2); lower=prctile(thetas,Pct); upper=prctile(thetas,100-Pct); interv=[(theta0-upper*se) (theta0 - lower*se)]; %---------------------------------------------- function out=btsd(orig,theta,n,NB) %Compute the bootstrap estimate %of the std error of the estimator %defined by the function theta %NB number of bootstrap simulations thetab=zeros(NB,1); for b=(1:NB) indices=myrand(1,n,n); samp=orig(indices,:); thetab(b)=feval(theta,samp); end out=sqrt(var(thetab)); %---------------------------------------------- >> boott(law15,'corr',1000,30,.05) -0.4737 1.0137 %---------------------------------------------- >> boott(law15,'corr',2000,30,.05) -0.2899 0.9801 %----------------------
function out=transcorr(orig) %transformed correlation coefficient c1=corrcoef(orig); rho=c1(1,2); out=.5*log((1+rho)/(1-rho)); >> transcorr(law15) 1.0362 >> tanh(1.03) 0.7739 >> boott(law15,'transcorr',100,30,.05) -0.7841 1.7940 >> tanh(ans) -0.6550 0.9462 >> boott(law15,'transcorr',1000,30,.05) 0.0473 1.7618 >> tanh(ans) 0.0473 0.9427 >> transcorr(law15) 1.0362 >> 2/sqrt(12) 0.5774 >> 1.0362 - 0.5774 0.4588 >> 1.0362 + 0.5774 1.6136 >> tanh([.4588 1.6136]) 0.4291 0.9237 %%%%%%%%%%%%%%%True confidence Intervals% >> prctile(res100k,[5 95]) ans = 0.5307 0.9041
The percentile method consists
in taking the
confidence interval
for as being
If is linear as we know finding it's expectation is trivial, and we can also find it's variance (recalling that and )
So when is not linear we have to bring ourselves back to that
case by using a Taylor expansion, we hope to find a region where
dwells with high probability and where is linear.
This method proves a generalisation of a central limit theorem:
By Taylor's theorem:
Binomial: We're interested in estimating , should we
The delta method provides a route towards variance stabilization, ie making the variance less dependent on (and thus making the distribution closer to being pivotal).
We would like to find transformations such that , where does not depend on .
By taking the transformation to be such that , we have a constant variance.
Poisson case:
,
Chisquare:
Let , where
.
,
,
,
the variance stabilizing
is
An improved estimate of the expectation can be obtained by taking the second order Taylor expansion:
Note : If you don't remember about Taylor series you can get the computer to remind you by using Maple.
Type maple, then:
>help(taylor); >taylor(f(x),x=mu,3); for multivariate: >readlib(mtaylor); >mtaylor(f(x,y),[x=a,y=b],3);
To simplify the formulas, suppose
the variances of both and are 1 and the means are 0,
then a simplified version of the argument in Lehmann(page 317)
says that any bivariate distribution with
correlation coefficient