next up previous index
Next: References Up: Introduction to the Bootstrap: Previous: Labs and Homeworks   Index

Subsections

Lectures

Situation of resampling in contemporary statistics

Fisher's Classical Paradigm

This will be explained as a flow chart showing where Fisher's initial hypotheses come in and how the data are studied under the conjectured distribution : usually multivariate normal. There are several books on Multivariate Analaysis that follow this paradigm and I will not go into any details, but it makes sense to bear in mind that if one does know from previous studies for instance that the data will be normal multivariate, there is only a simple estimation problem left, all the possible information from the data is summarised by the mean and variance covariance matrix and no more is to be said.

Tukey and Mallow's Exploratory-Confirmatory Paradigm

The underlying principle

The questions addressed

Suppose we are interested in the estimation of an unknown parameter $\theta$ (this doesn't mean that we are in a parametric context). The two main questions asked at that stage are :

The second question has to be answered through information on the distribution or at least the variance of the estimator.

Of course there are answers in very simple contexts: for instance when the parameter of interest is the mean $\mu$ then the estimator $\bar{X}$ has a known standard deviation : the estimated standard error noted sometimes

\begin{displaymath}
\hat{\sigma}=\left[(X_i-\bar{X})^2/n^2\right]\end{displaymath}

However no such estimator is available for the sample median for instance.

In maximum liklihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of $\widehat{se}(\theta)=
\frac{1}{\sqrt{Fisher Info}}$

The bootstrap is a more general way to answer question 2 , with the following aspects:

If we had several samples from the unknown (true) distribution $F$ then we could consider the variations of the estimator :

\begin{displaymath}{\Large }F \qquad
\begin{array}{llllll}
& \stackrel{\mbox{Ra...
...= & {\cal X}_n^B & {\longrightarrow} \hat{\theta}_B
\end{array}\end{displaymath}

Such a situation is never the case, so we replace these new samples by a resampling procedure based on the only information we have about $F$, and that is an empirical $\hat{F}_n$ :this side is what is called bootstrapping.

The bootstrap: Some Examples

A binomial example

Suppose we don't know any probability or statistics, and we are told that
  Heart attacks   Subjects
Aspirin 104 10933 11037
Placebo 189 10845 11034
The question : Is the true ratio $\theta$, the parameter of interest, smaller than 1?

Or is the difference between rates in both populations strictly different from zero?

We could run a simulation experiment to find out.

>> [zeros(1,10) ones(1,2)]
ans =
     0     0     0     0     0     0     0     0     0     0     1     1
>> [zeros(1,10) ones(1,2)]'
ans =
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     1
     1
>> [zeros(1,10) ; ones(1,2)]
???  All rows in the bracketed expression must have the same 
number of columns.
>> sample1=[zeros(1,109) ones(1,1)]';
>> sample2=[zeros(1,108) ones(1,2)]';
>> orig=sample1
>>[n,p]=size(orig)
n =     110
p =    1
>> thetab=zeros(1,1000);

File myrand.m for the bootstrap index creation:

function out=myrand(m,n,range)
%Returns a vector of draws
%of integers from 1to n,
%with replacement

out=floor(rand(m,n)*range)+1;

File bsample.m for the bootstrap resample creation:

function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a 
%matrix with nrow observations and ncol variables
      [n,p]=size(orig);
      indices=myrand(1,n,n);
      out=orig(indices,:);

File bsample.m:

function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a 
%matrix with nrow observations and ncol variables

      [n,p]=size(orig);
      indices=myrand(1,n,n);
      out=orig(indices,:);

Computing the bootstrap distribution for one sample

>> for (b =(1:1000))
      res.bsample1=bsample(sample1);
      thetab(b)=sum(res.bsample1==1);  
 end
>>hist(thetab)

This is what the histogram looks like:
Here is the complete data set computation

>> sample1=[zeros(1,10933),ones(1,104)]';
>> sample2=[zeros(1,10845),ones(1,189)]';
>> thetab=zeros(1,1000);
>> for (b =(1:1000))
      res.bsample1=bsample(sample1);
      thetab(b)=sum(res.bsample1==1);  
 end

Comparing to a hypothetical parameter

Suppose that we are trying to test $\theta=\frac{2}{110}$

>> for (b =(1:1000))
      res.bsample1=bsample(sample1);
      thetab(b)=sum(res.resample1==1)-2;  
 end
>>hist(thetab)
>> mean(thetab)
ans =
   -0.9530
>> var(thetab)
ans =
    1.0398
>> sum(thetab>0)
ans =
    96

This is what the histogram looks like:

Computing the bootstrap distribution for two samples

>>thetab=zeros(1,1000);
>> for (b =(1:1000))
      res.bsample1=bsample(sample1);
      res.bsample2=bsample(sample2);
      thetab(b)=sum(res.bsample2==1)-sum(res.bsample1==1);  
 end
>>hist(thetab)

This is what the histogram looks like:

Without the computer

Sample one could be considered as the realization of a Binomial random variable, from some unkown Binomial distribution, for which the best estimate given by maximum likelihood would be:

\begin{displaymath}B(n_1,p_1), \hat{p_1}=\frac{\sum X_i}{n}=\frac{104}{11037}\end{displaymath}

The second sample would be considered as coming from another Binomial, in the most general case

\begin{displaymath}B(n_2,p_2), \hat{p_2}=\frac{\sum X_i}{n}=\frac{189}{11034}\end{displaymath}

Theoretically, what can we say abou the distribution of $\hat{p_1} - \hat{p_2}$?

Some notation

From an original sample

\begin{displaymath}{\cal X}_n
=(X_1,X_2...X_n) \stackrel{iid}{\sim} F\end{displaymath}

draw a new sample of $n$ observations among the original sample with replacement, each observation having the same probabilty of being drawn ($=\frac{1}{n}$). A bootstrap sample is often denoted

\begin{displaymath}{\cal X}_n^*
=X_1^*,X_2^*...X_n^* \stackrel{iid}{\sim} F_n
\mbox{ the empirical distribution }\end{displaymath}

If we are interested in the behaviour of a random variable $\widehat{\theta}=\theta({\cal X}_n,F)$, then we can consider the sequence of $B$ new values obtained through computation of $B$ new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.

Here is an example of a line of matlab that does just that:

indices=randint(1,n,n)+1;

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)

An approximation of the distribution of the estimate $\widehat{\theta}=\theta({\cal X}_n,F)$ is provided by the distribution of

\begin{displaymath}\widehat{\theta}^{*b}=
\theta({\cal X}_n^{*b},F_n),   b=1..B
\end{displaymath}

$G_n^*(t)=P_{F_n}\left(
\widehat{\theta}^* \leq t \right)$ denotes the bootstrap distribution of $\widehat{\theta}^*$, often approximated by

\begin{displaymath}
\widehat{G}_n^*(t)=
\char93 \{\widehat{\theta}^*
\leq t
\}/B\end{displaymath}

\fbox{The Bootstrap Algorithm}

  1. Compute the original estimate from the original data. $\widehat{\theta}=\theta({\cal X}_n)$
  2. For b=1 to B do : %B is the number of bootstrap samples
    1. Create a resample ${\cal X}_b^*$
    2. Compute $\widehat{\theta}^*_b=\theta({\cal X}_b^*)$
  3. Compare $\widehat{\theta}^*_b$ to $\widehat{\theta}$.

Accuracy of the sample mean

Using the linearity of the mean and the fact that the sample is iid we have

\begin{displaymath}
\widehat{se}(\bar{x})= \sqrt{\frac{s^2}{n}}\end{displaymath}

where $s^2$ is the usual estimate of the variance obtained from the sample.

If we were given $B$ true samples, and their associated estimates $\hat{\theta^{*b}}$, we could compute the usual variance estimate for this sample of $B$ values, namely:

\begin{displaymath}\widehat{se}_{boot}(s)=\{ \sum_{b=1}^B
[s(\mbox{${\cal X}$}^{*b})-s(\mbox{${\cal X}$}^{*.})]^2/(B-1)
\}^{\frac{1}{2}}
\end{displaymath}

where

\begin{displaymath}s(\mbox{${\cal X}$}^{*.})=\frac{1}{B}\sum_{b=1}^B s(\mbox{${\cal X}$}^{*b})\end{displaymath}

Here are some computations for the mouse data(page 11 of text)

treat=[94 38 23 197 99 16 141]'
treat =
    94
    38
    23
   197
    99
    16
   141
>> median(treat)         
ans =    94
>> mean(treat)
ans =   86.8571
>> var(treat)
ans =   4.4578e+03
>> var(treat)/7
ans =
  636.8299
>> sqrt(637)
ans =   25.2389
thetab=zeros(1,1000);
for (b =(1:1000))               
thetab(b)=median(bsample(treat));
end
hist(thetab)
>> sqrt(var(thetab))
ans =
   37.7768
>> mean(thetab)
ans =
   80.5110

This is what the histogram looks like:

Control Group

control=[52 104 146 10 51 30 40 27 46]';
>> median(control)
ans =    46
>> mean(control)
ans =   56.2222
>> var(control)
ans =   1.8042e+03
>> var(control)/length(control)
ans =  200.4660
>> sqrt(200.4660)
ans =   14.1586
thetab=zeros(1,1000);
for (b =(1:1000))               
thetab(b)=median(bsample(control));
end
hist(thetab)
>> sqrt(var(thetab))
ans =   11.9218
>> mean(thetab)
ans =   45.4370
This is what the histogram looks like:

Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?

The combinatorics of the bootstrap distribution

As we noted in class, and looking at the histograms, the main aspect of the bootstrap distribution of the median is that it can take on very few values, in the case of the treatment group for instance, $7$. The simple bootstrap will always present this discrete characteristic even if we know the underlying distribution is continuous, there are ways to fix this and in many cases it won't matter but it is an important feature.

How many different bootstrap samples are there?

By different samples, the samples must differ as sets, ie there is no difference between the sample $\{x_1,x_2,\ldots,x_n\}$ $\{x_2,x_1,\ldots , x_n \}$, ie the observations are exchangeable or the statistic of interest is a symmetrical function $s$ of the sample: $\hat{\theta}=s(\mbox{${\cal X}$})$.
Definition:
The sequence $(X_1,X_2,\ldots,X_n)$ of random variables is said to be exchangeable if the distribution of the $n$ vector $(X_1,X_2,\ldots,X_n)$ is the same as that of $(X_{\pi(1)},X_{\pi(2)},\ldots,X_{\pi(n)})$, for $\pi$ any permutation of $n$ elements.

Suppose we condition on the sample of $n$ distinct observations $\mbox{${\cal X}$}$, there are as many different samples as there are ways of choosing $n$ objects out of a set of $n$ possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new notation for a bootstrap resample, up to now we have noted a possible reasample, say $\mbox{${\cal X}$}^{*b}=\{x_1,x_1,x_3,x_4,x_4\}$, because of the exchangeability/symmetry property we can recode this as the $n$ vector counting the number of occurrences of each of the observations. in this recoding we have $\mbox{${\cal X}$}^{*b}=(2,0,1,2,0)$ and the set of all bootstrap resamples is the $n$ dimensional simplex

\begin{displaymath}C_n=\{(k_1,k_2,\ldots,k_n), k_i \in {\sf N}, \sum k_i=n \}\end{displaymath}

Here is the argument I used in class to explain how big $C_n$ is. Each component in the vector is considered to be a box, there are $n$ boxes to contain $n$ balls in all, we want to contain to count the number of ways of separating the n balls into the $n$ boxes. Put down $n-1$ separators of $\vert$ to make boxes, and $n$ balls, there will be $2n-1$ positions from which to choose the $n-1$ bars' positions, for instance our vector above corresponds to: oo||o|oo| . Thus

\begin{displaymath}\vert C_n\vert={{2n-1}\choose{n-1}}\end{displaymath}

Stirling's formula ( $n!\sim n^ne^{-n}(2\pi n)^{\frac{1}{2}}$) gives an approximation $C_n \sim (n\pi)^{-\frac{1}{2}} 2^{2n-1}$,

here is the function file approxcom.m

function out=approxcom(n)
out=round((pi*n)^(-.5)*2^(2*n-1));
that produces the following table of the number of resamples:

\begin{displaymath}
\begin{array}{\vert l\vert l\vert l\vert l\vert l\vert l\ver...
...6.93 10^{10}& 6.35 10^{13} &
5.94 10^{16}\\
\hline
\end{array}\end{displaymath}

Are all these samples equally likely, thinking about the probability of drawing the sample of all $x_1$'s by choosing the index $1$ $n$ times in the integer uniform generation should persuade you that this sample appears only once in $n^{n}$ times. Whereas the sample with $x_1$ once and $x_2$ all the other observations can appear in $n$ out of the $n^{n}$ ways.

Which is the most likely bootstrap sample?

The most likely resample is the original sample $\mbox{${\cal X}$}=\{x_1,x_2,...,x_n\}$, the easiest way to see this is to consider:


The multinomial distribution

In fact when we are drawing bootstrap resamples we are just drawing from the mulinomial distribution a vector $(k_1,k_2,...k_n)$, with each of the $n$ categories being equally likely, $p_i=\frac{1}{n}$, so that the probability of a possible vector is

\begin{displaymath}Prob_{boot}(k_1,k_2,...k_n)=\frac{n!}{k_1!k_2!\cdots k_n!}
(\...
..._1+k_2+k_3\cdots k_n}=
{{n}\choose{k_1,k_2,\ldots,k_n}} n^{-n}
\end{displaymath}

This will be largest when all the $k_i$'s are $1$, thus the most likely sample in the boostrap resampling is the original sample, here is the table of the most likely values:

\begin{displaymath}
\begin{array}{\vert l\vert l\vert l\vert l\vert l\vert l\ver...
...-5} & 3\times 10^{-6} &
2.3\times 10^{-8}\\
\hline
\end{array}\end{displaymath}

As long as the statistic is somewhat a smooth function of the observations, we can see that discreteness of the boostrap distribution is not a problem.

Complete Enumeration

Theoretically, we could give a complete enumeration of the bootstrap sampling distribution, all we need to know is how to compute the statistic $\hat{\mbox{${\cal X}$}*}$ for all the bootstrap resamples. There is a way of going through all resamles as defined by the simplex $C_n$, it is called a Gray code. As an example, consider the law school data used by Efron (1982). One can only hope to enumerate completely for moderate sample sizes ($n\leq
20$ today). For larger sample sizes partial enumeration through carefully spaced points is discussed in Diaconis and Holmes (1994a). Another idea is to use a small dose of randomness, not as much as Monte Carlo, by doing a random walk between close points so as to be able to use updating procedures all the same, this is detailed in the case of exploring the tails of a bootstrap distribution in Diaconis and Holmes (1994b).

The original Gray code

Let $Z^n_2$ be the set of binary $n$-tuples. This may be identified with the vertices of the usual $n$-cube or with the set of all subsets of an $n$ element set. The original Gray code gives an ordered list of $Z^n_2$ with the property that successive values differ only in a single place. For example, when $n=3$ such a list is

$000,\; 001,\; 011,\; 010,\; 110,\; 111,\; 101,\; 100\; .$

It is easy to give a recursive description of such a list, starting from the list for $n=1$ (namely 0,1). Given a list $L_n$ of length $2^n$, form $L_{n+1}$ by putting a zero before each entry in $L_n$, and a one before each entry in $L_n$. Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from $0,1$ we get $00, 01, 11, 10$ and then the list displayed above for $n=3$. For $n=4$ the list becomes : $
0000,\;
0001,\;
0011,\;
0010,\;
0110,\;
0101,\;
0100,\;
1100,\;
1101,\;
1111,\;
1110,\;
1010,\;
1011,\;
1001,\;
1000.
$

Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.

One way to express this is as follows : let $m=\sum \epsilon_i 2^i$ be the binary representation of the integer $m$, and let $\cdots e_3 e_2 e_1 e_0$ be the string of rank $m$ in the Gray code list. Then $e_i=\epsilon_i+\epsilon_{i+1} \; \; (mod 2) \; \; (i=0,1,2 \ldots)$ and $\epsilon_i=e_i+e_{i+1}+ \cdots (mod 2) \; \; (i=0,1,2 \ldots)$. For example, when $n=4$, the list above shows the string of rank $6$ is $0101$ ; now $6=0110=0 \cdot 1 + 1 \cdot 2 + 1 \cdot 4 +0 \cdot 8$. So $e_0 = 0+1=1, e_1 = 1+1=0, e_2 = 1+0=1,
e_3=0+0=0$. Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).

Gray Codes for the Bootstrap

Bickel and Freedman (1981) carried out exact enumeration of the bootstrap distribution for the mean using the fast Fourier transform, Bailey (1992) used a similar approach for simple functionals of means. Fisher and Hall (1991) suggest exact enumeration procedures that we will compare to the Gray code approach in Section B below.

Let ${\cal X}_n=\{x_1,x_2,\cdots,x_n\}$ be the original data supposed to be independent and identically distributed from an unknown distribution $F$ on a space ${\cal X}$. The bootstrap proceeds by supposing that replacement of $F$ by $F_n$, the empirical distribution, can provide insight on sampling variability problems.

Practically one proceeds by repeatedly choosing from the $n$ points with replacement. This leads to bootstrap replications ${\cal X}^*_n=\{x^*_1,\cdots,x^*_n\}$. There are $n^n$ such possible replications, however these are not all different and grouping together replications that generate the same subset we can characterize each resample by its weight vector $(k_1,k_2,\cdots,k_n)$ where $k_i$ is the number of times $x_i$ appears in the replication. Thus $k_1+\cdots +k_n=n$.

Let the space of compositions of $n$ into at most $n$ parts be

\begin{displaymath}C_n=\{{\bf k}=(k_1,\cdots,k_n),\; k_1+\cdots +k_n=n,\; k_i\geq 0,\;
k_i\;\hbox{integer}\}.\leqno (3.1)\end{displaymath}

Thus $\vert C_n\vert={2n-1\choose n-1}$. We proceed by running through all compositions in a systematic way. Note that the uniform distribution on ${\cal X}^n_n$ induces a multinomial distribution on $C_n$

\begin{displaymath}m_n({\bf k})={1\over n^n}{n\choose k_1\cdots k_n}.\leqno (3.2)\end{displaymath}

To form the exhaustive bootstrap distribution of a statistic $T({\cal X}_n)$ one need only compute each of the ${2n-1\choose n-1}$ statistics and associate a weight $m_n({\bf k})$ with it. The shift from ${\cal X}^n_n$ to $C_n$ gives substantial savings. For the law school data, $n=15, 15^{15}\simeq 4.38\times 10^{17}$ while ${29\choose 14} \simeq
7.7\times
10^7$.

Efficient updating avoids multiplying such large numbers by factors of $n$. This is what makes the computation feasible. Gray codes generate compositions by changing two coordinates of the vector ${\bf k}$ by one up and one down. This means that $m_n({\bf k})$ can be easily updated by multiplying and dividing by the new coordinates. Similar procedures, discussed in Section C below allow efficient changes in the statistics of interest.

Gray codes for compositions.

Following earlier work by Nijenhuis, Wilf and Knuth, Klingsberg (1982) gave methods of generating Gray codes for compositions. We will discuss this construction briefly here, details can be found in Klingsberg (1982) and Wilf (1989).

For $n=3$, the algorithm produces the 10 compositions of $C_3$ in the following order:

\begin{displaymath}300,\; 210,\; 120,\; 030,\; 021,\; 111,\; 201,\; 102,\; 012,\; 003\; .\end{displaymath}

The easiest way to understand the generation of such a list is recursive, construction of the $n$-compositions of $n$ can actually be done through that of the $(n-1)$ compositions of $(n-i), i=1,\cdots,n$.

For any $n$, the 2-composition is just provided by the list $n0,(n-1)1,\cdots,0n$, which is of length $n+1$. So the 2-compositions out of 3 are $L(n,n-1)=L(3,2)=30, 21, 12, 03$

the 2-compositons out of 2 are $L(n-1,n-1)=L(2,2)=20, 11, 02$

and 2-compositions out of 1 are $L(n-2,n-1)=L(1,2)=10, 01$

Finally there is only one 2-composition of $0$: $L(0,2)=00$

The 3-out of 3 list is obtained by appending a $0$ to the $L(3,2)$ list, a 1 to the $L(2,2)$ list, a 2 to the $L(1,2)$ list and a 3 to the $L(0,2)$ list. These four lists are then concatenated by writing the first, the second in reverse order, the third in its original order followed by the fourth in reverse order. This is actually written:

\begin{displaymath}{\cal L}(3,3)=(3,2)\oplus 0,\;\; \overline{(2,2)\oplus 1},\;\;
(1,2)\oplus 2,\;\;\overline{(0,2)\oplus 3}\end{displaymath}

and more generally

\begin{displaymath}(n,n)=(n,n-1)\oplus 0,\;\;\overline{(n-1,n-1)\oplus
1},\;\;
(n-2,n-1)\oplus 2,\cdots \;.\end{displaymath}

The same procedure leads to the 35 compositions of $n=4$ in the following order:

\begin{displaymath}
4000,\;
3100,\;
2200,\;
1300,\;
0400,\;
0310,\;
1210,\;
\\
...
...\;
\\
0112,\;
1012,\;
0022,\;
0013,\;
0103,\;
1003,\;
0004.\;
\end{displaymath}

The lists generated in this way have the property that two successive compositions differ only by $\pm 1$ in two coordinates.

Klingsberg (1982) provides a simple nonrecursive algorithm that generates the successor of any compositon in this Gray code. This is crucial for the implementationin the the present paper. It requires that one keep track of the whereabouts of the first two non-zero elements and an updating counter. Both a $S$ and a $C$ version of the algorithm are provided in the appendix.

We conclude this subsection by discussing a different algorithm due to Nijenhuis and Wilf (1978, pp. 40-46) which runs through the compositions in lexicographic order (reading from right to left). This algorithm was suggested for bootstrapping by Fisher and Hall (1991).


$N-W$ algorithm to run through compositions $C_n$.

(1) Set $k_1=n, k_i=0, 2\leq i\leq n$.

(2) Let $h=$ first $i$ with $k_i\neq 0$. Set $t=k_h, k_h=0,
k_i=t-1,k_{n+1}=k_{n+1}+1$.

(3) Stop when $k_n=n$.


For example, the following list gives all 35 compoisitions in $C_4$ in the order produced by the N.W. algroithm

\begin{eqnarray*}
4000, 3100, 2200, 1300,& 0400, 3010, 2110,\\
1210, 0310, 202...
...2002, 1102, 0202,\\
1012, 0112, 0022,& 1003, 0103, 0013, 0004.
\end{eqnarray*}



Balanced Bootstraps

Intermediary between the complete enumeration of all samples and the drawing of a few samples at random lies the Balanced Bootstrap, (see Gleason, 1988, American Statistician,263-266) for a simple algorithm.

The idea is that as when we count all $\vert C_n\vert$ samples each observation has the same chance of appearing, we would like to have each observation appearing overall the same number of times after all $B$ bootstrap resamples have been drawn.

We can be ensured of this by taking a list of $B$ repetitions of each of the observations $x_1,x_2,\ldots, x_n$ and repeating them $B$ times, take a permutation of these $nB$ values and then take the first $n$ as the first sample, and so on.

Monte Carlo

This is the method used for drawing a sample at random from the empirical distribution. I will start by giving a history and general remarks about Monte Carlo methods for those who have never studied them before.


What is a Monte Carlo Method?

There is not necessarily a random component in the original problem that one wants to solve,usually a problem for which there is no analytical solution. An unknown parameter (deterministic) is expressed as a parameter of some random distribution, that is then simulated. The oldest well-known example is that of the stimation of $\pi$ by Buffon, in his needle on the floorboards expeiment, where supposing a needle of the same length as the width between cracks we have:

\begin{displaymath}p(needle\; crosses\; crack)=\frac{2}{\pi} \mbox{ implying }
\hat{\pi}=2\frac{\char93 tries}{\char93  hits}
\end{displaymath}

In physics and statistics many of the problems Monte Carlo is used on is under the form of the estimate of an integral unkown in closed form:

\begin{displaymath}\theta=\int_0^1 f(u)du
\mbox{ which can be seen as the evaluation of }
Ef(u),\mbox{ where }u \sim U(0,1)\end{displaymath}

  1. The crude, or mean-value Monte Carlo method thus proposes to generate $B$ numbers uniformly from $(0,1)$ and take their average: to estimate $\theta$,

    \begin{displaymath}\hat{\theta}=\frac{1}{B}\sum_{b=1}^B
f(u_b)\end{displaymath}

  2. The hit-or-miss Monte Carlo method generates random points in a bounded rectangle and counts the number of 'hits' or points that are in the region whose area we want to evaluate.

    \begin{displaymath}\hat{\theta}=\frac{\char93  hits}{\char93  total}\end{displaymath}

Which estimate is better?
This is similar to comparing statistical estimators in general.

There are certain desirable properties that we want estimators to have, consistency which ensures that as the sample size increases the estimates converges to the right answer is ensured here by the properties of Riemann sums. Other properties of interest are:

Both the above methods are unbiased, that is when repeated many times their average values are centred in the actual value $\theta$.

\begin{displaymath}E(\hat{\theta})=\theta\end{displaymath}

So the choice between them lies in finding the one which has the less variance. The heuristic I developed in class to see that the hit-and-miss has a higher variance is based on the idea that the variance comes from the added randomness of generating both coordinates at random, instead of just the absissae in the crude Monte Carlo.

More precisely, the variance of crude Monte Carlo is

\begin{displaymath}\sigma_M^2=\frac{1}{n}\int_0^1(f(u)-\theta)^2du=\frac{E(f(u)-\theta)^2}{n}=\frac{1}{n}E(f(u)^2)
-\frac{\theta^2}{n}
\end{displaymath}

and that of hit and miss Monte Carlo, which is just a Binomial(n,$\theta$) is:

\begin{displaymath}\sigma_H^2=\frac{\theta(1-\theta)}{n}
\end{displaymath}

The difference between these two variances is always positive:

\begin{displaymath}\sigma_M^2-\sigma_H^2=
\frac{1}{n}\int_0^1f(u)(1-f(u)du >0
\end{displaymath}

Most improvements to Monte Carlo methods are variance-reduction techniques.

Antithetic Resampling

Suppose we have two random variables that provide estimators for $\theta$, $X$ and $Y$, that they have the same variance but that they are negatively correlated, then $\frac{1}{2}(X+Y)$ will provide a better estimate for $\theta$ because it's variance will be smaller.

This the idea in antithetic resampling (see Hall, 1989). Suppose we are interested with a real-valued parameter, and that we have ordered our original sample $x_1<x_2 \cdots x_n$, for each resample $\mbox{${\cal X}$}^*=\{x_{j_1},x_{j_2},\ldots,x_{j_n}\}$ and statistic $s(\mbox{${\cal X}$}^*)$ we associate $\mbox{${\cal X}$}^{**}$ by taking a special permutation of the $j_i$'s that will make $cov(\mbox{${\cal X}$}^{**}), s(\mbox{${\cal X}$}^*))<0$, and as small as possible. If the statistic is a smooth function of mean for instance, then the 'reversal permutation' that maps $1$ into $n$, $2$ into $n-1$, etc... is the best, the small sample values are transformed into the larger observations, and the average of these two estimates will give an estimate with smaller variance.

Importance Sampling

This is often used in simulation, and is a method to work around the small area problem. If we want to evaluate tail probabilities, or very small areas, we may have very few hits of our random number generator in that area. However we can modify the random number generator, make that area more likely as long as we take that into account we we do the summation. Importance sampling is based on the equalities:

\begin{displaymath}\int_0^1 f(u)du =\int_0^1 \frac{f}{g}(u) g(u) du=
\int_0^1 \frac{f}{g}(u) dG(u)
\end{displaymath}

Theoretical Underpinnings

Statistical Functionals

(Reference : Eric Lehmann, 1998,pp.381-438.)

We often speak of the asymptotic properties of the sample mean $\bar{X}$.These refer to the sequence $\bar{X}_n$. These functions are the same in some sense, for all sample size. The notion of statistical functional makes this clearer.

Suppose we are interested in real-valued parameters. We often have a situation where the parameter of interest is a function of the distribution function $F$, these are called statistical functionals. $\theta=t(F)$ Examples:

\begin{displaymath}\mu=E_F(X),\qquad \mu^{(k)}=E_F(X-E(X))^k , \qquad F^{-1}(p) \end{displaymath}

Goodness of fit statistics:

\begin{displaymath}\mbox{Kolmogorov-Smirnov 's } h(F)=sup_x \vert F(x)-F_0(x)\vert \end{displaymath}

is estimated by:

\begin{displaymath}h(\mbox{$\hat{F}_n$})=sup_x \vert\mbox{$\hat{F}_n$}(x)-F_0(x)\vert \end{displaymath}

Ratio of two means.

\begin{displaymath}\theta=\frac{\mu_1}{\mu_2}=\frac{E_{F_1}(X)}{E_{F_2}(X)}\end{displaymath}

We use the sample cdf

\begin{displaymath}\hat{F}_n=\frac{\char93 X_i \leq x}{n}=\frac{1}{n}\sum_{i=1}^n
\delta_{\{X_i\leq x\}}\end{displaymath}

as the nonparametric estimate of the unknown distribution $F$.

The usual estimates for these functionals are obtained by simply plugging in the empirical distribution function for the unknown theoretical one.

Thus taking into account that for any function $g$ we have:

\begin{displaymath}\int g(x)d\mbox{$\hat{F}_n$}(x)=\frac{1}{n}\sum_{i=1}^n g(x_i)\end{displaymath}

the plug-in estiamte for the mean is $\int x dF_n(x) =\frac{1}{n}\sum_{i=1}^n x_i$ the usual sample estimate, for the variance, it is the biased estimate:

\begin{displaymath}\hat{var_F(X)}=\int (x-E_{\mbox{$\hat{F}_n$}}(x))^2=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2
\end{displaymath}

Notions of Convergence

Convergence in Law
A sequence of cumulative distribution functions $H_n$ is said to converge in distribution to $H$ iff $H_n(x)\longrightarrow H(x)$ on all continuity points of $H$.

We say that if the random variable $Y_n$ has cdf $H_n$ and the rv $Y$ has cdf $H$, $Y_n$ converges in law to $Y$, and we write

\begin{displaymath}Y_n \stackrel{\cal{ L}}{\longrightarrow} Y\end{displaymath}

This does not mean that $Y_n$ and $Y$ are arbitrarily close, think of the random variables $U \sim U(0,1)$ and $1-U$.

Convergence in Probability

\begin{displaymath}Y_n \stackrel{P}{\longrightarrow} Y \quad
\forall \epsilon, P(\vert Y_n-c\vert < \epsilon) \longrightarrow 1\end{displaymath}

Note:
If $k_n Y_n \stackrel{\cal{ L}}{\longrightarrow} H$ where $H$ is a limit distribution and $k_n\longrightarrow \infty$ then $Y_n \stackrel{P}{\longrightarrow} 0$.

Why is the empirical cdf $\hat{F}_n$a good estimator of F?


\begin{displaymath}\mbox{$\hat{F}_n$}(a)=\frac{Number \; of \; X_i \leq a}{n}\end{displaymath}

This number has a binomial distribution with probability $F(a)$so that the theoretical expectation is $E(\mbox{$\hat{F}_n$}(a))=F(a)$ and $var (\mbox{$\hat{F}_n$})=pq/n=\frac{1}{n}F(a)(1-F(a))$. From this and de Moivre's central limit theorem for binomial, we can show for fixed real $a$

\begin{displaymath}\sqrt{n}(\mbox{$\hat{F}_n$}(a)-F(a)) \stackrel{\cal{ L}}{\longrightarrow} \mbox{${\cal N}$}(0,F(a)(1-F(a)))\end{displaymath}

This shows consistency pointwise.

Because of the result noted above, this also ensures that $\mbox{$\hat{F}_n$}(a) \stackrel{P}{\longrightarrow} F(a)$, this is actually true uniformly in $a$ because Kolmogorov's statistic is pivotal

\begin{displaymath}d(\mbox{$\hat{F}_n$},F)=sup_{x} \vert\mbox{$\hat{F}_n$}(x)-F(x)\vert =D_n \end{displaymath}

has a distribution that does not depend on $F$. In fact $D_n \stackrel{\cal{ P}}{\longrightarrow} 0$, as $n \longrightarrow \infty$. It is easy for continuous $F$, but also true for larger classes of cdfs (Serfling, Billingsley).

Definition: A statistic is said to be pivotal if its distribution does not depend on any unknown parameters.

Example: Student's $t$ statistic.

Generalized Statistical Functionals

When we want to evaluate an estimator, construct confidence intervals, etc.. we are usually interested in evaluating quantities that are functions of both the unknown distribution $F$, the empirical $\hat{F}_n$and the sample size, here are some examples:
  1. The sampling distribution of the error: $\lambda_n(F,\mbox{$\hat{F}_n$})=P_F(\sqrt{n}(\theta(\mbox{$\hat{F}_n$})-\theta(F)))$
  2. The bias: $\lambda_n(F,\mbox{$\hat{F}_n$})=E_F(\theta(\mbox{$\hat{F}_n$}))-\theta(F)$
  3. The standard error: $\lambda_n(F,\mbox{$\hat{F}_n$})=\sqrt{E_F(\theta(\mbox{$\hat{F}_n$})-\theta(F))^2}
$
For each of these examples, what the bootstrap proposes is to replace $F$ by the empirical $\hat{F}_n$.

The bootstrap is said to work if

\begin{displaymath}\lambda_n(\mbox{$\hat{F}_n$},\mbox{$\hat{F}_n$}^*)-\lambda_n(F,\mbox{$\hat{F}_n$})\stackrel{P}{\longrightarrow} 0
\end{displaymath}

Typically, we will be in the situation where

\begin{displaymath}\lambda_n(F,\mbox{$\hat{F}_n$}){\longrightarrow} \lambda(F) \qquad \mbox{for
all F under consideration}
\end{displaymath}

If the above is true the bootstrap works if and only if:

\begin{displaymath}\lambda_n(\mbox{$\hat{F}_n$},\mbox{$\hat{F}_n$}^*){\longrightarrow} \lambda(F)
\end{displaymath}

We will not give any proofs in this class. (If you want to do a theoretical project, I can guide you to cases where this can be done.)

Examples: Estimating the bias of the median:

\begin{displaymath}
E(X_{(2)}^*)-X_{(2)}= [ \frac{7}{27}
X_{(1)}+\frac{13}{27} ...
...(3)}]-X_{(2)}
=\frac{7}{27}[\frac{X_{(1)}+X_{(3)}}{2}-X_{(2)}]
\end{displaymath}

Bootstrapping a probability:

\begin{displaymath}\lambda_n(\mbox{$\hat{F}_n$},\mbox{$\hat{F}_n$}^*)=P_{\mbox{$\hat{F}_n$}}[(X^*_1,X^*_2,\ldots,X^*_n)\in S]\end{displaymath}

as in the case of finding an error distribution: $P_F(\sqrt{n}(\theta(\mbox{$\hat{F}_n$})-\theta(F)))$, we can see that this is finding the volume of a set $S$ under a known distribution, and Monte Carlo is the natural route. So in this new notation we see that bootstrapping is a two step process:

The jackknife

Suppose the bias is of the form:

\begin{displaymath}
(*) \qquad E_n-\theta=\frac{a_1}{n}+\frac{a_2}{n^2}+\frac{a_3}{n^4}+\cdots
\end{displaymath}

Even if the numerators are unkowns depending on the real distribution $F$.

This method was the jackknife.

This is the first time that the sample was manipulated, each observation is dropped once from the sample, the new distribution function is called $F_{(i)}$ when the $ith$ observation has been dropped, and the functional estimate gives the estimator $\hat{\theta}_{(i)}=t(F_{(i)})$ and their average is denoted $\hat{\theta}_{(.)}=\frac{1}{n}
\sum_{i=1}^n \hat{\theta}_{(i)}$


\begin{displaymath}
\mathbf{X}_{(i)} = (x_1, x_2, \ldots x_{i-1}, x_{i+1}, \ldots x_n)
\end{displaymath}

We can use the jackknife for estimating the bias by:

\begin{displaymath}\widehat{Bias}_{\mbox{jack}} = (n-1)
(\hat{\theta}_{(\cdot)}-\hat{\theta})
\end{displaymath}

We can show that if the bias is of the order $\frac{1}{n}$ as in $(*)$ then the Jackknife estimate is in $\frac{1}{n^2}$

\begin{displaymath}E_F \hat{\theta}_{(\cdot)}=E_{n-1}=\theta+\frac{a_1(F)}{n-1}+\frac{a_2(F)}
{(n-1)^2}+\cdots \end{displaymath}


\begin{displaymath}\mbox{The Jackknife estimate is:} \tilde{\theta}=
\hat{\theta}-\widehat{Bias}_{\mbox{jack}}\end{displaymath}


\begin{displaymath}E_F \tilde{\theta}= E_F
\hat{\theta}-\widehat{Bias}_{\mbox{ja...
...2(F)}{n(n-1)}+{a_3(F)}(\frac{1}{n^2}+
\frac{1}{(n-1)^2}+\cdots \end{displaymath}

So the order of the bias has been decreased from $O(\frac{1}{n})$ to $O(\frac{1}{n^2})$

Example:
Suppose $\theta=var(X)=\int_{supp(F)} (x-\mu)^2dF$ for which the plug in estimate is:

\begin{displaymath}\hat{\theta}=\int_{supp(F_n)} (x-\mu(F_n))^2dF_n
=\frac{1}{n}\sum (x_i-\bar{x})^2\end{displaymath}

The jackknife estimate of variance


\begin{displaymath}
\widehat{se}_{\mbox{jack}}^2 = [ \frac{n-1}{n}
\sum (\hat{\theta}_{(i)} - \hat{\theta}_{(\cdot)})^2 ]
\end{displaymath}

Example: Patch Data

Jackknife

%------------------------------------------
function out=jackbias(theta,orig)
%Estimate the bias using the jackknife
%Theta has to be a character string containg
% a valid function name
[n,p]=size(orig);
lot=feval(theta,orig(2:n,:));
k=length(lot);
lo=zeros(n,k);
lo(1,:)=lot;
lo(n,:)=feval(theta,orig(1:(n-1),:));
for i=(2:(n-1))
   lo(i,:)=feval(theta,orig([1:(i-1),(i+1):n],:)); 
end
thetadot=mean(lo);
out=(n-1)*thetadot -(n-1)*feval(theta,orig);
%-------------------------------
function out=ratio(yszs)
%Computes the ratio of mean of first 
%column of mean of second column
out=mean(yszs(:,1))/mean(yszs(:,2));
%-------------------------------------
>>z=oldpatch-placebo ; y=newpatch-oldpatch
>> yz=[y,z]
       -1200        8406
        2601        2342
       -2705        8187
        1982        8459
       -1290        4795
         351        3516
        -638        4796
       -2719       10238
>>ratio(yz)
   -0.0713
>> jackbias('ratio',yz)
ans =    0.0080

Bootstrap Simulations

function out=bootbias(theta,orig,B)
thetabs=zeros(1,B);
for b=(1:B)
    bs=bsample(orig);
    thetabs(b)=feval(theta,bs);
end
theta0=feval(theta,orig);
out=mean(thetabs)-theta0;
%-----------------------------------
>> bootbias('ratio',yz,1000)
ans =    0.0085
%-----------------------------------
function out=bootmultin(orig,B)
[n,p]=size(orig);
out=zeros(B,n);
for b=1:B
inds=randint(1,n,n)+1;
for i=1:n
 out(b,inds(i))=out(b,inds(i))+1;
end;
end
%-----------------------------------

Cross Validation

Separate Diagnostic Data Sets When having iterated through several exploratory methods, varied the projections and looked for the best `fit', there seems only one honest method of verifying whether one is overfitting noise or whether there really is a latent variable or a good prediction available and that is to have another set of data of exactly the same type.

The best thing to do is at the beginning of the study to take a random sub-sample, without any particular stratification and to put it aside for the confirmatory stage. Many scientists are mean with their data, and only have just enough to model, but nowadays the expense of an extra 25 % or so, should be made - especially when the consequences of the study are medical, this is what tukey and mallows call a careful serarate diagnostic.

Cross-Validation when there is a response variable

When the above prescription is not followed and one of the variables has the status of variable to be explained, it is possible - at computational expense, but who cares ? - to redo the analysis leaving out part of the data and comparing with the reference set.

For instance in Discriminant Analysis
For each observation, do the analysis without that one, and look whether or not it is well classified, this will give an unbiased estimate of the percentage of badly classified. Cross Validation can thus be used when one variable has the particular status of being explained.

And in regresssion
We want to estimate the prediction error:

\begin{displaymath}PE=E_F(y-\hat{y})^2\end{displaymath}

This can be done by cross validation, writing:

\begin{displaymath}PRESS=\frac{1}{n}\sum_{i=1}^n
(\hat{y}_{(i)}-y)^2\end{displaymath}

However it has also been used at the diagnostic stage in principal components, and in classification and regression trees where it helps choose the size of an `optimal tree'.

Confidence Intervals

We call the unknown parameter $\theta$ and our estimate $\hat{\theta}$. Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of $\hat{\theta}-\theta$, we will be interested especially in its quantiles : denote the $\alpha$ quantile by $\underline{\delta}=\delta^{(\alpha)}$ and the $1-\alpha$ quantile by $\bar{\delta}=\delta^{(1-\alpha)}$.

By definition we have: $P(\hat{\theta}-\theta \leq \underline{\delta})=\alpha$ and $P(\hat{\theta}-\theta \leq \bar{\delta} )= 1-\alpha$ giving a $1-2\alpha$ confidence interval:

\begin{displaymath}[\hat{\theta}-\bar{\delta},\hat{\theta}-\underline{\delta}]\end{displaymath}

  1. Would we get the same answer without centering with regards to $\hat{\theta}$ ?

    What we called $\mbox{$\underline{\delta}$}$ and $\bar{\delta}$ were the ideal quantiles from which we build the confidence interval : $[\hat{\theta}-\bar{\delta},\hat{\theta}-\mbox{$\underline{\delta}$}]$.Estimated by : using $\bar{\delta}^*$ the $1-\alpha$th quantile of $\hat{\theta}^*-\hat{\theta}$, the $\hat{\theta}$'s do NOT cancel out.

  2. These intervals would not be the same if we took the distribution of $\hat{\theta}^*$ to mimick simply that of the $\hat{\theta}$'s, (those are seen later as the percentile method estimates).

Properties and Improvements

As can be predicted, the bigger $B$, the number of bootstrap resamples, the better the Monte Carlo approximations are, thus much recent work has been devoted to making the most of the computations, however as we have seen improving the MC approximation is not the only route for improving the bootstrap's performance, this can also be done:

Preliminaries : Pivotal quantities
Construction of confidence intervals is based (ideally) on a pivotal quantity $R_n$, that is a function of the sample and the parameter $\theta$ whose distribution is independant of the parameter $\theta$, the sample, or any other unknown parameter. Thus for instance no knowledge is needed about the parameters $\mu$ and $\sigma^{2}$ of a normal variable $X \sim N(
\mu,\sigma^{2})$ to construct confidence intervals for both of these parameters, because one has two pivotal quantities available :

\begin{displaymath}\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1}\end{displaymath}


\begin{displaymath}\frac{(n-1)S^{2}}{\sigma^{2}}\sim \chi^{2}_{n-1} \end{displaymath}

Let $c_{n}(\alpha)$ denote the largest $(1-\alpha)th$ quantile of the distribution of the pivot $R_n$, and $\Theta$ the parameter space, set of all possible values for $\theta$ then

\begin{displaymath}\left\{t \in \Theta : R_{n}(t) \leq
c_{n}(\alpha) \right\}\end{displaymath}

is a confidence set whose level is $1-\alpha$. Thus if $t_{(1-\alpha/2)}$, $\chi^{2}_{\alpha/2}$ and $\chi^{2}_{1-\alpha/2}$ denote the relevant quantiles for the distributions cited above the corresponding $1-\alpha$ confidence intervals will be

\begin{displaymath}\left[\bar{x}-t_{(1-\alpha/2)}s/\sqrt(n),
\bar{x}+t_{(1-\alpha/2)}s/\sqrt(n)
\right]
\end{displaymath}

and

\begin{displaymath}
\left[(n-1)s^{2}/\chi^{2}_{1-\alpha/2},
(n-1)s^{2}/\chi^{2}_{\alpha/2}
\right]\end{displaymath}

Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level ($1-\alpha$) of the confidence sets. Level error is sensitive to how far $R_{n}$ is from being a pivot.

We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation.

Studentized Confidence Intervals

Suppose that we are interested in a parameter $\theta$, and that $\widehat{\theta}$ is an estimate based on the n-sample ${\cal X}_n$.

We will suppose that the asymptotic variance of $\widehat{\theta}$ is $\sigma^{2}/n=se^2(\widehat{\theta})$ with a corresponding estimate $\widehat{se}^2(\widehat{\theta})$.

The studentized bootstrap or bootstrap-t method (EFRON 1982, HALL 1988) is based on approximating the distribution function $H_F(x)$ of $(\widehat{\theta}-\theta)/
\widehat{se}$ by $H_{F_n}(x)$, distribution of $\frac{(\widehat{\theta^{*}}-\widehat{\theta})}{\widehat{se}^*}$.

Denote by $c_{n,\alpha}^*$ the $\alpha$ th quantile of the distribution of $(\widehat{\theta^{*}}-\widehat{\theta})/\widehat{se}^*$,

\begin{displaymath}c_{n,\alpha}^*=H_{F_n}^{-1}(\alpha)\end{displaymath}

Thus the equitailed bootstrap-t confidence interval of coverage $2\alpha$ is :

\begin{displaymath}\left[\widehat{\theta}-c_{n,(1-\alpha)}
\widehat{se}
, \widehat{\theta}-c_{n,\alpha}\widehat{se}
\right]
\end{displaymath}

One way of finding an estimate $\widehat{se}^{*}$ is to use a second level bootstrap, this is rather expensive computation-wise : one could prefer as an intermediate solution a jackknife estimate. \fbox{Bootstrap-t Algorithm}
  1. Compute the original estimate of $\theta$ : THETA(0)
  2. For b=1 to B do :
    1. Generate a sample ${\cal X}_b^*$
    2. Compute the estimate of $\theta$ based on ${\cal X}_b^*$: THETA(b)
    3. Use ${\cal X}_b^*$ as original data and compute an estimate of the standard deviation of $\widehat{\theta}$
      (requires another set of iterations): SIGMA(b).
    4. Compute $R({\cal X}_b^*)$=(THETA(b)-THETA(0))/ SIGMA(b)
  3. Compute CNA and CNCA the $\alpha$ and $(1-\alpha)$'th quantiles for the sample formed by all the R's.
  4. LO=THETA(0)-SQRT(N)*CNCA*SIGMA(0)
  5. HI=THETA(0)-SQRT(N)*CNA*SIGMA(0)

Correlation Coefficient Example

function out=corr(orig)
%Correlation coefficient
c1=corrcoef(orig);
out=c1(1,2);
%------------------------------------------
function interv=boott(orig,theta,B,sdB,alpha)
%Studentized bootstrap confidence intervals
%
     theta0=feval(theta,orig);
     [n,p]=size(orig);
      thetab=zeros(B,1);
      sdstar=zeros(B,1);
      thetas=zeros(B,1);
      for b=1:B
        indices=myrand(1,n,n);
        samp=orig(indices,:);
        thetab(b)=feval(theta,samp);
%Compute the bootstrap se,se*   
        sdstar(b)=feval('btsd',samp,theta,n,sdB);
%Studentized statistic
      thetas(b)=(thetab(b)-theta0)/sdstar(b);
      end
      se=sqrt(var(thetab));
Pct=100*(alpha/2);
lower=prctile(thetas,Pct);
upper=prctile(thetas,100-Pct);
interv=[(theta0-upper*se) (theta0 - lower*se)];
%----------------------------------------------
function out=btsd(orig,theta,n,NB)
%Compute the bootstrap estimate
%of the std error of the estimator
%defined by the function theta
%NB number of bootstrap simulations
thetab=zeros(NB,1);
  for b=(1:NB)
      indices=myrand(1,n,n);
      samp=orig(indices,:);
      thetab(b)=feval(theta,samp);
  end
out=sqrt(var(thetab));
%----------------------------------------------
>> boott(law15,'corr',1000,30,.05)
   -0.4737    1.0137
%----------------------------------------------
>> boott(law15,'corr',2000,30,.05)
   -0.2899    0.9801
%----------------------

Transformations of the parameter-matlab example

function out=transcorr(orig)
%transformed correlation coefficient
c1=corrcoef(orig);
rho=c1(1,2);
out=.5*log((1+rho)/(1-rho));
>> transcorr(law15)
    1.0362
>> tanh(1.03)
    0.7739
>> boott(law15,'transcorr',100,30,.05) 
   -0.7841    1.7940
>> tanh(ans)
   -0.6550    0.9462
>> boott(law15,'transcorr',1000,30,.05)
    0.0473    1.7618
>> tanh(ans)
    0.0473    0.9427
>> transcorr(law15)                    
    1.0362
>> 2/sqrt(12)
    0.5774
>> 1.0362 - 0.5774
    0.4588
>> 1.0362 + 0.5774    
    1.6136
>> tanh([.4588 1.6136])
    0.4291    0.9237
%%%%%%%%%%%%%%%True confidence Intervals%
>> prctile(res100k,[5 95]) 
ans =
    0.5307    0.9041

Confidence Intervals : The percentile method

We denote by $G_n^*(t)=P_{F_n}\left(
\widehat{\theta}^* \leq t \right)$ the bootstrap distribution of $\widehat{\theta}^*$, approximated by

\begin{displaymath}
\widehat{G}_n^*(t)=
\char93 \{\widehat{\theta}^*
\leq t
\}/B\end{displaymath}

The percentile method consists in taking the $1-2\alpha$ confidence interval for $\theta$ as being

\begin{displaymath}\left[
\widehat{G}_n^{*-1}(\alpha),
\widehat{G}_n^{*-1}(1-\alpha)
\right]\end{displaymath}

Theoretically speaking this is equivalent to replacement of the unknown distribution $G(x,F)=P_F(\widehat{\theta}<x)$ by the estimate $G(x,F_n)$.
\fbox{The Percentile Algorithm } :
  1. Input the level $=2\alpha$ of the confidence interval
  2. For b=1 to B do /* B is the number of bootstrap samples*/
  3. For i=1 to n do /* Generation of Sample ${\cal X}_b$ */
  4. Compute $\widehat{\theta}^*_b=\theta({\cal X}_b)$
  5. Combine the $\widehat{\theta}^*_b=\theta({\cal X}_b)$ into a new data set THETA
  6. Compute the $\alpha$ and $(1-\alpha)$'th quantiles for THETA.

Variance Stabilizing Transformations

We may want $var(Y)$ to assess accuracy of an estimate or just it's expectation.

If $g$ is linear as we know finding it's expectation is trivial, and we can also find it's variance (recalling that $E(a+bV)=a+bE(V)$ and $Var(a+bV)=b^2Var(V)$)

So when $g$ is not linear we have to bring ourselves back to that case by using a Taylor expansion, we hope to find a region where $X$ dwells with high probability and where $g$ is linear.

\begin{displaymath}Y=g(X) \sim g(\mu_X) + (X-\mu_X) g'(\mu_X)
+\frac{1}{2}(X-\mu_X)^2 g''(\mu_X)\end{displaymath}

This gives as a first order approximation to the expectation of $Y$ : $
E(Y)\simeq g(\mu_X) $ and as a second order approximation

\begin{displaymath}
E(Y)\simeq g(\mu_X) + \frac{1}{2} \sigma_X^2 g''(\mu_X) \end{displaymath}

This is called the $delta$ method. How good is such an approximation? That depends on two things:
1) How nonlinear $g$ is in a neighborhood of $\mu_X$ 2) How big $\sigma_X^2$ is.
We are just interested in the neighborhood of $\mu_X$ because that's where X lies most of the time. (remember Chebychev).

This method proves a generalisation of a central limit theorem:

\begin{displaymath}\mbox{If }\sqrt{n}(\widehat{\theta}_n-\theta)\stackrel{L}{\Lo...
...{L}{\Longrightarrow}
\mbox{${\cal N}$}(0,\tau^2[f'(\theta)]^2)
\end{displaymath}

as long as $f'(\theta)$ exists and is non zero.

By Taylor's theorem:

\begin{displaymath}f(\widehat{\theta}_n)\simeq
f(\theta)+(\widehat{\theta}_n-\theta)f'(\theta)\end{displaymath}


\begin{displaymath}\mbox{So that }
\sqrt{n}(f(\widehat{\theta}_n)-f(\theta))\simeq \sqrt{n}(\widehat{\theta}_n-\theta)f'(\theta)
\end{displaymath}

Examples: Even though $1/X, \log X , e^X$ are non normal, this holds because the transformations will be locally linear.

Binomial: We're interested in estimating $p^2$, should we

  1. Do $n$ binomial trials with probability of success $p^2 (X)$. $\longrightarrow$ $X/n$
  2. Do $n$ binomial trials with probability $p$ of success $(Y)$ and then estimate the parameter by $(Y/n)^2$
  1. $\sqrt{n}(\frac{X}{n}-p^2)\longrightarrow \mbox{${\cal N}$}(0,p^2(1-p^2))$
  2. $\sqrt{n}((\frac{Y}{n})^2-p^2)\longrightarrow \mbox{${\cal N}$}(0,pq\times 4 p^2))$
Comparison: For large $n$, $X/n$ will be more accurate iff

\begin{eqnarray*}
p^2(1-p^2) &< &pq\times 4 p^2 \qquad \Longrightarrow
(1+p) < 4p \Longrightarrow
\frac{1}{3} < p
\end{eqnarray*}



Otherwise it is $(Y/n)^2$ that is the most efficient.

The delta method provides a route towards variance stabilization, ie making the variance less dependent on $\theta$ (and thus making the distribution closer to being pivotal).

We would like to find transformations $f$ such that $\sqrt{n}[f(\bar{x})-f(\lambda)]
\stackrel{L}{\longrightarrow} \mbox{${\cal N}$}(0,c^2)$, where $c^2$ does not depend on $\lambda$.

By taking the transformation $f$ to be such that $f'(\theta)=
\frac{c}{\tau(\theta)}$, we have a constant variance.

Poisson case:
$\theta=\lambda,\tau(\theta)=
\sqrt{\lambda}$,

\begin{displaymath}f'(\theta)=\frac{c}{\sqrt{\lambda}}
\qquad f(\lambda)=2c\sqrt{\lambda}\end{displaymath}

For $c=1$, we have $2\sqrt{n}(\sqrt{\bar{X}}-\sqrt{\lambda})\stackrel{L}{\longrightarrow}
\mbox{${\cal N}$}(0,1)$ and $R=2\sqrt{n}(\sqrt{\bar{X}}-\sqrt{\lambda})$ is asymptotically pivotal.

Chisquare:
Let $Y_i=X_i^2$, where $X_i's \stackrel{iid}{\sim} \mbox{${\cal N}$}(0,\sigma^2)$. $E(Y_i)=\sigma^2$, $var(Y_i)=2\sigma^4$, $\theta=\sigma^2,\tau(\theta)=2\theta^2$, the variance stabilizing $f$ is

\begin{displaymath}f'(\theta)=\frac{c}{\sqrt{2}\theta} \; \; \mbox{ so }
f(\the...
...rac{\bar{Y}}{\sigma^2}} \longrightarrow \mbox{${\cal N}$}(0,1)
\end{displaymath}

Correlation Coefficient

For the correlation coefficient case, we actually need a more sophisticated version, supposing we are interested in

\begin{displaymath}Z=g(X,Y)\end{displaymath}

a function of two random variables we know about, we need two-dimensional Taylor series, note $\mu=(\mu_x,\mu_y)$ Then

\begin{displaymath}Z=g(X,Y) \simeq g(\mu)+(X-\mu_x)\frac{\partial g}{\partial x}(\mu)+ (Y-\mu_y)\frac{\partial g}{\partial y}(\mu) \end{displaymath}

as a first approximation, this gives us straight away for the expectation $
E(Z)= g(\mu)$ and with a little more work for the variance we have :

\begin{displaymath}Var(Z)\simeq \frac{\partial g}{\partial x}(\mu)^2 \sigma_X^2 ...
...partial x}(\mu) \frac{\partial g}{\partial y}(\mu) \sigma_{X,Y}\end{displaymath}

An improved estimate of the expectation can be obtained by taking the second order Taylor expansion:

\begin{eqnarray*}
Z=g(X,Y)\simeq &g(\mu)+(X-\mu_x)\frac{\partial g}{\partial x}(...
...-\mu_x)(Y-\mu_y)\frac{\partial^2 g}{\partial x \partial y} (\mu)
\end{eqnarray*}



From this and from the properties of the expectation:

\begin{displaymath}E(Z)\simeq g(\mu)+ \frac{1}{2} \frac{\partial^2 g}{\partial x...
...
+\sigma_{XY} \frac{\partial^2 g}{\partial x \partial y} (\mu)\end{displaymath}

For $p$-variables it is similar.

Note : If you don't remember about Taylor series you can get the computer to remind you by using Maple.

Type maple, then:

>help(taylor);
>taylor(f(x),x=mu,3);
for multivariate:
>readlib(mtaylor);
>mtaylor(f(x,y),[x=a,y=b],3);

To simplify the formulas, suppose the variances of both $X$ and $Y$ are 1 and the means are 0, then a simplified version of the argument in Lehmann(page 317) says that any bivariate distribution with correlation coefficient $\rho$

\begin{displaymath}
\sqrt{n}
(\frac{\frac{1}{n} \sum X_i Y_i}{S_X S_Y}-\rho)
\lo...
...row \mbox{${\cal N}$}(0,\gamma^2))\qquad \gamma^2=(1-\rho^2)^2
\end{displaymath}

The variance stabilizing transformation is given by the condition, with $c=1$:

\begin{displaymath}f'(\rho)=\frac{1}{1-\rho^2}=\frac{1}{2}[\frac{1}{1-\rho}+\frac{1}{1+\rho}]
\end{displaymath}

so that

\begin{displaymath}
f(\rho)=\frac{1}{2}\log\frac{1+\rho}{1-\rho}
\end{displaymath}

This is called Fishers $z$-transformation.

Bootstrap World



next up previous index
Next: References Up: Introduction to the Bootstrap: Previous: Labs and Homeworks   Index
Susan Holmes 2002-04-25