Question 1. Using Python
See hw1_key.py
Question 2. Properties of Random Variables
\[ \begin{eqnarray} E[ (X - c)^2] \nonumber \end{eqnarray} \] Why does this suggest \(E[X]\) is a ``good" guess for the value of \(X\)?
Answer: \[ \begin{eqnarray} E[ (X - c)^2] & = & E[X^2] - 2 c E[X] + E[c^2] \nonumber \\ & = & E[X^2] - 2 cE[X] + c^2 \nonumber \\ \frac{\partial E[ (X - c)^2]}{\partial c} & = & - 2 E[X] + 2 c \nonumber \\ 0 & = & - 2 E[X] + 2 c^{*} \nonumber \\ E[X] & = & c^{*} \nonumber \end{eqnarray} \]
It suggests that the \(E[X]\) is the value of \(c\) that minimizes the average squared distance.
Suppose \(Y\) and \(Z\) are random variables, with joint density \(f(y, z)\).
Answer \[ \begin{eqnarray} f_{Y}(y) & = & \int_{-\infty}^{\infty} f(y, z) dz \nonumber \end{eqnarray} \]
Answer \[ \begin{eqnarray} f_{Z}(z) & = & \int_{-\infty}^{\infty} f(y, z) dy \nonumber \end{eqnarray} \]
Answer \[ \begin{eqnarray} E[YZ] & = & \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} YZ f(y, z) dydz \nonumber \\ & = & \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} YZ f_{Y}(y) f_{Z}(z) dy dz \nonumber \\ & = & \int_{-\infty}^{\infty} Y f_{Y}(y)dy \int_{-\infty}^{\infty} f_{Z}(z) dz \nonumber \\ & = & E[Y] E[Z] \nonumber \end{eqnarray} \]
Question 3: Finding Critical Values for a Function
Suppose we have a function \(f:\Re\rightarrow\Re\), \(f(x) = \sin(x)\).
R
plot \(\sin(x)\) for \(x \in [-2 \pi, 2 \pi]\) (here \(\pi\) is the mathematical constant).xx<- seq(-2*pi, 2*pi, len = 1000)
plot(sin(xx)~xx, type='l', lwd = 3, xlab = 'x', ylab = 'sin(x)', main= "f(x) = sin(x)")
R
plot it over \([-2 \pi, 2 \pi]\)plot(cos(xx)~xx, type='l', lwd = 3, xlab = 'x', ylab = 'cos(x)', main= "f'(x) = cos(x")
R
plot it over \([-2 \pi, 2 \pi]\)plot(I(-sin(xx))~xx, type='l', lwd = 3, xlab = 'x', ylab = '-sin(x)', main= "f''(x) = -sin(x)")
We say that \(x^{*}\) is a critical value for a function if \(f^{'}(x^{*}) = 0\). We can find \(x^{*}\) algebraically. Or, we can use a computational approach.\ We discussed the Newton-Raphson approach in class on Thursday. We’re going to write our own implementation of the algorithm in R
and apply it to find the critical values of \(f(x) = \sin(x)\)
where \(f^{'}(x_{t})\) is the first derivative evaluated at \(x_{t}\) and \(f^{''}(x_{t})\) is the second derivative evaluated at \(x_{t}\).
Write a function in R
that provides the update step for some value \(x_{t}\) if \(f(x) = \sin(x)\).
up_newt<- function(x){
x.1<- x + cos(x)/sin(x)
return(x.1)
}
while
command in R, write a loop that continues updating until the change,( \(|x_{t+1} - x_{t}|\)) drops below \(10^{-5}\). while(diff>tol){
x<- x.1
x.1<- up_newt(x)
diff<- abs(x.1 - x)
}
newt_raph<- function(x, tol){
x.1<- up_newt(x)
diff<- abs(x.1 - x)
while(diff>tol){
x<- x.1
x.1<- up_newt(x)
diff<- abs(x.1 - x)
}
return(x.1)
}
guesses = c(-2,-1,1,2)
for (i in guesses ){
print(newt_raph(i, 10^-5))
}
## [1] -1.571
## [1] -1.571
## [1] 1.571
## [1] 1.571
The algorithm is unstable because sin(0) = 0 . The result is that small differences in starting values are causing very big differences in the first step of the algorithm.
Problem 3: Probit Regression with a Prior
Suppose that we assume the following data generation process \[ \begin{eqnarray} Y_{i} & \sim & \text{Bernoulli}(\pi_{i} ) \nonumber \\ \pi_{i} & = & \Phi(\boldsymbol{X}_{i} \boldsymbol{\beta}) \nonumber \\ \beta_{j} & \sim & \text{Normal}(\mu, \sigma^{2}_{j} )\nonumber \end{eqnarray} \] with \(\boldsymbol{X}_{i} = (1, x_{i})\) for all \(i\) \((i = 1, ..., N)\), \(\boldsymbol{\beta} = (\beta_{1}, \beta_{2})\), and \(\Phi(\cdot)\) is the cumulative normal distribution function.
This is very similar to the model described in class, but now we have added a prior on \(\boldsymbol{\beta}\). This slightly alters the objective function: \[ \begin{eqnarray} p(\boldsymbol{\beta} | \boldsymbol{Y}, \boldsymbol{X}) & \propto & p(\boldsymbol{\beta}|\mu, \sigma^2) \times p(\boldsymbol{Y}| \boldsymbol{X}, \boldsymbol{\beta}) \label{e:post} \\ & \propto & \prod_{j=1}^{2} \frac{1}{\sqrt{2\pi} \sigma} \exp\left( - \frac{(\beta_{j} - \mu )^2}{2 \sigma^2} \right) \times \prod_{i=1}^{N} \Phi(\boldsymbol{X}_{i}\boldsymbol{\beta})^{Y_{i} } ( 1- \Phi(\boldsymbol{X}_{i}\boldsymbol{\beta})^{1 - Y_{i} } \nonumber \end{eqnarray} \]
In this problem, we will examine how the prior on \(\boldsymbol{\beta}\), and in particular the values we set for \(\mu\) and \(\sigma^2\), alters our inferences about \(\boldsymbol{\beta}\).
Analytically, write out the \(\log(p(\boldsymbol{\beta} | \boldsymbol{Y}, \boldsymbol{X}))\).
Answer \[
\begin{eqnarray}
\log p(\boldsymbol{\beta}| \boldsymbol{Y}, \boldsymbol{X}) & = & \sum_{j=1}^{2} \left(- \log \sigma + \frac{- (\beta_{j} - \mu)^{2}}{2 \sigma^2} \right) + \sum_{i=1}^{N} Y_{i} \log \Phi(\boldsymbol{X}_{i} \boldsymbol{\beta}) + (1- Y_{i}) \log \left( 1 - \Phi(\boldsymbol{X}_{i} \boldsymbol{\beta}) \right) + c \nonumber
\end{eqnarray}
\]
In R
create a function for the \(\log\) of Equation 1.
log.post<- function(pars, mu, sig.2, Y, X){
beta<- pars[1:2]
part1<- dnorm(beta, mean = mu, sd = sqrt(sig.2), log = T)
part2<- X%*%beta
probs<- pnorm(part2)
combine<- log(probs)*Y + log(1- probs)*(1-Y)
out<- sum(part1) + sum(combine)
return(out)
}
optim
to find \(\widehat{\boldsymbol{\beta}}\) with \(\mu = 0\) and \(\sigma^2 = 1000\) data_set<- read.delim('~/Dropbox/TextAsData/ProblemSets/PS1/DataSet.csv', sep=',')
data_set<- as.matrix(data_set)
Y<- data_set[,1]
X<- data_set[,-1]
max.output<- optim(c(1, 1), log.post, method = 'BFGS', control = list(trace =100, fnscale = -1), hessian = T, X = X, Y = Y , mu = 0, sig.2 = 1000)
## initial value 794.131479
## iter 10 value 194.440790
## iter 10 value 194.440790
## iter 10 value 194.440790
## final value 194.440790
## converged
max.output
## $par
## [1] 0.8724 -1.3899
##
## $value
## [1] -194.4
##
## $counts
## function gradient
## 34 10
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] -181.75 -70.77
## [2,] -70.77 -97.58
for
loop, store estimates of how \(\beta_{2}\) changes as you vary \(\sigma^2\) from \(10\) to \(0.01\). Plot \(\beta_{2}\) against \(\sigma^2\) and describe what happens as \(\sigma^2\) varies. sigma.2.seq<- seq(0.01, 10, len = 1000)
store.est<- c()
for(z in 1:length(sigma.2.seq)){
temp<- optim(c(1, 1), log.post, method = 'BFGS', control = list(trace =100, fnscale = -1), hessian = T, X = X, Y = Y , mu = 1, sig.2 = sigma.2.seq[z])
store.est[z]<- temp$par[2]
}
#output surpressed
As \(\sigma^2 \rightarrow\), the ``prior" becomes more informative and affects the inference more.
plot(store.est~sigma.2.seq, type='b', xlab = 'Sigma^2', ylab = 'Beta2')