CS 229 - Machine Learning

Probabilities and Statistics refresher
Star

By Afshine Amidi and Shervine Amidi

Introduction to Probability and Combinatorics

Sample space The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by SS.


Event Any subset EE of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in EE, then we say that EE has occurred.


Axioms of probability For each event EE, we denote P(E)P(E) as the probability of event EE occurring.

Axiom 1 ― Every probability is between 0 and 1 included, i.e:

0P(E)1\boxed{0\leqslant P(E)\leqslant 1}
Axiom 1

Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:

P(S)=1\boxed{P(S)=1}
Axiom 2

Axiom 3 ― For any sequence of mutually exclusive events E1,...,EnE_1, ..., E_n, we have:

P(i=1nEi)=i=1nP(Ei)\boxed{P\left(\bigcup_{i=1}^nE_i\right)=\sum_{i=1}^nP(E_i)}
Axiom 3

Permutation A permutation is an arrangement of rr objects from a pool of nn objects, in a given order. The number of such arrangements is given by P(n,r)P(n, r), defined as:

P(n,r)=n!(nr)!\boxed{P(n, r)=\frac{n!}{(n-r)!}}

Combination A combination is an arrangement of rr objects from a pool of nn objects, where the order does not matter. The number of such arrangements is given by C(n,r)C(n, r), defined as:

C(n,r)=P(n,r)r!=n!r!(nr)!\boxed{C(n, r)=\frac{P(n, r)}{r!}=\frac{n!}{r!(n-r)!}}

Remark: we note that for 0rn0\leqslant r\leqslant n, we have P(n,r)C(n,r)P(n,r)\geqslant C(n,r).


Conditional Probability

Bayes' rule For events AA and BB such that P(B)>0P(B)>0, we have:

P(AB)=P(BA)P(A)P(B)\boxed{P(A|B)=\frac{P(B|A)P(A)}{P(B)}}

Remark: we have P(AB)=P(A)P(BA)=P(AB)P(B)P(A\cap B)=P(A)P(B|A)=P(A|B)P(B).


Partition Let {Ai,i[ ⁣[1,n] ⁣]}\{A_i, i\in[\![1,n]\!]\} be such that for all ii, AiA_i\neq\varnothing. We say that {Ai}\{A_i\} is a partition if we have:

ij,AiAj= and i=1nAi=S\boxed{\forall i\neq j, A_i\cap A_j=\emptyset\quad\textrm{ and }\quad\bigcup_{i=1}^nA_i=S}
Partition

Remark: for any event BB in the sample space, we have P(B)=i=1nP(BAi)P(Ai)\displaystyle P(B)=\sum_{i=1}^nP(B|A_i)P(A_i).


Extended form of Bayes' rule Let {Ai,i[ ⁣[1,n] ⁣]}\{A_i, i\in[\![1,n]\!]\} be a partition of the sample space. We have:

P(AkB)=P(BAk)P(Ak)i=1nP(BAi)P(Ai)\boxed{P(A_k|B)=\frac{P(B|A_k)P(A_k)}{\displaystyle\sum_{i=1}^nP(B|A_i)P(A_i)}}

Independence Two events AA and BB are independent if and only if we have:

P(AB)=P(A)P(B)\boxed{P(A\cap B)=P(A)P(B)}

Random Variables

Definitions

Random variable A random variable, often noted XX, is a function that maps every element in a sample space to a real line.


Cumulative distribution function (CDF) The cumulative distribution function FF, which is monotonically non-decreasing and is such that limxF(x)=0\underset{x\rightarrow-\infty}{\textrm{lim}}F(x)=0 and limx+F(x)=1\underset{x\rightarrow+\infty}{\textrm{lim}}F(x)=1, is defined as:

F(x)=P(Xx)\boxed{F(x)=P(X\leqslant x)}
Cumulative distribution function

Remark: we have P(a<XB)=F(b)F(a)P(a < X\leqslant B)=F(b)-F(a).


Probability density function (PDF) The probability density function ff is the probability that XX takes on values between two adjacent realizations of the random variable.


Relationships involving the PDF and CDF Here are the important properties to know in the discrete (D) and the continuous (C) cases.

Case CDF FF PDF ff Properties of PDF
(D) F(x)=xixP(X=xi)\displaystyle F(x)=\sum_{x_i\leqslant x}P(X=x_i) f(xj)=P(X=xj)f(x_j)=P(X=x_j) 0f(xj)1 and jf(xj)=1\displaystyle0\leqslant f(x_j)\leqslant1\textrm{ and }\sum_{j}f(x_j)=1
(C) F(x)=xf(y)dy\displaystyle F(x)=\int_{-\infty}^xf(y)dy f(x)=dFdxf(x)=\displaystyle \frac{dF}{dx} f(x)0 and +f(x)dx=1\displaystyle f(x)\geqslant0\textrm{ and }\int_{-\infty}^{+\infty}f(x)dx=1

Expectation and Moments of the Distribution Here are the expressions of the expected value E[X]E[X], generalized expected value E[g(X)]E[g(X)], kthk^{th} moment E[Xk]E[X^k] and characteristic function ψ(ω)\psi(\omega) for the discrete and continuous cases:

Case E[X]E[X] E[g(X)]E[g(X)] E[Xk]E[X^k] ψ(ω)\psi(\omega)
(D) i=1nxif(xi)\displaystyle \sum_{i=1}^nx_if(x_i) i=1ng(xi)f(xi)\displaystyle \sum_{i=1}^ng(x_i)f(x_i) i=1nxikf(xi)\displaystyle \sum_{i=1}^nx_i^kf(x_i) i=1nf(xi)eiωxi\displaystyle\sum_{i=1}^nf(x_i)e^{i\omega x_i}
(C) +xf(x)dx\displaystyle \int_{-\infty}^{+\infty}xf(x)dx +g(x)f(x)dx\displaystyle \int_{-\infty}^{+\infty}g(x)f(x)dx +xkf(x)dx\displaystyle \int_{-\infty}^{+\infty}x^kf(x)dx +f(x)eiωxdx\displaystyle\int_{-\infty}^{+\infty}f(x)e^{i\omega x}dx

Variance The variance of a random variable, often noted Var(X)(X) or σ2\sigma^2, is a measure of the spread of its distribution function. It is determined as follows:

Var(X)=E[(XE[X])2]=E[X2]E[X]2\boxed{\textrm{Var}(X)=E[(X-E[X])^2]=E[X^2]-E[X]^2}

Standard deviation The standard deviation of a random variable, often noted σ\sigma, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:

σ=Var(X)\boxed{\sigma=\sqrt{\textrm{Var}(X)}}
Standard deviation

Transformation of random variables Let the variables XX and YY be linked by some function. By noting fXf_X and fYf_Y the distribution function of XX and YY respectively, we have:

fY(y)=fX(x)dxdy\boxed{f_Y(y)=f_X(x)\left|\frac{dx}{dy}\right|}

Leibniz integral rule Let gg be a function of xx and potentially cc, and a,ba, b boundaries that may depend on cc. We have:

c(abg(x)dx)=bcg(b)acg(a)+abgc(x)dx\boxed{\frac{\partial}{\partial c}\left(\int_a^bg(x)dx\right)=\frac{\partial b}{\partial c}\cdot g(b)-\frac{\partial a}{\partial c}\cdot g(a)+\int_a^b\frac{\partial g}{\partial c}(x)dx}

Probability Distributions

Chebyshev's inequality Let XX be a random variable with expected value μ\mu. For k,σ>0k, \sigma>0, we have the following inequality:

P(Xμkσ)1k2\boxed{P(|X-\mu|\geqslant k\sigma)\leqslant\frac{1}{k^2}}

Main distributions Here are the main distributions to have in mind:

Type Distribution PDF ψ(ω)\psi(\omega) E[X]E[X] Var(X)\textrm{Var}(X) Illustration
(D) XB(n,p)X\sim\mathcal{B}(n, p) (nx)pxqnx\displaystyle \displaystyle\binom{n}{x} p^xq^{n-x} (peiω+q)n(pe^{i\omega}+q)^n npnp npqnpq Binomial distribution
(D) XPo(μ)X\sim\textrm{Po}(\mu) μxx!eμ\displaystyle \frac{\mu^x}{x!}e^{-\mu} eμ(eiω1)e^{\mu(e^{i\omega}-1)} μ\mu μ\mu Poisson distribution
(C) XU(a,b)X\sim\mathcal{U}(a, b) 1ba\displaystyle \frac{1}{b-a} eiωbeiωa(ba)iω\displaystyle\frac{e^{i\omega b}-e^{i\omega a}}{(b-a)i\omega} a+b2\displaystyle\frac{a+b}{2} (ba)212\displaystyle\frac{(b-a)^2}{12} Uniform distribution
(C) XN(μ,σ)X\sim\mathcal{N}(\mu, \sigma) 12πσe12(xμσ)2\displaystyle \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} eiωμ12ω2σ2e^{i\omega\mu-\frac{1}{2}\omega^2\sigma^2} μ\mu σ2\sigma^2 Normal distribution
(C) XExp(λ)X\sim\textrm{Exp}(\lambda) λeλx\displaystyle \lambda e^{-\lambda x} 11iωλ\displaystyle\frac{1}{1-\frac{i\omega}{\lambda}} 1λ\displaystyle\frac{1}{\lambda} 1λ2\displaystyle\frac{1}{\lambda^2} Exponential distribution

Jointly Distributed Random Variables

Marginal density and cumulative distribution From the joint density probability function fXYf_{XY} , we have

Case Marginal density Cumulative function
(D) fX(xi)=jfXY(xi,yj)\displaystyle f_X(x_i)=\sum_{j}f_{XY}(x_i,y_j) FXY(x,y)=xixyjyfXY(xi,yj)\displaystyle F_{XY}(x,y)=\sum_{x_i\leqslant x}\sum_{y_j\leqslant y}f_{XY}(x_i,y_j)
(C) fX(x)=+fXY(x,y)dy\displaystyle f_X(x)=\int_{-\infty}^{+\infty}f_{XY}(x,y)dy FXY(x,y)=xyfXY(x,y)dxdy\displaystyle F_{XY}(x,y)=\int_{-\infty}^x\int_{-\infty}^yf_{XY}(x',y')dx'dy'

Conditional density The conditional density of XX with respect to YY, often noted fXYf_{X|Y}, is defined as follows:

fXY(x)=fXY(x,y)fY(y)\boxed{f_{X|Y}(x)=\frac{f_{XY}(x,y)}{f_Y(y)}}

Independence Two random variables XX and YY are said to be independent if we have:

fXY(x,y)=fX(x)fY(y)\boxed{f_{XY}(x,y)=f_X(x)f_Y(y)}

Covariance We define the covariance of two random variables XX and YY, that we note σXY2\sigma_{XY}^2 or more commonly Cov(X,Y)\textrm{Cov}(X,Y), as follows:

Cov(X,Y)σXY2=E[(XμX)(YμY)]=E[XY]μXμY\boxed{\textrm{Cov}(X,Y)\triangleq\sigma_{XY}^2=E[(X-\mu_X)(Y-\mu_Y)]=E[XY]-\mu_X\mu_Y}

Correlation By noting σX,σY\sigma_X, \sigma_Y the standard deviations of XX and YY, we define the correlation between the random variables XX and YY, noted ρXY\rho_{XY}, as follows:

ρXY=σXY2σXσY\boxed{\rho_{XY}=\frac{\sigma_{XY}^2}{\sigma_X\sigma_Y}}

Remark 1: we note that for any random variables X,YX, Y, we have ρXY[1,1]\rho_{XY}\in[-1,1].

Remark 2: If X and Y are independent, then ρXY=0\rho_{XY} = 0.


Parameter estimation

Definitions

Random sample A random sample is a collection of nn random variables X1,...,XnX_1, ..., X_n that are independent and identically distributed with XX.


Estimator An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.


Bias The bias of an estimator θ^\hat{\theta} is defined as being the difference between the expected value of the distribution of θ^\hat{\theta} and the true value, i.e.:

Bias(θ^)=E[θ^]θ\boxed{\textrm{Bias}(\hat{\theta})=E[\hat{\theta}]-\theta}

Remark: an estimator is said to be unbiased when we have E[θ^]=θE[\hat{\theta}]=\theta.


Estimating the mean

Sample mean The sample mean of a random sample is used to estimate the true mean μ\mu of a distribution, is often noted X\overline{X} and is defined as follows:

X=1ni=1nXi\boxed{\overline{X}=\frac{1}{n}\sum_{i=1}^nX_i}

Remark: the sample mean is unbiased, i.e E[X]=μE[\overline{X}]=\mu.


Central Limit Theorem Let us have a random sample X1,...,XnX_1, ..., X_n following a given distribution with mean μ\mu and variance σ2\sigma^2, then we have:

Xn+N(μ,σn)\boxed{\overline{X}\underset{n\rightarrow+\infty}{\sim}\mathcal{N}\left(\mu, \frac{\sigma}{\sqrt{n}}\right)}

Estimating the variance

Sample variance The sample variance of a random sample is used to estimate the true variance σ2\sigma^2 of a distribution, is often noted s2s^2 or σ^2\hat{\sigma}^2 and is defined as follows:

s2=σ^2=1n1i=1n(XiX)2\boxed{s^2=\hat{\sigma}^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2}

Remark: the sample variance is unbiased, i.e E[s2]=σ2E[s^2]=\sigma^2.


Chi-Squared relation with sample variance Let s2s^2 be the sample variance of a random sample. We have:

s2(n1)σ2χn12\boxed{\frac{s^2(n-1)}{\sigma^2}\sim\chi_{n-1}^2}