ML Basic Concepts (CSCI-567)

2022-01-13
🗂️
AI

Prerequisite knowledge of CSCI-567 Machine Learning

Sample Space(Ω)(\Omega): set of all possible outcomes or realizations

Event(A)(A): A subset of sample space

Probability: We assign a real number P(A)P(A) to each event AA, called the probability of AA

Probability Axioms:

  1. P(A)0P(A)\ge0 for every AA
  2. P(Ω)=1P(\Omega)=1
  3. If A1,A2,...A_1, A_2,... are disjoints, then P(i=1Ai)=i=1P(Ai)P(\bigcup_{i=1}^{\infty}A_i)=\sum_{i=1}^{\infty}P(A_i)

Random Variable: A random variable is a measurable function that maps a probability space into a measurable space.

A random variable is a variable whose values depend on outcomes of a random event.

The data are specific realizations of random variables

A statistic is just any function of the data or random variable.

Distribution Function

Definition: Suppose XX is a random variable, xx is a specific value of it, Cumulative Distribution Function is the function F:R[0,1]F:R\rarr[0,1] ,where F(x)=P(Xx)F(x)=P(X\leq x)

If XX is discrete \Rightarrow probability mass function: f(x)=P(X=x)f(x)=P(X=x)

If XX is continuous \Rightarrow probability density function for XX if there exists a function FF such that f(x)0f(x)\geq 0 for all xx, f(x)dx=1\int_{-\infty}^{\infty}f(x)dx=1 and for every aba\leq b,

P(aXb)=abf(x)dxP(a\leq X\leq b)=\int_a^bf(x)dx

If F(x)F(x) is differentiable everywhere, f(x)=F(x)f(x)=F^\prime(x)

Expected Values

  • Discrete random variable X, E[g(X)]=xχg(x)f(x)E[g(X)]=\sum_{x\in \chi}g(x)f(x)
  • Continuous random variable X, E[g(x)]=g(x)f(x)E[g(x)]=\int_{-\infty}^\infty g(x)f(x)

Mean and Variance

μ=E[X]\mu=E[X] is the mean; var[X]=E[(Xμ)2]var[X]=E[(X-\mu)^2] is the variance. We also have var[X]=E[X2]μ2var[X]=E[X^2]-\mu^2

Common Distributions

Multivariate Distributions

FX,Y(x,y):=P(Xx,Yy)F_{X,Y}(x,y):=P(X\leq x,Y\leq y)

fX,Y(x,y):=2FX,Y(x,y)xyf_{X,Y}(x,y):=\frac{\partial^2F_{X,Y}(x,y)}{\partial x\partial y}

Marginal Distribution of X(Discrete case):

fX(x)=P(X=x)=yP(X=x,Y=y)=yFX,Y(x,y)f_X(x)=P(X=x)=\sum_yP(X=x,Y=y)=\sum_yF_{X,Y}(x,y)

or FX(x)=yfX,Y(x,y)dyF_X(x)=\int_yf_{X,Y}(x,y)dy for continuous variable

Conditional probability of XX given Y=yY=y is

fXY(xy)=P(X=xY=y)=P(X=x,Y=y)P(Y=y)=fX,Y(x,y)FY(y)f_{X|Y}(x|y)=P(X=x|Y=y)=\frac{P(X=x,Y=y)}{P(Y=y)}=\frac{f_{X,Y}(x,y)}{F_Y(y)}

Law of total Probability:XX takes values x1,...,xnx_1,...,x_n and y is a value of Y, we have

FY(y)=jfYX(yxj)fX(xj)F_Y(y)=\sum_jf_{Y|X}(y|x_j)f_X(x_j)

Bayes Rule:

P(AB)=P(BA)P(A)P(B)P(A|B)=\frac{P(B|A)P(A)}{P(B)} fXY(xiy)=fYX(yxi)fX(xi)jfYX(yxj)fX(xj)f_{X|Y}(x_i|y)=\frac{f_{Y|X}(y|x_i)f_X(x_i)}{\sum_jf_{Y|X}(y|x_j)f_X(x_j)} fXY(xy)=fXY(xiy)=fYX(yxi)fX(xi)xfYX(yx)fX(x)dxf_{X|Y}(x|y)=f_{X|Y}(x_i|y)=\frac{f_{Y|X}(y|x_i)f_X(x_i)}{\int_xf_{Y|X}(y|x)f_X(x)dx}

Independent Variables XX and YY are independent if and only if:

P(X=x,Y=y)=P(X=x)P(Y=y)P(X=x, Y=y)=P(X=x)P(Y=y) or fX,Y(x,y)=fX(x)fy(y)f_{X,Y}(x,y)=f_X(x)f_y(y) for all values xx and yy.

IID variables: Independent and identically distributed random variables are drawn from the same distribution and are all mutually independent.

If X1,...XnX_1,...X_n are independent, we have

E[i=1nXi]=i=1nE[Xi],var[i=1naiXi]=i=1nai2var[Xi]E[\prod_{i=1}^{n}X_i]=\prod_{i=1}^{n}E[X_i], var[\sum_{i=1}^na_iX_i]=\sum_{i=1}^na_i^2var[X_i]

Linearity of Expectation: Even if X1,...,XnX_1,...,X_n are not independent,

E[i=1nXi]=i=1nE[Xi]E[\sum_{i=1}^nX_i]=\sum_{i=1}^nE[X_i]

Covariance

cov(X,Y)=E[(Xμx)(Yμy)]=E[XY]μxμycov(X,Y)=E[(X-\mu_x)(Y-\mu_y)]=E[X\cdot Y]-\mu_x\mu_y

Correlation coefficients

corr(X,Y)=Cov(X,Y)/\sigma_x\sigma_y $$ ($\sigma$: standard deviation) **Sample Mean**: $$\overline{X}=\frac{1}{N}\sum_{i=1}^NX_i

Sample Variance:

S2=1N1i=1N(XiX)2S^2=\frac{1}{N-1}\sum_{i=1}^N(X_i-\overline X)^2

If XiX_i are iid:

E[X]=E[Xi]=μE[\overline X]=E[X_i]=\mu Var(X)=σ2/NVar(\overline X)=\sigma^2/N E[S2]=σ2E[S^2]=\sigma^2

Point Estimation

The point estimator θ^N\hat{\theta}_N is a function of samples X1,...,XNX_1,...,X_N that approximates a parameter θ\theta of the distribution of XiX_i.

Sample Bias: the bias of an estimator is

bias(θ^N)=Eθ[θ^N]θbias(\hat{\theta}_N)=E_\theta[\hat\theta_N]-\theta

An estimator is unbiased estimator if Eθ[θ^N]=θE_\theta[\hat\theta_N]=\theta

Standard error: the standard deviation of θ^N\hat\theta_N is called the standard error

se(θ^N)=Var(θ^N)se(\hat\theta_N)=\sqrt{Var(\hat\theta_N)}

Information Theory

Suppose XX can have one of the m values: x1,x2,...,xmx_1, x_2,...,x_m. The probability distribution is P(X=xi)=piP(X=x_i)=p_i

Entropy: H(X)=j=1mpilogpiH(X)=-\sum_{j=1}^{m}p_i\log p_i

  • "High entropy" means X is from a uniform distribution

  • "Low entropy" means X is from varied distribution

Conditional Entropy

H(YX)=i=1mpiH(YX=xi)=i=1mj=1np(xi,yi)logp(yixi)H(Y|X)=\sum_{i=1}^{m}p_iH(Y|X=x_i)=-\sum_{i=1}^{m}\sum_{j=1}^{n}p(x_i,y_i)\log p(y_i|x_i)

mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons (bits), nats or hartleys) obtained about one random variable by observing the other random variable. I(Y;X)=H(Y)H(YX)I(Y;X)=H(Y)-H(Y|X)

I(Y;X)=I(X;Y)=H(X)+H(Y)H(X,Y)I(Y;X)=I(X;Y)=H(X)+H(Y)-H(X,Y)

In mathematical statistics, the Kullback–Leibler divergence, KL(PQ)KL(P\parallel Q){\displaystyle D_{\text{KL}}(P\parallel Q)} (also called relative entropy), is a statistical distance: a measure of how one probability distribution Q is different from a second, reference probability distribution P.

KL(pq)=p(x)log(p(x)/q(x))KL(p\parallel q)=\sum p(x)log(p(x)/q(x))

I(X;Y)=KL(p(x,y)p(x)p(y))I(X;Y)=KL(p(x,y)\parallel p(x)p(y))

Convex Functions:

If for any two points x1x_1 and x2x_2 in its domain XX and any t[0,1]t\in[0,1]

f(tx1+(1t)x2)tf(x1)+(1t)f(x2)f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2)

Convex Set: a set S is convex if and only if for any x1,x2Sx_1, x_2\in S, tx1+(1t)x2Stx_1+(1-t)x_2\in S for any t[0,1]t\in[0,1]

Convex Optimization is a minimization(maximization) of a convex function over a convex set.