Prerequisite knowledge of CSCI-567 Machine Learning
Sample Space ( Ω ) (\Omega) ( Ω ) : set of all possible outcomes or realizations
Event( A ) (A) ( A ) : A subset of sample space
Probability : We assign a real number P ( A ) P(A) P ( A ) to each event A A A , called the probability of A A A
Probability Axioms :
P ( A ) ≥ 0 P(A)\ge0 P ( A ) ≥ 0 for every A A A
P ( Ω ) = 1 P(\Omega)=1 P ( Ω ) = 1
If A 1 , A 2 , . . . A_1, A_2,... A 1 , A 2 , ... are disjoints, then P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i ) P(\bigcup_{i=1}^{\infty}A_i)=\sum_{i=1}^{\infty}P(A_i) P ( ⋃ i = 1 ∞ A i ) = ∑ i = 1 ∞ P ( A i )
Random Variable : A random variable is a measurable function that maps a probability space into a measurable space.
A random variable is a variable whose values depend on outcomes of a random event.
The data are specific realizations of random variables
A statistic is just any function of the data or random variable.
Distribution Function
Definition: Suppose X X X is a random variable, x x x is a specific value of it, Cumulative Distribution Function is the function F : R → [ 0 , 1 ] F:R\rarr[0,1] F : R → [ 0 , 1 ] ,where F ( x ) = P ( X ≤ x ) F(x)=P(X\leq x) F ( x ) = P ( X ≤ x )
If X X X is discrete ⇒ \Rightarrow ⇒ probability mass function: f ( x ) = P ( X = x ) f(x)=P(X=x) f ( x ) = P ( X = x )
If X X X is continuous ⇒ \Rightarrow ⇒ probability density function for X X X if there exists a function F F F such that f ( x ) ≥ 0 f(x)\geq 0 f ( x ) ≥ 0 for all x x x , ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty}f(x)dx=1 ∫ − ∞ ∞ f ( x ) d x = 1 and for every a ≤ b a\leq b a ≤ b ,
P ( a ≤ X ≤ b ) = ∫ a b f ( x ) d x P(a\leq X\leq b)=\int_a^bf(x)dx P ( a ≤ X ≤ b ) = ∫ a b f ( x ) d x
If F ( x ) F(x) F ( x ) is differentiable everywhere, f ( x ) = F ′ ( x ) f(x)=F^\prime(x) f ( x ) = F ′ ( x )
Expected Values
Discrete random variable X, E [ g ( X ) ] = ∑ x ∈ χ g ( x ) f ( x ) E[g(X)]=\sum_{x\in \chi}g(x)f(x) E [ g ( X )] = ∑ x ∈ χ g ( x ) f ( x )
Continuous random variable X, E [ g ( x ) ] = ∫ − ∞ ∞ g ( x ) f ( x ) E[g(x)]=\int_{-\infty}^\infty g(x)f(x) E [ g ( x )] = ∫ − ∞ ∞ g ( x ) f ( x )
Mean and Variance
μ = E [ X ] \mu=E[X] μ = E [ X ] is the mean; v a r [ X ] = E [ ( X − μ ) 2 ] var[X]=E[(X-\mu)^2] v a r [ X ] = E [( X − μ ) 2 ] is the variance. We also have v a r [ X ] = E [ X 2 ] − μ 2 var[X]=E[X^2]-\mu^2 v a r [ X ] = E [ X 2 ] − μ 2
Common Distributions
Multivariate Distributions
F X , Y ( x , y ) : = P ( X ≤ x , Y ≤ y ) F_{X,Y}(x,y):=P(X\leq x,Y\leq y) F X , Y ( x , y ) := P ( X ≤ x , Y ≤ y )
f X , Y ( x , y ) : = ∂ 2 F X , Y ( x , y ) ∂ x ∂ y f_{X,Y}(x,y):=\frac{\partial^2F_{X,Y}(x,y)}{\partial x\partial y} f X , Y ( x , y ) := ∂ x ∂ y ∂ 2 F X , Y ( x , y )
Marginal Distribution of X (Discrete case):
f X ( x ) = P ( X = x ) = ∑ y P ( X = x , Y = y ) = ∑ y F X , Y ( x , y ) f_X(x)=P(X=x)=\sum_yP(X=x,Y=y)=\sum_yF_{X,Y}(x,y) f X ( x ) = P ( X = x ) = ∑ y P ( X = x , Y = y ) = ∑ y F X , Y ( x , y )
or F X ( x ) = ∫ y f X , Y ( x , y ) d y F_X(x)=\int_yf_{X,Y}(x,y)dy F X ( x ) = ∫ y f X , Y ( x , y ) d y for continuous variable
Conditional probability of X X X given Y = y Y=y Y = y is
f X ∣ Y ( x ∣ y ) = P ( X = x ∣ Y = y ) = P ( X = x , Y = y ) P ( Y = y ) = f X , Y ( x , y ) F Y ( y ) f_{X|Y}(x|y)=P(X=x|Y=y)=\frac{P(X=x,Y=y)}{P(Y=y)}=\frac{f_{X,Y}(x,y)}{F_Y(y)} f X ∣ Y ( x ∣ y ) = P ( X = x ∣ Y = y ) = P ( Y = y ) P ( X = x , Y = y ) = F Y ( y ) f X , Y ( x , y )
Law of total Probability :X X X takes values x 1 , . . . , x n x_1,...,x_n x 1 , ... , x n and y is a value of Y, we have
F Y ( y ) = ∑ j f Y ∣ X ( y ∣ x j ) f X ( x j ) F_Y(y)=\sum_jf_{Y|X}(y|x_j)f_X(x_j) F Y ( y ) = j ∑ f Y ∣ X ( y ∣ x j ) f X ( x j )
Bayes Rule:
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac{P(B|A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A )
f X ∣ Y ( x i ∣ y ) = f Y ∣ X ( y ∣ x i ) f X ( x i ) ∑ j f Y ∣ X ( y ∣ x j ) f X ( x j ) f_{X|Y}(x_i|y)=\frac{f_{Y|X}(y|x_i)f_X(x_i)}{\sum_jf_{Y|X}(y|x_j)f_X(x_j)} f X ∣ Y ( x i ∣ y ) = ∑ j f Y ∣ X ( y ∣ x j ) f X ( x j ) f Y ∣ X ( y ∣ x i ) f X ( x i )
f X ∣ Y ( x ∣ y ) = f X ∣ Y ( x i ∣ y ) = f Y ∣ X ( y ∣ x i ) f X ( x i ) ∫ x f Y ∣ X ( y ∣ x ) f X ( x ) d x f_{X|Y}(x|y)=f_{X|Y}(x_i|y)=\frac{f_{Y|X}(y|x_i)f_X(x_i)}{\int_xf_{Y|X}(y|x)f_X(x)dx} f X ∣ Y ( x ∣ y ) = f X ∣ Y ( x i ∣ y ) = ∫ x f Y ∣ X ( y ∣ x ) f X ( x ) d x f Y ∣ X ( y ∣ x i ) f X ( x i )
Independent Variables X X X and Y Y Y are independent if and only if:
P ( X = x , Y = y ) = P ( X = x ) P ( Y = y ) P(X=x, Y=y)=P(X=x)P(Y=y) P ( X = x , Y = y ) = P ( X = x ) P ( Y = y ) or f X , Y ( x , y ) = f X ( x ) f y ( y ) f_{X,Y}(x,y)=f_X(x)f_y(y) f X , Y ( x , y ) = f X ( x ) f y ( y ) for all values x x x and y y y .
IID variables : Independent and identically distributed random variables are drawn from the same distribution and are all mutually independent.
If X 1 , . . . X n X_1,...X_n X 1 , ... X n are independent, we have
E [ ∏ i = 1 n X i ] = ∏ i = 1 n E [ X i ] , v a r [ ∑ i = 1 n a i X i ] = ∑ i = 1 n a i 2 v a r [ X i ] E[\prod_{i=1}^{n}X_i]=\prod_{i=1}^{n}E[X_i], var[\sum_{i=1}^na_iX_i]=\sum_{i=1}^na_i^2var[X_i] E [ i = 1 ∏ n X i ] = i = 1 ∏ n E [ X i ] , v a r [ i = 1 ∑ n a i X i ] = i = 1 ∑ n a i 2 v a r [ X i ]
Linearity of Expectation : Even if X 1 , . . . , X n X_1,...,X_n X 1 , ... , X n are not independent,
E [ ∑ i = 1 n X i ] = ∑ i = 1 n E [ X i ] E[\sum_{i=1}^nX_i]=\sum_{i=1}^nE[X_i] E [ i = 1 ∑ n X i ] = i = 1 ∑ n E [ X i ]
Covariance
c o v ( X , Y ) = E [ ( X − μ x ) ( Y − μ y ) ] = E [ X ⋅ Y ] − μ x μ y cov(X,Y)=E[(X-\mu_x)(Y-\mu_y)]=E[X\cdot Y]-\mu_x\mu_y co v ( X , Y ) = E [( X − μ x ) ( Y − μ y )] = E [ X ⋅ Y ] − μ x μ y
Correlation coefficients
corr(X,Y)=Cov(X,Y)/\sigma_x\sigma_y $$
($\sigma$: standard deviation)
**Sample Mean**:
$$\overline{X}=\frac{1}{N}\sum_{i=1}^NX_i
Sample Variance :
S 2 = 1 N − 1 ∑ i = 1 N ( X i − X ‾ ) 2 S^2=\frac{1}{N-1}\sum_{i=1}^N(X_i-\overline X)^2 S 2 = N − 1 1 i = 1 ∑ N ( X i − X ) 2
If X i X_i X i are iid:
E [ X ‾ ] = E [ X i ] = μ E[\overline X]=E[X_i]=\mu E [ X ] = E [ X i ] = μ
V a r ( X ‾ ) = σ 2 / N Var(\overline X)=\sigma^2/N Va r ( X ) = σ 2 / N
E [ S 2 ] = σ 2 E[S^2]=\sigma^2 E [ S 2 ] = σ 2
Point Estimation
The point estimator θ ^ N \hat{\theta}_N θ ^ N is a function of samples X 1 , . . . , X N X_1,...,X_N X 1 , ... , X N that approximates a parameter θ \theta θ of the distribution of X i X_i X i .
Sample Bias : the bias of an estimator is
b i a s ( θ ^ N ) = E θ [ θ ^ N ] − θ bias(\hat{\theta}_N)=E_\theta[\hat\theta_N]-\theta bia s ( θ ^ N ) = E θ [ θ ^ N ] − θ
An estimator is unbiased estimator if E θ [ θ ^ N ] = θ E_\theta[\hat\theta_N]=\theta E θ [ θ ^ N ] = θ
Standard error : the standard deviation of θ ^ N \hat\theta_N θ ^ N is called the standard error
s e ( θ ^ N ) = V a r ( θ ^ N ) se(\hat\theta_N)=\sqrt{Var(\hat\theta_N)} se ( θ ^ N ) = Va r ( θ ^ N )
Information Theory
Suppose X X X can have one of the m values: x 1 , x 2 , . . . , x m x_1, x_2,...,x_m x 1 , x 2 , ... , x m . The probability distribution is P ( X = x i ) = p i P(X=x_i)=p_i P ( X = x i ) = p i
Entropy: H ( X ) = − ∑ j = 1 m p i log p i H(X)=-\sum_{j=1}^{m}p_i\log p_i H ( X ) = − ∑ j = 1 m p i log p i
Conditional Entropy
H ( Y ∣ X ) = ∑ i = 1 m p i H ( Y ∣ X = x i ) = − ∑ i = 1 m ∑ j = 1 n p ( x i , y i ) log p ( y i ∣ x i ) H(Y|X)=\sum_{i=1}^{m}p_iH(Y|X=x_i)=-\sum_{i=1}^{m}\sum_{j=1}^{n}p(x_i,y_i)\log p(y_i|x_i) H ( Y ∣ X ) = ∑ i = 1 m p i H ( Y ∣ X = x i ) = − ∑ i = 1 m ∑ j = 1 n p ( x i , y i ) log p ( y i ∣ x i )
mutual information (MI ) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information " (in units such as shannons (bits ), nats or hartleys ) obtained about one random variable by observing the other random variable. I ( Y ; X ) = H ( Y ) − H ( Y ∣ X ) I(Y;X)=H(Y)-H(Y|X) I ( Y ; X ) = H ( Y ) − H ( Y ∣ X )
I ( Y ; X ) = I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I(Y;X)=I(X;Y)=H(X)+H(Y)-H(X,Y) I ( Y ; X ) = I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y )
In mathematical statistics , the Kullback–Leibler divergence, K L ( P ∥ Q ) KL(P\parallel Q) K L ( P ∥ Q ) (also called relative entropy ), is a statistical distance : a measure of how one probability distribution Q is different from a second, reference probability distribution P .
K L ( p ∥ q ) = ∑ p ( x ) l o g ( p ( x ) / q ( x ) ) KL(p\parallel q)=\sum p(x)log(p(x)/q(x)) K L ( p ∥ q ) = ∑ p ( x ) l o g ( p ( x ) / q ( x ))
I ( X ; Y ) = K L ( p ( x , y ) ∥ p ( x ) p ( y ) ) I(X;Y)=KL(p(x,y)\parallel p(x)p(y)) I ( X ; Y ) = K L ( p ( x , y ) ∥ p ( x ) p ( y ))
Convex Functions:
If for any two points x 1 x_1 x 1 and x 2 x_2 x 2 in its domain X X X and any t ∈ [ 0 , 1 ] t\in[0,1] t ∈ [ 0 , 1 ]
f ( t x 1 + ( 1 − t ) x 2 ) ≤ t f ( x 1 ) + ( 1 − t ) f ( x 2 ) f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2) f ( t x 1 + ( 1 − t ) x 2 ) ≤ t f ( x 1 ) + ( 1 − t ) f ( x 2 )
Convex Set: a set S is convex if and only if for any x 1 , x 2 ∈ S x_1, x_2\in S x 1 , x 2 ∈ S , t x 1 + ( 1 − t ) x 2 ∈ S tx_1+(1-t)x_2\in S t x 1 + ( 1 − t ) x 2 ∈ S for any t ∈ [ 0 , 1 ] t\in[0,1] t ∈ [ 0 , 1 ]
Convex Optimization is a minimization(maximization) of a convex function over a convex set.