Regression VS Classification
- continuous vs discrete
- measure prediction error differently
Measure error for one prediction
- absolute error: |prediction - sale price|
- squared error: (prediction - sale price)^2
Formal set up for linear regression
- input: x∈RD
- Output: y∈R
- Training data: D={(xn,yn),n=1,2,...,N}
- Linear model: f:RD→R, with f(x)=w0+∑d=1Dwdxd=w0+wTx
- For notation convenience f(x)=w~Tx~=x~Tw~, where w~=[w0,w1,w2,...,wD]T,x~=[1,x1,x2,...,xD]T
RSS(Residual sum of squares)
RSS(w~)=n∑(f(xn)−yn)2=n∑(w~Tx~n−yn)2=n∑(x~nTw~−yn)2
Empirical risk minimizer/Least squares solution
w~∗=w~∈RD+1argmin RSS(w~)
General Least Square Solution
Object: RSS(w~)=n∑(w~Tx~n−yn)2
∇RSS(w~)=2n∑x~n(w~Tx~n−yn)∝n∑x~nx~nTw~−n∑ynx~n
notation: X~=x~1Tx~2T⋮x~NT∈RN×(D+1),y~=y1y2⋮yn∈RN×1
∇RSS(w~)=(X~TX~)w~−X~Ty=0
w~∗=(X~TX~)−1X~Ty
When D=0,(X~TX~)−1=N1,X~Ty=∑nyn,w~∗=N∑nyn
- Invert the matrix X~TX~ takes O(D3)
What if X~TX~ is not invertible
- When n<D+1, not enough data to estimate all parameters.
- X~TX~ is symmetric
- eigen decomposition:
X~TX~=UTλ10⋮000λ2⋮⋯⋯⋯⋯⋮λD000⋮0λD+1U
(X~TX~)−1=UT1/λ10⋮0001/λ2⋮⋯⋯⋯⋯⋮1/λD000⋮01/λD+1U
Non-invertible means some eigenvalues are zero
One natural fix: add something positive:
X~TX~+λI=UTλ1+λ0⋮000λ2+λ⋮⋯⋯⋯⋯⋮λD+λ000⋮0λD+1+λU
The solution becomes: w~∗=(X~TX~+λI)−1X~Ty
Parametric VS Non-parametric:
- Parametric methods: the size of the model does not grow with the size of the training set N
- Non-parametric methods: the size of the model grows with the size of the training set
Linear regression with nonlinear basis
- Use a nonlinear mapping: ϕ(x):x∈RD→z∈RM, transform the data to a more complicated feature space.
- Then apply linear regression.
Model: f(x)=wTϕ(x), where w∈RM
Objective: RSS(w)=n∑(wTϕ(xn)−yn)2
Least square solution: w∗=(ΦTΦ)−1ΦTy, where Φ=ϕ(x1)Tϕ(x2)T⋮ϕ(x3)T∈RN×M
Underfitting:
- large training error
- large test error
overfitting:
- small training error
- large test error
How to prevent overfitting:
- use more data
- control the model complexity
- For polynomial basis, the degree M clearly controls the complexity
- Use cross-validation to pick hyper-parameter M
Regularized linear regression:
ε(w)=RSS(w)+λR(w)