minimizing perceptron loss does not really make sense (try w=0), but the algorithm derived from this perspective does.
Gradient Descent (GD)
Gradient is sometimes referred to as first-order information of a function. Therefore, these methods are called first-order methods.
Goal: minimize F(w)
Algorithm: move a bit in the negative gradient direction. w(t+1)←w(t)−η∇F(w(t)), where η>0 is called step size or learning rate.
in theory η should be set in terms of some parameters of F
in practice we just try several small values
Essentially, GD is Taylor approximation: F(w)≈F(w(t))+∇F(w(t))T(w−w(t)), we use −η∇F(w(t)) represents w−w(t), so F(w(t+1))≈F(w(t))−η∥∇F(w(t))∥22≤F(w(t))
Stochastic Gradient Descent (SGD)
w(t+1)←w(t)−η∇~F(w(t)), where ∇~F(w(t)) is a random variable (called stochastic gradient) E[∇~F(w(t))]=∇F(w(t))
It could be much faster to obtain a stochastic gradient!
Convergence guarantees for both GD and SGD on convex objectives.
Applying GD to perceptron loss
F(w)=n=1∑Nmax{0,−ynwTxn}
∇F(w)=n=1∑N−I[ynwTxn≤0]ynxn, only misclassified examples contribute to the gradient
GD update: w←w+ηn=1∑NI[ynwTxn≤0]ynxn
Each update makes one pass of the entire training set
Applying SGD to perception loss
How to construct a stochastic gradient?
Pick one example n∈[N] uniformly at random, ∇~F(w)=−NI[ynwTxn≤0]ynxn
SGD update: w←w+ηI[ynwTxn≤0]ynxn (η absorbing the constant N) (each update touches only one data point!)
Perceptron Algorithm
Perceptron algorithm is SGD with η=1 applied to perceptron loss:
Repeat:
Pick a data point xn uniformly at random
If sgn(wTxn)=yn, w←w+ynxn
Note:
w is always a linear combination of the training examples.
η=1 does not really matter in terms of training error.
If training set is linearly separable
perceptron converges in a finite number of stpes
training error is 0
sigmoid function: σ(z)=1+e(−z)1
σ(z) is between 0 and 1
σ(wTx)>=0.5⇔wTx≥0, consistent with predicting the label with sign(wTx)
larger wTx⇒ larger σ(wTx)⇒ higher confidence in label 1