Linear Classifier (CSCI-567)

2022-02-09
🗂️
AI

Classification:

  • N samples/instances: DTRAIN={(x1,y1),(x2,y2),...,(xn,yn)}D^{TRAIN}=\{(\vec x_1, y_1),(\vec x_2, y_2),...,(\vec x_n, y_n)\}
  • Each xnRD\vec x_n \in \mathbb{R}^D is called a feature vector
  • Each yn[C]={1,2,...,C}y_n\in[C]=\{1,2,...,C\} is called a label\class\catrgory
  • They are used for learning f:RD[C]f:\mathbb{R}^D\rightarrow[C] for future prediction

This post is focus on binary classification.

  • Number of classes: C=2
  • Labels: {-1, +1}

Model: F={f(x)=sign(wTx)xRD}\mathcal F=\{f(\vec x)=sign(\vec w^T\vec x)|\vec x\in\mathbb{R}^D\}, sign(wTx)={+1 if wTx>01 if wTx0sign(\vec w^T\vec x)=\begin{cases} +1\ if\ \vec w^T\vec x>0\\\\ -1\ if\ \vec w^T\vec x\leq0\\ \end{cases}

Loss Function

z=ynwTxz=y_n\vec w^T\vec x

  • perceptron loss:lperceptron(z)=max{0,z}l_{\mathrm{perceptron}}(z)=max\{0, -z\}
  • hinge loss: lhinge(z)=max{0,1z}l_\mathrm{hinge}(z)=max\{0,1-z\}
  • logistic loss: llogistic=log(1exp(z))l_\mathrm{logistic}=log(1-exp(-z))

Find ERM (Empirical Risk Minimization)

w=argminwRDn=1Nl(ynwTxn)\vec w^\ast=\mathop{argmin}\limits_{\vec w\in\mathbb{R}^D}\sum\limits_{n=1}^Nl(y_n\vec w^T\vec x_n)

minimizing perceptron loss does not really make sense (try w=0\vec w=\vec 0), but the algorithm derived from this perspective does.

Gradient Descent (GD)

  • Gradient is sometimes referred to as first-order information of a function. Therefore, these methods are called first-order methods.

Goal: minimize F(w)F(\vec w)

Algorithm: move a bit in the negative gradient direction. w(t+1)w(t)ηF(w(t))\vec w^{(t+1)}\leftarrow\vec w^{(t)}-\eta\nabla F(\vec w^{(t)}), where η>0\eta>0 is called step size or learning rate.

  • in theory η\eta should be set in terms of some parameters of FF
  • in practice we just try several small values

Essentially, GD is Taylor approximation: F(w)F(w(t))+F(w(t))T(ww(t))F(\vec w)\approx F(\vec w^{(t)})+\nabla F(\vec w^{(t)})^T(\vec w-\vec w^{(t)}), we use ηF(w(t))-\eta\nabla F(\vec w^{(t)}) represents ww(t)\vec w-\vec w^{(t)}, so F(w(t+1))F(w(t))ηF(w(t))22F(w(t))F(\vec w^{(t+1)})\approx F(\vec w^{(t)})-\eta\parallel\nabla F(\vec w^{(t)})\parallel_2^2\leq F(\vec w^{(t)})

Stochastic Gradient Descent (SGD)

w(t+1)w(t)η~F(w(t))\vec w^{(t+1)}\leftarrow\vec w^{(t)}-\eta\tilde\nabla F(\vec w^{(t)}), where ~F(w(t))\tilde\nabla F(\vec w^{(t)}) is a random variable (called stochastic gradient) E[~F(w(t))]=F(w(t))\mathbb{E}[\tilde\nabla F(\vec w^{(t)})]=\nabla F(\vec w^{(t)})

It could be much faster to obtain a stochastic gradient!

Convergence guarantees for both GD and SGD on convex objectives.

Applying GD to perceptron loss

F(w)=n=1Nmax{0,ynwTxn}F(\vec w)=\sum\limits_{n=1}^Nmax\{0, -y_n\vec w^T\vec x_n\}

F(w)=n=1NI[ynwTxn0]ynxn\nabla F(\vec w)=\sum\limits_{n=1}^N-\mathbb{I}[y_n\vec w^T\vec x_n\leq0]y_n\vec x_n, only misclassified examples contribute to the gradient

GD update: ww+ηn=1NI[ynwTxn0]ynxn\vec w\leftarrow\vec w+\eta\sum\limits_{n=1}^N\mathbb{I}[y_n\vec w^T\vec x_n\leq0]y_n\vec x_n

Each update makes one pass of the entire training set

Applying SGD to perception loss

How to construct a stochastic gradient?

Pick one example n[N]n\in[N] uniformly at random, ~F(w)=NI[ynwTxn0]ynxn\tilde\nabla F(\vec w)=-N\mathbb{I}[y_n\vec w^T\vec x_n\leq0]y_n\vec x_n

SGD update: ww+ηI[ynwTxn0]ynxn\vec w\leftarrow\vec w+\eta\mathbb{I}[y_n\vec w^T\vec x_n\leq0]y_n\vec x_n (η\eta absorbing the constant N) (each update touches only one data point!)

Perceptron Algorithm

Perceptron algorithm is SGD with η=1\eta=1 applied to perceptron loss:

Repeat:

  • Pick a data point xn\vec x_n uniformly at random
  • If sgn(wTxn)ynsgn(\vec w^T\vec x_n)\neq y_n, ww+ynxn\vec w\leftarrow\vec w+y_n\vec x_n

Note:

  • w\vec w is always a linear combination of the training examples.
  • η=1\eta=1 does not really matter in terms of training error.

If training set is linearly separable

  • perceptron converges in a finite number of stpes
  • training error is 0

sigmoid function: σ(z)=11+e(z)\sigma(z)=\frac{1}{1+e^{(-z)}}

  • σ(z)\sigma(z) is between 0 and 1
  • σ(wTx)>=0.5wTx0\sigma(\vec w^T\vec x)>=0.5 \Leftrightarrow\vec w^T\vec x\geq0, consistent with predicting the label with sign(wTx)sign(\vec w^T\vec x)
  • larger wTx\vec w^T\vec x\Rightarrow larger σ(wTx)\sigma(\vec w^T\vec x)\Rightarrow higher confidence in label 1
  • σ(z)+σ(z)=1\sigma(z)+\sigma(-z)=1 for all zz