NCC (CSCI-567)

2022-01-28
🗂️
AI

Algorithm

Training Data:

  • N samples/instances: DTRAIN={(x1,y1),(x2,y2),...,(xn,yn)}D^{TRAIN}=\{(\vec x_1, y_1),(\vec x_2, y_2),...,(\vec x_n, y_n)\}
  • Each xnRD\vec x_n \in \mathbb{R}^D is called a feature vector
  • Each yn[C]={1,2,...,C}y_n\in[C]=\{1,2,...,C\} is called a label\class\catrgory
  • They are used for learning f:RD[C]f:\mathbb{R}^D\rightarrow[C] for future prediction

Nearest neighbor: x(1)=xnn(x)\vec x(1)=\vec x_{nn(\vec{x})}, nn(x)[N]={1,2,...,N}nn(\vec x)\in[N]=\{1,2,...,N\}where nn(x)nn(x) is the index to one of the training instances, nn(x)=argminn[N]xxn2=argminn[N]d=1D(xdxnd)2nn(\vec{x})=\mathop{argmin}\limits_{n\in[N]}\parallel\vec{x}-\vec{x_n}\parallel_2=\mathop{argmin}\limits_{n\in[N]}\sqrt{\sum\limits_{d=1}^D(x_d-x_{nd})^2}

Classification rule: y=f(x)=ynn(x)y=f(\vec{x})=y_{nn(\vec x)}

Decision boundary: for every point in the space, we can determine its label using the NNC rule. This gives rise to a decision boundary that partitions the space into different regions.

Accuracy: the percentage of data points being correctly classified.

Error Rate: the percentage of data points being incorrectly classified.

Accuracy+Error Rate=1

Accuracy and Error Rate defined on the training data set:

ATRAIN=1NnI[f(xn)==yn]A^{TRAIN}=\frac{1}{N}\sum\limits_n\mathbb{I}[f(\vec{x_n})==y_n]

εTRAIN=1NnI[f(xn)yn]\varepsilon^{TRAIN}=\frac{1}{N}\sum\limits_n\mathbb{I}[f(\vec{x_n})\neq y_n]

I(e)\mathbb{I}(e) is the indicator function: I(e)={1 if e is true0 if e is false\mathbb{I}(e)=\begin{cases} 1\ if\ e\ is\ true \\\\ 0\ if\ e\ is\ false \\ \end{cases}

For NNC, ATRAIN=100%A^{TRAIN}=100\%,εTRAIN=0\varepsilon^{TRAIN}=0

Test/Evaluation data

  • DTEST={(x1,y1),(x2,y2),...,(xm,ym)}D^{TEST}=\{(\vec x_1, y_1),(\vec x_2, y_2),...,(\vec x_m, y_m)\}
  • A fresh dataset, not overlap with training set
  • Test accuracy and test error
    • ATRAIN=1MmI[f(xm)==ym]A^{TRAIN}=\frac{1}{M}\sum\limits_m\mathbb{I}[f(\vec{x_m})==y_m]
    • εTRAIN=1MmI[f(xm)ym]\varepsilon^{TRAIN}=\frac{1}{M}\sum\limits_m\mathbb{I}[f(\vec{x_m})\neq y_m]
  • Test accuracy and test error are better measurement than train accuracy and train error

Variant of NNC

  1. measure nearness with other distances, for example LpL_p distance: xxnp=(dxdxndp)1p\parallel\vec {x}-\vec{x_n}\parallel_p=(\sum\limits_d |x_d-x_{nd}|^p)^\frac{1}{p}, for p1p\geq1

  2. K-nearest neighbor, every neighbor votes for its lable. Predict with the majority

    1. with K increases, the decision boundary becomes smoother
  3. Preprocessing data.

    One issue of NNC: distances depend on units of the features. We can process data so it looks more "normalized".

    • Compute the means and standard deviations in each feature
      • xdˉ=1Nnxnd\bar{x_d}=\frac{1}{N}\sum\limits_n x_{nd}
      • sd2=1N1n(xndxdˉ)2s_d^2=\frac{1}{N-1}\sum\limits_n(x_{nd}-\bar{x_d})^2
    • Scale the feature accordingly
      • xndxndxdˉsdx_{nd}\leftarrow\frac{x_{nd}-\bar{x_d}}{s_d}

Hyper-parameters in NCC

  • The distance measure (e.g. the parameter pp for LPL_P norm)
  • K (how many nearest neighbor)
  • Different ways of preprocessing data

Dataset

  • Training data are used for learning f()f(\cdot)
  • Test data are used for assessing how well f()f(\cdot) do
  • Development/Validation data are used to optimize hyper-parameters.

Recipe

  • For each possible value of the hyperparameter
    • Train a model using DTRAIND^{TRAIN}
    • Evaluate the performance of the model on DDEVD^{DEV}
  • Choose the model with the best performance on DDEVD^{DEV}
  • Evaluate the model on DTESTD^{TEST}

S-fold Cross-validation

  • When we do not have a development set
  • Split the training data in to S equal parts
  • Use each part in turn as a development dataset and use the others as a training dataset
  • Choose the hyper-parameter leading to best average performance
  • Use the best hyper-parameter to train a model using all DTRAIND^{TRAIN}

Advantage of NNC

  • Simple, easy to implement

Disadvantage of NNC

  • Computationally intensive for large scale problems: O(ND)O(ND) for each prediction.
  • Need to carry the training data around. This type of method is called nonparametric
  • Choosing the right hyper-parameters can be involved

Typical steps of development a machine learning system

  • Collect data, split into training, development, and test sets.
  • Train a model with a machine learning algorithm. Most often we apply cross-validation to tune hyper-parameters.
  • Evaluate using the test data and report performance
  • Use the model to predict future/make decisions.

How good is NNC?

Most standard assumption: every data point (x,y)(\vec{x}, y) (from DTRAIN,DDEV,DTESTD^{TRAIN},D^{DEV},D^{TEST}) is an independent identically distribution (i.i.d) sample of an unknown joint distribution PP.

  • often written as (x,y)i.i.dP(\vec{x}, y) \mathop{\sim}\limits^{i.i.d}P

Test error of a fixed classifier is therefore a random variable, and the expectation of test error is the expected error\mistake of ff

E[εTEST]=1Mm=1ME(xm,ym)PI[f(xm)ym]=E(x,y)PI[f(x)y]E[\varepsilon^{TEST}]=\frac{1}{M}\sum\limits_{m=1}^{M}E_{(\vec{x_m}, y_m)\sim P}\mathbb{I}[f(\vec{x_m})\neq y_m]=E_{(\vec{x}, y)\sim P}\mathbb{I}[f(\vec{x})\neq y]

Test error is a proxy of expected error. The larger the test set, the better the approximation.

Expected risk

More generally, for a loss funcion L(y,y)L(y^\prime, y), the expected risk of ff is defined as R(f)=E(x,y)PL(f(x),y)R(f)=E_{(\vec x, y)\sim P}L(f(\vec x), y)

  • L(y,y)=I[yy]L(y^\prime, y)=\mathbb{I}[y^\prime \neq y] called 0-1 loss

Bayes optimal classifier: f(x)=argmaxc[C]P(cx)f^\ast(x)=\mathop{argmax}\limits_{c\in[C]}P(c|x)

The optimal risk: R(f)=ExPx[1maxc[C]P(cx)]R(f^\ast)=E_{x\sim P_x}[1-max_{c\in[C]}P(c|x)] where PxP_x is the marginal distribution of x.

It is easy to show R(f)R(f)R(f^\ast)\leq R(f) for any ff.

For special case C=2C=2, let η(x)=P(0x)\eta(x)=P(0|x), then R(f)=ExPx[min{η(x),1η(x)}]R(f^\ast)=E_{x\sim P_x}[min\{\eta(x), 1-\eta(x)\}]