Algorithm
Training Data:
- N samples/instances:
- Each is called a feature vector
- Each is called a label\class\catrgory
- They are used for learning for future prediction
Nearest neighbor: , where is the index to one of the training instances,
Classification rule:
Decision boundary: for every point in the space, we can determine its label using the NNC rule. This gives rise to a decision boundary that partitions the space into different regions.
Accuracy: the percentage of data points being correctly classified.
Error Rate: the percentage of data points being incorrectly classified.
Accuracy+Error Rate=1
Accuracy and Error Rate defined on the training data set:
is the indicator function:
For NNC, ,
Test/Evaluation data
- A fresh dataset, not overlap with training set
- Test accuracy and test error
- Test accuracy and test error are better measurement than train accuracy and train error
Variant of NNC
-
measure nearness with other distances, for example distance: , for
-
K-nearest neighbor, every neighbor votes for its lable. Predict with the majority
- with K increases, the decision boundary becomes smoother
-
Preprocessing data.
One issue of NNC: distances depend on units of the features. We can process data so it looks more "normalized".
- Compute the means and standard deviations in each feature
- Scale the feature accordingly
- Compute the means and standard deviations in each feature
Hyper-parameters in NCC
- The distance measure (e.g. the parameter for norm)
- K (how many nearest neighbor)
- Different ways of preprocessing data
Dataset
- Training data are used for learning
- Test data are used for assessing how well do
- Development/Validation data are used to optimize hyper-parameters.
Recipe
- For each possible value of the hyperparameter
- Train a model using
- Evaluate the performance of the model on
- Choose the model with the best performance on
- Evaluate the model on
S-fold Cross-validation
- When we do not have a development set
- Split the training data in to S equal parts
- Use each part in turn as a development dataset and use the others as a training dataset
- Choose the hyper-parameter leading to best average performance
- Use the best hyper-parameter to train a model using all
Advantage of NNC
- Simple, easy to implement
Disadvantage of NNC
- Computationally intensive for large scale problems: for each prediction.
- Need to carry the training data around. This type of method is called nonparametric
- Choosing the right hyper-parameters can be involved
Typical steps of development a machine learning system
- Collect data, split into training, development, and test sets.
- Train a model with a machine learning algorithm. Most often we apply cross-validation to tune hyper-parameters.
- Evaluate using the test data and report performance
- Use the model to predict future/make decisions.
How good is NNC?
Most standard assumption: every data point (from ) is an independent identically distribution (i.i.d) sample of an unknown joint distribution .
- often written as
Test error of a fixed classifier is therefore a random variable, and the expectation of test error is the expected error\mistake of
Test error is a proxy of expected error. The larger the test set, the better the approximation.
Expected risk
More generally, for a loss funcion , the expected risk of is defined as
- called 0-1 loss
Bayes optimal classifier:
The optimal risk: where is the marginal distribution of x.
It is easy to show for any .
For special case , let , then