The Perceptron Learning Algorithm (PLA) is one of the simplest Machine Learning (ML) algorithms. This post goes through an elementary example to illustrate how PLA works.
Notes on Perceptron Learning Algorithm
We start with some definitions and classification problem outline, next we move to the plain vanilla perceptron learning algorithm.
Linear Classifier
Consider set of points with first coordinate fixed to 1 and mapping where , which maps elements of to two different classes denoted by elements of . The two classes are assumed to be linearly separable, i.e. there exists vector such that where * is dot product on . For the sake of the following discussion we will define . Such mapping is a linear classifier (on the set ). The set of the pairs
:
is called the training data set.
Perceptron Learning Algorithm (PLA)
Given a training data set , the goal of PLA is to find some that provides correct classification on , i.e. for all pairs in the training set holds . Obviously, for a linearly separable training data sets such classifier is not unique.
The algorithm is iterative. On each iteration we will obtain new candidate value for , which we will denote as and the corresponding linear mapping will be denoted as .
Initialization. Assign .
Step .
- Compute values of pairs on the training set until we find a misclassified pair, say, for , .
- Update the candidate using the rule: .
Termination. Stop iterations when all points are correctly classified.
The intuition behind the update method (item 2) is that it adjusts to improve its performance on the selected misclassified point:
and since always , due to the first non-zero coordinate, the value of is closer to having the same sign as than . Since each step improves only on a single training pair (and may actually make classification performance worse on some or all other pairs), it is not obvious why the PLA converges to a solution of the linear classification problem. The proof will be discussed in one of the future posts. Here we will use an extremely simple examples to illustrate the PLA.
Example: One-dimensional Training Data Set.
The training data set consists of two training points:
;
also shown below:
While this is probably the simplest example that can be devised to illustrate the PLA, it shows the idea behind it in a clear unobstructed way. By executing the PLA step-by-step we will first initialize .
Iteration 1. We obtain:
;
.
Note that classifies the entire range of real values of as +1. Out of the two training data points only the first one is misclassified: the desired value is -1 while returns +1. This results in an update:
Iteration 2. With the new candidate we obtain the following values on the data points:
;
.
As a side note, the decision boundary between the two classes is now located at (while for it was at -∞):
The value of decision boundary is obtained by solving equation:
or, equivalently, .
The misclassified point leads to an update:
That moves decision boundary to the root of the equation i.e. to and results in all training data points classified correctly:
References
The PLA is covered by many books and articles. The AML Book listed below is easily accessible, well written and has a corresponding MOOC.
[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.