## Notes on Perceptron. Part 1: The Perceptron Learning Algorithm.

The Perceptron Learning Algorithm (PLA) is one of the simplest Machine Learning (ML) algorithms. This post goes through an elementary example to illustrate how PLA works.

# Notes on Perceptron Learning Algorithm

We start with some definitions and classification problem outline, next we move to the plain vanilla perceptron learning algorithm.

## Linear Classifier

Consider set of points $X=\{x_i \in R^{n+1}, i=1, \dots ,N\}$ with first coordinate fixed to 1 and mapping $X \to Y$ where $Y=\{-1,1\}$, which maps elements of $X$ to two different classes denoted by elements of $Y$.  The two classes are assumed to be linearly separable, i.e. there exists vector $W \in R^{n+1}$ such that $Y(x) = sign(W * x)$ where * is dot product on $R^{n+1}$. For the sake of the following discussion we will define $sign(0) = 1$. Such mapping $Y$ is a linear classifier (on the set $X$). The set of the pairs

$T=\{ (x_i, y_i)$: $y_i=Y(x_i), x_i \in X, i=1, \dots ,N\}$

is called the training data set.

## Perceptron Learning Algorithm (PLA)

Given a training data set $T$, the goal of PLA is to find some $W$ that provides correct classification on $T$, i.e. for all pairs in the training set  holds $Y(x_i) = sign(W * x_i) = y_i$. Obviously, for a linearly separable training data sets such classifier is not unique.

The algorithm is iterative. On each iteration we will obtain new candidate value for $W$, which we will denote as $w_k$ and the corresponding linear mapping will be denoted as $Y_k(x)=w_k * x$.

Initialization. Assign $w_0 = 0$.

Step $k$.

1. Compute values of $Y_k(x_i)$ pairs on the training set until we find a misclassified pair, say, for $i$, $Y_k(x_i) \neq y_i$.
2. Update the candidate using the rule: $w_{k+1} = w_k + y_ix_i$.

Termination. Stop iterations when all points are correctly classified.

The intuition behind the update method (item 2) is that it adjusts $w(k)$ to improve its performance on the selected misclassified point:

$Y_{k+1}(x_i) = w_{k+1} * x_i$

$= (w_k + y_ix_i)*x_i$

$= Y_k(x_i) + y_i|x_i|^2$

and since always $|x_i|^2 > 0$, due to the first non-zero coordinate, the value of $Y_{k+1}(x_i)$ is closer to having the same sign as $y_i$ than $Y_{k}(x_i)$. Since each step improves only on a single training pair (and may actually make classification performance worse on some or all other pairs), it is not obvious why the PLA converges to a solution of the linear classification problem. The proof will be discussed in one of the future posts. Here we will use an extremely simple examples to illustrate the PLA.

## Example: One-dimensional Training Data Set.

The training data set consists of two training points:

$x_1 = (1,2), y_1 = -1$;

$x_2 = (1,-0.25), y_2 = 1$

also shown below:

Figure 1. Training set. First coordinate of x is not shown (=1).

While this is probably the simplest example that can be devised to illustrate the PLA, it shows the idea behind it in a clear unobstructed way. By executing the PLA step-by-step we will first initialize $w_0 = (0,0)$.

Iteration 1. We obtain:

$Y_0(x_1) = sign(w_0 * x_1) = sign(0 * 1 + 0 * 2) = sign(0) = 1$;

$Y_0(x_2) = sign(w_0 * x_2) = sign(0 * 1 + 0 * ( - 0.25)) = sign(0) = 1$.

Note that $Y_0$ classifies the entire range of real values of $x$ as +1. Out of the two training data points only the first one $x_1$ is misclassified: the desired value is -1 while $Y_0(x_1)$ returns +1. This results in an update:

$w_1 = w_0 + (-1)*(1, 2) = (-1, -2)$

Iteration 2. With the new candidate $Y_1(x)=w_1*x$ we obtain the following values on the data points:

$Y_1(x_1) = sign(w_0 * x_1) = sign( ( - 1) * 1 + ( - 2) * 2) = sign(-5) = -1$;

$Y_1(x_2) = sign(w_0 * x_2) = sign( ( - 1) * 1 + ( - 2) * ( - 0.25)) = sign(-0.5) = - 1$.

As a side note, the decision boundary between the two classes is now located at $x = -0.5$ (while for $Y_0$ it was at -∞):

Figure 2. Decision boundary after first iteration.

The value $x = -0.5$ of decision boundary is obtained by solving equation:

$w_0 * x = 0$ or, equivalently, $(-1) * 1 + ( - 2) * x = 0$.

The misclassified point $x_2 = (1, - 0.25)$ leads to an update:

$w_2 = w_1 + 1*(1, -0.25) = (0, -2.25)$

That moves decision boundary to the root of the equation $0 * 1 + x * (-2.25) = 0$ i.e. to $x = 0$ and results in all training data points classified correctly:

Figure 3. Second iteration moves decision boundary to x=0. Both training data points are classified correctly.

## References

The PLA is covered by many books and articles. The AML Book listed below is easily accessible, well written and has a corresponding MOOC.

[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data”  AMLBook, 213 pages, 2012.