## Notes on Perceptron. Part 4: Convergence Theorem

This post discusses an outline of the the PLA convergence theorem. While the algorithm convergence is not obvious, its proof hinges only on two key inequalities.

The Perceptron Learning Algorithm was outlined in this earlier post.

## Notations and Convergence Conditions

The input vectors $x$ in the input set are (m+1)-dimensional and have first coordinate fixed to 1. The vector defining linear functional on the space of input vectors is denoted by $w$, it is also (m+1) dimensional and its first component is called bias. Dot product on the linear space at hands is defined in the standard (Euclidean) fashion and is denoted as $x*w$.

Each vector in the training set of vectors $x_i$, $i=1,n$ is assigned to one of the two classes denoted by +1 and -1. Let’s denote assignment mapping $C(x):x \to \{+1,-1\}$. The assignment is linearly separable if there exists $w^*$ such that $sign(x_i*w^*)=C(x_i)$.

By assuming linear separability achieved with vector $w^*$, we can define a quantity $a$ that can be treated as the minimal separation from the decision boundary:

(1) $a=$ min $C(x_i) (x_i*w^*)$.

We will also assume that the PLA starts iterations with $w(0)=0$ and vector $w$ updated to the i-th iteration is denoted as $w(i)$.

## Proof Idea

The main idea of the proof is to obtain two bounds on the growth of the squared norm $|w(n)|^2$.

• The first (quadratic lower) bound is obtained directly from counting updates and showing that the squared norm in question grows at least quadratically with the number of iterations. This does not depend on the particular choice of the update vector $x_i$ used to update $w(n)$.
• The second (linear upper) boundd is obtained with the direct use of the update rule: $x_i$ is picked because it is misclassified and such a choice makes $w(n)$ be more “aligned” with $w^*$ which limits the growth of $w(n)$ from above linearly.

A combination of both bounds gives the desired result and an estimate on maximum number of iterations required for PLA to converge.

## Convergence Theorem Proof

Unfolding $w(n+1)$ using the update rule we get:

(2) $w(n+1)=\sum_{i=1,n}C(x_i)x_i$,

when multiplied by $w^*$ and using (1) this gives:

(3) $w(n+1)*w^*=\sum_{i=1,n}C(x_i)x(i)*w^*\ge n a$

Next we use Cauchy-Schwarz inequality to obtain:

(4) $|w(n+1)|^2|w^*|^2 \ge (w(n+1)*w^*)^2$

that, by combining (3) and (4),  leads to:

(5) $|w(n+1)|^2 \ge \ n^2 a^2/|w^*|^2$

which gives the first component of the proof: the squared norm of the vectors $w(n)$ during update keeps at least as fast as $n^2$.

To obtain the bound on the growth of  $|w(n+1)|^2$ we first consider equalities:

(6) $w(k+1)=w(k)+C(x_k)x_k$

(7) $|w(k+1)|^2=|w(k)|^2+|x_k|^2+2C(x_k)x(k)*w(k)$

where (6) is by definition and (7) is expanded using Euclidean metric. By the definition of PLA the vector $x(k)$ was picked because it was misclassified hence the sign on f the last term in (7) us negative since $x(k)*w(k)$ is of opposite sign of $C(x_k)$. This leads to inequalities:

(8) $|w(k+1)|^2 \le |w(k)|^2+|x_k|^2$

and

(9) $|w(k+1)|^2 - |w(k)|^2 \le |x_k|^2$

Summation of inequality (9) for k in range from 0 to n with initial condition $w(0)=0$ results in:

(10) $|w(n+1)|^2 \le \sum_{i=1,n}|x_k|^2 \le n b$

where $b=$ max $|x_i|^2$ denotes maximum of squared norm of vectors $x_i, i=1, \dots ,n$.

The inequality (10) gives the second component of the proof: the growth of squared norm of the vectors $w(n)$ is limited from above by linear function of n.

Finally, it immediately follows from (5) and (10) that the algorithm has to stop. The maximum iteration number $n_*$ for stopping can be obtained from the equality: $n^2 a^2/|w^*|^2 = n b$ where both estimates of growth are combined, i.e.

(12) $n_* = b |w^*|^2/ a^2$

Qualitatively, the smaller minimal separation $a$ of the training data, the longer it may take to converge. In practice the number of iterations required for PLA to converge can be considerably smaller than the estimate above. However, neither $a$ or $|w^*|^2$ are known in advance for an arbitrary dataset so there is no easy way to calculate the estimate (12) upfront, without running the algorithm, or even conclude if it will converge or not.

The proof above was borrowed from [1] and is close to what [2] suggests as a series of steps in an exercise for Chapter 1.

## References

[1] S.Haykin “Neural Networks: A Comprehensive Foundation”, 19989, Prentice-Hall, 842 pages.

[2] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data”  AMLBook, 213 pages, 2012.