## Notes on Perceptron. Part 5: Adaline

Adaptive linear neuron (or element) aka Adaline can be viewed as a variation of PLA with update scaled by the mismatch magnitude. While not the universally best choice for the classifier learning algorithm, Adaline can be viewed as an improvement over the original PLA.

## Notation and PLA error measure

Training vectors $x_k,k=1,N$ are elements of $X=R^m$ and they are classified using two classes $C=\{-1,1\}$ so that the the training set is the set of pairs $T={(x_k,c_k): x_k \in X, c_k \in C, k=1,\dots,N}$. Lets define a linear functional $F(x)=w*x$ on space $X$ where $*$ is dot product, which by definition is: ($w*x=w^Tx$). The class $c(x)$ of $x$ is defined as $-1$ if $F(x) \le 0$ and $+1$ otherwise. An error measure on the training set for a given $w$ can be defined in many ways. The PLA can be treated as if it measures error as the number of misclassified samples. Using different measures leads to modified update rules that still resemble PLA update rule but may give an advantage.

## The $L_2$ Error

Lets define $L_2$ error measure $E(T)$ on the training dataset $T$ as follows.

(1) $E(T)=\sum_{k=1,N}(C_k -F(x_k))^2$

Note that the $E(T)$ does not care about mapping $c:R \to \{-1,1\}$ but uses the value of $F(x_k)$ on the training vector $x_k$ directly. Minimizing (1) would be equivalent to fitting linear regression solution to classification problem. While this may not be the best approach in general, it may still lead to a reasonable update rules that would make PLA more efficient.

An iterative solution for linear regression would require to compute gradient of (1) with respect to $w$ and update it on every step using that gradient scaled by some learning rate $\eta$. The gradient calculation is straightforward:

(2) $\frac{\partial E(w)}{\partial w_j}$ = $-2\sum_{k=1,N}(C_k -F(x_k))\frac{\partial F(x_k)}{\partial w_j}$ = $-2\sum_{k=1,N}(C_k -x_k*w)*x^j_k$

where $j=1,\dots,m$ is coordinate index in $X=R^m$. The coordinate-wise formula (2) is equivalent to the following one in more compact vector notation:

(3) $\nabla E(w)=-2\sum_{k=1,N}(C_k -x_k*w)x_k$

The general gradient descent update with learning rate $\eta$ is:

(4) $w(n+1)=-\eta * \nabla E(w(n))$

If we focus on a single misclassified sample $k$ (i.e. single term in the summation (3)) instead of the full batch update (4), we will get an update rule:

(5) $w(n+1)=-2\eta (C_k -w(n)*x_k)x_k$

This is a variation of Adaline (from Adaptive Linear Element or Neuron) that can be viewed as a variant of PLA (see post). The main differences are the introduction of learning rate and the use of the current mismatch of the desired output from the target class $(C_k -w(n)*x_k)$ which amplifies updates more for samples with larger deviation from the target.

An update to our running PLA code example (again see this post) to support the Adaline modification is nearly trivial:

class AdalineCLA(PLA):
'''
Adaptive linear element Classifier Learning Algorithm (CLA).
The algorithm approximately minimizes square error of classification.
'''
def __init__(self, training_data, learning_rate=1):
'''In addition to training data constructor takes learning
rate parameter, a multiplier that replaces 1 in the PLA.'''
self.n_points = training_data.shape[0]
self.learning_rate = learning_rate
self.curr_values = np.ones(self.n_points)
PLA.__init__(self, training_data)

def update_error_measure(self):
'''In addition to PLA, updates and stores the value of linear functional
used in the classifier.'''
self.classifier.values_on(self.data, self.curr_values)
PLA.update_error_measure(self)

def update_classifier(self):
delta_vect = self.data[self.update_idx, :-1]
curr_value = self.curr_values[self.update_idx]
target_class = self.data[self.update_idx, -1]
multiplier = 2*abs(target_class - curr_value)
self.classifier.vect_w[:-1] += target_class * \
multiplier * \
self.learning_rate * \
delta_vect


The selection rule for the update index remains the same. We used abs function to emphasize the source of the actual sign of the update: it comes form the the target class just like in PLA. Making batch version of Adaline (class BatchAdalineCLA sub-classed from class AdalineCLA) is also straightforward:

    def update_classifier(self):
'''Batch version: updates classifier's decision vector by
accumulating error from all misclassified samples.'''
for idx in self.misclassified:
delta_vect = self.data[idx, :-1]
curr_value = self.curr_values[self.update_idx]
target_class = self.data[self.update_idx, -1]
multiplier = 2*abs(target_class - curr_value)
self.classifier.vect_w[:-1] += target_class * \
multiplier * \
self.learning_rate * \
delta_vect


We leave out obvious changes to super-classes needed to keep the list of misclassified samples (or compute it dynamically). In the batch version the contribution from all misclassified samples is accumulated during single epoch update. This affects sensible choice of learning rate: it has to be smaller than that one for by-sample Adaline. An obvious rule of thumb is to make the learning rate proportional to 1/n_points. Proper selection of learning rate may influence the convergence significantly with trade off between convergence speed and proximity of the result to the optimal (in $L_2$ sense) solution. We will leave comparison of various flavors of PLA to one of the next posts. With a good learning rate selection batch Adaline can give significant speed-up over other variations of PLA. The following example converged only in 10 epochs, with is much faster than 38 epochs for the same data in one of the previous posts (second example). Click on the thumbnail to see gif animation: