Skip to content

Notes on Perceptron. Part 5: Adaline

Adaptive linear neuron (or element) aka Adaline can be viewed as a variation of PLA with update scaled by the mismatch magnitude. While not the universally best choice for the classifier learning algorithm, Adaline can be viewed as an improvement over the original PLA.

Notation and PLA error measure

Training vectors x_k,k=1,N are elements of X=R^m and they are classified using two classes C=\{-1,1\} so that the the training set is the set of pairs T={(x_k,c_k): x_k \in X, c_k \in C, k=1,\dots,N}. Lets define a linear functional F(x)=w*x on space X where * is dot product, which by definition is: (w*x=w^Tx). The class c(x) of x is defined as -1 if F(x) \le 0 and +1 otherwise. An error measure on the training set for a given w can be defined in many ways. The PLA can be treated as if it measures error as the number of misclassified samples. Using different measures leads to modified update rules that still resemble PLA update rule but may give an advantage.

The L_2 Error

Lets define L_2 error measure E(T) on the training dataset T as follows.

(1) E(T)=\sum_{k=1,N}(C_k -F(x_k))^2

Note that the E(T) does not care about mapping c:R \to \{-1,1\} but uses the value of F(x_k) on the training vector x_k directly. Minimizing (1) would be equivalent to fitting linear regression solution to classification problem. While this may not be the best approach in general, it may still lead to a reasonable update rules that would make PLA more efficient.

An iterative solution for linear regression would require to compute gradient of (1) with respect to w and update it on every step using that gradient scaled by some learning rate \eta. The gradient calculation is straightforward:

(2)  \frac{\partial E(w)}{\partial w_j} = -2\sum_{k=1,N}(C_k -F(x_k))\frac{\partial F(x_k)}{\partial w_j} = -2\sum_{k=1,N}(C_k -x_k*w)*x^j_k

where j=1,\dots,m is coordinate index in X=R^m. The coordinate-wise formula (2) is equivalent to the following one in more compact vector notation:

(3) \nabla E(w)=-2\sum_{k=1,N}(C_k -x_k*w)x_k

The general gradient descent update with learning rate \eta is:

(4) w(n+1)=-\eta * \nabla E(w(n))

If we focus on a single misclassified sample k (i.e. single term in the summation (3)) instead of the full batch update (4), we will get an update rule:

(5) w(n+1)=-2\eta (C_k -w(n)*x_k)x_k

This is a variation of Adaline (from Adaptive Linear Element or Neuron) that can be viewed as a variant of PLA (see post). The main differences are the introduction of learning rate and the use of the current mismatch of the desired output from the target class (C_k -w(n)*x_k) which amplifies updates more for samples with larger deviation from the target.

Adaline Python code

An update to our running PLA code example (again see this post) to support the Adaline modification is nearly trivial:

class AdalineCLA(PLA):
    '''
    Adaptive linear element Classifier Learning Algorithm (CLA).
    The algorithm approximately minimizes square error of classification.
    '''
    def __init__(self, training_data, learning_rate=1):
        '''In addition to training data constructor takes learning
        rate parameter, a multiplier that replaces 1 in the PLA.'''
        self.n_points = training_data.shape[0]
        self.learning_rate = learning_rate
        self.curr_values = np.ones(self.n_points)
        PLA.__init__(self, training_data)

    def update_error_measure(self):
        '''In addition to PLA, updates and stores the value of linear functional 
        used in the classifier.'''
        self.classifier.values_on(self.data, self.curr_values)
        PLA.update_error_measure(self)

    def update_classifier(self):
        '''Updates classifier's decision vector.'''
        delta_vect = self.data[self.update_idx, :-1]
        curr_value = self.curr_values[self.update_idx]
        target_class = self.data[self.update_idx, -1]
        multiplier = 2*abs(target_class - curr_value)
        self.classifier.vect_w[:-1] += target_class * \
                                       multiplier * \
                                       self.learning_rate * \
                                       delta_vect

The selection rule for the update index remains the same. We used abs function to emphasize the source of the actual sign of the update: it comes form the the target class just like in PLA. Making batch version of Adaline (class BatchAdalineCLA sub-classed from class AdalineCLA) is also straightforward:

    def update_classifier(self):
        '''Batch version: updates classifier's decision vector by
        accumulating error from all misclassified samples.'''
        for idx in self.misclassified:
            delta_vect = self.data[idx, :-1]
            curr_value = self.curr_values[self.update_idx]
            target_class = self.data[self.update_idx, -1]
            multiplier = 2*abs(target_class - curr_value)
            self.classifier.vect_w[:-1] += target_class * \
                                       multiplier * \
                                       self.learning_rate * \
                                       delta_vect

We leave out obvious changes to super-classes needed to keep the list of misclassified samples (or compute it dynamically). In the batch version the contribution from all misclassified samples is accumulated during single epoch update. This affects sensible choice of learning rate: it has to be smaller than that one for by-sample Adaline. An obvious rule of thumb is to make the learning rate proportional to 1/n_points. Proper selection of learning rate may influence the convergence significantly with trade off between convergence speed and proximity of the result to the optimal (in L_2 sense) solution. We will leave comparison of various flavors of PLA to one of the next posts. With a good learning rate selection batch Adaline can give significant speed-up over other variations of PLA. The following example converged only in 10 epochs, with is much faster than 38 epochs for the same data in one of the previous posts (second example). Click on the thumbnail to see gif animation:

batch_adaline

Batch Adaline

References

For short high level description see Wikipedia [1]. The book [2] discusses Adaline as an exercise and treats it as a variation of PLA. This is the approach we followed while implementing Adaline in Python.

[1] http://en.wikipedia.org/wiki/ADALINE
[2] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.