## Notes on Perceptron. Part 6: Experimenting with Learning Rate

Here we compare Pocket PLA, Adaline and some of its variations with adjustable learning rate using artificial training data.

Choosing learning rate is not so trivial task when dealing with Adaline, or any gradient method for that sake.

For the set of the following experiments we used artificial learning data generated by the algorithm from this previous post and the following variations of PLA:

1. Pocket PLA,
2. Adaline with fixed learning rate,
3. Adaline with non increasing learning step (see below),
4. Adaline with variable learning rate that can decrease and grow depending on the progress (see below).

We will use use this enumeration of PLA variants in the legend of the plots that follow.

We have chosen (a rather arbitrary) rule to set the “base learning rate” (i.e. the initial value before we attempt to “improve” it) to be $\theta_0=\log(M)/N$ where $M$ is dimension of input space space and $N$ is number of training data points.

Adaline modifications 3 and 4 introduced here work as follows.

The “non-increasing learning rate” variation (3) is allowed only to reduce learning rate by a fixed multiplier <1 on after epoch that did not increase number of misclassified samples. That logic corresponds to Python code:

class MALRBatchAdalineCLA(perceptron.BatchAdalineCLA):
(the 'MALRB' prefix for the class name).'''
def __init__(self, training_data, learning_rate=1, learning_rate_mult=0.8):
self.learning_rate_mult = learning_rate_mult

def update_error_measure(self):
'''Also adjusts learning rate: reduces it if the last epoch
resulted in greater number of misclassified samples.'''
# Decrease learning rate as needed:
if self.epoch_id > 2:
if self.progress[-1] <= min(self.progress[:-1]):
self.learning_rate *= self.learning_rate_mult


The “variable learning rate” version (4) is allowed to adjust rate both ways depending on the improvement made during the last epoch. Here is the corresponding Python code:

    def update_error_measure(self):
# Decrease learning rate
if self.epoch_id > 2:
if self.progress[-1] <= min(self.progress[:-1]):
self.learning_rate *= self.learning_rate_mult
else:
self.learning_rate /= self.learning_rate_mult


### Experiments

I used fixed seed for random classifier used for data generation across experiments (seed1=11204, not shown on plots) but varied the seed used for training data points (“seed2” on the plot text). The progress data (number of misclassified training samples) was generated in Python using TaskManager from the previous post (it takes a while to run PLA on 1 million points so parallelizing work helps). I used R to render plots: the epoch number along axis x and log of number of misclassified samples along axis y. As a side note, I found separation data generation from plotting quite handy: with data sitting on the disk ready to use it is easy to adjust plotting options and even select different tool.

The experiments show a variety of behaviors mostly demonstrating how gradient methods can fail if optimization step is not adjusted properly. Click on the thumbnails to see full size image.

Figures 1-3. Number of training points 1000, 10000 and 1000000.

On Figure 1 above left PocketPLA (1) steadily improves but does not converge after 300 epochs even though the training data is spearable (margin=0.0). Both fixed and non-increasing Adaline (2) and (3) trace same curve showing that both suffer from wildly different eigenvalues of the optimization criterion quadratic form. And variable Adaline magically converged soon after 250 epochs. On Figure 2 (above middle) the number of training points increased x10 from 1000 on Figure 2 below left and PocketPLA beats all other variations, but variable learning rate still offers an improvement over plain fixed learning rate Adaline; and on 1 mil points (Figure 3, above right) it converges while other variations do not. So, variable step Adaline appears to be not a bad competition for PocketPLA, right? Not so soon… The same parameters in training data vectors space dimension increased from 100 (Figures 1-3) to 1000 we see that Adaline does not show as much promise as previously:

Figures 4-6. Number of training points 1000, 1000 and 1000000.

Reducing data 10 dimensionality 10 brings in the opposite effect. The variable step Adaline(4) beats other variations on 1 Mil points hands down and shows very attractive behavior (converges!) in two other cases:

Figures 7-9. Number of training points 1000, 1000 and 1000000.

That was linearly separable case (margin = 0). What happens if we introduce non-separability? With margin = 0.1 used to generate training data points we the following:

Figures 10-12. Number of training points 1000, 1000 and 1000000.

Sadly, what seemed to be a good idea (adjusting Adaline learning rate in a naive straightforward way) shows not so practical behavior. The variable rate Adaline (4) divergies back to almost where it started. Increased margin only makes the picture less favorable for a Adaline variations (regardless of the data dimensionality) as the plot becomes even more erratic.

### What Went Wrong?

The tell-tale jigsaw behavior is typical for the gradient descent caught in a narrow valley around the minimum. On the jigsaw part each next iteration overshoots to the opposite side of the valley and fails to follow the direction of the valley to the arg min point. Besides, expression that is calculated in our version of Adaline when updating the decision vector is not exactly a gradient. It is only pointing in general direction of gradient but is different because we don’t include all the terms. Only misclassified samples are used and that makes gradient “incomplete”. Regardless of that, running gradient descent to get classifier working properly is a tricky task without using sigmoid functions to compute class of the input point. The main obstacle is outliers that can throw off the result.

There are several remedies to the problems described. We may discuss them in the upcoming posts.

## Conclusion

Attempts to improve over Pocket PLA by using Adaline and varying the learning rate using naive rules may indeed work in a narrow range of parameters. If parameters are tuned properly the advantage can be very much worth the effort of their tweaking. Gradient methods in their naive implementations may fail to reach minimum (which we know does exist in our case!). Additionally, numerical over- and under- flow issues can damage their performance.

On the other hand, lack of parameters in Pocket PLA leads to extremely robust (virtually “fool proof”) behavior on pretty much any training data.

An important disclaimer to make here is that the artificial training data we used was generated using a specific algorithm, which does not have Gaussian properties. Practically we divided in half the set of uniform random points in unit hypercube using a suitable hyperplane. The algorithms based on minimization of sum of squared errors (LMS = Least Mean Squares) are proven to give optimal result on inputs that have Gaussian distribution (unlike uniform used in my training data). That may be a subject of yet another post: re-run comparison presented here using Gaussian training data generator.

To be continued.

## References

The post is mostly inspired by suggested exercises in:
[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data”  AMLBook, 213 pages, 2012.