Consider a separable data set generated using methods from this post. The data set has the following values of the parameters included in formula (12) for max steps until convergence for PLA from this post:

,

,

.

Here is the max squared norm of training vectors (that are generated in the square -1,+1 resulting in value 2 plus bias 1); is norm of classifier decision vector and is margin, i.e. minimal separation of data points from decision boundary. The max number of epochs until PLA converges is estimated then as .

By fixing dataset with described parameters and running stochastic PLA 500 times with different random seed we obtained the following histogram of number of epochs until convergence. As the histogram shows, the theoretical estimate is higher than typical practical values by orders of magnitude. The obtained distribution is uni-modal and is centered around value which is almost three orders of magnitude closer to zero.

The code snippet used to generate histogram is straightforward:

epochs = [] n_runs = 500 for _ in range(n_runs): pla_rand = perceptron.PLARandomized(training_data) while pla_rand.next_epoch(): pass epochs.append(pla_rand.epoch_id) n, bins, patches = plt.hist(epochs, 5, normed=1, histtype='stepfilled') plt.setp(patches, 'facecolor', 'grey', 'alpha', 0.25) plt.xlabel('n epochs') plt.ylabel('n runs') plt.title(r'Stochastic PLA: Number of epochs to convergence') plt.savefig("stochastic_PLA_convergence_" + str(data_dim).zfill(3) + "_" + str(n_points)+'_' + str(n_runs) + ".png", bbox_inches='tight')

We omitted the required import statements for brevity. Also we normalized number of runs (normed=1 in histogram parameters). The implementation of PLARandomized is essentially the same as for basic deterministic PLA with the only difference – random choice of mis-classified sample on each epoch.

[1] S.Haykin “Neural Networks: A Comprehensive Foundation”, 19989, Prentice-Hall, 842 pages.

[2] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

We start with couple examples. In both of them the goal is to present a plan of crossing a river.

Puzzle 1. [1] A Wolf, A Goat and A Cabbage. A farmer needs to cross a river and to transport a wolf, a goat and a cabbage to the other side in a boat. The boat can carry the farmer and only one item. The following pairs of items can’t be left unattended by the farmer: wolf-goat and goat-cabbage for obvious reasons.

Puzzle 49. [1] Cannibals and Missionaries. Three missionaries and three cannibals must cross a river. Their boat can hold only two people, and it can’t cross the river by itself. If present missionaries can not be outnumbered by cannibals.

The common components of such puzzles are: a list of locations (river sides in the examples above), a list of items to move in between the initial and the goal location and a list of conditions to satisfy while moving. The objective is to plan moves so that the goal state is reached from the initial state while each intermediate step satisfies the conditions.

We can represent a state of the world in such puzzles by specifying the number of items at each location. If there are items of kind total then the number of items of kind at any location belongs to the set . If there are locations, then the distribution of item across them is described by a point in the set , which is Cartesian product of instances of . Combining all together for all item types we get a complete space . Any world state for such kind of puzzles is a point in .

Such formalization helps to easily enumerate all eligible neighbors of a current state for the purpose of searching a path

in from initial state to the goal state .

We will adopt the simplest possible strategy for state-space search. Namely, the algorithm we are going to use is a specialization of Graph-Search algorithm (Figure 3.19 in [2]) with breadth first forward search strategy. Next we will explore Python implementation of this approach and apply it to the two puzzles above.

Programmatically we can represent a point using, say, nested dictionary. Each item type present in the puzzle will correspond to a key at some level of hierarchy. The dictionary will give a complete world state description, e.g. dct[‘left’][‘goat’] will contain number of goats on left side of the river. Locations and items correspond to different levels of nesting in such dictionary. For the kind of puzzles we are exploring depth 2 of nesting is sufficient (locations and items) but the nested dictionary class can easily describe more complex situations by deeper nesting.

'''Defines class for symmetric nested dictionary treated like a complete tree with methods following depth first traversal. For some additional implementation notes see post http://stackoverflow.com/questions/14692690/access-python-nested-dictionary-items-via-a-list-of-keys ''' import pprint class NestedDict: '''Nested dictionary with the main assumption that it has same keys for all nodes at the same level.''' def __init__(self, dictn=None): '''Initialized empty dictionary and empty list of labels.''' self.dictn = {} if dictn is None else dictn self.labels = [] self.detph = 0 def expand(self, labels, func=lambda: None): '''Given nested list of labels of the form [[L1a, L1b, ...][L2a, ...][L3a, ...]...] creates the tree with leaves created using passed in function. By default leaves values are None. :param labels: nested list of labels. :params func: function to create default leaves. ''' self.dictn = self._expand(labels, func) self.depth = len(labels) self.labels = labels def _expand(self, labels, func): '''Expands dictionary recursively to the next level in the tree depth. :params labels: remaining labels to expand; :params func: function for creating leaves.''' dictn = {} is_leaf = len(labels) == 1 if is_leaf: for label in labels[0]: dictn[label] = func() else: for label in labels[0]: dictn[label] = self._expand(labels[1:], func) return dictn def get_at(self, dpath): '''Returns value given a path as list of labels. :param dpath: list in the form [L1, L2, ...] where Li are labels to follow on corresponding level.''' return reduce(lambda d, k: d[k], dpath, self.dictn) def set_at(self, dpath, value): '''Uses get_at to set value at the path in the tree. Assumes that the tree is already fully expanded. :params dpath: data path in the tree, a list of labels to visit on corresponding level. :params value: value to set on the path.''' self.get_at(dpath[:-1])[dpath[-1]] = value def get_as_str(self): return pprint.pformat(self.dictn, indent=1, width=800)

To describe world state we subclass nested dictionary and end up with a more specialized class:

'''Logistics puzzle state.''' import copy import nested_dict class State(nested_dict.NestedDict): '''A 'template' state of a logistics puzzle world. The only assumption is that any state is a subset of Cartesian product of state components (which is always true but is highly redundant if the valid subset is small).''' def __init__(self, labels, state_values): '''World state described by nested dictionary. :params labels: labels, a list of lists; e.g. the first one is locations e.g. ['left', 'right'] second one is item label e.g. ['goat', 'wolf', etc]. :params values: list of pairs, e.g. [['left', 'goat'], 1] which corresponds to one goat on left bank of river. ''' nested_dict.NestedDict.__init__(self) self.expand(labels, func=lambda: 0) for item, item_value in state_values: self.set_at(item, item_value) self.prev_state = None def get_neighbor(self): '''Returns deep copy of self, sets self as predecessor.''' state_candidate = copy.deepcopy(self) state_candidate.prev_state = self return state_candidate def enumerate_neighbors(self): return [] def is_valid(self): return True def equals(self, other): '''The danger of strings comparison is that order used for formatting string representation may vary.''' return self.get_as_str() == other.get_as_str() def get_path(self, prefix=[]): ''' Returns list of the states visited on the way to the current. Used to print out solution.''' prefix.append(self.get_as_str()) if not self.prev_state is None: self.prev_state.get_path(prefix)

As an example of specialization of world state, consider PuzzleState class for cannibals and missionaries.

import state class PuzzleState(state.State): '''Defines state in the search graph for missionaries and cannibals puzzle.''' locations = ['left', 'right'] items = ['boat', 'missionaries', 'cannibals'] labels = [locations, items] boat_capacity = 2 def __init__(self, state_values): state.State.__init__(self, PuzzleState.labels, state_values) def side_is_valid(self, side): '''Returns true is on the given side of the river the number of cannibals does not exceed number of missionaries.''' return self.dictn[side]['missionaries'] >= self.dictn[side]['cannibals'] \ or self.dictn[side]['missionaries'] == 0 def is_valid(self): '''True of the state is valid, i.e. missionaries can't be eaten.''' return self.side_is_valid('left') and self.side_is_valid('right') def boat_location_and_destinations(self): '''Returns current location of boat and possible destinations for it.''' destinations = [] location = None for loc in PuzzleState.locations: if self.dictn[loc]['boat'] == 1: location = loc else: destinations.append(loc) return location, destinations def enumerate_neighbors(self): '''Returns list of admissible states directly reachable from the current state (self).''' neighbors = [] location, destinations = self.boat_location_and_destinations() neighbors = [] location, destinations = self.boat_location_and_destinations() for dest in destinations: for mm in range(self.dictn[location]['missionaries'] + 1): for cc in range(self.dictn[location]['cannibals'] + 1): n_moved = cc + mm if n_moved > PuzzleState.boat_capacity or n_moved == 0: continue state_candidate = self.get_neighbor() state_candidate.move(location, dest, "cannibals", cc) state_candidate.move(location, dest, "missionaries", mm) state_candidate.move(location, dest, "boat", 1) if state_candidate.is_valid(): neighbors.append(state_candidate) return neighbors def move(self, from_location, to_location, item, n_items): '''Moves n_items of kind item between two locations.''' self.dictn[from_location][item] -= n_items self.dictn[to_location][item] += n_items

We may consider introduction of another intermediate class for river crossing puzzles. Then functions move and boat_location_and_destinations will be factored out to that new class.

The solver below uses string representation of a state to keep track of the visited states and assumes that states implement an appropriate get_as_str() method.

class BreadthFirstForward: def __init__(self, initial, goal): ''' Saves the initial state and the goal state as an instance variables. Initializes the internal bookkeeping: lists of visited and fringe states.''' self.initial = initial self.goal = goal self.visited = [initial.get_as_str()] self.fringe = [self.initial] def run(self): '''Implements the simplest version of breadth first search using FIFO queue for the fringe states.''' while self.fringe: state = self.fringe[0] if state.equals(self.goal): return state self.fringe = self.fringe[1:] new_states = state.enumerate_neighbors() for new_state in new_states: state_hash = new_state.get_as_str() if state_hash in self.visited: continue self.visited.append(state_hash) self.fringe.append(new_state) def solve_breadth_first(start, goal): '''A wrapper to create solver, run it and print result to console.''' slv = BreadthFirstForward(start, goal) solution = slv.run() if not solution is None: solution_str = [] solution.get_path(solution_str) for line in solution_str: print line else: print "No solution."

An output for the missionaries and cannibals puzzle looks like this:

import solver missionaries = [['left', 'missionaries'], 3] cannibals = [['left', 'cannibals'], 3] boat = [['left', 'boat'], 1] start = PuzzleState([missionaries, cannibals, boat]) missionaries = [['right', 'missionaries'], 3] cannibals = [['right', 'cannibals'], 3] boat = [['right', 'boat'], 1] goal = PuzzleState([missionaries, cannibals, boat]) solver.solve_breadth_first(start, goal) {'left': {'boat': 0, 'cannibals': 0, 'missionaries': 0}, 'right': {'boat': 1, 'cannibals': 3, 'missionaries': 3}} {'left': {'boat': 1, 'cannibals': 2, 'missionaries': 0}, 'right': {'boat': 0, 'cannibals': 1, 'missionaries': 3}} {'left': {'boat': 0, 'cannibals': 1, 'missionaries': 0}, 'right': {'boat': 1, 'cannibals': 2, 'missionaries': 3}} {'left': {'boat': 1, 'cannibals': 3, 'missionaries': 0}, 'right': {'boat': 0, 'cannibals': 0, 'missionaries': 3}} {'left': {'boat': 0, 'cannibals': 2, 'missionaries': 0}, 'right': {'boat': 1, 'cannibals': 1, 'missionaries': 3}} {'left': {'boat': 1, 'cannibals': 2, 'missionaries': 2}, 'right': {'boat': 0, 'cannibals': 1, 'missionaries': 1}} {'left': {'boat': 0, 'cannibals': 1, 'missionaries': 1}, 'right': {'boat': 1, 'cannibals': 2, 'missionaries': 2}} {'left': {'boat': 1, 'cannibals': 1, 'missionaries': 3}, 'right': {'boat': 0, 'cannibals': 2, 'missionaries': 0}} {'left': {'boat': 0, 'cannibals': 0, 'missionaries': 3}, 'right': {'boat': 1, 'cannibals': 3, 'missionaries': 0}} {'left': {'boat': 1, 'cannibals': 2, 'missionaries': 3}, 'right': {'boat': 0, 'cannibals': 1, 'missionaries': 0}} {'left': {'boat': 0, 'cannibals': 1, 'missionaries': 3}, 'right': {'boat': 1, 'cannibals': 2, 'missionaries': 0}} {'left': {'boat': 1, 'cannibals': 3, 'missionaries': 3}, 'right': {'boat': 0, 'cannibals': 0, 'missionaries': 0}}

There is a potential caveat with using string to represent state in keeping track of visited states. We need to make sure the nested dictionary is traversed in the same order each time by the pprint function pformat. Here we rely on the pprint module implementation and do not provide any additional guarantees. If the assumption fails, the search will not fail but will become more expensive as we will be visiting same state more than once. Eventually the search will return a solution if it exists but it may differ from the breadth first search solution.

Finally, it is obvious that reverting initial and goal state will turn the algorithm into a backward search.

It might be an interesting exercise (for which we may have a separate post in the future) to build an automated puzzle generator. It is relatively easy to come up with a generator and parser of first order logic rules describing valid states. Automated class generation for such rules could be also not so difficult to implement. The main challenge would be conversion of abstract items, locations and transitions into human readable puzzle fiction. Even without going that far, the framework we outlined here allows to test simple variations to the existing puzzles, say adding more items and locations (e.g. an island where items can be “parked” temporarily). We can also add item conditions: a perishable item left on initial side of a river for longer than certain number of moves is a loss and leads to an invalid state of the world. That will increase depth of the nested dictionary to three (locations, items, item conditions).

[1] A. Levitin, M. Levitin, Algorithmic Puzzles, Oxford University Press, 2011, 257 pages.

[2] S. Russel, P. Norvig, Artificial Intelligence, Modern Approach, Second Edition, Prentice Hall, 2003, 1081 pages.

[3] More river crossing puzzles https://justpuzzles.wordpress.com/2011/02/16/river-crossing-1/

Choosing learning rate is not so trivial task when dealing with Adaline, or any gradient method for that sake.

For the set of the following experiments we used artificial learning data generated by the algorithm from this previous post and the following variations of PLA:

- Pocket PLA,
- Adaline with fixed learning rate,
- Adaline with non increasing learning step (see below),
- Adaline with variable learning rate that can decrease and grow depending on the progress (see below).

We will use use this enumeration of PLA variants in the legend of the plots that follow.

We have chosen (a rather arbitrary) rule to set the “base learning rate” (i.e. the initial value before we attempt to “improve” it) to be where is dimension of input space space and is number of training data points.

Adaline modifications 3 and 4 introduced here work as follows.

The “non-increasing learning rate” variation (3) is allowed only to reduce learning rate by a fixed multiplier <1 on after epoch that did not increase number of misclassified samples. That logic corresponds to Python code:

class MALRBatchAdalineCLA(perceptron.BatchAdalineCLA): '''Adaline variation: Monotonously decreasing Adaptive Learning Rate (the 'MALRB' prefix for the class name).''' def __init__(self, training_data, learning_rate=1, learning_rate_mult=0.8): perceptron.BatchAdalineCLA.__init__(self, training_data, learning_rate) self.learning_rate_mult = learning_rate_mult def update_error_measure(self): '''Also adjusts learning rate: reduces it if the last epoch resulted in greater number of misclassified samples.''' perceptron.BatchAdalineCLA.update_error_measure(self) # Decrease learning rate as needed: if self.epoch_id > 2: if self.progress[-1] <= min(self.progress[:-1]): self.learning_rate *= self.learning_rate_mult

The “variable learning rate” version (4) is allowed to adjust rate both ways depending on the improvement made during the last epoch. Here is the corresponding Python code:

def update_error_measure(self): '''...also adjusts learning rate.''' perceptron.BatchAdalineCLA.update_error_measure(self) # Decrease learning rate if self.epoch_id > 2: if self.progress[-1] <= min(self.progress[:-1]): self.learning_rate *= self.learning_rate_mult else: self.learning_rate /= self.learning_rate_mult

I used fixed seed for random classifier used for data generation across experiments (seed1=11204, not shown on plots) but varied the seed used for training data points (“seed2” on the plot text). The progress data (number of misclassified training samples) was generated in Python using TaskManager from the previous post (it takes a while to run PLA on 1 million points so parallelizing work helps). I used R to render plots: the epoch number along axis x and log of number of misclassified samples along axis y. As a side note, I found separation data generation from plotting quite handy: with data sitting on the disk ready to use it is easy to adjust plotting options and even select different tool.

The experiments show a variety of behaviors mostly demonstrating how gradient methods can fail if optimization step is not adjusted properly. Click on the thumbnails to see full size image.

Figures 1-3. Number of training points 1000, 10000 and 1000000.

On Figure 1 above left PocketPLA (1) steadily improves but does not converge after 300 epochs even though the training data is spearable (margin=0.0). Both fixed and non-increasing Adaline (2) and (3) trace same curve showing that both suffer from wildly different eigenvalues of the optimization criterion quadratic form. And variable Adaline magically converged soon after 250 epochs. On Figure 2 (above middle) the number of training points increased x10 from 1000 on Figure 2 below left and PocketPLA beats all other variations, but variable learning rate still offers an improvement over plain fixed learning rate Adaline; and on 1 mil points (Figure 3, above right) it converges while other variations do not. So, variable step Adaline appears to be not a bad competition for PocketPLA, right? Not so soon… The same parameters in training data vectors space dimension increased from 100 (Figures 1-3) to 1000 we see that Adaline does not show as much promise as previously:

Figures 4-6. Number of training points 1000, 1000 and 1000000.

Reducing data 10 dimensionality 10 brings in the opposite effect. The variable step Adaline(4) beats other variations on 1 Mil points hands down and shows very attractive behavior (converges!) in two other cases:

Figures 7-9. Number of training points 1000, 1000 and 1000000.

That was linearly separable case (margin = 0). What happens if we introduce non-separability? With margin = 0.1 used to generate training data points we the following:

Figures 10-12. Number of training points 1000, 1000 and 1000000.

Sadly, what seemed to be a good idea (adjusting Adaline learning rate in a naive straightforward way) shows not so practical behavior. The variable rate Adaline (4) divergies back to almost where it started. Increased margin only makes the picture less favorable for a Adaline variations (regardless of the data dimensionality) as the plot becomes even more erratic.

The tell-tale jigsaw behavior is typical for the gradient descent caught in a narrow valley around the minimum. On the jigsaw part each next iteration overshoots to the opposite side of the valley and fails to follow the direction of the valley to the arg min point. Besides, expression that is calculated in our version of Adaline when updating the decision vector is not exactly a gradient. It is only pointing in general direction of gradient but is different because we don’t include all the terms. Only misclassified samples are used and that makes gradient “incomplete”. Regardless of that, running gradient descent to get classifier working properly is a tricky task without using sigmoid functions to compute class of the input point. The main obstacle is outliers that can throw off the result.

There are several remedies to the problems described. We may discuss them in the upcoming posts.

Attempts to improve over Pocket PLA by using Adaline and varying the learning rate using naive rules may indeed work in a narrow range of parameters. If parameters are tuned properly the advantage can be very much worth the effort of their tweaking. Gradient methods in their naive implementations may fail to reach minimum (which we know does exist in our case!). Additionally, numerical over- and under- flow issues can damage their performance.

On the other hand, lack of parameters in Pocket PLA leads to extremely robust (virtually “fool proof”) behavior on pretty much any training data.

An important disclaimer to make here is that the artificial training data we used was generated using a specific algorithm, which does not have Gaussian properties. Practically we divided in half the set of uniform random points in unit hypercube using a suitable hyperplane. The algorithms based on minimization of sum of squared errors (LMS = Least Mean Squares) are proven to give optimal result on inputs that have Gaussian distribution (unlike uniform used in my training data). That may be a subject of yet another post: re-run comparison presented here using Gaussian training data generator.

To be continued.

The post is mostly inspired by suggested exercises in:

[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

]]>

Running multi-threaded Python apps is challenging because of global interpreter lock (GIL) [1]. A simple workaround is to start a number of OS sub-processes each of which can run its own interpreter. A little bit of bookkeeping is required to manage tasks that map to such sub-processes. The requirements to the class performing such bookkeeping could be as simple as:

- initialize tasks list and update it as needed during calculations;
- keep manageable number of sub-processes started at the same, their number can be specified by user or deduced automatically for more advanced OS-dependent solutions;
- terminate execution when all sub-processes finish their calculations and the tasks list is empty.

Below is the minimal class that can be extended by inheritance to fit needs of a particular computational needs.

'''Module task_manager.py *********************** Provides a simple framework for setting up os subprocess based computations to utilize multi-core architecture. ''' import os.path import subprocess import time class TaskManager: '''Holds tasks lists, runs checks on their completion and starts new tasks if tasks slot(s) are available.''' def __init__(self, n_threads, timeouts={}): '''Minimal initialization. A task descriptor consists of a complete parameters list for starting execution as an OS sub-process; i.e. at minimum it contains name of the executable. Other required command line parameters must be also included into the list in the proper order as string values.''' self.proc_poll_dt = timeouts.get('proc_poll_dt', 0.2) self.next_free_dt = timeouts.get('next_free_dt', 0.5) self.proc_popen_dt = timeouts.get('proc_popen_dt', 0.5) self.n_threads = n_threads self.procs = [None] * n_threads self.task_descriptors = [] def update_task_descriptors(self): ''' A stub to override in sub-classes. Generates from scratch, loads from disk or updates list of the current task descriptions. Must include any additional required timeouts and synchronization.''' pass def update_tasks_status(self): '''Polls running processes and clears slots of those that have terminated.''' last_avail = None for proc_id in range(self.n_threads): if self.procs[proc_id] is not None: return_val = self.procs[proc_id].poll() time.sleep(self.proc_poll_dt) if return_val is not None: self.procs[proc_id] = None last_avail = proc_id else: last_avail = proc_id return last_avail def run(self): '''Main loop. Checks if there is an empty task slot and starts new task with an entry from self.task_descriptors.''' while True: last_avail = self.update_tasks_status() if last_avail is None: time.sleep(self.next_free_dt) continue else: self.update_task_descriptors() if not self.task_descriptors: return # Open new process: params = self.task_descriptors[-1] self.procs[last_avail] = subprocess.Popen(params) self.task_descriptors = self.task_descriptors[:-1] time.sleep(self.proc_popen_dt)

A minimal test code for the class could look like that. The update for tasks list does not create any tasks and the test must terminate right away.

def test_trivial(): tmon = TaskManager(3) tmon.run()

A test with the tasks list populated once and consisting of calls to sleep function. The sleep task code looks like this:

if __name__ == '__main__': import time import sys print sys.argv[1:] time.sleep(10) print __file__,':Done.'

and it sitting in task_man_test.py module. The test code for sleep tasks is below:

class SleepTaskManager(TaskManager): def __init__(self): TaskManager.__init__(self, 2) self.n_created = 0 self.executable = ['c:\\python27\\pythonw.exe', 'task_man_test.py'] def update_task_descriptors(self): if self.n_created == 0: self.n_created = 10 for ii in range(self.n_created): next_task = self.executable + [str(ii)] self.task_descriptors.append(next_task) def test_sleep(): tmon = SleepTaskManager() tmon.run() print 'Test sleep completed.' test_sleep() print __file__,':Done.'

Despite the simplicity of the base implementation, the class can be easily extended to distribute computations across several computers over LAN or the Internet. It can be done using Goggle Drive shared folder in a MapReduce fashion. Before starting computation of a task, it checks if the task is already started by other computers. Creating an empty output file uniquely linked by its name to a task allows to do that with reasonable reliability. If the task is already started, it will be skipped by other computers. Maintaining a global list of tasks placed on Google Drive may be less efficient because of the latency and necessity to open, read and save tasks list file. New file creation by itself has smaller latency. The final task could be generated when all sub-tasks are completed and its purpose could be combining computation results into a single file.

1. Global Interpreter Lock on Python Wiki.

2. MapReduce on Wikipedia

Training vectors are elements of and they are classified using two classes so that the the training set is the set of pairs . Lets define a linear functional on space where is dot product, which by definition is: (). The class of is defined as if and otherwise. An error measure on the training set for a given can be defined in many ways. The PLA can be treated as if it measures error as the number of misclassified samples. Using different measures leads to modified update rules that still resemble PLA update rule but may give an advantage.

Lets define error measure on the training dataset as follows.

(1)

Note that the does not care about mapping but uses the value of on the training vector directly. Minimizing (1) would be equivalent to fitting linear regression solution to classification problem. While this may not be the best approach in general, it may still lead to a reasonable update rules that would make PLA more efficient.

An iterative solution for linear regression would require to compute gradient of (1) with respect to and update it on every step using that gradient scaled by some learning rate . The gradient calculation is straightforward:

(2) = =

where is coordinate index in . The coordinate-wise formula (2) is equivalent to the following one in more compact vector notation:

(3)

The general gradient descent update with learning rate is:

(4)

If we focus on a single misclassified sample (i.e. single term in the summation (3)) instead of the full batch update (4), we will get an update rule:

(5)

This is a variation of Adaline (from **Ada**ptive **Line**ar Element or Neuron) that can be viewed as a variant of PLA (see post). The main differences are the introduction of learning rate and the use of the current mismatch of the desired output from the target class which amplifies updates more for samples with larger deviation from the target.

An update to our running PLA code example (again see this post) to support the Adaline modification is nearly trivial:

class AdalineCLA(PLA): ''' Adaptive linear element Classifier Learning Algorithm (CLA). The algorithm approximately minimizes square error of classification. ''' def __init__(self, training_data, learning_rate=1): '''In addition to training data constructor takes learning rate parameter, a multiplier that replaces 1 in the PLA.''' self.n_points = training_data.shape[0] self.learning_rate = learning_rate self.curr_values = np.ones(self.n_points) PLA.__init__(self, training_data) def update_error_measure(self): '''In addition to PLA, updates and stores the value of linear functional used in the classifier.''' self.classifier.values_on(self.data, self.curr_values) PLA.update_error_measure(self) def update_classifier(self): '''Updates classifier's decision vector.''' delta_vect = self.data[self.update_idx, :-1] curr_value = self.curr_values[self.update_idx] target_class = self.data[self.update_idx, -1] multiplier = 2*abs(target_class - curr_value) self.classifier.vect_w[:-1] += target_class * \ multiplier * \ self.learning_rate * \ delta_vect

The selection rule for the update index remains the same. We used abs function to emphasize the source of the actual sign of the update: it comes form the the target class just like in PLA. Making batch version of Adaline (class BatchAdalineCLA sub-classed from class AdalineCLA) is also straightforward:

def update_classifier(self): '''Batch version: updates classifier's decision vector by accumulating error from all misclassified samples.''' for idx in self.misclassified: delta_vect = self.data[idx, :-1] curr_value = self.curr_values[self.update_idx] target_class = self.data[self.update_idx, -1] multiplier = 2*abs(target_class - curr_value) self.classifier.vect_w[:-1] += target_class * \ multiplier * \ self.learning_rate * \ delta_vect

We leave out obvious changes to super-classes needed to keep the list of misclassified samples (or compute it dynamically). In the batch version the contribution from all misclassified samples is accumulated during single epoch update. This affects sensible choice of learning rate: it has to be smaller than that one for by-sample Adaline. An obvious rule of thumb is to make the learning rate proportional to 1/n_points. Proper selection of learning rate may influence the convergence significantly with trade off between convergence speed and proximity of the result to the optimal (in sense) solution. We will leave comparison of various flavors of PLA to one of the next posts. With a good learning rate selection batch Adaline can give significant speed-up over other variations of PLA. The following example converged only in 10 epochs, with is much faster than 38 epochs for the same data in one of the previous posts (second example). Click on the thumbnail to see gif animation:

For short high level description see Wikipedia [1]. The book [2] discusses Adaline as an exercise and treats it as a variation of PLA. This is the approach we followed while implementing Adaline in Python.

[1] http://en.wikipedia.org/wiki/ADALINE

[2] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

The Perceptron Learning Algorithm was outlined in this earlier post.

The input vectors in the input set are (m+1)-dimensional and have first coordinate fixed to 1. The vector defining linear functional on the space of input vectors is denoted by , it is also (m+1) dimensional and its first component is called bias. Dot product on the linear space at hands is defined in the standard (Euclidean) fashion and is denoted as .

Each vector in the training set of vectors , is assigned to one of the two classes denoted by +1 and -1. Let’s denote assignment mapping . The assignment is linearly separable if there exists such that .

By assuming linear separability achieved with vector , we can define a quantity that can be treated as the minimal separation from the decision boundary:

(1) min .

We will also assume that the PLA starts iterations with and vector updated to the i-th iteration is denoted as .

The main idea of the proof is to obtain two bounds on the growth of the squared norm .

**The first (quadratic lower) bound**is obtained directly from counting updates and showing that the squared norm in question grows at least quadratically with the number of iterations. This does not depend on the particular choice of the update vector used to update .**The second (linear upper) bound**d is obtained with the direct use of the update rule: is picked because it is misclassified and such a choice makes be more “aligned” with which limits the growth of from above linearly.

A combination of both bounds gives the desired result and an estimate on maximum number of iterations required for PLA to converge.

Unfolding using the update rule we get:

(2) ,

when multiplied by and using (1) this gives:

(3)

Next we use Cauchy-Schwarz inequality to obtain:

(4)

that, by combining (3) and (4), leads to:

(5)

which gives **the first component of the proof**: the squared norm of the vectors during update keeps at least as fast as .

To obtain the bound on the growth of we first consider equalities:

(6)

(7)

where (6) is by definition and (7) is expanded using Euclidean metric. By the definition of PLA the vector was picked because it was misclassified hence the sign on f the last term in (7) us negative since is of opposite sign of . This leads to inequalities:

(8)

and

(9)

Summation of inequality (9) for k in range from 0 to n with initial condition results in:

(10)

where max denotes maximum of squared norm of vectors .

The inequality (10) gives **the second component of the proof**: the growth of squared norm of the vectors is limited from above by linear function of n.

**Finally**, it immediately follows from (5) and (10) that the algorithm has to stop. The maximum iteration number for stopping can be obtained from the equality: where both estimates of growth are combined, i.e.

(12)

Qualitatively, the smaller minimal separation of the training data, the longer it may take to converge. In practice the number of iterations required for PLA to converge can be considerably smaller than the estimate above. However, neither or are known in advance for an arbitrary dataset so there is no easy way to calculate the estimate (12) upfront, without running the algorithm, or even conclude if it will converge or not.

The proof above was borrowed from [1] and is close to what [2] suggests as a series of steps in an exercise for Chapter 1.

[1] S.Haykin “Neural Networks: A Comprehensive Foundation”, 19989, Prentice-Hall, 842 pages.

[2] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

]]>

A minor modification for the code from the previous post on generation of artificial linearly separable datasets allows to generate “almost” separable data, i.e. data where number of data points that violate linear separability can be controlled and the max violation distance from the “true” decision boundary is a parameter. The code below uses function value_on instead of class_of (see Classifier python code in one of the previous posts). That allows to pick data point class randomly within the boundary of specified “width” around classifier’s decision boundary. (The width here is in quotes because the classifier linear function vect_w is not normalized.)

def nearly_separable_2d(seed, n_points, classifier, boundary_width): '''Generates n_points which are not linearly separable with a controlled degree of inseparability (in 2d). The classifier has to be able to return floating point value of the underlying linear function. :params seed: sets random seed for the random generator, :params n_points: number of points to generate, :params classifier: a linear function returning floating point for each point in 2d; :params boundary_width: instead of using comparison of classifier.value_on(data point) with 0 we will compare it with a random number in the range [-boundary_width, boundary_width] to assign the class +1 or -1 to a data point. ''' np.random.seed(seed) dim_x = 2 data_dim = dim_x + 1 + 1 # leading 1 and class value data = np.ones(data_dim * n_points).reshape(n_points, data_dim) # fill in random values data[:, 1] = -1 + 2*np.random.rand(n_points) data[:, 2] = -1 + 2*np.random.rand(n_points) # in the vicinity of boundary generate random number # to decide on the class of the data point: rand_boundary = -1 + 2*np.random.rand(n_points) rand_boundary *= boundary_width # TODO: use numpy way of applying a function to rows for idx in range(n_points): curr_val = classifier.value_on(data[idx]) data[idx,-1] = 1 if rand_boundary[idx] < curr_val else -1 return data

Below is a plot of 300 training points generated using the code above. Note that the separability violators are concentrated around the decision line of the classifier used to generate data.

Modifications done to the PLA to support the Pocket algorithm are obvious and there is no need to present them. Now the PLA should also check for both – convergence criteria and the max iteration limit to terminate.

The following two animations show behavior of the Pocket PLA on artificial datasets with non-separable data generated by function nearly_separable_2d above. The max iteration number was set to 100. Both animations show that updates on later steps are mostly happening on the datapoints that violate linear separability. Click on the thumbnail to see the gif animation.

[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

[2] Wikipedia (http://en.wikipedia.org/wiki/Perceptron)

The following GIF animations illustrate the intuition behind the PLA: even though individual training samples may become incorrectly classified after single update, the decision boundary continues to converge to a classifier consistent with the training data.

Click on the thumbnails to see the corresponding GIF animation.

The third example shows that smaller separation margin for training samples results in much slower convergence at the end. It is interesting to note that the PLA classified most of the samples quite quickly and only three closest to the decision line were used to make fine adjustments for hundreds of iterations before the algorithms converged. This is expected for small margin cases and is reflected in PLA convergence theorem, which we hope to discuss in the next post.

The Python code for the PLA visualization is placed into a separate class, which either opens window with the next update or saves an image generated via pylab library. Since this code is essentially a straightforward extension of the visualization from the previous post, I decided not to include it in the post.

The PLA class is quite compact:

import numpy as np import random import linear_classifier class PLA: '''Basic (deterministic) implementation of Perceptron Learning Algorithm.''' def __init__(self, training_data): '''Stores training data and initializes linear classifier with zero vector. :param training_data: numpy array of training data. ''' self.data = training_data self.n_points = self.data.shape[0] # Initialize classifier: data_dim = self.data.shape[1] vect_w = np.zeros(data_dim) self.classifier = linear_classifier.Classifier(vect_w) # Linear classifier, as initialized, classifies all as +1: self.curr_classes = np.ones(self.n_points) self.epoch_id = 0 self.progress = [] self.n_misclassified = 0 self.update_idx = None self.update_error_measure() def get_next_idx_to_update(self): '''Basing on the current classes finds the first index that needs and update (first mis-classified index).''' self.update_idx = next(idx for idx, train, curr in zip(range(self.n_points), self.data[:,-1], self.curr_classes) if train != curr) def compute_classes(self): '''Computes current classes.''' self.classifier.classes_of(self.data, self.curr_classes) def update_classifier(self): '''Updates classifier's decision vector.''' delta_vect = self.data[self.update_idx, :-1] update_sign = self.data[self.update_idx, -1] self.classifier.vect_w[:-1] += update_sign * delta_vect def update_error_measure(self): '''Counts mis-classified samples and stores boolean array of hits/misses, updates progress by recording percentage of the correctly classified examples.''' curr_class_equal = np.equal(self.data[:, -1], self.curr_classes) hits = np.sum(curr_class_equal) self.n_misclassified = self.n_points - hits self.progress.append(hits/float(self.n_points)) if self.n_misclassified == 0: return self.get_next_idx_to_update() def next_epoch(self): '''Performs one step in PLA. It applies current self.vect_w to training samples to get their current classification, saves "score", selects random mis-classified sample and adjusts self.vect_w according to PLA rule.''' self.epoch_id += 1 # No need to step is all samples are correctly classified if self.n_misclassified == 0: return False self.update_classifier() self.compute_classes() self.update_error_measure() return True

The function next_epoch() returns False when all points in training data are classified correctly. Until that we need to run a loop which is organized in a way to leave room for visualization to do its job:

pla = perceptron.PLA(training_data) while True: # Visualization calls come here. if not pla.next_epoch(): break

A direct variation of the deterministic PLA selects next update index randomly among misclassified training samples:

def get_next_idx_to_update(self): '''Instead of picking the first misclassified, select a random misclassified one.''' self.update_idx = random.choice([idx for idx, train, curr in zip(range(self.n_points), self.data[:,-1], self.curr_classes) if train != curr])

[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

]]>

We start with defining linear function on n-dimensional space and the binary classifier associated with it. Next we move to generating artificial test data and plotting it.

In anticipation of using it for Perceptron Learning Algorithm (PLA) tests we add another dimension for vectors to hold the class of the training sample. The training data structure is a numpy array with first column all set to 1 and the last column holding class assignment, meaning 2D data results in allocating and holding 2-dimensinoal array 4 x n_points.

''' Module linear_func.py ************************ Defines a linear function and a linear classifier based on the function. ''' import numpy as np class Classifier: '''Class to represent a linear function and associated linear classifier on n-dimension space.''' def __init__(self, vect_w=None): '''Initializes coefficients, if None then must be initialized later. :param vect_w: vector of coefficients.''' self.vect_w = vect_w def init_random_last0(self, dimension, seed=None): ''' Initializes to random vector with last coordinate=0, uses seed if provided. :params dimension: vector dimension; :params seed: random seed. ''' if seed is not None: np.random.seed(seed) self.vect_w = -1 + 2*np.random.rand(dimension) self.vect_w[-1] = 0 # exclude class coordinate def value_on(self, vect_x): '''Computes value of the function on vector vect_x. :param vect_x: the argument of the linear function.''' return sum(p * q for p, q in zip(self.vect_w, vect_x)) def class_of(self, vect_x): '''Computes a class, one of the values {-1, 1} on vector vect_x. :param vect_x: the argument of the linear function.''' return 1 if self.value_on(vect_x) >= 0 else -1

As a side note, we could define both linear function and the associated classifier using closures and lambda, but that would leave the coefficients of the function not directly accessible (we would need to compute them unless we specify them explicitly). Also it would make it less convenient adding more functionality like the utility for plotting we are discussing next.

Let’s add to Classifier class a utility function to simplify 2D plotting. The purpose of this function is to return a pair of points used to draw the decision boundary within specified box (axis-aligned rectangle). The decision boundary is the line where the value of the function equals 0. An important note is that for simplicity we skip checks for special cases when the line is parallel to one of the axis. We may generalize (later) the function by passing in selection of two coordinates to handle the case the classifier is defined in higher dimension.

def intersect_aabox2d(self, box=None): '''Returns two points intersection (if any) of the decision line (function value = 0) with axis-aligned rectangle.''' if box is None: box = ((-1,-1),(1,1)) minx = min(box[0][0], box[1][0]) maxx = max(box[0][0], box[1][0]) miny = min(box[0][1], box[1][1]) maxy = max(box[0][1], box[1][1]) intsect_x = [] intsect_y = [] for side_x in (minx, maxx): ya = -(self.vect_w[0] + self.vect_w[1] * side_x)/self.vect_w[2] if if ya >= miny and ya <= maxy: intsect_x.append(side_x) intsect_y.append(ya) for side_y in (miny, maxy): xb = -(self.vect_w[0] + self.vect_w[2] * side_y)/self.vect_w[1] if if xb <= maxx and xb >= minx: intsect_x.append(xb) intsect_y.append(side_y) return intsect_x, intsect_y

In a separate module (let’s call it test_data_gen.py) we define generator of test data. Also we add a simplest possible test code to plot the data and the decision boundary. Having that plot gives some reassurance that the code we are going to use is at least reasonably correct. To follow the best practices we would also need to add tests to the module and make sure that special cases are handled properly. We skip all this in the current discussion to get to the action (PLA tests in the upcoming posts) asap.

import numpy as np def separable_2d(seed, n_points, classifier): '''Generates n_points which are separable via passed in classifier in 2d. :params seed: sets random seed for the random generator, :params n_points: number of points to generate, :params classifier: a function returning either +1 or -1 for each point in 2d.''' np.random.seed(seed) dim_x = 2 data_dim = dim_x + 1 + 1 # leading 1 and class value data = np.ones(data_dim * n_points).reshape(n_points, data_dim) # fill in random values data[:, 1] = -1 + 2*np.random.rand(n_points) data[:, 2] = -1 + 2*np.random.rand(n_points) # TODO: use numpy way of applying a function to rows. for idx in range(n_points): data[idx,-1] = classifier.class_of(data[idx]) return data

The code above generates uniformly distributed in a box ((-1,-1),(1,1)). Of course, the box can be made a parameter but we don’t bother with this generalization. We may consider it later when we will investigate the influence of data normalization on the convergence speed of learning algorithms, in particular PLA.

The simplest possible test code with plotting the data and decision boundary is also straightforward. It uses pylab module for plotting.

if __name__ == "__main__": # Import * is not the best practice, but... from pylab import * import linear_classifier data_dim = 2 classifier = linear_classifier.Classifier() classifier.init_random_last0(data_dim + 2, 130216) data = separable_2d(263247, 12, classifier) condition = data[:, 3] >= 0 positive = np.compress(condition, data, axis=0) neg_condition = data[:, 3] < 0 negative = np.compress(neg_condition, data, axis=0) x_pos = positive[:, 1] y_pos = positive[:, 2] x_neg = negative[:, 1] y_neg = negative[:, 2] plot_lim = 1.2 box = ((-plot_lim, -plot_lim),(plot_lim, plot_lim)) decision_x, decision_y = classifier.intersect_aabox2d(box) figure() ylim([-plot_lim, plot_lim]) xlim([-plot_lim, plot_lim]) plot(x_pos, y_pos, 'g+', label="Class=+1") plot(x_neg, y_neg, 'r.', label="Class=-1") plot(decision_x, decision_y, 'b-', label="Decision Boundary") plt.legend(bbox_to_anchor=(0., 0.9, 1., .102), ncol=3, mode="expand", borderaxespad=0.) xlabel('x') ylabel('y') title('Artificial Training Data') show()

It draws the following image:

The generated data is a numpy array:

[[ 1. -0.1928764 0.39962001 1. ] [ 1. 0.25038266 -0.88412106 -1. ] [ 1. -0.89879402 -0.75034091 1. ] [ 1. -0.44705782 -0.63754637 1. ] [ 1. 0.18564908 0.84809286 -1. ] [ 1. 0.47149923 -0.72909877 -1. ] [ 1. -0.33891335 -0.11343507 1. ] [ 1. 0.27546945 -0.55661253 -1. ] [ 1. 0.93156523 -0.02588478 -1. ] [ 1. 0.71528106 -0.53893679 -1. ] [ 1. -0.37236843 0.90336908 1. ] [ 1. 0.34107903 0.54026163 -1. ]]

We do not bother to add References section to this post as all of the techniques and methods are widely used, explained and can be searched for on the Internet. That makes the post to be a nearly-trivial tutorial (with some minor todos left as an exercise). In the next installment we will investigate the behavior of Perceptron Learning Algorithm (PLA) on artificial training data generated using the code form this post.

]]>

We start with some definitions and classification problem outline, next we move to the plain vanilla perceptron learning algorithm.

Consider set of points with first coordinate fixed to 1 and mapping where , which maps elements of to two different classes denoted by elements of . The two classes are assumed to be linearly separable, i.e. there exists vector such that where * is dot product on . For the sake of the following discussion we will define . Such mapping is **a linear classifier** (on the set ). The set of the pairs

:

is called the **training data set**.

Given a training data set , the goal of PLA is to find some that provides correct classification on , i.e. for all pairs in the training set holds . Obviously, for a linearly separable training data sets such classifier is not unique.

The algorithm is iterative. On each iteration we will obtain new candidate value for , which we will denote as and the corresponding linear mapping will be denoted as .

Initialization. Assign .

Step .

- Compute values of pairs on the training set until we find a misclassified pair, say, for , .
- Update the candidate using the rule: .

Termination. Stop iterations when all points are correctly classified.

The intuition behind the update method (item 2) is that it adjusts to improve its performance on the selected misclassified point:

and since always , due to the first non-zero coordinate, the value of is closer to having the same sign as than . Since each step improves only on a single training pair (and may actually make classification performance worse on some or all other pairs), it is not obvious why the PLA converges to a solution of the linear classification problem. The proof will be discussed in one of the future posts. Here we will use an extremely simple examples to illustrate the PLA.

The training data set consists of two training points:

;

also shown below:

While this is probably the simplest example that can be devised to illustrate the PLA, it shows the idea behind it in a clear unobstructed way. By executing the PLA step-by-step we will first initialize .

Iteration 1. We obtain:

;

.

Note that classifies the entire range of real values of as +1. Out of the two training data points only the first one is misclassified: the desired value is -1 while returns +1. This results in an update:

Iteration 2. With the new candidate we obtain the following values on the data points:

;

.

As a side note, the decision boundary between the two classes is now located at (while for it was at -∞):

The value of decision boundary is obtained by solving equation:

or, equivalently, .

The misclassified point leads to an update:

That moves decision boundary to the root of the equation i.e. to and results in all training data points classified correctly:

The PLA is covered by many books and articles. The AML Book listed below is easily accessible, well written and has a corresponding MOOC.

[1] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. “Learning From Data” AMLBook, 213 pages, 2012.

]]>