Multi-Layer Learning Networks

Multi-Layer Learning Networks

R. A. STAFFORD

Philco Corp., Aeronutronic DivisionNewport Beach, California

This paper is concerned with the problem of designing a network of linear threshold elements capable of efficiently adapting its various sets of weights so as to produce a prescribed input-output relation. It is to accomplish this adaptation by being repetitively presented with the various inputs along with the corresponding desired outputs. We will not be concerned here with the further requirement of various kinds of ability to “generalize”—i.e., to tend to give correct outputs for inputs that have not previously occurred when they are similar in some transformed sense to other inputs that have occurred.

In putting forth a model for such an adapting or “learning” network, a requirement is laid down that the complexity of the adaption process in terms of interconnections among elements needed for producing appropriate weight changes, should not greatly exceed that already required to produce outputs from inputs with a static set of weights. In fact, it has been found possible to use the output-from-input computing capacity of the network to help choose proper weight changes by observing the effect on the output of a variety of possible weight changes.

No attempt is made here to defend the proposed network model on theoretical grounds since no effective theory is known at present. Instead, the plausibility of the various aspects of the network model, combined with empirical results must suffice.

To simplify the problem it is assumed that the network receives a set of two-valued inputs, x₁, x₂, ..., xₙ, and is required to produce only a single two-valued output, y. It is convenient to assign the numerical quantities +1 and -1 to the two values of each variable.

The simplest network would consist of a single linear threshold element with a set of weights, c₀, c₁, c₂, ..., cₙ. These determine theoutput-input relation or function so that y is +1 or -1 according as the quantity, c₀ + c₁x₁ + c₂x₂ + ... + cₙxₙ, is positive or not, respectively. It is possible for such a single element to exhibit an adaptive behavior as follows. If, for a given set, x₁, x₂, ..., xₙ, the output, y, is correct, then make no changes to the weights. Otherwise change the weights according to the equations

It has been shown by a number of people that the weights of such an element are assured of arriving at a set of values which produce the correct output-input relation after a sufficient number of errors, provided that such a set exists. An upper bound on the number of possible errors can be given which depends only on the initial weight values and the logical function to be learned. This does not, however, solve our network problem for two reasons.

First, as the number, n, of inputs gets large, the number of errors to be expected for most functions which can be learned increases to unreasonable values. For example, for n = 6, most such functions result in 500 to 1000 errors compared to an average of 32 errors to be expected in a perfect learning device.

Second, and more important, the fraction of those logical functions which can be generated in a single element becomes vanishingly small as n increases. For example, at n = 6 less than one in each three trillion logical functions is so obtainable.

It can be demonstrated that if a sufficiently large number of linear threshold elements is used, with the outputs of some being the inputs of others, then a final output can be produced which is any desired logical function of the inputs. The difficulty in such a network lies in the fact that we are no longer provided with a knowledge of the correct output for each element, but only for the final output. If the final output is incorrect there is no obvious way to determine which sets of weights should be altered.

As a result of considerable study and experimentation at Aeronutronic, a network model has been evolved which, it is felt, will get around these difficulties. It consists of four basic features which will now be described.

It is proposed that all weights in elements attached to inputs which come from other elements in the network be restricted to positive values. (Weights attached to the original inputs to the network, of course, must be allowed to be of either sign.) The reason for such a restriction is this. If element 1 is an input to element 2 with weight c₁₂, element 2 to element 3 with weight c₂₃,etc., then the sign of the product, c₁₂c₂₃ ..., gives the sense of the effect of a change in the output of element 1 on the final element in the chain (assuming this is the only such chain between the two elements). If these various weights were of either possible sign, then a decision as to whether or not to change the output in element 1 to help correct an error in the final element would involve all weights in the chain. Moreover, since there would in general be a multiplicity of such chains, the decision is rendered impossibly difficult.

The above restriction removes this difficulty. If the output of any element in the network is changed, say, from -1 to +1, the effect on the final element, if it is affected at all, is in the same direction.

It should be noted that this restriction does not seriously affect the logical capabilities of a network. In fact, if a certain logical function can be achieved in a network with the use of weights of unrestricted sign, then the same function can be generated in another network with only positive interconnecting weights and, at worst, twice the number of elements. In the worst case this is done by generating in the restricted network both the output and its complement for each element of the unrestricted network. (It is assumed that there are no loops in the network.)

The central problem in network learning is that of determining, for a given input, the set of elements whose outputs can be altered so as to correct the final element, and which will do the least amount of damage to previous adaptations to other inputs. Once this set has been determined, the incrementing rule given for a single element will apply in this case as well (subject to the restriction of leaving interconnecting weights positive), since the desired final output coincides with that desired for each of the elements to be changed (because of positive interconnecting weights).

In the process of arriving at such a decision three factors need to be considered. Elements selected for change should tend to be those whoseoutput would thereby be affected for a minimum number of other possible inputs. At the same time it should be ascertained that a change in each of the elements in question does indeed contribute significantly towards correcting the final output. Finally, a minimum number of such elements should be used.

It would appear at first that this kind of decision is impossible to achieve if the complexity of the decision apparatus is kept comparable to that of the basic input-output network as mentioned earlier. However, in the method to be described it is felt that a reasonable approximation to these requirements will be achieved without an undue increase in complexity.

It is assumed that in addition to its normal inputs, each element receives a variable input bias which we can call b. The output of every element should then be determined by the sign of the usual weighted sum of its inputs plus this bias quantity. This bias is to be the same for each element of the network. If b = 0 the network will behave as before. However, if b is increased gradually, various elements throughout the network will commence changing from -1 to +1, with one or a few changing at any one time as a rule. If b is decreased, the opposite will occur.

Now suppose that for a given input the final output ought to be +1 but actually is -1. Assume that b is then raised so high that this final output is corrected. Then commence a gradual decline in b. Various elements may revert to -1, but until the final output does, no weights are changed. When the final output does revert to -1, it is due to an element’s having a sum (weighted sum plus bias) which just passed down through zero. This then caused a chain effect of changing elements up to the final element, but presumably this element is the only one possessing a zero sum. This can then be the signal for the weights on an element to change—a change of final output from right to wrong accompanied simultaneously by a zero sum in the element itself.

After such a weight change, the final output will be correct once more and the bias can again proceed to fall. Before it reaches zero, this process may occur a number of times throughout the network. When the bias finally stands at zero with the final output correct, the network is ready for the next input. Of course if -1 is desired, the bias will change in the opposite direction.

It is possible that extending the weight change process a little past the zero bias level may have beneficial results. This might increase the life expectancy of each learned input-output combination and thereby reduce the total number of errors. This is because the methodused above can stop the weight correction process so that even though the final output is correct, some elements whose output are essential to the final output have sums close to zero, which are easily changed by subsequent weight changes.

It will be noted that this method conforms to all three considerations mentioned previously. First, by furnishing each element the same bias, and by not changing weights until the final output becomes incorrect with dropping bias, there is a strong tendency to select elements which, with b = 0, would have sums close to zero. But the size of the sum in an element is a good measure of the amount of damage done to an element for other inputs if its current output is to be changed. Second, it is obvious that each element changed has had a demonstrable effect on the final output. Finally, there will be a clear tendency to change only a minimum of elements because changes never occur until the output clearly requires a change.

On the other hand this method requires little more added complexity to the network than it already has. Each element requires a bias, an error signal, and the desired final output, these things being uniform for all elements in a network. Some external device must manipulate the bias properly, but this is a simple behavior depending only on an error signal and the desired final output—not on the state of individual elements in the network. What one has, then, is a network consisting of elements which are nearly autonomous as regards their decisions to change weights. Such a scheme appears to be the only way to avoid constructing a central weight-change decision apparatus of great complexity. This rather sophisticated decision is made possible by utilizing the computational capabilities the network already possesses in producing outputs from inputs.

It should be noted here that this varying bias method requires that the variable bias be furnished to just those elements which have variable weights and to no others. Any fixed portion of the network, such as preliminary layers or final majority function for example, must operate independently of the variable bias. Otherwise, the final output may go from right to wrong as the bias moves towards zero and no variable-weight element be to blame. In such a case the network would be hung up.

A third aspect of the network model is that for all the care taken in the previous steps, they will not suffice in settling quickly to a setof weights that will generate the required logical function unless there is a great multiplicity of ways in which this can be done. This is to say that a learning network needs to have an excess margin of weights and elements beyond the minimum required to generate the functions which are to be learned.

This is analogous to the situation that prevails for a single element as regards the allowed range of values on its weights. It can be shown for example, that any function for n=6 that can be generated by a single element can be obtained with each weight restricted to the range of integer values -9,-8, ..., +9. Yet no modification of the stated weight change rule is known which restricts weight values to these and yet has any chance of ever being learned for most functions.

It would appear from some of the preliminary results of network simulations that it may be useful to have elements become “fatigued” after undergoing an excessive number of weight changes. Experiments have been performed on simplifications of the model described so far which had the occasional result that a small number of elements came to a state where they received most of the weight increments, much to the detriment of the learning process. In such cases the network behaves as if it were composed of many fewer adjustable elements. In a sense this is asking each element to maintain a record of the data it is being asked to store so that it does not attempt to exceed its own information capacity.

It is not certain just how this fatigue factor should enter in the element’s actions, but if it is to be compatible with the variable bias method, this fatigue factor must enter into the element’s response to a changing bias. Once an element changes state with zero sum at the same time that the final output becomes wrong, incrementing must occur if the method is to work. Hence a “fatigued” element must respond less energetically to a change of bias, perhaps with a kind of variable factor to be multiplied by the bias term.

It is felt that the problem of selecting the structure of interconnections for a network is intimately connected to the previously mentioned problem of generalization. Presumably a given type of generalization can be obtained by providing appropriate fixedportions of the network and an appropriate interconnection structure for the variable portion. However, for very large networks, it is undoubtedly necessary to restrict the complexity so that it can be specified by relatively simple rules. Since very little is known about this quite important problem, no further discussion will be attempted here.

A computer simulation of some of the network features previously described has been made on an IBM 7090. Networks with an excess of elements and with only positive interconnecting weights were used. However, in place of the variable bias method, a simple choice of the element of sum closest to, and on the wrong side of, zero was made without regard to the effectiveness of the element in correcting the final output. No fatigue factors were used.

The results of these simulations are very encouraging, but at the same time indicate the need for the more sophisticated methods. No attempt will be made here to describe the results completely.

In one series of learning experiments, a 22-element network was used which had three layers, 10 elements on the first, 11 on the second, and 1 on the third. The single element on the third was the final output, and was a fixed majority function of the 11 elements in the second layer. These in turn each received inputs from each of the 10 on the first layer and from each of the 6 basic inputs. The 10 on the first layer each received only the 6 basic inputs. A set of four logical functions, A, B, C, and D, was used. Function A was actually a linear threshold function which could be generated by the weights 8, 7, 6, 5, 4, 3, 2, functions B and C were chosen by randomly filling in a truth table, while D was the parity function.

Table I gives the results of one series of runs with these functions and this network, starting with various random initial weights. The quantity, r, is the number of complete passes through the 64-entry truth table before the function was completely learned, while e is the total number of errors made. In evaluating the results it should be noted that an ideal learning device would make an average of 32 errors altogether on each run. The totals recorded in these runs are agreeably close to this ideal. As expected, the linear threshold function is the easiest to learn, but it is surprising that the parity function was substantially easier than the two randomly chosen functions.Table IIgives a chastening result of the same experiment with all interconnecting weights removed except that the final element is a fixed majority function of the other 21 elements. Thus there was adaptation on one layer only. As can be seenTable Iis hardly better thanTable IIso that the value of variable interconnecting weights was not being fully realized. In a later experiment the number of elements was reduced to 12 elements and the same functions used. In this case the presence of extra interconnecting weights actually proved to be a hindrance! However a close examination of the incrementing process brought out the fact that the troublesome behavior was due to the greater chance of having only a few (often only one) elements do nearly all the incrementing. It is expected that the use of the additional refinements discussed herein will produce a considerable improvement in bringing out the full power of adaptation in multiple layers of a network.

Aside from the previous question of deciding on network structure, there are several other questions that remain to be studied in learning networks.

There is the question of requiring more than a single output from a network. If, say, two outputs are required for a given input, one +1 and the other -1, this runs into conflict with the incrementing process. Changes that aid one output may act against the other.Apparently the searching process depicted before with a varying bias must be considerably refined to find weight changes which act on all the outputs in the required way. This is far from an academic question because there will undoubtedly be numerous cases in which the greatest part of the input-output computation will have shared features for all output variables. Only at later levels do they need to be differentiated. Hence it is necessary to envision a single network producing multiple outputs rather than a separate network for each output variable if full efficiency is to be achieved.

Another related question is that of using input variables that are either many-, or continuous-, valued rather than two-valued. No fundamental difficulties are discernible in this case, but the matter deserves some considerable study and experimentation.

Another important question involves the use of a succession of inputs for producing an output. That is, it may be useful to allow time to enter into the network’s logical action, thus giving it a “dynamic” as well as “static” capability.


Back to IndexNext