Neural Networks

As our next example we shall study networks of simple minded neurons. Attractors on these nets will look and behave much like the attractors for our gene nets. Our interest however is not the behavior of a single net but rather a net's capacity to learn in both supervised and unsupervised settings. This is not so far from your work on optimal fiber nets. In that case you began with a random set of fiber diameters and a given load and the net slowly learned how to redistribute its volume in such a way as to minimize the work done by the load. The role of teacher was played by fmincon.

1. Multi-Layer Perceptrons

In this setting we shall teach a neural network to respond in a desired way when faced with certain stimuli.

To learn is to modify one's synaptic weights.

Let us start with a simple example. One neuron with two inputs. The neuron scales and sums its inputs and fires if the sum exceeds a given threshold. If the inputs and output are binary and the threshold for firing is 0.5 then

     out = round(w1in1 + w2in2)

is our neural net. The challenge is to choose w1 and w2 in accordance with a particular task. For example, can we teach this neuron to perform the and operation? More precisely, we search for w1 and w2 such that

     round(w10 + w20) = 0
     round(w10 + w21) = 0
     round(w11 + w20) = 0
     round(w11 + w21) = 1

Here is one possible approach. You see that w learns to associate each stimulus with its preferred target by following the negative gradient of the error.

This example is of course a few hundred billion cells short of a full brain and is perhaps a little abrupt with respect to threshold. Let us consider bigger more gentle networks. We will consider networks with 3 layers comprised of

n inputs h hidden cells, and m outputs

The h-by-n weighting matrix from inputs to hidden cells will be called V and the m-by-h weighting matrix from hidden cells to outputs will be called W. Pan right for a concrete example.

The hard threshold at each cell will be replaced by the soft sigmoid

s(x) = 1/(1+exp(0.5-x))

With this notation, if p is an input pattern then the output will be

(1) o = s(Ws(Vp))

This representation is dangerously brief - please make sure you understand it before proceeding.

We will typically have I input patterns,

p1, p2, ..., pI

that we would like to pair with I targets

t1, t2, ..., tI

We do this by choosing V and W to minimize the misfit

I E(V,W) = (1/2) &sum ||s(Ws(Vpi))-ti||2 i=1

(Recall that each pi has n components and each ti has m components.)

We accomplish the minimization of E by guiding V and W along the gradient of E. That is, we update V and W according to

W = W - rate gradW E V = V - rate gradV E

These derivatives are easiest to see if there is but one input pattern (I=1) and a single output cell (m=1), for then the misfit is simply

E(V,W) = (1/2)(s(Ws(Vp))-t)2

and so the chain rule reveals the gradient, or vector of partial derivatives

gradW E(V,W) = [ &partE/&partW1 &partE/&partW2 ... &partE/&partWh ]

takes the form

gradW E(V,W) = (s(Ws(Vp))-t) * s'(Ws(Vp)) * s(Vp)T 1x1 1x1 1xh

And similarly, the matrix of partial derivatives, gradV E(V,W),

&partE/&partV1,1 &partE/&partV1,2 ... &partE/&partV1,n &partE/&partV2,1 &partE/&partV2,2 ... &partE/&partV2,n ... ... ... ... &partE/&partVh,1 &partE/&partVh,2 ... &partE/&partVh,n

takes the form

gradV E(V,W) = (s(Ws(Vp))-t) * s'(Ws(Vp)) * (WT .* s'(Vp)) * pT 1x1 1x1 hx1 1xn

These expressions can be simplified a bit upon observing that our sigmoid obeys

s'(x) = s(x)(1-s(x))

and so, setting

q = s(Vp) and o = s(Wq) brings gradW E = (o-t)o(1-o)qT and gradV E = (o-t)o(1-o)(WT.*q.*(1-q))pT

We have coded this training here. We test the net it generates using nnxor, in this diary
>> [V,W] = xortrain([5 5;1 1],[10 -15],10000,0.25)

V = 6.0321    6.0311
    1.0216    1.0066


W = 13.4652  -17.7826

>> nnxor([0 0]',V,W)
ans = 0.1062

>> nnxor([0 1]',V,W)
ans = 0.8600

>> nnxor([1 0]',V,W)
ans = 0.8524

>> nnxor([1 1]',V,W)
ans = 0.1615
Do you see that our pupil has stumbled upon the Boolean Identity

XOR(a,b) = OR(a,b) - AND(a,b)

You see from the code that we have proceeded as if there was only one input pattern by simply choosing one at random each time. The same ploy may be used in the case of multiple outputs, as, e.g., in this week's assignment.

2. Hopfield Nets

Hopfield nets are comprised of N hard-threshold nodes with all-to-all symmetric coupling where on = 1 and off = -1. Let us start with the net at the right. There are 4 nodes and so we must build a 4-by-4 weight matrix W. The bidirectional arrows imply equal weights in each direction, i.e.,

Wij = Wji

If s is the current state of the net then

ns = sign2(Ws) where sign2(x) = 1 if x > 0, sign2(x) = -1 if x <= 0,

will be the new state. This net can be trained to remember an input pattern p by setting the weights to

W = ppT

then p and its mirror image -p will be attractors. That is, they will satisfy s=sign2(Ws). To see this note that

Wp = ppTp = p(pTp) = (pTp)p and so sign2(Wp) = sign2(p) = p.

In fact, for each initial state the very next state will be either p or -p, or -ones(N,1) (when pTs=0), for

Ws = ppTs = p(pTs) = (pTs)p and so sign2(Ws) = sign2(pTs)sign2(p),

unless p is orthogonal to s, i.e., pTs=0, in which case

sign2(Ws) = -ones(N,1).

If this full off vector is itself orthogonal to p then it and p and -p will be the only attractors. To get a feel for this I encourage you to dissect and exercise hop on several 4-by-1 choices.

All of this generalizes nicely to multiple training patterns. In fact, if p1 and p2 are two such patterns we set

P = [p1 p2] and W = PPT

Arguing as above, we find

Ws = PPTs = (sTp1)p1 + (sTp2)p2

Evalutating sign2 of this is now a much more interesting affair. If p1 and p2 are orthogonal then it is not hard to see that both p1 and p2 (and their mirrors) will be attractors. In the general case they remain some (of possibly many) attractors. To get a feel for this I encourage you to exercise hop on several 4-by-2 choices.