out = round(w1in1 + w2in2)
round(w10 + w20) = 0
round(w10 + w21) = 0
round(w11 + w20) = 0
round(w11 + w21) = 1
This example is of course a few hundred billion cells short of a
full brain and is perhaps
a little abrupt with respect to threshold. Let us consider bigger
more gentle networks. We will consider networks with 3 layers comprised of
The h-by-n weighting matrix from inputs to hidden cells will be called V and the m-by-h weighting matrix from hidden cells to outputs will be called W. Pan right for a concrete example.n inputs h hidden cells, and m outputs
The hard threshold at each cell will be replaced by the soft sigmoid
With this notation, if p is an input pattern then the output will bes(x) = 1/(1+exp(0.5-x))
This representation is dangerously brief - please make sure you understand it before proceeding.(1) o = s(Ws(Vp))
We will typically have I input patterns,
that we would like to pair with I targetsp1, p2, ..., pI
We do this by choosing V and W to minimize the misfitt1, t2, ..., tI
(Recall that each pi has n components and each ti has m components.)I E(V,W) = (1/2) &sum ||s(Ws(Vpi))-ti||2 i=1
We accomplish the minimization of E by guiding V and W along the gradient of E. That is, we update V and W according to
These derivatives are easiest to see if there is but one input pattern (I=1) and a single output cell (m=1), for then the misfit is simplyW = W - rate gradW E V = V - rate gradV E
and so the chain rule reveals the gradient, or vector of partial derivativesE(V,W) = (1/2)(s(Ws(Vp))-t)2
takes the formgradW E(V,W) = [ &part>E/&part>W1 &part>E/&part>W2 ... &part>E/&part>Wh ]
And similarly, the matrix of partial derivatives, gradV E(V,W),gradW E(V,W) = (s(Ws(Vp))-t) * s'(Ws(Vp)) * s(Vp)T 1x1 1x1 1xh
takes the form&part>E/&part>V1,1 &part>E/&part>V1,2 ... &part>E/&part>V1,n &part>E/&part>V2,1 &part>E/&part>V2,2 ... &part>E/&part>V2,n ... ... ... ... &part>E/&part>Vh,1 &part>E/&part>Vh,2 ... &part>E/&part>Vh,n
These expressions can be simplified a bit upon observing that our sigmoid obeysgradV E(V,W) = (s(Ws(Vp))-t) * s'(Ws(Vp)) * (WT .* s'(Vp)) * pT 1x1 1x1 hx1 1xn
and so, settings'(x) = s(x)(1-s(x))
We have coded this training here. We test the net it generates using nnxor, in this diaryq = s(Vp) and o = s(Wq) brings gradW E = (o-t)o(1-o)qT and gradV E = (o-t)o(1-o)(WT.*q.*(1-q))pT
>> [V,W] = xortrain([5 5;1 1],[10 -15],10000,0.25)
V = 6.0321 6.0311
1.0216 1.0066
W = 13.4652 -17.7826
>> nnxor([0 0]',V,W)
ans = 0.1062
>> nnxor([0 1]',V,W)
ans = 0.8600
>> nnxor([1 0]',V,W)
ans = 0.8524
>> nnxor([1 1]',V,W)
ans = 0.1615
Do you see that our pupil has stumbled upon the Boolean Identity
You see from the code that we have proceeded as if there was only one input pattern by simply choosing one at random each time. The same ploy may be used in the case of multiple outputs, as, e.g., in this week's assignment.XOR(a,b) = OR(a,b) - AND(a,b)
Hopfield nets are comprised of N hard-threshold nodes with all-to-all symmetric
coupling where on = 1 and off = -1. Let us start with the net at the right.
There are 4 nodes and so we must build a 4-by-4 weight matrix
W. The bidirectional arrows imply equal weights in each
direction, i.e.,
If s is the current state of the net thenWij = Wji
will be the new state. This net can be trained to remember an input pattern p by setting the weights tons = sign2(Ws) where sign2(x) = 1 if x > 0, sign2(x) = -1 if x <= 0,
then p and its mirror image -p will be attractors. That is, they will satisfy s=sign2(Ws). To see this note thatW = ppT
In fact, for each initial state the very next state will be either p or -p, or -ones(N,1) (when pTs=0), forWp = ppTp = p(pTp) = (pTp)p and so sign2(Wp) = sign2(p) = p.
unless p is orthogonal to s, i.e., pTs=0, in which caseWs = ppTs = p(pTs) = (pTs)p and so sign2(Ws) = sign2(pTs)sign2(p),
If this full off vector is itself orthogonal to p then it and p and -p will be the only attractors. To get a feel for this I encourage you to dissect and exercise hop on several 4-by-1 choices.sign2(Ws) = -ones(N,1).
All of this generalizes nicely to multiple training patterns. In fact, if p1 and p2 are two such patterns we set
Arguing as above, we findP = [p1 p2] and W = PPT
Evalutating sign2 of this is now a much more interesting affair. If p1 and p2 are orthogonal then it is not hard to see that both p1 and p2 (and their mirrors) will be attractors. In the general case they remain some (of possibly many) attractors. To get a feel for this I encourage you to exercise hop on several 4-by-2 choices.Ws = PPTs = (sTp1)p1 + (sTp2)p2