The communication pattern is asymmetrical: rows broadcast, columns
rotate.
(Note: The Cannon algorithm is similar to Fox, but uniformly
does rotation).
Fox (and Cannon) treatments make the following assumptions:
The fox algorithm:
q = sqrt(p) // number of rows, cols in processor grid
// A is operand 1, B is operand 2 in A * B
// C is result
// i,j = process row, column
// src, dest rows for rotating 'up'
src = i+1 mod q;
dest = i-1 mod q;
for (stage = 0; stage < q; stage++) {
k_bar = (i+stage) mod q;
broadcast(A[i,k_bar]) to row i;
C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]
sendrecv(B[k_bar,j],src,dest);
}
Here is Pacheco's code