for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
//reads C(i,j) (can be register-allocated in the inner-most loop if we know that C, A, and B cannot alias)
for (k = 0; k < n; k++) {
//reads row i, column k of A into cache. cache hit (i,k-1) was read in previous iteration which was adjacent in memory.
//reads row k, column j of B into cache. CACHE MISS (k-1,j) is far away from (k,j) giving little spatial locality
C[i,j] = C[i,j] + A[i,k]*B[k,j]
}
}
}
Number of memory accesses (cache misses) = n^3 (due to B)
Blocked (tiled) matrix multiply. Consider A, B, C to be NxX matrices of bxb sub-blocks where b=n/N is the block-size.
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
//reads block at C(i,j) into cache. Likely to have O(b) misses; one for each row in the block
for (k = 0; k < N; k++) {
//reads row i of block at A(i,k) into cache. Likely to have O(b) misses; one for each row in the block.
//reads column j of block at B(k,j) into cache. Likely to have O(b) misses (one for each row in the block). Notice that the access pattern is still column-major, but the memory-load pattern can be row-major.
BLOCK_MULTIPLY(C[i,j], A[i,k], B[k,j], b)
}
}
}
Number of memory accesses: 2 * N^3 * n/N * n/N (read each block of A and B N^3 times) + 2 * N^2 * n/N *n/N (read and write each block of C once in the second nested loop N^2 times). This transformation is called loop tiling.
The improvement = n^3/N*n^2 = n/N = b. In general, increasing b sounds like a good idea, but only until all three arrays can fit in the cache. Thus if the cache-size is M, then the maximum block-size we can have is sqrt(M/3) (which is also the maximum speedup we can have).
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A[i,j] = 0;
for (k = 0; k < n; k++) {
A[i,j] += X[i,k]*Y[k,j];
}
}
}
Serial execution of matrix multiplication:
c is the umber of cache lines in the cache, and if n columns of Y can survive in cache simultaneously, one iteration of the outer-loop (e.g., i = 0), takes n^2/c cache misses (n^2 elements in Y and spatial locality allows a factor of c).n/p rows of A to one processor. With this: each processor needs to access n/p rows of A and X, but the entire Y. Each processor will compute n^2/p elements of A performing n^3/p multiply-and-add operations to do so.
While computation time decreases in proportion to p, the communication cost (total number of cache misses) rises in proportion to p, because Y needs to be loaded in entirety in each processor's cache. The total number of cache misses is n^2/c + p*n^2/c.
for (ii = 0; ii < n; ii+=B) {
for (jj = 0; jj < n; jj+=B) {
for (kk = 0; kk < n; kk+=B) {
for (i = ii; i < ii+B; i++) {
for (j = jj; j < jj+B; j++) {
for (k = kk; k < kk+B; k++) {
A[i,j] += X[i,k]*Y[k,j];
}
}
}
}
}
}
We have re-ordered the iteration space significantly, but because the order of iterations does not matter (addition is commutative and associative), this remains correct.
To bring a block of X or Y into the cache, we require B^2/c cache misses. The number of times we need to bring a block into the cache is (n/B)^3. Total number of cache misses = 2n^3/Bc. If B=n, we get 2n^2/c cache accesses; on the other hand if B=100 s.t. all three blocks fit in cache, we can get a significant speedup of O(Bc) [compared to worst-case O(n^3) cache misses].
This blocking can be done at each level of the cache hierarchy with different Bs. e.g., B=8 for L1, 80 for L2, ...
In practice: blocking can improve performance by a factor of 3 on modern machines on a uniprocessor (compute-bound after a point; recall Amdahl's law); for multiprocessors, where a set of blocks are farmed out to each processor, the speedup can be linear in the number of processors.