This section serves a number of purposes: First, it illustrates the issues that were raised while discussing the parallel implementation of matrix-matrix multiply and its variants. Second, it shows that indeed high performance can be attained using the techniques that underlie PLAPACK. Finally, it demonstrates that high performance can be attained on different architectures without any changes to the infrastructure. However, one should not read too much into the performance graphs: we collected and graphed performance numbers for the part of the infrastructure that was optimized as of the moment that the guide was completed. To see the current performance numbers for a range of algorithms, consult the PLAPACK web page.
All performance numbers center around the implementation of the most basic case of matrix-matrix multiplication: . For this case, we will show the performance of the three variants: panel-panel, matrix-panel, and panel-matrix.
Performance is reported in MFLOPS/sec per node, or millions of floating point operations per second per node. Given that a problem of with dimensions m , n , and k requires time t(m,n,k) (in seconds) to complete, the MFLOPS/sec per node for matrix-matrix multiplication is computed using the formula
We discuss performance on three current generation distributed memory parallel computers: the Intel Paragon with GP nodes (one compute processor per node), the Cray T3D, and the IBM SP-2. In addition, we show performance on the Convex Exemplar S-class system, in a shared-memory configuration. In all cases we will report performance per node, where we use the term node for a compute processor. All performance is for 64-bit precision (i.e. double precision real on the Exemplar, the Paragon, and the SP-2, and single precision real on the T3D). The peak performance of an individual node limits the peak performance per node that can be achieved for our parallel implementations. As of this writing, the single processor peak performances for 64-bit arithmetic matrix-matrix multiply, using assembly coded sequential BLAS, are roughly given by
The speed and topology of the interconnection network affects how fast peak performance can be approached. Since our infrastructure is far from optimized with regards to communication as of this writing, there is no real need to discuss network speeds other than that both the Intel Paragon and Cray T3D have very fast interconnection networks, while the IBM SP-2 has a noticeably slower network, relative to individual processor speeds. On the Exemplar, we use MPI to communicate between processors, so that it is programmed as if it were a distributed memory architecture.
In Figure , we show the performance on 64 node configurations of the different architectures. In that figure, the performance of the panel-panel variant is reported, with the matrix dimensions chosen so that all three matrices are square. The algorithmic and distribution blocking sizes, and nb were taken to be equal to each other ( ), but on each architecture they equal the value that appears to give good performance: on the Paragon and on the T3D and SP-2 . The different curves approach the peak performance for the given architecture, eventually leveling off.
In Figure , we again report the performance on the different architectures for the panel-panel variant, with the same choices for the blocking parameters. PLACE BEGIN HR HERE
PLACE END HR HERE
However, this time we fix dimensions m and n to be large, and we vary dimension k . For the panel-panel based implementation, on all architectures high performance is achieved even for small values of k , which verifies that the panel-panel variant is a good choice when k is small. This was predicted when we discussed implementation of the different variants.
In Figure , we illustrate the performance for the different variants on the Intel Paragon, by fixing dimensions m and n to be large ( m = n = 5000 ) and varying k along the x-axis. Our previous discussion predicts that the panel-panel implementation is most natural for this shape of the matrices, and indeed, that variant outperforms the other two. One fact worth mentioning is that the matrix-panel and panel-matrix variants level off at a considerably lower level of performance. This is due to the fact that the sequential dgemm implementation performs better for a panel-panel update than a matrix-panel or panel-matrix multiply when the panel width is 20. (Notice that this reflects a limitation of the implementation of the sequential dgemm call. Better performance can be attained by more careful implementation of that kernel.) Similar behavior is reported in Figure for the IBM SP-2. PLACE BEGIN HR HERE
PLACE END HR HERE
In Figure , we again illustrate the performance for the different variants on the Intel Paragon, but this time dimensions m and k are fixed to be large ( m = k = 5000 ) and n is varied. Our previous discussion predicts that the matrix-panel variant is the most natural for this shape of the matrices, and indeed, that variant initially outperforms the other two, when n is small. Since the matrix-panel and panel-matrix variants level off at a considerably lower level of performance, due to the sequential dgemm implementation, eventually the panel-panel variant becomes fastest. Similar behavior is reported in Figure for the IBM SP-2. PLACE BEGIN HR HERE
PLACE END HR HERE
Notice that showing performance numbers for a single configuration of an architecture gives no insight into the scalability of the approach. As nodes are added, the problem size can grow so that memory use per node remains constant (after all, as nodes are added, so is memory). For the matrix-matrix multiply example, the means that if m = n = k , then the dimensions can grow with the square-root of the number of nodes. Generally, dense linear algebra algorithms can be parallelized so that they are scalable in the following sense: if the problem size is increased so that memory use per node is kept constant, then the efficiency (e.g., MFLOPS/sec per node) will remain (approximately) constant. To verify this trend for matrix-matrix multiply, we report in Figure the MFLOPS/sec per node on the Intel Paragon when the number of nodes is varied, and the memory per node is kept constant. Notice that the performance is only slightly affected as nodes are added.
In Figure , we show that by carefully picking parameters and matrix dimensions, one can further improve performance. In that figure, we contrast the performance of the panel-panel variant on a Cray T3D for two slightly different choices: The first curve shows the exact same data as used in Figure : m = n = 8000 and . For that curve, no particular care is taken when picking the k at which the timings were collected. For the second curve, we increased the problem slightly, to m = n = 8192 (a power of two), the distribution blocking size was kept at nb = 64 , and the algorithmic blocking size was increased to . Also, k was sampled at powers of 2 . Notice that performance improves slightly. PLACE BEGIN HR HERE
PLACE END HR HERE
In Figure , we examine the performance of PLAPACK PLACE BEGIN HR HERE
PLACE END HR HERE
on the Convex Exemplar S-series. While all data presented was collected using a single 16 processor shared memory cluster, communication between processors uses MPI, like on a distributed memory architecture. The graph shows performance for square matrices, with the different curves representing different numbers of nodes being used. While trying to meet the deadline for this guide, we did not have time to fully tune PLAPACK for the Exemplar. Nonetheless, very good performance is achieved. It should be noted that for some matrix sizes, performance on the Exemplar was much lower than expected. In order to make the graph less busy, those performance numbers for the Exemplar are not reported. We expect that further experimentation will allow us to more consistently achieve high performance.