Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms to a parallel code requires careful manipulation of indices and parameters describing the data, its distribution to processors, and/or the communication required. It is this manipulation of indices that is highly error-prone, leading to bugs in parallel code. This in turn stands in the way of the parallel implementation of more sophisticated algorithms, since the coding effort simply becomes overwhelming.
The Parallel Linear Algebra Package (PLAPACK) infrastructure attempts to overcome this complexity by providing a coding interface that mirrors the natural description of sequential dense linear algebra algorithms. These descriptions typically do not require more than half a chalk board to explain. To achieve this, we have adopted an ``object based'' approach to programming. This object based approach has already been popularized for high performance parallel computing by libraries like the Toolbox being developed at Mississippi State University , the PETSc library at Argonne National Laboratory , and the Message-Passing Interface [, ].
In order to parallelize a linear algebra algorithm, we must start by partitioning and distributing the vectors and matrices involved. Traditionally, parallel dense linear algebra algorithms and libraries start by partitioning and distributing the matrix, with the distribution of vectors being an afterthought. This seems to make sense: A dense matrix typically requires memory and computation involving such matrices typically requires operations. Vectors on the other hand require only O(n) memory and computation involving vectors tends to only require O(n) or operations. It is this partitioning and distribution of matrices that then dictates the interface between an application and a parallel linear algebra library.
While this appears to be convenient for the library, this approach creates an inherent conflict between the needs of the application and the library. It is the vectors in linear systems that naturally dictate the partitioning and distribution of work associated with (most) applications that lead to linear systems . Notice that in a typical application, the linear system is created to compute values for degrees of freedom, which have some spatial significance. In finite element or boundary element methods, we solve for force, stress, or displacement at points in space. For the application, it is thus more natural to partition the domain of interest into sub domains, like domain decomposition methods do, and assign those sub domains to nodes (processors). This is equivalent to partitioning the vectors and assigning the sub-vectors to nodes.
The PLAPACK infrastructure uses a data distribution that starts by partitioning the vectors associated with the linear algebra problem and assigning the sub-vectors to nodes. The matrix distribution is then induced by the distribution of these vectors. This approach was chosen in an attempt to create more reasonable interfaces between applications and libraries. However, the surprising discovery has been that this approach greatly simplifies the implementation of the infrastructure, allowing much more generality (in future extensions of the infrastructure) while simultaneously reducing the amount of code required when compared to previous generation parallel dense linear algebra libraries .