My first introduction to parallel computing came in the early eighties,
when I was a Research
Assistant for Prof. G. W. Stewart and Prof. Dianne O'Leary, at the
University of Maryland in College Park.
At the time, we were happy to have an experimental
parallel computer, the *ZMOB*, consisting of 128 Z80 processors
These
processors were slow enough that we never did compute MFLOPS/sec attained.
My introduction to *high performance* parallel computing occurred while
I was on a leave of absence at the University of Tennessee.
It was during this leave that Prof. Jack Dongarra introduced
me to LAPACK and the distributed memory version of LAPACK that was
being developed, which later evolved into ScaLAPACK.

Collaboration with Dr. David Payne, Lance Shuler and Jerrell Watts on the Interprocessor Collective Communication (InterCom) Project [] showed how to systematically implement collective communication libraries for distributed memory architectures. Indeed, a large part of the introductory chapter of this book was taken from Part II of a manuscript authored with Payne, Shuler and Watts, with the working title ``A Street guide to Collective Communication.'' The systematic approach used by PLAPACK to redistribute vectors and matrices was to be an illustration of the use of collective communication in that effort.

The fact that there was an inherent link between matrix distribution and vector distribution first became obvious to us in a paper authored with Dr. John Lewis and Dr. David Payne. In that paper [], we discussed the efficient implementation of matrix-vector multiplication and showed how communication overhead is reduced by using a distribution that we now call Physically Based Matrix Distribution. It was the insights into applications provided by Carter Edwards, Dr. Abani Patra, and Dr. Po Geng that further provided evidence that this distribution provides a common distribution appropriate for dense, sparse iterative, and sparse direct linear algebra methods [].

The systematic approach to parallel implementation of the level-3 Basic Linear Algebra Subprograms started with the Scalable Universal Matrix Multiplication Algorithm (SUMMA) [], a collaboration with Jerrell Watts. This was later extended to all level-3 BLAS in a collaboration with Dr. Almadena Chtchelkanova, John Gunnels, Greg Morrow, and James Overfelt []. Interestingly enough, most of this work was inspired by looking at parallel out-of-core implementations of the LU factorization, first with Mercedes Marques and later with Ken Klimkowski [].

We would like to thank Dr. Yuan-Jye (Jason) Wu for agreeing to be a one-man alpha-release test site. By implementing a complex algorithm (the reduction of a square dense matrix to banded form []) using PLAPACK, Jason exercised many components of the infrastructure, thereby revealing many initial bugs. We also appreciate the encouragement received from Dr. Chris Bischof, Dr. Steven Huss-Lederman, Prof. Xiaobai Sun, and others involved in the PRISM project. We gratefully acknowledge George Fann at PNNL for his support of the PLAPACK project and for his time collecting the IBM SP-2 performance numbers presented later in the guide. Similarly, we are indebted to Dr. Greg Astfalk for assembling the numbers for the Convex Exemplar.

It has been a great pleasure to work with the graduate students listed as contributors to the various chapters. As of this writing, they are all Ph.D. students at The University of Texas at Austin. Their fields of study span Aerospace Engineering (Greg Baker), Computational and Applied Mathematics (Carter Edwards and James Overfelt), Computer Science (John Gunnels), and Physics (Philip Alpatov and Greg Morrow). All started as students in my special topics course on Parallel Techniques for Numerical Algorithms. A number of them became Research Assistants as part of the project. Others volunteered their time (supported by other projects). It is through their talent and enthusiasm that the PLAPACK project progressed as rapidly as it has.

To date, the PLAPACK project has been supported in part by the following funding sources:

- The Parallel Research on Invariant Subspace Methods (PRISM) project, under ARPA grant P-95006.
- The NASA High Performance Computing and Communications Program's Earth and Space Sciences Project under NRA Grants NAG5-2497 and NAG5-2511.
- The Environmental Molecular Sciences construction project at Pacific Northwest National Laboratory (PNNL). PNNL is a multiprogram national laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under Contract DE-AC06-76RLO 1830.
- The Intel Research Council.

We gratefully acknowledge access provided by the following parallel computing facilities:

- The University of Texas Computation Center's High Performance Computing Facility.
- The Texas Institute for Computational and Applied Mathematics' Distributed Computing Laboratory.
- The Molecular Science Computing Facility in the Environmental Molecular Sciences Laboratory at the Pacific Northwest National Laboratory (PNNL).
- The Intel Paragon system operated by the California Institute of Technology on behalf of the Concurrent Supercomputing Consortium. Access to this facility was arranged by the California Institute of Technology.
- Cray Research, a Silicon Graphics Company.
- The Convex Division of the Hewlett-Packard Company in Richardson, Texas.