Next: 7.9 Summary Up: 7 High Performance Fortran Previous: 7.7 Performance Issues

7.8 Case Study: Gaussian Elimination

To further illustrate the use of HPF, we present a slightly more   complex example. The problem considered is the Gaussian   elimination method used to solve a system of linear equations

where A is a known matrix of size N N , x is the required solution vector, and b is a known vector of size N . This example is often used in discussions of HPF as   it shows the benefits of cyclic distributions. The method proceeds in   two stages:

1. Gaussian elimination. The original system of equations is reduced to an upper triangular form

where U is a matrix of size N N in which all elements below the diagonal are zero, and diagonal elements have the value 1.

2. Back substitution. The new system of equations is solved to obtain the values of x .

Figure 7.10: The i th step of the Gaussian elimination algorithm in which nonzero subdiagonal elements in column i are eliminated by subtracting appropriate multiples of the pivot row.

The Gaussian elimination stage of the algorithm comprises N-1 steps. In the basic algorithm, the i th step eliminates nonzero subdiagonal elements in column i by subtracting the i th row from each row j in the range [i+1,n] , in each case scaling the i th row by the factor so as to make the element zero. Hence, the algorithm sweeps down the matrix from the top left corner to the bottom right corner, leaving zero subdiagonal elements behind it (Figure 7.10).

For numerical stability, this basic algorithm is modified so that instead of stepping through rows in order, it selects in step i the row in the range [i,n] with the largest element in column i . This row (called the pivot ) is swapped with row i prior to performing the subtractions.

Program 7.7 is an HPF implementation of this algorithm. For efficiency, this program maintains the vector b in the N+1 th column of the array A . The first do-loop implements Gaussian elimination. The MAXLOC intrinsic is used to identify the pivot row. Rather than performing an explicit swap with row i , an indirection array called indx is used to keep track of the actual indices of selected rows. This array is updated once the pivot is identified. The next statement computes the N scale factors; notice that the computation can be performed with a single array   assignment. Finally, the FORALL statement performs the subtractions. The mask ensures that the subtraction is performed only for rows that have not been previously selected as pivots ( Indx(j).EQ.0). Once the do-loop is complete, a second FORALL is used to reorganize the matrix into upper triangular form.

The last four lines of the program perform the back substitution. In reverse order from N to 1, each element of the solution is computed and then substituted into A to simplify the matrix.

Figure 7.11: Communication and computation in the various phases of the HPF Gaussian elimination algorithm. Arrows represent communication, and shading indicates tasks involved in computation in each phase. The five phases are described in Section 7.8.

Before developing data distribution directives for this program, let us determine how much concurrency it exposes and what data dependencies may lead to communication. We can think of the data-parallel program as specifying a fine-grained partition comprising N N tasks, each responsible for a single element of A . (These tasks characterize the computation that would be   associated with data elements by the owner-computes rule.) As illustrated in Figure 7.11, each of the N-1 steps of the elimination algorithm involves five principal steps, as follows:

1. The MAXLOC statement involves a reduction operation by the N tasks in the i th column.

2. The maximum value identified by the reduction ( max_indx) must be broadcast within the i th column, since it is required for the computation of scale factors.

3. The computation of scale factors (the array Fac) requires N independent operations, one in each task in the i th column.

4. A scale factor ( Fac(j)) and a pivot row value ( Row(k)) must be broadcast within each column and row, respectively, since they are required for the update.

5. The FORALL statement involves independent operations, one per task.

Studying this algorithm, we see that it has two interesting   attributes. First, there is little locality in communication beyond the fact that broadcasts and reductions are performed in rows and columns. Second, computation tends to be clustered: in each step, much of the computation is performed by tasks in a single row and column (before the FORALL) and in the bottom right-hand corner (the FORALL). These attributes can be exploited when developing data distribution directives to complete the parallel algorithm.

In many grid-based problems, we prefer to use a BLOCK distribution of the principal data structures because it reduces communication requirements by enhancing locality. However, in the Gaussian elimination problem, a BLOCK distribution has no communication advantages; furthermore, it causes many processors to be idle, particularly in the later stages of computation. In contrast, a CYCLIC distribution scatters computation over many processors and hence reduces idle time. Therefore, we could use the following data distribution directives.

```        !HPF\$  ALIGN Row(j) WITH A(1,j)
!HPF\$  ALIGN X(i) WITH A(i,N+1)
!HPF\$  DISTRIBUTE A(*,CYCLIC)
```

Of course, the number of processors that can be used efficiently by this one-dimensional decomposition is limited. An alternative formulation, more efficient on large numbers of processors, decomposes A in two dimensions. This can be specified as follows.

```        !HPF\$  ALIGN Row(j) WITH A(1,j)
!HPF\$  ALIGN X(i) WITH A(i,N+1)
!HPF\$  DISTRIBUTE A(CYCLIC,CYCLIC)
```

Next: 7.9 Summary Up: 7 High Performance Fortran Previous: 7.7 Performance Issues