next up previous contents
Next: ON Directives Applied to Up: The ON Directive Previous: Semantics of the ON

Examples of ON Directives

 

The following are valid examples of ON directives. Most of them illustrate idioms that programmers might want to use, rather than contrived situations. For simplicity, the first several examples assume the following array declarations:

      REAL A(N), B(N), C(N), D(N)
      !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)

One of the most commonly requested capabilities for the HPF is to control how loop iterations were assigned to processors. (Historically, the ON clause first appeared to perform exactly this role in the Kali FORALL construct.) This can be done by the ON directive, as shown in the following examples:

      !HPF$ INDEPENDENT
      DO I = 2, N-1
        !HPF$ ON HOME(A(I))
        A(I) = (B(I) + B(I-1) + B(I+1))/3
      END DO

      !HPF$ INDEPENDENT
      DO J = 2, N-1
        !HPF$ ON HOME(A(J+1)) BEGIN
         A(J) = B(J+1) + C(J+1) + D(J+1)
        !HPF$ END ON
      END DO

The ON directive in the I loop sets the active processor for each iteration of the loop to be the processor that stores A(I). In other words, it advises the compiler to have each processor run over its own section of the A array (and therefore B as well). The references to B(I-1) and B(I+1) must be fetched from off-processor for the first and last iterations on each processor (except for the boundary processors); note that those processors are not mentioned in the HOME clause. The ON directive in the J loop similarly sets the active set for each iteration, but advises the compiler to shift computations. As a result, each processor does a vector sum of its own sections of B, C, and D, stores the first element of the result on the processor to its left, and stores the rest of the result (shifted by one) in A. It is worth noting that the directives would still be valid (and minimize nonresident data accesses) if the arrays were distributed CYCLIC, although the number of nonresident references would be much higher.

Sometimes it is advantageous to ``split'' an iteration between processors. The following case shows one example of this:

      !HPF$ INDEPENDENT
      DO I = 2, N-1
        !HPF$ ON HOME(A(I))
        A(I) = (B(I) + B(I-1) + B(I+1))/3
        !HPF$ ON HOME C(I+1)
        C(I+1) = A(I) * D(I+1)
      END DO

Here, the active processor sets for the two statements in the loop body are different. Due to the first ON clause, the reference to A(I) is resident in the first statement. The second ON clause makes A(I) nonresident (for some values of I) there. This maximizes the data locality in both statements, but does require data movement between the two.

Because ON clauses nest naturally, they can be useful for expressing parallelism along three different dimensions. Consider the following examples:

      REAL X(M,M)
      !HPF$ DISTRIBUTE X(BLOCK,BLOCK)

      !HPF$ INDEPENDENT, NEW(I)
      DO J = 1, M
        !HPF$ ON HOME(X(:,J)) BEGIN
          DO I = 2, M
            !HPF$ ON HOME(X(I,J))
            X(I,J) = (X(I-1,J) + X(I,J)) / 2
          END DO
        !HPF$ END ON
      END DO
The active processor set for each iteration of the J loop is a column of the (presumably universal) processors arrangement. The I loop further subdivides the computation, giving each processor responsibility for computing the elements it owns. Many compilers would have chosen this computation partitioning automatically for such a simple example. However, the compiler might have attempted to fully parallelize the outer loop, executing each inner loop sequentially on one processor. (This might be attractive on a machine with very fast communications.) By inserting the ON clauses, the user has advised against this strategy, thus trading additional locality for restricted parallelism. Notice that the ON directive neither requires nor implies the INDEPENDENT assertion. In both nests, each iteration of the I loop depends on the preceding iteration, but the ON directive can still partition the computation among processors. The ON directive does not automatically make a loop parallel.

    Advice to implementors.``Dimension-based'' nesting, as above, will probably be a common case. The HOME clauses can be inverted at each level, treating indices from outer loops as run-time invariants. (End of advice to implementors.)

    Advice to users.Nested ON directives will tend to have efficient implementations if their HOME clauses refer to different dimensions of the processors arrangements, as in the above example. This minimizes the interaction between the levels of the loops, simplifying the implementation. (End of advice to users.)

Consider the following variation on the above example:

      !HPF$ DISTRIBUTE Y(BLOCK,*)

      !HPF$ INDEPENDENT, NEW(I)
      DO J = 1, M
        !HPF$ ON HOME(Y(:,J)) BEGIN
          DO I = 2, M
            !HPF$ ON HOME(Y(I,J))
            Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2
          END DO
        !HPF$ END ON
      END DO

Note that the ON clauses have not changed, except for the name of the array. The interpretation is similar to the above, except that the outer ON directive assigns each iteration of the J loop to all of the processors. The inner ON directive again implements a simple owner-computes rule. The programmer has directed the compiler to distribute a serial computation across all the processors. There are a few scenarios where this is more efficient than parallelizing the outer loop:

  1. Parallelizing the outer loop will generate many nonresident references, since only a part of each column is on any processor. If nonresident references are very expensive (or if M is relatively small), this overhead may outweigh any gain from parallel execution.
  2. The compiler may take advantage of the INDEPENDENT directive to avoid inserting any sychronization. This allows a natural pipelined execution. A processor will execute its part of the I loop for one value of J, then immediately go on to the next J iteration. Thus, the first processor will start on J=2 wile the second receives the data it needs (from processor one) for J=1. (A similar pipeline would develop in the X example above.)

    Advice to users. This example points out how ON may improve software engineering. While the "value" of HOME(X(I)) will chanve if X's mapping changes, its intent will usually stay the same - run the loop "aligned with" the array X. Moreover, the form of the clauses is portable, and they simplify experimenting with alternative computation partitioning. Both qualities are similar to the advantages of DISTRIBUTE and ALIGN over low-level data layout mechanisms. (End of advice to users.)

ON directives are particularly useful when the compiler cannot accurately estimate data locality, for example when the computation uses indirection arrays. Consider three variations of the same loop:

      REAL X(N), Y(N)
      INTEGER IX1(M), IX2(M)
      !HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK)
      !HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK)

      !HPF$ INDEPENDENT
      DO I = 1, N
        !HPF$ ON HOME( X(I) )
        X(I) = Y(IX(I)) - Y(IY(I))
      END DO

      !HPF$ INDEPENDENT
      DO J = 1, N
        !HPF$ ON HOME( IX(J) )
        X(J) = Y(IX(J)) - Y(IY(J))
      END DO

      !HPF$ INDEPENDENT
      DO K = 1, N
        !HPF$ ON HOME( X(IX(K)) )
        X(K) = Y(IX(K)) - Y(IY(K))
      END DO

In the I loop, each processor runs over its section of the X array. (That is, the active processor for iteration I is the owner of X(I).) Only the reference X(I) is guaranteed to be resident. (If M does not equal N, then IX and IY have a different block size than X, and thus a different mapping.) However, if it is usually the case that X(I), Y(IX(I)), and Y(IY(I)) are located on the same processor, then this choice of active processors may be the best available. (If X(I) and one of the other references are always on the same processor, then the programmer should add the RESIDENT clause as explained in Section 9.3.) In the next loop, iteration J's active processor is the owner of IX(J). Because IY has the same distribution as IX, reference IY(J) is always resident as well as IX(J). This is the most common array reference class in the loop, so it minimizes the number of nonresident data references in the absence of any special properties of IX and IY. It may not evenly balance the load among processors; for example, if /( N=M/2 /) then half the processors will be idle. As before, if the values in IX or IY ensure that one of the Y references is always resident, a RESIDENT assertion should be added. In the K loop, only reference Y(IX(K)) is guaranteed to be resident (because Y and X have the same distribution). However, the values stored in IX and IY may ensure that Y(IY(K)) and X(K) are always resident. Even if the three REAL values are not always, but merely ``usually'' on the same processor, this may be a good computation partitioning for both locality and parallelism. However, these advantages must be weighed against the cost of computing this partitioning. Since the HOME clause depends on a (presumably large) array of runtime values, substantial time may be required to determine which iterations are assigned to each processor. It should be clear from this discussion that there is no magic solution for handling complex computation partitionings; the best answer is usually a combination of application knowledge, careful data structure design (including ordering of the elements), and efficient compilation methodology and runtime support.

    Advice to implementors.The K loop is the situation that the inspector strategy described above was designed for. If there is an outer loop around any of these examples, and that loop does not modify the distribution of X or the values of IX, then a record of each processor's iterations can be saved for reuse. The cost is at worst linear in the sizes of the arrays. (End of advice to implementors.)

    Advice to users.It is unlikely that any current production compiler will generate low-overhead code for K loop. The difference from previous examples is that the HOME clause is not a function that can be easily inverted by the compiler. Some compilers may choose to execute every iteration on all processors, testing the HOME clause at run-time; others may pre-compute a list of iterations for every processor. Of course, the cost of computing the list will be substantial.

    In practice, one would make all the arrays the same size to avoid some of the alignment problems above; the example was written this way for pedagogical reasons, not as an example of good data structure design. (End of advice to users.)


next up previous contents
Next: ON Directives Applied to Up: The ON Directive Previous: Semantics of the ON