The following are valid examples of ON directives. Most of them illustrate idioms that programmers might want to use, rather than contrived situations. For simplicity, the first several examples assume the following array declarations:
REAL A(N), B(N), C(N), D(N)
!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)
One of the most commonly requested capabilities for the HPF is to control how loop iterations were assigned to processors. (Historically, the ON clause first appeared to perform exactly this role in the Kali FORALL construct.) This can be done by the ON directive, as shown in the following examples:
!HPF$ INDEPENDENT
DO I = 2, N-1
!HPF$ ON HOME(A(I))
A(I) = (B(I) + B(I-1) + B(I+1))/3
END DO
!HPF$ INDEPENDENT
DO J = 2, N-1
!HPF$ ON HOME(A(J+1)) BEGIN
A(J) = B(J+1) + C(J+1) + D(J+1)
!HPF$ END ON
END DO
The ON directive in the I loop sets the active processor for each iteration of the loop to be the processor that stores A(I). In other words, it advises the compiler to have each processor run over its own section of the A array (and therefore B as well). The references to B(I-1) and B(I+1) must be fetched from off-processor for the first and last iterations on each processor (except for the boundary processors); note that those processors are not mentioned in the HOME clause. The ON directive in the J loop similarly sets the active set for each iteration, but advises the compiler to shift computations. As a result, each processor does a vector sum of its own sections of B, C, and D, stores the first element of the result on the processor to its left, and stores the rest of the result (shifted by one) in A. It is worth noting that the directives would still be valid (and minimize nonresident data accesses) if the arrays were distributed CYCLIC, although the number of nonresident references would be much higher.
DO i = lb, ub, stride
!HPF$ ON HOME(array(f(i))) BEGIN
body
!HPF$ END ON
END DO
where array has some data mapping. Assume the mapping gives processor p the elements myset(p). (In a BLOCK distribution, for example, myset(p) is a contiguous range of integers.) Then the generated code on processor p should be
DO i e [lb : ub : stride] n
f-1 (myset(p))
body
END DO
(This schematic does not show where communication or synchronization
must be placed; that must be derived from analysis of the body.)
Moreover, is likely to be the identity function or a linear
function with integer coefficients, both of which can be inverted
easily. Given this, techniques for iterating through the set can be
found in several recent conferences.
(End of advice to implementors.)
Sometimes it is advantageous to ``split'' an iteration between processors. The following case shows one example of this:
!HPF$ INDEPENDENT
DO I = 2, N-1
!HPF$ ON HOME(A(I))
A(I) = (B(I) + B(I-1) + B(I+1))/3
!HPF$ ON HOME C(I+1)
C(I+1) = A(I) * D(I+1)
END DO
Here, the active processor sets for the two statements in the loop body are different. Due to the first ON clause, the reference to A(I) is resident in the first statement. The second ON clause makes A(I) nonresident (for some values of I) there. This maximizes the data locality in both statements, but does require data movement between the two.
DO i = lb, ub, stride
!HPF$ ON HOME(array1(f1(i)))
stmt1
!HPF$ ON HOME(array2(f2(i)))
stmt2
END DO
on processor p
set1 = [lb : ub : stride] n
f1-1 (myset1(p))
set2 = [lb : ub : stride] n
f2-1 (myset2(p))
DO i e set1 U set2
IF (i e set1) THEN
stmt1
END IF
IF (i e set1) THEN
stmt2
END IF
END DO
where myset1 (p) is the resident set for array1, and myset2 (p) is the resident set for array2. (Again, synchronization and communication must be handled by other means.) Code transformations such as loop distribution and loop peeling can be used to eliminate the tests in many cases. They will be particularly profitable if there are data dependences between the ON blocks. (End of advice to implementors.) Advice to users. Splitting an iteration like this is likely to require either additional tests at runtime or additional analysis by the compiler. Even if the compiler can generate low-overhead scheduling for the individual ON clauses, combining them is not necessarily low-overhead. The locality benefits must be rather substanfor this to pay off, but there are cases where multiple ON clauses are valuable. (All these statements are particularly true if one ON block uses data computed in another one.) (End of advice to users.)
Because ON clauses nest naturally, they can be useful for expressing parallelism along three different dimensions. Consider the following examples:
REAL X(M,M)
!HPF$ DISTRIBUTE X(BLOCK,BLOCK)
!HPF$ INDEPENDENT, NEW(I)
DO J = 1, M
!HPF$ ON HOME(X(:,J)) BEGIN
DO I = 2, M
!HPF$ ON HOME(X(I,J))
X(I,J) = (X(I-1,J) + X(I,J)) / 2
END DO
!HPF$ END ON
END DO
The active processor set for each iteration of the J loop is
a column of the (presumably universal) processors arrangement. The
I loop further subdivides the computation, giving each
processor responsibility for computing the elements it owns. Many
compilers would have chosen this computation partitioning
automatically for such a simple example. However, the compiler might
have attempted to fully parallelize the outer loop, executing each
inner loop sequentially on one processor. (This might be attractive
on a machine with very fast communications.) By inserting the
ON clauses, the user has advised against this strategy, thus
trading additional locality for restricted parallelism. Notice that
the ON directive neither requires nor implies the
INDEPENDENT assertion. In both nests, each iteration of the
I loop depends on the preceding iteration, but the
ON directive can still partition the computation among
processors. The ON directive does not automatically make a
loop parallel.
Consider the following variation on the above example:
!HPF$ DISTRIBUTE Y(BLOCK,*)
!HPF$ INDEPENDENT, NEW(I)
DO J = 1, M
!HPF$ ON HOME(Y(:,J)) BEGIN
DO I = 2, M
!HPF$ ON HOME(Y(I,J))
Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2
END DO
!HPF$ END ON
END DO
Note that the ON clauses have not changed, except for the name of the array. The interpretation is similar to the above, except that the outer ON directive assigns each iteration of the J loop to all of the processors. The inner ON directive again implements a simple owner-computes rule. The programmer has directed the compiler to distribute a serial computation across all the processors. There are a few scenarios where this is more efficient than parallelizing the outer loop:
ON directives are particularly useful when the compiler cannot accurately estimate data locality, for example when the computation uses indirection arrays. Consider three variations of the same loop:
REAL X(N), Y(N)
INTEGER IX1(M), IX2(M)
!HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK)
!HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK)
!HPF$ INDEPENDENT
DO I = 1, N
!HPF$ ON HOME( X(I) )
X(I) = Y(IX(I)) - Y(IY(I))
END DO
!HPF$ INDEPENDENT
DO J = 1, N
!HPF$ ON HOME( IX(J) )
X(J) = Y(IX(J)) - Y(IY(J))
END DO
!HPF$ INDEPENDENT
DO K = 1, N
!HPF$ ON HOME( X(IX(K)) )
X(K) = Y(IX(K)) - Y(IY(K))
END DO
In the I loop, each processor runs over its section of the X array. (That is, the active processor for iteration I is the owner of X(I).) Only the reference X(I) is guaranteed to be resident. (If M does not equal N, then IX and IY have a different block size than X, and thus a different mapping.) However, if it is usually the case that X(I), Y(IX(I)), and Y(IY(I)) are located on the same processor, then this choice of active processors may be the best available. (If X(I) and one of the other references are always on the same processor, then the programmer should add the RESIDENT clause as explained in Section 9.3.) In the next loop, iteration J's active processor is the owner of IX(J). Because IY has the same distribution as IX, reference IY(J) is always resident as well as IX(J). This is the most common array reference class in the loop, so it minimizes the number of nonresident data references in the absence of any special properties of IX and IY. It may not evenly balance the load among processors; for example, if /( N=M/2 /) then half the processors will be idle. As before, if the values in IX or IY ensure that one of the Y references is always resident, a RESIDENT assertion should be added. In the K loop, only reference Y(IX(K)) is guaranteed to be resident (because Y and X have the same distribution). However, the values stored in IX and IY may ensure that Y(IY(K)) and X(K) are always resident. Even if the three REAL values are not always, but merely ``usually'' on the same processor, this may be a good computation partitioning for both locality and parallelism. However, these advantages must be weighed against the cost of computing this partitioning. Since the HOME clause depends on a (presumably large) array of runtime values, substantial time may be required to determine which iterations are assigned to each processor. It should be clear from this discussion that there is no magic solution for handling complex computation partitionings; the best answer is usually a combination of application knowledge, careful data structure design (including ordering of the elements), and efficient compilation methodology and runtime support.
In practice, one would make all the arrays the same size to avoid some of the alignment problems above; the example was written this way for pedagogical reasons, not as an example of good data structure design. (End of advice to users.)