The following are valid examples of ON directives. Most of them illustrate idioms that programmers might want to use, rather than contrived situations. For simplicity, the first several examples assume the following array declarations:
REAL A(N), B(N), C(N), D(N) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)
One of the most commonly requested capabilities for the HPF is to control how loop iterations were assigned to processors. (Historically, the ON clause first appeared to perform exactly this role in the Kali FORALL construct.) This can be done by the ON directive, as shown in the following examples:
!HPF$ INDEPENDENT DO I = 2, N-1 !HPF$ ON HOME(A(I)) A(I) = (B(I) + B(I-1) + B(I+1))/3 END DO !HPF$ INDEPENDENT DO J = 2, N-1 !HPF$ ON HOME(A(J+1)) BEGIN A(J) = B(J+1) + C(J+1) + D(J+1) !HPF$ END ON END DO
The ON directive in the I loop sets the active processor for each iteration of the loop to be the processor that stores A(I). In other words, it advises the compiler to have each processor run over its own section of the A array (and therefore B as well). The references to B(I-1) and B(I+1) must be fetched from off-processor for the first and last iterations on each processor (except for the boundary processors); note that those processors are not mentioned in the HOME clause. The ON directive in the J loop similarly sets the active set for each iteration, but advises the compiler to shift computations. As a result, each processor does a vector sum of its own sections of B, C, and D, stores the first element of the result on the processor to its left, and stores the rest of the result (shifted by one) in A. It is worth noting that the directives would still be valid (and minimize nonresident data accesses) if the arrays were distributed CYCLIC, although the number of nonresident references would be much higher.
DO i = lb, ub, stride !HPF$ ON HOME(array(f(i))) BEGIN body !HPF$ END ON END DO
where array has some data mapping. Assume the mapping gives processor p the elements myset(p). (In a BLOCK distribution, for example, myset(p) is a contiguous range of integers.) Then the generated code on processor p should be
(This schematic does not show where communication or synchronization must be placed; that must be derived from analysis of the body.) Moreover, is likely to be the identity function or a linear function with integer coefficients, both of which can be inverted easily. Given this, techniques for iterating through the set can be found in several recent conferences. (End of advice to implementors.)DO i e [lb : ub : stride] n f-1 (myset(p)) body END DO
Sometimes it is advantageous to ``split'' an iteration between processors. The following case shows one example of this:
!HPF$ INDEPENDENT DO I = 2, N-1 !HPF$ ON HOME(A(I)) A(I) = (B(I) + B(I-1) + B(I+1))/3 !HPF$ ON HOME C(I+1) C(I+1) = A(I) * D(I+1) END DO
Here, the active processor sets for the two statements in the loop body are different. Due to the first ON clause, the reference to A(I) is resident in the first statement. The second ON clause makes A(I) nonresident (for some values of I) there. This maximizes the data locality in both statements, but does require data movement between the two.
DO i = lb, ub, stride !HPF$ ON HOME(array1(f1(i))) stmt1 !HPF$ ON HOME(array2(f2(i))) stmt2 END DO on processor p set1 = [lb : ub : stride] n f1-1 (myset1(p)) set2 = [lb : ub : stride] n f2-1 (myset2(p)) DO i e set1 U set2 IF (i e set1) THEN stmt1 END IF IF (i e set1) THEN stmt2 END IF END DO
where myset1 (p) is the resident set for array1, and myset2 (p) is the resident set for array2. (Again, synchronization and communication must be handled by other means.) Code transformations such as loop distribution and loop peeling can be used to eliminate the tests in many cases. They will be particularly profitable if there are data dependences between the ON blocks. (End of advice to implementors.) Advice to users. Splitting an iteration like this is likely to require either additional tests at runtime or additional analysis by the compiler. Even if the compiler can generate low-overhead scheduling for the individual ON clauses, combining them is not necessarily low-overhead. The locality benefits must be rather substanfor this to pay off, but there are cases where multiple ON clauses are valuable. (All these statements are particularly true if one ON block uses data computed in another one.) (End of advice to users.)
Because ON clauses nest naturally, they can be useful for expressing parallelism along three different dimensions. Consider the following examples:
The active processor set for each iteration of the J loop is a column of the (presumably universal) processors arrangement. The I loop further subdivides the computation, giving each processor responsibility for computing the elements it owns. Many compilers would have chosen this computation partitioning automatically for such a simple example. However, the compiler might have attempted to fully parallelize the outer loop, executing each inner loop sequentially on one processor. (This might be attractive on a machine with very fast communications.) By inserting the ON clauses, the user has advised against this strategy, thus trading additional locality for restricted parallelism. Notice that the ON directive neither requires nor implies the INDEPENDENT assertion. In both nests, each iteration of the I loop depends on the preceding iteration, but the ON directive can still partition the computation among processors. The ON directive does not automatically make a loop parallel.REAL X(M,M) !HPF$ DISTRIBUTE X(BLOCK,BLOCK) !HPF$ INDEPENDENT, NEW(I) DO J = 1, M !HPF$ ON HOME(X(:,J)) BEGIN DO I = 2, M !HPF$ ON HOME(X(I,J)) X(I,J) = (X(I-1,J) + X(I,J)) / 2 END DO !HPF$ END ON END DO
Consider the following variation on the above example:
!HPF$ DISTRIBUTE Y(BLOCK,*) !HPF$ INDEPENDENT, NEW(I) DO J = 1, M !HPF$ ON HOME(Y(:,J)) BEGIN DO I = 2, M !HPF$ ON HOME(Y(I,J)) Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2 END DO !HPF$ END ON END DO
Note that the ON clauses have not changed, except for the name of the array. The interpretation is similar to the above, except that the outer ON directive assigns each iteration of the J loop to all of the processors. The inner ON directive again implements a simple owner-computes rule. The programmer has directed the compiler to distribute a serial computation across all the processors. There are a few scenarios where this is more efficient than parallelizing the outer loop:
ON directives are particularly useful when the compiler cannot accurately estimate data locality, for example when the computation uses indirection arrays. Consider three variations of the same loop:
REAL X(N), Y(N) INTEGER IX1(M), IX2(M) !HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK) !HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK) !HPF$ INDEPENDENT DO I = 1, N !HPF$ ON HOME( X(I) ) X(I) = Y(IX(I)) - Y(IY(I)) END DO !HPF$ INDEPENDENT DO J = 1, N !HPF$ ON HOME( IX(J) ) X(J) = Y(IX(J)) - Y(IY(J)) END DO !HPF$ INDEPENDENT DO K = 1, N !HPF$ ON HOME( X(IX(K)) ) X(K) = Y(IX(K)) - Y(IY(K)) END DO
In the I loop, each processor runs over its section of the X array. (That is, the active processor for iteration I is the owner of X(I).) Only the reference X(I) is guaranteed to be resident. (If M does not equal N, then IX and IY have a different block size than X, and thus a different mapping.) However, if it is usually the case that X(I), Y(IX(I)), and Y(IY(I)) are located on the same processor, then this choice of active processors may be the best available. (If X(I) and one of the other references are always on the same processor, then the programmer should add the RESIDENT clause as explained in Section 9.3.) In the next loop, iteration J's active processor is the owner of IX(J). Because IY has the same distribution as IX, reference IY(J) is always resident as well as IX(J). This is the most common array reference class in the loop, so it minimizes the number of nonresident data references in the absence of any special properties of IX and IY. It may not evenly balance the load among processors; for example, if /( N=M/2 /) then half the processors will be idle. As before, if the values in IX or IY ensure that one of the Y references is always resident, a RESIDENT assertion should be added. In the K loop, only reference Y(IX(K)) is guaranteed to be resident (because Y and X have the same distribution). However, the values stored in IX and IY may ensure that Y(IY(K)) and X(K) are always resident. Even if the three REAL values are not always, but merely ``usually'' on the same processor, this may be a good computation partitioning for both locality and parallelism. However, these advantages must be weighed against the cost of computing this partitioning. Since the HOME clause depends on a (presumably large) array of runtime values, substantial time may be required to determine which iterations are assigned to each processor. It should be clear from this discussion that there is no magic solution for handling complex computation partitionings; the best answer is usually a combination of application knowledge, careful data structure design (including ordering of the elements), and efficient compilation methodology and runtime support.
In practice, one would make all the arrays the same size to avoid some of the alignment problems above; the example was written this way for pedagogical reasons, not as an example of good data structure design. (End of advice to users.)