The key rule about ON directives when applied to subprogram invocations is that the invocation does not change the active processor set. In effect, the callee inherits the caller's active processors. Thus,
!HPF$ PROCESSORS P(10) !HPF$ DISTRIBUTE X(BLOCK) ONTO P !HPF$ ON ( P(1:3) ) CALL PERSON_TO_PERSON() !HPF$ ON ( P(4:7) ) CALL COLLECT( X )
calls PERSON_TO_PERSON on three processors, while it calls COLLECT on four. The actual argument to COLLECT does not reside completely on the active set of processors. This is allowed, with appropriate declarations of the corresponding dummy argument as explained below.
The above rule has interesting implications for data distributions within the called routine. In particular, dummy arguments must be declared under the same restrictions as local objects, thus ensuring that the dummy is always stored on the active processor set. This does not imply that the corresponding actual argument is local, however. Consider the possibilities for how a dummy can be explicitly mapped:
It should also be noted that these rules do not invalidate any HPF programs written without using the ON directive. In those programs, the active processor set never changes (at least, not from the view of the language). Therefore, subset and universal processors arrangements can be used interchangeably, and the restriction on use of transcriptive mappings is obeyed automatically. (End of rationale.)
Let us return to the previous example:
!HPF$ PROCESSORS P(10) !HPF$ DISTRIBUTE X(BLOCK) ONTO P !HPF$ ON ( P(4:7) ) CALL COLLECT( X )
If COLLECT were declared as
SUBROUTINE COLLECT( A ) !HPF$ DISTRIBUTE A(CYCLIC)
then the call will be executed as follows:
SUBROUTINE COLLECT( A ) !HPF$ DISTRIBUTE A(BLOCK)
then the process would be the same, except that there would be a remapping from BLOCK on 10 processors to BLOCK on 4 processors. That is, the block size would increase by 2.5 times (with related shuffling of data) and then revert to the original. Again, it is important to note that the distribution of A is onto the active processor set rather than onto all of P.
The similar examples
REAL X(100,100), Y(100,100) !HPF$ PROCESSORS P(4), Q(2,2) !HPF$ DISTRIBUTE X(BLOCK,*) ONTO P !HPF$ DISTRIBUTE Y(BLOCK,BLOCK) ONTO Q INTERFACE SUBROUTINE A_CAB( B ) REAL B(:) !HPF$ DISTRIBUTE B *(BLOCK) END INTERFACE !HPF$ ON ( P(4:7) ) CALL A_CAB( X( 1:100, 1 ) !HPF$ ON HOME( X(1:100,1) ) CALL A_CAB( X(1:100,100) ) !HPF$ ON HOME( Y(1:100,1) ) CALL A_CAB( Y(1:100,1) ) !HPF$ ON HOME( Y(99,1:100) ) CALL A_CAB( Y(99,1:100) )
can be explained as follows. Calling A_CAB(1:100,1) on P(4:7) will produce a remapping from 10 processors to 4, as in the example above. (The compiler would be expected to produce a warning in this case, as explained in Section 4.) Calling A_CAB(X(1:100,100)) on HOME(X(1:100,1)) produces no such remapping (or warning), because the active processor set does not change; therefore, the descriptive mapping correctly asserts that the data is already on the right processors. The last two examples, calling A_CAB(Y(1:100,1)) and A_CAB(Y(99,1:100)) on the homes of their arguments, are also accomplished without remapping. In both cases, the actual arguments are mapped BLOCK-wise onto a subset of the processors (a column of Q in the first case, a row of Q in the second.
Two examples of transcriptive mapping are also useful:
! Assume ! PROCESSORS P(4) ! is declared in a module REAL X(100) !HPF$ DISTRIBUTE X(CYCLIC(5)) ONTO P INTERFACE SUBROUTINE FOR_HELP( C ) REAL C(:) !HPF$ INHERIT C END INTERFACE !HPF$ ON HOME( X(11:20) ) CALL FOR_HELP( X(11:20) ) !HPF$ ON ( P(1) ) CALL FOR_HELP( X(51:60) ) !Nonconforming
The first example is valid--the actual argument is (trivially) distributed on the active processor set. The second example is invalid--for example, element X(51) is stored onP(3), which is not in the active processor set for the call. The second example would be valid if the ON directive specified P(3:4) or HOME(X(11:20)), both of which map to the same processor set.
Calls to EXTRINSIC subprograms also deserve mention. The "standard" HPF 2.0 description of calling an EXTRINSIC (Section 6) says in part:
This constraint is changed to read
The intent is the same as in the original language design. Processors where data is stored can neither appear nor disappear; nor may the set of processors executing the program change without notice to the program. Similarly, some extrinsic kinds specify "all processors must be synchronized" or "execution of a local procedure on each processor"; such language is understood to mean "all active processors must be synchronized" or "execution of a local procedure on each active processor."
If a procedure uses alternate return, then the target of the return must be have the same active processor set as the CALL statement. In effect, this means that labels passed as arguments must refer to statement in the same ON block as the CALL statement.
Explicit use of CALLs in ON directives is oftedn associated with task parallelism. Several examples can be found in Section 9.4. The following example illustrates how processors can be used for a one-dimensional domain decomposition algorithm:
!HPF$ PROCESSORS PROCS(NP) !HPF$ DISTRIBUTE X(BLOCK) ONTO PROCS ! Compute ILO(IP) = lower bound on PROCS(IP) ! Compute IHI(IP) = upper bound on PROCS(IP) DONE = .FALSE. DO WHILE (.NOT. DONE) !HPF$ INDEPENDENT DO IP = 1, NP !HPF$ ON (PROCS(IP)) CALL SOLVE_SUBDOMAIN( IP, X(ILO(IP):IHI(IP)) ) END DO !HPF$ ON HOME(X) BEGIN CALL SOLVE_BOUNDARIES( X, ILO(1:NP), IHI(1:NP) ) DONE = CONVERGENCE_TEST( X, ILO(1:NP), IHI(1:NP) ) !HPF$ END ON END DO
The algorithm divides the entire computational domain (array X) into NP subdomains, one for each processor. The INDEPENDENT IP loop performs a computation on each subdomain's interior. The ON directive tells the compiler which processors to use in executing these (conceptually) parallel operations. This can increase data locality substantially, particularly if the compiler could not otherwise analyze the data access patterns in SOLVE_SUBDOMAIN. The subroutine SOLVE_SUBDOMAIN can use a transcriptive or descriptive mapping for its array argument, placing it on a single processor. In the next phase, the processors collaborate to update the boundaries of the subdomains and test for convergence. Subroutines SOLVE_BOUNDARIES and CONVERGENCE_TEST may well have their own loops similar to the IP loop, with similar RESIDENT clauses. Note that only the lower and upper bounds of each subdomain is recorded; this allows different processors to process different-sized subdomains. However, each subdomain must ``fit'' into one processor's section of the X array.