ON Directives Applied to Subprogram Invocations

Next: The RESIDENT ClauseDirective, Up: The ON Directive Previous: Examples of ON Directives

ON Directives Applied to Subprogram Invocations

The key rule about ON directives when applied to subprogram invocations is that the invocation does not change the active processor set. In effect, the callee inherits the caller's active processors. Thus,

!HPF$ PROCESSORS P(10)
!HPF$ DISTRIBUTE X(BLOCK) ONTO P

!HPF$ ON ( P(1:3) )
      CALL PERSON_TO_PERSON()
!HPF$ ON ( P(4:7) )
      CALL COLLECT( X )

calls PERSON_TO_PERSON on three processors, while it calls COLLECT on four. The actual argument to COLLECT does not reside completely on the active set of processors. This is allowed, with appropriate declarations of the corresponding dummy argument as explained below.

The above rule has interesting implications for data distributions within the called routine. In particular, dummy arguments must be declared under the same restrictions as local objects, thus ensuring that the dummy is always stored on the active processor set. This does not imply that the corresponding actual argument is local, however. Consider the possibilities for how a dummy can be explicitly mapped:

Prescriptive mapping: If the actual is not mapped on the active processor set, it will be remapped. This is exactly analogous to remapping a BLOCK-distributed array to CYCLIC via a prescriptive mapping.
Descriptive mapping: The user is asserting that the actual is already mapped onto the set of active processors. If the assertion is true, then the dummy is already stored locally; if not, then the compiler inserts a remapping operation (and reports a warning, following the recommendations in Section 4).
Transcriptive mapping: In this case, a new restriction must be made to allow efficient access to the dummy argument. If a dummy is transcriptively mapped, then the actual argument must be resident on the active processor set at the time of the invocation. This may be checked at run-time.

In summary, a dummy argument is always mapped to the set of active processors, although the actual argument need not be (except in the case of transcriptive mappings).

Rationale.The treatment of dummy arguments as local objects is consistent with all previous Fortran (and FORTRAN) standards. Moreover, it has the advantage that it reflects the usual expectations and wishes of programmers. Dummy arguments are not expected to create great inefficiencies in Fortran programs; ensuring that they are always stored locally tends to reinforce that expectation. Also, programmers are used to ``pass by reference'' behavior, in which arguments are not copied; the restrictions on data mapping to active processor sets allow this implementation when the data is not remapped on subprogram call. One case of this deserves special mention--transcriptive mappings. If the programmer wants to keep the data in place (the usual expectation of INHERIT and related features) and control which processors execute the computation (the meaning of ON), then the basic principles of active sets (set forth in Section 9.1) imply that the data must be resident before the call is made. When remapping occurs due to explicit directives, then surely the user expects a communication cost to accompany the remapping.

It should also be noted that these rules do not invalidate any HPF programs written without using the ON directive. In those programs, the active processor set never changes (at least, not from the view of the language). Therefore, subset and universal processors arrangements can be used interchangeably, and the restriction on use of transcriptive mappings is obeyed automatically. (End of rationale.)

Advice to implementors.These restrictions imply that accesses to dummy arguments never require one-sided communication if the argument is explicitly mapped and the ON clause is used. Of course, accesses to global data may still run into serious complications. If the compiler itself partitions the computation, it is not restricted by the ON directive rules. (End of advice to implementors.)

Advice to users.The idea to remember in calling subprograms from an ON block is this: make sure that the actual arguments are stored on the active set. If the subroutine interface uses transcriptive (``take anything'') mappings, then this is a requirement. If the subroutine uses any other type of mapping, then having resident actual arguments may avoid the expense of remapping data. (Of course, it does not by itself guarentee that remapping doesn't occur--a prescriptive interface can force a BLOCK-to-CYCLIC redistribution--but it does ensure that the remapping is between active processors. This allows simpler and more efficient collective communications operations to be generated in the runtime system.) (End of advice to users.)

Let us return to the previous example:

!HPF$ PROCESSORS P(10)
!HPF$ DISTRIBUTE X(BLOCK) ONTO P

!HPF$ ON ( P(4:7) )
      CALL COLLECT( X )

If COLLECT were declared as

      SUBROUTINE COLLECT( A )
!HPF$ DISTRIBUTE A(CYCLIC)

then the call will be executed as follows:

X will be remapped from BLOCK on 10 processors (i.e., all of P) to CYCLIC on 4 processors (i.e., P(4:7)). This will be a many-to-many exchange pattern.
COLLECT will be called on processors P(4), P(5), P(6), and P(7). Accesses to A within the subroutine will be satisfied from the redistributed array on those processors.
A will be remapped back to the distribution of X. This is the inverse of step 1.

Note that the distribution of A is onto 4 processors (the active processor set inside the call), not onto the universal processor set. If the interface is

      SUBROUTINE COLLECT( A )
!HPF$ DISTRIBUTE A(BLOCK)

then the process would be the same, except that there would be a remapping from BLOCK on 10 processors to BLOCK on 4 processors. That is, the block size would increase by 2.5 times (with related shuffling of data) and then revert to the original. Again, it is important to note that the distribution of A is onto the active processor set rather than onto all of P.

The similar examples

      REAL X(100,100), Y(100,100)
!HPF$ PROCESSORS P(4), Q(2,2)
!HPF$ DISTRIBUTE X(BLOCK,*) ONTO P
!HPF$ DISTRIBUTE Y(BLOCK,BLOCK) ONTO Q

      INTERFACE
        SUBROUTINE A_CAB( B )
        REAL B(:)
!HPF$   DISTRIBUTE B *(BLOCK)
      END INTERFACE

!HPF$ ON ( P(4:7) )
      CALL A_CAB( X( 1:100, 1 )
!HPF$ ON HOME( X(1:100,1) )
      CALL A_CAB( X(1:100,100) )
!HPF$ ON HOME( Y(1:100,1) )
      CALL A_CAB( Y(1:100,1) )
!HPF$ ON HOME( Y(99,1:100) )
      CALL A_CAB( Y(99,1:100) )

can be explained as follows. Calling A_CAB(1:100,1) on P(4:7) will produce a remapping from 10 processors to 4, as in the example above. (The compiler would be expected to produce a warning in this case, as explained in Section 4.) Calling A_CAB(X(1:100,100)) on HOME(X(1:100,1)) produces no such remapping (or warning), because the active processor set does not change; therefore, the descriptive mapping correctly asserts that the data is already on the right processors. The last two examples, calling A_CAB(Y(1:100,1)) and A_CAB(Y(99,1:100)) on the homes of their arguments, are also accomplished without remapping. In both cases, the actual arguments are mapped BLOCK-wise onto a subset of the processors (a column of Q in the first case, a row of Q in the second.

Two examples of transcriptive mapping are also useful:

! Assume
! PROCESSORS P(4)
! is declared in a module
      REAL X(100)
!HPF$ DISTRIBUTE X(CYCLIC(5)) ONTO P

      INTERFACE
        SUBROUTINE FOR_HELP( C )
        REAL C(:)
!HPF$   INHERIT C
      END INTERFACE

!HPF$ ON HOME( X(11:20) )
      CALL FOR_HELP( X(11:20) )
!HPF$ ON ( P(1) )
      CALL FOR_HELP( X(51:60) )    !Nonconforming

The first example is valid--the actual argument is (trivially) distributed on the active processor set. The second example is invalid--for example, element X(51) is stored onP(3), which is not in the active processor set for the call. The second example would be valid if the ON directive specified P(3:4) or HOME(X(11:20)), both of which map to the same processor set.

Calls to EXTRINSIC subprograms also deserve mention. The "standard" HPF 2.0 description of calling an EXTRINSIC (Section 6) says in part:

as if

Exactly the same set of processors are visible to the HPF environment before and after the subprogram call.

This constraint is changed to read

Exactly the same set of active processors are available to the HPF environment before and after the subprogram call.
Exactly the same universal processor set is visible to the HPF environment before and after the subprogram call.

The intent is the same as in the original language design. Processors where data is stored can neither appear nor disappear; nor may the set of processors executing the program change without notice to the program. Similarly, some extrinsic kinds specify "all processors must be synchronized" or "execution of a local procedure on each processor"; such language is understood to mean "all active processors must be synchronized" or "execution of a local procedure on each active processor."

Rationale. This gives the combination of EXTRINSIC procedureds and ON directives a fork-join model of parallelism, which seems to be both natural and semantically clean. (End of rationale.)

If a procedure uses alternate return, then the target of the return must be have the same active processor set as the CALL statement. In effect, this means that labels passed as arguments must refer to statement in the same ON block as the CALL statement.

Rationale. This constraint is similar to the prohibition against jumping out of an ON block, and has the same justification. (End of rationale.)

Explicit use of CALLs in ON directives is oftedn associated with task parallelism. Several examples can be found in Section 9.4. The following example illustrates how processors can be used for a one-dimensional domain decomposition algorithm:

!HPF$ PROCESSORS PROCS(NP)
!HPF$ DISTRIBUTE X(BLOCK) ONTO PROCS

! Compute ILO(IP) = lower bound on PROCS(IP)
! Compute IHI(IP) = upper bound on PROCS(IP)
DONE = .FALSE.
DO WHILE (.NOT. DONE)
  !HPF$ INDEPENDENT
  DO IP = 1, NP
    !HPF$ ON (PROCS(IP))
    CALL SOLVE_SUBDOMAIN( IP, X(ILO(IP):IHI(IP)) )
  END DO
  !HPF$ ON HOME(X) BEGIN
    CALL SOLVE_BOUNDARIES( X, ILO(1:NP), IHI(1:NP) )
    DONE = CONVERGENCE_TEST( X, ILO(1:NP), IHI(1:NP) )
  !HPF$ END ON
END DO

The algorithm divides the entire computational domain (array X) into NP subdomains, one for each processor. The INDEPENDENT IP loop performs a computation on each subdomain's interior. The ON directive tells the compiler which processors to use in executing these (conceptually) parallel operations. This can increase data locality substantially, particularly if the compiler could not otherwise analyze the data access patterns in SOLVE_SUBDOMAIN. The subroutine SOLVE_SUBDOMAIN can use a transcriptive or descriptive mapping for its array argument, placing it on a single processor. In the next phase, the processors collaborate to update the boundaries of the subdomains and test for convergence. Subroutines SOLVE_BOUNDARIES and CONVERGENCE_TEST may well have their own loops similar to the IP loop, with similar RESIDENT clauses. Note that only the lower and upper bounds of each subdomain is recorded; this allows different processors to process different-sized subdomains. However, each subdomain must ``fit'' into one processor's section of the X array.

Advice to implementors.The IP loop above is likely to be a common idiom among programmers doing block-structured codes. In general, it can be implemented by inverting the HOME clause as was done above. In the one-to-one case shown here (probably very popular with programmers), it can be implemented by assigning the processor id to the loop index variable and testing the range of the loop (once). (End of advice to implementors.)

Rationale.Some compilers will propagate the ON information from the caller to the callee at compile time, and some at run time. Repeating the ON clause in the caller and callee will tend to give the compiler better information, resulting in better generated code. (End of rationale.)

Next: The RESIDENT ClauseDirective, Up: The ON Directive Previous: Examples of ON Directives

[ Contact Us | HiPerSoft | Computer Science ]