[ HPF Home | Versions | Compilers | Projects | Publications | Applications | Benchmarks | Events | Contact ] |
HPF specifies an allowed parallel implementation of an INDEPENDENT DO loop with reduction statements, thereby specifying the semantics of such a loop.
Just as the result of the Fortran intrinsic function SUM is defined to be a implementation-dependent approximation to the sum of the elements of its argument array, the value of a reduction variable on exit from its INDEPENDENT DO loop is likewise not completely specified by HPF. One possible value is that which would have been computed by sequential execution of the loop, but other implementation-dependent approximations to this value may be produced. Any such implementation-dependent value is, however, an approximation to the value produced by sequential execution of the loop. If rounding error, underflow, and overflow do not occur, it will be identical to that value.
Since no reference to a protected reduction variable can occur except in a reduction statement, it is not necessary to define the values that these variables may have while protected.
We describe a simple implementation mechanism that applies to commutative reduction operations. On entry to an independent loop, every executing processor allocates a private accumulator variable associated with each variable in the reduction clause on the INDEPENDENT directive, and initializes it to the identity element for the corresponding intrinsic reduction operator. The private accumulator variable has the same shape, type, and kind type parameter as the reduction variable.
The identity elements for the intrinsic operators are defined in Table 5.1.
Table 5.1: Identity elements for intrinsic reduction operators.
Operator Identity Element + 0 - 0 * 1 / 1 .AND. .TRUE. .OR. .FALSE. .EQV. .TRUE. .NEQV. .FALSE.
Table 5.2: Identity elements for intrinsic reduction functions.
Function Identity Element IAND(I,J) NOT(0) (all one-bits) IOR(I,J) 0 IEOR(I,J) 0 MIN(X,Y) the positive number of largest absolute value that has the same type and kind type parameter as the reduction variable MAX(X,Y) the negative number of largest absolute value that has the same type and kind type parameter as the reduction variable
The intrinsic functions that may be used as reduction functions are listed, together with their identity elements, in Table 5.2.
Each processor performs a subset of the loop iterations; when it encounters a reduction statement, it updates its own accumulator variable. A processor is free to perform its loop iterations in any order; furthermore, it may start an iteration, suspend work on it, do some or all of the work of other iterations, and resume work on the suspended iteration. However, any update of a private accumulator variable occurs through the execution of a reduction statement, and reduction statements are executed atomically.
The final value of the reduction variable is computed by combining the private accumulator variables with the value of the reduction variable on entry to the loop, using the reduction operator. The ordering of this reduction is language-processor dependent, just as it is for the intrinsic reduction functions (SUM, etc.).
As an example, consider:
REAL Z Z = 5. !HPF$ INDEPENDENT, REDUCTION(X) DO I = 1, 10 Z = Z + I END DO
The final value of Z will be 5 + (1+2+3+4+5+6+7+8+9+10) = 60; the order in which the additions occur is not specified by HPF.
For a second example, here is a SUM_SCATTER done as an independent loop:
!HPF$ INDEPENDENT, REDUCTION(X) DO I = 1, N X(INDEX(I)) = X(INDEX(I)) - F(I) END DO
The implementation will most likely make a private copy on every processor of an accumulator array XLOCAL of the same type and shape as X, and initialize it to zero. Each iteration will subtract the value of F(I) from its own XLOCAL(INDEX(I)). To create the final result, the implementation must combine all the private accumulator arrays with the initial value of X. The combining operator is the same as the reduction operator, namely addition, so that the result is the sum of the initial value of X and the accumulator arrays. The implementation has the option of using a sparse data structure to store only the updated elements of the local accumulator.
In an MPI based implementation, the MPI_REDUCE function could be used for this task. (End of advice to implementors.)
©2000-2006 Rice University | [ Contact Us | HiPerSoft | Computer Science ] |