HPF_CRAFT is intended for use in circumstances where greater control and performance are desired for MIMD style architectures. Since data may be declared to be private, local control is made more available and since processor information is available message passing and direct memory access programming styles can be seamlessly integrated with explicitly mapped data.
The following examples show some of the capabilities of HPF_CRAFT that are different from those of HPF. Others, such as integrated message passing and synchronization primitives are not shown. Much of HPF can also be used within HPF_CRAFT.
Example 1 illustrates the difference between the default distribution for data and the distribution of mapped data.
In the above example, given 8 processors, there would be 8 * 100 * 20 (or 16,000) elements in the array PRIVATE_A. Each processor contains an entire array named PRIVATE_A. The elements of PRIVATE_A on processor 1 cannot be referenced using implicit syntax by any other processor. There are only 100 * 20 (or 2000) elements of array MAPPED_A, however, and these elements are distributed about the machine in a (BLOCK, BLOCK) fashion.! Example 1 INTEGER PRIVATE_A(100, 20), PRIVATE_B(12, 256), PRIVATE_C INTEGER MAPPED_A(100, 20), MAPPED_B(12, 256), MAPPED_C !HPF$ DISTRIBUTE MAPPED_A(BLOCK, BLOCK), MAPPED_B(BLOCK, *), MAPPED_C
The difference between the PRIVATE_A declaration in HPF_CRAFT and that in HPF is the most instructive. In HPF_CRAFT each processor contains one copy of the array, and the values of the elements of the array may vary from processor to processor. HPF implementations are permitted to make one copy of the array per processor the default, but the values of these copies must remain coherent across all processors. In HPF there is no way to write a conforming program in which different processors have different values for the same array.
Example 2 shows the usefulness of the ON clause for the INDEPENDENT loop as well as giving an example of how private data may be used.
In this example, each iteration is executed on the processor containing the data that is mapped to it. The user was allowed to specify this.! Example 2 PRIVATE_C = 0 !HPF$ INDEPENDENT (I, J) ON MAPPED_B(I, J) DO J=1,256 DO I=1,12 MAPPED_B(I, J) = MAPPED_B(I, J) + 5 PRIVATE_C = PRIVATE_C + MAPPED_B(I, J) ENDDO ENDDO
In addition, the private variable PRIVATE_C is used to compute a total for each processor. At the end of execution of the loop, the values of PRIVATE_C may be different on each processor depending upon the values in the elements of the array on each processor. This data may be used as is, or it can be quickly summed using a barrier or an ATOMIC UPDATE.
Example 3 shows the final total value being combined into the variable MAPPED_C whose value is available to all processors.
! Example 3 MAPPED_C = 0 !HPF$ ATOMIC UPDATE MAPPED_C = MAPPED_C + PRIVATE_C
Example 4 shows how the language allows private data to vary from processor to processor.
In this example, PRIVATE_C on processor 5 will have the result of some-big-expression. Each processor can do distinctly different work and communicate through mapped data.! Example 4 IF (MY_PE() .EQ. 5) THEN PRIVATE_C = some-big-expression ENDIF
The code fragment in Example 5 is from an application and shows a few features of the language.
! Example 5 !HPF$ GEOMETRY G(*, CYCLIC) REAL FX(100,100), FY(100,100), FZ(100,100) !HPF$ DISTRIBUTE (G) :: FX,FY,FZ REAL FXP(100,16,100), FYP(100,16,100) !HPF$ DISTRIBUTE FXP(*,*, BLOCK) FYP(*,*, BLOCK) INTEGER CELL, ATOM, MAP(1000), NACELL(1000) !HPF$ INDEPENDENT (CELL) ON FX(1,CELL) DO CELL=1,100 JCELL0 = 16*(CELL-1) DO NABOR = 1, 13 JCELL = MAP(JCELL0+NABOR) DO ATOM=1, NACELL(CELL) FX(ATOM, CELL) = FX(ATOM, CELL) + FXP(ATOM, NABOR, JCELL) FY(ATOM, CELL) = FY(ATOM, CELL) + FYP(ATOM, NABOR, JCELL) ENDDO ENDDO ENDDO
The GEOMETRY directive allows the user to generically specify a mapping and use it to apply to many arrays (they need not have the same extents.)
Example 5 has a single INDEPENDENT loop which is the outer loop. It executes 100 iterations total. Within this loop the private value of JCELL0 is set for each processor (ensuring that it is a local computation everywhere.) Nested inside the INDEPENDENT loop is a private loop; this loop executes 13 times per processor. Inside this loop JCELL is computed locally on each processor, minimizing unnecessary communication. Finally the innermost loop is also private.