The fundamental model of parallelism in HPF is that of single-threaded data-parallel execution with a globally shared address space. Fortran array statements and the FORALL statement are natural ways of specifying data parallel computation. In addition, HPF provides the INDEPENDENT directive. It can be used to assert that certain loops do not carry any dependences and therefore may be executed in parallel.
Exploitation of data locality is critical to achieving good performance on a high-performance computer, whether a uniprocessor workstation, a network of workstations, or a parallel computer. On a Non-Uniform-Memory-Access (NUMA) parallel computer, the effective distribution of data among processor memories is very important in reducing data movement overheads. One of the key features of HPF is the facility for user specification of data mapping. HPF provides a logical view of the parallel machine as a rectilinear arrangement of abstract processors in one or more dimensions. The programmer can specify the relative alignment of elements of different program arrays, and the distribution of arrays over the logical processor grid. Data mapping is specified using HPF directives that can aid the compiler in optimizing parallel performance, but have no effect on the semantics of the program. This is illustrated by the following simple example.
REAL A(1000,1000) !HPF PROCESSORS procs(4,4) !HPF DISTRIBUTE (BLOCK,BLOCK) ONTO procs :: A DO k = 1, num_iter FORALL (i=2:999, j=2:999) A(i,j) = (A(i,j-1) + A(i-1,j) + A(i,j+1) + A(i+1,j))/4 END FORALL END DO
The code fragment describes a simple Jacobi relaxation computation using a two-dimensional floating-point array A. The HPF directives appear as structured comments. The PROCESSORS directive specifies a logical grid of processors proc. The DISTRIBUTE directive recommends that the compiler partition the array A into equal-sized blocks along each of its dimensions. This will result in a configuration of blocks each containing elements, one block per processor. The PROCESSORS and DISTRIBUTE directive are described in detail later in Section 3.
The outer DO k loop iterates over num_iter Jacobi relaxation steps. The inner loop uses the Fortran 95 FORALL construct. It specifies the execution of the loop body for all values of i and j in the range 2 through 999. The semantics of the FORALL require that the right-hand-side expressions for all iterations (i.e., for all values of i and j between 2 and 999) be evaluated before any of the assignments to the left-hand-side variables are performed.
When targeted for execution on a distributed-memory machine with 16 processors, an HPF compiler generates SPMD code, with each processor locally containing a part of the global array A. The outer k loop is executed sequentially while the inner FORALL is executed in parallel. Each processor will require some ``boundary'' elements of A that reside in partitions mapped to the local memories of other processors. Primitives to achieve the necessary inter-processor communication are inserted by the HPF compiler into the generated SPMD code. The single-threaded data-parallel model with a global name-space makes it convenient for the programmer to specify the strategy for parallelization and data partitioning at a higher level of abstraction. The tedious low-level details of translating from an abstract global name space to the local memories of individual processors and the management of explicit inter-processor communication are left to the compiler.
The following example illustrates some of the communication implications of scalar assignment statements. The purpose is to illustrate the implications of data distribution specifications on communication requirements for parallel execution. The explanations given do not necessarily reflect the actual compilation process.
Consider the following code fragment:
REAL a(1000), b(1000), c(1000), x(500), y(0:501) INTEGER inx(1000) !HPF PROCESSORS procs(10) !HPF DISTRIBUTE (BLOCK) ONTO procs :: a, b, inx !HPF DISTRIBUTE (CYCLIC) ONTO procs :: c !HPF ALIGN x(i) WITH y(i+1) ... a(i) = b(i) ! Assignment 1 x(i) = y(i+1) ! Assignment 2 a(i) = c(i) ! Assignment 3 a(i) = a(i-1) + a(i) + a(i+1) ! Assignment 4 c(i) = c(i-1) + c(i) + c(i+1) ! Assignment 5 x(i) = y(i) ! Assignment 6 a(i) = a(inx(i)) + b(inx(i)) ! Assignment 7
In this example, the PROCESSORS directive specifies a linear arrangement of 10 processors. The DISTRIBUTE directives recommend to the compiler that the arrays a, b, and inx should be distributed among the 10 processors with blocks of 100 contiguous elements per processor. The array c is to be cyclically distributed among the processors with c(1), c(11), ..., c(991) mapped onto processor procs(1); c(2), c(12), ..., c(992) mapped onto processor procs(2); and so on. The complete mapping of arrays x and y onto the processors is not specified, but their relative alignment is indicated by the ALIGN directive. The ALIGN statement recommends that x(i) and y(i+1) be stored on the same processor for all values of i, regardless of the actual distribution chosen by the compiler for y (y(0) and y(1) are not aligned with any element of x). The PROCESSORS, DISTRIBUTE, and ALIGN directives are discussed in detail in Section 3.
In Assignment 1 (a(i) = b(i)), the identical distribution of a and b specifies that for all i, corresponding elements of a(i) and b(i) should be mapped to the same processor. Therefore, execution of this statement requires no communication of data values between processors.
In Assignment 2 (x(i) = y(i+1)), there is no inherent communication. In this case, the relative alignment of the two arrays matches the assignment statement for any actual distribution of the arrays.
Although Assignment 3 (a(i) = c(i)) looks very similar to the first assignment, the communication requirements are very different due to the different distributions of a and c. Array elements a(i) and c(i) are mapped to the same processor for only 10% of the possible values of i. (This can be seen from the definitions of BLOCK and CYCLIC in Section 3.) The elements are located on the same processor if and only if . For example, the assignment involves no inherent communication (i.e., both a(i) and c(i) are on the same processor) if or , but does require communication if .
In Assignment 4 (a(i) = a(i-1) + a(i) + a(i+1)), the references to array a are all on the same processor for about 98% of the possible values of i. The exceptions to this are for any , (when a(i) and a(i-1) are on procs(k) and a(i+1) is on procs(k+1)) and for any (when a(i) and a(i+1) are on procs(k+1) and a(i-1) is on procs(k)). This statement requires communication. only for ``boundary'' elements on each processor,
Assignment 5, c(i) = c(i-1) + c(i) + c(i+1), while superficially similar to Assignment 4, has very different communication behavior. Because the distribution of c is CYCLIC rather than BLOCK, the three references c(i), c(i-1), and c(i+1) are mapped to three distinct processors for any value of i. Therefore, this statement requires communication for at least two of the right-hand side references, regardless of the implementation strategy.
The final two assignments have very limited information regarding the communication requirements. In Assignment 6 (x(i) = y(i)) the only information available is that x(i) and y(i+1) are on the same processor; this has no logical consequences for the relationship between x(i) and y(i). Thus, nothing can be said regarding communication required at runtime for the statement without further information. In Assignment 7 (a(i) = a(inx(i)) + b(inx(i))), it can be proved that a(inx(i)) and b(inx(i)) are always mapped to the same processor. Similarly, it is easy to deduce that a(i) and inx(i) are mapped together. Without knowledge of the values stored in inx, however, the relation between a(i) and a(inx(i)) is unknown, as is the relationship between a(i) and b(inx(i)).