This section shows the use of task parallelism to build a pipelined data-parallel 2-dimensional FFT and illustrates the compilation of task parallelism by showing SPMD code generated from the HPF program.
The basic sequential 2DFFT code is as follows:
REAL, DIMENSION(n,n) :: a1, a2 DO WHILE(.true.) READ (unit = 1, end = 100) a1 CALL rowffts(a1) a2 = a1 CALL colffts(a2) WRITE (unit = 2) a2 CYCLE 100 CONTINUE EXIT END DO
To write a pipelined task and data parallel 2D FFT in HPF, the code is slightly modified and several HPF directives are added. First, variables a1 and a2 are distributed onto disjoint subsets of processors, and then a task region is used to create two lexical tasks to perform rowffts and colffts on different subsets of processors. The assignment a2 = a1 in the task region specifies the transfer of data between the tasks. A new variable done1 is introduced to store the termination condition. The modified code is as follows:
REAL, DIMENSION(n,n) :: a1,a2 LOGICAL done1 !HPF$ PROCESSORS procs(8) !HPF$ DISTRIBUTE a1(block,*) ONTO procs(1:4) !HPF$ DISTRIBUTE a2(*,block) ONTO procs(5:8) !HPF$ TEMPLATE, DIMENSION(4), DISTRIBUTE(BLOCK) ONTO procs(1:4) :: td1 !HPF$ ALIGN WITH td1(*) :: done1 !HPF$ TASK_REGION done1 = .false. DO WHILE (.true.) !HPF$ ON (procs(1:4)) BEGIN, RESIDENT READ (unit = iu,end=100) a1 CALL rowffts(a1) GOTO 101 100 done1 = .true. 101 CONTINUE !HPF$ END ON IF (done1) EXIT a2 = a1 !HPF$ ON (procs(5:8)) BEGIN, RESIDENT CALL colffts(a2) WRITE(unit = ou) a2 !HPF$ END ON END DO !HPF$ END TASK_REGION
Finally, we show simplified SPMD code generated for each processor. We assume a message passing model where sends are asynchronous and nonblocking and receives block until the data is available. We use a simple memory model where variable declarations are identical across all processors even though some variables will be referenced only on subsets of the processors. A shadow variable done1_copy is created by the compiler to transfer information from processor subset 1 to processor subset 2 about termination of processing. The code is as follows:
REAL DIMENSION(n/4,n) :: a1 REAL DIMENSION(n,n/4) :: a2 LOGICAL done1 C Following are compiler generated variables LOGICAL done1_copy LOGICAL inset1, inset2 C C Following magic compiler function call is to set the variables C inset1 and inset2 to .true. for subset 1 and subset 2 processors C respectively, and .false. otherwise. C CALL initialize_tasksets(inset1,inset2) C Code for processor subset 1 IF (inset1) done1 = .false. DO WHILE (.true.) C Read is left unchanged as the code depends on the I/O model READ (unit = 1,end=100) a1 CALL rowffts(a1) GOTO 101 100 done1 = .true. 101 CONTINUE _send(done1,procs(5:8)) IF (done1) EXIT _send(a1,proces(5:8)) END DO END IF C Code for processor subset 2 IF (inset2) DO WHILE(.true.) _receive(done1_copy,procs(1:4)) IF (local_done1) EXIT _receive(a2,procs(1:4)) CALL colffts(a2) C Write is left unchanged as the code depends on the I/O model. WRITE (unit = 2) a2 END DO END IF
_send and _receive are communication calls to transfer variables between subsets of processors. Program execution until the end of input is as follows. Subset 1 processors repeatedly read input, compute rowffts, and send the computed output as well as done1 flag, which normally has the value .false., to subset 2 processors. The subset 2 processors receive the flag and the data set, compute colffts and write the results to the output. When the end of input is reached, subset 1 processor set the value flag done1 to .true., send it and terminate execution. Subset 2 processors receive the flag, recognize that the end of input has been reached, and terminate execution.