| SP Parallel Programming Workshop |
| p a r a l l e l c o d e d e v e l o p m e n t |
| Talk Overview |
|
| Obstacles to Efficient Parallel Programming |
|
| A Methodical Approach |
|
Parallel Programming is a complex and intuitive process that does not easily lend itself to autonomous and completely mechanical methods. We describe a method that
| Step 1: Identify Computational Hotspots |
|
We begin any parallelization process by identifying the parts of the program that consume the most run time.
The goal is to know which code should be parallelized and which code should be recycled from the serial program.
The downside of parallel programming:
The reason is obvious:
In addition, the complexity and feasibility of the parallelization process should be considered when evaluating parallelization targets.
Step 1: Hotspot Identification Checklist
Step 1: Hotspot Identification Example
Generic File I/O
Many programs begin with several pages of code that read input data from a file, create complex data structures, write this data into those data structures, and perform a small amount of preprocessing on that data.
Similarly, many programs perform complimentary tasks at the end which postprocess the data and output results to files.
Observation: These processes are usually not good candidates for parallelization:
| Step 2: Partitioning |
|
In this step we partition our problem into smaller tasks that can execute in parallel with each other.
The intent is to expose multiple opportunities for parallel execution rather then to suggest the actual partition of tasks that will be executed in parallel.
Later, considerations of communication, memory, and task requirements will lead us to agglomerate smaller tasks into larger tasks.
For now, the goal is to partition the tasks as finely as possible so as to provide greater flexibility to make choices during later steps.
There are two primary methods for partitioning a problem.
Data Decomposition: partition the data first. Then partition the computation based on the data partition. Some general heuristics:
Alternatively, combine the two strategies at different points in your problem.
Step 2: Partitioning Checklist
There are exponentially many ways to partition a problem; some may be significantly better or worse than others. The following checkpoints should help you determine if your partitioning is appropriate:
Step 2: Partitioning Example
Generic 3-D Grid
Many programs focus their computation on a 3-dimensional grid structure. Depending on the nature of the program, computation could be focused on planes, columns, individual points, or blocks of data in any combination of the 3 dimensions. Some example data partitions would be:
Step 2: Partitioning Example
Matrix Multiply
In the common matrix multiply problem, two 2-d matrices, A and B, are multiplied to form a third 2-d matrix C. If all matrices are NxN, this problem requires about NxNxN floating-point operations. Each element in C is computed from the dot product of a row of A with a column of B.
for(i=0;i < NRA;i++)
for(j=0;j < NCB;j++)
for(k=0;k < NCA;k++)
c[i][j]+= a[i][k] * b[k][j];
Some potentially good partitions:
Some potentially bad partitions:
Step 2: Partitioning Example
Mandelbrot Image Calculation
In a Mandelbrot image calculation, a 2-d image is generated in which each pixel's value is calculated solely as a function of its coordinates.
for (i=0;i < IMAGE_HEIGHT;i++)
{
p_imag = ystart + ((i + (taskid * (IMAGE_HEIGHT/numtasks))) * yinc);
for (j=0;j < IMAGE_WIDTH;j++)
{
p_real = xstart + (j * xinc); z_real = p_real; z_imag = p_imag;
count = 0;
t_real = z_real * z_real; t_imag = z_imag * z_imag; z_size = t_real + t_imag;
while ((count < MAX_COUNT) && (z_size < MAX_SIZE))
{
z_imag = ((z_real + z_real) * z_imag) + p_imag;
z_real = (t_real - t_imag) + p_real; t_real = z_real * z_real;
t_imag = z_imag * z_imag; z_size = t_real + t_imag;
count++;
}
output_image[i][j] = count * 2;
}
}
Example output images might look like:
Some example partitions would be:
| Step 3: Communication |
|
Ideally, the tasks defined in the previous step could execute completely in parallel, but in most real world programs these tasks require data computed or used by other tasks; communication is necessary.
Data communication is one of the main obstacles to efficient parallel programming.
In the communication step, we identify the data communication requirements of your partition by drawing lines between tasks that either produce or consume data from other tasks.
The intermediate goal is to identify tasks which most frequently communicate with another. In the next step, those tasks will be agglomerated into larger tasks.
Another goal is to identify those data structures which are needed by many tasks and to make decisions on whether to replicate them.
The process: draw lines between those tasks that must communicate.
A small number of communication patterns typically arise:
Step 3: Communication Checklist
The communication step requires few decisions.
The purpose is to make you understand the relationships between different data structures and task in your program.
This will allow you to better map those data structures and tasks to processors.
Step 3: Communication Example
Jacobi Finite Difference Method
The communication patterns of the Jacobi finite difference method are a good example of an algorithm with local communication requirements. In this example, a two-dimensional grid is repeatedly updated by assigning a function of neighbor values to each element in a 2-d grid X:
The communication structure for a single element in such a grid would look like this:
Step 3: Communication Example
Global Reduction Operations
Many programs contain phases where a single value must be summed among each task. The communication structure of such an operation would look like this:
This type of communication operation is inefficient because all receive operations, and all additions, must be serialized.
A better communication strategy would be to use the divide-and-conquer strategy to perform the global sum with a binary tree:
The running time of this code fragment would be reduced from O(N) to O(LogN).
Step 3: Communication Example
Affine Image Warping
In Affine image warping with LaGrange interpolation, an input image is rotated, translated and scaled to produce an output image; each pixel in the output image is overwritten by some pixel in the input image. A small (4 by 64 floating-point numbers) interpolation table is used to interpolate input image values that fall in between pixels.
A code fragment is available here.
Regardless of our data/task partitioning scheme, each task will need to access unknown entries of the interpolation table.
If that table were stored at a single task, excessive communication would occur.
Observation: the interpolation table is computed only once.
Suggestion: replicating the table at each processor will reduce total communication costs and not severely consume task memory.
| Step 4: Agglomeration |
|
At this point we have defined a data partition and identified many tasks which can execute in parallel along with their communication requirements.
In the agglomeration step, we group many small tasks into larger tasks in such a way that we:
How much should we agglomerate? Assume we start with T tasks that will run on P processors.
Step 4: Agglomeration:
How To Agglomerate
Whether we are creating fully- or semi-agglomerated tasks, the basic guidelines are the same (and an exercise in satisfying conflicting demands).
It's often useful to know the communication performance capabilities of your target machine while agglomerating tasks: SP2 Communication Performance Summary.
Step 4: Agglomeration:
Special Case
Special Case: agglomerating T tasks with varying run times to P tasks. We must employ a static load balancing scheme to ensure processors have equal work load.
Step 4: Agglomeration Checklist
Step 4: Agglomeration Example
2-D Jacobi Finite Difference Method
We examine the 2-D Jacobi finite difference method again as an example for task agglomeration. Recall that each element in the 2-d grid is updated each iteration as a function of its 4 neighbors. By grouping together grid elements in 2-D blocks, we can reduce total communication, increase communication granularity and increase computational granularity.
Step 4: Agglomeration Example
Mandelbrot Image Calculation
Recall that in Mandelbrot image calculation, a 2-d image is generated in which each pixel's value is calculated solely as a function of its coordinates. Example image
In Step 2 we suggested that partitioning by pixels would be most aggressive.
We now consider the agglomeration phase. Inspection of the code reveals that the computation time is not constant for each pixel. In fact, different pixel colors in the output image correspond to different amounts of computation.
Because computation is widely varying at each pixel, a potential load-balancing problem arises.
Agglomeration choices:
| Step 5: Mapping tasks to Processors |
|
At this point we should have our program partitioned into reasonable sized tasks, but not yet mapped to processors.
In the case of static problems where we fully agglomerated T smaller tasks into P larger tasks, the mapping task is straightforward; one task per processor; SP2 tip.
In the tougher problems, the number of tasks is unknown; some form of dynamic scheduling and dynamic load balancing must be used.
Step 5: Mapping: Task-Scheduling Algorithms
Task-scheduling algorithms, where tasks are dynamically mapped to processors as computation proceeds, work best when each task has small communication requirements:
Step 5: Mapping Checklist
Step 5: Mapping Examples
Recursive Bisection Example
Step 5: Mapping Examples
Mandelbrot Image Calculation
| Case Study: 2-D FFT |
|
| Overview | |
The problem is described in a serial context, then the above steps are applied to yield a parallel program.
In this problem a series of one-dimensional FFT's are applied to a two-dimensional image.
The input and output of the algorithm is a 2-D image. A brief pseudo-code description of the algorithm is:
do i = 1, IMAGE_HEIGHT
row(tmp_image, i) = 1-d_fft(row(input_image, i))
end do
do i = 1, IMAGE_WIDTH
column(output_image, i) = 1-d_fft(column(tmp_image, i))
end do
The only important data structures are the input, tmp, and output images.
Each image is the same size; typically anywhere from 128 by 128 to 8192
by 8192 or larger.
2-D FFT Parallelization
Identify the Computation Hotspots: intuitive and visual inspection of the code reveals that the bulk of the run time is spent performing the N row FFT's and the N column FFT's. Running the program with prof confirms this.
Partitioning: for 2-D FFT, partitioning is the most important step and there are several choices:
Agglomeration/Mapping: This step, and the mapping step, are simple for this example.
Source code for serial and parallel versions in C:
Source code for serial and parallel versions in Fortran:
| Case Study: 2-D CFD |
|
| Overview | |
The problem is described in a serial context, then the above steps are applied to yield a parallel program.
This case study is a computational fluid dynamics example in two dimensions. In this example, we'll be simulating airflow over a wing. More specifically, this code calculates the solution of a compressible, inviscid, non-conducting flow over a body in two dimensions.
The code solves the 2D, compressible Euler equations on unstructured triangular grids:
The Euler equations, where U is the vector of conserved variables. A and B are the Jacobian matrices defined in the program.
The algorithm approximately solves the Euler equations over the domain by solving the discretized equations on the unstructured grid, or mesh. The values for density, x-momentum, y-momentum, and total energy are stored at the vertices of the mesh.
Basically, the field values at a given timestep are checked to see if they satisfy the discretized Euler equations. If not, then they are adjusted and checked again. This iterative process continues until the field values satisfy the discretized Euler equations to a specified tolerance.
2-D CFD: Computational Elements
The wing is divided into a triangular mesh. Each triangle is called a cell. Each cell has 3 vertices or nodes. Click here for an illustration of the 2-D mesh.
The "residual" is calculated at each cell using field values held at the vertices of that cell.
This fluctuation is then split-up and distributed to the vertices of the cell.
The sum of the fluctuation contributions from all the cells that surround a vertex make up the nodal fluctuation for that vertex.
Once the code has looped through all of the cells, the nodal residuals will have been calculated for each vertex, and the vertex values can be updated for the next timestep.
2-D CFD: Useful Information
CFD Parallelization
Identify the Computation Hotspots: visual inspection of the code shows that the main loop in which the values (density, x-dir momentum, y-dir momentum and total energy) are calculated for each cell are the computationally expensive parts of the algorithm. File I/O, pre-, and post-processing steps will remain serial.
Partitioning: The cell is a natural element for data-partitioning:
Communication: each cell only needs data from its vertices. Each vertex only needs data from vertices that it is directly connected to. Therefore, the 2-D grid itself already identifies communication links.
Agglomeration: at this point we have a separate task for each cell and far more cells than processors. We must agglomerate to larger tasks.
Source code (compressed tar format) for serial and parallel versions in C:
| References and More Information |
|