SP Parallel Programming Workshop
p a r a l l e l     c o d e     d e v e l o p m e n t



  Table of Contents
  1. Talk Overview
  2. Obstacles to Efficient Parallel Programming
  3. A Methodical Approach
  4. Case studies:
  5. References and More Information


 
Talk Overview Up to Table of Contents Down to Obstacles to Efficient Parallel Programming


 
Obstacles to Efficient Parallel Programming Up to Talk Overview Down to A Methodical Approach

  1. Overhead due to data communication

  2. Load Balancing Inefficiencies

  3. Inherently Sequential Algorithmic Nature



 
A Methodical Approach Up to Obstacles to Efficient Parallel Programming Down to Step 1: Identify Computational Hotspots

Parallel Programming is a complex and intuitive process that does not easily lend itself to autonomous and completely mechanical methods. We describe a method that

Drawing from the extensive work in the field of Parallel Programming research, and in particular, the work of Ian Foster our 5 step method is:

  1. Identify Computational Hotspots

  2. Partition the problem into smaller independent tasks

  3. Identify Communication requirements between such tasks

  4. Agglomerate smaller tasks into larger tasks

  5. Map tasks to actual processors

As you become more experienced with parallel programming, you'll be able to consider two, three, or all steps at the same time.


An Illustration of the 5-Step Method:



 
Step 1: Identify Computational Hotspots Up to A Methodical Approach Down to Step 2: Partitioning

We begin any parallelization process by identifying the parts of the program that consume the most run time.

The goal is to know which code should be parallelized and which code should be recycled from the serial program.

The downside of parallel programming:

The reason is obvious:

There are three basic ways to identify the hotspots in your code:

In addition, the complexity and feasibility of the parallelization process should be considered when evaluating parallelization targets.


Step 1: Hotspot Identification Checklist


Step 1: Hotspot Identification Example
Generic File I/O



 
Step 2: Partitioning Up to Step 1: Identify Computational Hotspots Down to Step 3: Communication

In this step we partition our problem into smaller tasks that can execute in parallel with each other.

The intent is to expose multiple opportunities for parallel execution rather then to suggest the actual partition of tasks that will be executed in parallel.

Later, considerations of communication, memory, and task requirements will lead us to agglomerate smaller tasks into larger tasks.

For now, the goal is to partition the tasks as finely as possible so as to provide greater flexibility to make choices during later steps.

There are two primary methods for partitioning a problem.

Data Decomposition: partition the data first. Then partition the computation based on the data partition. Some general heuristics:

Functional Decomposition: partition the computation into smaller tasks. Then, partition the data based on these tasks; this is common in problems where there are no obvious data structures to partition, or where the data structures are highly unstructured.

Alternatively, combine the two strategies at different points in your problem.


Step 2: Partitioning Checklist

There are exponentially many ways to partition a problem; some may be significantly better or worse than others. The following checkpoints should help you determine if your partitioning is appropriate:

  1. Maximize flexibility; does your partition define at least an order of magnitude more tasks than there are processors?

  2. Scaling to larger problem sizes: does your partition avoid redundant storage requirements?

  3. Load balancing: Are tasks of comparable size?

  4. Problem scalability: does the number of tasks scale with problem size?

  5. Maximize flexibility; have you identified several alternative partitions?


Step 2: Partitioning Example
Generic 3-D Grid

Many programs focus their computation on a 3-dimensional grid structure. Depending on the nature of the program, computation could be focused on planes, columns, individual points, or blocks of data in any combination of the 3 dimensions. Some example data partitions would be:


Step 2: Partitioning Example
Matrix Multiply

In the common matrix multiply problem, two 2-d matrices, A and B, are multiplied to form a third 2-d matrix C. If all matrices are NxN, this problem requires about NxNxN floating-point operations. Each element in C is computed from the dot product of a row of A with a column of B.

   for(i=0;i < NRA;i++)
      for(j=0;j < NCB;j++)
         for(k=0;k < NCA;k++)
            c[i][j]+= a[i][k] * b[k][j];
Some potentially good partitions:

Some potentially bad partitions:


Step 2: Partitioning Example
Mandelbrot Image Calculation

In a Mandelbrot image calculation, a 2-d image is generated in which each pixel's value is calculated solely as a function of its coordinates.

for (i=0;i < IMAGE_HEIGHT;i++)
{
   p_imag = ystart + ((i + (taskid * (IMAGE_HEIGHT/numtasks))) * yinc);
   for (j=0;j < IMAGE_WIDTH;j++)
   {
      p_real = xstart + (j * xinc); z_real = p_real; z_imag = p_imag;
      count = 0;
      t_real = z_real * z_real; t_imag = z_imag * z_imag; z_size = t_real + t_imag;
      while ((count < MAX_COUNT) && (z_size < MAX_SIZE))
      {
         z_imag = ((z_real + z_real) * z_imag) + p_imag;
         z_real = (t_real - t_imag) + p_real; t_real = z_real * z_real;
         t_imag = z_imag * z_imag; z_size = t_real + t_imag;
          count++;
      }
      output_image[i][j] = count * 2;
   }
}
Example output images might look like:

Some example partitions would be:



 
Step 3: Communication Up to Step 2: Partitioning Down to Step 4: Agglomeration

Ideally, the tasks defined in the previous step could execute completely in parallel, but in most real world programs these tasks require data computed or used by other tasks; communication is necessary.

Data communication is one of the main obstacles to efficient parallel programming.

In the communication step, we identify the data communication requirements of your partition by drawing lines between tasks that either produce or consume data from other tasks.

The intermediate goal is to identify tasks which most frequently communicate with another. In the next step, those tasks will be agglomerated into larger tasks.

Another goal is to identify those data structures which are needed by many tasks and to make decisions on whether to replicate them.

The process: draw lines between those tasks that must communicate.

A small number of communication patterns typically arise:


Step 3: Communication Checklist


Step 3: Communication Example
Jacobi Finite Difference Method

The communication patterns of the Jacobi finite difference method are a good example of an algorithm with local communication requirements. In this example, a two-dimensional grid is repeatedly updated by assigning a function of neighbor values to each element in a 2-d grid X:

The communication structure for a single element in such a grid would look like this:


Step 3: Communication Example
Global Reduction Operations

Many programs contain phases where a single value must be summed among each task. The communication structure of such an operation would look like this:

This type of communication operation is inefficient because all receive operations, and all additions, must be serialized.

A better communication strategy would be to use the divide-and-conquer strategy to perform the global sum with a binary tree:

The running time of this code fragment would be reduced from O(N) to O(LogN).


Step 3: Communication Example
Affine Image Warping

In Affine image warping with LaGrange interpolation, an input image is rotated, translated and scaled to produce an output image; each pixel in the output image is overwritten by some pixel in the input image. A small (4 by 64 floating-point numbers) interpolation table is used to interpolate input image values that fall in between pixels.

A code fragment is available here.

Regardless of our data/task partitioning scheme, each task will need to access unknown entries of the interpolation table.

If that table were stored at a single task, excessive communication would occur.

Observation: the interpolation table is computed only once.

Suggestion: replicating the table at each processor will reduce total communication costs and not severely consume task memory.



 
Step 4: Agglomeration Up to Step 3: Communication Down to Step 5: Mapping tasks to processors

At this point we have defined a data partition and identified many tasks which can execute in parallel along with their communication requirements.

In the agglomeration step, we group many small tasks into larger tasks in such a way that we:

Alternatively, we could randomly divide tasks among the processors. then execute.

How much should we agglomerate? Assume we start with T tasks that will run on P processors.


Step 4: Agglomeration:
How To Agglomerate

Whether we are creating fully- or semi-agglomerated tasks, the basic guidelines are the same (and an exercise in satisfying conflicting demands).

It's often useful to know the communication performance capabilities of your target machine while agglomerating tasks: SP2 Communication Performance Summary.

  1. Tasks that can run concurrently should be placed in different groups.

  2. Tasks that require the same input data should be grouped together.

  3. Tasks that communicate with each other should be grouped together.

  4. Pairs of tasks that frequently communicate should be grouped together (to increase message granularity).

  5. Surface-to-Volume issues: for problems that only communicate with local neighbors, we can simultaneously increases the message granularity, decrease the total communication requirements, and increase the computation granularity by agglomerating neighbors.


Step 4: Agglomeration:
Special Case

Special Case: agglomerating T tasks with varying run times to P tasks. We must employ a static load balancing scheme to ensure processors have equal work load.


Step 4: Agglomeration Checklist

  1. Has agglomeration reduced total communication costs by grouping tightly coupled tasks together?

  2. Has agglomeration yielded tasks with similar computation and communication costs? If we have created just one task per processor, then these tasks should have nearly identical costs.

  3. Does the number of tasks still scale with problem size?

  4. Has agglomeration preserved opportunities for concurrent execution?

  5. If you expect to use a dynamic task scheduler, does your agglomeration have an order of magnitude more tasks than processors?


Step 4: Agglomeration Example
2-D Jacobi Finite Difference Method

We examine the 2-D Jacobi finite difference method again as an example for task agglomeration. Recall that each element in the 2-d grid is updated each iteration as a function of its 4 neighbors. By grouping together grid elements in 2-D blocks, we can reduce total communication, increase communication granularity and increase computational granularity.


Step 4: Agglomeration Example
Mandelbrot Image Calculation
 

Recall that in Mandelbrot image calculation, a 2-d image is generated in which each pixel's value is calculated solely as a function of its coordinates. Example image

In Step 2 we suggested that partitioning by pixels would be most aggressive.

We now consider the agglomeration phase. Inspection of the code reveals that the computation time is not constant for each pixel. In fact, different pixel colors in the output image correspond to different amounts of computation.

Because computation is widely varying at each pixel, a potential load-balancing problem arises.

Agglomeration choices:



 
Step 5: Mapping tasks to Processors Up to Step 4: Agglomeration Down to Case Study: 2-D FFT

At this point we should have our program partitioned into reasonable sized tasks, but not yet mapped to processors.

In the case of static problems where we fully agglomerated T smaller tasks into P larger tasks, the mapping task is straightforward; one task per processor; SP2 tip.

In the tougher problems, the number of tasks is unknown; some form of dynamic scheduling and dynamic load balancing must be used.


Step 5: Mapping: Task-Scheduling Algorithms

Task-scheduling algorithms, where tasks are dynamically mapped to processors as computation proceeds, work best when each task has small communication requirements:


Step 5: Mapping Checklist

  1. If using a centralized load-balancing scheme, have you verified that the manager will not become a bottleneck?

  2. If using probabilistic or cyclic methods, do you have a large enough number of tasks to ensure reasonable load balance?

  3. If using a dynamic load-balancing scheme, have you evaluated the relative costs of different strategies (ie. probabilistic mappings)?


Step 5: Mapping Examples
Recursive Bisection Example


Step 5: Mapping Examples
Mandelbrot Image Calculation

Discussion

 
Case Study: 2-D FFT Up to Step 5: Mapping tasks to processors Down to Case Study: 2-D CFD
Overview

The problem is described in a serial context, then the above steps are applied to yield a parallel program.

In this problem a series of one-dimensional FFT's are applied to a two-dimensional image.

The input and output of the algorithm is a 2-D image. A brief pseudo-code description of the algorithm is:

   do i = 1, IMAGE_HEIGHT
      row(tmp_image, i) = 1-d_fft(row(input_image, i))
   end do

   do i = 1, IMAGE_WIDTH
      column(output_image, i) = 1-d_fft(column(tmp_image, i))
   end do
The only important data structures are the input, tmp, and output images. Each image is the same size; typically anywhere from 128 by 128 to 8192 by 8192 or larger.


2-D FFT Parallelization

Identify the Computation Hotspots: intuitive and visual inspection of the code reveals that the bulk of the run time is spent performing the N row FFT's and the N column FFT's. Running the program with prof confirms this.

Partitioning: for 2-D FFT, partitioning is the most important step and there are several choices:

Communication: the communication step was useful for eliminating many bad partitions. For the final partition, communication is required between the 1-D FFT phases, but not during the 1-D FFT's.

Agglomeration/Mapping: This step, and the mapping step, are simple for this example.

Source code for serial and parallel versions in C:

Source code for serial and parallel versions in Fortran:



 
Case Study: 2-D CFD Up to Case Study: 2-D FFT Down to References and More Information
Overview

The problem is described in a serial context, then the above steps are applied to yield a parallel program.

This case study is a computational fluid dynamics example in two dimensions. In this example, we'll be simulating airflow over a wing. More specifically, this code calculates the solution of a compressible, inviscid, non-conducting flow over a body in two dimensions.

The code solves the 2D, compressible Euler equations on unstructured triangular grids:

The Euler equations, where U is the vector of conserved variables. A and B are the Jacobian matrices defined in the program.

The algorithm approximately solves the Euler equations over the domain by solving the discretized equations on the unstructured grid, or mesh. The values for density, x-momentum, y-momentum, and total energy are stored at the vertices of the mesh.

Basically, the field values at a given timestep are checked to see if they satisfy the discretized Euler equations. If not, then they are adjusted and checked again. This iterative process continues until the field values satisfy the discretized Euler equations to a specified tolerance.


2-D CFD: Computational Elements

The wing is divided into a triangular mesh. Each triangle is called a cell. Each cell has 3 vertices or nodes. Click here for an illustration of the 2-D mesh.

The "residual" is calculated at each cell using field values held at the vertices of that cell.

This fluctuation is then split-up and distributed to the vertices of the cell.

The sum of the fluctuation contributions from all the cells that surround a vertex make up the nodal fluctuation for that vertex.

Once the code has looped through all of the cells, the nodal residuals will have been calculated for each vertex, and the vertex values can be updated for the next timestep.

2-D CFD: Useful Information


CFD Parallelization

Identify the Computation Hotspots: visual inspection of the code shows that the main loop in which the values (density, x-dir momentum, y-dir momentum and total energy) are calculated for each cell are the computationally expensive parts of the algorithm. File I/O, pre-, and post-processing steps will remain serial.

Partitioning: The cell is a natural element for data-partitioning:

We partition the unstructured 2-D grid into distinct cells.

Communication: each cell only needs data from its vertices. Each vertex only needs data from vertices that it is directly connected to. Therefore, the 2-D grid itself already identifies communication links.

Agglomeration: at this point we have a separate task for each cell and far more cells than processors. We must agglomerate to larger tasks.

Mapping: the mapping step is essentially done at this point; we fully agglomerated the N cell/vertices into P separate tasks.

Source code (compressed tar format) for serial and parallel versions in C:



 
References and More Information Up to Case Study: 2-D CFD