SP Parallel Programming Workshop
p a r a l l e l     p r o g r a m m i n g     i n t r o d u c t i o n



  Table of Contents
  1. Overview
    1. What is Parallelism?
    2. Sequential Programming
    3. The Need for Faster Machines
    4. Parallel Computing
    5. Parallel Programming Overview
  2. Architecture Taxonomy
    1. SISD Model
    2. SIMD Model
    3. MIMD Model
  3. Memory Architectures
    1. Shared Memory
    2. Distributed Memory
    3. Memory-Processor Arrangements
  4. Processor Communication
    1. Communications Network on the IBM SP
  5. Parallel Programming Paradigms
    1. Various Methods
    2. Message Passing
    3. Data Parallel
    4. Implementations
  6. Steps for Creating a Parallel Program
    1. Decomposing the Program
    2. Communication
      1. Point to Point
      2. One to All Broadcast
      3. All to All Broadcast
      4. One to All Personalized
      5. All to all Personalized
      6. Shifts
      7. Collective Computation
  7. Design and Performance Considerations
    1. Amdahl's Law
    2. Load Balancing
    3. Granularity
    4. Data Dependency
    5. Deadlock
    6. Communication Patterns and Bandwidth
    7. I/O Patterns
    8. Debugging
    9. Performance Monitoring and Analysis
  8. Parallel Examples
    1. Essentials of Loop Parallelism
    2. Calculating PI
      1. Serial Problem Description
      2. Parallel Solution
    3. Calculating Array Elements
      1. Serial Problem Description
      2. Parallel Solution
      3. Pool of Tasks
      4. Load Balancing and Granularity
    4. Simple Heat Equation
      1. Serial Problem Description
      2. Parallel Solution
      3. Overlapping Communication and Computation
    5. Application Case Study
  9. References and More Information


 
Overview Up to Table of Contents Down to Sequential Programming
What is Parallelism?

A strategy for performing large, complex tasks faster.

A large task can either be performed serially, one step following another, or can be decomposed into smaller tasks to be performed simultaneously, i.e., in parallel.

Parallelism is done by:

Parallel problem solving is common. Examples: building construction; operating a large organization; automobile manufacturing plant

The automobile analogy.

 
Overview Up to What is Parallelism The Need for Faster Machines
Sequential Programming

Traditionally, programs have been written for serial computers:



 
Overview Up to Sequential Programming Down to Parallel Computing?
The Need for Faster Machines

You might think that one instruction executed in 9 billionths of a second would be fast enough. You'd be wrong.

There are several classes of problems that require faster processing:



 
Overview Up to Need for Faster Machines? Down to Parallel Programming Overview
Parallel Computing

Traditional Supercomputers

Technology

Benefits

Limitations

Parallel Supercomputers

Technology

Benefits

Limitations

Parallel computing requires:



 
Overview Up to Parallel Computing Down to Architecture Taxonomy
Parallel Programming



 
Architecture Taxonomy Up to Parallel Programming Overview Down to SISD Model



 
Architecture Taxonomy Up to Architecture Taxonomy Down to SIMD Model
SISD Model:     Single Instruction, Single Data Stream



 
Architecture Taxonomy Up to SISD Model Down to MIMD Model
SIMD Model:     Single Instruction, Multiple Data Stream


Vector SIMD


Parallel SIMD



 
Architecture Taxonomy Up to SIMD Model Down to Memory Architectures
MIMD Model:     Multiple Instructions, Multiple Data


 
Memory Architectures Up to MIMD Model Down to Shared Memory



 
Memory Architectures Up to Memory Architectures Down to Distributed Memory
Shared Memory


 
Memory Architectures Up to Shared Memory Down to Memory-Processor Arrangements
Distributed Memory


 
Memory Architectures Up to Distributed Memory Down to Communications Network on the IBM SP
Memory / Processor Arrangements


 
Processor Communications Up to Memory-Processor Arrangements Down to Parallel Programming Paradigms
Communications Network on the IBM SP


 
Parallel Programming Paradigms Up to Communications Network on the IBM SP Down to Message Passing
Various Methods

There are many methods of programming parallel computers. Two of the most common are message passing and data parallel.

Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support.
An effective implementation is one which closely matches its target hardware and provides the user ease in programming.

 
Parallel Programming Paradigms Up to Various Methods Down to Data Parallel
Message Passing

The message passing model is defined as:

Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.

 
Parallel Programming Paradigms Up to Message Passing Down to Implementations
Data Parallel

The data parallel model is defined as:

Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.

The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.

 
 
Parallel Programming Paradigms Up to Data Parallel Down to F90 / High Performance Fortran (HPF)
Implementations

Message Passing: Message Passing Interface (MPI)



 
Parallel Programming Paradigms Up to Message Passing Down to Steps for Creating a Parallel Program
Implementations

F90 / High Performance Fortran (HPF)



 
Steps for Creating a Parallel Program Up to F90 / High Performance Fortran (HPF) Down to Decomposing the Program

  1. If you are starting with an existing serial program, debug the serial code completely

  2. Identify the parts of the program that can be executed concurrently:

  3. Decompose the program:

  4. Code development

  5. Compile, Test, Debug

  6. Optimization


 
Steps for Creating a Parallel Program Up to Steps for Creating a Parallel Program Down to Communication
Decomposing the Program

There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both



 
Steps for Creating a Parallel Program Up to Decomposing the Program Down to Design and Performance Considerations
Communication

Understanding the interprocessor communications of your program is essential.

The types of communications for message passing and data parallel are exactly the same. In fact most data parallel compilers simply use one of the standard message passing libraries to achieve data movement.

Communications on distributed memory computers:  


Point to Point

The most basic method of communication between two processors is the point to point message. The originating processor "sends" the message to the destination processor. The destination processor then "receives" the message.

The message commonly includes the information, the length of the message, the destination address and possibly a tag.

Typical message passing libraries subdivide the basic sends and receives into two types:

 


One to All Broadcast

A node may have information which all the others require. A broadcast is a message sent to many other nodes.

A One to All broadcast occurs when one processor sends the same information to many other nodes.  


All to All Broadcast

With an All to All broadcast each processor sends its unique information to all the other processors.  


One to All Personalized

Personalized communication send a unique message to each processor.

In One to All personalized communication one processor sends a unique message to every other processor.  


All to All Personalized

In All to All Personalized communication each processor sends a unique message to all other processors.  


Shifts

Shifts are permutations of information. Information is exchanged in one logical direction or the other. Each processor exchanges the same amount of information with its neighbor processor.

There are two types of shifts:

 


Collective Computation

In collective computation (reductions), one member of the group collects data from the other members. Commonly a mathematical operation like a min, max, add, multiple etc. is performed.

 
 
Design and Performance Considerations Up to Communication Down to Load Balancing
Amdahl's Law



 
Design and Performance Considerations Up to Amdahls Law Down to Granularity
Load Balancing



 
Design and Performance Considerations Up to Load Balancing Down to Data Dependency
Granularity



 
Design and Performance Considerations Up to Granularity Down to Deadlock
Data Dependency



 
Design and Performance Considerations Up to Data Dependency Down to Communication Patterns and Bandwidth
Deadlock



 
Design and Performance Considerations Up to Deadlock Down to I/O Patterns
Communication Patterns and Bandwidth



 
Design and Performance Considerations Up to Communication Patterns and Bandwidth Down to Debugging
I/O Patterns



 
Design and Performance Considerations Up to I/O Patterns Down to Performance Monitoring and Analysis
Debugging



 
Design and Performance Considerations Up to Debugging Down to Parallel Examples
Performance Monitoring and Analysis



 
 
Parallel Examples Up to Performance Monitoring and Analysis Down to Calculating PI
Essentials of Loop Parallelism

Some examples will help illustrate the methods of parallel programming and the performance issues involved.

Each of the problems has a main loop. Loops are a main target for parallelizing and vectorizing code. A program often spends much of its time in loops. When it can be done, parallelizing these sections of code can have dramatic benefits.

A step-wise refinement procedure for developing the parallel algorithms will be employed. An initial solution for each problem will be presented and improved by considering performance issues.

Pseudo-code will be used to describe the solutions. The solutions will address the following issues:



 
Parallel Examples Up to Essentials of Loop Parallelism Down to Calculating Array Elements
Calculating PI

Serial Problem Description

 


Calculating PI
Parallel Solution: Message Passing



 
Parallel Examples Up to Calculating PI Down to Simple Heat Equation
Calculating Array Elements

Serial Problem Description

 


Calculating Array Elements
Parallel Solution: Message Passing


Calculating Array Elements
Parallel Solution: Pool of Tasks


Calculating Array Elements
Load Balancing and Granularity



 
Parallel Examples Up to Calculating Array Elements Down to Application Case Study
Simple Heat Equation

Serial Problem Description

 


Simple Heat Equation
Parallel Solutions


Simple Heat Equation
Parallel Solution: Message Passing

 


Simple Heat Equation
Overlapping Communication and Computation



 
Parallel Examples Up to Simple Heat Equation Down to References and More Information
Application Case Study

A detailed case study of a Numeric Weather Prediction Model, developed by Glenn Wightwick of IBM Australia Science & Technology is available HERE.

 
References and More Information Up to Application Case Study