|
SP Parallel Programming Workshop
|
|
p a r a l l e l p r o g r a m m i n g
i n t r o d u c t i o n
|
Table of Contents
-
Overview
-
What is Parallelism?
-
Sequential Programming
-
The Need for Faster Machines
-
Parallel Computing
-
Parallel Programming Overview
-
Architecture Taxonomy
-
SISD Model
-
SIMD Model
-
MIMD Model
-
Memory Architectures
-
Shared Memory
-
Distributed Memory
-
Memory-Processor Arrangements
-
Processor Communication
-
Communications Network on the IBM SP
-
Parallel Programming Paradigms
-
Various Methods
-
Message Passing
-
Data Parallel
- Implementations
-
Steps for Creating a Parallel Program
-
Decomposing the Program
-
Communication
-
Point to Point
-
One to All Broadcast
-
All to All Broadcast
-
One to All Personalized
-
All to all Personalized
-
Shifts
-
Collective Computation
-
Design and Performance Considerations
-
Amdahl's Law
-
Load Balancing
-
Granularity
-
Data Dependency
-
Deadlock
-
Communication Patterns and Bandwidth
-
I/O Patterns
-
Debugging
-
Performance Monitoring and Analysis
-
Parallel Examples
-
Essentials of Loop Parallelism
-
Calculating PI
-
Serial Problem Description
-
Parallel Solution
-
Calculating Array Elements
-
Serial Problem Description
-
Parallel Solution
-
Pool of Tasks
-
Load Balancing and Granularity
-
Simple Heat Equation
-
Serial Problem Description
-
Parallel Solution
-
Overlapping Communication and Computation
- Application Case Study
-
References and More Information
|
Overview
|
|
|
What is Parallelism?
|
A strategy for performing large, complex tasks faster.
A large task can either be performed serially, one step following another,
or can be decomposed into smaller tasks to be performed
simultaneously, i.e., in parallel.
Parallelism is done by:
- Breaking up the task into smaller tasks
- Assigning the smaller tasks to multiple workers to work on simultaneously
- Coordinating the workers
Parallel problem solving is common. Examples: building construction;
operating a large organization; automobile manufacturing plant
The automobile analogy.
|
Overview
|
|
|
Sequential Programming
|
Traditionally, programs have been written for serial computers:
- One instruction executed at a time
- Using one processor
- Processing speed dependent on how fast data can move through hardware
- Speed of Light = 30 cm/nanosecond
- Limits of Copper Wire = 9 cm/nanosecond
- Fastest machines execute approximately 1 instruction in 9-12 billionths
of a second
|
Overview
|
|
|
The Need for Faster Machines
|
You might think that one instruction executed in 9 billionths of a
second would be fast enough. You'd be wrong.
There are several classes of problems that require faster processing:
- Simulation and Modeling problems:
- Based on successive approximations
- More calculations, more precise
- Problems dependent on computations / manipulations of large amounts
of data
- Image and Signal Processing
- Entertainment (Image Rendering)
- Database and Data Mining
- Seismic
- Grand Challenge Problems:
- Climate Modeling
- Fluid Turbulence
- Pollution Dispersion
- Human Genome
- Ocean Circulation
- Quantum Chromodynamics
- Semiconductor Modeling
- Superconductor Modeling
- Combustion Systems
- Vision & Cognition
|
Overview
|
|
|
Parallel Computing
Traditional Supercomputers
|
Technology
- Single processors were created to be as fast as possible.
- Peak performance was achieved with good memory bandwidth.
Benefits
- Supports sequential programming (Which many people understand)
- 30+ years of compiler and tool development
- I/O is relatively simple
Limitations
- Single high performance processors are extremely expensive
- Significant cooling requirements
- Single processor performance is reaching its asymptotic limit
Parallel Supercomputers
Technology
- Applying many smaller cost efficient processors to work on a part of
the same task
- Capitalizing on work done in the microprocessor and networking markets
Benefits
- Ability to achieve performance and work on problems impossible with
traditional computers.
- Exploit "off the shelf" processors, memory, disks and tape systems.
- Ability to scale to problem.
- Ability to quickly integrate new elements into systems thus capitalizing
on improvements made by other markets.
- Commonly much cheaper.
Limitations
- New technology. Programmers need to learn parallel programming approaches.
- Standard sequential codes will not "just run".
- Compilers and tools are often not mature.
- I/O is not as well understood yet.
Parallel computing requires:
- Multiple processors
(The workers)
- Network
(Link between workers)
- Environment to create and manage parallel processing
- Operating System
(Administrator of the system that knows how to
handle multiple workers)
- Parallel Programming Paradigm
- Message Passing
- Data Parallel
- Fortran 90 / High Performance Fortran
- Others
- A parallel algorithm and a parallel program
(The decomposition of the problem into pieces that
multiple workers can perform)
|
Overview
|
|
|
Parallel Programming
|
- Parallel programming involves:
- Decomposing an algorithm or data into parts
- Distributing the parts as tasks which are worked on by multiple
processors simultaneously
- Coordinating work and communications of those processors
- Parallel programming considerations:
- Type of parallel architecture being used
- Type of processor communications used
|
Architecture Taxonomy
|
|
- All parallel computers use multiple processors
- There are several different methods used to classify computers
- No single taxonomy fits all designs
- Not used as much today, but still see the categories
- Flynn's taxonomy uses the relationship of program
instructions to program data. The four categories are:
|
Architecture Taxonomy
|
|
|
SISD Model: Single Instruction, Single Data Stream
|
- Not a parallel computer
- Conventional serial, scalar von Neumann computer
- One instruction stream
- A single instruction is issued each clock cycle
- Each instruction operates on a single (scalar) data element
- Limited by the number of instructions that can be issued in a given unit
of time
- Performance frequently measured in MIPS (million of instructions per
second) or clock frequency MHz (Megahertz)
- Most non-supercomputers
Automobile analogy
|
Architecture Taxonomy
|
|
|
SIMD Model: Single Instruction, Multiple Data Stream
|
- Also von Neumann architectures but more powerful instructions
- Each instruction may operate on more than one data element
- Usually intermediate host executes program logic and broadcasts
instructions to other processors
- Synchronous (lockstep)
- Rating how fast these machines can issue instructions
is not a good measure of their performance
- Performance is measured in MFLOPS (millions of floating point
operations per second)
- Two major types:
- Vector SIMD
- Parallel SIMD
Automobile analogy
Vector SIMD
- Single instruction results in multiple operands being updated
- Scalar processing operates on single data elements. Vector processing
operates on whole vectors (groups) of data at a time.
- Examples:
- Cray 1
- NEC SX-2
- Fujitsu VP
- Hitachi S820
Single processor of:
- Cray C 90
- Cray2
- NEC SX-3
- Fujitsu VP 2000
- Convex C-2
Parallel SIMD
- Processor arrays - single instruction is issued and all processors
execute the same instruction, operating on different sets of data.
- Processors run in a synchronous, lockstep fashion
- Advantages
- Disadvantages
- Decisions within DO loops can result in poor execution by requiring
all processes to perform the operation controlled by decision
whether results are used or not
- Examples:
- Connection Machine CM-2
- Maspar MP-1, MP-2
|
Architecture Taxonomy
|
|
|
MIMD Model: Multiple Instructions, Multiple Data
|
|
Memory Architectures
|
|
- The way processors communicate is dependent upon memory architecture,
which, in turn, will affect how you write your parallel program
- The primary memory architectures are:
- Shared Memory
- Distributed Memory
|
Memory Architectures
|
|
|
Shared Memory
|
- Multiple processors operate independently but share the same memory
resources
- Only one processor can access the shared memory location at a time
- Synchronization achieved by controlling tasks' reading from and
writing to the shared memory
- Advantages
- Easy for user to use efficiently
- Data sharing among tasks is fast (speed of memory access)
- Disadvantages
- Memory is bandwidth limited. Increase of processors without
increase of bandwidth can cause severe bottlenecks
- User is responsible for specifying synchronization, e.g., locks
- Examples:
- Cray Y-MP
- Convex C-2
- Cray C-90
|
Memory Architectures
|
|
|
Distributed Memory
|
- Multiple processors operate independently but
each has its own private memory
- Data is shared across a communications network using message passing
- User responsible for synchronization using message passing
- Advantages
- Memory scalable to number of processors. Increase number of
processors, size of memory and bandwidth increases.
- Each processor can rapidly access its own memory without interference
-
Disadvantages
- Difficult to map existing data structures to this memory organization
- User responsible for sending and receiving data among processors
- To minimize overhead and latency, data should be blocked up in
large chunks and shipped before receiving node needs it
- Examples:
- nCUBE Hypercube
- Intel Hypercube
- TMC CM-5
- IBM SP1, SP2
- Intel Paragon
|
Memory Architectures
|
|
|
Memory / Processor Arrangements
|
- Distributed Memory
- MPP - Massively Parallel Processor
- Shared Memory
- SMP - Symmetric Multiprocessor
- Identical processors
- Equal access to memory
- Sometimes called UMA - Uniform Memory Access
- or CC-UMA - Cache Coherent UMA
- Cache coherent means if one processor updates a location in
shared memory, all the other processors know about the update
- NUMA - Non-Uniform Memory Access
- Sometimes called CC-NUMA - Cache Coherent NUMA
- Often made by linking two or more SMPs
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower
- Combinations
- Multiple SMPs connected by a network
- Processors within an SMP communicate via memory
- Requires message passing between SMPs
- One SMP can't directly access the memory of another SMP
- Multiple distributed memory processors connected to a larger shared
memory
- Small fast memory can be used for supplying data to processors
and large slower memory can be used for a backfill to the
smaller memories
- Similar to register <= cache memory <= main memory hierarchy
- Transfer from local memory to shared memory would be transparent
to the user
- Probable design of the future with several processors and
their local memory surrounding a larger shared memory on a
single board
- Comparison
Mulitprocessor Architectures
| CC-UMA | CC-NUMA | MPP |
| | | |
| Architecture
| mostly RISC | mostly RISC | RISC |
| | | | rich instructions |
| | | |
| Examples
| SMPs | SGI Origin | Cray T3E |
| | Sun VExx | Sequent | Maspar |
| | DEC | HP Exemplar | IBM SP2 |
| | SGI Challenge | DEC | |
| | | |
| Communications
| MPI | MPI | MPI |
| | shmem | shmem | |
| | | |
| Scalabiliity
| to 10s of | to 100s of | to 1000s of |
| | processors | processors | processors |
| | | |
| Draw Backs
| limited memory | new architecture | sys admin |
| | bandwidth | | programming |
| | | |
| Software
| many 1000s ISVs | many 1000s ISVs | 10s ISVs |
| Availablilty
| | point-to-point | hard to develop |
| | | communication | and maintain |
| | | | "home grown" |
| | | |
|
Processor Communications
|
|
|
Communications Network on the IBM SP
|
- In order to coordinate tasks of multiple nodes working on the same
problem, some form of inter-processor communications is required to:
- Convey information and data between processors
- Synchronize node activities
- Distributed memory
- SP Node Connectivity
- Nodes are connected to each other by a native high performance switch
- Nodes are connected to the network (and, hence, to each other)
via the ethernet
- Type of Communication
- ip - Internet Protocol (TCP/IP)
- runs over ethernet or
- runs over the switch (depending on environment setup)
- us - User Space
|
Parallel Programming Paradigms
|
|
|
Various Methods
|
There are many methods of programming parallel computers. Two of the most
common are message passing and data parallel.
- Message Passing - the user makes calls to libraries to explicitly
share information between processors.
- Data Parallel - data partitioning determines parallelism
- Shared Memory - multiple processes sharing common memory space
- Remote Memory Operation - set of processes in which a process can
access the memory of another process without its participation
- Threads - a single process having multiple (concurrent)
execution paths
- Combined Models - composed of two or more of the above.
Note: these models are machine/architecture
independent, any of the models can be implemented on
any hardware given appropriate operating system
support.
An effective implementation is one which closely matches its
target hardware and provides the user ease in programming.
|
Parallel Programming Paradigms
|
|
|
Message Passing
|
The message passing model is defined as:
- set of processes using only local memory
- processes communicate by sending and receiving
messages
- data transfer requires
cooperative operations to be performed by each
process (a send operation must have a matching
receive)
Programming with message passing is done by linking with and
making calls to libraries which manage the data exchange
between processors. Message passing libraries are available for most
modern programming languages.
|
Parallel Programming Paradigms
|
|
|
Data Parallel
|
The data parallel model is defined as:
- Each process works on a different part of the same data structure
- Commonly a Single Program Multiple Data (SPMD) approach
- Data is distributed across processors
- All message passing is done invisibly to the programmer
- Commonly built "on top of" one of the common message passing libraries
Programming with data parallel model is accomplished by
writing a program with data parallel constructs and compiling it with
a data parallel compiler.
The compiler converts the program into standard
code and calls to a message passing library to distribute the data to all
the processes.
|
Parallel Programming Paradigms
|
|
|
Implementations
Message Passing: Message Passing Interface (MPI)
|
- Message Passing Interface often called MPI.
- A standard portable message-passing library definition developed in
1993 by a group of parallel computer vendors, software writers, and
application scientists.
- Available to both Fortran and C programs.
- Available on a wide variety of parallel machines.
- Target platform is a distributed memory system such as the SP.
- All inter-task communication is by message passing.
- All parallelism is explicit: the programmer is responsible for
parallelism the program and implementing the MPI constructs.
- Programming model is SPMD (Single Program Multiple Data)
|
Parallel Programming Paradigms
|
|
|
Implementations
F90 / High Performance Fortran (HPF)
|
- Fortran 90 (F90) - (ISO / ANSI standard extensions to Fortran 77).
- High Performance Fortran (HPF) - extensions to F90 to support data parallel
programming.
- Compiler directives allow programmer specification of data distribution
and alignment.
- New compiler constructs and intrinsics allow the programmer to do
computations and manipulations on data with different distributions.
|
Steps for Creating a Parallel Program
|
|
- If you are starting with an existing serial program, debug the serial code
completely
- Identify the parts of the program that can be executed concurrently:
- Requires a thorough understanding of the algorithm
- Exploit any inherent parallelism which may exist.
- May require restructuring of the program and/or algorithm.
May require an entirely new algorithm.
- Decompose the program:
- Functional Parallelism
- Data Parallelism
- Combination of both
- Code development
- Code may be influenced/determined by machine architecture
- Choose a programming paradigm
- Determine communication
- Add code to accomplish task control and communications
- Compile, Test, Debug
- Optimization
- Measure Performance
- Locate Problem Areas
- Improve them
|
Steps for Creating a Parallel Program
|
|
|
Decomposing the Program
|
There are three methods for decomposing a problem into smaller tasks to be
performed in parallel: Functional Decomposition, Domain Decomposition,
or a combination of both
-
Functional Decomposition (Functional Parallelism)
- Decomposing the problem into different tasks which can be
distributed to multiple processors for simultaneous execution
- Good to use when there is not static structure or fixed determination
of number of calculations to be performed
-
Domain Decomposition (Data Parallelism)
- Partitioning the problem's data domain and distributing portions to
multiple processors for simultaneous execution
- Good to use for problems where:
- data is static (factoring and solving large matrix or finite
difference calculations)
- dynamic data structure tied to single entity where entity can be
subsetted (large multi-body problems)
- domain is fixed but computation within various regions of the
domain is dynamic (fluid vortices models)
-
There are many ways to decompose data into partitions to be distributed:
|
Steps for Creating a Parallel Program
|
|
|
Communication
|
Understanding the interprocessor communications of your program is essential.
- Message Passing communication is programed explicitly. The programmer
must understand and code the communication.
- Data Parallel compilers and run-time systems do all
communications behind the scenes. The programmer need not understand
the underlying communications. On the other hand to get good
performance from your code you should write your algorithm with the
best communication possible.
The types of communications for message passing and data parallel are exactly
the same. In fact most data parallel compilers simply use one of the
standard message passing libraries to achieve data movement.
Communications on distributed memory computers:
- Point to Point
- One to All Broadcast
- All to All Broadcast
- One to All Personalized
- All to All Personalized
- Shifts
- Collective Computation
Point to Point
The most basic method of communication between two processors is the
point to point message. The originating processor "sends" the message
to the destination processor. The destination processor then "receives"
the message.
The message commonly includes the information, the length of the message, the
destination address and possibly a tag.
Typical message passing libraries subdivide the basic sends
and receives into two types:
- blocking - processing waits until message is transmitted
- nonblocking - processing continues even if message hasn't been
transmitted yet
One to All Broadcast
A node may have information which all the others require.
A broadcast
is a message sent to many other nodes.
A One to All broadcast occurs when one processor sends the same information
to many other nodes.
All to All Broadcast
With an All to All
broadcast each
processor sends its unique information to all the other processors.
One to All Personalized
Personalized communication send a unique message to each processor.
In One to All personalized communication one processor sends a unique
message to every other processor.
All to All Personalized
In All to All Personalized communication each processor sends a unique message
to all other processors.
Shifts
Shifts are permutations of information. Information is exchanged in one
logical direction or the other. Each processor exchanges the same amount of
information with its neighbor processor.
There are two types of shifts:
- Circular - Each processor exchanges information with its logical neighbor.
When there is no longer a neighbor due to an edge of data the shift
"wraps around" and takes the information from the opposite edge.
- End Off Shift - When an edge occurs, the processor is padded with zero or
a user defined value.
Collective Computation
In collective computation (reductions), one member of the group
collects data from the other members. Commonly a mathematical
operation like a min, max, add, multiple etc. is performed.
|
Design and Performance Considerations
|
|
|
Amdahl's Law
|
- Amdahl's Law
states that potential program
speedup is defined by the fraction of code (P) which can be parallelized:
1
speedup = --------
1 - P
- If none of the code can be parallelized, f = 0 and the speedup = 1 (no
speedup). If all of the code is parallelized, f = 1 and the speedup is
infinite (in theory).
If 50% of the code can be parallelized, maximum speedup = 2, meaning
the code will run twice as fast.
- Introducing the number of processors performing the parallel fraction of
work, the relationship can be modeled by:
1
speedup = ------------
P + S
---
N
where P = parallel fraction, N = number of processors and S = serial
fraction.
- It soon becomes obvious that there are limits to the scalability of
parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of
the code is parallelizable):
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
- However, certain problems demonstrate increased performance by increasing
the problem size. For example:
2D Grid Calculations 85 seconds 85%
Serial fraction 15 seconds 15%
We can increase the problem size by halving both the grid points and
the time step, which is directly proportional to the grid spacing.
This results in four times the number of grid points (factor of two in
each direction) and twice the number of time steps. The timings then
look like:
2D Grid Calculations 680 seconds 97.84%
Serial fraction 15 seconds 2.16%
- Problems which increase the percentage of parallel time with their size
are more "scalable" than problems with a fixed percentage of parallel
time.
|
Design and Performance Considerations
|
|
|
Load Balancing
|
|
Design and Performance Considerations
|
|
|
Granularity
|
- In order to coordinate between different processors working on the same
problem, some form of communication between them is required
- The ratio between computation and communication is known as granularity.
- Fine-grain
parallelism
- All tasks execute a small number of instructions between
communication cycles
- Low computation to communication ratio
- Facilitates load balancing
- Implies high communication overhead and less opportunity for
performance enhancement
- If granularity is too fine it is possible that the overhead
required for communications and synchronization between tasks
takes longer than the computation.
- Coarse-grain
parallelism
- Typified by long computations consisting of large numbers of
instructions between communication synchronization points
- High computation to communication ratio
- Implies more opportunity for performance increase
- Harder to load balance efficiently
- The most efficient granularity is dependent on the algorithm and the
hardware environment in which it runs
- In most cases overhead associated with communications and
synchronization is high relative to execution speed
so it is advantageous to have coarse granularity.
|
Design and Performance Considerations
|
|
|
Data Dependency
|
- A data dependency exists when there is multiple use of the same storage
location
- Importance of dependencies: frequently inhibit parallel execution
- Example 1:
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0
500 CONTINUE
- This code has a data dependency.
- Must have computed value for A(J-1) before we can calculate A(J).
- If Task 2 has A(J) and Task 1 has A(J-1), the value of A(J) is
dependent on:
- Example 2:
task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3
The value of Y is dependent on:
- Distributed memory
If and/or when the value of X is communicated between the tasks.
- Shared memory
Which task last stores the value of X.
- How to handle data dependencies?
|
Design and Performance Considerations
|
|
|
Deadlock
|
- Deadlock describes a condition where two or more processes are waiting
for an event or communication from one of the other processes.
- The simplest example is demonstrated by two processes which are both
programmed to read/receive from the other before writing/sending.
- Example
TASK1 TASK2
------------------ ------------------
X = 4 Y = 8
SOURCE = TASK2 SOURCE = TASK1
RECEIVE (SOURCE,Y) RECEIVE (SOURCE,X)
DEST = TASK2 DEST = TASK1
SEND (DEST, X) SEND (DEST, Y)
Z = X + Y Z = X + Y
- One solution is to change the order of the SEND and
RECEIVE in one of the tasks.
TASK1 TASK2
------------------ ------------------
X = 4 Y = 8
SOURCE = TASK2 DEST = TASK1
RECEIVE (SOURCE,Y) SEND (DEST, Y)
DEST = TASK2 SOURCE = TASK1
SEND (DEST, X) RECEIVE (SOURCE,X)
Z = X + Y Z = X + Y
- Another solution is to use NON-BLOCKING message passing.
|
Design and Performance Considerations
|
|
|
Communication Patterns and Bandwidth
|
- For some problems, increasing the number of processors will:
-
Decrease the execution time attributable to computation
-
But also, increase the execution time attributable to communication
- The time required for communication is dependent upon a given system's
communication bandwidth parameters.
- For example, the time (t) required to send W words between any two
processors is:
t = L + W/B
where L = latency and B = hardware bitstream rate in words per second.
Latency can be thought of as the time required to send a zero byte message
- Communication patterns also affect the computation to communication ratio.
For example, gather-scatter communications between a single processor
and N other processors will be impacted more by an increase in latency than
N processors communicating only with nearest neighbors.
|
Design and Performance Considerations
|
|
|
I/O Patterns
|
- I/O operations are generally regarded as inhibitors to parallelism
- Parallel I/O systems are as yet, largely undefined and not available
- In an environment where all processors see the same filespace, write
operations will result in file overwriting
- Read operations will be affected by the fileserver's ability to handle
multiple read requests at the same time
- I/O which must be conducted over the network (non-local) can cause
severe bottlenecks
- Some options:
- Reduce overall I/O as much as possible
- Confine I/O to specific serial portions of the job
- For example, Task 1 could read an input file and then communicate
required data to other tasks. Likewise, Task 1 could perform
write operation after receiving required data from all other tasks.
- Create unique filenames for each tasks' input/output file(s)
- For distributed memory systems with shared filespace, perform I/O in
local, non-shared filespace
- For example, each processor may have /tmp filespace which can used. This is usually much more efficient than performing I/O over the
network to one's home directory.
|
Design and Performance Considerations
|
|
|
Debugging
|
- Debugging parallel programs is significantly more of a challenge than
debugging serial programs
- Parallel debuggers are beginning to become available, but much work
remains to be done
- Use a modular approach to program development
- Pay as close attention to communication details as to computation details
|
Design and Performance Considerations
|
|
|
Performance Monitoring and Analysis
|
- As with debugging, monitoring and analyzing parallel program execution
is significantly more of a challenge than for serial programs
- A number of parallel tools for execution monitoring and program analysis
are available
- Some are quite useful; some are cross-platform also
- Work remains to be done, particularly in the area of scalability.
|
Parallel Examples
|
|
|
Essentials of Loop Parallelism
|
Some examples will help illustrate the methods of parallel
programming and the performance issues involved.
Each of the problems has a main loop. Loops are a main target
for parallelizing and vectorizing code. A program often spends
much of its time in loops. When it can be done, parallelizing
these sections of code can have dramatic benefits.
A step-wise refinement procedure for developing the parallel algorithms
will be employed. An initial solution for each problem will be presented
and improved by considering performance issues.
Pseudo-code will be used to describe the solutions. The solutions will
address the following issues:
- identification of parallelism
- program decomposition
- load balancing (static vs. dynamic)
- task granularity in the case of dynamic load balancing
- communication patterns - overlapping communication and computation
|
Parallel Examples
|
|
|
Calculating PI
Serial Problem Description
|
- Embarrassingly parallel
- Computationally intensive
- Minimal communication
- Minimal I/O
- The value of PI can be calculated in a number of ways, many of which
are easily parallelized
- Consider the following
method of approximating
PI
- Inscribe a circle in a square
- Randomly generate points in the square
- Determine the number of points in the square that are also in the circle
- Let r be the number of points in the circle divided by the number of
points in the square
- PI ~ 4 r
- Note that the more points generated, the better the approximation
- Serial pseudo code for this procedure:
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
- Note that most of the time in running this program would be
spent executing the loop
Calculating PI
Parallel Solution: Message Passing
- Parallel strategy: break the loop into portions which can be
executed by the processors.
- For the task of approximating PI:
- each processor executes its portion of the loop a number of
times.
- each processor can do its work without requiring any information
from the other processors (there are no data dependencies). This
situation is known as Embarassingly Parallel.
- uses SPMD model. One process acts as master and collects
the results.
- Message passing pseudo code:
Red highlights changes for Message Passing.
npoints = 10000
circle_count = 0
p = number of processors
num = npoints/p
find out if I am MASTER or WORKER
do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
if I am MASTER
receive from WORKER their circle_counts
compute PI (use MASTER and WORKER calculations)
else if I am WORKER
send to MASTER circle_count
endif
|
Parallel Examples
|
|
|
Calculating Array Elements
Serial Problem Description
|
- This example shows calculations on array elements that require
very little communication.
- Elements of 2-dimensional array are calculated.
- The calculation of elements is independent of one another -
leads to
embarassingly
parallel situation.
- The problem should be computationally intensive.
- Serial code could be of the form:
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
- The serial program calculates one element at a time in the specified
order.
Calculating Array Elements
Parallel Solution: Message Passing
- Arrays are distributed so that each processor owns a portion of an array.
- Independent calculation of array elements insures no
communication amongst processors is needed.
- Distribution scheme is chosen by other criteria, e.g. unit stride
through arrays.
- Desirable to have unit stride through arrays, then the
choice of a distribution scheme depends on the programming language.
- Fortran: block cyclic distribution
- C: cyclic block distribution
- After the array is distributed, each processor
executes the portion of the loop corresponding to the data it owns.
- Notice only the loop variables are different from the serial
solution.
- For example, with Fortran and a block cyclic distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
- Message Passing Solution:
- With Fortran storage scheme, perform block cyclic
distribution of array.
- Implement as SPMD model.
- Master process initializes array, sends info to worker
processes and receives results.
- Worker process receives info, performs its share of
computation and sends results to master.
- Pseudo code solution:
Red highlights changes for Message
Passing.
find out if I am MASTER or WORKER
if I am MASTER
initialize the array
send each WORKER info on part of array it owns
send each WORKER its portion of initial array
receive from each WORKER results
else if I am WORKER
receive from MASTER info on part of array I own
receive from MASTER my portion of initial array
# calculate my portion of array
do j = my first column,my last column
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
send MASTER results
endif
Calculating Array Elements
Parallel Solution: Pool of Tasks
- We've looked at problems that are static load balanced.
- each processor has fixed amount of work to do
- may be significant idle time for faster or more lightly loaded
processors.
- Usually is not a major concern with dedicated usage. i.e.
loadleveler.
- If you have a load balance problem, you can use a "pool of tasks"
scheme.
- Two processes are employed
- Master Process:
- holds pool of tasks for worker processes to do
- sends worker a task when requested
- collects results from workers
- Worker Process: repeatedly does the following
- gets task from master process
- performs computation
- sends results to master
- Worker processes do not know before runtime which portion of array
they will handle or how many tasks they will perform.
- The fastest process will get more tasks to do.
Dynamic load balancing occurs at run time.
- Solution:
- Calculate an array element
- Worker process gets task from master, performs work, sends
results to master, and gets next task
- Pseudo code solution:
Red highlights changes for Message
Passing.
find out if I am MASTER or WORKER
if I am MASTER
do until no more jobs
send to WORKER next job
receive results from WORKER
end do
tell WORKER no more jobs
else if I am WORKER
do until no more jobs
receive from MASTER next job
calculate array element: a(i,j) = fcn(i,j)
send results to MASTER
end do
endif
Calculating Array Elements
Load Balancing and Granularity
- Static load balancing can result in significant idle time for
faster processors.
- Pool of tasks offers a potential solution - the faster processors
do more work.
- In the pool of tasks solution, the workers calculated array elements,
resulting in
- optimal load balancing: all processors complete work at the
same time
- fine granularity: small unit of computation, master and worker
communicate after every element
- fine granularity may cause very high communications cost
- Alternate Parallel Solution:
- give processors more work - columns or rows rather than elements
- more computation and less communication results in larger
granularity
- reduced communication may improve performance
|
Parallel Examples
|
|
|
Simple Heat Equation
Serial Problem Description
|
- Most problems in parallel computing require communication among
the processors.
- Common problem requires communication with "neighbor" processor.
- The heat equation describes the temperature change over time,
given initial temperature distribution and boundary conditions.
- A finite differencing scheme is employed to solve the
heat equation numerically on a square region.
- The initial
temperature is zero on the boundaries and high in the middle.
- The boundary temperature is held at zero.
- For the fully explicit problem, a time stepping algorithm is used.
The elements of a 2-dimensional array represent the temperature at
points on the square.
- The calculation of an element is dependent on neighbor element
values.
- A serial program would contain code like:
do iy = 2, ny - 1
do ix = 2, nx - 1
u2(ix, iy) =
u1(ix, iy) +
cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
end do
end do
Simple Heat Equation
Parallel Solutions
Simple Heat Equation
Parallel Solution: Message Passing
Simple Heat Equation
Overlapping Communication and Computation
|
Parallel Examples
|
|
|
Application Case Study
|
A detailed case study of a Numeric Weather Prediction Model, developed by
Glenn Wightwick of IBM Australia Science & Technology is available
HERE.
|
References and More Information
|
|
- "IBM AIX Parallel Environment Application Development, Release 1.0", IBM
Corporation.
- Carriero, Nicholas and Gelernter, David, "How to Write Parallel Programs -
A First Course". MIT Press, Cambridge, Massachusetts.
- Dowd, Kevin, High Performance Computing", O'Reilly & Associated, Inc.,
Sebastopol, California.
- Hockney, R.W. and Jesshope, C.R., "Parallel Computers 2",Hilger, Bristol
and Philadelphia.
- Ragsdale, Susan, ed., "Parallel Programming", McGraw-Hill, Inc., New York.
- Chandy, K. Mani and Taylor, Stephen, "An Introduction to Parallel
Programming", Jones and Bartlett, Boston
- We gratefully acknowledge John Levesque of Applied Parallel Research for
the use of his "A Parallel Programming Workshop" presentation materials.
- We also gratefully acknowledge the Cornell Theory Center, Ithaca, New
York for the use of portions of their "Parallel Processing" and "Scalable
Processing" presentation materials.
- We also gratefully acknowledge Glenn Ozaki of Silicon Graphics
Incorporated for providing the information for the Multiprocessor
Architecture table.