| SP Parallel Programming II Workshop |
| m p i p e r f o r m a n c e t o p i c s |
| Review of MPI Message Passing |
|
| Terminology | |
It is not safe to modify or use the application buffer after completion of a non-blocking send. It is the programmer's responsibility to insure that the application buffer is free for reuse.
Non-blocking communications are primarily used to overlap computation with communication to effect performance gains.
| Review of MPI Message Passing |
|
| MPI Communication Routines | |
| Blocking Point-to-Point Routines | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MPI_Bsend
| Buffered send
| MPI_Recv
| Receive
| MPI_Rsend
| Ready send
| MPI_Send
| Standard send
| MPI_Sendrecv
| Combined send and receive
| MPI_Sendrecv_replace
| Combined send and receive using a common buffer
| MPI_Ssend
| Synchronous send
| Non-Blocking Point-to-Point Routines
| MPI_Ibsend
| Buffered send
| MPI_Irecv
| Receive
| MPI_Irsend
| Ready send
| MPI_Isend
| Standard send
| MPI_Issend
| Synchronous send
| Persistent Communications Point-to-Point
Routines
| MPI_Bsend_init
| Creates a persistent buffered send request
| MPI_Recv_init
| Creates a persistent receive request
| MPI_Rsend_init
| Creates a persistent ready send request
| MPI_Send_init
| Creates a persistent standard send request
| MPI_Ssend_init
| Creates a persistent synchronous send request
| MPI_Start
| Activates a persistent request operation
| MPI_Startall
| Activates a collection of persistent request operations
| Completion / Testing Point-to-Point Routines
| MPI_Iprobe
| Non-blocking query for a message's arrival
| MPI_Probe
| Blocking query for a message's arrival
| MPI_Test
| MPI_Testall MPI_Testany MPI_Testsome Non-blocking tests for message arrival(s)
| MPI_Wait
| MPI_Waitall MPI_Waitany MPI_Waitsome Blocking waits for the completion of non-blocking operation
request(s)
| Collective Communication Routines
| MPI_Allgather
| Gathers individual messages from each process in the
communicator and distributes the resulting message to each process.
| MPI_Allgatherv
| Same as MPI_Allgather but allows for messages to be of
different sizes and displacements.
| MPI_Allreduce
| Performs the specified reduction operation across all tasks
in the communicator and then distributes the result to all tasks.
| MPI_Alltoall
| Sends a distinct message from each process to every process.
| MPI_Alltoallv
| Same as MPI_Alltoall but allows for messages to be of
different sizes and displacements.
| MPI_Barrier
| Creates a barrier synchronization in the communicator
| MPI_Bcast
| Broadcasts a message from one process to all other
processes in the communicator.
| MPI_Gather
| Gathers distinct messages from each task in the communicator
to a single destination task.
| MPI_Gatherv
| Same as MPI_Gatherv but allows for messages to be of
different sizes and displacements.
| MPI_Reduce
| Performs a reduction operation across all tasks in the
communicator and places the result in a single task.
| MPI_Reduce_scatter
| First does an element-wise reduction on a vector across
all tasks in the group. Next, the result vector is split into disjoint
segments and distributed across the tasks. This is equivalent to an MPI_Reduce
followed by an MPI_Scatter operation.
| MPI_Scan
| Performs a parallel prefix reduction on data distributed
across the communicator
| MPI_Scatter
| Distributes distinct messages from a single source task
to each task in the group.
| MPI_Scatterv
| Same as MPI_Scatterv but allows for messages to be of
different sizes and displacements.
| | ||||
| Factors Affecting MPI Performance |
|
| Message Buffering |
|
Example of an unsafe MPI program.
MPI_Buffer_attach - Allocates user buffer space |
MPI_Buffer_detach - Frees user buffer space |
MPI_Bsend - Buffer send, blocking |
MPI_Ibsend - Buffer send, non-blocking |
For IBM's MPI:
|
| MPI Message Passing Protocols |
|
| MPI Message Passing Protocols |
|
| Eager Protocol | |
For IBM's MPI:
|
| MPI Message Passing Protocols |
|
| Rendezvous Protocol | |
| MPI Message Passing Protocols | Eager Protocol vs. Rendezvous Protocol | ||
| Message Size |
|
Example code for message size timing results
| Point-to-Point Communications |
|
| Persistent Communications |
|
Step 1: Create persistent requests
The desired routine is called to setup buffer location(s) which will be sent/received. The five available routines are:
MPI_Recv_init
| Creates a persistent receive request
| MPI_Bsend_init
| Creates a persistent buffered send request
| MPI_Rsend_init
| Creates a persistent ready send request
| MPI_Send_init
| Creates a persistent standard send request
| MPI_Ssend_init
| Creates a persistent synchronous send request
| |
Step 2: Start communication transmission
Data transmission is begun by calling either of the MPI_Start routines.
MPI_Start
| Activates a persistent request operation
| MPI_Startall
| Activates a collection of persistent request operations
| |
Step 3: Wait for communication completion
Because persistent operations are non-blocking, the appropriate MPI_Wait or MPI_Test routine must be used to insure their completion.
Step 4: Deallocate persistent request objects
When there is no longer a need for persistent communications, the programmer should explicitly free the persistent request objects by using the MPI_Request_free() routine.
|
|
MPI_Recv_init (&rbuff, n, MPI_CHAR, src, tag, comm, &reqs[0]);
MPI_Send_init (&sbuff, n, MPI_CHAR, dest, tag, comm, &reqs[1]);
for (i=1; i <=REPS; i++){
...
MPI_Startall (2, reqs);
...
MPI_Waitall (2, reqs, stats);
...
}
MPI_Request_free (&reqs[0]);
MPI_Request_free (&reqs[1]);
|
Example code using persistent communications
Comparison code using MPI_Irecv and MPI_Isend
| Derived Datatypes |
|
The code fragment below provides an example.
|
|
/* Some declarations */
typedef struct {
float f1,f2,f3,f4;
int i1,i2;
} f4i2;
f4i2 rbuff, sbuff;
MPI_Datatype newtype, oldtypes[2];
int blockcounts[2];
MPI_Aint offsets[2], extent;
MPI_Status stat;
....
/* Setup MPI structured type for the 4 floats and 2 ints */
offsets[0] = 0;
oldtypes[0] = MPI_FLOAT;
blockcounts[0] = 4;
MPI_Type_extent(MPI_FLOAT, &extent);
offsets[1] = 4 * extent;
oldtypes[1] = MPI_INT;
blockcounts[1] = 2;
MPI_Type_struct(2, blockcounts, offsets, oldtypes, &newtype);
MPI_Type_commit(&newtype);
...
/* Send/Receive 4 floats and 2 ints as a single element of derived datatype */
for (i=1; i<=REPS; i++){
MPI_Send(&sbuff, 1, newtype, 1, tag, MPI_COMM_WORLD);
MPI_Recv(&rbuff, 1, newtype, 1, tag, MPI_COMM_WORLD, &stat);
}
...
/* Send/Receive 4 floats and then 2 ints individually */
for (i=1; i<=REPS; i++){
MPI_Send(&sbuff.f1, 4, MPI_FLOAT, 1, tag, MPI_COMM_WORLD);
MPI_Send(&sbuff.i1, 2, MPI_INT, 1, tag, MPI_COMM_WORLD);
MPI_Recv(&rbuff.f1, 4, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &stat);
MPI_Recv(&rbuff.i1, 2, MPI_INT, 1, tag, MPI_COMM_WORLD, &stat);
}
|
The simple example code provided below demonstrated a performance improvement of 38% when a derived datatype was used instead of individual send/receive operations. 1
Derived datatype vs. individual send/receives
example code
| Method | Bandwidth (MB/sec) |
|---|---|
| MPI_Type_vector | 12.10 |
| MPI_Type_struct | 9.57 |
| User pack/unpack | 20.23 |
| Individual send/receive | 0.29 |
| Network Contention |
|
| # Processors | Bandwidth (MB/sec) |
|---|---|
| 2 | 34 |
| 4 | 34 |
| 8 | 34 |
| 16 | 31 |
| 32 | 25 |
| 64 | 22 |
Above results reported by William Gropp and Ewing Lusk in their Supercomputing 96 tutorial, "Tuning MPI Applications for Peak Performance".
| References and More Information |
|
Notes
| 1 | Timing results were obtained on two IBM SP nodes (4-way SMP 332 MHz 604e) configured with 1.5 GB of memory. Unless otherwise indicated, User Space communications were used over the High-Performance Switch. All executions were conducted in a production batch system and used only one processor of a 4-way SMP node. |
| 2 | Timing results were obtained on a variable number of IBM SP nodes (4-way SMP 332 MHz 604e) configured with 1.5 GB of memory. Communication protocols used were User Space and Internet Protocol over the High-Performance Switch. All executions were conducted in a production batch system. "Onnode" timings indicate that MPI tasks populated as fully as possible, all of the available processors on 4-way SMP nodes. |