SP Parallel Programming Workshop
o p t i m i z a t i o n     t o p i c s



  Table of Contents
  1. Overview
  2. Types of Optimization
  3. The Optimization Process
  4. Optimization Strategy
  5. Preprocessor Optimizations
    1. Overview
    2. Using the KAP Fortran Preprocessor
    3. Using the KAP C Preprocessor
  6. Compiler Optimizations
  7. Hand-Tuning Optimizations
    1. Memory Considerations
      1. The Memory Hierarchy
      2. Efficient Memory Use
    2. Array and Memory Management Optimizations
      1. Stride Minimization
      2. Array Padding
      3. Miscellaneous Array Optimizations
    3. Loop Optimizations
      1. Loop Fusion
      2. Loop Interchange
      3. Invariant IF Code Floating
      4. Loop Defactorizing
      5. Loop Collapse
      6. Loop Unrolling
      7. Loop Unrolling and Sum Reductions
      8. Outer Loop Unrolling
      9. IF Loops, WHILE Loops and DO Loops
    4. Arithmetic Optimizations
      1. BLAS and ESSL Routines
      2. Mathematical Acceleration SubSystem (MASS)
      3. Strength Reduction
      4. Miscellaneous Arithmetic Optimizations
  8. Input/Output Optimizations
  9. References and More Information
  10. Exercise


 
The Optimization Process Up to Types of Optimization Down to Optimization Strategy


The Optimization Process


 
Preprocessor Optimizations Up to Optimization Strategy Down to Using the KAP Fortran Preprocessor
Overview

For the IBM SP environment:


 
Preprocessor Optimizations Up to Preprocessor Optimizations Down to Using the KAP C Preprocessor
Using the KAP Fortran Preprocessor


Note Note: These instructions are for the IBM SP platform, and assume that the KAP Fortran preprocessor product is installed and licensed on the system you are using.

Syntax For Invoking The KAP Fortran Preprocessor

Compiler Compiler options -Pk or -Pk! -Wp,kopt1,kopt2,koptN source file(s)

Compiler

Compiler Options

-Pk or -Pk!

KAP options

Warning The -Wp, flag must come after the -Pk/-Pk! flags. Also, its argument list must contain NO spaces.

Serial Examples

Parallel MPI Example



 
Preprocessor Optimizations Up to Using the KAP Fortran Preprocessor Down to Compiler Optimizations
Using the KAP C Preprocessor


Note Note: These instructions are for the IBM SP platform, and assume that the KAP C preprocessor product is installed and licensed on the system you are using.

Syntax For Invoking the KAP C Preprocessor

Form 1 kxlc or kcc Compiler options +Kn source file(s)
Form 2 kxlc or kcc Compiler options +Kargs=kopt1:kopt2:koptN source file(s)
Form 3 kxlc or kcc Compiler options +Kcc=compiler_location source file(s)

kxlc or kcc

Compiler Options

KAP options

Warning For all KAP C preprocessor colon-separated option lists, there must be NO spaces.

Serial Examples

Parallel MPI Example



 
Compiler Optimizations Up to Using the KAP C Preprocessor Down to Hand-Tuning Optimizations

IBM Compiler Overview

Compiler Optimization Options

Recommendations



 
Hand-Tuning Optimizations Up to Compiler Optimizations Down to Memory Considerations

Why the compiler can't do everything



 
 
Memory Considerations Up to Hand-Tuning Optimizations Down to Efficient Memory Use
The Memory Hierarchy

Memory Hierarchy

CPU
Where the actual work (execution of computer instructions) is performed. Modern CPUs usually contain several different functional components, such as floating point units, fixed point (integer) units, branch control units, etc.

Register
Memory with immediate access by the CPU. Registers comprise the fastest and smallest level of the memory hierarchy. Usually 4-8 bytes in size and less than 100 in number.

Cache
A very fast, but small, area of memory. Cache is typically measured in kilo or mega bytes and may also be subdivided into layers (L1, L2). CPU access to data in cache usually requires only a single CPU cycle. Cache is usually organized into "lines", with each line being measured in bytes (32, 64, 128, 256...)

Main Memory
Main Memory is much larger than cache - usually an order of magnitude, and is measured in mega or gigabytes. However, access to main memory is usually an order of magnitude slower than cache, being measure in tens of cycles. Main memory is organized into "pages" (4096 bytes on the SP).

Disk (Virtual Memory)
Disk is usually several orders of magnitude slower and larger than main memory. Accessing data on disk can cost 100,000s of cycles.

Mass Storage (tape)
Virtually unlimited in size. Access to data on tape is measured in geologic time compared to main memory.

Cache Misses and Page Faults

  Cache Set Associativity

 
Memory Considerations Up to The Memory Hierarchy Down to Array and Memory Management Optimizations
Efficient Memory Use


 
 
Array and Memory Management Optimizations Up to Efficient Memory Use Down to Array Padding
Stride Minimization
Array allocation in C and Fortran



 
Array and Memory Management Optimizations Up to Stride Minimization Down to Miscellaneous Array Optimizations
Array Padding


 
Array and Memory Management Optimizations Up to Array Padding Down to Loop Optimizations
Miscellaneous Array Optimizations

Sparse Arrays

Array Initialization



 
 
Loop Optimizations Up to Miscellaneous Array Optimizations Down to Loop Interchange
Loop Fusion


 
Loop Optimizations Up to Loop Interchange Down to Loop Defactorizing
Invariant IF Code Floating


 
Loop Optimizations Up to Invariant IF Code Floating Down to Loop Collapse
Loop Defactorizing


 
Loop Optimizations Up to Loop Collapse Down to Loop Unrolling and Sum Reductions
Loop Unrolling


 
Loop Optimizations Up to Loop Unrolling Down to Outer Loop Unrolling
Loop Unrolling and Sum Reductions


 
Loop Optimizations Up to Outer Loop Unrolling Down to Arithmetic Optimizations
IF Loops, WHILE Loops and DO Loops


 
 
Arithmetic Optimizations Up to IF Loops, WHILE Loops and DO Loops Down to Mathematical Acceleration SubSystem (MASS)
BLAS and ESSL Routines


 
Arithmetic Optimizations Up to BLAS and ESSL Routines Down to Strength Reduction
Mathematical Acceleration SubSystem (MASS)


 
Arithmetic Optimizations Up to Mathematical Acceleration SubSystem (MASS) Down to Miscellaneous Arithmetic Optimizations
Strength Reduction


Exponentiation by repeated multiplication


Horner's Rule of Polynomial Evaluation


Right shift replacement for integer division by a power of 2


Factorization



 
Arithmetic Optimizations Up to Strength Reduction Down to Input/Output Optimizations
Miscellaneous Arithmetic Optimization


Common Subexpression Elimination Note: even better is:
 
     XX = XX + 2.0 * (X(I)*Y(I)+Z(I)) + X(I)*Y(I)-Z(I) 


Type Considerations



 
Input/Output Optimizations Up to Miscellaneous Arithmetic Optimizations Down to References and More Information

Suggestions



 
References and More Information Up to Input/Output Optimizations