|
SP Parallel Programming Workshop
|
|
o p t i m i z a t i o n t o p i c s
|
Table of Contents
- Overview
- Types of Optimization
- The Optimization Process
- Optimization Strategy
- Preprocessor Optimizations
- Overview
- Using the KAP Fortran Preprocessor
- Using the KAP C Preprocessor
- Compiler Optimizations
- Hand-Tuning Optimizations
- Memory Considerations
- The Memory Hierarchy
- Efficient Memory Use
- Array and Memory Management
Optimizations
- Stride Minimization
- Array Padding
- Miscellaneous Array Optimizations
- Loop Optimizations
- Loop Fusion
- Loop Interchange
- Invariant IF Code Floating
- Loop Defactorizing
- Loop Collapse
- Loop Unrolling
- Loop Unrolling and Sum Reductions
- Outer Loop Unrolling
- IF Loops, WHILE Loops and DO Loops
- Arithmetic Optimizations
- BLAS and ESSL Routines
- Mathematical Acceleration SubSystem (MASS)
- Strength Reduction
- Miscellaneous Arithmetic Optimizations
- Input/Output Optimizations
- References and More Information
- Exercise
|
The Optimization Process
|
|
- Steps in the Optimization Process
- Debug your source code and verify program correctness (usually with
optimization switches off).
- Profile your code; identify opportunities for performance
improvement.
- Perform hand-tuning operations, such as algorithm optimization,
re-coding of bottlenecks, etc.
- Apply preprocessor optimizations and/or compile with optimization
switches on
- Profile your code; Examine blocks of code that consume the most
execution time.
- Repeatedly apply various optimizations to such blocks.
- Ensure mathematical correctness of the program.
|
Preprocessor Optimizations
|
|
|
Overview
|
- A preprocessor is a software tool or package that transforms source
code before compilation, producing as output new source code that can
then be compiled. The types of source transformations which are
performed can be specified by inlined directives or command line flags
and arguments.
- What can a preprocessor do?
- Memory managment optimizations to improve cache use
- Algebraic transformations - replace mathematical expressions with
equivalent, more efficient ones
- Procedure inlining and interprocedural analysis
- Loop optimizations
- Dead code removal, invarient IF code floating
- Replace less efficient code with optimized library calls
(BLAS, ESSL, ...)
- Provide listing files which describe the optimizations performed and
hinderances to optimization.
| For the IBM SP environment:
|
|
Preprocessor Optimizations
|
|
|
Using the KAP Fortran Preprocessor
|
| Note: These instructions are for the IBM SP platform, and assume that the
KAP Fortran preprocessor product is installed and licensed on the system
you are using.
|
Syntax For Invoking The KAP Fortran Preprocessor
Compiler
Compiler options
-Pk or -Pk!
-Wp,kopt1,kopt2,koptN
| source file(s)
| | | | |
Compiler
This is the IBM Fortran compiler which you desire to use. For example,
for a serial Fortran program, you might use xlf or
xlf90. For a parallel MPI program, you might use
mpxlf.
Compiler Options
Specify the usual IBM Fortran compiler options that you desire. In
particular, it's usually a good idea to include the IBM compiler
optimization flags, such as -O3 .
-Pk or -Pk!
Select one of these flags (required). -Pk causes the
transformed source code to be compiled. -Pk! preprocesses
the source file but does not compile it.
KAP options
- There are quite a number of options which can be passed to the
KAP Fortran preprocessor. To accomplish this, you must use the
four character -Wp, flag followed by a comma
separated list of KAP arguments.
- The kapf man page
describes all of the available options. Note that the options usually
have a short form as well. Some useful ones are shown in the table below
(see the man page for details).
| The -Wp, flag must come after the -Pk/-Pk!
flags. Also, its argument list must contain NO spaces.
|
| Option
| Short Form
| Description
|
-aggressive=ab
-ag=ab
| Pads COMMON blocks and adjusts leading array dimensions in COMMON to
promote more efficient cache use. Default is turned off.
|
-cacheline=N
-chl=N
| Informs preprocessor of cache line size in N bytes. Default is 128
bytes.
|
-cachesize=N
-chs=N
| Informs preprocessor of cache size in N Kbytes. Default is 64 Kbytes.
|
-fortran
-f
| Saves the transformed output of preprocessor as Pfilename where
filename is the input Fortran source file. Not saved by default.
|
-include=pathname
-inc=pathname
| Allows specification of directory for required include files. Useful for
MPI programs which have a required include file.
|
-library_calls=libname
-lc=libname
| Directs preprocessor to replace sections of code with calls to standard
numerical library routines which have the same functionality. Use either
blas or essl.
|
-list=file
-l=file
| Instructs the preprocessor to produce a listing file which describes
precompilation optimizations
|
-optimize=N
-o=N
| Specifies the level of optimization, 0 - 5, with 5 being both the
maximum and default.
|
-roundoff=3
-scalaropt=3
-r=3
-so=3
These two options are both the default, and permit the greatest
optimization. Other values (0 - 2) can be set if problems occur.
| | | | | | | | | | | | | | | | | | |
Serial Examples
xlf -O3 -Pk mycode.f
xlf -O3 -Pk -Wp,-o=3,-chs=256 mycode.f
xlf -O2 -Pk! -Wp,-o=4,-l=my.out,-f mycode.f
Parallel MPI Example
mpxlf -O4 -Pk -Wp,-inc=/usr/lpp/poe/include parallel_prog.f
|
Preprocessor Optimizations
|
|
|
Using the KAP C Preprocessor
|
| Note: These instructions are for the IBM SP platform, and assume that the
KAP C preprocessor product is installed and licensed on the system you are
using.
|
Syntax For Invoking the KAP C Preprocessor
Form 1
kxlc or kcc
Compiler options
+Kn
| source file(s)
|
Form 2
kxlc or kcc
Compiler options
+Kargs=kopt1:kopt2:koptN
| source file(s)
|
Form 3
kxlc or kcc
Compiler options
+Kcc=compiler_location
| source file(s)
| | | | | | | | | | | | |
kxlc or kcc
Use either the kxlc (ANSI C) or kcc
(IBM extended C) command to invoke the KAP C Preprocessor. These commands
are actually "drivers" and will call the following for each input source
file:
- IBM C pre-processor
- KAP C optimizing preprocessor
- IBM C compiler
Compiler Options
Specify the usual IBM C compiler options that you desire. In particular,
it's usually a good idea to include the IBM compiler optimization flags,
such as -O3 .
KAP options
- Form 1: This is the simplest way to direct preprocessing
optimization. Specify the level of optimization as
+KN=, where N= is an
integer 0 - 5. The default level is 3, and is considered adequate for
most applications. This form can be combined with the other two forms.
- Form 2: This form is used to explicitly set specific KAP options.
The format is the literal +Kargs= followed by a
colon-separated list of KAP options. This form can be combined with
the other two forms.
- Form 3: This form is used to specify an alternate compiler, and is
particularly useful for compiling parallel MPI codes when you need to
use one of the IBM parallel C compiler scripts, such as
mpcc. This form can be combined with the other two
forms.
- Useful options: Similar to the options described in the KAP Fortran
table, but there are differences (such as no
-ag option). See the
kapc man page for
details. Also see the
kxlc/kcc man page
for additional options.
| For all KAP C preprocessor colon-separated option lists, there must be
NO spaces.
|
Serial Examples
kxlc -O3 +K3 myprog.c
kcc -O2 +Kargs=-o2:-l:-cachesize=128 myprog.c
Parallel MPI Example
kxlc -I/usr/lpp/poe/include +Kcc=/usr/bin/mpcc +K5 parallel_prog.c
|
Compiler Optimizations
|
|
IBM Compiler Overview
- All compilers provide the ability to automatically optimize source codes.
The optimizations that a compiler can perform will vary according to the
compiler.
- There are a variety of Fortran, HPF, C, and C++ compiler commands for the
IBM SP environment.
| IBM Compiler Commands
|
| Language
| Compiler Command
| Description
|
| Fortran
| xlf
| Standard AIX Fortran 77 compiler
|
f77
| Standard AIX Fortran 77 compiler
|
xlf90
| Standard AIX Fortran 90 compiler
|
xlf_r
| SMP Fortran compiler; for use with threads
|
xlf90_r
| SMP Fortran 90 compiler; for use with threads
|
xhhpf90
| Data-parallel Fortran 90 HPF compiler
|
xhhpf
| Data-parallel HPF compiler
|
mpxlf
| Fortran 77 compiler for parallel jobs using IBM's MPI; a wrapper script
|
mpxlf_r
| SMP Fortran 77 compiler for parallel jobs using IBM's MPI and threads;
a wrapper script
|
| C
| xlc
| Standard C compiler (ANSI)
|
cc
| Standard C compiler (IBM Extended)
|
xlc_r
| SMP C compiler; for use with threads (ANSI)
|
cc_r
| SMP C compiler; for use with threads (IBM Extended)
|
mpcc
| C compiler for parallel jobs using IBM's MPI; a wrapper script; assumes
IBM Extended rather than ANSI C
|
mpcc_r
| SMP C compiler for parallel jobs using IBM's MPI and threads; a wrapper
script; assumes IBM Extended rather than ANSI C
|
| C++
| xlC
| Standard C++ compiler
|
xlC_r
| SMP C++ compiler; for use with threads
|
mpCC
| C++ compiler for parallel jobs using MPI; a wrapper script
|
mpCC_r
| SMP C++ compiler for parallel jobs using MPI and threads; a wrapper script
| | | | | | | | | | | | | | | | | | | |
- Despite the number of compiler commands, all of the above compiler
invocations are supported by the same underlying AIX Fortran and C
Language compilers.
- For the most part, a common set of compiler flags can be used for the
various flavors of Fortran compile commands, and a common set for the
various flavors of C / C++.
- A complete description of the IBM compilers is beyond the scope of this
tutorial. Consult the IBM Compiler
documentation for complete information. A summary of the compiler
options can be found in their respective man pages:
- For IBM compiler optimizations, both the C compiler and Fortran compiler
convert source code into the same Intermediate Language before compiling.
A common optimizer then translates the IL code into machine language.
Compiler Optimization Options
- There are a number of Fortran and C compile options which deal specifically
with optimization. The table below describes the more useful ones. Note
that some of these options include caveats which are not mentioned in this
table.
| IBM Compiler Optimization
Options
|
| Option
| Description
|
-O
| Basic optimization. No techniques are used which could alter program
results. "Typically doubles the performance of both integer and
floating point programs" (per IBM).
|
-O2
| Identical to -O
|
-O3
| High level optimization. May alter semantics of the program and produce
results that are not identical to unoptimized results. Usually requires
additional compile time and memory. Typically improves performance 0-7%
over -O2, but can be much greater for some codes (per IBM). Can decrease
performance for some codes also.
|
-O4
| Aggressive optimization, trading off compile time for potential
improvements beyond -O3. Fortran only.
|
-qstrict
| Used with -O3 to turn off the aggressive semantic altering optimizations.
|
-qhot
-qhot=arraypad
Performs additional loop and memory optimizations. Array padding
optimizations may be specified, but can be unsafe. Fortran only.
|
-qarch=arch -qtune=arch
Perform optimizations specific for the type of IBM RISC
architecture specified. Architectures of interest to the SP
environment include:
- com - common to all architectures. Provides
portability at the expense of optimum tuning
- auto - automatically detects architecture and tunes
accordingly. Must run on the platform you compiled on.
- p2sc - Power2 super-chip (P2SC)
- pwr - Power, Power2, and P2SC
- pwr2 - Power2, P2SC
- pwr3 - Power3
- 604 - Power PC 604 (604e)
|
-qcache=list
-qcache=auto
Specifies cache configuration for use by optimizer. The option list is
defines detailed configuration information. Use auto
if you will run on the same platform you compiled on. If used, you must
also specify -qhot.
|
-qipa
| Performs interprocedural analysis, enabling optimizations that can be
performed across procedures (usually performed only within a given
procedure). Must also use one of the -O options. Further optimization
is possible if used in link phase also.
|
-qautodbl=dblpad4
| Promotes REAL and REAL(4) variables to REAL(8); promotes COMPLEX and
COMPLEX(4) to COMPLEX(8). Can significantly improve performance and
also accuracy. Use for Power and Power2 architectures only.
|
-qsmp
| Exploits SMP architecture. Automatic parallelization of DO loops.
Recognition of IBM SMP directives. Source must be compiled with
xlf_r or xlf90_r.
Fortran version 5.1+ only.
|
-qreport=report
| Used to generate a report showing how the program is parallelized and
how loops are optimized. Valid report types are
smplist and hotlist.
Fortran version 5.1+ only.
|
-Pv
| Invoke the VAST preprocessor before compiling.
|
-Pv!
| Invoke the VAST preprocessor only (no compilation)
|
-Pk
| Invoke the KAP preprocessor before compiling.
|
-Pk!
| Invoke the KAP preprocessor only (no compilation)
| | | | | | | | | | | | | | | | |
Recommendations
- Use -O2 or -O3 -qstrict for production
codes.
- If -O4 is used, be sure to compare your code's performance
and results against other optimization levels.
- If your code must be portable across different types of SP nodes,
do not use options which target a specific processor architecture
(-qarch -qtune -qcache). Your code may fail, run
slower, or produce wrong results.
- If executable portability is not a concern, use the architecture specific
options, -qarch=auto -qtune=auto -qcache=auto.
- Use -qipa. Use in the link phase is also recommended.
- If you use -qhot be sure to compare code performance
against not using it.
- Preprocessor use is highly recommended (if available).
- Consult the IBM Compiler documentation for
more information about the above compiler options, and others not mentioned.
Also see XL Fortran:
Eight Ways to Boost Performance.
|
Hand-Tuning Optimizations
|
|
Why the compiler can't do everything
- Sometimes the compiler won't perform optimizations even though it can.
The compiler assigns a higher priority to producing consistent and
correct code, than optimizing performance.
- Sometimes a potential compiler optimization could result in errors: (A +
B) + C does not always equal A + (B + C). Certain optimizations can
reassociate floating-point operations and potentially accumulate into
significant errors.
The difference is usually small: Finite precision floating-point
numbers do not always associate:
(A + B) + C A + (B + C)
3.483986447771696 3.483986447771695
4.467320344550364 4.467320344550365
^
|
--- 15th decimal place
opt01.c
|
Sometimes the difference is huge: Finite floating-point exponents can
lead to "Catastrophic Cancellation".
Result of the original loop is 0.000000
Result of the interchanged loop is 143166118.018429
opt02.c
|
- In cases where the compiler can't or won't perform optimizations, or
when it produces incorrect results, hand-tuning becomes necessary.
- The remainder of this tutorial concerns "hand tuning" optimization
techniques in the areas of:
- Arrays and memory managment
- Loops
- Arithmetic operations
- I/O
|
Memory Considerations
|
|
|
The Memory Hierarchy
|
- Many optimizations target efficient use of memory resources. Understanding
memory related factors is highly useful in understanding optimization
techniques.
- Two aspects of memory that are critical to understanding optimization are:
- The Memory Hierarchy - the different physical "layers" of memory and
their different characteristics.
- Spatial and Temporal Locality: program characteristics which will
improve your code performance.
- Most computer architectures share a similar memory structure. Memory is
organized within a hierarchy with the fastest, smallest, most expensive
memory components at the top, and the slowest, largest, least expensive
components at the bottom.
- The Memory Hierarchy is necessary because the main memory is about 50
times too slow to keep up with the CPU.
- CPU
- Where the actual work (execution of computer instructions) is performed.
Modern CPUs usually contain several different functional components, such
as floating point units, fixed point (integer) units, branch control
units, etc.
- Register
- Memory with immediate access by the CPU. Registers comprise
the fastest and smallest level of the memory hierarchy. Usually 4-8 bytes
in size and less than 100 in number.
- Cache
- A very fast, but small, area of memory. Cache is typically measured in
kilo or mega bytes and may also be subdivided into layers (L1, L2). CPU
access to data in cache usually requires only a single CPU cycle.
Cache is usually organized into "lines", with each line being measured
in bytes (32, 64, 128, 256...)
- Main Memory
- Main Memory is much larger than cache - usually an order of magnitude,
and is measured in mega or gigabytes. However, access to main memory is
usually an order of magnitude slower than cache, being measure in tens
of cycles. Main memory is organized into "pages" (4096 bytes on the SP).
- Disk (Virtual Memory)
- Disk is usually several orders of magnitude slower and larger than main
memory. Accessing data on disk can cost 100,000s of cycles.
- Mass Storage (tape)
- Virtually unlimited in size. Access to data on tape is measured in
geologic time compared to main memory.
Cache Misses and Page Faults
- When a memory location is referenced, its associated cache lines are
checked to see if the data is present in cache. If present, the data
can be loaded into a register. If not, a cache miss occurs.
- A cache miss means the processor must now spend cycles trying to find
where the requested data actually resides, and then bring that data into
cache.
- Searching for the data usually means checking a fast table, called the
Translation Lookaside Buffer (TLB), of the most recently accessed pages
in memory. If the data is not in a page being tracked by the TLB,
a TLB miss occurs and the processor must then search a larger table which
tracks all of the pages in main memory, called the Page Frame Table (PFT).
- If the requested data resides in a page being tracked by the PFT, the
corresponding cache line is filled with that data and its surrounding
elements - up to the size of one cache line.
- If the data is not in main memory, then a page fault occurs
and control is turned over to the operating system.
A page fault generally implies that the page is located in virtual
memory on disk and the operating system must perform I/O to obtain it
- a very expensive operation cycle-wise.
Cache Set Associativity
- For reasons of efficiency, memory locations can only map to a subset of
available cache lines. The association of a memory location with a
set of cache lines is called set associativity. The set
of lines which share common mappings is called a congruence class.
- Power2 SP nodes possess 4-way cache set associativity. This means that
a given memory location can map to only 4 possible cache lines.
|
| Adapted from IBM's POWER3 Introduction and
Tuning Guide
|
- Power 3 nodes are 128-way cache set associative, allowing memory
locations to map to 128 possible cache lines.
|
| Adapted from IBM's POWER3 Introduction and
Tuning Guide
|
- Cache set associativity can lead to performance decreases under certain
conditions:
- When an array is processed with a stride that is a large power of 2.
For example, given a cache that is 4-way set associative with
128 congruence classes of 128 byte cache lines, processing an array
of REAL(8) with a stride of 2048 will cause a cache miss for each
element.
- When there are multiple arrays being processed simultaneously, and
each array is a power of 2 in size, and the arrays are adjacent in
memory.
|
Memory Considerations
|
|
|
Efficient Memory Use
|
- Programs should be designed so that a high percentage of accesses
are made to the higher levels of memory. To accomplish this, the
programmer should strive for two important program characteristics:
Spatial Locality
If location X in memory is currently being accessed,
it is likely that a location near X will be accessed next.
Temporal Locality
If location X in memory is currently being accessed, it is likely that
location X will soon be accessed again.
- Taking advantage of spatial and temporal locality translates will minimize
cache misses, TLB misses, and page faults through data reuse.
- Minimizing cache misses
- Avoid processing data with large strides. Ideally, data should
be processed sequentially as it is stored in memory.
- Avoid cache associativity conflicts. This requires understanding how
memory is mapped into cache lines and will vary across different
architectures.
- Minimizing TLB misses
- The TLB usually holds only a limited number of pointers to recently
referenced pages. On most SP models, the number is 256. Since a page
is 4096 bytes, this means only 1 MB (1048576 bytes) of memory is
tracked by the TLB.
- When processing arrays which use more than this amount of memory
(individually or collectively), be sure to make the most of the
data before accessing new data.
- Refrain from using large-stride data access. For example, a stride
of 4096 bytes (one page) can generate a TLB miss every access.
- Minimizing page faults
- Excessive page faulting is characterized by "disk thrashing", and can
severely degrade performance. One way to avoid this is to insure that
all of your data fits into real (not virtual) memory by using a
machine with larger memory resources.
- If this is not possible, then changing the way data is accessed may
help, such as processing data in "chunks" which fit into memory.
- Make optimal use of the instruction cache. Make sure that the
compute-intensive areas of your program are small enough to fit inside
the instruction cache (for most programs, this is not a concern).
|
Array and Memory Management Optimizations
|
|
|
Stride Minimization
|
- Matrices are allocated differently in C and Fortran.
- In a loop, stride is defined as the distance between
successively accessed elements of a matrix in successive iterations of
the loop. For example:
| C Language
| Fortran
|
| Stride 1
|
for(i=0;i<1000;i++)
for(j=0;j<1000;j++)
c[i][j] = a[i][j] * b[i][j]
|
DO I=1,1000
DO J=1,1000
C(J,I) = A(J,I) * B(J,I)
END DO
END DO
|
| Stride 1000
|
for(j=0;j<1000;j++)
for(i=0;i<1000;i++)
c[i][j] = a[i][j] * b[i][j]
|
DO J=1,1000
DO I=1,1000
C(J,I) = A(J,I) * B(J,I)
END DO
END DO
|
- Processing matrices with minimal stride takes advantage of spatial
locality. When any element is referenced, and needs to be brought into
cache from memory, an entire cache line worth of data is brought with it.
- Small stride exploits the extra data brought in with the cache line,
since the next data to be processed is already in cache.
- Large stride does not exploit the extra data brought in with the cache
line, and in fact, may require a new cache line to be loaded for each
element accessed.
- To ensure minimal stride: In C, the innermost loop's induction variable
should be the rightmost array subscript; for Fortran, it should be the
leftmost array subscript.
Example timings (msec)
| mem01.c
|
mem01.f
|
| stride 1000 loop
| stride 1 loop
| stride 1000 loop
| stride 1 loop
|
| No Optimization
| 404
| 74
| 420
| 89
|
| -O3
| 348
| 12
| 347
| 10
|
|
Array and Memory Management Optimizations
|
|
|
Array Padding
|
- Array Padding is a technique that can be used to reduce cache misses due
to set associativity in arrays that are sized by a power of two.
- To pad an array, simply increase the leading array dimension in Fortran,
or the last dimension in C/C++. The optimal amount to pad varies, but
should be at least the number of elements that fit in one line of cache.
Examples
|
Untuned
REAL*8 A(256,256)
DO I=1,256
DO J=1,256
DO K=1,256
A(I,J)=A(I,J)+A(J,K)-A(K,I)
END DO
END DO
END DO
Tuned
REAL*8 A(257,256)
DO I=1,256
DO J=1,256
DO K=1,256
A(I,J)=A(I,J)+A(J,K)-A(K,I)
END DO
END DO
END DO
|
Example timings (msec)
|
arraypad.c
|
arraypad.f
|
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 5528
| 6172
| 5536
| 6428
|
| -O3
| 5000
| 1525
| 5034
| 1527
|
| -O3 with kapc/kapf
| 5025
| 1023
| 5023
| 1532
|
|
Array and Memory Management Optimizations
|
|
|
Miscellaneous Array Optimizations
|
Sparse Arrays
- A sparse array is one in which only a small proportion of the elements
contain valid data. For each active element accessed, large numbers of
inactive elements may be read by the CPU.
- Even if only active elements are accessed, the entire cache line must
be in the cache. In the worst case, each access causes a cache
miss. Severe paging, and memory space problems may then be encountered.
- A more efficient algorithm that uses denser arrays, or linked lists
may be the best alternative to sparse arrays. Precompilers and compilers
do not provide optimizations for sparse arrays.
Array Initialization
- Static Initialization: performed once when program is loaded.
REAL(8) A(100, 100) /10000*1.0/
- Dynamic Initialization: performed at runtime as the programmer chooses.
DO I=1, DIM1
DO J=1, DIM2
A(I, J) = 1.0
- Use dynamic initialization if the object file must be kept small.
- Use neither form if the array is static and you need it initialized
to zero. Instead, declare the array as STATIC. Static variables are
initialized to zero automatically by the AIX operating system before
they are passed to your program (Note: this action may not be
portable).
|
Loop Optimizations
|
|
|
Loop Fusion
|
- Loop Fusion involves the merging of the statements in several loops
into a single loop.
- Advantages:
- Loop overhead is reduced.
- Allows for better instruction overlap.
- Cache misses can be decreased if both loops reference the same array.
Examples
|
Untuned
for(i=0;i<100000;i++)
x = x * a[i] + b[i];
for(i=0;i<100000;i++)
y = y * a[i] + c[i];
Tuned
for(i=0;i<100000;i++) {
x = x * a[i] + b[i];
y = y * a[i] + c[i];
}
|
Example timings (msec)
|
loop01.c
|
loop01.f
|
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 55
| 39
| 57
| 45
|
| -O3
| 30
| 16
| 29
| 21
|
| -O3 with kapc/kapf
| 30
| 16
| 29
| 21
|
- Disadvantage: Has the potential to increase cache misses due to
cache set associativity effects when the fused loops:
- Contain references to multiple arrays and the starting elements of
those arrays map to the same cache line.
- Access arrays that take up a large portion of cache.
Example timings (msec)
|
loop02.c
|
loop02.f
|
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 11
| 24
| 13
| 27
|
| -O3
| 6
| 19
| 7
| 13
|
| -O3 with kapc/kapf
| 6
| 9
| 7
| 22
|
|
Loop Optimizations
|
|
|
Invariant IF Code Floating
|
- IF statements that do not change from iteration to iteration
may be moved out of the loop.
- The compiler can usually detect and perform this optimization except
when:
- Loops contain calls to procedures and the compiler is pessimistic
about aliasing.
- Variable-bounded loops that may never get entered.
- Complex loops where the invariance can not be determined by the
compiler.
Examples
|
Untuned
for(i=0;i<1000;i++)
{
for(j=0;j<1000;j++)
{
if (a[i] > 100)
b[i] = a[i] - 3.7;
x = x + a[j] + b[i];
}
}
Tuned
for(i=0;i<1000;i++)
{
if (a[i] > 100)
b[i] = a[i] - 3.7;
for(j=0;j<1000;j++)
x = x + a[j] + b[i];
}
Compiler inhibitor
for(i=0;i<ARRAY_SIZE;i++)
{
for(j=0;j<ARRAY_SIZE;j++)
{
srandom(SEED);
if (a[random() % ARRAY_SIZE] >= 100.0)
b[i] = a[i] - 3.7;
x = x + a[j] + b[i];
}
}
|
Example timings (msec)
|
loop05.c
|
loop05.f
|
loop06.c
|
| Untuned
| Tuned
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 452
| 243
| 390
| 241
| 1257
| 15
|
| -O3
| 120
| 120
| 135
| 135
| 1254
| 13
|
| -O3 with kapc/kapf
| 75
| 75
| 135
| 135
| 1250
| 13
|
|
Loop Optimizations
|
|
|
Loop Defactorizing
|
- Loops involving multiplication by a constant can be rewritten so that the
multiplication takes place outside the loop.
- The defactorized loop allows the POWER processor to perform a single
FMA instruction.
- For more complex loops, defactorizing should target balancing adds and
multiplies to provide the floating point unit with a stream of FMA's.
Examples
|
Factorized
DO I=1, ARRAY_SIZE
A(I) = 0.0D0
DO J = 1, ARRAY_SIZE
A(I) = A(I) + B(J) * D(J) * C(I)
ENDDO
ENDDO
Defactorized
DO I=1, ARRAY_SIZE
A(I) = 0.0D0
DO J = 1, ARRAY_SIZE
A(I) = A(I) + B(J) * D(J)
ENDDO
A(I) = A(I) * C(I)
ENDDO
|
Example timings (msec)
|
loop08.c
|
loop08.f
|
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 6991
| 5819
| 6855
| 6482
|
| -O3
| 490
| 291
| 1143
| 382
|
| -O3 with kapc/kapf
| 715
| 605
| 301
| 201
|
|
Loop Optimizations
|
|
|
Loop Unrolling
|
- Data dependence delays can be reduced or eliminated.
- Loop overhead may be reduced.
- May be performed by the compiler during optimization.
Examples
|
Untuned
DO J=1, DIM2
DO I=1, DIM1
DO K=1, 4
A(I, J) = A(I, J) + B(K, I) * C(K, J)
ENDDO
ENDDO
ENDDO
Tuned
DO J=1, DIM2
DO I=1, DIM1
A(I, J) = A(I, J) + B(1, I) * C(1, J)
A(I, J) = A(I, J) + B(2, I) * C(2, J)
A(I, J) = A(I, J) + B(3, I) * C(3, J)
A(I, J) = A(I, J) + B(4, I) * C(4, J)
ENDDO
ENDDO
|
Example timings (msec)
|
loop11.c
|
loop11.f
|
| Untuned
| Tuned
| Untuned
| Tuned
|
| No Optimization
| 224
| 117
| 223
| 160
|
| -O3
| 25
| 20
| 63
| 37
|
| -O3 with kapc/kapf
| 24
| 20
| 42
| 42
|
|
Loop Optimizations
|
|
|
Loop Unrolling and Sum Reductions
|
- Intermediate sums are used inside the inner loop to help eliminate the
data dependency on previous iterations. Permits better "pipelining"
of FMA (floating-point multiply-add) operations.
Examples
|
Untuned
a = 0.0;
for(i=0;i < ARRAY_SIZE;i++)
for(j=0;j < ARRAY_SIZE;j++)
a = a + b[j] * c[i];
Tuned
a1 = a2 = a3 = a4 = 0.0;
for(i=0;i < ARRAY_SIZE;i++)
for(j=0;j < ARRAY_SIZE;j+=4)
{
a1 = a1 + b[j] * c[i];
a2 = a2 + b[j+1] * c[i];
a3 = a3 + b[j+2] * c[i];
a4 = a4 + b[j+3] * c[i];
}
aa = a1 + a2 + a3 + a4;
|
Example timings (msec)
|
loop12.c
|
| No Optimization
| -O3
| -O3 with kapc/kapf
|
| Untuned
| 39
| 12.0
| 12.0
|
| Tuned - depth 2 unrolling
| 33
| 6.1
| 6.0
|
| Tuned - depth 4 unrolling
| 29
| 3.6
| 3.9
|
| Tuned - depth 8 unrolling
| 28
| 3.0
| 2.5
|
| Tuned - depth 16 unrolling
| 27
| 3.3
| 3.1
|
|
Loop Optimizations
|
|
|
IF Loops, WHILE Loops and DO Loops
|
- IF Loops and WHILE Loops inhibit some preprocessor and compiler
optimizations. Use the DO loop construct whenever possible instead.
- The compiler optimizer and kapf Fortran preprocessor can transform some
but not all, IF and WHILE loops for you.
Examples
|
Untuned (IFs and GOTOs)
I = 0
10 I = I + 1
IF (I.GT.100000) THEN
GOTO 30
ENDIF
A(I) = A(I) + B(I) * C(I)
GOTO 10
30 CONTINUE
| Untuned (WHILE loop)
I = 0
DO WHILE (I.LT.100000)
I = I + 1
A(I) = A(I) + B(I) * C(I)
END DO
|
Tuned
DO I=1, 100000
A(I) = A(I) + B(I) * C(I)
ENDDO
| Tuned
DO I=1, 100000
A(I) = A(I) + B(I) * C(I)
ENDDO
|
|
Arithmetic Optimizations
|
|
|
BLAS and ESSL Routines
|
- BLAS: Basic Linear Algebra Subroutines: provide high level
performance for matrix-matrix, matrix-vector and vector-vector
operations.
- ESSL: IBM's Engineering and Scientific Subroutine Library: A
more comprehensive set of routines, tuned for the RS/6000 architecture.
Includes:
- Linear algebra equations and subprograms
- Matrix operations
- Fourier transformations and convolution
- Sorting and searching
- Interpolation
- Numerical quadrature
- Random number generation
- Available for Fortran, C and C++ programs.
- 3 advantages for using the BLAS and ESSL routines:
- BLAS and ESSL subroutine calls are easier to code than the operations
they replace.
- BLAS subroutine calls are portable across different platforms.
- BLAS and ESSL subroutine calls are likely to be highly tuned to
the specific hardware and are likely to perform well.
- See the following documentation for further information:
- "AIX Commands Reference", for information of the BLAS routines.
- "Engineering and Scientific Subroutine Library Guide and Reference",
for information on the ESSL routines.
- The KAP and VAST preprocessors can recognize code which can be
transformed into calls to BLAS and ESSL routines.
|
Arithmetic Optimizations
|
|
|
Mathematical Acceleration SubSystem (MASS)
|
- The Mathematical Acceleration SubSystem (MASS) is a set of highly tuned
scalar and vector subroutines that offer alternatives for functions in
libm.a, the math library supplied with AIX.
- The MASS Scalar Library libmass.a contains an accelerated set of
frequently used math intrinsic functions optimized for the POWER, POWER2,
and PowerPC architectures:
| sqrt
| rsqrt
| exp
| sinh
| x**y
|
| log
| sin
| cos
| cosh
|
|
| tan
| atan
| atan2
| tanh
|
|
- The MASS Vector Libraries contain a set of vector functions specifically
optimized for the POWER2 architecture.
- Additional information
|
Arithmetic Optimizations
|
|
|
Strength Reduction
|
- Reduces the computational cost of an operation while providing
mathematically identical results. Some effective strength reductions for
the RS/6000 architecture include:
- Translation of integer multiplication by a constant (done by compiler
and preprocessors)
- Exponentiation by repeated multiplication
- Horner's Rule of Polynomial Evaluation
- Floating-point division by inverse multiplication (done by compiler
and preprocessors)
- Right shift replacement for integer division by a power of 2
- Factorization
Exponentiation by repeated multiplication
- Performed by XLF Fortran compiler for small, constant, positive
integer expressions. Most other exponentiations are done by
calling a library procedure.
- C does not have an exponentiation operator - all exponentiations
involve a library call to the pow function. Hand tuning
for C programs may be of more benefit than for Fortran programs because
of this.
Horner's Rule of Polynomial Evaluation
Right shift replacement for integer division by a power of 2
- Shift operation requires less cycles than division. Dividend
and divisor must both be unsigned or positive integer.
Examples
|
Untuned
IL = 0
DO I=1,ARRAY_SIZE
DO J=1,ARRAY_SIZE
IL = IL + A(J)/2
ENDDO
ILL(I) = IL
ENDDO
Tuned
IL = 0
ILL = 0
DO I=1,ARRAY_SIZE
DO J=1,ARRAY_SIZE
IL = IL + ISHFT(A(J),-1)
ENDDO
ILL(I) = IL
ENDDO
|
Example timings (msec)
|
arit02.f
|
| Untuned
| Tuned
|
| No Optimization
| 4808
| 4809
|
| -O3
| 534
| 356
|
Factorization
|
Arithmetic Optimizations
|
|
|
Miscellaneous Arithmetic Optimization
|
Common Subexpression Elimination
Note: even better is:
XX = XX + 2.0 * (X(I)*Y(I)+Z(I)) + X(I)*Y(I)-Z(I)
Type Considerations
- Changes the type or precision of data to reduce memory usage, avoid
type conversions, and take advantage of processor specific hardware.
- For example, on the RS/6000 architecture, REAL*16 can not be processed
directly by the FPU. Instead, each REAL*16 operation must be broken
down by software into a lengthy series of instructions. Converting to
REAL*8 will improve performance.
Example timings (msec)
|
arit05.f
|
| Untuned
| Tuned
|
| No Optimization
| 157
| 35
|
| -O3
| 129
| 13
|
|
Input/Output Optimizations
|
|
- Most of the previously discussed optimizations will be of little benefit
if your program is I/O bound. I/O slows down your program because:
- I/O is orders of magnitude slower than internal memory accesses
- I/O routines consume CPU cycles themselves
Suggestions
- Eliminate all unnecessary I/O.
- Move I/O statements outside of loops if possible.
- Use unformatted (binary) I/O whenever possible
- Formatted I/O requires that library calls be made to convert
the binary representation to human readable format and then
converted back again to binary format when the processor must
process the data.
- Formatted I/O can also result in lost precision and rounding errors.
- Binary data is smaller - requires less physical I/O time to
process and less disk space to store. For example, to store
the number "1" as a double precision formatted value, requires
21 bytes (1.000000000000000000 plus newline). Storing it as a
binary value requires only 8 bytes (if it is part of a large record).
- Use large record sizes. Each record (result of one write operation)
of unformatted I/O includes a 4-byte header and 4-byte trailer. Files
with short records will contain a higher percentage of header and
trailer records - thus requiring more I/O and space.
| Write 1024 x 1024 Matrix
of REAL(8)
|
| Method
| File Size (bytes)
| Time (msec)
|
| Formatted, one real(8) per record (1048576 records)
| 22,020,096
| 34997
|
| Binary, one real(8) per record (1048576 records)
| 16,777,216
| 13054
|
| Formatted, one 1048576 (1024x1024) element record
| 20,971,521
| 21812
|
| Binary, one 1048576 (1024x1024) element record
| 8,388,616
| 134
|
iotest.f
- Access data from memory
- If possible, read all data into memory before the program starts
- AIX allows C programmers to access files through the use
of "shared memory segments" and operating system procedures.
Fortran programmers can make Interlanguage calls to the
same procedures. See the online man pages and IBM
documentation for details on using
shmget, shmat, shmdt, shmctl
- Finally, whenever possible, perform I/O to local disk rather than NFS
or networked mounted file systems.
|
References and More Information
|
|
- This tutorial was created by George Gusciora, MHPCC.
- "IBM C Set ++ for AIX User's Guide". Version 3 Release 1. IBM
Corporation. August 1995.
- "IBM C Set ++ for AIX Language Reference". Version 3 Release 1. IBM
Corporation. August 1995.
- "IBM C for AIX User's Guide". Version 3 Release 1. IBM
Corporation. August 1995.
- "IBM C for AIX Language Reference". Version 3 Release 1. IBM
Corporation. August 1995.
- "XL Fortran for AIX User's Guide". Version 5 Release 1. IBM
Corporation. November 1997.
- "XL Fortran for AIX Language Reference". Version 5 Release 1. IBM
Corporation. November 1997.
- "Optimization and Tuning Guide for Fortran, C, and C++". AIX Version 4.
IBM Corporation. June 1996.
- "XL Fortran: Eight Ways to Boost Performance", IBM Corporation.
www.software.ibm.com/ad/fortran/xlfortran/optim.htm
- "Mathematical Acceleration SubSystem (MASS) Version 2.4". IBM Corporation.
www.rs6000.ibm.com/resource/technology/MASS/
- "On-node Optimization". Bob Walkup, IBM Corporation. Presentation
materials for Advanced Parallel Programming workshop held at the Maui High Performance Computing Center, January 5, 1998.
www.rs6000.ibm.com/resource/technology/MASS/