NOTE: This document has been saved locally for performance and preservation reasons. Its original location was: http://www.software.ibm.com/ad/fortran/xlfortran/optim.htm and is © IBM Corporation.

XL Fortran: Eight Ways to Boost Performance

XL Fortran for AIX® Version 6 provides different methods for improving your program's performance. This article presents eight methods (in order from simplest to most advanced) that you can use to boost the performance of most programs. You can:

Using the -O Optimization Options (-O, -O2, -O3, -O4)

Many compiler users shy away from using a compiler's optimization capabilities because historically compiler optimizers were overly aggressive, or just bug-ridden, and tended to introduce errors into valid Fortran programs. The XL Fortran Compiler optimizer is highly reliable. The -O2 option provides an intermediate level of optimization that avoids any techniques that could alter the semantics of valid Fortran programs. Although results at -O2 may not be identical to those produced when you do not select any optimization options, typically -O2 provides better precision than not using any optimization.

Why should you use -O2? Our performance tests suggest that using -O2 typically doubles the performance of both fixed-point and floating-point programs. You do not need to hand-tune your code, or use any other compiler options, to obtain this level of performance improvement, and you don't have to worry about -O2 introducing semantic changes into your program. Note that -O and -O2 provide the same level of optimization.

The compiler provides a higher level of optimization when you use -O3. This option increases the range of optimizations the compiler performs, but it can also increase compilation time and memory use by the compiler. Our experience suggests that -O3 usually provides improvements. In a few cases -O3 can decrease performance. If you use -O3 you may want to compare your programs' performance at -O3 to its performance when compiled with -O2.

In certain programs -O3 can change the behavior or results of your program, unless you also specify the -qstrict option, which we recommend for novice users.

The -qhot option provides a different set of optimizations than -O3 does. Its tuning efforts concentrate on efficient scalarization of array language and iteration-reordering transformations. These transformations may produce results that are not bitwise identical to those produced only at -O2 or -O3. We have found that this option is primarily beneficial when used in conjunction with the -qarch, -qtune, and -qcache options (for iteration-reordering transformations), or with programs using array language, but you may wish to experiment with it in other situations as well. -qhot can occasionally reduce performance if it does not have enough information about the size of loop bounds and array dimensions, and you may want to use timing techniques to determine whether -qhot improves your program's performance.

You must specify -qhot for any of the -qcache settings to have an effect.

The -O4 option aggressively optimizes the source program, trading off additional compile time for potential improvements in the generated code. You can specify the option at compile time or link time. If you specify the option at link time, it has no effect unless you also specify it at compile time for at least the file which contains the main program. Specifying the -O4 option implies the following other options: -qhot, -qipa, -O3 (and all of the options and settings which it implies), -qarch=auto, -qtune=auto, and -qcache=auto. You can specify additional options following the -O4 option; these options will override the implied options listed in the previous sentence. For example, if you are compiling on a 604e machine, you can specify -O4 -qarch=pwr2 to produce executables for a POWER2 target machine.

Recommendation: Use -O4 if you are not concerned about compilation time. Otherwise, use -O2 or -O3 -qstrict for any production-level program you compile. Try using -qhot if you have time to test different versions of your executable file for performance or if your program uses Fortran 90/95 array language.

Using -qarch and -qtune

The RISC System/6000 includes models based on four different chip configurations: the original POWER processor, the PowerPC® processor (including the 601 processor, which is a bridge between the POWER and PowerPC processors), the POWER2 processor, and the POWER3 processor. You can use -qarch and -qtune to target your program to particular machines.

If you intend your program to run only on a particular architecture, you can use the -qarch option to instruct the compiler to generate code specific to that architecture. This allows the compiler to take advantage of machine-specific instructions that can improve performance. -qarch provides arguments for you to specify certain chip models; for example, you can specify -qarch=604 to indicate that your program is to be executed on any PowerPC 604 hardware platform.

Note:

The -qarch option may result in a program that cannot be run on machines with processors other than those supported by the option. If you run such a program on an unsupported processor under AIX Version 4, your program may fail at execution time. If you want your program to run on more than one architecture, but to be tuned to a particular architecture, you can use a combination of the -qarch and -qtune options.

-qarch and -qtune are primarily of benefit for floating-point intensive programs. On PowerPC systems, programs that process mainly unpromoted single-precision variables are more efficient when you specify -qarch=ppc. On POWER2 and POWER3 systems, programs that process mainly double-precision variables (or single-precision variables promoted to double by one of the -qautodbl options) become more efficient with -qarch=pwr2 and -qarch=pwr3.

If your program is likely to be run on all four types of processors equally often, do not specify any -qarch or -qtune options. The default for these options is to support only the common subset of instructions of all processors. If you specify -q32, the defaults are -qarch=com and -qtune=pwr2. If you specify -q64, the defaults are -qarch=ppc and -qtune=pwr3.

If you specify the auto suboption for the -qarch option, XL Fortran automatically detects the specific architecture of the compiling machine. If you specify the auto suboption for the -qtune option, XL Fortran automatically detects the specific processor type of the compiling machine. For both options, XL Fortran assumes that the execution environment will be the same as the compilation environment.

You can further enhance the performance of programs intended for specific machines by using the -qcache and -qhot options.

Recommendation: If your program is intended for the full range of RISC System/6000 implementations, and is not intended primarily for one processor type, do not use either -qarch or -qtune.

Using Interprocedural Analysis

Interprocedural analysis (IPA) enhances the -O optimizations by performing detailed analysis across procedures. It extends the area examined during optimization and inlining from a single procedure to multiple procedures (which can be in different source files) and the linkage between them.

You request IPA by specifying the -qipa option. You can fine-tune the optimizations performed by specifying -qipa suboptions. You must also specify one of -O, -O2, -O3, and -O4. For additional performance benefits, you can also specify the -Q option.

To use IPA, you must:

  1. Do preliminary performance analysis and tuning before compiling with the -qipa option, because IPA analysis uses a two-pass mechanism that increases link time. (You can use the noobject suboption to reduce this overhead.)
  2. Specify the -qipa option on both the compile and link steps of the entire application, or as much of it as possible. Specify suboptions to indicate what assumptions to make about the parts of the program that are not compiled with -qipa. During compilation, the compiler stores interprocedural analysis information in the object file. During linking, the -qipa option causes a complete reoptimization of the entire application.

Recommendation: We strongly recommend that you specify -qipa on both the compile and link steps, and that you specify -qipa=noobject on the compile step.

Using Floating-Point Options

This section explains what default floating-point options you can change to improve performance of floating-point intensive programs. Some of these options can affect conformance to floating-point standards. They can change the results of computations, but in many cases the result is an increase in accuracy.

-qautodbl=dblpad4
This option promotes all REAL and REAL(4) variables to REAL(8), and all COMPLEX and COMPLEX(4) variables to COMPLEX(8), without promoting REAL(8) and COMPLEX(8) variables. It also pads variables that may share storage with promoted single-precision variables. This option can significantly improve the performance of programs that use 32-bit floating-point variables, on POWER, POWER2, and POWER3 machines only. It can also increase accuracy. Note that on PowerPC systems, -qautodbl=dblpad4 can reduce performance, and should generally not be used.
-qrealsize=8
This option promotes all REAL and COMPLEX variables that do not have an explicit size to REAL(8) and COMPLEX(8), respectively. Note that on PowerPC systems, this option can reduce performance, and should generally not be used.
-qfloat=fltint
This option speeds up floating-point-to-integer conversions by not checking for overflows. Fortran does not define what happens when a conversion to integer results in an out-of-range value, so this option does not affect programs conforming to the Fortran 90 or Fortran 95 standards. However, if you suspect that your program may contain conversions to integer that are not representable as integers, do not use this option.
-qfloat=rsqrt
This option lets the compiler calculate the reciprocal of a square root, rather than the square root itself, when the square root appears as a divisor in an expression. The division by the square root is replaced by a multiplication of the reciprocal if the compiler deems that the replacement would be beneficial. This option can cause slight rounding differences.
-qfloat=hssngl
Use this option to speed up single-precision arithmetic, if the results of computations of REAL(4) values are likely to fall outside the valid range of REAL(4) values. If you are certain that the results will always be within the range of REAL(4) values, you can use -qfloat=hsflt instead, which provides even better performance. However, -qfloat=hsflt is unsafe where computations may produce results outside the valid range (-10**38 to to 10**38). You should only use -qfloat=hssngl or -qfloat=hsflt if there is a demonstrated performance improvement.

Recommendation for POWER, POWER2, and POWER3 platforms: For single-precision programs, you can improve performance while preserving accuracy by using these floating-point options: -qfloat=fltint:rsqrt:hssngl

If your single-precision program is not memory-intensive (for example, if it does not access more data than the available cache space), you can obtain equal or better performance, and greater precision, by using: -qfloat=fltint:rsqrt -qautodbl=dblpad4

For programs that do not contain single-precision variables, use -qfloat=rsqrt:fltint only. Note that -O3 without -qstrict automatically sets -qfloat=rsqrt:fltint.

Recommendation for PowerPC platform: Single-precision programs are generally more efficient than double-precision programs on PowerPC systems, so promoting default REAL values to REAL(8) can reduce performance. Use the following -qfloat suboptions: -qfloat=hssngl:fltint:rsqrt

Using a Fortran Preprocessor

The KAP™ and VAST™ preprocessors for Fortran can produce tuned versions of your source code. You can obtain these preprocessors directly from Kuck and Associates (for KAP) and Pacific Sierra Research (for VAST). You may find that compiling with these preprocessors improves performance for some programs.

The preprocessors perform memory management optimizations, algebraic transformations, inlining, interprocedural analysis, and other optimizations.

If your program contains a large proportion of common algebraic algorithms, these algorithms may already exist in specially tuned libraries such as the BLAS (Basic Linear Algebraic Subroutines) library or ESSL (Engineering and Scientific Subroutine Library). The BLAS Library is currently shipped with the AIX Operating System, while the ESSL library is a separate program product that provides a greater range of algorithms and improved performance. ESSL algorithms are tuned to individual hardware implementations, and take advantage of whatever memory and processor configuration is detected at run time. Both the KAP and VAST preprocessors can generate calls to these libraries.

Using -qcache for Single-Platform Programs

If your program is intended to run exclusively on a single machine or configuration, you can help the compiler tune your program to the memory layout of that machine by using the -qcache option. (You must also specify the -qhot option for -qcache to have any effect. -qhot uses information on cache configurations to determine appropriate memory management optimizations.) There are three types of cache: Data, Instruction, and Combined. Models generally fall into two categories: those with both data and instruction caches, and those with a single, combined data/instruction cache. The type=C|D|I suboption lets you identify the type of cache to which the -qcache option refers.

The -qcache options can also be used to identify the size and set associativity of a model's level-2 cache, or the Translation Lookaside Buffer (TLB), which is a table used to locate recently referenced pages of memory. In most cases you do not need to specify the -qcache entry for a TLB unless your program uses more than 512 KB of data space.

If you specify the auto suboption of -qcache, XL Fortran automatically detects the specific cache configuration of the compiling machine. XL Fortran assumes that the execution environment will be the same as the compilation environment.

Using SMP

XL Fortran for AIX, Version 6.1 exploits the RS/6000 symmetric multi-processing (SMP) architecture. It supports both automatic parallelization of a Fortran program and explicit parallelization (through a set of directives that you can use to parallelize selected portions of your program). SMP support includes the following:

  • SMP directives:
    • ASSERT
    • BARRIER
    • CNCALL
    • CRITICAL/END CRITICAL
    • DO/END DO
    • DO SERIAL
    • INDEPENDENT
    • MASTER/END MASTER
    • PARALLEL/END PARALLEL
    • PARALLEL DO/END PARALLEL DO
    • PARALLEL SECTIONS/END PARALLEL SECTIONS/SECTION
    • PERMUTATION
    • SCHEDULE
    • THREADLOCAL
  • The DO (work-sharing) construct
  • The MASTER construct
  • The PARALLEL region construct
  • The -qsmp compiler option, to indicate that code should be produced for an SMP system.
  • The xlf95_r, xlf95_r7, xlf90_r, xlf90_r7, xlf_r, and xlf_r7 invocation commands. The main difference between these invocation commands and the xlf95, xlf90, and xlf invocation commands is that the first set of commands links and binds the object files to the thread-safe components (libraries, crt0_r.o, and so on). If you specify the -qsmp compiler option in addition to one of the invocation commands in the first set:
    • Automatic parallelization of DO loops is turned on.
    • The program will recognize all SMP directives.
  • The XLSMPOPTS environment variable, to allow you to specify options which affect SMP execution.
  • Fortran interfaces to the POSIX™ pthreads library.
  • Threadsafe I/O and math library calls.
  • The omp_get_thread_num service and utility procedure, to return the number of the thread within the team that is between 0 and NUM_PARTHDS - 1.
  • The omp suboption for the -qsmp compiler option, to enforce compliance with the OpenMP Fortran API.

Using Asynchronous I/O

You may need to use asynchronous I/O to gain speed and efficiency in scientific programs that perform I/O for large amounts of data. Synchronous I/O blocks the execution of an application until the I/O operation is completed. Asynchronous I/O allows an application to continue processing while the I/O operation is performed in the background. You can modify applications to take advantage of the ability to overlap processing and I/O operations. Multiple asynchronous I/O operations may also be performed simultaneously, on multiple files residing on independent devices.

Further Reading

The XL Fortran for AIX User's Guide, SC09-2719, describes how to compile, link, and run your programs using XL Fortran Version 6.1. See in particular Chapter 5, "XL Fortran Compiler-Option Reference", and Chapter 7, "XL Fortran Floating-Point Processing."

The XL Fortran for AIX Language Reference, SC09-2718, describes the XL Fortran programming language. See in particular Chapter 8, "Input/Output Concepts", for more information on asynchronous I/O.

The following books describe the hardware architecture of the RISC System/6000 family of processors:

  • IBM RISC System/6000 Technology, SA23-2619.
  • PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, SA23-2737.

To help to understand and exploit the new generation of computer systems that are based on the RS/6000 POWER3 architecture, see the POWER3 Performance and Tuning Guide.

AIX, RISC System/6000, PowerPC, and IBM are trademarks of International Business Machines Corporation in the United States and/or other countries.
POSIX is a trademark of the Institute of Electrical and Electronic Engineers.

©1999 IBM Corporation