Table of Contents
Attaching to a Running Application
Appendix A1. AIX/DPCL Installation Notes
Appendix A2. Linux/Dyninst Installation Notes
Appendix A3. Bugs in the v0.8 Release
Dynaprof is a performance analysis tool designed to insert performance measurement instrumentation directly into a running applications' address space at run time. The instrumentation included with this release of Dynaprof can measure real-time as well as any hardware performance metrics available through the PAPI. Run-time instrumentation of the object code has numerous advantages over traditional source-based performance profiling systems. Most significant of which is the elimination of the interference of calls to the instrumentation with the compiler's optimization passes. For aggressively scheduled processors, significant code reorganization and subroutine inlining is often required for maximal utilization of the processors functional units. When additional subroutine calls are added, the performance of an application can change especially for compute intensive regions. An additional benefit is the removal of the instrumentations dependency on the compilation process. The type and format of the instrumentation can be changed without recompiling the application.
Dynaprof comes as a compressed tar file of precompiled binaries. The directory structure is as follows:
|
Machine specific installation notes and dependencies |
|
Contains the Dynaprof binary and the reporting scripts for the included probes |
|
Contains the Dynaprof probes that are inserted into the application to be profiled |
|
Contains this document as well as any other machine specific information |
To install dynaprof, simply untar/unzip this distribution into an installation area, set the DYNAPROF_PROBEDIR environment variable, as described below, and follow the remaining instructions in the INSTALL file. Usually, this consists of making sure you have installed the proper shared libraries on which the probe modules depend.
Most administrators/users will want to set the DYNAPROF_PROBEDIR environment variable in their login scripts. The other option is to create a wrapper script that sets this variable automatically. If you do not set this variable ahead of time, you will either have to set it at run-time using the set command or explicitly name the full path to the probe in the use command.
All of the following variables are optional. Most have intelligent defaults set by the program at startup or by your system administrator during the installation process. Most of these variables can also be set at run time using the set and unset command. DYNAPROF_MAKE This variable sets the name of the make command. This is used mainly for short cuts during the performance tuning process.
Example:
[mucci@nebula]$ setenv DYNAPROF_MAKE gmake -f Makefile.aix-powerDYNAPROF_DEBUG
This variable enables debugging output in Dynaprof. See the -d / --debug in the section on Command Line Options. Any non-NULL value enables this option. Not recommended.
Example:
[mucci@nebula]$ setenv DYNAPROF_DEBUG 1DYNAPROF_DEBUGGER
This variable sets the name of the command to start the debugger. This is mainly used for short cuts during the performance tuning process.
Example:
[mucci@nebula]$ setenv DYNAPROF_DEBUGGER gdb -qDYNAPROF_POEBIN THIS VARIABLE IS FOR AIX SYSTEMS ONLY
This variable sets the full path and name of the POE binary for starting parallel programs under AIX.
Example:
[mucci@nebula]$ setenv DYNAPROF_POEBIN /usr/local/bin/poeDYNAPROF_PROBEDIR
This variable sets the full path to the directory containing Dynaprof probes.
Example:
[mucci@nebula]$ setenv DYNAPROF_PROBEDIR /usr/local/dynaprof/usr/lib
Dynaprof is just a regular executable like most other tools on your system. To start DynaProf, you have two options.
Verify that your PATH environment variable includes the directory where Dynaprof is installed.
Explicitly specify the path to Dynaprof as part of the command line.
Dynaprof has a number of command line options, most of which are reasonable self-explanatory. The less obvious options are explained below.
[mucci@nebula]$ dynaprof -h DynaProf 0.8 Philip J. Mucci, mucci@cs.utk.edu, 2000-2002 Provided courtesy of UTK's Innovative Computing Laboratory. See http://icl.cs.utk.edu for more information. This is Open Source Software! ./dynaprof [options] [[--] executable-file [executable-args]] Options: -b | --batch Exit after processing options. -c| --commmand= Execute Dynaprof commands from . -d | --debug Enable debugging statements in Dynaprof. -h | --help Print this message. -q | --quiet Do not print version number on startup. -t | --tty= Use for input/output by the program being profiled. -g | --gui Gui mode, only buffering one line. -v | --version Print version information and then exit.
1. The output of the tool is completely unbuffered.
2. All output from any commands after startup is prefaced by an integer representing the number of lines to follow.
3. An extra newline is appended at the end of the above output.
4. All input to the tool is echoed to the screen.
In order to instrument an application with Dynaprof, the user must either load the application into the tool or attach to an already running application. The load command takes one or more arguments. The first argument must be the name of the executable, possibly including a path component. The remaining arguments are simply those that you would pass to the executable as arguments on the command line. Note that glob-style shell expansion is not supported. Upon return from this command, the application will have been created and placed in a stopped state at the first instruction.
Usage: load <executable> [command line arguments]
Example:
(dynaprof) load tests/simple 1 2 3 (dynaprof)
To instrument a threaded application with Dynaprof, the user also uses the load command. Only bound threads, threads that are associated with a kernel thread, are supported at the moment. Some run time environments provide environment variables to control this policy.
On AIX systems, set the following environment variable.
[mucci@nebula]$ setenv AIXTHREAD_SCOPE S
Example:
(dynaprof) poeload tests/mpicount -procs 4On other systems (DynInst), parallel applications can be instrumented in two ways.
The first method is for doing interactive performance analysis of only one process of the application. It requires that the parallel runtime allows the user to start the processes manually. MPICH and the p4 device serve as a good example. By providing the -t option to mpirun, the user can find out the exact commands that need to be run to start the application. The user is then free to start one or more of those processes under an instance of Dynaprof using the mpiload command. The mpiload is exactly like the load command except that it waits for the process to return from MPI_Init() before allowing the user to perform instrumentation.
Example:
First, have mpirun tell us what it would normally do.
[mucci@nebula]$ mpirun -t -np 2 tests/mpicount Procgroup file: localhost.localdomain 0 /home/mucci/work/dynaprof/tests/mpicount localhost.localdomain 1 /home/mucci/work/dynaprof/tests/mpicount /home/mucci/work/dynaprof/tests/mpicount -p4pg /home/mucci/work/dynaprof/tests/PI9172 -p4wd /home/mucci/work/dynaprof/tests ssh localhost.localdomain /home/mucci/work/dynaprof/tests/mpicount -p4pg /home/mucci/work/dynaprof/tests/PI9172 -p4wd /home/mucci/work/dynaprof/testsNext, start Dynaprof and load in the first process using the arguments from MPI.
(dynaprof) mpiload tests/mpicount /home/mucci/work/dynaprof/tests/mpicount -p4pg /home/mucci/work/dynaprof/tests/PI9172 -p4wd /home/mucci/work/dynaprof/tests (dynaprof)^Z [1] Suspended [mucci@nebula]$Now start the remote application as mpirun would.
[mucci@nebula]$ ssh localhost.localdomain /home/mucci/work/dynaprof/tests/mpicount -p4pg /home/mucci/work/dynaprof/tests/PI9172 -p4wd /home/mucci/work/dynaprof/tests & [2] 9283 [mucci@nebula]$ bg [1] ./dynaprof & [mucci@nebula]$ fg (dynaprof)The second method is for doing batch-mode performance analysis of all the processes of the application. It assumes that the user has made a script file containing the Dynaprof
Example:
[mucci@nebula]$ cat > cpi_drv mpiload /home/mucci/work/dynaprof/tests/cpi ^D [mucci@nebula]$ mpirun -np 2 dynaprof -b -c cpi_drv [mucci@nebula]$
Usage: attach <executable> <process identifier>
Example:
[mucci@nebula]$ tests/count > /dev/null & [3] 6327 [mucci@nebula]$ dynaprof (dynaprof) attach tests/count 6327 (dynaprof)Unloading an Application
The unload command takes no arguments. If the application is still running, the application will be terminated.
Usage: unload
Usage: detach
Usage: list [module [function]]
Example:
(dynaprof) load tests/swim (dynaprof) list DEFAULT_MODULE swim.F libm.so.6 libc.so.6 (dynaprof)DEFAULT_MODULE is something unique to g77. It contains all the Fortran run-time routines. If your application is not compiled with -g, more than likely you'll find your code in the DEFAULT_MODULE. Now let's list the functions found in the module swim.F
(dynaprof) list swim.F MAIN__ inital_ calc1_ calc2_ calc3z_ calc3_ (dynaprof)Now let's list the function calls found in the module swim.F in the MAIN__ routine. Note there is no exit point as main doesn't really ever return. What you see is the entry point followed by numerous calls to Fortran I/O functions. Nestled in between are calls to the user's code.
(dynaprof) list swim.F MAIN__ Entry Call s_wsle Call do_lio Call e_wsle Call s_wsle Call do_lio Call e_wsle Call inital_ Call s_wsfe Call do_fio Call do_fio Call do_fio Call do_fio Call do_fio Call do_fio Call do_fio Call e_wsfe Call calc1_ Call calc2_ Call s_wsfe Call do_fio Call do_fio Call e_wsfe Call s_wsfe Call do_fio Call do_fio Call do_fio Call e_wsfe Call s_stop Call calc3z_ Call calc3_ (dynaprof)
Before the user can instrument an application, he must decide what that instrumentation will consist of. There are currently two probes shipped with Dynaprof, the PAPI Probe and the Wallclock Probe. Each probe performs its measurement per-thread. This means that each thread will be counted separately from the others.
The PAPI probe gathers measurements using PAPI, the Performance Application Programming Interface. A full description of the interface is beyond the scope of this document but it can be found on the PAPI Home Page. Simply put, PAPI uses the processor's hardware performance counters to measure specific hardware events like cache misses, branch mispredictions and floating point instructions. By default, if no argument is specified, the PAPI probe defaults to counting with PAPI_FP_INS or floating point instructions. Note that this is very different than counting floating point operations, which is a very subjective. Counting hardware events always comes with a caveat: you must know a little about the architecture on which you are running. What is counted as a floating point operation on one architecture may not be counted on another. For example, the fpmv or floating point register move instruction in the IBM Power Architecture is counted as a floating point instruction. If you have any doubts about what the PAPI presets are counting, please see the papi_avail program in the Dynaprof installation directory. It will tell you exactly what PAPI events are available and exactly what they are counting. It is up to you to dig out the processor reference manual to decode the register definitions and understand what you're counting.
Currently, Dynaprof uses PAPI in the user domain. This means that only events that occur in user context will be counted. Other activity on the system will not appreciably affect the counts of most operations except resources that must be flushed and reloaded upon context switches, like caches and TLBs. Note that the PAPI probe also supports multiplexing of counters. That is, if you pass more events than your processor can count at any one given time, PAPI will timeshare the counting hardware to give the illusion that there are far more counters available than actually exist on the hardware. This approach has been shown to work well.
Usage: use papiprobe [arg1,arg2,...]
argN can take one of two forms.
1. A PAPI preset event name. It is the user's responsibility to make sure this preset exists on the host architecture. If this event does not exist, the PAPI probe will exit and so will the application.
2. A native event name of the form 0x<hex>@<reg> where <hex> is a hexadecimal event code of the native event and <reg> is the number of the hardware performance register to program.
Example:
Let's use the PAPI probe to count the total number of cycles and the total number of instructions as defined by the PAPI presets.
(dynaprof) use papiprobe PAPI_TOT_CYC, PAPI_TOT_INS
Or let's use the native interface to measure FMA's on FPU 0 and FPU1 on the Power 3. These correspond to event 11 on counter 4 and event 20 on counter 5.
(dynaprof) use papiprobe 0xb@4, 0x14@5
The Wallclock probe takes no arguments. It very simply measures elapsed real-time which is sometimes referred to as wallclock time. It does this using the highest resolution and lowest latency real time clock available on the host architecture. The output units are in microseconds.
Usage: use wallclock
Instrumenting the Application
Dynaprof inserts instrumentation directly into the applications' address
space. This is accomplished through a run-time code generation and patching
mechanism based upon either Dyninst or DPCL, IBM's derivative effort. Whenever
a function is instrumented, all it's children are instrumented as well.
This is to enable the probe to generate both inclusive and exclusive metrics.
Usage: instr
Usage: instr module <module_pattern>
Usage: instr function <module> <function_pattern>
The instr function has three forms. The first form, the command by itself, simply prints out the previously instrumented points. The second form instruments all functions inside any modules that match the glob-style pattern. The third form performs instrumentation only matching functions inside a specific module.
Example:
First let's load the fpsx application and enable the use of the Wallclock probe.
(dynaprof) load tests/fspx (dynaprof) use wallclock Module wallclock.so was loaded.Now let's see what's inside.
(dynaprof) list DEFAULT_MODULE eos.F phase.F setup.F update.F supmain.F io.F properties.F solveT.F libm.so.6 libc.so.6Ok, let's examine the interesting ones.
(dynaprof) list solveT.F tinmush_ tinsol_ tinvoid_ (dynaprof) list update.F akw.1 proflux_ flux_ pde_Ok, let's instrument the entire solver module first.
(dynaprof) instr module solveT.F solveT.F, inserted 3 instrumentation pointsNow let's just instrument all the flux computation routines.
(dynaprof) instr function update.F *flux_ update.F, inserted 2 instrumentation pointsFinally let's look at all the instrumented functions.
(dynaprof) instr tinmush_ tinsol_ tinvoid_ proflux_ flux_Looks good. We're ready to continue with execution.
To begin an application, the user issues the run command. For attached applications, the run command is functionally equivalent to the continue command as described below.
Usage: run
Usage: ^C
To resume execution, one simply issues the continue command.
Usage: continue
Example: (input from the user is in BOLD)
(dynaprof) load tests/input (dynaprof) run input the order of the matrix, 0 to exit 1000 ^C Program received signal SIGINT, Interrupt. Program stopped. (dynaprof) continue norm. resid resid machep x(1) x(n) 8.77321770E+00 3.89573462E-12 2.22044605E-16 1.00000000E+00 1.00000000E+00times are reported for matrices of order 1000 factor solve total mflops times for array with leading dimension of1001 1.579E+05 3.000E+02 1.582E+05 4.227E-03 input the order of the matrix, 0 to exit 0 Program exited normally.
Dynaprof does not enforce the manner in which each probe is to generate its output. By not placing these restrictions on the probe modules, the probe designer is free to determine whatever output format is most appropriate, be that a real time binary data feed to a visualization engine or a static data file dumped to disk at the end of the run. The probes included with Dynaprof write the collected data to disk either when the application finishes or the user explicitly sends the application a SIGHUP signal. This signal causes the probe module to flush the data to disk. Note that this data will be overwritten at the end of the run, so it is recommended that the user copy this data to a new file as soon as the flush has been performed. Currently, both the PAPI probe and the Wallclock probe produce a compact file consisting of encoded ASCII data. The data files are created in the directory where the application exists. Each probe prints a message to this effect when the probe is first initialized. The files are named <executable.pid>, where pid is the process identifier. For multithreaded applications, each thread generates a data file of the form <executable.pid.tid> where tid is the thread identifier. Example:
(dynaprof) use wallclock Module wallclock.so was loaded. (dynaprof) instr module simple.c simple.c, inserted 3 instrumentation points (dynaprof) run output goes to /home/mucci/dynaprof/tests/simple.8874 In main() In quickstuff() In quickstuff() In slowstuff() Program exited normally.Example of Multithreaded Operation:
(dynaprof) load tests/pthread_count (dynaprof) use wallclock Module wallclocksmp.so was loaded. (dynaprof) listDEFAULT_MODULE pcount.c libc.so.6 (dynaprof) instr module pcount.c pcount.c, inserted 4 instrumentation points (dynaprof) run output goes to /home/mucci/dynaprof/tests/pthread_count.8885.1024 output goes to /home/mucci/dynaprof/tests/pthread_count.8887.1026 output goes to /home/mucci/dynaprof/tests/pthread_count.8888.2051 Program exited normally.This data is then interpreted, formatted and displayed by the reporting scripts included with the Dynaprof distribution. There are two scripts, one for each probe. Each script takes one argument, the name of the data file to process.
Usage: wallclockrpt <Wallclock data file> Usage: papiproberpt <PAPI probe data file>
Wallclock Probe Output
Let's take the output from the above run of the simple test program
and see what it looks like.
Example:
[mucci@nebula]$ wallclockrpt tests/simple.8874
Exclusive Profile. Name Percent Total Calls ------------- ------- ----- -------- TOTAL 100 1.442e+10 1 unknown 100 1.442e+10 1 main 0.0001598 2.305e+04 1 quickstuff 0.0001001 1.444e+04 2 slowstuff 9.211e-05 1.328e+04 1 Inclusive Profile. Name Percent Total SubCalls ------------- ------- ----- -------- TOTAL 100 1.442e+10 0 main 99.98 1.442e+10 5 slowstuff 83.22 1.2e+10 2 quickstuff 16.76 2.417e+09 4 1-Level Inclusive Call Tree. Parent/-Child Percent Total Calls ------------- ------- ----- -------- TOTAL 100 1.442e+10 1 quickstuff 100 2.417e+09 2 - unknown 0.001276 3.084e+04 2 - sleep 100 2.417e+09 2 slowstuff 100 1.2e+10 1 - unknown 0.0002016 2.419e+04 1 - sleep 100 1.2e+10 1 main 100 1.442e+10 1 - unknown 0.0001381 1.992e+04 1 - quickstuff 8.366 1.206e+09 1 - quickstuff 8.398 1.211e+09 1 - slowstuff 83.24 1.2e+10 1 - exit 0 0 1
The PAPI Probe reporting script prints out a header containing machine information and then possible multiple profiles resembling the output from the Wallclock Probe. Let's instrument the swim application, a popular shallow water benchmark, to measure Level 1 Instruction and Level 1 Data Cache Misses.
Example:
(dynaprof) load tests/swim (dynaprof) use probes/papiprobe PAPI_L1_DCM, PAPI_L1_ICM (dynaprof) instr function swim.F calc* swim.F, inserted 8 instrumentation points (dynaprof) run papiprobe: output goes to /home/mucci/work/dynaprof/tests/swim.7366 SPEC benchmark 102.swim NUMBER OF POINTS IN THE X DIRECTION 512 NUMBER OF POINTS IN THE Y DIRECTION 512 GRID SPACING IN THE X DIRECTION 25000. GRID SPACING IN THE Y DIRECTION 25000. TIME STEP 20. TIME FILTER PARAMETER 0.001 NUMBER OF ITERATIONS 120 CYCLE NUMBER 60 MODEL TIME IN HOURS 0.33 Pcheck = 0.1314E+11 Ucheck = 0.5215E+05 Vcheck = 0.5215E+05 CYCLE NUMBER 120 MODEL TIME IN HOURS 0.67 Pcheck = 0.1314E+11 Ucheck = 0.5215E+05 Vcheck = 0.5215E+05 Program exited normally.Now let's visualize the data.
[mucci@nebula]$ probes/papiproberpt /home/mucci/work/dynaprof/tests/swim.7366 > out Output file : /home/mucci/work/dynaprof/tests/swim.7366 Option string : PAPI_L1_DCM,PAPI_L1_ICM Processor : 1198 Mhz GenuineIntel Intel Pentium III rev 0x1 (1-way) Total metrics measured : 2 Metric 1: : PAPI_L1_DCM, Level 1 data cache misses (Native 0x45,0x45) Metric 2: : PAPI_L1_ICM, Level 1 instruction cache misses (Native 0xf28,0xf28) Total functions : 4 Exclusive Profile of Metric PAPI_L1_DCM. Name Percent Total Calls ------------- ------- ----- -------- TOTAL 100 5.155e+08 1 calc3_ 52.73 2.718e+08 118 calc2_ 38.52 1.986e+08 120 calc1_ 8.086 4.168e+07 120 unknown 0.3937 2.03e+06 1 calc3z_ 0.2722 1.403e+06 1 Inclusive Profile of Metric PAPI_L1_DCM. Name Percent Total SubCalls ------------- ------- ----- -------- TOTAL 100 5.155e+08 0 calc3_ 52.73 2.718e+08 0 calc2_ 38.52 1.986e+08 0 calc1_ 8.086 4.168e+07 0 calc3z_ 0.2722 1.403e+06 0 1-Level Inclusive Call Tree of Metric PAPI_L1_DCM. Parent/-Child Percent Total Calls ------------- ------- ----- -------- TOTAL 100 5.155e+08 1 calc1_ 100 4.168e+07 120 calc2_ 100 1.986e+08 120 calc3z_ 100 1.403e+06 1 calc3_ 100 2.718e+08 118 Exclusive Profile of Metric PAPI_L1_ICM. Name Percent Total Calls ------------- ------- ----- -------- TOTAL 100 9.916e+04 1 unknown 29.52 2.927e+04 1 calc2_ 24.01 2.381e+04 120 calc1_ 23.5 2.331e+04 120 calc3_ 22.87 2.268e+04 118 calc3z_ 0.09378 93 1 Inclusive Profile of Metric PAPI_L1_ICM. Name Percent Total SubCalls ------------- ------- ----- -------- TOTAL 100 9.916e+04 0 calc2_ 24.01 2.381e+04 0 calc1_ 23.5 2.331e+04 0 calc3_ 22.87 2.268e+04 0 calc3z_ 0.09378 93 0 1-Level Inclusive Call Tree of Metric PAPI_L1_ICM. Parent/-Child Percent Total Calls ------------- ------- ----- -------- TOTAL 100 9.916e+04 1 calc1_ 100 2.331e+04 120 calc2_ 100 2.381e+04 120 calc3z_ 100 93 1 calc3_ 100 2.268e+04 118
PAPI Home Page
http://icl.cs.utk.edu/projects/papi
PMToolkit Home Page
http://www.alphaworks.ibm.com/tech/pmapi
Perfctr Download Page
http://www.csd.uu.se/~mikpe/linux/perfctr
DPCL Project Web Site
http://oss.software.ibm.com/developerworks/opensource/dpcl
DYNINST API Home Page
In addition, please make sure the following environment variables are set appropriately.
LD_LIBRARY_PATH must be set to a colon separated list of directories where the GNU run-time linker can find the following shared libraries. These libraries are included in the usr/lib directory of the Dynaprof distribution.
libpapi.so
libperfctr.so
libdyninstAPI.so
libdyninstAPI_RT.so
DYNINSTAPI_RT_LIB must be set to the fully qualified filename of the libdyninstAPI_RT.so shared library. As mentioned above, this library is included with the Dynaprof distribution.
11/14/02 mucci@cs.utk.edu ------------------------- - xdynaprof does not detect load/attach failures - xdynaprof windows jump around 7/31/02 mucci@cs.utk.edu ------------------------ - The probe reporting scripts should make sure they only take one argument. - The SIGHUP flush feature of papiprobe has been broken by the new header written to disk. 7/29/02 mucci@cs.utk.edu ------------------------ - instr of the same functions does not return error - The -tty argument does not work. The stdin/stdout of dynaprof itself is redirected instead of just the application. - Multiple instr/interrupt/instr cycles do not reinitialize the probes properly. 7/16/02 mucci@cs.utk.edu ------------------------ - This release only works with GNU Make