Performance properties
Time
- Description:
-
Total time spent for program execution including the idle times of CPUs
reserved for worker threads during OpenMP sequential execution. This
pattern assumes that every thread of a process allocated a separate CPU
during the entire runtime of the process. Executions in a time-shared environment
will also include time slices used by other processes. Over-subscription
of processor cores (e.g., exploiting hardware threads) will also manifest
as additional CPU allocation time.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree hierarchy to break down total time into
constituent parts which will help determine how much of it is due to
local/serial computation versus MPI, OpenMP, or POSIX thread parallelization
costs, and how much of that time is wasted waiting for other processes
or threads due to ineffective load balance or due to insufficient
parallelism.
-
Expand the call tree to identify important callpaths and routines where
most time is spent, and examine the times for each process or thread to
locate load imbalance.
- Parent metric:
- None
- Sub-metrics:
-
Execution Time
Overhead Time
OpenMP Idle Threads Time
Visits
- Description:
-
Number of times a call path has been visited. Visit counts for MPI
routine call paths directly relate to the number of MPI Communication Operations and
MPI Synchronization Operations. Visit counts for OpenMP operations and parallel regions
(loops) directly relate to the number of times they were executed.
Routines which were not instrumented, or were filtered during measurement,
do not appear on recorded call paths. Similarly, routines are not shown
if the compiler optimizer successfully in-lined them prior to automatic
instrumentation.
- Unit:
- Counts
- Diagnosis:
-
Call paths that are frequently visited (and thereby have high exclusive
Visit counts) can be expected to have an important role in application
execution performance (e.g., Execution Time). Very frequently executed
routines, which are relatively short and quick to execute, may have an
adverse impact on measurement quality. This can be due to
instrumentation preventing in-lining and other compiler optimizations
and/or overheads associated with measurement such as reading timers and
hardware counters on routine entry and exit. When such routines consist
solely of local/sequential computation (i.e., neither communication nor
synchronization), they should be eliminated to improve the quality of
the parallel measurement and analysis. One approach is to specify the
names of such routines in a filter file for subsequent
measurements to ignore, and thereby considerably reduce their
measurement impact. Alternatively, selective instrumentation
can be employed to entirely avoid instrumenting such routines and
thereby remove all measurement impact. In both cases, uninstrumented
and filtered routines will not appear in the measurement and analysis,
much as if they had been "in-lined" into their calling routine.
- Parent metric:
- None
- Sub-metrics:
-
None
Execution Time
(only available after remapping)
- Description:
-
Time spent on program execution but without the idle times of worker
threads during OpenMP sequential execution and time spent on tasks
related to trace generation. Includes time blocked in system calls
(e.g., waiting for I/O to complete) and processor stalls (e.g.,
memory accesses).
- Unit:
- Seconds
- Diagnosis:
-
A low fraction of execution time indicates a suboptimal measurement
configuration leading to trace buffer flushes (see Overhead Time) or
inefficient usage of the available hardware resources (see
OpenMP Idle Threads Time).
- Parent metric:
- Time
- Sub-metrics:
-
Computation Time
MPI Time
OpenMP Time
POSIX Threads Time
OpenACC Time
OpenCL Time
CUDA Time
HIP Time
Overhead Time
(only available after remapping)
- Description:
-
Time spent performing major tasks related to measurement, such as
creation of the experiment archive directory, clock synchronization, or
dumping trace buffer contents to a file. Note that normal per-event
overheads – such as event acquisition, reading timers and
hardware counters, runtime call-path summarization, and storage in trace
buffers – is not included.
- Unit:
- Seconds
- Diagnosis:
-
Significant measurement overheads are typically incurred when
measurement is initialized (e.g., in the program main routine
or MPI_Init) and finalized (e.g., in MPI_Finalize),
and are generally unavoidable. While they extend the total (wallclock)
time for measurement, when they occur before parallel execution starts
or after it completes, the quality of measurement of the parallel
execution is not degraded. Trace file writing overhead time can be
kept to a minimum by specifying an efficient parallel filesystem (when
provided) for the experiment archive (e.g.,
SCOREP_EXPERIMENT_DIRECTORY=/work/mydir).
-
When measurement overhead is reported for other call paths, especially
during parallel execution, measurement perturbation is considerable and
interpretation of the resulting analysis much more difficult. A common
cause of measurement overhead during parallel execution is the flushing
of full trace buffers to disk: warnings issued by the measurement
system indicate when this occurs. When flushing occurs simultaneously
for all processes and threads, the associated perturbation is
localized in time. More usually, buffer filling and flushing occurs
independently at different times on each process/thread and the
resulting perturbation is extremely disruptive, often forming a
catastrophic chain reaction. It is highly advisable to avoid
intermediate trace buffer flushes by appropriate instrumentation and
measurement configuration, such as specifying a filter file
listing purely computational routines (classified as type USR by
scorep-score -r ) or an adequate trace buffer size
(SCOREP_TOTAL_MEMORY larger than max_buf reported by
scorep-score). If the maximum trace buffer capacity requirement
remains too large for a full-size measurement, it may be necessary to
configure the subject application with a smaller problem size or to
perform fewer iterations/timesteps to shorten the measurement (and
thereby reduce the size of the trace).
- Parent metric:
- Time
- Sub-metrics:
-
None
Computation Time
(only available after remapping)
- Description:
-
Time spent in computational parts of the application, excluding
communication and synchronization overheads of parallelization
libaries/language extensions such as MPI, OpenMP, or POSIX threads.
- Unit:
- Seconds
- Diagnosis:
-
Expand the call tree to determine important callpaths and routines
where most computation time is spent, and examine the time for
each process or thread on those callpaths looking for significant
variations which might indicate the origin of load imbalance.
-
Where computation time on each process/thread is unexpectedly
slow, profiling with PAPI preset or platform-specific hardware counters
may help to understand the origin. Serial program profiling tools
(e.g., gprof) may also be helpful. Generally, compiler optimization
flags and optimized libraries should be investigated to improve serial
performance, and where necessary alternative algorithms employed.
- Parent metric:
- Execution Time
- Sub-metrics:
-
OpenCL Kernel Time
OpenMP Target Kernel Time
CUDA Kernel Time
HIP Kernel Time
MPI Time
(only available after remapping)
- Description:
-
Time spent in (instrumented) MPI calls. Note that depending on the
setting of the SCOREP_MPI_ENABLE_GROUPS environment variable,
certain classes of MPI calls may have been excluded from measurement and
therefore do not show up in the analysis report.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree to determine which classes of MPI operation
contribute the most time. Typically the remaining (exclusive) MPI Time,
corresponding to instrumented MPI routines that are not in one of the
child classes, will be negligible.
- Parent metric:
- Execution Time
- Sub-metrics:
-
MPI Management Time
MPI Synchronization Time
MPI Communication Time
MPI File I/O Time
MPI Management Time
(only available after remapping)
- Description:
-
Time spent in MPI calls related to management operations, such as MPI
initialization and finalization, opening/closing of files used for MPI
file I/O, or creation/deletion of various handles (e.g., communicators
or RMA windows).
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree to determine which classes of MPI management
operation contribute the most time. While some management costs are
unavoidable, others can be decreased by improving load balance or reusing
existing handles rather than repeatedly creating and deleting them.
- Parent metric:
- MPI Time
- Sub-metrics:
-
MPI Init/Finalize Time
MPI Communicator Management Time
MPI File Management Time
MPI Window Management Time
MPI Init/Finalize Time
(only available after remapping)
- Description:
-
Time spent in MPI calls regarding initialization and finalization, i.e.,
MPI_Init or MPI_Init_thread and MPI_Finalize
(world model), as well as MPI_Session_init and
MPI_Session_finalize (sessions model). Also covered are
related query functions.
- Unit:
- Seconds
- Diagnosis:
-
These are unavoidable one-off costs for MPI parallel programs, which
can be expected to increase for larger numbers of processes. Some
applications may not use all of the processes provided (or not use some
of them for the entire execution), such that unused and wasted
processes wait in MPI_Finalize for the others to finish. If
the proportion of time in these calls is significant, it is probably
more effective to use a smaller number of processes (or a larger amount
of computation).
- Parent metric:
- MPI Management Time
- Sub-metrics:
-
None
MPI Communicator Management Time
(only available after remapping)
- Description:
-
Time spent in MPI Communicator management routines such as creating and
freeing communicators, Cartesian and graph topologies, and getting or
setting communicator attributes.
- Unit:
- Seconds
- Diagnosis:
-
There can be significant time in collective operations such as
MPI_Comm_create, MPI_Comm_free and
MPI_Cart_create that are considered neither explicit
synchronization nor communication, but result in implicit barrier
synchronization of participating processes. Avoidable waiting time
for these operations will be reduced if all processes execute them
simultaneously. If these are repeated operations, e.g., in a loop,
it is worth investigating whether their frequency can be reduced by
re-use.
- Parent metric:
- MPI Management Time
- Sub-metrics:
-
None
MPI File Management Time
(only available after remapping)
- Description:
-
Time spent in MPI file management routines such as opening, closing,
deleting, or resizing files, seeking, syncing, and setting or retrieving
file parameters or the process's view of the data in the file.
- Unit:
- Seconds
- Diagnosis:
-
Collective file management calls (see MPI Collective File Operations) may suffer from
wait states due to load imbalance. Examine the times spent in collective
management routines for each process and try to distribute the preceding
computation from processes with the shortest times to those with the
longest times.
- Parent metric:
- MPI Management Time
- Sub-metrics:
-
None
MPI Window Management Time
(only available after remapping)
- Description:
-
Time spent in MPI window management routines such as creating and freeing
memory windows and getting or setting window attributes.
- Unit:
- Seconds
- Parent metric:
- MPI Management Time
- Sub-metrics:
-
None
MPI Synchronization Time
(only available after remapping)
- Description:
-
Time spent in MPI explicit synchronization calls, such as barriers and
remote memory access window synchronization. Time in point-to-point
message transfers with no payload data used for coordination is currently
part of MPI Point-to-point Communication Time.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree further to determine the proportion of time in
different classes of MPI synchronization operations. Expand the
calltree to identify which callpaths are responsible for the most
synchronization time. Also examine the distribution of synchronization
time on each participating process for indication of load imbalance in
preceding code.
- Parent metric:
- MPI Time
- Sub-metrics:
-
MPI Collective Synchronization Time
MPI One-sided Synchronization Time
MPI Collective Synchronization Time
(only available after remapping)
- Description:
-
Total time spent in MPI barriers.
- Unit:
- Seconds
- Diagnosis:
-
When the time for MPI explicit barrier synchronization is significant,
expand the call tree to determine which MPI_Barrier calls are
responsible, and compare with their Visits count to see how
frequently they were executed. Barrier synchronizations which are not
necessary for correctness should be removed. It may also be appropriate
to use a communicator containing fewer processes, or a number of
point-to-point messages for coordination instead. Also examine the
distribution of time on each participating process for indication of
load imbalance in preceding code.
- Parent metric:
- MPI Synchronization Time
- Sub-metrics:
-
None
MPI Communication Time
(only available after remapping)
- Description:
-
Time spent in MPI communication calls, including point-to-point,
collective, and one-sided communication.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree further to determine the proportion of time in
different classes of MPI communication operations. Expand the calltree
to identify which callpaths are responsible for the most communication
time. Also examine the distribution of communication time on each
participating process for indication of communication imbalance or load
imbalance in preceding code.
- Parent metric:
- MPI Time
- Sub-metrics:
-
MPI Point-to-point Communication Time
MPI Collective Communication Time
MPI One-sided Communication Time
MPI Point-to-point Communication Time
(only available after remapping)
- Description:
-
Total time spent in MPI point-to-point communication calls. Note that
this is only the respective times for the sending and receiving calls,
and not message transmission time.
- Unit:
- Seconds
- Diagnosis:
-
Investigate whether communication time is commensurate with the number
of MPI Bytes Transferred. Consider replacing blocking
communication with non-blocking communication that can potentially be
overlapped with computation, or using persistent communication to
amortize message setup costs for common transfers. Also consider the
mapping of processes onto compute resources, especially if there are
notable differences in communication time for particular processes,
which might indicate longer/slower transmission routes or network
congestion.
- Parent metric:
- MPI Communication Time
- Sub-metrics:
-
None
MPI Collective Communication Time
(only available after remapping)
- Description:
-
Total time spent in MPI collective communication calls.
- Unit:
- Seconds
- Diagnosis:
-
As the number of participating MPI processes
increase (i.e., ranks in MPI_COMM_WORLD or a subcommunicator),
time in collective communication can be expected to increase
correspondingly. Part of the increase will be due to additional data
transmission requirements, which are generally similar for all
participants. A significant part is typically time some (often many)
processes are blocked waiting for the last of the required participants
to reach the collective operation. This may be indicated by significant
variation in collective communication time across processes, but is
most conclusively quantified from the child metrics determinable via
automatic trace pattern analysis.
-
Since basic transmission cost per byte for collectives can be relatively high,
combining several collective operations of the same type each with small amounts of data
(e.g., a single value per rank) into fewer operations with larger payloads
using either a vector/array of values or aggregate datatype may be beneficial.
(Overdoing this and aggregating very large message payloads is counter-productive
due to explicit and implicit memory requirements, and MPI protocol switches
for messages larger than an eager transmission threshold.)
-
MPI implementations generally provide optimized collective communication operations,
however, in rare cases, it may be appropriate to replace a collective
communication operation provided by the MPI implementation with a
customized implementation of your own using point-to-point operations.
For example, certain MPI implementations of MPI_Scan include
unnecessary synchronization of all participating processes, or
asynchronous variants of collective operations may be preferable to
fully synchronous ones where they permit overlapping of computation.
- Parent metric:
- MPI Communication Time
- Sub-metrics:
-
None
MPI File I/O Time
(only available after remapping)
- Description:
-
Time spent in MPI file I/O calls.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree further to determine the proportion of time in
different classes of MPI file I/O operations. Expand the calltree to
identify which callpaths are responsible for the most file I/O time.
Also examine the distribution of MPI file I/O time on each process for
indication of load imbalance. Use a parallel filesystem (such as
/work) when possible, and check that appropriate hints values
have been associated with the MPI_Info object of MPI files.
- Parent metric:
- MPI Time
- Sub-metrics:
-
MPI Individual File I/O Time
MPI Collective File I/O Time
MPI Individual File I/O Time
(only available after remapping)
- Description:
-
Time spent in individual MPI file I/O calls.
- Unit:
- Seconds
- Diagnosis:
-
Expand the calltree to identify which callpaths are responsible for the
most individual file I/O time. When multiple processes read and write
to files, MPI collective file reads and writes can be more efficient.
Examine the number of MPI Individual File Read Operations and MPI Individual File Write Operations to
locate potential opportunities for collective I/O.
- Parent metric:
- MPI File I/O Time
- Sub-metrics:
-
None
MPI Collective File I/O Time
(only available after remapping)
- Description:
-
Time spent in collective MPI file I/O calls.
- Unit:
- Seconds
- Diagnosis:
-
Expand the calltree to identify which callpaths are responsible for the
most collective file I/O time. Examine the distribution of times on
each participating process for indication of imbalance in the operation
itself or in preceding code. Examine the number of MPI Collective File Read Operations
and MPI Collective File Write Operations done by each process as a possible origin of
imbalance. Where asychrony or imbalance prevents effective use of
collective file I/O, individual (i.e., non-collective) file I/O may be
preferable.
- Parent metric:
- MPI File I/O Time
- Sub-metrics:
-
None
OpenMP Idle Threads Time
(only available after remapping)
- Description:
-
Idle time on CPUs that may be reserved for teams of threads when the
process is executing sequentially before and after OpenMP parallel
regions, or with less than the full team within OpenMP parallel
regions.
-
- Unit:
- Seconds
- Diagnosis:
-
On shared compute resources, unused threads may simply sleep and allow
the resources to be used by other applications, however, on dedicated
compute resources (or where unused threads busy-wait and thereby occupy
the resources) their idle time is charged to the application.
According to Amdahl's Law, the fraction of inherently serial execution
time limits the effectiveness of employing additional threads to reduce
the execution time of parallel regions. Where the Idle Threads Time is
significant, total Time (and wall-clock execution time) may be
reduced by effective parallelization of sections of code which execute
serially. Alternatively, the proportion of wasted Idle Threads Time
will be reduced by running with fewer threads, albeit resulting in a
longer wall-clock execution time but more effective usage of the
allocated compute resources.
- Parent metric:
- Time
- Sub-metrics:
-
OpenMP Limited Parallelism Time
OpenMP Limited Parallelism Time
(only available after remapping)
- Description:
-
Idle time on CPUs that may be reserved for threads within OpenMP
parallel regions where not all of the thread team participates.
-
- Unit:
- Seconds
- Diagnosis:
-
Code sections marked as OpenMP parallel regions which are executed
serially (i.e., only by the master thread) or by less than the full
team of threads, can result in allocated but unused compute resources
being wasted. Typically this arises from insufficient work being
available within the marked parallel region to productively employ all
threads. This may be because the loop contains too few iterations or
the OpenMP runtime has determined that additional threads would not be
productive. Alternatively, the OpenMP omp_set_num_threads API
or num_threads or if clauses may have been explicitly
specified, e.g., to reduce parallel execution overheads such as
OpenMP Synchronization Time. If the proportion of
OpenMP Limited Parallelism Time is significant, it may be more
efficient to run with fewer threads for that problem size.
- Parent metric:
- OpenMP Idle Threads Time
- Sub-metrics:
-
None
OpenMP Time
(only available after remapping)
- Description:
-
Time spent in OpenMP API calls and code generated by the OpenMP compiler.
In particular, this includes thread team management and synchronization
activities.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree to determine which classes of OpenMP activities
contribute the most time.
- Parent metric:
- Execution Time
- Sub-metrics:
-
OpenMP Target Time
OpenMP Synchronization Time
OpenMP Flush Time
OpenMP Target Time
(only available after remapping)
- Description:
-
Time spent in OpenMP target related API calls and target code generated by the compiler.
In particular, this includes memory management, synchronization and kernel launches.
Note:
Some OpenMP runtimes handle the events in this metric tree asynchronously to reduce overhead and more
accurately describe the runtime behavior. A dispatched event will then correspond to the time it took
to add the event to the target architectures native event queue (e.g. a CUDA stream). Since some events
require synchronization first, this can lead to high measurement times in an unexpected place
(e.g. long times for data transfers induced by kernel launches). Therefore, events have to be
examined carefully.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree to determine which classes of OpenMP target activities
contribute the most time.
- Parent metric:
- OpenMP Time
- Sub-metrics:
-
OpenMP Target Memory Management Time
OpenMP Target Kernel Launch Time
OpenMP Target Memory Management Time
(only available after remapping)
- Description:
-
Time spent queuing and waiting for data transfer operations to finish.
This includes direct transfers initiated by OpenMP API functions like
omp_target_memcpy, explicit transfers via OpenMP directives,
indirect transfers for kernels, and memory management operations, like
allocations.
- Unit:
- Seconds
- Diagnosis:
-
Excessive amounts of data transfer operations can significantly increase
the overall time spent due to induced overhead with each operation.
Try to reduce the amount of data transfer operations by moving memory early,
asynchronously and less often. Try to move as much data as possible.
- Parent metric:
- OpenMP Target Time
- Sub-metrics:
-
None
OpenMP Target Kernel Launch Time
(only available after remapping)
- Description:
-
Time spent to launch OpenMP target kernels and wait for their completion, if not asynchronous.
- Unit:
- Seconds
- Parent metric:
- OpenMP Target Time
- Sub-metrics:
-
None
OpenMP Target Kernel Time
(only available after remapping)
- Description:
-
Time spent executing OpenMP target kernels.
- Unit:
- Seconds
- Parent metric:
- Computation Time
- Sub-metrics:
-
None
OpenMP Synchronization Time
(only available after remapping)
- Description:
-
Time spent in OpenMP synchronization, whether barriers or mutual exclusion
via ordered sequentialization, critical sections, atomics or lock API calls.
- Unit:
- Seconds
- Parent metric:
- OpenMP Time
- Sub-metrics:
-
OpenMP Barrier Synchronization Time
OpenMP Critical Synchronization Time
OpenMP Lock API Synchronization Time
OpenMP Ordered Synchronization Time
OpenMP Taskwait Synchronization Time
OpenMP Barrier Synchronization Time
(only available after remapping)
- Description:
-
Time spent in implicit (compiler-generated) or explicit (user-specified)
OpenMP barrier synchronization. Note that during measurement implicit
barriers are treated similar to explicit ones. The instrumentation
procedure replaces an implicit barrier with an explicit barrier enclosed
by the parallel construct. This is done by adding a nowait
clause and a barrier directive as the last statement of the parallel
construct. In cases where the implicit barrier cannot be removed (i.e.,
parallel region), the explicit barrier is executed in front of the
implicit barrier, which will then be negligible because the thread team
will already be synchronized when reaching it. The synthetic explicit
barrier appears as a special implicit barrier construct.
- Unit:
- Seconds
- Parent metric:
- OpenMP Synchronization Time
- Sub-metrics:
-
OpenMP Explicit Barrier Synchronization Time
OpenMP Implicit Barrier Synchronization Time
OpenMP Explicit Barrier Synchronization Time
(only available after remapping)
- Description:
-
Time spent in explicit (i.e., user-specified) OpenMP barrier
synchronization, both waiting for other threads
and inherent barrier processing overhead.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly barrier synchronizations and determine whether
they are necessary to ensure correctness or could be safely removed
(based on algorithm analysis). Consider replacing an explicit barrier
with a potentially more efficient construct, such as a critical section
or atomic, or use explicit locks. Examine the time that each thread
spends waiting at each explicit barrier, and try to re-distribute
preceding work to improve load balance.
- Parent metric:
- OpenMP Barrier Synchronization Time
- Sub-metrics:
-
None
OpenMP Implicit Barrier Synchronization Time
(only available after remapping)
- Description:
-
Time spent in implicit (i.e., compiler-generated) OpenMP barrier
synchronization, both waiting for other threads
and inherent barrier processing overhead.
- Unit:
- Seconds
- Diagnosis:
-
Examine the time that each thread spends waiting at each implicit
barrier, and if there is a significant imbalance then investigate
whether a schedule clause is appropriate. Consider whether
it is possible to employ the nowait clause to reduce the
number of implicit barrier synchronizations.
- Parent metric:
- OpenMP Barrier Synchronization Time
- Sub-metrics:
-
None
OpenMP Critical Synchronization Time
(only available after remapping)
- Description:
-
Time spent waiting to enter OpenMP critical sections and in atomics,
where mutual exclusion restricts access to a single thread at a time.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly critical sections and atomics and determine
whether they are necessary to ensure correctness or could be safely
removed (based on algorithm analysis).
- Parent metric:
- OpenMP Synchronization Time
- Sub-metrics:
-
None
OpenMP Lock API Synchronization Time
(only available after remapping)
- Description:
-
Time spent in OpenMP lock API calls.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly usage of locks and determine whether they are
necessary to ensure correctness or could be safely removed (based on
algorithm analysis). Consider re-writing the algorithm to use lock-free
data structures.
- Parent metric:
- OpenMP Synchronization Time
- Sub-metrics:
-
None
OpenMP Ordered Synchronization Time
(only available after remapping)
- Description:
-
Time spent waiting to enter OpenMP ordered regions due to enforced
sequentialization of loop iteration execution order in the region.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly ordered regions and determine
whether they are necessary to ensure correctness or could be safely
removed (based on algorithm analysis).
- Parent metric:
- OpenMP Synchronization Time
- Sub-metrics:
-
None
OpenMP Taskwait Synchronization Time
(only available after remapping)
- Description:
-
Time spent in OpenMP taskwait directives, waiting for child tasks
to finish.
- Unit:
- Seconds
- Parent metric:
- OpenMP Synchronization Time
- Sub-metrics:
-
None
OpenMP Flush Time
(only available after remapping)
- Description:
-
Time spent in OpenMP flush directives.
- Unit:
- Seconds
- Parent metric:
- OpenMP Time
- Sub-metrics:
-
None
POSIX Threads Time
(only available after remapping)
- Description:
-
Time spent in instrumented POSIX threads API calls. In particular, this
includes thread management and synchronization activities.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree to determine which classes of POSIX thread
activities contribute the most time.
- Parent metric:
- Execution Time
- Sub-metrics:
-
POSIX Threads Management Time
POSIX Threads Synchronization Time
POSIX Threads Management Time
(only available after remapping)
- Description:
-
Time spent managing (i.e., creating, joining, cancelling, etc.) POSIX
threads.
- Unit:
- Seconds
- Diagnosis:
-
Excessive POSIX threads management time in pthread_join
indicates load imbalance which causes wait states in the joining
threads waiting for the other thread to finish. Examine the join
times and try to re-distribute the computation in the corresponding
worker threads to achieve a better load balance.
-
Also, correlate the thread management time to the Visits of
management routines. If visit counts are high, consider using a
thread pool to reduce the number of thread management operations.
- Parent metric:
- POSIX Threads Time
- Sub-metrics:
-
None
POSIX Threads Synchronization Time
(only available after remapping)
- Description:
-
Time spent in POSIX threads synchronization calls, i.e., mutex and
condition variable operations.
- Unit:
- Seconds
- Diagnosis:
-
Expand the metric tree further to determine the proportion of time in
different classes of POSIX thread synchronization operations. Expand the
calltree to identify which callpaths are responsible for the most
synchronization time. Also examine the distribution of synchronization
time on each participating thread for indication of lock contention
effects.
- Parent metric:
- POSIX Threads Time
- Sub-metrics:
-
POSIX Threads Mutex API Synchronization Time
POSIX Threads Condition API Synchronization Time
POSIX Threads Mutex API Synchronization Time
(only available after remapping)
- Description:
-
Time spent in POSIX threads mutex API calls.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly usage of mutex operations and determine whether
they are necessary to ensure correctness or could be safely removed
(based on algorithm analysis). Consider re-writing the algorithm to
use lock-free data structures.
- Parent metric:
- POSIX Threads Synchronization Time
- Sub-metrics:
-
None
POSIX Threads Condition API Synchronization Time
(only available after remapping)
- Description:
-
Time spent in POSIX threads condition API calls.
- Unit:
- Seconds
- Diagnosis:
-
Locate the most costly usage of condition operations and determine whether
they are necessary to ensure correctness or could be safely removed (based
on algorithm analysis). Consider re-writing the algorithm to use data
structures without the need for condition variables.
- Parent metric:
- POSIX Threads Synchronization Time
- Sub-metrics:
-
None
OpenACC Time
(only available after remapping)
- Description:
-
Time spent in the OpenACC run-time system, API calls and on device.
If the OpenACC implementation is based on CUDA, and OpenACC and CUDA
support are both enabled during measurement, the CUDA activities from
within OpenACC will be accounted separately (just like CUDA calls
within MPI and other metric hierarchies).
- Unit:
- Seconds
- Parent metric:
- Execution Time
- Sub-metrics:
-
OpenACC Initialization/Finalization Time
OpenACC Memory Management Time
OpenACC Synchronization Time
OpenACC Kernel Launch Time
OpenACC Initialization/Finalization Time
(only available after remapping)
- Description:
-
Time needed to initialize and finalize OpenACC and OpenACC kernels.
- Unit:
- Seconds
- Parent metric:
- OpenACC Time
- Sub-metrics:
-
None
OpenACC Memory Management Time
(only available after remapping)
- Description:
-
Time spent on memory management including data transfer from host to
device and vice versa.
- Unit:
- Seconds
- Parent metric:
- OpenACC Time
- Sub-metrics:
-
None
OpenACC Synchronization Time
(only available after remapping)
- Description:
-
Time spent on OpenACC synchronization.
- Unit:
- Seconds
- Parent metric:
- OpenACC Time
- Sub-metrics:
-
None
OpenACC Kernel Launch Time
(only available after remapping)
- Description:
-
Time spent to launch OpenACC kernels.
- Unit:
- Seconds
- Parent metric:
- OpenACC Time
- Sub-metrics:
-
None
OpenCL Kernel Time
(only available after remapping)
- Description:
-
Time spent executing OpenCL kernels.
- Unit:
- Seconds
- Parent metric:
- Computation Time
- Sub-metrics:
-
None
OpenCL Time
(only available after remapping)
- Description:
-
Time spent in the OpenCL run-time system, API and on device.
- Unit:
- Seconds
- Parent metric:
- Execution Time
- Sub-metrics:
-
OpenCL General Management Time
OpenCL Memory Management Time
OpenCL Synchronization Time
OpenCL Kernel Launch Time
OpenCL General Management Time
(only available after remapping)
- Description:
-
Time needed for general OpenCL setup, e.g. initialization, device and
event control, etc.
- Unit:
- Seconds
- Parent metric:
- OpenCL Time
- Sub-metrics:
-
None
OpenCL Memory Management Time
(only available after remapping)
- Description:
-
Time spent on memory management including data transfer from host to
device and vice versa.
- Unit:
- Seconds
- Parent metric:
- OpenCL Time
- Sub-metrics:
-
None
OpenCL Synchronization Time
(only available after remapping)
- Description:
-
Time spent on OpenCL synchronization.
- Unit:
- Seconds
- Parent metric:
- OpenCL Time
- Sub-metrics:
-
None
OpenCL Kernel Launch Time
(only available after remapping)
- Description:
-
Time spent to launch OpenCL kernels.
- Unit:
- Seconds
- Parent metric:
- OpenCL Time
- Sub-metrics:
-
None
CUDA Kernel Time
(only available after remapping)
- Description:
-
Time spent executing CUDA kernels.
- Unit:
- Seconds
- Parent metric:
- Computation Time
- Sub-metrics:
-
None
CUDA Time
(only available after remapping)
- Description:
-
Time spent in the CUDA run-time system, API calls and on device.
- Unit:
- Seconds
- Parent metric:
- Execution Time
- Sub-metrics:
-
CUDA General Management Time
CUDA Memory Management Time
CUDA Synchronization Time
CUDA Kernel Launch Time
CUDA General Management Time
(only available after remapping)
- Description:
-
Time needed for general CUDA setup, e.g. initialization, control of
version, device, primary context, context, streams, events, occupancy,
etc.
- Unit:
- Seconds
- Parent metric:
- CUDA Time
- Sub-metrics:
-
None
CUDA Memory Management Time
(only available after remapping)
- Description:
-
Time spent on memory management including data transfer from host to
device and vice versa. Note that "memset" operations are considered
in CUDA Kernel Launch Time.
- Unit:
- Seconds
- Parent metric:
- CUDA Time
- Sub-metrics:
-
None
CUDA Synchronization Time
(only available after remapping)
- Description:
-
Time spent on CUDA synchronization.
- Unit:
- Seconds
- Parent metric:
- CUDA Time
- Sub-metrics:
-
None
CUDA Kernel Launch Time
(only available after remapping)
- Description:
-
Time spent to launch CUDA kernels, including "memset" operations.
- Unit:
- Seconds
- Parent metric:
- CUDA Time
- Sub-metrics:
-
None
HIP Kernel Time
(only available after remapping)
- Description:
-
Time spent executing HIP kernels.
- Unit:
- Seconds
- Parent metric:
- Computation Time
- Sub-metrics:
-
None
HIP Time
(only available after remapping)
- Description:
-
Time spent in the HIP API calls.
- Unit:
- Seconds
- Parent metric:
- Execution Time
- Sub-metrics:
-
HIP Stream Management Time
HIP Memory Allocation Time
HIP Memory Transfer Time
HIP Synchronization Time
HIP Kernel Launch Time
HIP Stream Management Time
(only available after remapping)
- Description:
-
Time needed for HIP stream management.
- Unit:
- Seconds
- Parent metric:
- HIP Time
- Sub-metrics:
-
None
HIP Memory Allocation Time
(only available after remapping)
- Description:
-
Time needed for HIP memory allocations.
- Unit:
- Seconds
- Parent metric:
- HIP Time
- Sub-metrics:
-
None
HIP Memory Transfer Time
(only available after remapping)
- Description:
-
Time needed for HIP memory transfers.
- Unit:
- Seconds
- Parent metric:
- HIP Time
- Sub-metrics:
-
None
HIP Synchronization Time
(only available after remapping)
- Description:
-
Time spent on HIP synchronization.
- Unit:
- Seconds
- Parent metric:
- HIP Time
- Sub-metrics:
-
None
HIP Kernel Launch Time
(only available after remapping)
- Description:
-
Time spent to launch HIP kernels.
- Unit:
- Seconds
- Parent metric:
- HIP Time
- Sub-metrics:
-
None
MPI Bytes Transferred
(only available after remapping)
- Description:
-
The total number of bytes that were notionally processed in MPI
communication and synchronization operations (i.e., the sum of the bytes
that were sent and received). Note that the actual number of bytes
transferred is typically not determinable, as this is dependant on the MPI
internal implementation, including message transfer and failed delivery
recovery protocols.
- Unit:
- Bytes
- Diagnosis:
-
Expand the metric tree to break down the bytes transferred into
constituent classes. Expand the call tree to identify where most data
is transferred and examine the distribution of data transferred by each
process.
- Parent metric:
- None
- Sub-metrics:
-
MPI Point-to-point Bytes Transferred
MPI Collective Bytes Transferred
MPI One-Sided Bytes Transferred
MPI Point-to-point Bytes Transferred
(only available after remapping)
- Description:
-
The total number of bytes that were notionally processed by
MPI point-to-point communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to identify where the most data is transferred
using point-to-point communication and examine the distribution of data
transferred by each process.
- Parent metric:
- MPI Bytes Transferred
- Sub-metrics:
-
MPI Point-to-point Bytes Sent
MPI Point-to-point Bytes Received
MPI Point-to-point Bytes Sent
- Description:
-
The number of bytes that were notionally sent using MPI
point-to-point communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is sent using
point-to-point communication operations and examine the distribution of
data sent by each process.
-
If the aggregate MPI Point-to-point Bytes Received is less than the amount
sent, some messages were cancelled, received into buffers which were
too small, or simply not received at all. (Generally only aggregate
values can be compared, since sends and receives take place on
different callpaths and on different processes.) Sending more data than
is received wastes network bandwidth. Applications do not conform to
the MPI standard when they do not receive all messages that are sent,
and the unreceived messages degrade performance by consuming network
bandwidth and/or occupying message buffers. Cancelling send operations
is typically expensive, since it usually generates one or more internal
messages.
- Parent metric:
- MPI Point-to-point Bytes Transferred
- Sub-metrics:
-
None
MPI Point-to-point Bytes Received
- Description:
-
The number of bytes that were notionally received using MPI
point-to-point communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is received using
point-to-point communication and examine the distribution of data
received by each process.
-
If the aggregate MPI Point-to-point Bytes Sent is greater than the amount
received, some messages were cancelled, received into buffers which
were too small, or simply not received at all. (Generally only
aggregate values can be compared, since sends and receives take place
on different callpaths and on different processes.) Applications do
not conform to the MPI standard when they do not receive all messages
that are sent, and the unreceived messages degrade performance by
consuming network bandwidth and/or occupying message buffers.
Cancelling receive operations may be necessary where speculative
asynchronous receives are employed, however, managing the associated
requests also involves some overhead.
- Parent metric:
- MPI Point-to-point Bytes Transferred
- Sub-metrics:
-
None
MPI Collective Bytes Transferred
(only available after remapping)
- Description:
-
The total number of bytes that were notionally processed in
MPI collective communication operations. This assumes that collective
communications are implemented naively using point-to-point
communications, e.g., a broadcast being implemented as sends to each
member of the communicator (including the root itself). Note that
effective MPI implementations use optimized algorithms and/or special
hardware, such that the actual number of bytes transferred may be very
different.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
collective communication and examine the distribution of data
transferred by each process.
- Parent metric:
- MPI Bytes Transferred
- Sub-metrics:
-
MPI Collective Bytes Outgoing
MPI Collective Bytes Incoming
MPI Collective Bytes Outgoing
- Description:
-
The number of bytes that were notionally sent by MPI collective
communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
collective communication and examine the distribution of data outgoing
from each process.
- Parent metric:
- MPI Collective Bytes Transferred
- Sub-metrics:
-
None
MPI Collective Bytes Incoming
- Description:
-
The number of bytes that were notionally received by MPI collective
communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
collective communication and examine the distribution of data incoming
to each process.
- Parent metric:
- MPI Collective Bytes Transferred
- Sub-metrics:
-
None
MPI One-Sided Bytes Transferred
(only available after remapping)
- Description:
-
The number of bytes that were notionally processed in MPI one-sided
communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
one-sided communication and examine the distribution of data transferred
by each process.
- Parent metric:
- MPI Bytes Transferred
- Sub-metrics:
-
MPI One-sided Bytes Sent
MPI One-sided Bytes Received
MPI One-sided Bytes Sent
- Description:
-
The number of bytes that were notionally sent in MPI one-sided
communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
one-sided communication and examine the distribution of data sent by
each process.
- Parent metric:
- MPI One-Sided Bytes Transferred
- Sub-metrics:
-
None
MPI One-sided Bytes Received
- Description:
-
The number of bytes that were notionally received in MPI one-sided
communication operations.
- Unit:
- Bytes
- Diagnosis:
-
Expand the calltree to see where the most data is transferred using
one-sided communication and examine the distribution of data received by
each process.
- Parent metric:
- MPI One-Sided Bytes Transferred
- Sub-metrics:
-
None
MPI File Operations
(only available after remapping)
- Description:
-
Number of MPI file operations of any type.
- Unit:
- Counts
- Diagnosis:
-
Expand the metric tree to see the breakdown of different classes of MPI
file operation, expand the calltree to see where they occur, and look
at the distribution of operations done by each process.
- Parent metric:
- None
- Sub-metrics:
-
MPI Individual File Operations
MPI Collective File Operations
MPI Individual File Operations
(only available after remapping)
- Description:
-
Number of individual MPI file operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the distribution of individual MPI file operations done by each
process and compare with the corresponding MPI File Management Time and
MPI Individual File I/O Time.
- Parent metric:
- MPI File Operations
- Sub-metrics:
-
MPI Individual File Read Operations
MPI Individual File Write Operations
MPI Individual File Read Operations
(only available after remapping)
- Description:
-
Number of individual MPI file read operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the callpaths where individual MPI file reads occur and the
distribution of operations done by each process in them, and compare
with the corresponding MPI Individual File I/O Time.
- Parent metric:
- MPI Individual File Operations
- Sub-metrics:
-
None
MPI Individual File Write Operations
(only available after remapping)
- Description:
-
Number of individual MPI file write operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the callpaths where individual MPI file writes occur and the
distribution of operations done by each process in them, and compare
with the corresponding MPI Individual File I/O Time.
- Parent metric:
- MPI Individual File Operations
- Sub-metrics:
-
None
MPI Collective File Operations
(only available after remapping)
- Description:
-
Number of collective MPI file operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the distribution of collective MPI file operations done by each
process and compare with the corresponding MPI File Management Time and
MPI Collective File I/O Time.
- Parent metric:
- MPI File Operations
- Sub-metrics:
-
MPI Collective File Read Operations
MPI Collective File Write Operations
MPI Collective File Read Operations
(only available after remapping)
- Description:
-
Number of collective MPI file read operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the callpaths where collective MPI file reads occur and the
distribution of operations done by each process in them, and compare
with the corresponding MPI Collective File I/O Time.
- Parent metric:
- MPI Collective File Operations
- Sub-metrics:
-
None
MPI Collective File Write Operations
(only available after remapping)
- Description:
-
Number of collective MPI file write operations.
- Unit:
- Counts
- Diagnosis:
-
Examine the callpaths where collective MPI file writes occur and the
distribution of operations done by each process in them, and compare
with the corresponding MPI Collective File I/O Time.
- Parent metric:
- MPI Collective File Operations
- Sub-metrics:
-
None
MPI One-sided Synchronization Time
(only available after remapping)
- Description:
-
Time spent in MPI one-sided synchronization calls.
- Unit:
- Seconds
- Parent metric:
- MPI Synchronization Time
- Sub-metrics:
-
MPI Active Target Synchronization Time
MPI One-sided Passive Target Synchronization Time
MPI Active Target Synchronization Time
(only available after remapping)
- Description:
-
Time spent in MPI one-sided active target synchronization calls.
- Unit:
- Seconds
- Parent metric:
- MPI One-sided Synchronization Time
- Sub-metrics:
-
None
MPI One-sided Passive Target Synchronization Time
(only available after remapping)
- Description:
-
Time spent in MPI one-sided passive target synchronization calls.
- Unit:
- Seconds
- Parent metric:
- MPI One-sided Synchronization Time
- Sub-metrics:
-
None
MPI One-sided Communication Time
(only available after remapping)
- Description:
-
Time spent in MPI one-sided communication operations, for example,
MPI_Accumulate, MPI_Put, or MPI_Get.
- Unit:
- Seconds
- Parent metric:
- MPI Communication Time
- Sub-metrics:
-
None
Computational Load Imbalance Heuristic
(only available after remapping)
- Description:
-
This simple heuristic allows to identify computational load imbalances and
is calculated for each (call-path, process/thread) pair. Its value
represents the absolute difference to the average computation time. This
average value is the aggregated exclusive time spent by all
processes/threads in this call-path, divided by the number of
processes/threads visiting it.
-
-
Note:
A high value for a collapsed call tree node does not necessarily mean that
there is a load imbalance in this particular node, but the imbalance can
also be somewhere in the subtree underneath. Unused threads outside
of OpenMP parallel regions are considered to constitute OpenMP Idle Threads Time
and expressly excluded from the computational load imbalance heuristic.
- Unit:
- Seconds
- Diagnosis:
-
Total load imbalance comprises both above average computation time
and below average computation time, therefore at most half of it could
potentially be recovered with perfect (zero-overhead) load balance
that distributed the excess from overloaded to unloaded
processes/threads, such that all took exactly the same time.
-
Computation imbalance is often the origin of communication and
synchronization inefficiencies, where processes/threads block and
must wait idle for partners, however, work partitioning and
parallelization overheads may be prohibitive for complex computations
or unproductive for short computations. Replicating computation on
all processes/threads will eliminate imbalance, but would typically
not result in recover of this imbalance time (though it may reduce
associated communication and synchronization requirements).
-
Call paths with significant amounts of computational imbalance should
be examined, along with processes/threads with above/below-average
computation time, to identify parallelization inefficiencies. Call paths
executed by a subset of processes/threads may relate to parallelization
that hasn't been fully realized (Computational Load Imbalance Heuristic: Non-participation), whereas
call-paths executed only by a single process/thread
(Computational Load Imbalance Heuristic: Single Participant) often represent unparallelized serial code,
which will be scalability impediments as the number of processes/threads
increase.
- Parent metric:
- None
- Sub-metrics:
-
Computational Load Imbalance Heuristic: Overload
Computational Load Imbalance Heuristic: Underload
Computational Load Imbalance Heuristic: Overload
(only available after remapping)
- Description:
-
This metric identifies processes/threads where the exclusive execution
time spent for a particular call-path was above the average value.
It is a complement to Computational Load Imbalance Heuristic: Underload.
-
-
See Computational Load Imbalance Heuristic for details on how this heuristic is calculated.
- Unit:
- Seconds
- Diagnosis:
-
The CPU time which is above the average time for computation is the
maximum that could potentially be recovered with perfect (zero-overhead)
load balance that distributed the excess from overloaded to underloaded
processes/threads.
- Parent metric:
- Computational Load Imbalance Heuristic
- Sub-metrics:
-
Computational Load Imbalance Heuristic: Single Participant
Computational Load Imbalance Heuristic: Single Participant
(only available after remapping)
- Description:
-
This heuristic distinguishes the execution time for call-paths executed
by single processes/threads that potentially could be recovered with
perfect parallelization using all available processes/threads.
-
It is the Computational Load Imbalance Heuristic: Overload time for call-paths that only have
non-zero Visits for one process or thread, and complements
Computational Load Imbalance Heuristic: Non-participation in Singularity.
-
- Unit:
- Seconds
- Diagnosis:
-
This time is often associated with activities done exclusively by a
"Master" process/thread (often rank 0) such as initialization,
finalization or I/O, but can apply to any process/thread that
performs computation that none of its peers do (or that does its
computation on a call-path that differs from the others).
-
The CPU time for singular execution of the particular call path
typically presents a serial bottleneck impeding scalability as none of
the other available processes/threads are being used, and
they may well wait idling until the result of this computation becomes
available. (Check the MPI communication and synchronization times,
particularly waiting times, for proximate call paths.)
In such cases, even small amounts of singular execution can
have substantial impact on overall performance and parallel efficiency.
With perfect partitioning and (zero-overhead) parallel
execution of the computation, it would be possible to recover this time.
-
When the amount of time is small compared to the total execution time,
or when the cost of parallelization is prohibitive, it may not be
worth trying to eliminate this inefficiency. As the number of
processes/threads are increased and/or total execution time decreases,
however, the relative impact of this inefficiency can be expected to grow.
- Parent metric:
- Computational Load Imbalance Heuristic: Overload
- Sub-metrics:
-
None
Computational Load Imbalance Heuristic: Underload
(only available after remapping)
- Description:
-
This metric identifies processes/threads where the computation time spent
for a particular call-path was below the average value. It is a complement
to Computational Load Imbalance Heuristic: Overload.
-
-
See Computational Load Imbalance Heuristic for details on how this heuristic is calculated.
- Unit:
- Seconds
- Diagnosis:
-
The CPU time which is below the average time for computation could
potentially be used to reduce the excess from overloaded processes/threads
with perfect (zero-overhead) load balancing.
- Parent metric:
- Computational Load Imbalance Heuristic
- Sub-metrics:
-
Computational Load Imbalance Heuristic: Non-participation
Computational Load Imbalance Heuristic: Non-participation
(only available after remapping)
- Description:
-
This heuristic distinguishes the execution time for call paths not executed
by a subset of processes/threads that potentially could be used with
perfect parallelization using all available processes/threads.
-
It is the Computational Load Imbalance Heuristic: Underload time for call paths which have zero
Visits and were therefore not executed by this process/thread.
-
- Unit:
- Seconds
- Diagnosis:
-
The CPU time used for call paths where not all processes or threads
are exploited typically presents an ineffective parallelization that
limits scalability, if the unused processes/threads wait idling for
the result of this computation to become available. With perfect
partitioning and (zero-overhead) parallel execution of the computation,
it would be possible to recover this time.
- Parent metric:
- Computational Load Imbalance Heuristic: Underload
- Sub-metrics:
-
Computational Load Imbalance Heuristic: Non-participation in Singularity
Computational Load Imbalance Heuristic: Non-participation in Singularity
(only available after remapping)
- Description:
-
This heuristic distinguishes the execution time for call paths not executed
by all but a single process/thread that potentially could be recovered with
perfect parallelization using all available processes/threads.
-
It is the Computational Load Imbalance Heuristic: Underload time for call paths that only have
non-zero Visits for one process/thread, and complements
Computational Load Imbalance Heuristic: Single Participant.
-
- Unit:
- Seconds
- Diagnosis:
-
The CPU time for singular execution of the particular call path
typically presents a serial bottleneck impeding scalability as none of
the other processes/threads that are available are being used, and
they may well wait idling until the result of this computation becomes
available. With perfect partitioning and (zero-overhead) parallel
execution of the computation, it would be possible to recover this time.
- Parent metric:
- Computational Load Imbalance Heuristic: Non-participation
- Sub-metrics:
-
None
Hits
- Description:
-
Number of exclusive samples inside this region.
- Unit:
- Counts
Wrapped libraries
- Description:
-
Total time spent for program execution in external libraries.
- Unit:
- Seconds
What is remapping?
A number of additional metrics can be calculated during an analysis report
postprocessing step called remapping. In addition, remapping also
organizes the performance properties in a hierarchical way, which allows to
examine analysis reports at different levels of granularity. The remapping
step is automatically performed by the Scalasca convenience command
scalasca -examine (or short square) the first time an
experiment archive is examined. Thus, it should be transparent to users
following the recommended workflow as described in the
Scalasca User Guide.
However, the remapping process can also be performed manually using the
command-line tool cube_remap2 from the CubeLib package if necessary.
This tool reads an input Cube file and generates a corresponding output Cube
file according to a remapping specification. Note that this remapping
specification has to be different for postprocessing runtime summaries and
trace analysis reports, though. To postprocess a Score-P runtime summary
report profile.cubex and create a summary.cubex report, use
cube_remap2 -d -r `scorep-config --remap-specfile` -o summary.cubex profile.cubex
Likewise, to postprocess a Scalasca trace analysis report scout.cubex
and create a trace.cubex report, use
cube_remap2 -d -r `scalasca --remap-specfile` -o trace.cubex scout.cubex
Note that as of Score-P v5.0 and Scalasca v2.6, respectively, the remapping
specification is embedded in the runtime summary and trace analysis reports
if the specification file can be read from the installation directory at
measurement/analysis time. In this case, the -r <file> option
can be omitted from the commands above. However, this embedded specification
is dropped during any Cube algebra operation (e.g., cube_cut or
cube_merge).
IMPORTANT NOTE:
Remapping specifications are typically targeted towards a particular version
of Score-P or Scalasca. Thus, it is highly recommended to use the
remapping specification distributed with the Score-P/Scalasca version that was
used to generate the input report. Otherwise the remapping may produce
unexpected results.
 |
|
Copyright © 2012, 2020, Forschungszentrum Jülich GmbH, Germany
Copyright © 2015, 2018, Technische Universität Dresden, Germany
|