Profiling

Note

Also see Debugging.

You have code, you want it to run fast. This is what Triton is for. But how do you know if your code is running as fast as it can? We are scientists, and if things aren’t quantified we can’t do science on them. Programming can often seem like a black box: modern computers are extremely complicated, and people can’t predict what is actually making code fast or slow anymore. Thus, you need to profile your code: get detailed performance measurements. These measurements let you know how to make it run faster.

There are many tools for profiling, and it really is one of the fundamental principles for any programming language. You really should learn how to do quick profile just to make sure things are OK, even if you aren’t trying to optimize things: you might find a quick win even if you didn’t write the code yourself (for example, 90% of your time is spent on input/output).

This page is under development, but so far serves as an introduction. We hope to expand it with specific Triton examples.

Summary: profiling on Linux

First off, look at your language-specific profiling tools.

CPU profiling

This can give you a list of where all your processor time is going, either per-function or per-line. Generally, most of your time is in a very small region of your code, and you need to know what this is in order to improve just that part.

See the C and Python profiling example above.

GNU gprof

gprof is a profiler based on instrumenting your code (build with -pg). It has relatively high overhead, but gives exact information e.g. for the number of times a function is called.

Perf

perf is a sampling profiler, which periodically samples events originating e.g. from the CPU performance monitoring unit (PMU). This generates a statistical profile, but the advantage is that the overhead is very low (single digit %), and one can get timings at the level of individual asm instructions. For a simple example, consider a (naive) matrix multiplication program:

Compile the program (-g provides debug symbols which will be useful later on, at no performance cost):

$ gfortran -Wall -g -O3 mymatmul.f90

Run the program via the profiler to generate profile data:

$ perf record ./a.out

Now we can look at the profile:

$ perf report
# Samples: 1251
#
# Overhead Command Shared Object Symbol
# ........ .............. ............................. ......
#
85.45% a.out ./a.out [.] MAIN\_\_
4.24% a.out /usr/lib/libgfortran.so.3.0.0 [.] \_gfortran\_arandom\_r4
3.12% a.out /usr/lib/libgfortran.so.3.0.0 [.] kiss\_random\_kernel

So 85% of the runtime is spent in the main program (symbol MAIN__), and most of the rest is in the random number generator, which the program calls in order to generate the input matrices.

Now, lets take a closer look at the main program:

$ perf annotate MAIN__
------------------------------------------------
Percent \| Source code & Disassembly of a.out
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000004008b0 <MAIN__>:

: c = 0.
:
: do j = 1, n
: do k = 1, n
: do i = 1, n
: c(i,j) = c(i,j) + a(i,k) \* b(k,j)
30.12 : 400a40: 0f 28 04 01 movaps (%rcx,%rax,1),%xmm0
4.92 : 400a44: 0f 59 c1 mulps %xmm1,%xmm0
12.36 : 400a47: 0f 58 04 02 addps (%rdx,%rax,1),%xmm0
40.73 : 400a4b: 0f 29 04 02 movaps %xmm0,(%rdx,%rax,1)
9.65 : 400a4f: 48 83 c0 10 add $0x10,%rax

Unsurprisingly, the inner loop kernel takes up practically all the time.

For more information on using perf, see the perf tutorial at

https://perf.wiki.kernel.org/index.php/Tutorial

Input/output profiling

This will tell you how much time is spent reading and writing data, where, and what type of patterns it has (big reads, random access, etc). Note that you can see the time information when CPU profiling: if input/output functions take a lot of time, you need to improve IO.

/usr/bin/time -v prints some useful info about IO operations and statistics.

Lowest level: use strace to print the time taken in every system call that accesses files. This is not that great.:

#  Use strace to print the total bytes
strace -e trace=desc $command |& egrep 'write' | awk --field-separator='='  '{ x+=$NF } END { print x }'
strace -e trace=desc $command |& egrep 'read' | awk --field-separator='='  '{ x+=$NF } END { print x }'

# Number of calls only
strace -e trace=file -c  $command

Memory profiling

Less common, but it can tell you something about what memory is being used.

If you are making your own algorithms, memory profiling becomes more important because you need to be sure that you are using the memory hierarchy efficiently. There are tools for this.

MPI and parallel profiling

mpiP

mpiP: Lightweight, Scalable MPI Profiling http://mpip.sourceforge.net/. Collects statistical information about MPI functions. mpiP is a link-time library, that means that it can be linked to the object file, though it is recommended that you have recompiled the code with -g. Debugging information is used to decode the program counters to a source code filename and line number automatically. mpiP will work without -g, but mileage may vary.

Usage example:

# assume you have you MPI flavor module loaded
module load mpip/3.4.1

# link or compile your code from scratch with -g
mpif90 ­-g ­-o my_app my_app.f90 ­-lmpiP ­-lm ­-lbfd ­-liberty ­-lunwind
# or
mpif90 ­-o my_app my_app.o ­-lmpiP ­-lm ­-lbfd ­-liberty ­-lunwind

# run the code normally (either interactively with salloc or as usual with sbatch)
salloc -p play --ntasks=4 srun mpi_app

If everything works, you will see the mpiP header preceding your program stdout, and there will be generated a text report file in your work directory. File is small, no worries about quota. Please, consult the link above for the file content explanation. During runtime, one can set MPIP environment variables to change the profiler behavior. Example:

export MPIP="-t 10.0 -k 2"

Scalasca

Available through module load scalasca

GPU profiling

There are several tools available on triton for GPU profiling.

PyTorch Profiler

PyTorch Profiler is a tool that helps monitor and optimize the performance of PyTorch models. It provides insights into resource usage such as CPU, GPU, and memory, helping developers identify bottlenecks in model training. For detailed information on how to use PyTorch Profiler, refer to the official documentation.

On Triton, there are at least two approaches to visualizing profiler results:

To visualize the report on Triton, you can launch a desktop from Open Ondemand. In the desktop’s terminal, run:

$ module load mamba
$ source activate yourenv
$ tensorboard --logdir ./log

to launch the tensorboard and point it to the log directory where the profiling results were saved. Now you can view results in TensorBoard: open a browser and go to http://localhost:6006. Navigate to the PyTorch Profiler tab to view detailed profiling information, such as:

  • CPU and GPU utilization

  • Memory usage

  • Kernel execution times

  • Operation breakdowns

NOTE: you need to have PyTorch Profiler TensorBoard Plugin installed in your environment via:

pip install torch-tb-profiler

NVIDIA Profilers

NVIDIA provides a suite of profiling tools designed to help users optimize and analyze the performance of applications running on NVIDIA GPUs. These tools are essential for understanding how applications utilize GPU resources and for identifying bottlenecks in the code.

Overview of NVIDIA Profilers:

Nsight Systems is a system-wide performance analysis tool that provides detailed insights into GPU workloads, CPU workloads, and their interactions. It helps identify issues related to CPU-GPU interaction, GPU utilization, memory bandwidth, and other system-level bottlenecks, such as I/O operations.

Nsight Compute is a GPU kernel profiling tool specifically designed for deep-dive analysis of CUDA kernels. It provides a wide range of performance metrics.

The old profiler nvprof is being deprecated and is no longer recommended for profiling GPU applications. It has been replaced by Nsight Compute and Nsight Systems, which provide a more comprehensive and user-friendly profiling experience.

Examples

We have produced some profiling examples.

To use Nsight Systems, you need to run:

$ module load cuda
$ nsys profile -o my_application_report ./my_application

This will generate a report file called my_application_report.qdrep. To visualize the report on Triton, you can launch a desktop from Open Ondemand. In the desktop’s terminal, run:

$ module load cuda qt
$ nsys-ui

to launch the Nsight Systems GUI. You can then open the report file in the GUI to analyze the performance of your application.

If you want to have a quick overview of the performance of your application, you can add the --stats true flag to the nsys command:

$ nsys profile --stats true -o my_application_report ./my_application

Sample output might look something like this:

  Calculating pi using 1000000000 stochastic trials
  Throws: 785390400/1000000000 Pi: 3.141561508
  Generating '/tmp/nsys-report-dae1.qdstrm'
  [1/8] [========================100%] my_application_report.nsys-rep
  [2/8] [========================100%] my_application_report.sqlite
  [3/8] Executing 'nvtx_sum' stats report
  SKIPPED: /home/username/hpc-examples/slurm/my_application_report.sqlite does not contain NV Tools Extension (NVTX) data.
  [4/8] Executing 'osrt_sum' stats report

Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name
--------  ---------------  ---------  ----------  ---------  --------  ---------  -----------  ----------------------
    67.5        386063524         12  32171960.3  1869947.5      1922  212805298   68194552.0  poll
    31.2        178480955        530    336756.5    15054.0       863   86904639    4568774.8  ioctl
     0.4          2103660         43     48922.3    11494.0      6474     828043     147057.8  mmap64
     0.3          1592392         11    144762.9    38465.0      1054    1057811     310592.0  write
     0.2          1191873          1   1191873.0  1191873.0   1191873    1191873          0.0  pthread_cond_wait
     0.1           668982          2    334491.0   334491.0    153375     515607     256136.7  pthread_create
     0.1           419792         66      6360.5     6045.0      1315      16170       2895.0  open64
     0.1           321168          2    160584.0   160584.0       365     320803     226583.9  pthread_cond_broadcast
     0.0           248340         33      7525.5     4207.0      2493      32380       6793.7  fopen
     0.0           143869         15      9591.3     3959.0      1813      61837      15235.8  mmap
     0.0            62325         27      2308.3     1941.0       961       8547       1660.7  fclose
     0.0            60643         48      1263.4       44.0        42      58482       8434.5  fgets
     0.0            58613         74       792.1      759.5       509       1824        208.7  fcntl
     0.0            35772          6      5962.0     5731.0      2006      12850       3821.8  open
     0.0            26375         14      1883.9     1158.5       655       8367       2022.1  read
     0.0            23330          2     11665.0    11665.0      7532      15798       5844.9  fread
     0.0            22813          3      7604.3     9569.0      3606       9638       3462.8  pipe2
     0.0            21978          5      4395.6     4144.0      2141       6999       1769.5  munmap
     0.0            15134          2      7567.0     7567.0      4708      10426       4043.2  socket
     0.0            10788          1     10788.0    10788.0     10788      10788          0.0  connect
     0.0             4913          7       701.9      720.0       552        973        153.8  dup
     0.0             4911          2      2455.5     2455.5      2180       2731        389.6  fwrite
     0.0             3976         64        62.1       26.0        24        128         40.5  pthread_mutex_trylock
     0.0             2007          1      2007.0     2007.0      2007       2007          0.0  bind
     0.0             1081          1      1081.0     1081.0      1081       1081          0.0  listen

  [5/8] Executing 'cuda_api_sum' stats report

Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name
--------  ---------------  ---------  ----------  ---------  --------  ---------  -----------  ----------------------
    92.4        196623264          3  65541088.0   122225.0      9122  196491917  113406758.7  cudaMalloc
     6.5         13852441          2   6926220.5  6926220.5     68117   13784324    9698823.0  cudaMemcpy
     0.9          1941642          2    970821.0   970821.0     20026    1921616    1344627.2  cudaLaunchKernel
     0.1           300228          3    100076.0   126648.0     13525     160055      76794.0  cudaFree
     0.0             1428          1      1428.0     1428.0      1428       1428          0.0  cuModuleGetLoadingMode

  [6/8] Executing 'cuda_gpu_kern_sum' stats report

Time (%)  Total Time (ns)  Instances   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                           Name
--------  ---------------  ---------  ----------  ----------  --------  --------  -----------  -------------------------------------------------------
    85.3         11426785          1  11426785.0  11426785.0  11426785  11426785          0.0  throw_dart(curandStateXORWOW *, int *, unsigned long *)
    14.7          1970581          1   1970581.0   1970581.0   1970581   1970581          0.0  setup_rng(curandStateXORWOW *, unsigned long)

  [7/8] Executing 'cuda_gpu_mem_time_sum' stats report

Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)      Operation
--------  ---------------  -----  --------  --------  --------  --------  -----------  ------------------
    64.3            41055      1   41055.0   41055.0     41055     41055          0.0  [CUDA memcpy DtoH]
    35.7            22784      1   22784.0   22784.0     22784     22784          0.0  [CUDA memcpy HtoD]

  [8/8] Executing 'cuda_gpu_mem_size_sum' stats report

Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
----------  -----  --------  --------  --------  --------  -----------  ------------------
     0.524      1     0.524     0.524     0.524     0.524        0.000  [CUDA memcpy DtoH]
     0.262      1     0.262     0.262     0.262     0.262        0.000  [CUDA memcpy HtoD]

This output shows that the kernel throw_dart (part of the Monte Carlo simulation) takes up 88% of the time, while setup_rng (for setting up random number generators) takes 12%. It is also important to note that in this example memory allocation cudaMalloc and memory copying cudaMemcpy used more time than the actual computation. Memory operations are time consuming operations and thus best codes try to minimize the need for doing them.

For more detailed guides on how to use nsys, refer to the official NVIDIA documentation.

To use Nsight Compute, you need to run:

$ module load cuda
$ ncu --target-processes all --set full ./my_application

This command profiles all CUDA kernels in your application, providing a comprehensive set of performance metrics including instruction throughput, memory access efficiency, and occupancy. Nsight Compute generates a report file called my_application.ncu-rep.

Nsight Compute also allows to analyze the performance of specific CUDA kernels. For example, you can run:

$ ncu --kernel-name 'myKernel' ./my_cuda_application

to profile a kernel named myKernel, or

$ ncu --kernel-name '^myKernel.*' ./my_cuda_application

to profile all kernels whose names start with myKernel.

Sample output might look something like this:

 Calculating pi using 1000000000 stochastic trials
 ==PROF== Connected to process 3944692 (/home/username/hpc-examples/slurm/pi-gpu)
 ==PROF== Profiling "throw_dart": 0%....50%....100% - 19 passes
 Throws: 785390400/1000000000 Pi: 3.141561508
 ==PROF== Disconnected from process 3944692
 [3944692] pi-gpu@127.0.0.1
throw_dart(curandStateXORWOW *, int *, unsigned long *) (512, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 7.0
  Section: GPU Speed Of Light Throughput
  ----------------------- ------------- ------------
  Metric Name               Metric Unit Metric Value
  ----------------------- ------------- ------------
  DRAM Frequency          cycle/usecond       877.96
  SM Frequency            cycle/nsecond         1.24
  Elapsed Cycles                  cycle     10544719
  Memory Throughput                   %        49.01
  DRAM Throughput                     %         0.04
  Duration                      msecond         8.52
  L1/TEX Cache Throughput             %        73.75
  L2 Cache Throughput                 %        49.01
  SM Active Cycles                cycle   8850096.11
  Compute (SM) Throughput             %        37.09
  ----------------------- ------------- ------------

  OPT   This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full
        waves across all SMs. Look at Launch Statistics for more details.

  Section: Launch Statistics
  -------------------------------- --------------- ---------------
  Metric Name                          Metric Unit    Metric Value
  -------------------------------- --------------- ---------------
  Block Size                                                   128
  Function Cache Configuration                     CachePreferNone
  Grid Size                                                    512
  Registers Per Thread             register/thread              21
  Shared Memory Configuration Size            byte               0
  Driver Shared Memory Per Block        byte/block               0
  Dynamic Shared Memory Per Block       byte/block               0
  Static Shared Memory Per Block        byte/block               0
  Threads                                   thread           65536
  Waves Per SM                                                0.40
  -------------------------------- --------------- ---------------

  Section: Occupancy
  ------------------------------- ----------- ------------
  Metric Name                     Metric Unit Metric Value
  ------------------------------- ----------- ------------
  Block Limit SM                        block           32
  Block Limit Registers                 block           21
  Block Limit Shared Mem                block           32
  Block Limit Warps                     block           16
  Theoretical Active Warps per SM        warp           64
  Theoretical Occupancy                     %          100
  Achieved Occupancy                        %        39.97
  Achieved Active Warps Per SM           warp        25.58
  ------------------------------- ----------- ------------

  OPT   Estimated Speedup: 60.03%
        This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
        theoretical (100.0%) and measured achieved occupancy (40.0%) can be the result of warp scheduling overheads
        or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
        as well as across blocks of the same kernel. See the CUDA Best Practices Guide
        (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
        optimizing occupancy.

The profiling result suggests that the kernel grid is too small, meaning the workload is not sufficient to fully utilize the GPU’s resources. Only 0.4 full waves across all SMs were active, indicating suboptimal GPU utilization. To improve performance, one may need to increase the grid size to better match the GPU’s capabilities.

For detailed guides on how to use ncu, refer to the documentation: NVIDIA Nsight Compute suite.