跳转至

参考资料

CUDA 安装

总结:

  • 最简单方式是看 cuda 下载页面提供的安装命令
  • deb 安装貌似无法只安装 cuda,不安装驱动
  • runfile 安装可以做到不重启
  • cuda-toolkit-12-1 包含 cuda
  • nvidia-driver-530 包含驱动

有用链接

cuda 下载

不同 metapackage

gpu driver

  • nvidia-headless-535
    • This metapackage installs the NVIDIA driver and the libraries that enable parallel general purpose computation through CUDA and OpenCL.
    • nvidia-headless-535 存在两个 APT-source,分别是发行版和 APT-Sources: /var/lib/dpkg/status

cuda toolkit (nvcc etc)

cuda-toolkit-11-8

  • Meta-package containing all the available toolkit packages related to native CUDA development. Contains the toolkit, samples, and documentation. Locked at CUDA Toolkit version 11.8. cuda-11-8
  • Meta-package containing all the available packages required for native CUDA development. Contains the toolkit, samples, driver and documentation. Locked at CUDA Toolkit version 11.8.

3种方式

  • 发行版源自带的
    • 只包含 nvidia-driver,不包含 cuda toolkit
  • rpm/deb packages
    • 离线版 deb 文件
    • 在线版 nvidia 源
  • runfile package
    • 和发行版无关 (distribution-independent package)

nvidia apt 源

包含所有版本 cuda,driver。其中 driver 可能比发行版源中的 driver 优先级更高

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

离线版 deb

使用方法

sudo dpkg -i cuda-repo-xxx.deb

sudo cp /var/cuda-repo-xxx/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt update
sudo apt install cuda-toolkit-xxx

查看离线方法安装的 cuda 版本

~ dpkg --list |grep cuda-repo
ii  cuda-repo-ubuntu2004-11-8-local                             11.8.0-520.61.05-1                                             amd64        cuda repository configuration files
ii  cuda-repo-ubuntu2004-12-1-local                             12.1.0-530.30.02-1                                             amd64        cuda repository configuration files

可以看到cuda-repo 包含一个 apt list 文件

~ dkpg -L cuda-repo-ubuntu2004-11-8-local
/etc/apt/sources.list.d/cuda-ubuntu2004-11-8-local.list

内容为

fyyuan@snode6 ➜  ~ cat /etc/apt/sources.list.d/cuda-ubuntu2004-11-8-local.list
deb [signed-by=/usr/share/keyrings/cuda-368EAC11-keyring.gpg] file:///var/cuda-repo-ubuntu2004-11-8-local /

安装 deb 后,是添加了一个 repo。还需要再手动 install cuda。 不知为何安装 cuda 时一定会自动更新 driver,而使用 runfile 方式,则可以选择不安装驱动。

  • 问题在于 cuda 是最大的一个 metapackage,安装 cuda-toolkit-xx 就好了
  • cuda-drivers-530
  • cuda-toolkit-12-1
  • nvidia-driver-530
fyyuan@snode6 ➜  archive sudo apt install cuda --no-install-recommends
Reading package lists... Done
Building dependency tree
Reading state information... Done

Recommended packages:
  libnvidia-compute-530:i386 libnvidia-decode-530:i386 libnvidia-encode-530:i386 libnvidia-fbc1-530:i386 libnvidia-gl-530:i386

The following packages will be REMOVED:
  cuda-drivers-495 libnvidia-cfg1-495 libnvidia-compute-495 libnvidia-decode-495 libnvidia-encode-495 libnvidia-extra-495 libnvidia-fbc1-495 libnvidia-gl-495 nvidia-compute-utils-495 nvidia-dkms-495 nvidia-driver-495 nvidia-kernel-common-495 nvidia-kernel-source-495
  nvidia-utils-495 xserver-xorg-video-nvidia-495

The following NEW packages will be installed:
  cuda-12-1 cuda-cccl-12-1 cuda-command-line-tools-12-1 cuda-compiler-12-1 cuda-cudart-12-1 cuda-cudart-dev-12-1 cuda-cuobjdump-12-1 cuda-cupti-12-1 cuda-cupti-dev-12-1 cuda-cuxxfilt-12-1 cuda-demo-suite-12-1 cuda-documentation-12-1 cuda-driver-dev-12-1 cuda-drivers-530
  cuda-gdb-12-1 cuda-libraries-12-1 cuda-libraries-dev-12-1 cuda-nsight-12-1 cuda-nsight-compute-12-1 cuda-nsight-systems-12-1 cuda-nvcc-12-1 cuda-nvdisasm-12-1 cuda-nvml-dev-12-1 cuda-nvprof-12-1 cuda-nvprune-12-1 cuda-nvrtc-12-1 cuda-nvrtc-dev-12-1 cuda-nvtx-12-1 cuda-nvvp-12-1
  cuda-opencl-12-1 cuda-opencl-dev-12-1 cuda-profiler-api-12-1 cuda-runtime-12-1 cuda-sanitizer-12-1 cuda-toolkit-12-1 cuda-toolkit-12-1-config-common cuda-toolkit-12-config-common cuda-tools-12-1 cuda-visual-tools-12-1 gds-tools-12-1 libcublas-12-1 libcublas-dev-12-1
  libcufft-12-1 libcufft-dev-12-1 libcufile-12-1 libcufile-dev-12-1 libcurand-12-1 libcurand-dev-12-1 libcusolver-12-1 libcusolver-dev-12-1 libcusparse-12-1 libcusparse-dev-12-1 libnpp-12-1 libnpp-dev-12-1 libnvidia-cfg1-530 libnvidia-common-530 libnvidia-compute-530
  libnvidia-decode-530 libnvidia-encode-530 libnvidia-extra-530 libnvidia-fbc1-530 libnvidia-gl-530 libnvjitlink-12-1 libnvjitlink-dev-12-1 libnvjpeg-12-1 libnvjpeg-dev-12-1 libnvvm-samples-12-1 nsight-compute-2023.1.0 nsight-systems-2023.1.2 nvidia-compute-utils-530
  nvidia-dkms-530 nvidia-driver-530 nvidia-kernel-common-530 nvidia-kernel-source-530 nvidia-utils-530 xserver-xorg-video-nvidia-530
The following packages will be upgraded:
  cuda cuda-drivers nvidia-modprobe nvidia-settings
4 upgraded, 76 newly installed, 15 to remove and 21 not upgraded.
Need to get 0 B/3,012 MB of archives.
After this operation, 7,064 MB of additional disk space will be used.
Do you want to continue? [Y/n]

runfile

不建议使用,否则会和 apt 方法冲突。且没有了 apt 的管理方式。

不同版本冲突性

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-conflicting-installation-methods 总结:

  • 安装 cuda tookit,不同版本间没有冲突
  • 安装 driver,不同版本使用同一种方式 (deb 或 runfile) 安装没有冲突

卸载

sudo /usr/local/cuda-X.Y/bin/cuda-uninstaller  # runfile


# cuda
sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"

# driver
sudo apt purge "*nvidia*" "libxnvctrl*"   # 只卸载nvidia-driver-xxx不够

sudo apt autoremove

环境变量选择版本

cuda_version="cuda-11.0"
export PATH="/usr/local/$cuda_version/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/$cuda_version/lib64:$LD_LIBRARY_PATH"
export demo_suite="/usr/local/$cuda_version/extras/demo_suite"

cd $demo_suite
./deviceQuery   # 获得设备信息

nvcc 使用

nvcc -std=c++11 -lcurand -lcublas

实例

卸载 runfile 后重启(没重启还有很多 nvdia 的模块)

root@icarus3:/home/nfs/fyyuan# lsmod |grep nv
nvidia_drm             65536  0
nvidia_modeset       1273856  1 nvidia_drm
nvidia              55701504  27 gdrdrv,nvidia_modeset
drm_kms_helper        184320  4 ast,nvidia_drm
drm                   495616  7 drm_kms_helper,drm_vram_helper,ast,nvidia,nvidia_drm,ttm

安装 driver

root@icarus3:/home/nfs/fyyuan# apt-cache policy nvidia-driver-530
nvidia-driver-530:
  Installed: 530.41.03-0ubuntu0.20.04.2
  Candidate: 530.41.03-0ubuntu0.20.04.2
  Version table:
 *** 530.41.03-0ubuntu0.20.04.2 500
        500 https://mirrors.ustc.edu.cn/ubuntu focal-updates/restricted amd64 Packages
        500 https://mirrors.ustc.edu.cn/ubuntu focal-security/restricted amd64 Packages
        100 /var/lib/dpkg/status

安装 cuda

apt install nvidia-cuda-toolkit

root@icarus3:/home/nfs/fyyuan# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

版本太老,为 10.1 的

i0 上有许多 apt source

  • cuda-ubuntu2004-12-1-local.list 对应/var/cuda-repo-ubuntu2004-12-1-local/,包含很多 deb 文件
  • cuda-ubuntu2004-x86_64.list 为 https://developer.download.nvidia.com/compute/cuda deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda /repos/ubuntu2004/x86_64/ /
  • source.list 中还直接包含deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /
root@icarus0:/usr/local/cuda-12.1/bin# ls /etc/apt/sources.list.d/
cuda-ubuntu2004-12-1-local.list       docker.list       nvidia-container-toolkit.list.save
cuda-ubuntu2004-12-1-local.list.save  docker.list.save  nvidia-docker.list.save
cuda-ubuntu2004-x86_64.list           mlnx.list
cuda-ubuntu2004-x86_64.list.save      mlnx.list.save

官网按照 deb(network),发现添加了源,设置了

root@icarus3:/home/nfs/fyyuan# apt install cuda
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  cuda-12-1 cuda-cccl-12-1 cuda-command-line-tools-12-1 cuda-compiler-12-1 cuda-cudart-12-1 cuda-cudart-dev-12-1 cuda-cuobjdump-12-1 cuda-cupti-12-1
  cuda-cupti-dev-12-1 cuda-cuxxfilt-12-1 cuda-demo-suite-12-1 cuda-documentation-12-1 cuda-driver-dev-12-1 cuda-drivers cuda-drivers-530 cuda-gdb-12-1
  cuda-libraries-12-1 cuda-libraries-dev-12-1 cuda-nsight-12-1 cuda-nsight-compute-12-1 cuda-nsight-systems-12-1 cuda-nvcc-12-1 cuda-nvdisasm-12-1
  cuda-nvml-dev-12-1 cuda-nvprof-12-1 cuda-nvprune-12-1 cuda-nvrtc-12-1 cuda-nvrtc-dev-12-1 cuda-nvtx-12-1 cuda-nvvp-12-1 cuda-opencl-12-1
  cuda-opencl-dev-12-1 cuda-profiler-api-12-1 cuda-runtime-12-1 cuda-sanitizer-12-1 cuda-toolkit-12-1 cuda-toolkit-12-1-config-common
  cuda-toolkit-12-config-common cuda-toolkit-config-common cuda-tools-12-1 cuda-visual-tools-12-1 default-jre default-jre-headless gds-tools-12-1
  libcublas-12-1 libcublas-dev-12-1 libcufft-12-1 libcufft-dev-12-1 libcufile-12-1 libcufile-dev-12-1 libcurand-12-1 libcurand-dev-12-1 libcusolver-12-1
  libcusolver-dev-12-1 libcusparse-12-1 libcusparse-dev-12-1 libnpp-12-1 libnpp-dev-12-1 libnvjitlink-12-1 libnvjitlink-dev-12-1 libnvjpeg-12-1
  libnvjpeg-dev-12-1 libnvvm-samples-12-1 nsight-compute-2023.1.1 nsight-systems-2023.1.2 nvidia-modprobe nvidia-settings openjdk-11-jre
  openjdk-11-jre-headless
Suggested packages:
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei | fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
  cuda cuda-12-1 cuda-cccl-12-1 cuda-command-line-tools-12-1 cuda-compiler-12-1 cuda-cudart-12-1 cuda-cudart-dev-12-1 cuda-cuobjdump-12-1 cuda-cupti-12-1
  cuda-cupti-dev-12-1 cuda-cuxxfilt-12-1 cuda-demo-suite-12-1 cuda-documentation-12-1 cuda-driver-dev-12-1 cuda-drivers cuda-drivers-530 cuda-gdb-12-1
  cuda-libraries-12-1 cuda-libraries-dev-12-1 cuda-nsight-12-1 cuda-nsight-compute-12-1 cuda-nsight-systems-12-1 cuda-nvcc-12-1 cuda-nvdisasm-12-1
  cuda-nvml-dev-12-1 cuda-nvprof-12-1 cuda-nvprune-12-1 cuda-nvrtc-12-1 cuda-nvrtc-dev-12-1 cuda-nvtx-12-1 cuda-nvvp-12-1 cuda-opencl-12-1
  cuda-opencl-dev-12-1 cuda-profiler-api-12-1 cuda-runtime-12-1 cuda-sanitizer-12-1 cuda-toolkit-12-1 cuda-toolkit-12-1-config-common
  cuda-toolkit-12-config-common cuda-toolkit-config-common cuda-tools-12-1 cuda-visual-tools-12-1 default-jre default-jre-headless gds-tools-12-1
  libcublas-12-1 libcublas-dev-12-1 libcufft-12-1 libcufft-dev-12-1 libcufile-12-1 libcufile-dev-12-1 libcurand-12-1 libcurand-dev-12-1 libcusolver-12-1
  libcusolver-dev-12-1 libcusparse-12-1 libcusparse-dev-12-1 libnpp-12-1 libnpp-dev-12-1 libnvjitlink-12-1 libnvjitlink-dev-12-1 libnvjpeg-12-1
  libnvjpeg-dev-12-1 libnvvm-samples-12-1 nsight-compute-2023.1.1 nsight-systems-2023.1.2 nvidia-modprobe openjdk-11-jre openjdk-11-jre-headless
The following packages will be upgraded:
  nvidia-settings
1 upgraded, 69 newly installed, 0 to remove and 59 not upgraded.
Need to get 2802 MB of archives.
After this operation, 6511 MB of additional disk space will be used.
Do you want to continue? [Y/n]

NVCC

两阶段编译

GPU 不同 generation,二进制程序是不兼容的。为了保证应用的兼容性,采用了两阶段编译。

NVIDIA CUDA Compiler Driver

  • nvcc 总是需要使用两个架构:虚拟架构 (compute_xx) 和真实架构 (sm_xx)
  • sm 必须实现了 compute 的功能
  • 虚拟架构表明了应用使用到的能力,使用更低的虚拟架构允许在更广泛的硬件上运行
  • 真实架构越高,生成的代码就更高效(硬件特性更多)

JIT 机制

  • 将 cubin 生成推迟到运行时
  • 缺点是增加了应用 startup 延迟

By specifying a virtual code architecture instead of a real GPU, nvcc postpones the assembly of PTX code until application runtime, at which time the target GPU is exactly known.

nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50

fatbinaries

A different solution to overcome startup delay by JIT while still allowing execution on newer GPUs is to specify multiple code instances

nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52

example

NVIDIA CUDA Compiler Driver

--gpu-architecture-arch

nvcc x.cu --gpu-architecture=compute_50 --gpu-code=sm_50,sm_52

JIT

gpu-code 如果也指定虚拟架构,表明使用 JIT --gpu-code arguments can be virtual architectures. In this case the stage 2 translation will be omitted for such virtual architecture, and the stage 1 PTX result will be embedded instead. At application launch, and in case the driver does not find a better alternative, the stage 2 compilation will be invoked by the driver with the PTX as input.

nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52

省略 gpu-code

nvcc x.cu --gpu-architecture=sm_52
nvcc x.cu --gpu-architecture=compute_50

分别等价于

nvcc x.cu --gpu-architecture=compute_52 --gpu-code=sm_52,compute_52
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50

同时省略 arch, gpu-code

默认使用 sm_52?

nvcc x.cu
# 等价于
nvcc x.cu --gpu-architecture=compute_52 --gpu-code=sm_52,compute_52

-generate-code 指定多种组合

使用一个虚拟架构,限制了功能。可以为不同硬件指定不同虚拟架构

compute_50 assumes no half-precision floating-point operation support for both the sm_50 code and the sm_53 code:

nvcc x.cu \
    --generate-code arch=compute_50,code=sm_50 \
    --generate-code arch=compute_50,code=sm_52 \
    --generate-code arch=compute_53,code=sm_53

查看编译结果包含哪些版本

cuobjdump

nvcc 选项

On all platforms, the default host compiler executable (gcc and g++ on Linux and cl.exe on Windows) found in the current execution search path will be used, unless specified otherwise with appropriate options

NVIDIA CUDA Compiler Driver

When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.

If -arch=all is specified, nvcc embeds a compiled code image for all supported architectures (sm_*), and a PTX program for the highest major virtual architecture. For -arch=all-majornvcc embeds a compiled code image for all supported major versions (sm_*0), plus the earliest supported, and adds a PTX program for the highest major virtual architecture.

环境变量

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#nvcc-environment-variables

export NVCC_PREPEND_FLAGS='-G -keep -arch=sm_60'
export NVCC_APPEND_FLAGS='-DNAME=" foo "'
nvcc foo.cu -o foo
# 等价于
nvcc -G -keep -arch=sm_60 foo.cu -o foo -DNAME=" foo "

Cuda 编程

统一内存模型

在 CUDA 编程中,__managed__ 是一个 CUDA 扩展关键字,用于标识被修饰的变量或数据结构将在主机(CPU)和设备(GPU)之间自动进行内存管理。它是 CUDA Unified Memory 功能的一部分,旨在简化主机和设备之间的内存管理和数据传输。

通过使用 __managed__ 关键字,您可以将变量或数据结构声明为统一内存(Unified Memory)。这意味着在使用这些统一内存对象时,无需显式地在主机和设备之间进行手动内存分配和数据传输。CUDA 运行时系统会自动处理内存的分配和迁移,以确保数据在主机和设备之间正确共享。

同步

cuda - Does __syncthreads() synchronize all threads in the grid? - Stack Overflow

__syncthreads()  // 该函数同步block内线程(实际是warp,warp内线程执行的是相同指令,本身就是同步的)
  • cudaDeviceSynchronize
    • CPU 等待 GPU 完成 kernel 调用

CUDA 没有全局同步(block 间同步)

  • 对于 SM 数量很大的 GPU,硬件成本太高
  • would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency 解决办法: 分解为多个 kernel
  • kernel launch 作为 global sync point
  • kernel launch has negligible HW overhead, low SW overhead

cuBLAS

计算矩阵 A x B 方式一:

cublasSgemm (blas_handle, CUBLAS_OP_T, CUBLAS_OP_T,   //CUBLAS使用列优先存储。CUBLAS_OP_T表示进行转置,即行优先。列优先下leading dimesion为m(mxn矩阵)
                     m, n, k,
                     &alpha,
                     d_A, k, d_B, n,
                     &beta,
                     d_C, n); //结果仍为列优先,故计算得到C的转置

方式二:

/* 通过计算B^T * A^T得到C^T,由于列优先所有d_C实际为C的行优先结果。 //https://stackoverflow.com/a/56064726
        疑问:这里beta等于0,故C的初始值没有影响,行列优先均可(ldc=m, n均正确)。但是如果非0,那么如何处理呢?
        */
        cublasSgemm(blas_handle, CUBLAS_OP_N, CUBLAS_OP_N,
                    n, m, k,
                    &alpha,
                    d_B, n, d_A, k,
                    &beta,
                    d_C, n);

工具

Which tools are available on which GPU architectures

GPU architecture Visual Profiler and nvprof Nsight Systems Nsight Compute
Maxwell Yes No No
Pascal Yes Yes No
Volta Yes Yes Yes
Turing Yes* Yes Yes
Ampere and later GPU architectures No Yes Yes

nsight sytem

  • high level

nsight compute

  • detailed
  • performance metrics

nvprof

The Visual Profiler is a graphical profiling tool that displays a timeline of your application’s CPU and GPU activity, and that includes an automated analysis engine to identify optimization opportunities. The nvprof profiling tool enables you to collect and view profiling data from the command-line.

Note that Visual Profiler and nvprof will be deprecated in a future CUDA release. The NVIDIA Volta platform is the last architecture on which these tools are fully supported. It is recommended to use next-generation tools NVIDIA Nsight Systems for GPU and CPU sampling and tracing and NVIDIA Nsight Compute for GPU kernel profiling.

Visual Profiler/nvvp

NVIDIA Visual Profiler | NVIDIA Developer

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual-profiler

The Visual Profiler is available as both a standalone application and as part of Nsight Eclipse Edition. The standalone version of the Visual Profiler, nvvp, is included in the CUDA Toolkit for all supported OSes except for macOS.

步骤

  • 安装设置 1.8 JRE
  • Preparing An Application For Profiling
  • create session: A session contains the settings, data, and profiling results associated with your application. Each session is saved in a separate file; so you can delete, move, copy, or share a session by simply deleting, moving, copying, or sharing the session file. By convention, the file extension .nvvp is used for Visual Profiler session files.
  • analyziing

java 版本

需要 Java Runtime Environment (JRE) 1.8 可以启动时通过 vm 指定

nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

也可以设置全局 java 版本 update-alternatives 是一个 Debian 系统中用于管理多个软件版本的工具。通过它,你可以轻松切换默认使用的 Java 版本

sudo update-alternatives --config java

IOException: locking file

(base) fyyuan@snode6 ➜  ~ nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
java.lang.RuntimeException: Error initializing storage.
        at org.eclipse.osgi.internal.framework.EquinoxContainer.<init>(EquinoxContainer.java:77)
        at org.eclipse.osgi.launch.Equinox.<init>(Equinox.java:31)
        at org.eclipse.core.runtime.adaptor.EclipseStarter.startup(EclipseStarter.java:295)
        at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:231)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:648)
        at org.eclipse.equinox.launcher.Main.basicRun(Main.java:603)
        at org.eclipse.equinox.launcher.Main.run(Main.java:1465)
        at org.eclipse.equinox.launcher.Main.main(Main.java:1438)
Caused by: java.io.IOException: An error occurred while locking file "/staff/fyyuan/.eclipse/org.eclipse.platform_4.4.1_868875438_linux_gtk_x86_64/configuration/org.eclipse.osgi/.manager/.fileTableLock": "Input/output error". A common reason is that the file system or Runtime Environment does not support file locking for that location. Please choose a different location, or disable file locking by passing "-Dosgi.locking=none" as a VM argument.
        at org.eclipse.osgi.internal.location.Locker_JavaNio.lock(Locker_JavaNio.java:49)
        at org.eclipse.osgi.storagemanager.StorageManager.lock(StorageManager.java:388)
        at org.eclipse.osgi.storagemanager.StorageManager.open(StorageManager.java:701)
        at org.eclipse.osgi.storage.Storage.getChildStorageManager(Storage.java:1747)
        at org.eclipse.osgi.storage.Storage.getInfoInputStream(Storage.java:1764)
        at org.eclipse.osgi.storage.Storage.<init>(Storage.java:124)
        at org.eclipse.osgi.storage.Storage.createStorage(Storage.java:84)
        at org.eclipse.osgi.internal.framework.EquinoxContainer.<init>(EquinoxContainer.java:75)
        ... 11 more

提示的添加 vm 参数方法,添加到 cuda-tookit-dir/libnvvp/nvvp.ini 结尾(使用-vmargs 指定)

sudo vim /usr/local/cuda-11.8/libnvvp/nvvp.ini

-startup
plugins/org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20140603-1326
-data
@noDefault
-vmargs
-Dosgi.locking=none

PC sample View

  • compute_52 支持 PC 采样。
    • PC and state of warp are sampled at regular interval for one of the active warps per SM
    • The warp state indicates if that warp issued an instruction in a cycle or why it was stalled and could not issue an instruction
    • Hence the stall for the sampled warp need not necessarily indicate that there is a hole in the instruction issue pipeline.
  • compute_60 支持 latency reasons
    • While collecting these samples, there is no instruction issued in the respective warp scheduler and hence these give the latency reasons.

6.0 以上设备,Visual Profiler show two views: ‘Kernel Profile - PC Sampling’which gives the warp state view and‘Kernel Profile - PC Sampling - Latency’which gives the latency reasons.

memory 统计

  • The data paths from the SMs to the memory spaces (Global, Local, Texture, Surface and Shared) report the total number of memory instructions executed, it includes both read and write operations.
  • The data path between memory spaces and “Unified Cache” or “Shared Memory” reports the total amount of memory requests made.
  • All other data paths report the total amount of transferred memory in bytes.

source-disassembly

-lineinfo

Warp state/stall原因

1. Preparing An Application For Profiling — Profiler 12.3 documentation (nvidia.com)

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#warp-state

能够对应到源码进行分析:CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling | NVIDIA Technical Blog

  • Instruction issued
  • Stalled
    • Stalled for instruction fetch
      • For very short kernels, consider fusing into a single kernels.
    • Stalled for execution dependency
    • Stalled for memory dependency
      • 尝试提高内存合并和/或获取字节的效率(对齐等)。使用 gld_efficiency 和 gst_efficiency 检查未合并的内存访问
      • memory-level parallelism (MLP): the number of independent memory operations in flight per thread. Loop unrolling, loading vector types such as float4, and processing multiple elements per thread are all ways to increase memory-level parallelism.
      • 请考虑将经常访问的数据移动到更靠近 SM 的位置,例如使用共享内存或只读数据缓存。
      • 考虑尽可能重新计算数据,而不是从设备内存加载数据
      • 如果本地内存访问量很高,请考虑增加每个线程的寄存器计数以减少溢出,即使以占用率为代价
    • Stalled for memory throttle
    • Stalled for texture
    • Stalled for sync
    • Stalled for constant memory dependency
    • Stalled for pipe busy:functional unit busy
    • Stalled for not selected:Warp 已准备就绪,但没有机会发出,因为选择了其他一些 Warp 进行发出。
    • Stalled for other:Warp 因不常见的原因(如编译器或硬件原因)而被阻止。开发人员无法控制这些 stall。

nvprof

  • Summary Mode
    • 默认模式,nvprof outputs a single result line for each kernel function and each type of CUDA memory copy/set performed by the application.
    • 如果不需要,可以使用 关闭 API 跟踪 --profile-api-trace none 。这减少了一些性能分析开销,尤其是在内核较短时。
  • gpu-trace
    • GPU-Trace mode provides a timeline of all activities taking place on the GPU in chronological order.
    • 按照时间顺序显示。同一个 kernel 会多次显示
    • nvprof --print-gpu-trace matrixMul
  • API-trace: API-trace mode shows the timeline of all CUDA runtime and driver API calls invoked on the host in chronological order.
    • 显示 runtime 和 driver API 调用
    • nvprof --print-api-trace matrixMul
  • Event/metric Summary Mode
    • --events all --metrics all

An event is a countable activity, action, or occurrence on a device. It corresponds to a single hardware counter value which is collected during kernel execution. To see a list of all available events on a particular NVIDIA GPU, type nvprof --query-events.

metric is a characteristic of an application that is calculated from one or more event values. To see a list of all available metrics on a particular NVIDIA GPU, type nvprof --query-metrics. You can also refer to the metrics reference .

Usage: nvprof [options] [application] [application-arguments]

--analysis-metrics
                        Collect profiling data that can be imported to Visual Profiler's
                        "analysis" mode. Note: Use "--export-profile" to specify
                        an export file.

--devices <device ids>
                        Change the scope of subsequent "--events", "--metrics", "--query-events"
                        and "--query-metrics" options.
                        Allowed values:
                                all - change scope to all valid devices
                                comma-separated device IDs - change scope to specified
                        devices

  -e,  --events <event names>
                        Specify the events to be profiled on certain device(s). Multiple
                        event names separated by comma can be specified. Which device(s)
                        are profiled is controlled by the "--devices" option. Otherwise
                        events will be collected on all devices.
                        For a list of available events, use "--query-events".
                        Use "--events all" to profile all events available for each
                        device.
                        Use "--devices" and "--kernels" to select a specific kernel
                        invocation.
 --kernels <kernel path syntax>
                        Change the scope of subsequent "--events", "--metrics" options.
                        The syntax is as follows:
                                <kernel name>
                                Limit scope to given kernel name.
                        or
                                <context id/name>:<stream id/name>:<kernel name>:<invocation>
                        The context/stream IDs, names, kernel name and invocation
                        can be regular expressions. Empty string matches any number
                        or characters. If <context id/name> or <stream id/name>
                        is a positive number, it's strictly matched against the
                        CUDA context/stream ID. Otherwise it's treated as a regular
                        expression and matched against the context/stream name specified
                        by the NVTX library. If the invocation count is a positive
                        number, it's strictly matched against the invocation of
                        the kernel. Otherwise it's treated as a regular expression.
                        Example: --kernels "1:foo:bar:2" will profile any kernel
                        whose name contains "bar" and is the 2nd instance on context
                        1 and on stream named "foo".

  -m,  --metrics <metric names>
                        Specify the metrics to be profiled on certain device(s).
                        Multiple metric names separated by comma can be specified.
                        Which device(s) are profiled is controlled by the "--devices"
                        option. Otherwise metrics will be collected on all devices.
                        For a list of available metrics, use "--query-metrics".
                        Use "--metrics all" to profile all metrics available for
                        each device.
                        Use "--devices" and "--kernels" to select a specific kernel
                        invocation.
                        Note: "--metrics all" does not include some metrics which
                        are needed for Visual Profiler's source level analysis.
                        For that, use "--analysis-metrics".

 --pc-sampling-period <period>
                        Specify PC Sampling period in cycles,  at which the sampling
                        records will be dumped. Allowed values for the period are
                        integers between 5 to 31 both inclusive.
                        This will set the sampling period to (2^period) cycles
                        Default value is a number between 5 and 12 based on the setup.
                        Note: Only available for GM20X+.

 --print-api-summary
                  Print a summary of CUDA runtime/driver API calls.

 --print-api-trace
                  Print CUDA runtime/driver API trace.

 --print-gpu-trace
                        Print individual kernel invocations (including CUDA memcpy's/memset's)
                        and sort them in chronological order. In event/metric profiling
                        mode, show events/metrics for each kernel invocation.
  -s,  --print-summary
                        Print a summary of the profiling result on screen. Note:
                        This is the default unless "--export-profile" or other print
                        options are used.

  -o,  --export-profile <filename>
                        Export the result file which can be imported later or opened
                        by the NVIDIA Visual Profiler.
                                "%p" in the file name string is replaced with the
                        process ID of the application being profiled.
                                "%q{<ENV>}" in the file name string is replaced
                        with the value of the environment variable "<ENV>". If the
                        environment variable is not set it's an error.
                                "%h" in the file name string is replaced with the
                        hostname of the system.
                                "%%" in the file name string is replaced with "%".
                                Any other character following "%" is illegal.
                        By default, this option disables the summary output. Note:
                        If the application being profiled creates child processes,
                        or if '--profile-all-processes' is used, the "%p" format
                        is needed to get correct export files for each process.

metric 具体含义:1. Preparing An Application For Profiling — Profiler 12.3 documentation (nvidia.com)

The Visual Profiler is a graphical profiling tool that displays a timeline of your application’s CPU and GPU activity, and that includes an automated analysis engine to identify optimization opportunities. The nvprof profiling tool enables you to collect and view profiling data from the command-line.

下一代

Note that Visual Profiler and nvprof will be deprecated in a future CUDA release. The NVIDIA Volta platform is the last architecture on which these tools are fully supported. It is recommended to use next-generation tools NVIDIA Nsight Systems for GPU and CPU sampling and tracing and NVIDIA Nsight Compute for GPU kernel profiling.

迁移:1. Preparing An Application For Profiling — Profiler 12.3 documentation (nvidia.com)

  • An event is a countable activity, action, or occurrence on a device. It corresponds to a single hardware counter value which is collected during kernel execution. To see a list of all available events on a particular NVIDIA GPU, type nvprof --query-events.
  • metric is a characteristic of an application that is calculated from one or more event values. To see a list of all available metrics on a particular NVIDIA GPU, type nvprof --query-metrics. You can also refer to the metrics reference .

支持 cudaProfilerStart, Stop CUDA API 用于专注部分代码的分析。nvprof 需要使用--profile-from-start off 来使用

NVTX 用于给 CPU 代码打上标记,好在工具里看到

To understand what the application’s CPU threads are doing outside of CUDA function calls, you can use the NVIDIA Tools Extension API (NVTX). When you add NVTX markers and ranges to your application, the Timeline View shows when your CPU threads are executing within those regions.

The Visual Profiler is available as both a standalone application and as part of Nsight Eclipse Edition. The standalone version of the Visual Profiler, nvvp, is included in the CUDA Toolkit for all supported OSes

流程

  • 准备应用程序
    • 二进制程序,不用专门修改。但是使用 NVTX 更好

Nsight Compute

Nsight Compute Documentation (nvidia.com)

  • Kernel Profiling Guide
    • Kernel Profiling Guide with metric types and meaning, data collection modes and FAQ for common problems.
  • Nsight Compute
    • NVIDIA Nsight Compute User Interface (UI) manual. Information on all views, controls and workflows within the tool UI. Transitions guide for Visual Profiler.
  • Nsight Compute CLI
    • NVIDIA Nsight Compute Command Line Interface (CLI) manual. Information on workflows and options for the command line, including multi-process profiling and NVTX filtering. Transitions guide for Nvprof.
  • nvprof:sm7.5 之前使用,被 nsight compute 替代
  • Nsight compute
    • 图形化界面
    • 命令行:ncu

ncu

Nsight Compute CLI :: Nsight Compute Documentation (nvidia.com)

metric

--metric

Specify all metrics to be profiled, separated by comma. If no --section options are given, only the temporary section containing all metrics listed using this option is collected. If --section options are given in addition to --metrics, all metrics from those sections and from --metrics are collected.

nvprof metric 在 ncu 中的表示 Nsight Compute CLI :: Nsight Compute Documentation (nvidia.com)

|inst_executed|smsp__inst_executed.sum|

问题

Profiling is not supported on this device

Nsight compute 支持的 GPU:Release Notes :: Nsight Compute Documentation (nvidia.com)

  • 不支持 pascal(如 gtx1080),从 Volta GV100 开始支持

无权限使用 metric

修改/etc/modprobe.d,重启保持修改

sudo vim /etc/modprobe.d/nvidia_profile.conf

options nvidia NVreg_RestrictProfilingToAdminUsers=0

update-initramfs -u -k all #更新initramfs, -k all指定所有内核版本

不用重启,临时修改方法

modprobe -rf nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia

modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0

无法移除时,查看是什么进程占用了 gpu

sudo lsof /dev/nvidia*

可能需要停止 nvidia-persistenced.service 服务

sudo systemctl status nvidia-persistenced.service

调试工具

cuda-memck

使用 cuda-memcheck 检查程序是否有非法地址访问。会使得程序执行变慢

cuda-memcheck prog arg...

cuda-gdb 调试

nvcc -g -G XXX.cu -o XXX # -g对于cpu, -G对于GPU

cuda-gdb 和 gdb 使用类似。

cuda 程序如果代码写错,执行时只能从内核日志中看到报错信息(如 Xid 报错),非常不便。

而使用 cuda-gdb 可以直接定位哪出代码访存错误:

(cuda-gdb) r 10 10 10                                                                                CUDA Exception: Warp Illegal Address                                                                 The exception was triggered at PC 0x55555617ff10 (GEMM.cu:82)

Thread 1 "GEMM" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4, block (0,0,0), thread (0,6,0), device 0, sm 0, warp 6, lane 0] 0x000055555617ff20 in gemm_block_shared<32, 32, 8><<<(32,32,1),(32,32,1)>>> (A=0x7fffbda00000, B=0x7fffd7800000, C=0x7fffd7c00000, m=1024, k=1024, n=1024) at GEMM.cu:82
82                  Bs[ty][tx] = B(iter * bm + ty, bx * bk + tx); //(tx, ty) in block B(iter, bx)
(cuda-gdb) quit
A debugging session is active.
        Inferior 1 [process 32589] will be killed.
Quit anyway? (y or n) y

NVbit

unsigned int __ballot(int predicate); #
  • If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.
int atomicOr(int* address, int val);

warp-level 指令

CSE 599 I Accelerated Computing - Programming GPUs Lecture 18.pdf (tschmidt23.github.io)

cuda SDK