1.2 Pytorch+CUDA
约 3130 字大约 10 分钟
2025-04-20
1.2.1 H20/H100类显卡问题
这类显卡为Hopper架构,计算能力 sm_90
,CUDA版本必须 >= 11.8
GPU 所需的最低 CUDA 版本:https://stackoverflow.com/questions/28932864/which-compute-capability-is-supported-by-which-cuda-versions/28933055#28933055
1.2.2 NVdiffrast CUDA11.8问题
https://github.com/NVlabs/nvdiffrast/issues/138
主要问题:
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)
nvcc error : 'ptxas' core dumped
解决办法:升级CUDA版本😭,升级到12.1就解决了(注意,最好还是不要用Conda 的CUDA,经常出现找不到CUDA环境的问题)
1.2.3 Conda安装CUDA/GCC
在新服务器上配置环境:
安装 mamba(比 conda 更快)
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" bash Miniforge3-$(uname)-$(uname -m).sh
创建环境
mamba create -n h_gs python=3.12 -y mamba install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
安装 GCC/G++
mamba install gcc=11.4 mamba install gxx=11.4
安装 CUDA
conda 安装CUDA各版本命令:https://anaconda.org/nvidia/cuda-toolkit
mamba install nvidia/label/cuda-12.1.0::cuda-toolkit -c nvidia/label/cuda-12.1.0
注意:这里就不要用 pip 安装 pytorch 了,实验证明,pip 安装 pytorch 之后再用 nvidia/label/cuda-xxx
channel 安装 CUDA 会出现 bug(自动安装最新版本的 CUDA)
注意:Conda的CUDA,经常出现找不到CUDA环境的问题
https://stackoverflow.com/questions/72684130/how-to-set-the-cuda-path-in-the-conda-environment
进入Conda环境目录并创建以下子目录和文件:
cd $CONDA_PREFIX mkdir -p ./etc/conda/activate.d mkdir -p ./etc/conda/deactivate.d touch ./etc/conda/activate.d/env_vars.sh touch ./etc/conda/deactivate.d/env_vars.sh
编辑
./etc/conda/activate.d/env_vars.sh
如下:#!/bin/sh export CUDA_HOME=$CONDA_PREFIX export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
编辑
./etc/conda/deactivate.d/env_vars.sh
如下:#!/bin/sh unset CUDA_HOME unset LD_LIBRARY_PATH
当运行 conda activate analytics
时,环境变量 MY_KEY
和 MY_FILE
将设置为写入文件的值。当运行 conda deactivate
时,这些环境变量将被删除。
1.2.4 undefined symbol: iJIT_NotifyEvent
Traceback (most recent call last):
File "/mnt/data4/home/xxxx/GPAvatar/inference.py", line 11, in <module>
import torch
File "/mnt/data/home/xxxx/miniforge3/envs/GPAvatar/lib/python3.9/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
ImportError: /mnt/data/home/xxxx/miniforge3/envs/GPAvatar/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent
https://github.com/pytorch/pytorch/issues/123097
在以下环境时 import torch
发生的bug:
mamba install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
mamba install nvidia/label/cuda-11.8.0::cuda-toolkit -c nvidia/label/cuda-11.8.0
此时的 mkl=2024.2.2
,我们将其降级为 mkl=2024.0
mamba install mkl==2024.0
1.2.5 ld: cannot find -lcudart
我们使用 2.2.3 的流程安装pytorch和CUDA环境之后,要用 conda 的CUDA环境进行某个库编译时,出现了bug:
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/compiler_compat/ld: cannot find -lcudart: No such file or directory
collect2: error: ld returned 1 exit status
error: command '/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin/g++' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for diff_gaussian_rasterization_32d
Running setup.py clean for diff_gaussian_rasterization_32d
Failed to build diff_gaussian_rasterization_32d
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (diff_gaussian_rasterization_32d)
❯ which nvcc
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin/nvcc
❯ echo $CUDA_HOME
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar
❯ echo $PATH
/home/xxxx/local/bin:/home/xxxx/local/bin:/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin:/mnt/data/home/xxxx/miniforge3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
❯ echo $LD_LIBRARY_PATH
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/lib:
去探查发现,这里的软链接出了问题:
❯ ls /mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/lib
libcudart.so -> libcudart.so.12.1.55
libcudart.so.12
libcudart.so.12.1.105
继续探究发现,安装Pytorch时会安装 cuda-cudart=12.1.105
以下是按照Pytorch时会安装的所有以 pytorch
、 nvidia
为 channel 的包:
+ pytorch-mutex 1.0 cuda pytorch Cached
+ libcublas 12.1.0.26 0 nvidia Cached
+ libcufft 11.0.2.4 0 nvidia Cached
+ libcusolver 11.4.4.55 0 nvidia Cached
+ libcusparse 12.0.2.55 0 nvidia Cached
+ libnpp 12.0.2.50 0 nvidia Cached
+ cuda-cudart 12.1.105 0 nvidia Cached
+ cuda-nvrtc 12.1.105 0 nvidia Cached
+ libnvjitlink 12.1.105 0 nvidia Cached
+ libnvjpeg 12.1.1.14 0 nvidia Cached
+ cuda-cupti 12.1.105 0 nvidia Cached
+ cuda-nvtx 12.1.105 0 nvidia Cached
+ ffmpeg 4.3 hf484d3e_0 pytorch Cached
+ libjpeg-turbo 2.0.0 h9bf148f_0 pytorch Cached
+ cuda-version 12.6 3 nvidia Cached
+ libcurand 10.3.7.77 0 nvidia Cached
+ libcufile 1.11.1.6 0 nvidia Cached
+ cuda-opencl 12.6.77 0 nvidia Cached
+ cuda-libraries 12.1.0 0 nvidia Cached
+ cuda-runtime 12.1.0 0 nvidia Cached
+ pytorch-cuda 12.1 ha16c6d3_6 pytorch Cached
+ pytorch 2.4.1 py3.12_cuda12.1_cudnn9.1.0_0 pytorch Cached
+ torchtriton 3.0.0 py312 pytorch Cached
+ torchaudio 2.4.1 py312_cu121 pytorch Cached
+ torchvision 0.19.1 py312_cu121 pytorch Cached
而这是安装 cuda-toolkit-12.1.0
的包:
+ cuda-documentation 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvml-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnvvm-samples 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cccl 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-driver-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-profiler-api 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0 21MB
+ cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0 11kB
+ libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached
+ libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached
+ libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0 782kB
+ libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0 54MB
+ libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached
+ libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached
+ libnvjitlink 12.1.55 0 nvidia/label/cuda-12.1.0 18MB
+ libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0 3MB
+ cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0 5MB
+ cuda-cuobjdump 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cuxxfilt 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvcc 12.1.66 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvprune 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-gdb 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvdisasm 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvprof 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0 58kB
+ cuda-sanitizer-api 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nsight 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ nsight-compute 2023.1.0.15 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cudart-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvrtc-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-opencl-dev 12.1.56 0 nvidia/label/cuda-12.1.0 Cached
+ libcublas-dev 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached
+ libcufft-dev 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached
+ gds-tools 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached
+ libcufile-dev 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached
+ libcurand-dev 10.3.2.56 0 nvidia/label/cuda-12.1.0 Cached
+ libcusolver-dev 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached
+ libcusparse-dev 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnpp-dev 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached
+ libnvjitlink-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnvjpeg-dev 12.1.0.39 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cupti-static 12.1.62 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-compiler 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvvp 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-command-line-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nsight-compute 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-cudart-static 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvrtc-static 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ libcublas-static 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached
+ libcufft-static 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached
+ libcufile-static 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached
+ libcurand-static 10.3.2.56 0 nvidia/label/cuda-12.1.0 Cached
+ libcusolver-static 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached
+ libcusparse-static 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnpp-static 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached
+ libnvjpeg-static 12.1.0.39 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-libraries-dev 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-libraries-static 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-visual-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-toolkit 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
这是安装 cuda-toolkit-12.1.1
的包:
+ cuda-documentation 12.1.105 0 nvidia/label/cuda-12.1.1 91kB
+ cuda-nvml-dev 12.1.105 0 nvidia/label/cuda-12.1.1 87kB
+ libnvvm-samples 12.1.105 0 nvidia/label/cuda-12.1.1 33kB
+ cuda-cccl 12.1.109 0 nvidia/label/cuda-12.1.1 1MB
+ cuda-driver-dev 12.1.105 0 nvidia/label/cuda-12.1.1 17kB
+ cuda-profiler-api 12.1.105 0 nvidia/label/cuda-12.1.1 19kB
+ cuda-cudart 12.1.105 0 nvidia/label/cuda-12.1.1 Cached
+ cuda-nvrtc 12.1.105 0 nvidia/label/cuda-12.1.1 Cached
+ cuda-opencl 12.1.105 0 nvidia/label/cuda-12.1.1 11kB
+ libcublas 12.1.3.1 0 nvidia/label/cuda-12.1.1 367MB
+ libcufft 11.0.2.54 0 nvidia/label/cuda-12.1.1 108MB
+ libcufile 1.6.1.9 0 nvidia/label/cuda-12.1.1 783kB
+ libcurand 10.3.2.106 0 nvidia/label/cuda-12.1.1 54MB
+ libcusolver 11.4.5.107 0 nvidia/label/cuda-12.1.1 116MB
+ libcusparse 12.1.0.106 0 nvidia/label/cuda-12.1.1 177MB
+ libnpp 12.1.0.40 0 nvidia/label/cuda-12.1.1 147MB
+ libnvjitlink 12.1.105 0 nvidia/label/cuda-12.1.1 Cached
+ libnvjpeg 12.2.0.2 0 nvidia/label/cuda-12.1.1 3MB
+ cuda-cupti 12.1.105 0 nvidia/label/cuda-12.1.1 Cached
+ cuda-cuobjdump 12.1.111 0 nvidia/label/cuda-12.1.1 245kB
+ cuda-cuxxfilt 12.1.105 0 nvidia/label/cuda-12.1.1 302kB
+ cuda-nvcc 12.1.105 0 nvidia/label/cuda-12.1.1 55MB
+ cuda-nvprune 12.1.105 0 nvidia/label/cuda-12.1.1 67kB
+ cuda-gdb 12.1.105 0 nvidia/label/cuda-12.1.1 6MB
+ cuda-nvdisasm 12.1.105 0 nvidia/label/cuda-12.1.1 50MB
+ cuda-nvprof 12.1.105 0 nvidia/label/cuda-12.1.1 5MB
+ cuda-nvtx 12.1.105 0 nvidia/label/cuda-12.1.1 Cached
+ cuda-sanitizer-api 12.1.105 0 nvidia/label/cuda-12.1.1 18MB
+ cuda-nsight 12.1.105 0 nvidia/label/cuda-12.1.1 119MB
+ nsight-compute 2023.1.1.4 0 nvidia/label/cuda-12.1.1 808MB
+ cuda-cudart-dev 12.1.105 0 nvidia/label/cuda-12.1.1 381kB
+ cuda-nvrtc-dev 12.1.105 0 nvidia/label/cuda-12.1.1 12kB
+ cuda-opencl-dev 12.1.105 0 nvidia/label/cuda-12.1.1 59kB
+ libcublas-dev 12.1.3.1 0 nvidia/label/cuda-12.1.1 76kB
+ libcufft-dev 11.0.2.54 0 nvidia/label/cuda-12.1.1 14kB
+ gds-tools 1.6.1.9 0 nvidia/label/cuda-12.1.1 43MB
+ libcufile-dev 1.6.1.9 0 nvidia/label/cuda-12.1.1 13kB
+ libcurand-dev 10.3.2.106 0 nvidia/label/cuda-12.1.1 460kB
+ libcusolver-dev 11.4.5.107 0 nvidia/label/cuda-12.1.1 51kB
+ libcusparse-dev 12.1.0.106 0 nvidia/label/cuda-12.1.1 178MB
+ libnpp-dev 12.1.0.40 0 nvidia/label/cuda-12.1.1 525kB
+ libnvjitlink-dev 12.1.105 0 nvidia/label/cuda-12.1.1 15MB
+ libnvjpeg-dev 12.2.0.2 0 nvidia/label/cuda-12.1.1 13kB
+ cuda-libraries 12.1.1 0 nvidia/label/cuda-12.1.1 2kB
+ cuda-cupti-static 12.1.105 0 nvidia/label/cuda-12.1.1 12MB
+ cuda-compiler 12.1.1 0 nvidia/label/cuda-12.1.1 1kB
+ cuda-nvvp 12.1.105 0 nvidia/label/cuda-12.1.1 120MB
+ cuda-command-line-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB
+ cuda-nsight-compute 12.1.1 0 nvidia/label/cuda-12.1.1 1kB
+ cuda-cudart-static 12.1.105 0 nvidia/label/cuda-12.1.1 948kB
+ cuda-nvrtc-static 12.1.105 0 nvidia/label/cuda-12.1.1 18MB
+ libcublas-static 12.1.3.1 0 nvidia/label/cuda-12.1.1 389MB
+ libcufft-static 11.0.2.54 0 nvidia/label/cuda-12.1.1 199MB
+ libcufile-static 1.6.1.9 0 nvidia/label/cuda-12.1.1 3MB
+ libcurand-static 10.3.2.106 0 nvidia/label/cuda-12.1.1 55MB
+ libcusolver-static 11.4.5.107 0 nvidia/label/cuda-12.1.1 76MB
+ libcusparse-static 12.1.0.106 0 nvidia/label/cuda-12.1.1 185MB
+ libnpp-static 12.1.0.40 0 nvidia/label/cuda-12.1.1 143MB
+ libnvjpeg-static 12.2.0.2 0 nvidia/label/cuda-12.1.1 3MB
+ cuda-libraries-dev 12.1.1 0 nvidia/label/cuda-12.1.1 2kB
+ cuda-libraries-static 12.1.1 0 nvidia/label/cuda-12.1.1 2kB
+ cuda-visual-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB
+ cuda-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB
+ cuda-toolkit 12.1.1 0 nvidia/label/cuda-12.1.1 2kB
对比发现是 cuda-12.1.1
才对的上CUDA版本12.1的Pytorch。但是我们在安装的时候,先安装CUDA版本12.1的Pytorch,再安装 cuda-12.1.1
会出现冲突问题:
└─ pytorch-cuda is not installable because it requires
└─ libcublas >=12.1.0.26,<12.1.3.1 , which conflicts with any installable versions previously reported.
也就是说,该死的CUDA版本12.1的Pytorch的 libcublas
需要适配 cuda-toolkit-12.1.0
,但是其的 cuda-cudart
等库又需要适配 cuda-toolkit-12.1.1
可以看到 pytorch-cuda 强要求 libcublas >=12.1.0.26,<12.1.3.1
,我们只好迁就 pytorch,安装12.1.0的CUDA,但是呢!我们可以修改Pytorch官方给出的 nvidia
channel 为 nvidia/label/cuda-12.1.0
使用以下命令:
mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
其就会安装与我们安装的 cuda-toolkit-12.1.0
一样的一些 cuda 库了!
+ pytorch-mutex 1.0 cuda pytorch Cached
+ libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached
+ libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached
+ libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached
+ libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached
+ libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached
+ libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0 3MB
+ cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0 21MB
+ cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0 11kB
+ libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0 782kB
+ libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0 54MB
+ cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0 5MB
+ cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0 58kB
+ cuda-version 12.1 h1d6eff3_3 conda-forge 21kB
+ ffmpeg 4.3 hf484d3e_0 pytorch Cached
+ libjpeg-turbo 2.0.0 h9bf148f_0 pytorch Cached
+ libnvjitlink 12.1.105 hd3aeb46_0 conda-forge 16MB
+ cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ cuda-runtime 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
+ pytorch-cuda 12.1 ha16c6d3_6 pytorch Cached
+ pytorch 2.4.1 py3.12_cuda12.1_cudnn9.1.0_0 pytorch Cached
+ torchtriton 3.0.0 py312 pytorch Cached
+ torchvision 0.19.1 py312_cu121 pytorch Cached
+ torchaudio 2.4.1 py312_cu121 pytorch Cached
到这里,问题就解决了:我们之后要安装 pytorch-cuda 和 cuda-toolkit 时,只需要执行以下命令(顺序应该不重要了):
mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
mamba install nvidia/label/cuda-12.1.0::cuda-toolkit -c nvidia/label/cuda-12.1.0
安装 cuda-toolkit 就相当于在安装完 pytorch-cuda 的需要的部分 cuda 库后,进行了补充安装,都是同一个 channel 的当然就不会有问题了
1.2.6 CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
使用一下官方命令安装 pytorch=1.11.0+cuda=11.3
mamba install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
运行 torch.einsum
这个函数时,出现了以下报错:
File "/mnt/data4/home/xxxx/eg3d/dataset_preprocessing/ffhq/Deep3DFaceRecon_pytorch/models/bfm.py", line 96, in compute_shape
id_part = torch.einsum('ij,aj->ai', self.id_base, id_coeff)
File "/mnt/data/home/xxxx/miniforge3/envs/deep3d_pytorch/lib/python3.8/site-packages/torch/functional.py", line 330, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
解决方案:
换成 pip 安装 pytorch(感觉是因为这个老版本的 Conda 安装 cudatoolkit 方法有问题,到了pytorch v1.13.0 支持 CUDA 11.7 之后,就是安装 pytorch-cuda 了,应该就没问题了)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
1.2.7 ld: cannot find -lcuda
在编译万恶的 tiny-cuda-nn 的时候遇到的错误:
mamba create --name map4d -y python=3.10
mamba install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia/label/cuda-11.8.0
# 遇到了之前 1.2.4 的 bug -> 降级 mkl
mamba install mkl=2024.0
mamba install gxx=ll.4
pip install ninja git+https://github.com/hturki/tiny-cuda-nn.git@ht/res-grid#subdirectory = bindings/torch
报错:
[10/10] /mnt/data/home/wangjiawei/miniforge3/envs/map4d/bin/nvcc -I/tmp/pip-req-build-yzogp32j/include -I/tmp/pip-req-build-yzogp32j/dependencies -I/tmp/pip-req-build-yzogp32j/dependencies/cutlass/include -I/tmp/pip-req-build-yzogp32j/dependencies/cutlass/tools/util/include -I/tmp/pip-req-build-yzogp32j/dependencies/fmt/include -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/python3.10/site-packages/torch/include -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/python3.10/site-packages/torch/include/TH -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/python3.10/site-packages/torch/include/THC -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/include -I/mnt/data/home/wangjiawei/miniforge3/envs/map4d/include/python3.10 -c -c /tmp/pip-req-build-yzogp32j/src/fully_fused_mlp.cu -o /tmp/pip-req-build-yzogp32j/bindings/torch/src/fully_fused_mlp.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -std=c++14 --extended-lambda --expt-relaxed-constexpr -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -Xcompiler=-Wno-float-conversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -DTCNN_MIN_GPU_ARCH=89 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_89_C -D_GLIBCXX_USE_CXX11_ABI=0
/tmp/pip-req-build-yzogp32j/dependencies/fmt/include/fmt/core.h(288): warning #1675-D: unrecognized GCC pragma
/tmp/pip-req-build-yzogp32j/dependencies/fmt/include/fmt/core.h(288): warning #1675-D: unrecognized GCC pragma
creating build/lib.linux-x86_64-cpython-310/tinycudann_bindings
g++ -pthread -B /mnt/data/home/wangjiawei/miniforge3/envs/map4d/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /mnt/data/home/wangjiawei/miniforge3/envs/map4d/include -fPIC -O2 -isystem /mnt/data/home/wangjiawei/miniforge3/envs/map4d/include -shared /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../dependencies/fmt/src/format.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../dependencies/fmt/src/os.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/common.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/common_device.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/cpp_api.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/cutlass_mlp.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/encoding.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/fully_fused_mlp.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/../../src/network.o /tmp/pip-req-build-yzogp32j/bindings/torch/build/temp.linux-x86_64-cpython-310/tinycudann/bindings.o -L/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/python3.10/site-packages/torch/lib -L/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib -lcuda -lcudadevrt -lcudart_static -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/tinycudann_bindings/_89_C.cpython-310-x86_64-linux-gnu.so
/mnt/data/home/wangjiawei/miniforge3/envs/map4d/compiler_compat/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
error: command '/mnt/data/home/wangjiawei/miniforge3/envs/map4d/bin/g++' failed with exit code 1
[end of output]
解决方法:
https://blog.csdn.net/weixin_44546162/article/details/140106401
这个 libcuda.so 文件不在常规的 lib 文件夹下面,需要到其他地方去找(一般编译也用不到这个,都用 libcudart👿)
错误尝试: 使用 stubs
目录下的 libcuda.so
在 Conda 环境或 CUDA Toolkit 的安装目录中,有时会发现在 lib/stubs/
目录下也存在一个 libcuda.so
文件。例如:/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/stubs/libcuda.so
stubs
目录下的库文件是“桩库”,它们只包含了函数签名等元数据,用于满足编译链接时的符号查找,但不包含实际的函数实现,如果尝试通过:
export LD_LIBRARY_PATH=/mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/stubs:$LD_LIBRARY_PATH
将此路径加入链接器搜索路径,还是会出现问题!😡
正确方法:创建符号链接到 Conda 环境的 lib
目录
通过 ldconfig -p | grep libcuda.so
找到正确的 libcuda.so.1
:(一般都在 /usr/lib/x86_64-linux-gnu
下)
ldconfig -p | grep libcuda.so
libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1 # 这就我们需要的
libcuda.so.1 (libc6) => /usr/lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /usr/lib/i386-linux-gnu/libcuda.so
将系统中正确的 libcuda.so.1
符号链接到 Conda 环境的 lib
目录下,并命名为 libcuda.so
:
# 找到系统中 libcuda.so.1 的实际路径,例如 /usr/lib/x86_64-linux-gnu/libcuda.so.1
# 然后创建链接到你的 conda 环境的 lib 文件夹
ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /mnt/data/home/wangjiawei/miniforge3/envs/map4d/lib/libcuda.so
之后再执行 pip 安装就可以啦!🫡
# 安装原本 tinycudann 也可以,这里是项目需要
pip install git+https://github.com/hturki/tiny-cuda-nn.git@ht/res-grid#subdirectory = bindings/torch