Skip to content

Conda安装Pytorch和CUDA/GCC与一些环境Bug解决

约 2867 字大约 10 分钟

SkillsPytorchCondaCUDA

2024-12-01

在实验室服务器上的时候,一般是没有 root 权限的,而服务器可能只安装了特定版本的 CUDA 和 GCC,我们就需要自己安装自己需要版本的 CUDA/GCC 来满足安装包时的环境要求。

而 Conda 除了安装 Python 的包以外,其还提供了很多其他库——比如CUDA、GCC甚至还有 COLMAP,那么我们就可以很方便的安装自己的环境的啦!

而官方的 Conda 比较慢,一般还得自己改 channel 到 conda-forge,所以推荐使用 Mamba 来替代,其比 Conda 更快,而且命令与 Conda 一致,只需把通常用的 conda 替换成 mamba 即可。安装地址:https://github.com/conda-forge/miniforge

1 快速安装环境

这里先附上各个版本Pytorch和CUDA的安装命令合集:

但是注意,不要直接使用里面的命令,要想获得良好的体验,请参考下面的例子:

一个简单的例子安装 Pytorch 2.3.0、CUDA 12.1、GCC/G++ 11.4:

  1. 安装 mamba(比 conda 更快)

    wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
    bash Miniforge3-$(uname)-$(uname -m).sh
  2. 创建环境(注意我这里修改了官方的 CUDA 版本 Pytorch 的安装命令,改变了channel 为 -c nvidia/label/cuda-12.1.0

    mamba create -n h_gs python=3.12 -y
    mamba activate h_gs
    mamba install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
  3. 安装 GCC/G++

    mamba install gcc=11.4
    mamba install gxx=11.4
  4. 安装 CUDA(注意我这里修改了官方的 CUDA 安装命令,多了一个 -c nvidia/label/cuda-12.1.0

    mamba install nvidia/label/cuda-12.1.0::cuda-toolkit -c nvidia/label/cuda-12.1.0

可以看到以上操作有两个地方特别注意的 channel 修改,是因为混乱的 Conda 环境管理会导致原版命令出现 bug,具体情况请看第3节

  • Pytorch 安装的 channel 修改是为了使 pytorch-cuda 与之后我们安装的 CUDA 兼容

  • CUDA 安装的 channel 添加是为了防止其安装与版本不一样的其他CUDA库

    安装 cuda-toolkit 即是安装很多个 CUDA 相关的库,直接使用官方命令安装会导致其首先查看在默认频道并安装最新版本,而不是 nvidia/label/cuda-12.1.0 中的软件包,就会安装一些最新版本的 CUDA 库,有一定出 bug 的风险。

2 环境路径问题

注意:Conda 的 CUDA,经常出现找不到 CUDA 环境的问题,以下操作可以通过将环境变量设置为激活步骤的一部分:

  1. 进入Conda环境目录并创建以下子目录和文件

    cd $CONDA_PREFIX
    mkdir -p ./etc/conda/activate.d
    mkdir -p ./etc/conda/deactivate.d
    touch ./etc/conda/activate.d/env_vars.sh
    touch ./etc/conda/deactivate.d/env_vars.sh
  2. 编辑 ./etc/conda/activate.d/env_vars.sh 如下

    #!/bin/sh
    
    export CUDA_HOME=$CONDA_PREFIX
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
  3. 编辑 ./etc/conda/deactivate.d/env_vars.sh 如下

    #!/bin/sh
    
    unset CUDA_HOME
    unset LD_LIBRARY_PATH

当运行 mamba activate analytics 时,环境变量 CUDA_HOMELD_LIBRARY_PATH 将设置为写入文件的值。当运行 mamba deactivate 时,这些环境变量将被删除。

这样确保了 Conda 环境激活的时候,程序可以正确找到 Conda 安装的包/库的位置

3 各种 Bug

3.1 undefined symbol: iJIT_NotifyEvent

https://github.com/pytorch/pytorch/issues/123097

在以下环境时 import torch 发生的bug:

mamba install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
mamba install nvidia/label/cuda-11.8.0::cuda-toolkit -c nvidia/label/cuda-11.8.0

此时的 mkl=2024.2.2,我们将其降级为 mkl=2024.0

mamba install mkl==2024.0

3.2 ld: cannot find -lcudart

我们使用第1节的原版命令(没有修改 channel)的流程安装pytorch和CUDA环境之后,要用 Conda 的 CUDA 环境进行某个库编译时,出现了bug:

/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/compiler_compat/ld: cannot find -lcudart: No such file or directory
      collect2: error: ld returned 1 exit status
      error: command '/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/bin/g++' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for diff_gaussian_rasterization_32d
  Running setup.py clean for diff_gaussian_rasterization_32d
Failed to build diff_gaussian_rasterization_32d
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (diff_gaussian_rasterization_32d)
 which nvcc
/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/bin/nvcc
 echo $CUDA_HOME
/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar
 echo $PATH
/home/wangjiawei/local/bin:/home/wangjiawei/local/bin:/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/bin:/mnt/data/home/wangjiawei/miniforge3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
 echo $LD_LIBRARY_PATH
/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib:

去探查发现,这里的软链接出了问题:

 ls /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib

libcudart.so  -> libcudart.so.12.1.55
libcudart.so.12
libcudart.so.12.1.105

继续探究发现,安装Pytorch时会安装 cuda-cudart=12.1.105

以下是按照Pytorch时会安装的所有以 pytorchnvidia 为 channel 的包:

  + pytorch-mutex         1.0  cuda                          pytorch     Cached
  + libcublas       12.1.0.26  0                             nvidia      Cached
  + libcufft         11.0.2.4  0                             nvidia      Cached
  + libcusolver     11.4.4.55  0                             nvidia      Cached
  + libcusparse     12.0.2.55  0                             nvidia      Cached
  + libnpp          12.0.2.50  0                             nvidia      Cached
  + cuda-cudart      12.1.105  0                             nvidia      Cached
  + cuda-nvrtc       12.1.105  0                             nvidia      Cached
  + libnvjitlink     12.1.105  0                             nvidia      Cached
  + libnvjpeg       12.1.1.14  0                             nvidia      Cached
  + cuda-cupti       12.1.105  0                             nvidia      Cached
  + cuda-nvtx        12.1.105  0                             nvidia      Cached
  + ffmpeg                4.3  hf484d3e_0                    pytorch     Cached
  + libjpeg-turbo       2.0.0  h9bf148f_0                    pytorch     Cached
  + cuda-version         12.6  3                             nvidia      Cached
  + libcurand       10.3.7.77  0                             nvidia      Cached
  + libcufile        1.11.1.6  0                             nvidia      Cached
  + cuda-opencl       12.6.77  0                             nvidia      Cached
  + cuda-libraries     12.1.0  0                             nvidia      Cached
  + cuda-runtime       12.1.0  0                             nvidia      Cached
  + pytorch-cuda         12.1  ha16c6d3_6                    pytorch     Cached
  + pytorch             2.4.1  py3.12_cuda12.1_cudnn9.1.0_0  pytorch     Cached
  + torchtriton         3.0.0  py312                         pytorch     Cached
  + torchaudio          2.4.1  py312_cu121                   pytorch     Cached
  + torchvision        0.19.1  py312_cu121                   pytorch     Cached

而这是安装 cuda-toolkit-12.1.0 的包:

  + cuda-documentation           12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvml-dev                12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + libnvvm-samples              12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cccl                    12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-driver-dev              12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-profiler-api            12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cudart                  12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvrtc                   12.1.55  0      nvidia/label/cuda-12.1.0       21MB
  + cuda-opencl                  12.1.56  0      nvidia/label/cuda-12.1.0       11kB
  + libcublas                  12.1.0.26  0      nvidia/label/cuda-12.1.0     Cached
  + libcufft                    11.0.2.4  0      nvidia/label/cuda-12.1.0     Cached
  + libcufile                   1.6.0.25  0      nvidia/label/cuda-12.1.0      782kB
  + libcurand                  10.3.2.56  0      nvidia/label/cuda-12.1.0       54MB
  + libcusolver                11.4.4.55  0      nvidia/label/cuda-12.1.0     Cached
  + libcusparse                12.0.2.55  0      nvidia/label/cuda-12.1.0     Cached
  + libnpp                     12.0.2.50  0      nvidia/label/cuda-12.1.0     Cached
  + libnvjitlink                 12.1.55  0      nvidia/label/cuda-12.1.0       18MB
  + libnvjpeg                  12.1.0.39  0      nvidia/label/cuda-12.1.0        3MB
  + cuda-cupti                   12.1.62  0      nvidia/label/cuda-12.1.0        5MB
  + cuda-cuobjdump               12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cuxxfilt                12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvcc                    12.1.66  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvprune                 12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-gdb                     12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvdisasm                12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvprof                  12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvtx                    12.1.66  0      nvidia/label/cuda-12.1.0       58kB
  + cuda-sanitizer-api           12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nsight                  12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + nsight-compute           2023.1.0.15  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cudart-dev              12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvrtc-dev               12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-opencl-dev              12.1.56  0      nvidia/label/cuda-12.1.0     Cached
  + libcublas-dev              12.1.0.26  0      nvidia/label/cuda-12.1.0     Cached
  + libcufft-dev                11.0.2.4  0      nvidia/label/cuda-12.1.0     Cached
  + gds-tools                   1.6.0.25  0      nvidia/label/cuda-12.1.0     Cached
  + libcufile-dev               1.6.0.25  0      nvidia/label/cuda-12.1.0     Cached
  + libcurand-dev              10.3.2.56  0      nvidia/label/cuda-12.1.0     Cached
  + libcusolver-dev            11.4.4.55  0      nvidia/label/cuda-12.1.0     Cached
  + libcusparse-dev            12.0.2.55  0      nvidia/label/cuda-12.1.0     Cached
  + libnpp-dev                 12.0.2.50  0      nvidia/label/cuda-12.1.0     Cached
  + libnvjitlink-dev             12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + libnvjpeg-dev              12.1.0.39  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-libraries                12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cupti-static            12.1.62  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-compiler                 12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvvp                    12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-command-line-tools       12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nsight-compute           12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-cudart-static           12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-nvrtc-static            12.1.55  0      nvidia/label/cuda-12.1.0     Cached
  + libcublas-static           12.1.0.26  0      nvidia/label/cuda-12.1.0     Cached
  + libcufft-static             11.0.2.4  0      nvidia/label/cuda-12.1.0     Cached
  + libcufile-static            1.6.0.25  0      nvidia/label/cuda-12.1.0     Cached
  + libcurand-static           10.3.2.56  0      nvidia/label/cuda-12.1.0     Cached
  + libcusolver-static         11.4.4.55  0      nvidia/label/cuda-12.1.0     Cached
  + libcusparse-static         12.0.2.55  0      nvidia/label/cuda-12.1.0     Cached
  + libnpp-static              12.0.2.50  0      nvidia/label/cuda-12.1.0     Cached
  + libnvjpeg-static           12.1.0.39  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-libraries-dev            12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-libraries-static         12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-visual-tools             12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-tools                    12.1.0  0      nvidia/label/cuda-12.1.0     Cached
  + cuda-toolkit                  12.1.0  0      nvidia/label/cuda-12.1.0     Cached

这是安装 cuda-toolkit-12.1.1 的包:

  + cuda-documentation         12.1.105  0      nvidia/label/cuda-12.1.1       91kB
  + cuda-nvml-dev              12.1.105  0      nvidia/label/cuda-12.1.1       87kB
  + libnvvm-samples            12.1.105  0      nvidia/label/cuda-12.1.1       33kB
  + cuda-cccl                  12.1.109  0      nvidia/label/cuda-12.1.1        1MB
  + cuda-driver-dev            12.1.105  0      nvidia/label/cuda-12.1.1       17kB
  + cuda-profiler-api          12.1.105  0      nvidia/label/cuda-12.1.1       19kB
  + cuda-cudart                12.1.105  0      nvidia/label/cuda-12.1.1     Cached
  + cuda-nvrtc                 12.1.105  0      nvidia/label/cuda-12.1.1     Cached
  + cuda-opencl                12.1.105  0      nvidia/label/cuda-12.1.1       11kB
  + libcublas                  12.1.3.1  0      nvidia/label/cuda-12.1.1      367MB
  + libcufft                  11.0.2.54  0      nvidia/label/cuda-12.1.1      108MB
  + libcufile                   1.6.1.9  0      nvidia/label/cuda-12.1.1      783kB
  + libcurand                10.3.2.106  0      nvidia/label/cuda-12.1.1       54MB
  + libcusolver              11.4.5.107  0      nvidia/label/cuda-12.1.1      116MB
  + libcusparse              12.1.0.106  0      nvidia/label/cuda-12.1.1      177MB
  + libnpp                    12.1.0.40  0      nvidia/label/cuda-12.1.1      147MB
  + libnvjitlink               12.1.105  0      nvidia/label/cuda-12.1.1     Cached
  + libnvjpeg                  12.2.0.2  0      nvidia/label/cuda-12.1.1        3MB
  + cuda-cupti                 12.1.105  0      nvidia/label/cuda-12.1.1     Cached
  + cuda-cuobjdump             12.1.111  0      nvidia/label/cuda-12.1.1      245kB
  + cuda-cuxxfilt              12.1.105  0      nvidia/label/cuda-12.1.1      302kB
  + cuda-nvcc                  12.1.105  0      nvidia/label/cuda-12.1.1       55MB
  + cuda-nvprune               12.1.105  0      nvidia/label/cuda-12.1.1       67kB
  + cuda-gdb                   12.1.105  0      nvidia/label/cuda-12.1.1        6MB
  + cuda-nvdisasm              12.1.105  0      nvidia/label/cuda-12.1.1       50MB
  + cuda-nvprof                12.1.105  0      nvidia/label/cuda-12.1.1        5MB
  + cuda-nvtx                  12.1.105  0      nvidia/label/cuda-12.1.1     Cached
  + cuda-sanitizer-api         12.1.105  0      nvidia/label/cuda-12.1.1       18MB
  + cuda-nsight                12.1.105  0      nvidia/label/cuda-12.1.1      119MB
  + nsight-compute           2023.1.1.4  0      nvidia/label/cuda-12.1.1      808MB
  + cuda-cudart-dev            12.1.105  0      nvidia/label/cuda-12.1.1      381kB
  + cuda-nvrtc-dev             12.1.105  0      nvidia/label/cuda-12.1.1       12kB
  + cuda-opencl-dev            12.1.105  0      nvidia/label/cuda-12.1.1       59kB
  + libcublas-dev              12.1.3.1  0      nvidia/label/cuda-12.1.1       76kB
  + libcufft-dev              11.0.2.54  0      nvidia/label/cuda-12.1.1       14kB
  + gds-tools                   1.6.1.9  0      nvidia/label/cuda-12.1.1       43MB
  + libcufile-dev               1.6.1.9  0      nvidia/label/cuda-12.1.1       13kB
  + libcurand-dev            10.3.2.106  0      nvidia/label/cuda-12.1.1      460kB
  + libcusolver-dev          11.4.5.107  0      nvidia/label/cuda-12.1.1       51kB
  + libcusparse-dev          12.1.0.106  0      nvidia/label/cuda-12.1.1      178MB
  + libnpp-dev                12.1.0.40  0      nvidia/label/cuda-12.1.1      525kB
  + libnvjitlink-dev           12.1.105  0      nvidia/label/cuda-12.1.1       15MB
  + libnvjpeg-dev              12.2.0.2  0      nvidia/label/cuda-12.1.1       13kB
  + cuda-libraries               12.1.1  0      nvidia/label/cuda-12.1.1        2kB
  + cuda-cupti-static          12.1.105  0      nvidia/label/cuda-12.1.1       12MB
  + cuda-compiler                12.1.1  0      nvidia/label/cuda-12.1.1        1kB
  + cuda-nvvp                  12.1.105  0      nvidia/label/cuda-12.1.1      120MB
  + cuda-command-line-tools      12.1.1  0      nvidia/label/cuda-12.1.1        1kB
  + cuda-nsight-compute          12.1.1  0      nvidia/label/cuda-12.1.1        1kB
  + cuda-cudart-static         12.1.105  0      nvidia/label/cuda-12.1.1      948kB
  + cuda-nvrtc-static          12.1.105  0      nvidia/label/cuda-12.1.1       18MB
  + libcublas-static           12.1.3.1  0      nvidia/label/cuda-12.1.1      389MB
  + libcufft-static           11.0.2.54  0      nvidia/label/cuda-12.1.1      199MB
  + libcufile-static            1.6.1.9  0      nvidia/label/cuda-12.1.1        3MB
  + libcurand-static         10.3.2.106  0      nvidia/label/cuda-12.1.1       55MB
  + libcusolver-static       11.4.5.107  0      nvidia/label/cuda-12.1.1       76MB
  + libcusparse-static       12.1.0.106  0      nvidia/label/cuda-12.1.1      185MB
  + libnpp-static             12.1.0.40  0      nvidia/label/cuda-12.1.1      143MB
  + libnvjpeg-static           12.2.0.2  0      nvidia/label/cuda-12.1.1        3MB
  + cuda-libraries-dev           12.1.1  0      nvidia/label/cuda-12.1.1        2kB
  + cuda-libraries-static        12.1.1  0      nvidia/label/cuda-12.1.1        2kB
  + cuda-visual-tools            12.1.1  0      nvidia/label/cuda-12.1.1        1kB
  + cuda-tools                   12.1.1  0      nvidia/label/cuda-12.1.1        1kB
  + cuda-toolkit                 12.1.1  0      nvidia/label/cuda-12.1.1        2kB

对比发现是 cuda-12.1.1 才对的上CUDA版本12.1的Pytorch。但是我们在安装的时候,先安装CUDA版本12.1的Pytorch,再安装 cuda-12.1.1 会出现冲突问题:

└─ pytorch-cuda is not installable because it requires
   └─ libcublas >=12.1.0.26,<12.1.3.1 , which conflicts with any installable versions previously reported.

也就是说,该死的CUDA版本12.1的Pytorch的 libcublas 需要适配 cuda-toolkit-12.1.0 ,但是其的 cuda-cudart 等库又需要适配 cuda-toolkit-12.1.1

可以看到 pytorch-cuda 强要求 libcublas >=12.1.0.26,<12.1.3.1,我们只好迁就 pytorch,安装12.1.0的CUDA,但是呢!我们可以修改Pytorch官方给出的 nvidia channel 为 nvidia/label/cuda-12.1.0

使用以下命令:

mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0

其就会安装与我们安装的 cuda-toolkit-12.1.0 一样的一些 cuda 库了!

  + pytorch-mutex         1.0  cuda                          pytorch                      Cached
  + libcublas       12.1.0.26  0                             nvidia/label/cuda-12.1.0     Cached
  + libcufft         11.0.2.4  0                             nvidia/label/cuda-12.1.0     Cached
  + libcusolver     11.4.4.55  0                             nvidia/label/cuda-12.1.0     Cached
  + libcusparse     12.0.2.55  0                             nvidia/label/cuda-12.1.0     Cached
  + libnpp          12.0.2.50  0                             nvidia/label/cuda-12.1.0     Cached
  + libnvjpeg       12.1.0.39  0                             nvidia/label/cuda-12.1.0        3MB
  + cuda-cudart       12.1.55  0                             nvidia/label/cuda-12.1.0     Cached
  + cuda-nvrtc        12.1.55  0                             nvidia/label/cuda-12.1.0       21MB
  + cuda-opencl       12.1.56  0                             nvidia/label/cuda-12.1.0       11kB
  + libcufile        1.6.0.25  0                             nvidia/label/cuda-12.1.0      782kB
  + libcurand       10.3.2.56  0                             nvidia/label/cuda-12.1.0       54MB
  + cuda-cupti        12.1.62  0                             nvidia/label/cuda-12.1.0        5MB
  + cuda-nvtx         12.1.66  0                             nvidia/label/cuda-12.1.0       58kB
  + cuda-version         12.1  h1d6eff3_3                    conda-forge                    21kB
  + ffmpeg                4.3  hf484d3e_0                    pytorch                      Cached
  + libjpeg-turbo       2.0.0  h9bf148f_0                    pytorch                      Cached
  + libnvjitlink     12.1.105  hd3aeb46_0                    conda-forge                    16MB
  + cuda-libraries     12.1.0  0                             nvidia/label/cuda-12.1.0     Cached
  + cuda-runtime       12.1.0  0                             nvidia/label/cuda-12.1.0     Cached
  + pytorch-cuda         12.1  ha16c6d3_6                    pytorch                      Cached
  + pytorch             2.4.1  py3.12_cuda12.1_cudnn9.1.0_0  pytorch                      Cached
  + torchtriton         3.0.0  py312                         pytorch                      Cached
  + torchvision        0.19.1  py312_cu121                   pytorch                      Cached
  + torchaudio          2.4.1  py312_cu121                   pytorch                      Cached

到这里,问题就解决了:我们之后要安装 pytorch-cuda 和 cuda-toolkit 时,只需要执行以下命令(顺序应该不重要了):

mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
mamba install nvidia/label/cuda-12.1.0::cuda-toolkit -c nvidia/label/cuda-12.1.0

安装 cuda-toolkit 就相当于在安装完 pytorch-cuda 的需要的部分 cuda 库后,进行了补充安装,都是同一个 channel 的当然就不会有问题了

3.3 libstdc++.so.6: version `GLIBCXX_3.4.29' not found

在 conda 环境安装 gcc/gxx 之后,运行开始遇到了以下的报错

  File "/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/python3.12/site-packages/google/protobuf/internal/wire_format.py", line 13, in <module>
    from google.protobuf import descriptor
  File "/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/python3.12/site-packages/google/protobuf/descriptor.py", line 28, in <module>
    from google.protobuf.pyext import _message
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/python3.12/site-packages/google/protobuf/pyext/_message.cpython-312-x86_64-linux-gnu.so)`

排查发现:

在 conda 环境中找不到 libstdc++.so.6 !不过能找到 libstdc++.so.6.0.33

 strings /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/libstdc++.so.6 | grep GLIBCXX
strings: '/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/libstdc++.so.6': No such file

 ls /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/libstdc++*
/mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/libstdc++.so.6.0.33

看来是软链接出了问题,我尝试了重新在 conda 环境安装 gcc/gxx,但是始终无法解决软链接问题,而且在卸载之后,这个 libstdc++.so.6.0.33 依旧存在,看来是某个库的问题。

这样太难排查了,所以直接采取最简单的办法——自己手动软链接:

ln -s /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/x86_64-conda-linux-gnu/lib/libstdc++.so.so.6.0.33 \
      /mnt/data/home/wangjiawei/miniforge3/envs/GAGAvatar/lib/libstdc++.so.6

问题暂时解决😋

参考

https://stackoverflow.com/questions/72684130/how-to-set-the-cuda-path-in-the-conda-environment

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#saving-environment-variables

https://stackoverflow.com/questions/78484090/conda-cuda12-incompatibility/78843983#78843983