本文记录了pytorch中碰到的一些bug
[error]RuntimeError: Couldn’t export Python operator <python_value>
Pytorch训练过程中使用torch.jit.ScriptModule来trace模型,net.save(path)
保存模型时候出现下面的错误:
Traceback (most recent call last): File "train.py", line 69, in <module> train(opt, data_loader, model, visualizer) File "train.py", line 42, in train model.save('latest') File "/opt/ml/job/models/conditional_gan_model.py", line 132, in save self.save_network(self.netG, 'G', label, self.gpu_ids) File "/opt/ml/job/models/base_model.py", line 49, in save_network network.save(save_path) RuntimeError: Couldn't export Python operator <python_value> Defined at: @torch.jit.script_method def forward(self, x): output = self.model(x) ~~~~~~~~~~ <--- HERE return output ===============End of job stdout&stderr==============
|
[error] Import Error
安装完pytorch后
,导入import torch
,出现下面的错误:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/maxliu/anaconda3/lib/python3.7/site-packages/torch/__init__.py", line 84, in <module> from torch._C import * ImportError: /home/maxliu/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZNK3c104Type11isSubtypeOfESt10shared_ptrIS0_E
|
首先我能够明确安装nvidia, cuda, cudnn
没有任何问题,那么问题出在哪里了呢,使用命令查看这个符号:
c++filt _ZNK3c104Type11isSubtypeOfESt10shared_ptrIS0_E
|
得到下面的函数:
c10::Type::isSubtypeOf(std::shared_ptr<c10::Type>) const
|
这里怀疑在搜索c10
的库时候链接的版本库出了问题,因为安装的pytorch
是最新的版本,而我之前使用过旧的libtorch
版本,并且加入到了LD_LIBRARY_PATH
,echo $LD_LIBRARY_PATH
查看LD_LIBRARY_PATH
:
/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/data/source/libtorch/lib:
|
其中/data/source/libtorch/lib
就是我之前加入的旧库,import torch
引入了旧的库导致找不到符号,在LD_LIBRARY_PATH
中去除这个库即可。
Comments