解决错误 RuntimeError: cuda runtime error (710) : device-side assert triggered a

在github上看别人的代码,用别人的数据集跑通了,满心欢喜的换自己的数据集,修改了一番后,发现遇到了莫名其妙的错误,如下

Traceback (most recent call last):
  File "train_discriminator.py", line 167, in <module>
    main()
  File "train_discriminator.py", line 105, in main
    loss_D.backward()
  File "//anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/ATen/native/cuda/SoftMax.cu:647

这里的log里似乎没有什么关键信息,搜了一下,说切换到cpu运行就可以看到一些信息了

解决错误 RuntimeError: cuda runtime error (710) : device-side assert triggered a

切换到cpu,遇到如下错误

File "train_discriminator.py", line 167, in <module>
    main()
  File "train_discriminator.py", line 100, in main
    loss_s = F.cross_entropy(y_disc_real, labels)
  File "/python3.7/site-packages/torch/nn/functional.py", line 2009, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1838, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THNN/generic/ClassNLLCriterion.c:97

这样我就想到原因了,是因为使用pytorch的torchtext加载数据集,其中测试时遇到了在训练集中没有的label,这样就造成了错误,提醒我们要注意分层抽样,特别是在某一类别的数量特别少时。