caffe训练出错:Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR
1、caffe调用caffe train出错:
F0212 20:20:50.783892 23403 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
@ 0x7ff56413d5cd google::LogMessage::Fail()
@ 0x7ff56413f433 google::LogMessage::SendToLog()
@ 0x7ff56413d15b google::LogMessage::Flush()
@ 0x7ff56413fe1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7ff56490289b caffe::CuDNNConvolutionLayer<>::LayerSetUp()
@ 0x7ff5647bcd89 caffe::Net<>::Init()
@ 0x7ff5647bf4be caffe::Net<>::Net()
@ 0x7ff564780839 caffe::Solver<>::InitTrainNet()
@ 0x7ff564781ce5 caffe::Solver<>::Init()
@ 0x7ff564781fff caffe::Solver<>::Solver()
@ 0x7ff5647718e1 caffe::Creator_SGDSolver<>()
@ 0x40a918 train()
@ 0x407668 main
@ 0x7ff562eca830 __libc_start_main
@ 0x407f39 _start
@ (nil) (unknown)
Aborted
按照如下方法,将engine:CAFFE注释掉。
https://github.com/shicai/MobileNet-Caffe/issues/3
2、按照1修改后,训练,但读到conv5-3还是报同样的错
(1)搜索资料,发现这个错的原因,还有可能是内存溢出
因此,怀疑是否是内存溢出问题,将batch size从1000改成256,再改成128,再改成10都不行
(2)确定caffe编译时是支持cudnn的,因此,将1中的engine:caffe恢复,再将batch size修改下,训练成功。