err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure when training #14034
Replies: 3 comments
-
More related failures... Traceback (most recent call last): |
Beta Was this translation helpful? Give feedback.
-
Prima facie google search reveals that this error is caused when TDR (timeout detection recovery) is triggered (By default set to 2sec, if exceeded) |
Beta Was this translation helpful? Give feedback.
-
@mxnet-label-bot add [question, cuda] |
Beta Was this translation helpful? Give feedback.
-
Description
err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure when training Alexnet
Environment info (Required)
----------Python Info----------
Version : 3.6.7
Compiler : MSC v.1900 64 bit (AMD64)
Build : ('v3.6.7:6ec5cf24b7', 'Oct 20 2018 13:35:33')
Arch : ('64bit', 'WindowsPE')
------------Pip Info-----------
Version : 18.1
Directory : C:\Users\ianfe\envs\mxnet\lib\site-packages\pip
----------MXNet Info-----------
Version : 1.3.1
Directory : C:\Users\ianfe\envs\mxnet\lib\site-packages\mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Windows-10-10.0.17763-SP0
system : Windows
node : DESKTOP-RNUS3LP
release : 10
version : 10.0.17763
----------Hardware Info----------
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
Name
AMD Ryzen Threadripper 1950X 16-Core Processor
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0170 sec, LOAD: 1.6175 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0590 sec, LOAD: 0.2110 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0440 sec, LOAD: 0.1160 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0170 sec, LOAD: 0.0920
sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0140 sec, LOAD: 0.3180 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0150 sec, LOAD: 0.0430 sec.
Package used (Python/R/Scala/Julia):
Error Message:
Traceback (most recent call last):
File "train_alexnet.py", line 111, in
epoch_end_callback=epochEndCBs)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 893, in fit
sym_gen=self.sym_gen)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\model.py", line 325, in _train_multi_device
executor_manager.update_metric(eval_metric, data_batch.label)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 444, in update_metric
self.curr_execgrp.update_metric(metric, labels, pre_sliced)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\executor_manager.py", line 296, in update_metric
metric.update(labels_slice, texec.outputs)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 318, in update
metric.update(labels, preds)
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\metric.py", line 418, in update
pred_label = pred_label.asnumpy().astype('int32')
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\ndarray\ndarray.py", line 1972, in asnumpy
ctypes.c_size_t(data.size)))
File "C:\Users\ianfe\Envs\mxnet\lib\site-packages\mxnet\base.py", line 251, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:41:47] c:\jenkins\workspace\mxnet-tag\mxnet\3rdparty\mshadow\mshadow./cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (4 vs. 0) Name: MapPlanKernel ErrStr:unspecified launch failure
Minimum reproducible example
Intermitted failure, but same error.
Steps to reproduce
model = mx.model.FeedForward(
ctx=[mx.gpu(0), mx.gpu(1), mx.gpu(2)],
symbol=model,
initializer=mx.initializer.Xavier(),
arg_params=argParams,
aux_params=auxParams,
optimizer=opt,
num_epoch=90,
begin_epoch=args["start_epoch"])
print("[INFO] training network...")
model.fit(
X=trainIter,
eval_data=valIter,
eval_metric=metrics,
batch_end_callback=batchEndCBs,
epoch_end_callback=epochEndCBs)
What have you tried to solve it?
Beta Was this translation helpful? Give feedback.
All reactions