MindSpore:图算融合loss出现异常

问题描述:

在Ascend 910A 、mindspore1.1.2 环境下运行图算融合,网络架构是Resnet 50 ,同样的参数条件下,程序正常运行没有任何问题,但是程序中加上(enable_graph_kernel=True) 之后,有时候训练时 loss变为负无穷大,有时候会变成nan,然后都会报错。

具体报错的内容如下:

amp_level: O2
WARNING: 'ControlDepend' is deprecated from version 1.1 and will be removed in a future version, use 'Depend' instead.
epoch: 1 step: 1, loss is 2.2937918
WARNING: 'ControlDepend' is deprecated from version 1.1 and will be removed in a future version, use 'Depend' instead.
epoch: 1 step: 71, loss is 4803.1064
epoch: 1 step: 72, loss is 4034.7324
epoch: 1 step: 142, loss is -5.3169115e+37
epoch: 1 step: 143, loss is 2.3025851
epoch: 1 step: 213, loss is -3.4028235e+38
epoch: 1 step: 214, loss is 2.3025851
[ERROR] RUNTIME(75982)model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(75982)model execute task failed, device_id=5, model stream_id=526, model task_id=14, model_id=513, first_task_id=65535
[ERROR] RUNTIME(75982)aicore kernel execute failed, device_id=5, stream_id=531, task_id=302, fault kernel_name=Fused_Add_RealDiv_fusion_15159710175106831708_kernel0, func_name=Fused_Add_RealDiv_fusion_15159710175106831708_kernel0
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.217.107 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:569] GetErrorNodeName] Node: Default/GraphKernel_Add_RealDiv_fusion-op16882, run task error.
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.218.172 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:587] DumpTaskExceptionInfo] Dump node (Default/GraphKernel_Add_RealDiv_fusion-op16882) task error input/otput data to: ./task_error_dump/5 Error msg:  model execute failed trace:
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.219.537 [mindspore/ccsrc/runtime/device/ascend/ascend_device_address.cc:663] DumpMemToFile] SyncDeviceToHost: rtMemcpy mem size[4] fail, ret[507899]
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.221.399 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:569] GetErrorNodeName] Node: Default/GraphKernel_Add_RealDiv_fusion-op16882, run task error.
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.222.187 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:587] DumpTaskExceptionInfo] Dump node (Default/GraphKernel_Add_RealDiv_fusion-op16882) task error input/otput data to: ./task_error_dump/5 Error msg:  model execute failed trace:
[ERROR] DEVICE(75885,python3.7):2021-04-13-08:26:59.223.474 [mindspore/ccsrc/runtime/device/ascend/ascend_device_address.cc:663] DumpMemToFile] SyncDeviceToHost: rtMemcpy mem size[4] fail, ret[507899]
[ERROR] SESSION(75885,python3.7):2021-04-13-08:26:59.524.099 [mindspore/ccsrc/backend/session/ascend_session.cc:1033] Execute] run task error!
Traceback (most recent call last):
  File "train_profile.py", line 170, in <module>
    model.train(args.epoch_size, ds_train, callbacks=callbacks, dataset_sink_mode=True)  # sink_size=200
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train
    sink_size=sink_size)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 391, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/mayinping/ms-seng-img2col-opt2/model_seng.py", line 236, in _train_dataset_sink_process
    outputs = self._train_network(*inputs)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 322, in __call__
    out = self.compile_and_run(*inputs)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 592, in compile_and_run
    return _executor(self, *new_inputs, phase=self.phase)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 585, in __call__
    return self.run(obj, *args, phase=phase)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 613, in run
    return self._exec_pip(obj, *args, phase=phase_real)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 75, in wrapper
    results = fn(*arg, **kwargs)
  File "/opt/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 596, in _exec_pip
    return self._executor(args_list, phase)
RuntimeError: mindspore/ccsrc/backend/session/ascend_session.cc:1033 Execute] run task error!

# In file /home/mayinping/ms-seng-img2col-opt2/seng.py(109)
        for i in range(54):

现在不知道是什么原因导致图算融合无法运行,有哪位大神可以指点一下吗?

解决方案:

mindspre r1.2已经解决此问题。感谢使用mindspore。


版权声明:本文为BaldheadedM原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。