tf2.6 OOM:tensorflow/core/framework/op_kernel.cc:1680] Resource exhausted: failed to allocate memory

2022-04-27 17:16:35.834265: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 22727688192 memory_limit_: 22727688192 available bytes: 0 curr_region_allocation_bytes_: 45455376384
2022-04-27 17:16:35.834667: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                     22727688192
InUse:                     22719993344
MaxInUse:                  22727555072
NumAllocs:                     1984779
MaxAllocSize:               3804246016
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-04-27 17:16:35.835488: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ****************************************************************************************************
2022-04-27 17:16:35.835746: W tensorflow/core/framework/op_kernel.cc:1680] Resource exhausted: failed to allocate memory
Traceback (most recent call last):
  File "D:/python-workspace/Mask_RCNN-master16/train333.py", line 320, in <module>
    augmentation=augment_seq)
  File "D:\python-workspace\Mask_RCNN-master16\mrcnn\model.py", line 2376, in train
    use_multiprocessing=False
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\keras\engine\training_v1.py", line 796, in fit
    use_multiprocessing=use_multiprocessing)
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\keras\engine\training_generator_v1.py", line 586, in fit
    steps_name='steps_per_epoch')
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\keras\engine\training_generator_v1.py", line 252, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\keras\engine\training_v1.py", line 1076, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\keras\backend.py", line 4032, in __call__
    run_metadata=self.run_metadata)
  File "D:\Anaconda3\envs\py36_maskrcnn_env_bak\lib\site-packages\tensorflow\python\client\session.py", line 1480, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[8,128,128,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node fpn_p4upsampled/resize/ResizeNearestNeighbor}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[Func/training_4/SGD/gradients/gradients/mrcnn_bbox_fc_dropout/dropout_1/cond_grad/StatelessIf/then/_414/input/_955/_7447]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) Resource exhausted: OOM when allocating tensor with shape[8,128,128,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node fpn_p4upsampled/resize/ResizeNearestNeighbor}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.

Process finished with exit code 1

maskrcnn在win10 GPU  rtx3090服务器,执行train.py的过程中,提示OOM错误,程序退出。

解决方法:

batch_size 就是 IMAGES_PER_GPU

1. 调小下面三个参数:

IMAGES_PER_GPU = 8

IMAGE_MIN_DIM = 400

IMAGE_MAX_DIM = 512

2.配置GPU利用参数:

(1)TensorFlow1.X

①占用GPU90%的显存

config = tf.ConfigProto()

config.gpu_options.per_process_gpu_memory_fraction = 0.9

session = tf.Session(config=config)

②设置GPU使用量最小,动态分配现存

config = tf.ConfigProto()

config.gpu_options.allow_growth = True

session = tf.Session(config=config)

(2)TensorFlow2.X

①指定GPU使用量,占用GPU90%的显存

gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.per_process_gpu_memory_fraction = 0.9
②设置GPU使用量最小,动态分配现存

gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
 


版权声明:本文为shanxiderenheni原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。