MTCNN训练代码的相关问题

最近在整MTCNN相关的东西,于是从github大佬处clone了一份,无奈本人手残,跑起来处处是报错。帖子上的询问均是石沉大海,只有边搞边记录了,希望能坚持到跑通的那一天吧= =。

博文CSDN:https://blog.csdn.net/weixin_36474809/article/details/82752199
博文代码: https://github.com/AITTSMD/MTCNN-Tensorflow

README

  1. Download Wider Face Training part only from Official Website , unzip to replace WIDER_train and put it into prepare_data folder.
  2. Download landmark training data from [here]((http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm )),unzip and put them into prepare_data folder.
  3. Run prepare_data/gen_12net_data.py to generate training data(Face Detection Part) for PNet.
  4. Run gen_landmark_aug_12.py to generate training data(Face Landmark Detection Part) for PNet.
  5. Run gen_imglist_pnet.py to merge two parts of training data.
  6. Run gen_PNet_tfrecords.py to generate tfrecord for PNet.
  7. After training PNet, run gen_hard_example to generate training data(Face Detection Part) for RNet.
  8. Run gen_landmark_aug_24.py to generate training data(Face Landmark Detection Part) for RNet.
  9. Run gen_imglist_rnet.py to merge two parts of training data.
  10. Run gen_RNet_tfrecords.py to generate tfrecords for RNet.(you should run this script four times to generate tfrecords of neg,pos,part and landmark respectively)
  11. After training RNet, run gen_hard_example to generate training data(Face Detection Part) for ONet.
  12. Run gen_landmark_aug_48.py to generate training data(Face Landmark Detection Part) for ONet.
  13. Run gen_imglist_onet.py to merge two parts of training data.

 

开始受苦

我们根据文章先去下有关数据集(官网就有,觉得慢就用迅雷之类的下,注意这里要下载的文件数量),然后放到对应文件夹位置(prepare_data)。

有需要可以去下阉割版的:https://pan.baidu.com/s/1mf0hM5VqtpdMfTpf2VRELA  提取码:kuqw   (感谢网上大佬的资源)(不过我还是推荐下原版文件,虽然很大)

又或者(更新):链接: https://pan.baidu.com/s/1LIYlK5sVx4qsK9tvEuJ4cw 提取码: 2yvx

在遇见具体的代码问题之前,有几个常规报错需要了解:

 

1.

这里是显卡的显存的问题,手动调整显卡的占用率:

import tensorflow as tf
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.ConfigProto(gpu_options=gpu_options))

 

2.

遇到no module之类的问题,我们查找资料发现:

根据官方通讯ensorflow 2.x与1.x相比发生了重大变化。tf.contrib将从核心TensorFlow信息库和构建过程中删除。TensorFlow的contrib模块已经超出了在单个存储库中可以维护和支持的范围。较大的项目最好单独维护,而较小的扩展将逐步扩展到TensorFlow核心代码。

如果要使用tensorflow 1.x功能/方法,请在中保存一个兼容性模块tensorflow 2.x。

简单来说,就是把tf.[modulename](modulename是用到的函数名)或者tf.contrib.[modulename]之类的代码改成:

tf.compat.v1.[modulename]()

这样就可以解决相当一部分的兼容性问题。

但是还是会有一些小的问题,例如找不到tensorflow.compat.v1.contrib.slimtensorflow.compat.v1.contrib.tensorboard.plugins.projector 。

是因为这类文件没有整合到compat.v1包里,需要另外安装然后:

import tf_slim as slim
from tensorboard.plugins import projector

 

3.

 

这个貌似是tensorflow和cuda的版本问题,建议查找相关帖子,检查一下是否是因为不匹配产生的报错。(我用的是tf 2.4.1、CUDA11.2和461.33 的NVIDIA驱动)

据说是Tesorflow-gpu 2.4.1,默认情况下,不再注册XLA:CPU和XLA:GPU设备。如果确实需要它们,检查过后可尝试加入以下代码:

import os
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

然后出现以下的情况应该是解除了报错(因为我不晓得实际上算不算解除= =)

 

4.

报错:could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED  ,考虑可能是GPU占用的问题。

解决:

a) tensorflow框架下设置GPU按需分配:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess :
…

b) keras框架(Tensorflow backend) 设置GPU按需分配:

import tensorflow as tf
from keras import backend as K
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
K.set_session(sess)

c) Tensorflow 2.0 设置GPU按需分配方式(没有session):

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

 

 

接下来是根据readme顺序运行代码时遇到的问题。

1.

运行gen_12net_data.py之前,由于部分情况下运行该文件不会自动生成一个…/…、DATA/12的文件夹,所以,必须自己在prepare_data的…/…/ 目录下自己新建一个DATA/12的文件夹(或者自己在代码内直接改文件路径)(涉及到绝对路径和相对路径)

 

2.

运行gen_12net_data.py报错‘NoneType’ object has no attribute ‘shape’ (这个报错查得我脑瓜生疼= =)

网络上说是因为文件是中文的问题or路径问题,但是并不完全对,这里分两种情况:

       (1) 刚运行gen_12net_data.py立马报错的,这种是没有将WIDER_train放到上述的地址,或者是代码内的地址不对,建议好好检查一下;

       (2)运行一段时间后中途报错的,可能是文件损坏或者丢失,多次尝试无果后建议去重新下载。

 

3.

运行gen_landmark_aug_12.py报错:AssertionError

解决:更改主函数中的data_pathLFW文件解压出来的目录。(或者挨个去试)

 

4.

运行gen_PNet_tfrecords.py报错:AttributeError: 'NoneType' object has no attribute 'tostring'

检查是不是有些代码需要改成tf.compat.v1。若依旧报错,可能是第一步运行gen_12net_data.py文件不完整或者出错,按照之前的顺序重新执行之前README的步骤即可。(脑瓜疼x2 XP)

解决后训练完成的效果(训练有点花时间,请耐心等待)

 

5.

运行train_PNet.py/ read_tfrecord_v2.py报错:RuntimeError: Input pipelines based on Queues are not supported when eager execution is enabled. Please use tf.data to ingest data into your model instead.

可以参考文章:tf1.x迁移到tf2.x contrib的方法和思路

不管用的话,查资料得知,tf2.0好像是为了更安全对graph有一个默认的 eager execution is enabled by default,叫急切执行什么的,不管他有什么好处,我们为了跑通我们的代码,所以我们需要禁止它,2.x的TensorFlow就会返回执行1.x。于是在代码开头加一句来手动将它disable掉(下面二选一):

tf.compat.v1.disable_v2_behavior()
tf.compat.v1.disable_eager_execution()

修改后,还要注意代码中的dataset_dir路径名以及.tfrecord_shuffle文件的名字是否正确。

亦或者,直接在文件开头引用时就:

import tensorflow.compat.v1 as tf

tf.disable_v2_behavior() 

我没试过,算了还是不要节外生枝。

然后read_tfrecord_v2.py文件就可以输出了。

不过后面运行train_PNet.py又会报错,warning部分是由于disable操作的原因,get_shape我估计是路径问题(?)。现在就卡在这儿了,希望成功了的老哥们能不吝赐教。

2021-04-13 13:43:21.220435: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
['E:\\study2\\MTCNN-Tensorflow\\train_models', 'E:\\study2\\MTCNN-Tensorflow', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\python37.zip', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\DLLs', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\lib', 'D:\\toolsware\\anaconda\\envs\\tensorflow', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\lib\\site-packages', '../prepare_data']
2021-04-13 13:43:27.077710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-13 13:43:27.168672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021-04-13 13:43:27.168851: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-13 13:43:27.221668: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-13 13:43:27.221762: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-13 13:43:27.245849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-13 13:43:27.251559: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-13 13:43:27.279790: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-13 13:43:27.304099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-13 13:43:27.307262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-13 13:43:27.307604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
E:/study2/MTCNN-Tensorflow/prepare_data/DATA/imglists/PNet\train_PNet_landmark.txt
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:19: string_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Total size of the dataset is:  1428473
E:/study2/MTCNN-Tensorflow/data/MTCNN_model/PNet_landmark/PNet
dataset dir is: E:/study2/MTCNN-Tensorflow/prepare_data/DATA/imglists/PNet\train_PNet_landmark.tfrecord_shuffle
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:277: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:189: limit_epochs (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:198: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:198: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:26: TFRecordReader.__init__ (from tensorflow.python.ops.io_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:57: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
Traceback (most recent call last):
  File "E:/study2/MTCNN-Tensorflow/train_models/train_PNet.py", line 46, in <module>
    train_PNet(base_dir, prefix, end_epoch, display, lr)
  File "E:/study2/MTCNN-Tensorflow/train_models/train_PNet.py", line 31, in train_PNet
    train(net_factory,prefix, end_epoch, base_dir, display=display, base_lr=lr)
  File "E:\study2\MTCNN-Tensorflow\train_models\train.py", line 182, in train
    cls_loss_op,bbox_loss_op,landmark_loss_op,L2_loss_op,accuracy_op = net_factory(input_image, label, bbox_target,landmark_target,training=True)
  File "E:\study2\MTCNN-Tensorflow\train_models\mtcnn_model.py", line 187, in P_Net
    print(inputs.get_shape())
AttributeError: 'NoneType' object has no attribute 'get_shape'

Process finished with exit code 1

 

 

 

 

 


版权声明:本文为NIDHOG原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。