MTCNN训练代码的相关问题

最近在整MTCNN相关的东西，于是从github大佬处clone了一份，无奈本人手残，跑起来处处是报错。帖子上的询问均是石沉大海，只有边搞边记录了，希望能坚持到跑通的那一天吧= =。

博文CSDN：https://blog.csdn.net/weixin_36474809/article/details/82752199
博文代码： https://github.com/AITTSMD/MTCNN-Tensorflow。

README

Download Wider Face Training part only from Official Website , unzip to replace WIDER_train and put it into prepare_data folder.
Download landmark training data from [here]((http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm ))，unzip and put them into prepare_data folder.
Run prepare_data/gen_12net_data.py to generate training data(Face Detection Part) for PNet.
Run gen_landmark_aug_12.py to generate training data(Face Landmark Detection Part) for PNet.
Run gen_imglist_pnet.py to merge two parts of training data.
Run gen_PNet_tfrecords.py to generate tfrecord for PNet.
After training PNet, run gen_hard_example to generate training data(Face Detection Part) for RNet.
Run gen_landmark_aug_24.py to generate training data(Face Landmark Detection Part) for RNet.
Run gen_imglist_rnet.py to merge two parts of training data.
Run gen_RNet_tfrecords.py to generate tfrecords for RNet.(you should run this script four times to generate tfrecords of neg,pos,part and landmark respectively)
After training RNet, run gen_hard_example to generate training data(Face Detection Part) for ONet.
Run gen_landmark_aug_48.py to generate training data(Face Landmark Detection Part) for ONet.
Run gen_imglist_onet.py to merge two parts of training data.

开始受苦

我们根据文章先去下有关数据集(官网就有，觉得慢就用迅雷之类的下，注意这里要下载的文件数量)，然后放到对应文件夹位置(prepare_data)。

有需要可以去下阉割版的：https://pan.baidu.com/s/1mf0hM5VqtpdMfTpf2VRELA 提取码：kuqw (感谢网上大佬的资源)(不过我还是推荐下原版文件，虽然很大)

又或者(更新)：链接: https://pan.baidu.com/s/1LIYlK5sVx4qsK9tvEuJ4cw 提取码: 2yvx

在遇见具体的代码问题之前，有几个常规报错需要了解：

1.

这里是显卡的显存的问题，手动调整显卡的占用率：

import tensorflow as tf
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.ConfigProto(gpu_options=gpu_options))

2.

遇到no module之类的问题，我们查找资料发现：

根据官方通讯ensorflow 2.x与1.x相比发生了重大变化。tf.contrib将从核心TensorFlow信息库和构建过程中删除。TensorFlow的contrib模块已经超出了在单个存储库中可以维护和支持的范围。较大的项目最好单独维护，而较小的扩展将逐步扩展到TensorFlow核心代码。
如果要使用tensorflow 1.x功能/方法，请在中保存一个兼容性模块tensorflow 2.x。

简单来说，就是把tf.[modulename]（modulename是用到的函数名）或者tf.contrib.[modulename]之类的代码改成：

tf.compat.v1.[modulename]()

这样就可以解决相当一部分的兼容性问题。

但是还是会有一些小的问题，例如找不到tensorflow.compat.v1.contrib.slim和tensorflow.compat.v1.contrib.tensorboard.plugins.projector 。

是因为这类文件没有整合到compat.v1包里，需要另外安装然后：

import tf_slim as slim
from tensorboard.plugins import projector

3.

这个貌似是tensorflow和cuda的版本问题，建议查找相关帖子，检查一下是否是因为不匹配产生的报错。(我用的是tf 2.4.1、CUDA11.2和461.33 的NVIDIA驱动)

据说是Tesorflow-gpu 2.4.1，默认情况下，不再注册XLA：CPU和XLA：GPU设备。如果确实需要它们，检查过后可尝试加入以下代码：

import os
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

然后出现以下的情况应该是解除了报错(因为我不晓得实际上算不算解除= =)

4.

报错：could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED ，考虑可能是GPU占用的问题。

解决：

a) tensorflow框架下设置GPU按需分配：

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess :
…

b) keras框架（Tensorflow backend）设置GPU按需分配：

import tensorflow as tf
from keras import backend as K
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
K.set_session(sess)

c) Tensorflow 2.0 设置GPU按需分配方式（没有session）：

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

接下来是根据readme顺序运行代码时遇到的问题。

1.

运行gen_12net_data.py之前，由于部分情况下运行该文件不会自动生成一个…/…、DATA/12的文件夹，所以，必须自己在prepare_data的…/…/ 目录下自己新建一个DATA/12的文件夹(或者自己在代码内直接改文件路径)(涉及到绝对路径和相对路径)

2.

运行gen_12net_data.py报错‘NoneType’ object has no attribute ‘shape’ (这个报错查得我脑瓜生疼= =)

网络上说是因为文件是中文的问题or路径问题，但是并不完全对，这里分两种情况：

(1) 刚运行gen_12net_data.py立马报错的，这种是没有将WIDER_train放到上述的地址，或者是代码内的地址不对，建议好好检查一下；

(2)运行一段时间后中途报错的，可能是文件损坏或者丢失，多次尝试无果后建议去重新下载。

3.

运行gen_landmark_aug_12.py报错：AssertionError。

解决：更改主函数中的data_path到LFW文件解压出来的目录。(或者挨个去试)

4.

运行gen_PNet_tfrecords.py报错：AttributeError: 'NoneType' object has no attribute 'tostring'。

检查是不是有些代码需要改成tf.compat.v1。若依旧报错，可能是第一步运行gen_12net_data.py文件不完整或者出错，按照之前的顺序重新执行之前README的步骤即可。(脑瓜疼x2 XP)

解决后训练完成的效果(训练有点花时间，请耐心等待)

5.

运行train_PNet.py/ read_tfrecord_v2.py报错：RuntimeError: Input pipelines based on Queues are not supported when eager execution is enabled. Please use tf.data to ingest data into your model instead.

可以参考文章：tf1.x迁移到tf2.x contrib的方法和思路

不管用的话，查资料得知，tf2.0好像是为了更安全对graph有一个默认的 eager execution is enabled by default，叫急切执行什么的，不管他有什么好处，我们为了跑通我们的代码，所以我们需要禁止它，2.x的TensorFlow就会返回执行1.x。于是在代码开头加一句来手动将它disable掉(下面二选一)：

tf.compat.v1.disable_v2_behavior()

tf.compat.v1.disable_eager_execution()

修改后，还要注意代码中的dataset_dir路径名以及.tfrecord_shuffle文件的名字是否正确。

~~亦或者，直接在文件开头引用时就：~~

~~import tensorflow.compat.v1 as tf~~

~~tf.disable_v2_behavior()~~

我没试过，算了还是不要节外生枝。

然后read_tfrecord_v2.py文件就可以输出了。

不过后面运行train_PNet.py又会报错，warning部分是由于disable操作的原因，get_shape我估计是路径问题(?)。现在就卡在这儿了，希望成功了的老哥们能不吝赐教。

2021-04-13 13:43:21.220435: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
['E:\\study2\\MTCNN-Tensorflow\\train_models', 'E:\\study2\\MTCNN-Tensorflow', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\python37.zip', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\DLLs', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\lib', 'D:\\toolsware\\anaconda\\envs\\tensorflow', 'D:\\toolsware\\anaconda\\envs\\tensorflow\\lib\\site-packages', '../prepare_data']
2021-04-13 13:43:27.077710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-13 13:43:27.168672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021-04-13 13:43:27.168851: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-13 13:43:27.221668: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-13 13:43:27.221762: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-13 13:43:27.245849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-13 13:43:27.251559: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-13 13:43:27.279790: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-13 13:43:27.304099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-13 13:43:27.307262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-13 13:43:27.307604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
E:/study2/MTCNN-Tensorflow/prepare_data/DATA/imglists/PNet\train_PNet_landmark.txt
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:19: string_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Total size of the dataset is:  1428473
E:/study2/MTCNN-Tensorflow/data/MTCNN_model/PNet_landmark/PNet
dataset dir is: E:/study2/MTCNN-Tensorflow/prepare_data/DATA/imglists/PNet\train_PNet_landmark.tfrecord_shuffle
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:277: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:189: limit_epochs (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:198: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From D:\toolsware\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\training\input.py:198: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:26: TFRecordReader.__init__ (from tensorflow.python.ops.io_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From E:\study2\MTCNN-Tensorflow\prepare_data\read_tfrecord_v2.py:57: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
Traceback (most recent call last):
  File "E:/study2/MTCNN-Tensorflow/train_models/train_PNet.py", line 46, in <module>
    train_PNet(base_dir, prefix, end_epoch, display, lr)
  File "E:/study2/MTCNN-Tensorflow/train_models/train_PNet.py", line 31, in train_PNet
    train(net_factory,prefix, end_epoch, base_dir, display=display, base_lr=lr)
  File "E:\study2\MTCNN-Tensorflow\train_models\train.py", line 182, in train
    cls_loss_op,bbox_loss_op,landmark_loss_op,L2_loss_op,accuracy_op = net_factory(input_image, label, bbox_target,landmark_target,training=True)
  File "E:\study2\MTCNN-Tensorflow\train_models\mtcnn_model.py", line 187, in P_Net
    print(inputs.get_shape())
AttributeError: 'NoneType' object has no attribute 'get_shape'

Process finished with exit code 1

原文链接：https://blog.csdn.net/NIDHOG/article/details/115662191