Pytorch之经典神经网络CNN(四) —— ZFNet(Pascal VOC 2007)

2013年提出的

ImageNet ILSVRC 2013的冠军,作者是Matthew Zeiler和Rob Fergus,所以叫做ZFNet(Zeiler&Fergus Net的简称)

ZFNet——可视化的开端(可视化理解卷积神经网络)

ZFNet是在AlexNet基础上进行了一些细节的改动,改动不大,网络结构上并没有太大的突破。主要是引入了可视化,使用了解卷积和反池化(无法实现,只能近似)的近似对每一层进行可视化,并采用一个GPU进行训练。该论文最大的贡献在于通过使用可视化技术揭示了神经网络各层到底在干什么,起到了什么作用。

按照以前的观点,一个卷积神经网络的好坏我们只能通过不断地训练去判断,我也没有办法知道每一次卷积、每一次池化、每一次经过激活函数到底发生了什么,也不知道神经网络为什么取得了如此好的效果,那么只能靠不停的实验来寻找更好的模型。ZFNet的核心工作其实就是一个——通过使用可视化技术揭示了神经网络各层到底在干什么,起到了什么作用。一旦知道了这些,如何调整我们的神经网络,往什么方向优化,就有了较好的依据。

主要工作:

(1)使用一个多层的反卷积网络来可视化训练过程中特征的演化及发现潜在的问题;

(2)同时根据遮挡图像局部对分类结果的影响来探讨对分类任务而言到底那部分输入信息更重要。

[PyTorch]ZFNet vs AlexNet | 大海

可视化理解卷积神经网络 | 大海

Visualizing and Understanding Convolutional Networks_tina的博客-CSDN博客

卷积神经网络可视化理解_ZHUYOUKANG的博客-CSDN博客_卷积神经网络可视化理解

ZFNet(2013)及可视化的开端 - 云+社区 - 腾讯云

ZF网络架构深度详解_MIss-Y的博客-CSDN博客_zf网络

Pascal VOC 2007

The PASCAL Visual Object Classes Challenge 2007 (VOC2007)

数据集包含 训练集:5011 张,测试集:4952张,共9963张,20个类

20个类分别为

目录结构

下载/展示数据集

import cv2
import numpy as np
from torchvision import datasets
import matplotlib.pyplot as plt

#下载PASCAL VOC数据集
#如果已经有了就不再下载了
dataset = datasets.VOCDetection(
    root='dataset/VOC2007/',
    year='2007',
    image_set='trainval',
    download=True
)
#len(dataset) 是 5011


# img, target = dataset.__getitem__(0)
img, target = dataset[0] #取出第0张图片
#img是<class 'PIL.Image.Image'>
#img是(375, 500)
#target是dict
#target的keys有dict_keys(['folder', 'filename', 'source', 'owner', 'size', 'segmented', 'object'])

img = np.array(img)
#现在img是(500, 375, 3)

#此时用cv2.imshow()即可直接显示,若用plt.imshow的话则需要转换一下通道
#opencv是BGR通道,plt默认RGB通道
# cv2.imshow('img', img)
plt.imshow(img[:,:,::-1]) #-1是通道反向取值,由BGR变为RGB
plt.show()

下载读取之后目录下是这样

提取全部的训练/验证集,分类别保存标注信息

"""
提取全部的训练/验证集,分类别保存标注信息
"""

import cv2
import numpy as np
import os
import xmltodict

# import utils.util as util

def check_dir(data_dir):
    if not os.path.exists(data_dir):
        os.mkdir(data_dir)

path = 'dataset/VOC2007/'
root_dir = path + 'train_val/'
train_dir = path + 'train_val/train/'
val_dir = path + 'train_val/val/'

train_txt_path = path + 'VOCdevkit/VOC2007/ImageSets/Main/train.txt'
val_txt_path = path + 'VOCdevkit/VOC2007/ImageSets/Main/val.txt'

annotation_dir = path + 'VOCdevkit/VOC2007/Annotations'
jpeg_image_dir = path + 'VOCdevkit/VOC2007/JPEGImages'

alphabets = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable',
             'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']


def find_all_cate_rects(annotation_dir, name_list):
    """
    找出所有的类别的标注框
    """
    cate_list = list()
    for i in range(20):
        cate_list.append(list())

    for name in name_list:
        annotation_path = os.path.join(annotation_dir, name + ".xml")
        with open(annotation_path, 'rb') as f:
            xml_dict = xmltodict.parse(f)
            # print(xml_dict)

            objects = xml_dict['annotation']['object']
            if isinstance(objects, list):
                for obj in objects:
                    obj_name = obj['name']
                    obj_idx = alphabets.index(obj_name)

                    difficult = int(obj['difficult'])
                    if difficult != 1:
                        bndbox = obj['bndbox']
                        cate_list[obj_idx].append({'img_name': name, 'rect':
                            (int(bndbox['xmin']), int(bndbox['ymin']), int(bndbox['xmax']), int(bndbox['ymax']))})
            elif isinstance(objects, dict):
                obj_name = objects['name']
                obj_idx = alphabets.index(obj_name)

                difficult = int(objects['difficult'])
                if difficult != 1:
                    bndbox = objects['bndbox']
                    cate_list[obj_idx].append({'img_name': name, 'rect':
                        (int(bndbox['xmin']), int(bndbox['ymin']), int(bndbox['xmax']), int(bndbox['ymax']))})
            else:
                pass

    return cate_list


def save_cate(cate_list, image_dir, res_dir):
    """
    保存裁剪的图像
    """
    # 保存image_dir下所有图像,以便后续查询
    image_dict = dict()
    image_name_list = os.listdir(image_dir)
    for name in image_name_list:
        image_path = os.path.join(image_dir, name)
        img = cv2.imread(image_path)

        image_dict[name.split('.')[0]] = img

    # 遍历所有类别,保存标注的图像
    for i in range(20):
        cate_name = alphabets[i]
        cate_dir = os.path.join(res_dir, cate_name)
        check_dir(cate_dir)

        for item in cate_list[i]:
            img_name = item['img_name']
            xmin, ymin, xmax, ymax = item['rect']

            rect_img = image_dict[img_name][ymin:ymax, xmin:xmax]
            img_path = os.path.join(cate_dir, '%s-%d-%d-%d-%d.png' % (img_name, xmin, ymin, xmax, ymax))
            cv2.imwrite(img_path, rect_img)


if __name__ == '__main__':
    check_dir(root_dir)
    check_dir(train_dir)
    check_dir(val_dir)

    train_name_list = np.loadtxt(train_txt_path, dtype=np.str)
    print(train_name_list)
    cate_list = find_all_cate_rects(annotation_dir, train_name_list)
    print([len(x) for x in cate_list])
    save_cate(cate_list, jpeg_image_dir, train_dir)

    val_name_list = np.loadtxt(val_txt_path, dtype=np.str)
    print(val_name_list)
    cate_list = find_all_cate_rects(annotation_dir, val_name_list)
    print([len(x) for x in cate_list])
    save_cate(cate_list, jpeg_image_dir, val_dir)

    print('done')

datasets.ImageFolder

ImageFolder是一个通用的数据加载器,它要求我们以下面这种格式来组织数据集的训练、验证或者测试图片。

root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png

ImageFolder假设所有的文件按文件夹保存,每个文件夹下存储同一个类别的图片,文件夹名为类名

像这次ImageFolder要读取的文件夹就是这样子的

网络实现

论文中输入为224×224,实际操作为227×227

输入的图片的大小和Alexnet一样的

ZFNet在AlexNet基础上的调整:

  • 第1个卷积层,kernel size从11减小为7,将stride从4减小为2(这将导致feature map增大1倍)
  • 为了让后续feature map的尺寸保持一致,第2个卷积层的stride从1变为2
  • 第3、4、5个卷积层,instead of 384, 384, 256 filters use 512, 1024, 512

仅这2项修改,就获得了几个点的性能提升

import torch
from torch import nn
import os
import time
import copy
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
# from torchvision.datasets import ImageFolder
from torchvision import datasets
import torchvision
import matplotlib.pyplot as plt


class ZFNet(nn.Module):
    def __init__(self, num_classes=1000):
        super(ZFNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, stride=2, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x


data_root_dir = 'dataset/VOC2007/train_val'
model_dir = 'checkpoints/'


def load_data(root_dir):
    transform = transforms.Compose([
        transforms.Resize((227, 227)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    data_loaders = {}
    dataset_sizes = {}
    for phase in ['train', 'val']:
        phase_dir = os.path.join(root_dir, phase)

        data_set = datasets.ImageFolder(
            root=phase_dir,
            transform=transform
        )

        data_loader = DataLoader(
            dataset=data_set,
            batch_size=64,
            shuffle=True,
            num_workers=8
        )

        data_loaders[phase] = data_loader
        dataset_sizes[phase] = len(data_set)

    return data_loaders, dataset_sizes


def train_model(model, criterion, optimizer, scheduler, dataset_sizes, dataloaders, num_epochs=25, device=None):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    loss_dict = {'train': [], 'val': []}
    acc_dict = {'train': [], 'val': []}
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()  # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]
            loss_dict[phase].append(epoch_loss)
            acc_dict[phase].append(epoch_acc)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, loss_dict, acc_dict


def save_png(title, res_dict):
    # x_major_locator = MultipleLocator(1)
    # ax = plt.gca()
    # ax.xaxis.set_major_locator(x_major_locator)
    fig = plt.figure()

    plt.title(title)
    for name, res in res_dict.items():
        for k, v in res.items():
            x = list(range(len(v)))
            plt.plot(v, label='%s-%s' % (name, k))

    plt.legend()
    plt.savefig('%s.png' % title)


if __name__ == '__main__':
    data_loaders, data_sizes = load_data(data_root_dir)
    print(data_sizes)

    res_loss = dict()
    res_acc = dict()

    # for name in ['alexnet', 'zfnet']:
    for name in ['zfnet']:
        if name == 'alexnet':
            model = torchvision.models.AlexNet(num_classes=20)
        else:
            model = ZFNet(num_classes=20)
        device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        model = model.to(device)

        criterion = nn.CrossEntropyLoss()
        # optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)
        optimizer = optim.Adam(model.parameters(), lr=1e-3)
        lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.1)

        best_model, loss_dict, acc_dict = train_model(model, criterion, optimizer, lr_scheduler, data_sizes,
                                                      data_loaders, num_epochs=5,
                                                      device=device)
        # 保存最好的模型参数
        if not os.path.exists(model_dir):
            os.mkdir(model_dir)
        torch.save(best_model.state_dict(), os.path.join(model_dir, '%s.pth' % name))

        res_loss[name] = loss_dict
        res_acc[name] = acc_dict

        print('train %s done' % name)
        print()

    save_png('loss', res_loss)
    save_png('acc', res_acc)

各卷积层的可视化

为什么仅2项修改,ZFNet就比AlexNet获得了几个点的性能提升呢?

通过对AlexNet的特征进行可视化,文章作者发现第2层出现了aliasing。在数字信号处理中,aliasing是指在采样频率过低时出现的不同信号混淆的现象,作者认为这是第1个卷积层stride过大引起的,为了解决这个问题,可以提高采样频率,所以将stride从4调整为2,与之相应的将kernel size也缩小(可以认为stride变小了,kernel没有必要看那么大范围了),这样修改前后,特征的变化情况如下图所示,第1层呈现了更多更具区分力的特征,第二2层的特征也更加清晰,没有aliasing现象

可视化操作

我们讲到卷积神经网络通过逐层卷积将原始像素空间逐层映射到特征空间,深层feature map上每个位置的值都代表与某种模式的相似程度,但因为其位于特征空间,不利于人眼直接观察对应的模式,为了便于观察理解,需要将其映射回像素空间,“从群众中来,到群众中去”,论文《 Visualizing and Understanding Convolutional Networks》就重点介绍了如何“到群众中去”。

可视化操作,针对的是已经训练好的网络,或者训练过程中的网络快照,可视化操作不会改变网络的权重,只是用于分析和理解在给定输入图像时网络观察到了什么样的特征,以及训练过程中特征发生了什么变化。

给定1张输入图像,先前向传播,得到每一层的feature map,如果想可视化第ii层学到的特征,保留该层feature map的最大值,将其他位置和其他feature map置0,将其反向映射回原始输入所在的像素空间。对于一般的卷积神经网络,前向传播时不断经历 input image→conv → rectification → pooling →……,可视化时,则从某一层的feature map开始,依次反向经历 unpooling → rectification → deconv → …… → input space,如下图所示,上方对应更深层,下方对应更浅层,前向传播过程在右半侧从下至上,特征可视化过程在左半侧从上至下

可视化时每一层的操作如下:

  • Unpooling:在前向传播时,记录相应max pooling层每个最大值来自的位置,在unpooling时,根据来自上层的map直接填在相应位置上,如上图所示,Max Locations “Switches”是一个与pooling层输入等大小的二值map,标记了每个局部极值的位置。
  • Rectification:因为使用的ReLU激活函数,前向传播时只将正值原封不动输出,负值置0,“反激活”过程与激活过程没什么分别,直接将来自上层的map通过ReLU。
  • Deconvolution:可能称为transposed convolution更合适,卷积操作output map的尺寸一般小于等于input map的尺寸,transposed convolution可以将尺寸恢复到与输入相同,相当于上采样过程,该操作的做法是,与convolution共享同样的卷积核,但需要将其左右上下翻转(即中心对称),然后作用在来自上层的feature map进行卷积,结果继续向下传递。关于Deconvolution的更细致介绍,可以参见博文《一文搞懂 deconvolution、transposed convolution、sub-­pixel or fractional convolution》

不断经历上述过程,将特征映射回输入所在的像素空间,就可以呈现出人眼可以理解的特征。给定不同的输入图像,看看每一层关注到最显著的特征是什么,如下图所示:


版权声明:本文为hxxjxw原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。