看图说话实战教程 | 第四节 | 模型评估

欢迎来到《看图说话实战教程》系列第四节。在这一节中，我们开始评估训练好的看图说话模型。

评估指标

在正式进入模型评估实现之前，让我们简单地聊一聊看图说话模型的评估指标。

1. BLEU

看图说话任务的Caption Generation类似于机器翻译任务中的目标语言序列生成。因此，看图说话任务可以采用机器翻译任务的指标对序列生成质量进行评估。机器翻译任务常用的评估指标之一是 BLEU (BiLingual Evaluation Understudy)，意为双语评估替补。Understudy意思就是代替人进行翻译结果的评估。毕竟人工处理过于耗时费力。BLEU算法由IBM研究科学家 Kishore Papineni 于2002年在其论文《BLEU: A Method for Automatic Evaluation of Machine Translation》中首次提出的。

当我们评估一段机器翻译的源语言（如英语）到目标语言（如中文）的序列时，评判机器翻译好坏的准则就是：机器翻译结果越接近专业人士翻译的结果，则翻译得越好。BLEU算法的基本设计思想也是如此。实际上，BLEU算法就是用来判断两个句子的相似程度，做法是：将一句机器翻译的话与其对应的几个参考翻译作比较，计算出一个综合分数，分数越高说明机器翻译得越好。

我们称模型生成的句子为候选句子 (candidate)，语料库中的句子为参考句子 (reference)。BLEU算法会计算candidate与reference之间的相似分数。BLEU分数取值范围在0.0到1.0之间，如果两个句子完美匹配 (perfect match)，那么BLEU值为1.0；反之，如果两个句子完全不匹配 (perfect mismatch)，那么BLEU值为0.0。

2. 优缺点

BLEU算法的优点非常明显：

计算代价小，速度快；
容易理解；
与语言无关（这意味着你可以使用全世界任意的语言来测试）；
与人类评通过计算
被学术界和工业界广泛采用。

但其缺点也不能被忽略：

不考虑语言表达（语法）上的准确性；
测评精度会受常用词的干扰；
短译句的测评精度有时会较高；
没有考虑同义词或相似表达的情况，可能会导致合理翻译被否定；

要知道，BLEU算法是做不到百分之百地准确，它只能做到个大概判断，它的目标也只是给出一个快且不差的自动评估解决方法。

Kishore Papineni在其论文中提出了一种改进方法——修正的N-Grams精度——以确保它考虑到参考句子reference文本中单词的出现，而非奖励生成大量合理翻译单词的候选结果。为了提升多个句子组成的block的翻译效果，论文通过正则化N-Grams进行改进。更多的细节参见论文。

3. 计算BLEU分数

Python自然语言工具包库（NLTK）提供了BLEU评分的实现，你可以使用它来评估生成的文本，通过与参考文本对比。

3.1 语句BLEU分数

NLTK提供了sentence_bleu()函数，用于根据一个或多个参考语句来评估候选语句。

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test'], ['this', 'is', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(refernce, candidate)
print(score) # 输出1.0，完全匹配

3.2 语料库BLEU分数

NLTK还提供了一个称为corpus_bleu()的函数来计算多个句子（如段落或文档）的BLEU分数。

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score) # 输出1.0

3.3 累加和单独的BLEU分数

NLTK中提供的BLEU评分方法允许你在计算BLEU分数时为不同的n元组指定权重。这使你可以灵活地计算不同类型的BLEU分数，如单独和累加的n-gram分数。

单独的N-Gram分数

单独的N-gram分数是对特定顺序的匹配n元组的评分，例如单个单词（称为1-gram）或单词对（称为2-gram或bigram）。权重被指定为一个数组，其中每个索引对应相应次序的n元组。仅要计算1-gram匹配的BLEU分数，你可以指定1-gram权重为1，对于2元,3元和4元指定权重为0，也就是权重为（1,0,0,0）。

# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))

累加的N-Gram分数

累加分数是指对从1到n的所有单独n-gram分数的计算，通过计算加权几何平均值来对它们进行加权计算。默认情况下，sentence_bleu()和corpus_bleu()分数计算累加的4元组BLEU分数，也称为BLEU-4分数。BLEU-4对1元组，2元组，3元组和4元组分数的权重为1/4（25％）或0.25。

# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score) # 输出 0.707106781187

累加的和单独的1元组BLEU使用相同的权重，也就是（1,0,0,0）。计算累加的2元组BLEU分数为1元组和2元组分别赋50％的权重，计算累加的3元组BLEU为1元组，2元组和3元组分别为赋33％的权重。让我们通过计算BLEU-1，BLEU-2，BLEU-3和BLEU-4的累加得分来具体说明：

# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

# 分别输出：0.750000，0.500000，0.632878，0.707107

在描述文本生成系统的性能时，通常会报告从BLEU-1到BLEU-4的累加分数。

模型评估

一旦模型训练完成，我们就需要评估其在测试集上的预测能力。看图说话任务主要目标就是给定一张图片，为其生成一段文字描述。那么，评估的目标就是评判生成的文字描述与该图片默认的文字描述的接近程度。

首先，我们需要加载训练好的模型；然后，用训练好的模型来为测试集中的每张图片生成对应的文字描述。以符号startseq标记序列开始，不断地输出下一个单词，直到遇到序列生成结束标记endseq，或者达到序列最大长度。

模型加载代码如下：

from keras.models import load_model
filename = 'model-ep001-loss3.112-val_loss3.153.h5' # 替换成你自己的
model = load_model(filename)

实际上，模型在预测下一个单词时输出的是该单词在字典中的索引位置，我们需要将该索引位置映射成对应的单词。

def word_for_id(integer, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == integer:
      return word
  return None

接下来，我们定义一个函数用于生成序列：

from numpy import argmax
from keras.preprocessing.sequence import pad_sequences

def generate_desc(model, tokenizer, photo, max_length):
  in_text = 'startseq'
  # 在序列最大长度上遍历
  for i in range(max_length):
    sequence = tokenizer.texts_to_sequences([in_text])[0]
    sequence = pad_sequences([sequence], maxlen=max_length)
    yhat = model.predict([photo, sequence], verbose=0)
    yhat = argmax(yhat)
    word = word_for_id(yhat, tokenizer)
    if word is None:
      break
    # 拼接成新的输入继续预测
    in_text += ' ' + word
    # 如果遇到序列终止符号则停止迭代
    if word == 'endseq':
      break
  return in_text

接下来，我们为测试集里的所有图片生成对应的文字描述。然后计算BLEU分数。

from nltk.translate.bleu_score import corpus_bleu

def evaluate_model(model, descriptions, photos, tokenizer, max_length):
  actual, predicted = list(), list()
  for key, desc_list in descriptions.items():
    yhat = generate_desc(model, tokenizer, photos[key], max_length)
    references = [d.split() for d in desc_list]
    actual.append(references)
    predicted.append(yhat.split())
  # 计算BLEU分数
  print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
  print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
  print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
  print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

完整代码

模型评估小节涉及到的完整的代码如下：

from numpy import argmax
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from nltk.translate.bleu_score import corpus_bleu

def word_for_id(integer, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == integer:
      return word
  return None

def generate_desc(model, tokenizer, photo, max_length):
  in_text = 'startseq'
  # 在序列最大长度上遍历
  for i in range(max_length):
    sequence = tokenizer.texts_to_sequences([in_text])[0]
    sequence = pad_sequences([sequence], maxlen=max_length)
    yhat = model.predict([photo, sequence], verbose=0)
    yhat = argmax(yhat)
    word = word_for_id(yhat, tokenizer)
    if word is None:
      break
    # 拼接成新的输入继续预测
    in_text += ' ' + word
    # 如果遇到序列终止符号则停止迭代
    if word == 'endseq':
      break
  return in_text

def evaluate_model(model, descriptions, photos, tokenizer, max_length):
  actual, predicted = list(), list()
  for key, desc_list in descriptions.items():
    yhat = generate_desc(model, tokenizer, photos[key], max_length)
    references = [d.split() for d in desc_list]
    actual.append(references)
    predicted.append(yhat.split())
  # 计算BLEU分数
  print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
  print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
  print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
  print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))


# 加载测试集
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))

# 测试集的描述
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' & len(test_descriptions))

# 测试集的图片特征向量
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
  
# 加载模型
filename = 'model-ep001-loss3.112-val_loss3.153.h5' # 替换成你自己的
model = load_model(filename)

# 评估模型
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

结束语

感谢花费宝贵时间阅读本节教程，敬请期待下一节！希望您能这篇教程中受益匪浅！也特别欢迎大家在评论区提出宝贵的改进意见。如有错误或表述不当之处，也欢迎指正出来！

想要了解更多的自然语言处理最新进展、技术干货及学习教程，欢迎关注微信公众号“语言智能技术笔记簿”或扫描二维码添加关注。
在这里插入图片描述

原文链接：https://blog.csdn.net/jlqCloud/article/details/105261174