Paper小记：BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding

Abstract

bert是transformer的双向编码器。BERT的设计是通过联合调节所有层的左右上下文，从未标记的文本中预训练深度双向表示。因此，预先训练过的BERT模型可以通过一个额外的输出层来进行微调，从而为广泛的任务创建最先进的模型，如问题回答和语言推理，而不需要实质性的任务特定的体系结构修改。（在很多任务上表现优良）

Introduction

两种策略将预训练语言表示应用到下游任务：

基于微调的方式:生成式预训练transformer（被单向限制）

基于特征的方式:elmo

bert有两个任务：完形填空下一句预测

Related Work

Figure 1: Overall pre-training and fifine-tuning procedures for BERT. Apart from output layers, the same architec

tures are used in both pre-training and fifine-tuning. The same pre-trained model parameters are used to initialize

models for different down-stream tasks. During fifine-tuning, all parameters are fifine-tuned. [CLS] is a special

symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques

tions/answers).

BERT （model）

There are two steps in our framework: pre-training （ trained on unlabeled data ）and fifine-tuning（trained on labeled data and each downstream task has sep arate fifine-tuned models ）.

A distinctive feature of BERT is its unifified architecture across different tasks. There is mini mal difference between the pre-trained architec ture and the fifinal downstream architecture.

two model sizes: bertbase and bertlarge

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualiza tion of this construction can be seen in Figure 2 .

task1MLM中：复杂的mask 机制是为了缓解【mask】token造成的预训练微调阶段分布不匹配的问题。

task2NSP：略

Experiments

略

原文链接：https://blog.csdn.net/huihuixiaoxue/article/details/124138042