
Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT的全称是: Bidirectional Encoder Representation from Transformers
1. Transformer to BERT

1.1 ELMO
ELMO 全称: Embeddings from Language Models


1.2 Transformer




1.3 Bert



E_A 代表这个 Token 属于 SentenceA 还是 Sentence B
1.4 Pre-training Bert








阅读理解是QA加难的版本
3. Recap

每个word都是这句话的所有信息组成的
Bert Training 40+ times, Fine-tune 2~4 times
every token: 12 * 768, 12 层的 Transformer.
Bert 主要的缺陷就是太大了.
Checking if Disqus is accessible...