PLM papers

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

In this repo, we collect some representative PLM papers in recent years based on the number of citations.

Table of Contents


  1. "Pre-trained models for natural language processing: A survey". Science China Technological Sciences(2020) [PDF]
  2. "Which *BERT? A Survey Organizing Contextualized Encoders". EMNLP(2020) [PDF]
  3. "A Primer in BERTology: What We Know About How BERT Works". TACL(2020) [PDF]
  4. "From static to dynamic word representations: a survey". International Journal of Machine Learning and Cybernetics(2020) [PDF]
  5. "Overview of the Transformer-based Models for NLP Tasks". 2020 15th Conference on Computer Science and Information Systems (FedCSIS) [PDF]
  6. "A Survey on Contextual Embeddings". arXiv(2020) [PDF]
  7. "The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures". IEEE Access(2021) [PDF]
  8. "Pre-Trained Models: Past, Present and Future". arXiv(2021) [PDF]
  9. "A Survey of Transformers". arXiv(2021) [PDF]


  1. XNLI: "XNLI: Evaluating Cross-lingual Sentence Representations". EMNLP(2018) [PDF] [Dataset]
  2. GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". ICLR(2019) [Homepage]
  3. SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". NeurIPS(2019) [Homepage]
  4. CLUE: "CLUE: A Chinese Language Understanding Evaluation Benchmark". COLING(2020) [Homepage]
  5. XTREME: "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization". ICML(2020) [Homepage]
  6. XGLUE: "XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation". EMNLP(2020) [Homepage]
  7. DialoGLUE: "DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue". arXiv(2020) [Homepage]

PLM Design


  1. GPT: "Improving Language Understanding by Generative Pre-Training". OpenAI(2018) [Project]
  2. GPT-2: "Language Models are Unsupervised Multitask Learners". OpenAI(2019) [Project]
  3. BERT: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL(2019) [PDF] [Code]
  4. XLNet: "XLNet: Generalized Autoregressive Pretraining for Language Understanding". NeurIPS(2019) [PDF] [Code]
  5. SBERT: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". ACL(2019) [PDF] [Code]
  6. UniLM: "Unified Language Model Pre-training for Natural Language Understanding and Generation". NeurIPS(2019) [PDF] [Code]
  7. MASS: "MASS: Masked Sequence to Sequence Pre-training for Language Generation". ICML(2019) [PDF] [Code]
  8. Chinese-BERT-wwm: "Pre-Training with Whole Word Masking for Chinese BERT". arXiv(2019) [PDF] [Code]
  9. "Cloze-driven Pretraining of Self-attention Networks". EMNLP(2019) [PDF]
  10. "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model". Workshop on Methods for Optimizing and Evaluating Neural Language Generation(2019) [PDF] [Code]
  11. GPT-3: "Language Models are Few-Shot Learners". arXiv(2020) [PDF] [Code]
  12. T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR(2020) [PDF] [Code]
  13. BART: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension". ACL(2020) [PDF] [Code]
  14. Poly-encoders: "Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring". ICLR(2020) [PDF]
  15. SpanBERT: "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL(2020) [PDF] [Code]
  16. ERNIE 2.0: "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding". AAAI(2020) [PDF] [Code]
  17. SemBERT: "Semantics-Aware BERT for Language Understanding". AAAI(2020) [PDF] [Code]
  18. "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks". TACL(2020) [PDF] [Code]
  19. ProphetNet: "ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training". EMNLP(2020) [PDF]
  20. UniLMv2: "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training". ICML(2020) [PDF] [Code]
  21. MacBERT: "Revisiting Pre-Trained Models for Chinese Natural Language Processing". EMNLP(2020) [PDF] [Code]
  22. MPNet: "MPNet: Masked and Permuted Pre-training for Language Understanding". arXiv(2020) [PDF] [Code]
  23. DEBERTA: "DeBERTa: Decoding-enhanced BERT with Disentangled Attention". ICLR(2021) [PDF] [Code]
  24. PALM: "PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation". EMNLP(2020) [PDF]


  1. ERNIE(Baidu): "ERNIE: Enhanced Representation through Knowledge Integration". arXiv(2019) [PDF] [Code]
  2. KnowBert: "Knowledge Enhanced Contextual Word Representations". EMNLP(2019) [PDF]
  3. ERNIE(Tsinghua): "ERNIE: Enhanced Language Representation with Informative Entities". ACL(2019) [PDF] [Code]
  4. COMET: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction". ACL(2019) [PDF] [Code]
  5. K-BERT: "K-BERT: Enabling Language Representation with Knowledge Graph". AAAI(2020) [PDF] [Code]
  6. WKLM: "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model". ICLR(2020) [PDF]
  7. LUKE: "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention". EMNLP(2020) [PDF] [Code]
  8. K-Adapter: "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters". ICLR(2021) [PDF]
  9. KEPLER: "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation". TACL(2021) [PDF] [Code]


  1. XLM: "Cross-lingual Language Model Pretraining". arXiv(2019) [PDF] [Code]
  2. "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond". TACL(2019) [PDF] [Code]
  3. UDify: "75 Languages, 1 Model: Parsing Universal Dependencies Universally". EMNLP(2019) [PDF] [Code]
  4. Unicoder: "Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks". EMNLP(2019) [PDF]
  5. XLM-R: "Unsupervised Cross-lingual Representation Learning at Scale". ACL(2020) [PDF]
  6. "Multilingual Alignment of Contextual Word Representations". ICLR(2020) [PDF]
  7. mBART: "Multilingual Denoising Pre-training for Neural Machine Translation". TACL(2020) [PDF] [Code]
  8. mT5: "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer". NAACL(2021) [PDF] [Code]
  9. InfoXLM: "InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training". NAACL(2021) [PDF] [Code]


  1. ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks". NeuralIPS(2019) [PDF]
  2. LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP(2019) [PDF] [Code]
  3. VideoBERT: "VideoBERT: A Joint Model for Video and Language Representation Learning" ICCV(2019) [PDF]
  4. MulT: "Multimodal Transformer for Unaligned Multimodal Language Sequences". ACL(2019) [PDF] [Code]
  5. VisualBERT: "VisualBERT: A Simple and Performant Baseline for Vision and Language". arXiv(2019) [PDF]
  6. B2T2: "Fusion of Detected Objects in Text for Visual Question Answering". EMNLP(2019) [PDF] [Code]
  7. VL-BERT: "VL-BERT: Pre-training of Generic Visual-Linguistic Representations". ICLR(2020) [PDF] [Code]
  8. Unicoder-VL: "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training". AAAI(2020) [PDF]
  9. VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA". AAAI(2020) [PDF] [Code]
  10. UNITER: "UNITER: UNiversal Image-TExt Representation Learning". ECCV(2020) [PDF] [Code]
  11. Oscar: "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks". ECCV(2020) [PDF] [Code]
  12. "12-in-1: Multi-Task Vision and Language Representation Learning". CVPR(2020) [PDF] [Code]
  13. ActBERT: "ActBERT: Learning Global-Local Video-Text Representations". CVPR(2020) [PDF]
  14. VLN: "Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks". CVPR(2020) [PDF]
  15. VILLA: "Large-Scale Adversarial Training for Vision-and-Language Representation Learning". arXiv(2020) [PDF] [Code]
  16. ImageBERT: "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data". arXiv(2020) [PDF]
  17. ALIGN: "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". ICML(2021) [PDF]
  18. ClipBERT: "Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling". CVPR(2021) [PDF] [Code]
  19. DALL·E: "Zero-Shot Text-to-Image Generation". arXiv(2021) [PDF] [Code]
  20. CLIP: "Learning Transferable Visual Models From Natural Language Supervision". arXiv(2021) [PDF] [Code]

Information Retrieval

  1. ORQA: "Latent Retrieval for Weakly Supervised Open Domain Question Answering". ACL(2019) [PDF]
  2. REALM: "REALM: Retrieval-Augmented Language Model Pre-Training". arXiv(2020) [PDF]
  3. RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS(2020) [PDF] [Code]
  4. DPR: "Dense Passage Retrieval for Open-Domain Question Answering". EMNLP(2020) [PDF] [Code]

PLM Analysis


  1. "What Does BERT Look at? An Analysis of BERT’s Attention". BlackBoxNLP(2019) [PDF] [Code]
  2. "BERT Rediscovers the Classical NLP Pipeline". ACL(2019) [PDF]
  3. "How Multilingual is Multilingual BERT?". ACL(2019) [PDF]
  4. "A Structural Probe for Finding Syntax in Word Representations". NAACL(2019) [PDF] [Code]
  5. "Language Models as Knowledge Bases?". EMNLP(2019) [PDF] [Code]
  6. "What Does BERT Learn about the Structure of Language?". ACL(2019) [PDF] [Code]
  7. "Linguistic Knowledge and Transferability of Contextual Representations". NAACL(2019) [PDF]
  8. "Assessing BERT's Syntactic Abilities". arXiv(2019) [PDF] [Code]
  9. "Probing Neural Network Comprehension of Natural Language Arguments" ACL(2019) [PDF]
  10. "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings". EMNLP(2019) [PDF]
  11. "Visualizing and Measuring the Geometry of BERT". NeurIPS(2019) [PDF]
  12. "Designing and Interpreting Probes with Control Tasks". EMNLP(2019) [PDF]
  13. "Open Sesame: Getting inside BERT’s Linguistic Knowledge". BlackboxNLP(2019) [PDF] [Code]
  14. "What do you learn from context? Probing for sentence structure in contextualized word representations". ICLR(2019) [PDF] [Code]
  15. "Commonsense Knowledge Mining from Pretrained Models". EMNLP(2019) [PDF]
  16. "Do NLP Models Know Numbers? Probing Numeracy in Embeddings". EMNLP(2019) [PDF]
  17. "On the Cross-lingual Transferability of Monolingual Representations". ACL(2020) [PDF]
  18. "Cross-Lingual Ability of Multilingual BERT: An Empirical Study". ICLR(2020) [PDF] [Code]
  19. "What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models". TACL(2020) [PDF] [Code]
  20. "How Much Knowledge Can You Pack Into the Parameters of a Language Model?". EMNLP(2020) [PDF] [Code]
  21. "How Can We Know What Language Models Know?". TACL(2020) [PDF] [Code]
  22. "oLMpics-On What Language Model Pre-training Captures". TACL(2020) [PDF] [Code]
  23. "Information-Theoretic Probing with Minimum Description Length". EMNLP(2020) [PDF] [Code]
  24. "Inducing Relational Knowledge from BERT". AAAI(2020) [PDF]
  25. AutoPrompt: "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts". EMNLP(2020) [PDF] [Code]
  26. "Emergent linguistic structure in artificial neural networks trained by self-supervision". PNAS(2020) [PDF]
  27. "Evaluating Commonsense in Pre-Trained Language Models". AAAI(2020) [PDF] [Code]
  28. "Inducing Relational Knowledge from BERT". AAAI(2020) [PDF]


  1. "Universal Adversarial Triggers for Attacking and Analyzing NLP". EMNLP(2019) [PDF] [Code]
  2. "Pretrained Transformers Improve Out-of-Distribution Robustness". ACL(2020) [PDF] [Code]
  3. BERT-ATTACK: "BERT-ATTACK: Adversarial Attack Against BERT Using BERT". EMNLP(2020) [PDF] [Code]
  4. "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment". AAAI(2020) [PDF] [Code]


  1. "Are Sixteen Heads Really Better than One?". NeurIPS(2019) [PDF] [Code]
  2. "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned". ACL(2019) [PDF] [Code]
  3. "Revealing the Dark Secrets of BERT". EMNLP(2019) [PDF]
  4. "The Lottery Ticket Hypothesis for Pre-trained BERT Networks". NeurIPS(2020) [PDF] [Code]
  5. "When BERT Plays the Lottery, All Tickets Are Winning". EMNLP(2020) [PDF] [Code]


  1. "Scaling Laws for Neural Language Models". arXiv(2020) [PDF]
  2. "Extracting Training Data from Large Language Models". arXiv(2020) [PDF] [Code]

Efficient PLM


  1. RoBERTa: "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv(2019) [PDF] [Code]
  2. "Efficient Training of BERT by Progressively Stacking". ICML(2019) [PDF] [Code]
  3. Megatron-LM: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism". arXiv(2019) [PDF] [Code]
  4. ELECTRA: "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR(2020) [PDF] [Code]
  5. "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes". ICLR(2020) [PDF] [Code]
  6. GShard: "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv(2020) [PDF]
  7. Admin: "Understanding the Difficulty of Training Transformers". EMNLP(2020) [PDF] [Code]
  8. ZeRO: "ZeRO: Memory optimizations Toward Training Trillion Parameter Models". SC20: International Conference for High Performance Computing, Networking, Storage and Analysis [PDF] [Code]
  9. Switch Transformers: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv(2021) [PDF] [Code]


  1. DistilBERT: "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". arXiv(2019) [PDF] [Code]
  2. PKD: "Patient Knowledge Distillation for BERT Model Compression". EMNLP(2019) [PDF] [Code]
  3. "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks". arXiv(2019) [PDF]
  4. Q8BERT: "Q8BERT: Quantized 8Bit BERT". 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 [PDF]
  5. ALBERT: "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR(2020) [PDF] [Code]
  6. TinyBERT: "TinyBERT: Distilling BERT for Natural Language Understanding". EMNLP(2020) [PDF] [Code]
  7. Layerdrop: "Reducing Transformer Depth on Demand with Structured Dropout". ICLR(2020) [PDF] [Code]
  8. Q-BERT: "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT". AAAI(2020) [PDF]
  9. MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices". ACL(2020) [PDF] [Code]
  10. "Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning". 5th Workshop on Representation Learning for NLP(2020) [PDF] [Code]
  11. MiniLM: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers". arXiv(2020) [PDF] [Code]
  12. FastBERT: "FastBERT: a Self-distilling BERT with Adaptive Inference Time". ACL(2020) [PDF] [Code]
  13. DeeBERT: "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference". ACL(2020) [PDF] [Code]

PLM Adaptation


  1. "Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks". arXiv(2018) [PDF] [Code]
  2. "How to Fine-Tune BERT for Text Classification?". CCL(2019) [PDF]
  3. "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks". ACL(2020) [PDF] [Code]
  4. "Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?". ACL(2020) [PDF]


  1. MT-DNN: "Multi-Task Deep Neural Networks for Natural Language Understanding". ACL(2019) [PDF] [Code]
  2. "BAM! Born-Again Multi-Task Networks for Natural Language Understanding". ACL(2019) [PDF] [Code]
  3. "Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding". arXiv(2019) [PDF] [Code]


  1. "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning". ICML(2019) [PDF] [Code]
  2. Adapter: "Parameter-Efficient Transfer Learning for NLP". ICML(2019) [PDF] [Code]


  1. PET: "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference". EACL(2021) [PDF] [Code]
  2. "It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners". NAACL(2021) [PDF] [Code]
  3. "Prefix-Tuning: Optimizing Continuous Prompts for Generation". arXiv(2021) [PDF]
  4. LM-BFF: "Making Pre-trained Language Models Better Few-shot Learners". ACL(2021) [PDF] [Code]
  5. "What Makes Good In-Context Examples for GPT-3?". arXiv(2021) [PDF] [Code]
  6. "The Power of Scale for Parameter-Efficient Prompt Tuning". arXiv(2021) [PDF]


  1. "To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks". RepL4NLP(2019) [PDF]
  2. "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models". NAACL(2019) [PDF] [Code]
  3. "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping". arXiv(2020) [PDF]
  4. SMART: "SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization". EMNLP(2020) [PDF] [Code]
  5. "Revisiting Few-sample BERT Fine-tuning". ICLR(2021) [PDF]


