- Information-Theoretic Probing with Minimum Description Length
- Designing and Interpreting Probes with Control Tasks
- Analysis Methods in Neural Language Processing: A Survey
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
- Evaluating NLP Models via Contrast Sets
- BERTScore: Evaluating Text Generation with BERT
- BLEURT: Learning Robust Metrics for Text Generation
- MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
- Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
- Certified Robustness to Adversarial Word Substitutions
- Universal Adversarial Triggers for Attacking and Analyzing NLP
- Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples
- The Unstoppable Rise of Computational Linguistics in Deep Learning
- How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
- A Call for More Rigor in Unsupervised Cross-lingual Learning
- Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
- Language (Re)modelling: Towards Embodied Language Understanding
- To Test Machine Comprehension, Start by Defining Comprehension