ACL 2020

Selections from ACL 2020

This contains a selection** of papers from ACL 2020. For a more complete list of papers organized by topic, see here. The full list of papers (in both the main conference and in workshops) can be found here. The full proceedings, organized by track, are in this giant pdf.

I also enjoyed reading Vered Shwartz's highlights of ACL 2020, as well as Yoav Goldberg's thoughts on the format of ACL 2020 (in particular that the virtual format makes free-form converations, socializing, meeting people, and generating ideas very difficult).

** NOTE: Some areas (Dialogue, Question Answering, Machine Translation, Language Grounding, Multilingual/cross-lingual Tasks, and Syntax) are not well-represented in this list.

General Remarks
Tutorials
Birds of a feather
Workshops
Keynotes
System Demonstrations
Topics
NLG papers
NLU papers
Applications
Other

General Remarks

Four new tracks this year:

Ethics and NLP (44)
Interpretation and Analysis of Models for NLP (95)
Theory and Formalism in NLP (Linguistic and Mathematical) (12)
Theme: Taking Stock of Where We've Been and Where We're Going (65)

The number of submissions was over double those of ACL 2018:

Popular topics:

ACL 2020 paper keyword statistics: https://github.com/roomylee/ACL-2020-Papers (top keywords: generation, translation, dialogue, graph, extraction).
The six most popular tracks (by number of papers accepted) according to https://acl2020.org/blog/general-conference-statistics/:
- Machine Translation
- Machine Learning for NLP
- Dialogue and Interactive Technologies
- Information Extraction
- Generation
- NLP Applications

Interestingly, these don't correspond to the number of views per paper. Yoav Goldberg made a box-plot for the number of views each video received:

Box-plot for number of video views in each track. Image from https://twitter.com/yoavgo/status/1282459579339681792

Although there were far fewer Theme, Interpretability, Cognitive Modeling and Ethics papers, these had the highest number of views. On the other hand, Information Extraction, NLP Applications, and Dialogue and Interactive Systems papers had very few views.

Sadly, Discourse and Pragmatics had both few submissions and few views (despite Bonnie Webber winning the lifetime achievement award).

Tutorials

Birds of a feather

Unfortunately many of these sessions were scheduled at the same time, so I missed many interesting sessions: Lexical Semantics, Summarization, Information Retrieval & Text Mining, Interpretability, Language Grounding, and NLP Applications.

There were some interesting discussions in Discourse and Pragmatics, Information Extraction, and Generation (with over a hundred people in the Generation session!) Unfortunately the Zoom format with large numbers of people is not conducive to fluid converations between participants; these were mostly QA with the host. It was good to hear some established researchers express a desire to move beyond the current "pre-training on huge datasets and fine-tuning" kind of NLP.

Workshops

Full list of workshops: https://acl2020.org/program/workshops/

W5: FEVER 3 (papers)
- Maintaining Quality in FEVER Annotation (paper)
"Annotations in the FEVER fact-checking dataset aren't riddled with superficial shortcuts like those Niven&Kao found elsewhere.". TO FIND OUT: what about https://arxiv.org/pdf/1908.05267.pdf ?
W7: The Second Workshop on Figurative Language Processing (papers)
W8: NUSE (Workshop on Narrative Understanding, Storylines, and Events) (papers)
W10: 5th Workshop on Representation Learning for NLP (papers)
W20: NLPCovid
WNGT (The 4th Workshop on Neural Generation and Translation) (papers)

And two multi-modal workshops:

Keynotes

Kathleen R. McKeown, Rewriting the Past: Assessing the Field through the Lens of Language Generation

Josh Tenenbaum, Cognitive and computational building blocks for more human-like language in machines

System Demonstrations

Topics

🔝 Methodological

💥 Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (best paper) (paper) (code)

➖ Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses (paper) (code)

🔝 Theme Papers

💥 (Re)construing Meaning in NLP (paper)

See also: https://arxiv.org/abs/2003.11922 , https://arxiv.org/pdf/2006.02419.pdf

💥 Language (Re)modelling: Towards Embodied Language Understanding (paper)

See also: Ronen Tamari mentioned Bayesian program learning (DreamCoder) as a related line of work. Also see Extraction of causal structure from procedural text for discourse representations, Neural Execution of Graph Algorithms, Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems, Embodiment and language comprehension: reframing the discussion and Recent Advances in Neural Program Synthesis.

💥 Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data (best theme paper) (paper)

See also this dicussion: To Dissect an Octopus: Making Sense of the Form/Meaning Debate

See also: Glenberg and Robertson, Symbol Grounding and Meaning: A Comparison of High-Dimensional and Embodied Theories of Meaning, 2000, Changing Notions of Linguistic Competence in the History of Formal Semantics and Distribution is not enough: going Firther

💥 How Can We Accelerate Progress Towards Human-like Linguistic Generalization? (honorable mention theme paper) (paper)

➖ The Unstoppable Rise of Computational Linguistics in Deep Learning (paper)

➖ To Test Machine Comprehension, Start by Defining Comprehension (paper) (data will be available later)

See also: McNamara and Magliano, Toward a Comprehensive Model of Comprehension, 2009

➖ Automated Evaluation of Writing – 50 Years and Counting (paper)

🔝 New Datasets

➖ Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts (paper) (data)

➖ S2ORC: The Semantic Scholar Open Research Corpus (paper) (data, model)

➖ Cross-modal Coherence Modeling for Caption Generation (paper) (data)

Clue: a new dataset of 10,000 image-caption pairs, annotated with coherence relations (e.g., visible, subjective, action, story).

➖ ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations (paper) (code)

🔝 Integrating text with other kinds of data

See also: Reinald Kim Amplayo, "Rethinking Attribute Representation and Injection for Sentiment Classification", EMNLP 2019

➖ Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information (paper) (code)

💥 Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders (paper) (code)

➖ Injecting Numerical Reasoning Skills into Language Models (paper) (code)

➖ GCAN: Graph-aware Co-Attention Networks for Explainable Fake News Detection on Social Media (paper) (code)

➖ Integrating Semantic and Structural Information with Graph Convolutional Network for Controversy Detection (paper) (data)

➖ Learning to Tag OOV Tokens by Integrating Contextual Representation and Background Knowledge (paper)

🔝 Semantics

💥 What are the Goals of Distributional Semantics? (paper)

💥 Autoencoding Pixies: Amortised Variational Inference with Graph Convolutions for Functional Distributional Semantics (paper) (code)

➖ A Frame-based Sentence Representation for Machine Reading Comprehension (paper)

🔝 Discourse

➖ Discourse-Aware Neural Extractive Text Summarization (paper) (code)

🔝 Explainability

I was glad to see a that interpretability/explainability was a popular topic this year. Here is a summary of interpretability and model analysis methods at ACL 2020 by Carolin Lawrence.

See also:"Evaluating Explanation Methods for Neural Machine Translation" (listed under Machine Translation).

Faithfulness

💥 💥 Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? (paper)

A survey of the literature on interpretability, with a focus on faithfulness (whether an interpretation accurately represents the reasoning process behind a model's decision). Many papers conflate faithfulness and plausibility. Some guidelines are offered for future work on evaluating interpretability:

Be explicit about which aspect of interpretability is being evaluated.

Evaluation of faithfulness should not involve human judgment. We cannot judge whether an interpretation is faithful or not, because if we could, there would be no need for interpretability. Rather, human judgment measures plausibility. Similarly, end-task performance, while potentially valuable, is not an evaluation of faithfulness.

Even so-called "inherently interpretable" systems should be tested for faithfulness. For example, some studies claimed attention mechanisms were inherently interpretable, while more recent work (e.g., "Attention is not explanation") have shown one must be cautious about such claims.

Evaluation of faithfulness should not focus only on correct model predictions. Finally, the authors argue that researchers should move beyond ``faithfulness'' as a binary notion (since one can usually find counter-examples to disprove faithfulness) to a more nuanced, graded notion of faithfulness (a model should be sufficiently faithful).

💥 💥 Learning to Faithfully Rationalize by Construction (paper) (code)

A rationale is a snippet of text "explaining" a model's decision. We would like faithful rationales, but this can be difficult (e.g., contextualized representations mix signals up, and we don't know which tokens affect the output). Discrete Rationalization (Lei et al., EMNLP 2016) uses an extractor trained to select a rationale, followed by a separate classifier, which only uses the rationale to make a prediction. Unfortunately, to train the extractor one needs human rationales, which are hard to obtain.

Solution: FRESH. First train a black-box model (e.g., BERT). Then apply a saliency scorer (many options available, does not need to be faithful), and use thresholding (with other heuristics) to select a text snippet from the saliency scores. This text snippet is then used to train a fresh model to make a final prediction. The rationale is thus faithful by construction. However, there is still no insight into why a given rationale was selected, or in how the rationale was used.

See also: Jacovi and Goldberg, Aligning Faithful Interpretations with their Social Attribution

💥 Obtaining Faithful Interpretations from Compositional Neural Networks (paper) (code)

A systematic evaluation of the intermediate outputs of neural module networks on the NLVR2 and DROP datasets, to see whether modules are faithful (i.e., that the modules do what we would expect). Results: no, they are not faithful ("the intermediate outputs differ from the expected output"). Several methods are explored which improve faithfulness.

➖ Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport (paper) (code)

➖ NILE : Natural Language Inference with Faithful Natural Language Explanations (paper) (code)

Evaluation

💥 Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (paper) (code)

Carries out human subject tests to evaluate the "simulatability" of explainability methods ("a model is simulatable when a person can predict its behavior on new inputs"), on both text and tabular data. Users try to predict model output, with or without a corresponding model explanation.

➖ ERASER: A Benchmark to Evaluate Rationalized NLP Models (paper) (code)

A dataset of rationales for various NLP tasks, together with metrics that capture faithfulness of the rationales (comprehensiveness: "were all features needed to make a prediction?", and sufficiency: "are the rationales sufficient to make a given prediction?")

Attention

See also these earlier papers on whether attention can be used for explainability: Jain and Wallace, "Attention is not Explanation" (2019), Serrano and Smith, "Is Attention Interpretable" (2019), and Wiegreffe & Pinter, "Attention is not not Explanation" (2019).

💥 Towards Transparent and Explainable Attention Models (paper) (code)

This paper investigates why attention weights in LSTMs are often neither plausible nor faithful; one reason for this is the low diversity of hidden states. To improve interpretability, a diversity objective is added which increases the diversity of the hidden representations. Future work: try something similar for transformer-based models.

💥 Learning to Deceive with Attention-Based Explanations (paper) (code)

Presents a simple method for training models with deceptive attention masks, with very little reduction in accuracy. The method reduces weights for certain tokens which are actually used to drive predictions. A human study shows that the manipulated attention-based explanations can trick people into thinking that biased models (against gender) are unbiased.

➖ Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (paper) (data)

Creates a new dataset encoding what parts of a text humans focus on when doing text classification based on YELP reviews. Various metrics are introduced to compare the human and computer attention weights. Results: (1) "human-likeness" of the attention is very sensitive to attention type, with biRNNs being most similar, (2) increased length size reduces the similarity, across models. I wonder if Hierarchical Attention Networks would do better here.

➖ Understanding Attention for Text Classification (paper) (code)

Looks at attention and polarity scores as training approaches a local minimum to understand under which conditions attention is more or less interpretable.

Influence Functions

If one has access to the training dataset, one can ask: which are the training examples that influenced the prediction the most?

💥 Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions (paper) (code)

An empirical investigation of incfluence functions in NLP tasks, answering the following: (1) do they reliably explain transformer-based models? (2) are they consistent with insights from gradient-based saliency scores?. In addition, the authors discuss the use of influence functions to analyze artifacts in the training data.

💥 Generating Fact Checking Explanations (paper)

➖ Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection (paper) (code)

➖ Information-Theoretic Probing for Linguistic Structure (paper) (code)

➖ ExpBERT: Representation Engineering with Natural Language Explanations (paper) (code)

➖ Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling (SRW) (Student Research Workshop) (paper) (code)

Other

➖ Relation Extraction with Explanation (paper)

➖ Interpretable Operational Risk Classification with Semi-Supervised Variational Autoencoder (paper)

➖ Interpreting Twitter User Geolocation (paper)

➖ DTCA: Decision Tree-based Co-Attention Networks for Explainable Claim Verification (paper)

🔝 Learning with Less Data

Data Augmentation

💥 Good-Enough Compositional Data Augmentation (paper) (code)

➖ Syntactic Data Augmentation Increases Robustness to Inference Heuristics (paper) (code)

Are the limitations of BERT on natural language inference (NLI) due to limitations of the model, or the lack of enough (appropriate) data? The training set is augmented with "syntactically informative examples" (e.g., subject/object inversion), which improves BERT's performance on controlled examples.

➖ Incorporating External Knowledge through Pre-training for Natural Language to Code Generation (paper) (code)

Other

➖ Named Entity Recognition without Labelled Data: A Weak Supervision Approach (paper) (code)

➖ Improving Disfluency Detection by Self-Training a Self-Attentive Model (paper) (code)

[less data]: "We show that self-training — a semi-supervised technique for incorporating unlabeled data — sets a new state-of-the-art for the self-attentive parser on disfluency detection, demonstrating that self-training provides benefits orthogonal to the pre-trained contextualized word representations"

See also: (earlier paper by the same authors, which contains some additional analysis) Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks

💥 A Systematic Assessment of Syntactic Generalization in Neural Language Models (paper) (code)

💥 Probing Linguistic Systematicity (paper) (code)

➖ Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models (paper) (data)

➖ How Furiously Can Colourless Green Ideas Sleep? Sentence Acceptability in Context (paper) (code)

NLG papers

🔝 Dialogue

➖ PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable (paper) (code)

➖ Guiding Variational Response Generator to Exploit Persona (paper)

🔝 Text Generation

Evaluation:

💥 BLEURT: Learning Robust Metrics for Text Generation (paper) (code)

Generation with constraints:

💥 Automatic Poetry Generation from Prosaic Text (paper) (code)

Poetry generation, in French and English, conditioned on a given "topic".

➖ Rigid Formats Controlled Text Generation (paper) (code)

Generation of English sonnets and classical Chinese poetry.

➖ Semantic Scaffolds for Pseudocode-to-Code Generation (paper) (code)

Beyond plain autoregressive text generation:

See also the papers under "Infilling" below.

➖ Distilling Knowledge Learned in BERT for Text Generation (paper) (code)

➖ A Study of Non-autoregressive Model for Sequence Generation (paper)

VAE

➖ Pre-train and Plug-in: Flexible Conditional Text Generation with Variational Auto-Encoders (paper) (code)

➖ Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs (paper) (code)

Generation via Retrieval and/or Editing:

See also: "Iterative Edit-Based Unsupervised Sentence Simplification" listed under Summarization and Simplification.

💥 Paraphrase Generation by Learning How to Edit from Samples (paper)

➖ Unsupervised Paraphrasing by Simulated Annealing (paper) (code)

➖ Simple and Effective Retrieve-Edit-Rerank Text Generation (paper)

Other

💥 Explicit Semantic Decomposition for Definition Generation (paper)

💥 Fact-based Text Editing (paper) (code)

➖ Neural Syntactic Preordering for Controlled Paraphrase Generation (paper) (code)

Detecting generated text:

➖ Reverse Engineering Configurations of Neural Text Generation Models (paper)

➖ Automatic Detection of Generated Text is Easiest when Humans are Fooled (paper)

Infilling:

💥 Enabling Language Models to Fill in the Blanks (paper) (code) (demo)

➖ INSET: Sentence Infilling with INter-SEntential Transformer (paper) (code)

➖ Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order (paper)

NLU and NLG together:

➖ Towards Unsupervised Language Understanding and Generation by Joint Dual Learning (paper) (code)

➖ A Generative Model for Joint Natural Language Understanding and Generation (paper)

See also:

🔝 Data-to-Text Generation

➖ Posterior Control of Blackbox Generation (paper) (code)

➖ Few-Shot NLG with Pre-Trained Language Model (paper) (code)

➖ Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence (paper)

➖ Bridging the Structural Gap Between Encoding and Decoding for Data-To-Text Generation (paper)

Table-to-Text Generation:

➖ Logical Natural Language Generation from Open-Domain Tables (paper) (code)

➖ Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints (paper)

AMR-to-Text Generation:

➖ GPT-too: A Language-Model-First Approach for AMR-to-Text Generation (paper) (code)

➖ Line Graph Enhanced AMR-to-Text Generation with Mix-Order Graph Attention Networks (paper) (code)

➖ AMR-To-Text Generation with Graph Transformer (paper) (code)

🔝 Summarization and Simplification

Evaluation:

➖ Fact-based Content Weighting for Evaluating Abstractive Summarisation (paper) (code)

➖ Facet-Aware Evaluation for Extractive Summarization (paper) (data)

➖ SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization (paper) (code)

➖ Asking and Answering Questions to Evaluate the Factual Consistency of Summaries (paper)

Factuality, Truthfulness, Faithfulness:

➖ Improved Natural Language Generation via Loss Truncation (paper) (code)

Noisy data is one reason for ``hallucination'' of facts in neural summarization systems. One way to deal with this issue is to replace log loss with a more robust alternative.

➖ FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization (paper) (code)

➖ Improving Truthfulness of Headline Generation (paper) (code)

➖ On Faithfulness and Factuality in Abstractive Summarization (paper) (data)

➖ Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports (paper)

Unsupervised:

💥 Iterative Edit-Based Unsupervised Sentence Simplification (paper) (code)

➖ Unsupervised Opinion Summarization with Noising and Denoising (paper) (code - maybe?)

➖ The Summary Loop: Learning to Write Abstractive Summaries Without Examples (paper) (code)

Simplification

➖ Neural CRF Model for Sentence Alignment in Text Simplification (paper) (code and data)

See also: "Iterative Edit-Based Unsupervised Sentence Simplification" and "ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations", above.

🔝 Machine Translation

💥 Hard-Coded Gaussian Attention for Neural Machine Translation (paper) (code)

See also: Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

💥 Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics (honorable mention) (paper) (code)

➖ Evaluating Explanation Methods for Neural Machine Translation (paper)

➖ “You Sound Just Like Your Father” Commercial Machine Translation Systems Include Stylistic Biases (paper)

➖ Multi-Hypothesis Machine Translation Evaluation (paper) (code)

➖ Compositional Generalization by Factorizing Alignment and Translation (SRW) (paper)

🔝 Style Transfer

➖ Can Humor Prediction Datasets be used for Humor Generation? Humorous Headline Generation via Style Transfer (paper)

➖ Improving Adversarial Text Generation by Modeling the Distant Future (paper)

➖ Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen (paper) (code)

➖ Politeness Transfer: A Tag and Generate Approach (paper) (code)

➖ Parallel Data Augmentation for Formality Style Transfer (paper) (data)

➖ Learning Implicit Text Generation via Feature Matching (paper)

➖ Hooks in the Headline: Learning to Generate Headlines with Controlled Styles (paper) (code)

➖ Exploring Contextual Word-level Style Relevance for Unsupervised Style Transfer (paper)

NLU papers

🔝 Text Classification

➖ Feature Projection for Improved Text Classification (paper) (code)

➖ Discrete Latent Variable Representations for Low-Resource Text Classification (paper) (code)

➖ GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples (paper) (code)

➖ Text Classification with Negative Supervision (paper)

➖ Contextualized Weak Supervision for Text Classification (paper) (code)

➖ Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks (paper) (code)

➖ Hierarchy-Aware Global Model for Hierarchical Text Classification (paper) (code)

➖ Efficient Strategies for Hierarchical Text Classification: External Knowledge and Auxiliary Tasks (paper)

🔝 Topic Models

➖ Tree-Structured Neural Topic Model (paper) (code)

➖ Neural Topic Modeling with Bidirectional Adversarial Training (paper)

➖ An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering (paper) (code)

🔝 Knowledge Graphs

💥 Orthogonal Relation Transforms with Graph Context Modeling for Knowledge Graph Embedding (see results on WN18RR) (paper) (code)

➖ Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction (paper) (code)

➖ Taxonomy Construction of Unseen Domains via Graph-based Cross-Domain Knowledge Transfer (paper) (code)

🔝 Hypernymy Detection

➖ BiRRE: Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy Detection (paper)

➖ Hypernymy Detection for Low-Resource Languages via Meta Learning (paper)

🔝 Natural Language Inference

See also these papers from ACL 2020 listed under other sections: "NILE: Natural Language Inference with Faithful Natural Language Explanations" and "Syntactic Data Augmentation Increases Robustness to Inference Heuristics".

💥 Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition (paper) (code)

💥 Inherent Disagreements in Human Textual Inferences (paper) (data)