A Non-Blind Reviewing Proposal for *ACL

Background

The current *ACL arXiv and reviewing policy can be characterized as follows: we know there are implicit biases, so the intervention is to hide the signals (features) that could lead to implicit biases.

See above link for details, but one important provision is that posting to any non-blind preprint server one month before the conference deadline is prohibited. I am indeed on record as calling this policy "idiotic". There are many reasons why authors would post on arXiv (and equally many reasons why some think it's a bad idea), but those motivations are orthogonal. I'm happy to discuss but that's a separate issue.

My entire argument is that this policy is ineffective. Explained as follows:

And as I've pointed out several times: the *ACL policy only works for the initial creation of the paper. If a paper is rejected, the authors will naturally post on arXiv, so blind review is defeated the next conference cycle. We just delayed our impact for no good reason.
— Jimmy Lin (@lintool) May 23, 2018

To which Chris Manning explained:

The @ACL_NLP policy relies on 2 human failings: procrastination and forgetfulness. Everyone could finish papers 35 days before a deadline but few do. Some genuine preprints or previously rejected papers will be broadly available but people are less likely to remember authors. 7/
— Christopher Manning (@chrmanning) May 27, 2018

This, IMO, is not sound evidence-based policy making. As far as I'm aware, these "two human failings" are unverified assumptions. People will search while reviewing and unwittingly unblind papers. And no, you can't simply prescribe how reviewers conduct their reviews (i.e., "try not to search..."). It's unenforceable.

I think if you're going to take away someone's right (in this case, to freely post on arXiv), there had better be some evidence that this imposition accomplishes some good. I see hand waving. I don't see empirical evidence. And yes, I believe the burden of proof rests with those who seek to restrict the actions others, not the other way around.

My Proposal

Given that background, I've clearly stated:

I think everyone is misunderstanding me: I'm not opposed to blind review. It has advantages but it is by no means perfect. What annoys me is the attitude that everyone treats blind review as sacrosanct, the sole source of truth in an axiomatic, inviolable manner.
— Jimmy Lin (@lintool) May 27, 2018

Now, in the spirit of being constructive, I'll propose a completely opposite approach: let's give up on blind review and be explicit about the signals that may lead to implicit bias, and then correct for them. This approach is predicated on there being substantial research on implicit biases and how to combat then. (And there is!)

Under this proposal, a conference would do something like this:

All conference officers (PC chairs, SPC, etc.) must take mandatory training that discusses diversity, inclusivity, implicit bias, etc.
Conference officers are empowered during mandatory in-person SPC meetings to correct for any biases they encounter, based on the above training.
The paper submission system explicitly gathers all features that we might imagine to be relevant to diversity, inclusivity, implicit bias, etc. (e.g., h-index, gender, L2 vs. L1).
We make available for decision making, accumulated over time, empirical evidence with respect to implicit biases, like the WSDM experiment.

Under this proposal, there are absolutely no restrictions against arXiv submissions, flag-planting, shameless self-promotion on social media, etc. There will, however, be consequences (see below).

How This Would Work

Reviewing would proceed normally as most conferences are run today. Reviewers remain blind to authors, but they must be revealed to the SPC (or track chairs, or whatever). The major change happens at the in-person PC meeting (and I believe the in-person-ness is absolutely critical). The SPC members examine all papers and make adjustments to combat implicit bias:

Based on the diversity, inclusivity, implicit bias, etc. training they have received.
Based on prior evidence and empirically validated strategies (by other researchers) to handle diversity, inclusivity, implicit bias, etc. issues. This means the techniques will be refined and improved over time.

In other words, we empower the SPC members (or PC chairs, etc.) to override reviews based on their training and judgment. Or even stronger: we expect the SPC members to correct for implicit bias and hold them accountable.

Let's, for the sake of concreteness, assume that the SPC uses the "checklist approach" (although there are other effective techniques as well). The checklist approach attempts to make explicit the decision-making process so as to ensure that decisions are made equitably (for whatever definition of equity the community decides to adopt).

Let's again, for concreteness, assume that under the checklist approach we wish to guard against the implicit biases associated with fame, gender, and L2 vs. L1-ness. (The community can converge through some process on what these characteristics are.) Then I'd imagine that discussions at the in-person meeting might include the following:

Paper 342 was reviewed by two second-year Ph.D. students who don't seem to have a solid command of the literature, and as a result we believe there is a "famous author" effect at work here. Despite the high scores, we're going to reject the paper.

Paper 1934 has a review that exhibits sexist undertones regarding the topic of study. This is inappropriate and so we are going to discount the negative review and accept the paper.

Paper 908 has one review that isn't substantive, and is critical based solely on relatively superficial grammatical mistakes. We're going to recommend a native-speaker editing pass, but otherwise accept the paper because we think the underlying ideas are solid. (Even better, if we add a shepherding process, we make the copyediting mandatory).
Paper 85 makes exaggerated novelty claims that went unchallenged because the paper came from a prestigious institution, but in our opinion, the methodology is shoddy. Despite the high scores, we're going to reject.

Note that under this model authors can shamelessly flag-plant all they want on arXiv, but they will be held accountable if they do shoddy science.

Why This Is Better

I argue that my proposal is better because it replaces arbitrary non-evidence-based policies with scientifically sound strategies that can be refined over time. Even taking at face value this need to compromise between speed and bias, what's the optimal embargo period? Why a month? Is there evidence that forgetfulness "kicks in" after a month? In short, this requires parameter tuning, and we have absolutely no idea how to do it. Further evidence as to the arbitrary nature of this parameter is the fact that Chris Manning's clarification of the policy actually misstates the policy: he says "35 days".

In contrast, under my proposal, we trust in the scientific evidence of our colleagues in the social sciences who study implicit bias for a living. For the sake of illustration, I've sketched out what the review process might look like with the "checklist approach". If there are better strategies, we'll adopt them.

In short, we're replacing a non-evidence-based policy axe with the much more nuanced scalpel of human judgment that improves over time.

Objections

It's a lot of work!

And we're trying to combat a pernicious problem... your point is?

It puts a lot of power in hands of the SPC members.

So? We already do. Now we're just giving them a clear mandate and a methodology for redress. If we can't select a handful of members from the community we trust doing this, then I think the community has more serious issues...

What about the possibility of retaliation? Won't an SPC member who is an assistant professor worry about axing a paper from a famous author? What about the possibility of retaliation for tenure reviews, for example?

I concede that this is a concern. It can perhaps be mitigated by SPC members where we'd pair someone very senior (full professors) to provide "cover" for junior faculty.

In-person PC meetings are untenable.

No they're not. Last time I checked, SIGIR and WWW still do this. Many conferences have split into tracks now, so less desirable, but still adequate, are track-specific in-person meetings. As a bonus, we also rotate these meetings to different parts of the globe to further promote diversity.

This plan biases against people with restrictions on traveling, such as parents with young children.

Agreed. I have no response to this. This actually happened to me recently.

Additional Arguments Why the arXiv Policy Is Ineffective

I believe the steady state is as follows:

Hypothesis that new arXiv policy for *ACL simply pushes deadlines one month earlier. Look at the spike in arXiv cs.CL submissions last week, one month before NAACL 2018 long paper deadlines. Thanks to @tuzhucheng for analysis! #NLProc pic.twitter.com/l18zHuI59k
— Jimmy Lin (@lintool) November 20, 2017

Chris Manning would argue:

Everyone could finish their papers 35 (sic) days before a deadline but in practice very few people do.

To which I would respond: Let's consider a young, ambitious graduate student of a "famous author" who would benefit from unblinding. The student is cognizant of the advantages of purposely unblinding a submission. I think there would be significant motivation for the student to finish the paper early, since there is a very strong incentive. Then the student could flaunt the rules (e.g., broadly advertise on social media) and face no consequences.

There is no possible adjustment to the policy that could work. Whatever the embargo period is, the student will just submit to arXiv n + 1 days in advance. If we start making the length of the embargo period too long, we run into the case where (for example) ACL notification period overlaps with the embargo period for EMNLP. At that point, we've effectively outlawed arXiv for unpublished work. This is indeed a minority viewpoint per the Report on ACL Survey on Preprint Publishing and Reviewing (page 31); the survey suggests that most in the community would not be comfortable going so far.

lintool / non-blind-review