kenantang / anthology-keyword

Automated keyword search in ACL anthology.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automated Keyword Search in ACL Anthology

Introduction

This tool enables automated search for relevant papers with keywords in the ACL Anthology.

Challenges

One key challenge is that the returned number is unstable. After multiple searches, the minimal number is the stable one. The multiple searches are realized by switching between the two search options ("Relevance" or "Year of Publication"). The clicks are automated by chromedriver, so please check that the local chromedriver has the same version as the browser before running the script (version 103 for this initial commit).

Still, the results are not guaranteed to be correct, even when the number of searches is set to 10. For example, at 07/04/22 the number of "subreddit" papers should be 31. For multiple runs with 10 as the number, only 108 is returned.

In order to (1) provide enough time for searching and (2) space the clicks so that the script is not immediately identified as a bot, the average time for each query is now around 20 seconds. Also, between the searches there is a sleep of 5~10 seconds, sampled uniformly. When the script reports an error, you can manually open the webpage to see if a captcha is required. After submitting the captcha once by yourself, you should be able to run the script again.

The output.txt file saves the results from my keyword logbook.

Experiments

The file would be updated regularly from 22/07/10. The initial test shows that the current strategy works for scraping a list of 42 keywords.

In a later test with 358 keywords, the scraper fails at the following indices 9, 54, 99, 135, 180, 225, 270, 279, 291, 324, 353.

Observations

Combinations of keywords do not return good results:

  • gate control GNN, 0
  • gate fusion GNN, 0

Different hyphenation will change results:

  • multitask-setting, 262
  • multi-task setting, 740
  • subspan, 2
  • sub-span, 40
  • glass-box, 12
  • glass box, 48

Some words have surprisingly low frequencies:

  • i.i.d., 16
  • gating factor, 1
  • paired student t-test, 0
  • PCA, 42
  • t-SNE, 23
  • CYK algorithm, 4
  • CKY algorithm, 14
  • VLU, 0

Some surprisingly high:

  • -STS, 90400
  • color, 13400

Results are not relevant as desired:

  • bos, 127

Many keywords that have appeared in the original paper return 0 frequency.

Word Cloud Visualization

This is a word cloud visualization of all results.

WordCloud

TODO

Currently only the notebook version is available. Command line support would be added later.

About

Automated keyword search in ACL anthology.


Languages

Language:Jupyter Notebook 100.0%