harshit158/paper-dots

information-retrieval keyphrase-extraction knowledge-graph nlp

What is Paper Dots ?

Paper Dots is an automatic insights extraction tool from research papers, which

Automatically annotates a research paper PDF with important keyphrases, ensuring faster skim-reading of papers
Builds cumulative Knowledge Graph on top of papers read so far, helping in tracking important concepts
Delivers relevant papers continuously through mail, promoting consistent and directed learning

The end-to-end pipeline is shown below:

Approach

There are 3 main components to the project:

Keyphrase Extraction
Implemented using Constituency Parsing (using AllenNLP pretrained model) followed by a rule based engine to refine the extracted keyphrases

Coming Soon:

Keyphrase extraction from entire paper and not just the abstract
Further division of identified keyphrases into domain specific entities like Datasets, References, Algorithms, Metrics etc

Knowledge Graph construction
Implemented using Open Information Extraction (OPENIE pretrained model from AllenNLP). Extracted SVO triplets followed by refining, to generate the final nodes and edges for the knowledge graph.

Paper sampling

The papers are sampled from Arxiv corpus (hosted on Kaggle). To enable semantic search over the papers, we had to first obtain the embeddings for each of the papers in the corpus, for which we used Sentence-Transformers.
The corpus embeddings are available and can be downloaded from here for research purposes.
Once the corpus embeddings are in place, a new paper can be sampled from the corpus using the seed paper as follows:

Code Structure

Paper-Dots

├── docs
├── tests
├── output
├── LICENSE
├── README
├── src
|   ├── config.py
|   ├── information_extraction.py                     # Driver of Information Extraction pipeline
|   ├── extractor.py
|   ├── constituency_parser.py
|   ├── mail_sender.py
|   ├── model_loader.py
|   ├── mongo_utils.py
|   ├── paper_walk.py
|   ├── task_keyphrase_extraction.py                  # Task 1
|   ├── task_knowledge_graph.py                       # Task 2
|   ├── utils.py
│   ├── paper_sampler                                 
|   |   ├── app.py                                    # Flask App
|   |   ├── Dockerfile
|   |   ├── paper_sampler.py
|   |   ├── utils.py
|   |   ├── requirements.txt
|   |   ├── data
|   |   |   ├── corpus_embeddings.hdf5                # Embeddings of Arxiv dataset (5.5 GB)
|   |   |   ├── corpus_ids.pkl                        # Corresponding IDs of the paper

How to use ?

Currently, the end-to-end pipeline is only configured for personal use, but we are working on it to make it available for public. However, you can send a mail to paperdotsai@gmail.com with the link of your seed paper, and we will onboard you in the next iteration.

The individual tasks of the Information Extraction sub-pipeline, however, can be used as follows:

Keyphrase Extraction:

python task_keyphrase_extraction.py -fp https://arxiv.org/abs/1706.03762

All the options are as follows:

-fp [--filepath]:       This is the path to the research paper. Can be URL (both abs and pdf links are supported) or local path
-ca [--clip_abstract]:  If true, clips the annotated abstract as an image file and doesnt do the annotation of entire PDF
-sa [--save_abstract]:  If true, saves the annotated image at ANNOTATE_FILEPATH in config

Knowledge Graph:

python task_knowledge_graph.py -fp https://arxiv.org/abs/1706.03762

All the options are as follows:

-fp [--filepath]:       This is the path to the research paper. Can be URL (both abs and pdf links are supported) or local path

How to contribute ?

Feel free to raise requests for new features :)

Contact

paperdotsai@gmail.com

About

Automatic insights extraction and annotation tool from research papers

information-retrieval keyphrase-extraction knowledge-graph nlp

MIT License

Languages

Language:Jupyter Notebook 98.7%Language:Python 1.0%Language:HTML 0.3%Language:Dockerfile 0.0%