PaperRank: A Reputation-based Search Engine for Academic Papers

In this project, we proposed a reputation-based search engine for academic papers. The work is mainly twofold: (1) we realized a basic search engine based on general similarity criterion, which could produce a relatively good results; (2) we tried to improve the search outcomes by additional information from the citations using PageRank algorithm. We raised new PageRank algorithm called subset PageRank targeting at the problem that some relevant papers are not similar to the query. Also, we design a Graphical User Interface (GUI) for our search engine.

1. Project Structure

webapp/
- HTTP server serving the search webpages
data/
- Papers Data
- Data preprocessing
- Elasticsearch scripts
paperrank/
- Algorithm of PaperRank
nginx/
- Reverse proxy server (for deployment only)
requirements.txt
- Python dependencies

2. Prepare environment

A. Python virtual environment (Development use)

Install python3, virtualenv.

$ virtualenv env -p python3  # "-p python3" means using python 3, needed if you have python 2 as global default
$ source env/bin/activate  # Get into the virtual env
(env) $ pip install -r requirements.txt  # Install dependencies

# Now we can try running some commands
(env) $ python app.py

B. Docker (Start server & elasticsearch)

Install Docker.

Then, start docker containers:

$ docker-compose up -d  # -d means run in background

This will boot up the following services:

nginx server listening on :80
flask server listening on :8000
elasticsearch server listening on :9200
dejavu listening on :1358

3. Prepare our database

Here are the instructions on how to setup the whole system.

Download Data

First, prepare the data we need. Since it's too large, we need to download it separately.

Data structure: https://api.semanticscholar.org/corpus/

Removed papers without valid "outCitations" or valid "inCitations".
Original: 7.35GB, 3000000 papers
Now: 2.07GB, 425876 papers

Download here and unzip.
Then, put it in /data/local/filtered_papers.json.

PaperRank

Then, run the PageRank algorithm to compute the scores for ranking.

# Preprocess data into graph (as a csv file)
cd paperrank/data/
python preprocess.py

# Run PageRank algo and output result to data/pagerank_result.json
cd paperrank/
python PageRank.py

Elasticsearch

Run our elasticsearch cluster and import data into it.

cd data/

# Load initial data into elasticsearch
python load.py

# Sample query
python query.py

Dajavu

You can look into elasticsearch data easily with dejavu. Go to http://localhost:1358/. Connect it to ES cluster at http://localhost:9200.

Run web UI

Go to http://localhost:8000/ to visit the web UI.

4. Other useful commands

# View your running containers
$ docker ps

# Shut down all containers
$ docker-compose down

# View server logs
$ docker logs -f flask_server

# Get into the container
$ docker exec -it flask_server bash

# Restart containers
$ docker-compose restart flask_server nginx

# If Dockerfile has changed, rebuild the image
$ docker-compose up -d --force-recreate --build flask_server nginx

carsonwah / paperrank-search-engine