Paper Graph

• Built a distributed big data pipeline on AWS to facilitate data collection and graph annotation for Open Research Corpus Dataset with 45 Millions academic research papers. The pipeline allowed researchers to identify the mostimportant papers related to their papers of interest.

• Designed a multi-node Spark cluster-computing framework processing modules. Applied minHash locality sensitive hashing algorithm to compare similarities between papers, optimized Pyspark jobs performance by tuning and comparing different Spark operations-transformations and user defined functions (UDF).

• Deployed a Neo4j database to store graph-based relationship between academic papers, including citation relation- ships and also similarity relationships, Neo4j database supported front-end query demands for graph illustration.


Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.


A step by step series of examples that tell you how to get a development env running

• 1. Set up Spark (Spark 2.4.3)

1. install open JDK 8, which Spark 2.4.3 supports
    $ sudo apt update
    $ sudo apt install openjdk-8-jre-headless
Java 8 folder: /usr/lib/jvm/java-8-openjdk-amd64/
2. check java version after installed java 8:
ubuntu@ip-10-0-0-6:~$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

3. set default java version in Spark machine at java 8, not 11
	•	Modify /usr/local/spark/conf/, set JAVA path
	•	.bash_profile: change export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
4. check spark version before installing jars needed
ubuntu@ip-10-0-0-6:~$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_212
Compiled by user  on 2019-05-01T05:08:38Z
Type --help for more information.
5. install jars needed for spark 2.4.3
	•	create a folder under /usr/local/spark called lib 
/usr/local/spark$ sudo mkdir lib
	•	Make the folder readable(writable):
/usr/local/spark$ sudo chmod -R 777 lib/
	•	move jars needed to that folder:
    - wget
    - wget

	•	change spark default conf file:
	•	add the following lines to spark-defaults.conf:
    - spark.executor.extraClassPath /usr/local/spark/lib/aws-java-sdk-1.7.4.jar:/usr/local/spark/lib/hadoop-aws-2.7.1.jar
    - spark.driver.extraClassPath /usr/local/spark/lib/aws-java-sdk-1.7.4.jar:/usr/local/spark/lib/hadoop-aws-2.7.1.jar
6. add AWS credentials to the env
    - export AWS_ACCESS_KEY_ID=xxxxx
    - export AWS_SECRET_ACCESS_KEY=xxxxxx
	•	restart spark
   $ sh /usr/local/spark/sbin/

• 2. Set up PostgreSQL

1. check host and port:
   $ sudo netstat -plunt |grep postgres
   change password:
   $ sudo -u postgres psql postgres
   postgres=# \password postgres
2. allow postgre connected by remote machines(e.g: flask, spark)
   $ cd /etc/postgresql/10/main/
3. open file named postgresql.conf
   $ sudo nano postgresql.conf
4. add this line to that file:
   listen_addresses = '*'
   then open file named pg_hba.conf:
   $ sudo nano pg_hba.conf
   add this line to that file:
   host  all  all md5
5. restart the server:
   $ sudo /etc/init.d/postgresql restart

• 3. Set up Neo4j(Neo4j 3.1.4)

1. Open port 7687 for bolt, 7474 for Neo4j browser
2. Get Java8 (Java 10 is not compatible with Neo4j version 3.1.4)
   $ java -showversion
   $ sudo add-apt-repository ppa:webupd8team/java // we need to run this command for install java.
   $ sudo apt-get update // using this command all dependency will be updated
   $ sudo apt-get install oracle-java8-installer // now using this command java will be installed
   $ sudo apt-get update // using this command all dependency will be updated
   $ sudo apt install openjdk-8-jre-headless
3. After installing java now we will start the installation process for neo4j
   $ wget -O - | sudo apt-key add -
   $ echo 'deb stable/' >/tmp/neo4j.list
   $ sudo mv /tmp/neo4j.list /etc/apt/sources.list.d
   $ sudo apt-get update // using this command all dependency will be updated
4. After completing installation process restart your neo4j service using below command.
   $ sudo service neo4j restart

Add additional notes about how to deploy this on a live system(Nginx)

1. Install npm, gunicorn and pm2
   $ sudo apt-get intall nginx python-pip nodejs npm  
   $ sudo pip install flask gunicorn  
   $ sudo npm install pm2  
2. Configure Nginx proxy for Flask application. Add following code in /etc/nginx/sites-available/default file:
   server {  
    listen 80;
    listen [::]:80;
    location / {

3. Creat bash file for gunicorn execution, start application with Gunicorn using 10 workers. Create and add following code in
   gunicorn -w 10 hello_world:app

4. start the application with PM2 (production process manager)
   $ sudo pm2 start
5. ensure pm2 will restart if server restarts.
   $ pm2 startup
   $ pm2 save

Data Source

  • Semantic Scholar - Over 45 million published research papers in Computer Science, Neuroscience, and Biomedical fields provided as an easy-to-use JSON archive.

Built With

  • PySpark - PySpark is the Python API for Spark.
  • Neo4j - Graph database management system
  • AWS_S3 - Simple Storage Service is a service offered by Amazon Web Services
  • PostgreSQL - Relational database management system



This project is licensed under the MIT License - see the file for details


