sarbajitdutta/ReCiter-MachineLearning-Analysis

ReCiter stores results from its analysis in both DynamoDB (a NoSQL database) and S3 (a file storage system). It also exposes data through its various APIs. Generally speaking, these systems are optimized for performance, so it takes a little extra effort to retrieve data, so it can be analyzed. As you can see, there are numerous ways to get data from your ReCiter instance.

Retrieving CSV data from DynamoDB using Python

Prerequisites: if you have not done so already...
1. Download the Visual Studio Code application.
2. Install boto3. Go to command line and enter, pip install boto3 --user
3. We need to connect to DynamoDB. Install AWS command line interface as per these directions.
4. Configure AWS CLI. Enter access key and secret key
Open DynamoDB_Analysis.py in the Visual Studio application.
1. This Python file contains a series of commands,
  1. The first part is for scan the DynamoDB table and get everything, store in a Python list
  2. Other parts are for each table to be output. For example, the personArticle_mysql.csv table, will return all elements within DynamoDB at the level of the individual person / article. One such element is a PMID; another is the personTypeScore.
2. There are currently eight tables to be output.
3. To ignore certain tables for export from DynamoDB and comment them out, use three apostrophes before and after the code.
Set the outputPath equal to the location of where you wish to output the files from this script.
Save the file.
Execute the Python file:
1. Option #1: Use the Terminal
  1. In the Terminal, navigate to the location of this Python file.
  2. Execute python3 DynamoDB_Analysis_upload.py
  3. The code will output a series of CSV files to the same directory as the Python file
2. Option #2:
  1. In VisualStudio, right-click on the body of the Python file.
  2. Click on the “Run Python file in Terminal” option
  3. The code will output a series of CSV files to the same directory as the Python file
3. Notes:
  1. In our experience, for about 9,000 people and 100,000 records, it takes approximately 10-15 minutes to perform these scans. This is true if you run all of them together or if you run them one at a time.
  2. If your code does not properly anticipate a null key, you will see a “KeyError.” The code includes examples for how to handle cases where data may be null such as publicationAbstract.
  3. Amazon AWS generally charges more when it comes to read-write units as opposed to storage, so be careful about how this code consumes capacity.
  4. To add a new feature:
    1. Add the attribute name in “f.write”.
    2. Add it below with the other fields and their locations in the tree.
    3. Add it in a third location, towards the bottom also under “f.write.”
    4. Each section of code refers to a particular level of the tree structure. Try not to add attributes with other attributes that are at different levels of the tree structure. For example, you wouldn’t add userAssertion at the person level; that belongs at the personArticle level.

Retrieving CSV data from S3 using Python

Prerequisites: if you have not done so already...
1. Download the Visual Studio Code application.
2. We need to connect to S3. Install AWS command line interface as per these directions.
3. Configure AWS CLI. Enter access key and secret key.
Go to the directory on your machine where you wish to output the data from S3.
Create a directory called “s3”
Run the following command in Terminal to download data from S3 aws s3 cp s3://reciter-dynamodb/AnalysisOutput /<yourFolderLocation>/s3 --recursive
Note: everytime before you rerun the command, make sure the file is empty
Open S3_AnalysisOutput.py in the Visual Studio application.
Set the originalDataPath equal to the location of where the downloaded files are.
Set the outputPath equal to the location of where you wish to output the files from this script.
Save the file.
Execute the Python file:
1. Option #1: Use the Terminal
  1. In the Terminal, navigate to the location of this Python file.
  2. Execute python3 S3_AnalysisOutput_upload.py
  3. The code will output a series of CSV files to the same directory as the Python file
2. Option #2:
  1. In VisualStudio, right-click on the body of the Python file.
  2. Click on the “Run Python file in Terminal” option
  3. The code will output a series of CSV files to the same directory as the Python file
3. Notes:
  1. In our experience, for about 1,000 people and 170,000 records, it takes approximately 3-5 minutes to copy these data into local directory.
  2. To ignore certain tables for export from S3 and comment them out, use three apostrophes before and after the code.

Retrieving JSON data from the Feature Generator API using a bash script

Here we’re going to use a bash script to retrieve data from ReCiter’s feature generator (or any other) API. This approach affords you greater control and may even return some data that’s not available in S3 or DynamoDB. For example, suppose that your minimumStorageThreshold is set to 4. If you retrieve directly from S3 or DynamoDB, articles with scores below 4 won’t be stored in the system. Additionally, you can take advantage of the Feature Generator API’s ability to return only certain fields.

Prerequisites: if you have not done so already...
1. You will need to obtain the API key associated with the API you will be querying.
Download the reciterRetrieval.sh file.
Navigate to the directory where this file is stored and change its permissions like so:
chmod +x reciterRetrieval.sh
Open reciterRetrieval.sh in a text editor.
Update the API key.
Update the URL request as you wish. You will want to change anything including: the domain, fields returned, totalStandardizedArticleScore threshold, etc.
Save the file.
Run . /reciterRetrieval.sh from the command line.
The script will output a series of JSON files.

Importing data into a MySQL database and running queries

If you haven’t installed MySQL on your system, you will need to do so.
1. If you haven’t yet installed MySQL, consider MariaDB, which is a fork of MySQL. MariaDB has a median function, which may come in handy later.
2. Follow these instructions to install MariaDB.
In your MySQL tool of choice, import the database reciterAnalysis.sql.
Each of the data files corresponds to a table in your newly created database. Select each table and import each one, one by one.

Generating CSV files from JSON output and import into a MySQL database

Prerequisites:
Obtain existing JSON files output from the Feature Generator API.
Install PyMySQL: sudo pip install PyMySQL
Install boto3. Go to command line and enter, pip install boto3 --user
Import the database reciterAnalysis.sql
Go to Terminal enter the following:

export DB_HOST=[hostname]
export DB_USERNAME=[username]
export DB_PASSWORD=[password]
export DB_NAME=[database name]

Open the uploadToDatabase.py file. Update originalDataPath with the location of the JSON files. Update outputPath with the location of the CSV files to be output.
In Terminal, run python uploadToDatabase.py. (Note: "rank" is a reserved word so we surrounded that by backticks.)
This should fully populate a reporting database

Run queries

Now we can run some queries. Here is one to give you the general idea.

select distinct personIdentifier, authorFirstName, authorLastName, a.pmid, articleTitle, e.rank, totalAuthorCount, 
case
when e.rank = '1' then 'first'
when e.rank = totalAuthorCount then 'last'
end as firstLast,
publicationTypeCanonical, datePublicationAddedToEntrez
from personArticle
left join personArticleAuthor e on e.personIdentifier = a.personIdentifier and a.pmid = e.pmid
join (select pmid, max(rank) as totalAuthorCount
from personArticleAuthor 
group by pmid) x  on x.pmid = a.pmid
where a.pmid is not null
and userAssertion = 'ACCEPTED'
and datePublicationAddedToEntrez > '2019-05-01' 
and authorTargetAuthor = 1

Converting ReCiter data from JSON to CSV using Python

These steps will show you how to convert data from JSON retrieved from ReCiter into several CSV files, which can be used in PANDAS, a machine learning tool.

Prerequisites: if you have not done so already...
1. Download the Visual Studio Code application.
Open ML_Model_Test_upload.py in the Visual Studio application.
Set the originalDataPath equal to the location of where the JSON files are.
Set the outputPath equal to the location of where you wish to output the files from this script.
Save the file.
Execute the Python file:
1. Option #1: Use the Terminal
  1. In the Terminal, navigate to the location of this file.
  2. Execute python3 ML_Model_Test_upload.py
  3. The code will output a series of CSV files to the same directory as the python file
2. Option #2:
  1. In VisualStudio, right-click on the body of the Python file.
  2. Click on the “Run Python file in Terminal” option
  3. The code will output a series of CSV files to the same directory as the python file

Using machine learning to optimize scores and thresholds

Here we will learn how to perform analyses on data retrieved from ReCiter so that we can optimize weights and thresholds from application.properties. We will use pandas, a sophisticated data analysis library.

Install Anaconda Python 3.7+, which includes Jupyter Notebooks.
If you haven’t done so already, create a flat file described in the above section, “Converting ReCiter data from JSON to CSV using Python.” This steps prepares data that can be used by the Jupyter Notebook.
Open Anaconda.
Choose Launch Jupyter Notebooks.
If you run the code snippets, you can output any of the following:
1. Density plot of scores broken down by userAssertion value
2. Correlation plot of features
3. Optimized weights for features
Here are some sample outputs...

sarbajitdutta / ReCiter-MachineLearning-Analysis