EleMisi / CovidRank

PageRank algorithm on COVID-19 Open Research Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CovidRank

CovidRank is a project realized as part of the Languages and algorithms for AI exam of the Master's degree in Artificial Intelligence, University of Bologna. The aim of this project is to use different ranking algorithms on CORD-19 dataset to figure out what are the most relevant publications in virology field.

Available algorithms

The following ranking algorithms were implemented both in a non-distributed and in a distributed version:

  • InDegreeRank which performs the ranking of a graph's nodes according to their normalized in-degree;
  • PageRank (see References).

The available algorithms are:

  • InDegree
  • DistribuitedInDegree
  • PageRank
  • DistributedPageRank

Available data

Given the original dataset, we've extracted article names and their citations lists into a graph, and built the following citations graph by thresholding by thresholding the Indegree of the vertices:

  • citation_500:

    • Indegree threshold = 500
    • 1,760 nodes
    • 2,747 edges
  • citation_100:

    • Indegree threshold = 100
    • 9,648 nodes
    • 23,437 edges
  • citation_50:

    • Indegree threshold = 50
    • 14,925 nodes
    • 51,814 edges
  • citation_10:

    • Indegree threshold = 10
    • 32,686 nodes
    • 227,433 edges
  • citation_1:

    • Indegree threshold = 1
    • 1,015,682 nodes
    • 1,576,019 edges

Running the tests

You can test the program by typing

run <citations_graph> <algorithm>

Results

We performed our tests on two different configurations:

  • Local machine: 32 GB RAM, i7 7700k CPU
  • AWS: 2x machines with 4 cores and 16gb memory

In the figures below we report the comparison results on the 5 citations graphs: Algorithms performances 1 Algorithms performances 2

As an example, we report the top 10 articles extracted from citation_1 using DistribuitedInDegree:

Position Title
1 Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia
2 Identification of a novel coronavirus in patients with severe acute respiratory syndrome
3 A novel coronavirus associated with severe acute respiratory syndrome
4 Coronavirus as a possible cause of severe acute respiratory syndrome
5 Characterization of a novel coronavirus associated with severe acute respiratory syndrome
6 Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus
7 Bats are natural reservoirs of SARS-like coronaviruses
8 The molecular biology of coronaviruses
9 Global trends in emerging infectious diseases
10 A major outbreak of severe acute respiratory syndrome in Hong Kong

Built With

Authors

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

References

PageRank - Wikipedia

About

PageRank algorithm on COVID-19 Open Research Dataset

License:Apache License 2.0


Languages

Language:Scala 100.0%