AIP

Article Information Parser is an instrument to parse, unify, and in some cases correct article meta-data. AIP creates a PostgreSQL database that allows for easily finding related work.

Developing such a database is tricky, an excerpt of our article introducing this instrument:

Current information sources do not cover the spectrum of the systems community entirely.
For example, DBLP -- which specifically focuses on computer science articles -- lacks certain venues and does not record article abstracts.
Other datasets such as Semantic Scholar and AMiner have similar and other limitations.
Moreover, these datasets also overlap, yet contain important information the others do not offer; they are disjoint.
Our approach is to parse each dataset and filter and unify the information provided.

This instrument combines three data sources: DBLP, Semantic Scholar, and AMiner, which we filter and store in a PostgreSQL database. DBLP is a well-known European archive that focuses on computer science and features all the top-level venues (journals and conferences). Semantic Scholar is an American project created by the Allen Institute for AI. The project aims to analyze and extract important data from scientific publications. AMiner is an Asian project that aims to provide a knowledge graph for mining academic social networks. Both AMiner and Semantic Scholar have incorporated Microsoft's Academic Graph (MAG) in their datasets nowadays.

AIP tackles several non-trivial challenges in unifying these datasets:

Data discrepancies between sources. For example, titles in DBLP end with a dot, whereas they do not in the Semantic Scholar and AMiner corpuses, causing exact matching to fail.
Titles and abstracts may contain encoded characters leading to mismatching articles that are in fact the same.
Despite all data sources having a format specified, we encountered several instances where the format specified is not adhered to, or the data is malformed.
Venue strings being different among these sources. Some sources use an abbreviation, some use a BibTeX string, etc. AIP maps all these occurrences to the same abbreviation.
Complementing existing entries. For example, DBLP does not offer abstracts whilst Semantic Scholar and AMiner do.

How to run AIP

We developed two useful scripts to run AIP and generate the database using raw datasources:

The steps to run AIP are as followed:

Clone this repository.
Update PostgreSQL settings in database_manager.py
Download released datasets from three sources and store them in a directory.
Run either one of the two scripts mentioned earlier or run separately parse_dblp.py, parse_semantic_scholar.py, or parse_aminer.py using as argument to root of the data.

Have a look at which argument each script accepts (such as file locations) for more options.

AIP database structure

The database file contains the following tables:

publications

Column name	Explanation
id	A unique id for the paper, usually the ID assigned by DBLP.
venue	The abbreviation of the venue the article was accepted at.
year	The publishing year.
volume	(Optional) the volume of the journal the article it was included in.
title	The title of the article.
doi	The DOI of the article, in case there are multiple, the first one is usually used.
abstract	The abstract of the article (if present in one of the datasets).
n_citations	The number of times this article has been cited.

authors

Column name	Explanation
id	A unique identifier per author, this is the id used by DBLP.
name	The full name of the author.
orcid	The ORCID of the author if known.

author_paper_pairs this is a table to make a link between authors and publications. We are aware of the use of paper rather than article (legacy).

Column name	Explanation
author_id	A id of an author.
paper_id	The id of an article the author (co-)authored.

cites is currently not used, this table will contain in the future two article ids: which paper cited which.

properties

Column name	Explanation
last_modified	The data when the contents of the database were last altered.
version	The version of the database content, whenever a script modifies the database, after being done, this counter should be incremented.
db_schema_version	The version of the database schema. We use this to incrementally alter the database (adding indices, modifying/deleting/adding tables, etc.)

Query Example

The following SQL command returns papers from 2011 onwards with keywords performance analysis quality in either title or abstract, sorted by year in descending order.

SELECT * FROM publications WHERE year >= 2011
AND (lower(title) LIKE '%performance%' 
	OR lower(abstract) LIKE '%performance%')
AND (lower(title) LIKE '%analysis%'
	OR lower(abstract) LIKE '%analysis%')
AND (lower(title) LIKE '%quality%'
	OR lower(abstract) LIKE '%quality%')
ORDER BY year DESC

JyQuery / AIP

AIP

How to run AIP

AIP database structure

Query Example

About

Languages