Java gEneric DAta Integration (JedAI) Toolkit

JedAI constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of domain-independent, state-of-the-art techniques that apply to both RDF and relational data. These techniques rely on an approximate, schema-agnostic functionality based on (meta-)blocking for high scalability.

JedAI can be used in three different ways:

As an open source library that implements numerous state-of-the-art methods for all steps of the end-to-end ER work presented in the figure below.
As a desktop application with an intuitive Graphical User Interface that can be used by both expert and lay users.
As a workbench that compares the relative performance of different (configurations of) ER workflows.

This repository contains the code (in Java 8) of JedAI's open source library. The code of JedAI's desktop application and workbench is available in this repository.

Several datasets already converted into the serialized data type of JedAI can be found here.

You can find a short presentation of JedAI Toolkit here.

Citation

If you use JedAI, please cite the following paper:

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: "JedAI: The Force behind Entity Resolution", in ESWC 2017 (pdf).

Consortium

JEDAI is a collaboration project involving the following partners:

JedAI Workflow

JedAI implements a schema-agnostic, domain-independent end-to-end workflow for both Clean-Clean and Dirty ER that consists of 7 steps, as shown in the following image:

Below, we explain in more detail the purpose and the functionality of every step.

Data Reading

It transforms the input data into a list of entity profiles. An entity is a uniquely identified set of name-value pairs (e.g., an RDF resource with its URI as identifier and its set of predicates and objects as name-value pairs).

The following formats are currently supported:

CSV
RDF (any format, including XML, OWL)
SQL (mySQL, PostgreSQL)
SPARQL endpoints
Java serialized objects

The next version will add support for more formats: JSON, MongoDB, Oracle and SQL Server.

Block Building

It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.

The following methods are currently supported:

Standard/Token Blocking
Sorted Neighborhood
Extended Sorted Neighborhood
Q-Grams Blocking
Extended Q-Grams Blocking
Suffix Arrays Blocking
Extended Suffix Arrays Blocking

For more details on the functionality of these methods, see here.

Block Cleaning

Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.

The following methods are currently supported:

Size-based Block Purging
Comparison-based Block Purging
Block Filtering

All methods are optional, but complementary with each other and can be used in combination. For more details on the functionality of these methods, see here.

Comparison Cleaning

Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.

The following methods are currently supported:

Comparison Propagation
Cardinality Edge Pruning (CEP)
Cardinality Node Pruning (CNP)
Weighed Edge Pruning (WEP)
Weighed Node Pruning (WNP)
Reciprocal Cardinality Node Pruning (ReCNP)
Reciprocal Weighed Node Pruning (ReWNP)

Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here.

Entity Matching

It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities.

The following schema-agnostic methods are currently supported:

Group Linkage,
Profile Matcher, which aggregates all attributes values in an individual entity into a textual representation.

Both methods can be combined with the following representation models.

character n-grams (n=2, 3 or 4)
character n-gram graphs (n=2, 3 or 4)
token n-grams (n=1, 2 or 3)
token n-gram graphs (n=1, 2 or 3)

For more details on the functionality of these bag and graph models, see here.

The bag models can be combined with the following similarity measures, using both TF and TF-IDF weights:

ARCS similarity
Cosine similarity
Jaccard similarity
Generalized Jaccard similarity
Enhanced Jaccard similarity

The graph models can be combined with the following graph similarity measures:

Containment similarity
Normalized Value similarity
Value similarity
Overall Graph similarity

Entity Clustering

It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

The following domain-independent methods are currently supported for Dirty ER:

Center Clustering
Connected Components Clustering
Cut Clustering
Markov Clustering
Merge-Center Clustering
Ricochet SR Clustering

For more details on the functionality of these methods, see here.

For Clean-Clean ER, only one method is supported:

Unique Mapping Clustering

For more details on its functionality, see here.

GabrielePisciotta / JedAIToolkit

Java gEneric DAta Integration (JedAI) Toolkit

Citation

Consortium

JedAI Workflow

Data Reading

Block Building

Block Cleaning

Comparison Cleaning

Entity Matching

Entity Clustering

About

Languages