f4lco / metanome-algorithms

Source code for several Metanome data profiling algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metanome Algorithm Repository

This repository contains several data profiling algorithms for the Metanome platform. The algorithms have been implemented by students of the information systems group at the Hasso-Plattner-Institut (HPI) in the context of the Metanome project.

Installation

Before building the algorithms, the following prerequisites need to be installed:

  • Java JDK 1.8 or later
  • Maven 3.1.0
  • Git

Because all profiling algorithms rely on the Metanome platform, i.e., they use Metanome as a dependency, this project needs to be installed in the local maven repository first. So please visit the GitHub-page, checkout the sources and build them with the following command:

.../metanome$ mvn install

Then, all algorithms can be built with this command:

.../metanome-algorithms$ MAVEN_OPTS="-Xmx1g -Xms20m -Xss10m" mvn clean install

Alternatively, you can open the algorithms project in your IDE of choice, specify -Xmx1g -Xms20m -Xss10m as build parameters, and run it as mvn clean install.

The build creates one "fatjar" for each algorithm in the repository. After the build succeeded, run either the collect.bat (Windows) or collect.sh (Linux) script to copy all created algorithms into one folder named "COLLECTION". Now, you can choose the algorithms you need and copy them over into a Metanome deployment.

Headless deployment

To run the Metanome algorithms without a full Metanome deployment, consider the Metanome-cli project. This project extends the Metanome framework with a command line interface, so you can configure end execute the jars from a shell. If you need to integrate Metanome algorithms into your own projects, the Metanome-cli implementation can serve as a reference on how to add the algorithms into other projects.

Adding new algorithms

All algorithms in this repository are continuously maintained and upgraded to newer versions with every release of the Metanome framework. To add a new algorithm to the repository, the following steps should be followed:

  1. Copy the algorithm maven project into a subdirectory of the algorithms repository.

  2. Use the following pattern for the naming of your algorithm artifact:

      <groupId>de.metanome.algorithms.[algorithm-name-lowercase]</groupId>
      <artifactId>[algorithm-name]</artifactId>
      <packaging>jar</packaging>
      <name>[algorithm-name]</name>
    
  3. Set the parent pom to the root pom using the root's current version:

      <parent>
        <groupId>de.metanome.algorithms</groupId>
        <artifactId>algorithms</artifactId>
        <version>1.1-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
      </parent>
    
  4. Add the algorithm project as a module to the root pom of the reposotory.

  5. Remove the version tags of your project and all dependencies to Metanome subprojects; these versions are inherited from the root pom.

  6. Remove unnecessary repository information, e.g., all repositories that are defined in root/parent should not be duplicated.

  7. Add a copy command for the jar file of the new algorithm to the collect.bat and collect.sh scripts.

About

Source code for several Metanome data profiling algorithms

License:Apache License 2.0


Languages

Language:Java 99.9%Language:Shell 0.1%Language:Batchfile 0.1%