josemagalhaes1996 / smart-distributed-crawler-big-data-integration

This project aims to produce a set of tools, that will help big data integration engineers, model the data automatically with a certain confidence interval.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Smart Crawler for Big Data integration

This project aims to produce a set of tools, that will help big data integration engineers, model the data automatically with a certain confidence interval.

General Architecture

Getting Started

These instructions help you start developing and running the project for testing purposes.

Prerequisites

  1. Netbeans
  2. Hadoop Cluster

Installing

To start developing

  1. Install Netbeans;
  2. Clone the git repository;
  3. Configure as Maven project;

Deployment

  • Maven package and deploy them in the Big Data infrastructure.
  • Change endpoints in AtlasClient.AtlasCosumer

Running the tests

  • Quality Tests: select tables in basicProfiler.Profiler
  • Similarity Tests: select tables in Similarity.SimilarityAnalysis

Built With

  • Spark - The scalabe event processing engine
  • Atlas - Data Governance and Metadata framework for Hadoop
  • Ranger - Enable, monitor and manage comprehensive data security across the Hadoop platform.

Authors

  • José Magalhães
  • João Galvão
  • Maria Inês Costa

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

License

This project is currently internal.

Acknowledgments

  • Cheers for the LID4 community

About

This project aims to produce a set of tools, that will help big data integration engineers, model the data automatically with a certain confidence interval.


Languages

Language:Java 100.0%