sekikn / gobblin

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

Home Page:https://github.com/linkedin/gobblin/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

About

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

https://github.com/linkedin/gobblin/wiki

License:Apache License 2.0


Languages

Language:Java 98.4%Language:Shell 0.7%Language:Python 0.3%Language:JavaScript 0.3%Language:CSS 0.1%Language:HTML 0.1%Language:XSLT 0.1%Language:Groovy 0.0%Language:Dockerfile 0.0%Language:Roff 0.0%