Intel Big Data Analytic Toolkit

Big Data Analytic Toolkit is a set of acceleration libraries aimed to optimize big data analytic frameworks. There're several major use cases:

Data engineers who want some Intel architecture based optimizations
End-users of big data analytic frameworks who're looking for performance acceleration
Database developers who're seeking for reusable building blocks
Data Scientist who looks for heterogenous execution

By using this library, frontend SQL engines like Prestodb/Spark query performance will be significant improved.

Users can reuse implemented operators/functions to build a full-featured SQL engine. Currently this library offers an highly optimized compiler to JITed function for execution.

Building blocks utilizing compression codec (based on IAA, QAT) can be used directly to Hadoop/Spark for compression acceleration.

Introduction

The following diagram shows the design architecture. Currently, it offers a few building blocks including a lightweight LLVM based SQL compiler on top of Arrow data format, ICL - a compression codec leveraging state-of-art Intel IAA accelerator, QATCodec - compression codec wrapper based on Intel QAT accelerator.

Cider:

a modularized and general-purposed Just-In-Time (JIT) compiler for data analytic query engine. It employs Substrait as a protocol allowing to support multiple front-end engines. Currently it provides a LLVM based implementation based on HeavyDB.
Velox Plugin:

a Velox-plugin is a bridge to enable Big Data Analytic Toolkit onto Velox. It introduces hybrid execution mode for both compilation and vectorization (existed in Velox). It works as a plugin to Velox seamlessly without changing Velox code.
Intel Codec Library:

Intel Codec Library for BigData provides compression and decompression library for Apache Hadoop/Spark to make use of the acceleration hardware for compression/decompression.

Supported features

Current supported features are available on Project Page. Newly supported feature in release 0.9 is available at release page.

Getting Started

Get the BDTK Source

git clone --recursive https://github.com/intel/BDTK.git
cd BDTK
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive

Setting up BDTK develop envirenmont on Linux Docker

We provide Dockerfile to help developers setup and install BDTK dependencies.

Build an image from a Dockerfile

$ cd ${path_to_source_of_bdtk}/ci/docker
$ docker build -t ${image_name} .

Start a docker container for development

$ docker run -d --name ${container_name} --privileged=true -v ${path_to_source_of_bdtk}:/workspace/bdtk ${image_name} /usr/sbin/init

How to build

Once you have setup the Docker build envirenment for BDTK and get the source, you can enter the BDTK container and build like:

Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version, or make release to build an optimized version. Use make test-debug or make test-release to run tests.

How to Enable in Presto

To use it with Prestodb, Intel version Prestodb is required together with Intel version Velox. Detailed steps are available at installation guide.

Roadmap

In the coming release, following working items were prioritized.

Better test coverage for entire library
Better robustness and enable more implemented features in Prestodb as pilot SQL engine, by improving offloading framework
Better extensibility at multi-levels (incl. relational algebra operator, expression function, data format), by adopting state-of-art compiler design (multi-levels)
Complete Arrow format migration
Next-gen codegen framework
Support large volume data processing
Advanced features development

Code Of Conduct

Big Data Analytic Toolkit's Code of Conduct can be found here.

Online Documentation

You can find the all the Big Data Analytic Toolkit documents on the project web page.

License

Big Data Analytic Toolkit is licensed under the Apache 2.0 License. A copy of the license can be found here.

Awaiskhan404 / BDTK