Designed by Agile Lab, Witboost is a versatile platform that addresses a wide range of sophisticated data engineering challenges. It enables businesses to discover, enhance, and productize their data, fostering the creation of automated data platforms that adhere to the highest standards of data governance. Want to know more about Witboost? Check it out here or contact us!

This repository is part of our Starter Kit meant to showcase Witboost's integration capabilities and provide a "batteries-included" product.

CDP Impala Specific Provisioner

Overview
Building
Running
Configuring
Deploying
How it works
HLD
API specification

Overview

This project implements a Specific Provisioner deploying Output Ports and Storage Areas* (as External Tables or Views) on Apache Impala hosted on a Cloudera Data Platform environment. It supports both CDP Public Cloud with Cloudera Data Warehouse (CDW) using Impala and Amazon Web Services (AWS) S3 storage, and CDP Private Cloud using Impala and HDFS. After deploying this microservice and configuring witboost to use it, the platform can create Output Ports and Storage Areas* on existing csv or Parquet tables leveraging an existing Impala instance.

As of now, this provisioner can only deploy View Output Ports and Storage Areas on CDP Private Cloud environments.

Specifically, this provisioner can create:

CDP Public:
- Output Ports as External table allowing to define a schema, a HDFS/S3 location, format of the data files, and extra TBLPROPERTIES.
CDP Private:
- All components deployable on CDP Public mentioned above
- Storage Areas as External table allowing to define a schema, a HDFS/S3 location, format of the data files, and extra TBLPROPERTIES.
- Storage Areas as views, defined by a custom SQL statement provided by the user
- Output Ports as simple 1:1 views from a source table, defining a schema as the set of columns to be queried.

What's a Specific Provisioner?

A Specific Provisioner is a microservice which is in charge of deploying components that use a specific technology. When the deployment of a Data Product is triggered, the platform generates it descriptor and orchestrates the deployment of every component contained in the Data Product. For every such component the platform knows which Specific Provisioner is responsible for its deployment, and can thus send a provisioning request with the descriptor to it so that the Specific Provisioner can perform whatever operation is required to fulfill this request and report back the outcome to the platform.

You can learn more about how the Specific Provisioners fit in the broader picture here.

Software stack

This microservice is written in Scala 2.13, using HTTP4s and Guardrail for the HTTP layer. Project is built with SBT and supports packaging as JAR, fat-JAR and Docker image, ideal for Kubernetes deployments (which is the preferred option).

This is a multi module sbt project:

api: Contains the API layer of the service. The latter can be invoked synchronously in 3 different ways:
1. POST /provision: provision the impala output port/storage area specified in the payload request. It will synchronously call the service logic to perform the provisioning logic.
2. POST /validate: validate the payload request and return a validation result. It should be invoked before provisioning a resource in order to understand if the request is correct.
3. POST /updateacl: Updates the access to users to the provisioned resources, only for output ports.
core: Contains model case classes and shared logic among the projects
service: Contains the Provisioner Service logic. Is called from the API layer after some check on the request and return the deployed resource. This is the module on which we provision the output port/storage area

In this project we are using the following sbt plugins:

scalaformat: To keep the scala style aligned with all collaborators
wartRemover: To keep the code as functional as possible
scoverage: To create a test coverage report
k8tyGitlabPlugin: To publish the packages to Gitlab Package Registry

Artifacts

We produce two different artifacts on the CI/CD for this repository

The scoverage report that you could download from the CI/CD and check the test coverage
A docker image published in the Gitlab Container Registry
A set of jars, one for each module published in the Maven Gitlab Package Registry

Building

Requirements:

Java >=11
sbt

This project depends on a private library scala-mesh-commons which you should have access to at compile time. Currently, the library is hosted as a package in a Gitlab Maven Package Registry.

To pull these libraries, we need to set up authentication to the Package Registry (see Gitlab docs). We've set authentication based on environment variables that sbt uses to authenticate. Please export the following environment variables before importing the project:

export GITLAB_ARTIFACT_HOST=https://gitlab.com/api/v4/projects/51107980/packages/maven
export GITLAB_ARTIFACT_USER=<Gitlab Username>
export GITLAB_ARTIFACT_TOKEN=<Gitlab Personal Access Token>

Generating sources: this project uses OpenAPI as standard API specification and the sbt-guardrail plugin to generate server code from the specification.

The code generation is done automatically in the compile phase:

sbt compile

Test

Tests: are handled by the standard task as well:

sbt test

CI/CD

Once you commit and push the CI/CD will be triggered, test and build phase are executed at each push. The CI/CD will use the job token to push the dependency libraries Dev Deploy are executed only for master branch Prod Deploy are executed only for release branch You could double-check the artifacts that will be deployed downloading from the CI/CD artifacts.zip that was cached during the test/build stages

How to collaborate

We recommend using IntelliJ IDEA Community Edition for developing this project. You are free to use your favorite IDE. Please remember to add on the .gitignore the IDE specific files.

If you fork this repository, please modify the project settings with the appropriate gitlab project id to avoid trying pushing artifacts to the wrong repository.

Scala style

Leverage the scalaformat library to reformat the code while editing. This will apply the scala format specification written on the .scalafmt.conf and avoids fake changes on merge request.

We added additional compilation rules using the wartRemover library, so if any exceptions are raised during compile time please fix them.

Running

To run the server, you need to set up the necessary environment variables to access CDP and the AWS environment. This Specific Provisioner uses the followings SDK:

CDP SDK: please refer to the official documentation to setup the access credentials (only required for CDP Public Cloud).
AWS SDK: please refer to the official documentation to setup the access credentials (only required for CDP Public Cloud).

For example, for local execution you need to set the following environment variables:

# AWS configuration is only required for CDP Public Cloud
export AWS_REGION=<aws_region>
export AWS_ACCESS_KEY_ID=<aws_access_key_id>
export AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>
export AWS_SESSION_TOKEN=<aws_session_token>

export CDP_DEPLOY_ROLE_USER=<cdp_user>
export CDP_DEPLOY_ROLE_PASSWORD=<cdp_password>
export CDP_ACCESS_KEY_ID=<cdp_user_access_key_id> # Only required for CDP Public Cloud
export CDP_PRIVATE_KEY=<cdp_user_private_key> # Only required for CDP Public Cloud

This provisioner uses two sets of credentials to perform operations on Apache Ranger and Apache Impala. The default configuration sets them both equal to the environment variables CDP_DEPLOY_ROLE_USER and CDP_DEPLOY_ROLE_PASSWORD, so that only one user is initially necessary, but the Ranger credentials can be overridden via configuration if they need to be different (see Configuring).

The used CDP users must be Machine User and need to check some requirements depending on the type of CDP Cloud.

CDP Public Cloud

On CDP Public it needs to have at least the following roles:

Impala:
- DWAdmin
- DWUser
Ranger:
- EnvironmentAdmin
- EnvironmentUser

Alternatively to EnvironmentAdmin role, the Machine User for Ranger must have the necessary permissions to manage Ranger, specifically creating/updating/retrieving/deleting Security Zones, Roles, Resource based Policies and the resources related to them. If the same user is used for both services, it must have the four roles and/or permissions.

CDP Private Cloud

On CDP Private, the deploy user needs to have admin privileges on Ranger, as well as have the following permissions (e.g. through Ranger policies):

read, write, execute permissions on HDFS directory to be used
all permissions on Impala databases and tables to be used

However, if Impala is authenticated using Kerberos as it is in most cases, the only set of credentials needed will be used to access Ranger, whereas for Impala a valid keytab with a principal with service name impala will be necessary, accompanied by the necessary kerberos configuration files (see Configuring).

After this, execute:

sbt compile run

By default, the server binds to port 8093 on localhost. After it's up and running you can make provisioning requests to this address.

Configuring

Most application configurations are handled with the Typesafe Config library. You can find the default settings in the reference.conf of each module. Customize them and use the config.file system property or the other options provided by Typesafe Config according to your needs. The provided docker image expects the config file mounted at path /config/application.conf.

Especially for CDP Private Cloud, a set of required configuration fields must be modified, like Ranger and HDFS base URLs.

Furthermore, you can specify via configuration a set of values called "publicInfo" and "privateInfo" to return valuable information to the user about the deployed resources by the provisioner at provision time.

For more information on the configuration and to understand how to set up the provisioner for a specific type of CDP Cloud, see Configuring the Impala Specific Provisioner.

Helm chart configuration

CDP Public v.s. CDP Private

The chart provides a couple of configurations to setup the provisioner to work on either CDP Public Cloud or CDP Private Cloud. private.enabled would set the necessary environment variables that the provisioner needs in order to work (see Running). By setting it to true it will remove the Access Key and Private Key used by the Cloudera SDK to contact the public cloud.

The second configuration kerberos.enabled would set the necessary system properties needed for the provisioner to authenticate on a Kerberos system to services like Impala. For this, the provisioner expects a jaas.conf file and krb5.conf. For more information about these files see Configuring the Impala Specific Provisioner. You can provide override values for these files using the kerberos.krb5Override and kerberos.jaasOverride fields.

Custom Root CA

The chart provides the option customCA.enabled to add a custom Root Certification Authority to the JVM truststore. If this option is enabled, the chart will load the custom CA from a secret with key cdp-private-impala-custom-ca. The CA is expected to be in a format compatible with keytool utility (PEM works fine).

Deploying

This microservice is meant to be deployed to a Kubernetes cluster.

How it works

Parse the request body
Retrieve impala coordinator host and ranger host from either the CDP environment (CDP Public), or the provisioner configuration (CDP Private).
Create the impala resource (table or view)
Upsert the ranger security zone for the specific data product version
Upsert ranger roles for owners of the component; and for Output Ports a role for users as well.
Upsert access policies for said roles, granting read/write access to the owner role, and read-only to the user role
Return the deployed resource

Descriptor Input

The Impala Specific Provisioner receives a yaml-descriptor containing a data contract schema and a specific field with the information of the table or view to be deployed. It allows defining

Data contract schema. OpenMetadata Column schema defining the schema of the table or view to be created
Database name: Database to be created to handle the component tables
Table name: Table name to be created, or when provisioning a view, the name of the table exposed by the view
View name: Sent when provisioning a view to define its name
Format: Format of the data files an external table exposes. Only required for table creation
Location: Location in S3 (CDP Public) or HDFS (CDP Private) where the data files are located
Partitions: List of columns used to partition the data
Table parameters: Extra table parameters to define TBLPROPERTIES, text file delimiter and header, etc.
Custom DML Statement: Storage Areas as views can be created by using a query provided by the user.

For the specification of schema of this object, check out Descriptor Input

License

This project is available under the Apache License, Version 2.0; see LICENSE for full details.

About us

Agile Lab creates value for its Clients in data-intensive environments through customizable solutions to establish performance driven processes, sustainable architectures, and automated platforms driven by data governance best practices.

Since 2014 we have implemented 100+ successful Elite Data Engineering initiatives and used that experience to create Witboost: a technology agnostic, modular platform, that empowers modern enterprises to discover, elevate and productize their data both in traditional environments and on fully compliant Data mesh architectures.

agile-lab-dev / witboost-cdp-impala-specific-provisioner