cdoron / arrow-flight-module

The Mesh for Data platform - Arrow/Flight module

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GitHub Actions Build License

arrow-flight-module

The arrow-flight-module (AFM) for Fybrik brings enforcement of data governance policies to the world of Apache Arrow Flight.

What is it?

AFM is a Flight server that enables applications to consume tabular data from data sources. More importantly, the AFM is also a Policy Enforcement Point (PEP) capable of performing enforcement actions dictated by a Policy Decision Point. Such enforcement actions include blocking unauthorized requests, validating and transforming data payloads.

Register as a Fybrik module

To register AFM as a Fybrik module apply module.yaml to the fybrik-system namespace of your cluster.

To install the latest release run:

kubectl apply -f https://github.com/fybrik/arrow-flight-module/releases/latest/download/module.yaml -n fybrik-system

Version compatibility matrix

Fybrik AFM Command
0.1.x 0.1.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.1.0/module.yaml
0.2.x 0.2.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.2.0/module.yaml
0.3.x 0.3.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.3.0/module.yaml
0.4.x 0.4.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.4.0/module.yaml
0.5.x 0.5.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.5.0/module.yaml
0.6.x 0.6.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.6.0/module.yaml
0.7.x 0.7.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.7.0/module.yaml
1.0.x 0.8.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.8.0/module.yaml
1.1.x 0.8.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.8.0/module.yaml
1.2.x 0.9.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.9.0/module.yaml
1.2.x 0.10.x https://github.com/fybrik/arrow-flight-module/releases/download/v0.10.0/module.yaml
master master https://raw.githubusercontent.com/fybrik/arrow-flight-module/master/module.yaml

Build and deploy to Kubernetes

These instructions are for building an image of AFM and deploying it to Kubernetes. Deployment will install a Helm release named afm to the cluster. The release runs the flight server with empty configuration.

Requirements

  • make
  • Docker
  • kubectl with access to a kubernetes cluster (e.g., a kind cluster)
  • Helm 3.7 or above

Deploy to kind clusters

make build push-to-kind deploy

Deploy to other clusters

For other clusters you will need to use an image registry. For example if your image registry is us.icr.io/username then use:

REPOSITORY=us.icr.io/username/arrow-flight-module make build push deploy

Configuration

You can provide a configuration file by upgrading the deployed Helm chart. For example, to use sample/sample.yaml as configuration run:

helm upgrade --reuse-values --set-file config_override=sample/sample.yaml afm ./helm/afm

It is also possible to specify specific configuration values following the modules specification of Fybrik. Generally, you would want Fybrik to do that for you and not deal with it directly.

Usage

Once the server is deployed you can connect to it using any Flight client SDK. For example, in python:

import pyarrow.flight as fl
import pandas as pd

if __name__ == '__main__':
  client = fl.connect("grpc://afm-arrow-flight-module.default.svc.cluster.local:80")  # change to the address that the server is deployed to
  info: fl.FlightInfo = client.get_flight_info(
      fl.FlightDescriptor.for_command(r'{"asset": "sample.parquet"}'))  # change to an asset configured in the server config file
  result: fl.FlightStreamReader = client.do_get(info.endpoints[0].ticket)

  df: pd.DataFrame = result.read_pandas()
  print(df)

Development

This project requires Python 3.8 and pipenv.

Run the server locally with pipenv install and then pipenv run server.

See /sample for an example to run locally.

Status

This project is in a very early stage and contains just the following features:

  • Formats
    • Parquet
  • Filesystems
    • S3
  • Queries
    • Asset name
    • Column Selection
  • Enforcement Actions
    • Redact
    • RemoveColumn
  • Configuration (assets, actions)
  • Demo
    • Flight client

The project focus is to transform data based on policies. We are planning to create a specialized architecture for a generic PEP for Apache Arrow Flight and demonstrating it with an existing flight server such as Ballista.

A full list of deisred features (not necessarily fulfilled by AFM alone):

  • Formats
    • Parquet
    • Parquet Modular Encryption (PME)
    • CSV
    • JSON
    • ORC
    • Arrow Flight
  • Filesystems
    • S3
    • Local
  • Queries
    • Asset name
    • Column selection
    • Nested column selection
    • Filter
    • Full SQL
  • Enforcement Actions
    • Redact
    • RemoveColumn
    • Masking (format preserving)
    • Blackout period
  • Plugable enforcement actions
  • Simplified dynamic configuration (personal asset catalog, action policies)
  • Distributed query planning
  • Serverless query execution
  • Support writes
  • Integrations
  • Demo
    • Flight client
    • Apache Spark client

About

The Mesh for Data platform - Arrow/Flight module

License:Other


Languages

Language:Python 73.1%Language:Shell 15.6%Language:Makefile 6.8%Language:Mustache 3.2%Language:Dockerfile 0.7%Language:Open Policy Agent 0.7%