aprilwebster / pycascades2021

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pycascades2021

pyflink

  • python API for Apache Flink
  • Flink allows you to build scalable batch and streaming workloads
  • udf support and integration with pandas

Table API

  • powerful relational queries (e.g. sql)

Datastream API

  • more lowlevel

UDF Support

  • enable full 3rd party python use
  • scalar, table and modular functions

Scalar Function

  • parallelization

Vectorized UDFs

  • can configure size of batch to convert to Panda series
  • decreased serialization overhead

Basics

  • python version >= 3.5
  • download from PyPi
  • process and group
  • latent dirichlet - why some parts of a dataset are related or similar to each other
  • set up sources+sinks and Tables for each

About