bpartridge / rust-dataframe

A Rust DataFrame implementation, built on Apache Arrow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rust DataFrame

A dataframe implementation in Rust.

This project currently exists as a prototype that uses the Apache Arrow Rust library. Its goal is to act as an additional user of Arrow as we develop the library, in order to pick up potential difficulties from Arrow's downstream consumers (this dataframe being one).

Functionality

This project is inspired by Pandas and other dataframe libraries, but specifically borrows functions from Apache Spark.

It mainly focuses on computation, and aims to include:

  • Scalar functions
  • Aggregate function
  • Window functions
  • Array functions

As a point of reference, we use Apache Spark Python functions for function parity, and aim to be compatible with Apache Spark functions.

Eager vs Lazy Evaluation

The initial experiments of this project were to see if it's possible to create some form of dataframe. We're happy that this condition is met, however the initial version relied on eager evaluation, which would make it difficult to use in a REPL fashion, and make it slow.

We are mainly focusing on creating a process for lazy evaluation (the current LazyFrame), which involves reading an input's schema, then applying transformations on that schema until a materialising action is required. While still figuring this out, there might not be much progress on the surface, as most of this exercise is happening offline.

The plan is to provide a reasonable API for lazily transforming data, and the ability to apply some optimisations on the computation graph (e.g. predicate pushdown, rearranging computations).

In the future, LazyFrame will probably be renamed to DataFrame, and the current DataFrame with eager evaluation removed/made private.

The ongoing experiments on lazy evaluation are in the master branch, and we would appreciate some help 🙏🏾.

Non-Goals

Although we use Apache Spark as a reference, we do not intend on:

  • Creating deferred computation kernels (we'll leverage Arrow Rust)
  • Creating distributed computation kernels

Spark is a convenience to reduce bikeshedding, but we will probably provide a more Rust idiomatic API in future.

Status

Roadmap

  • Lazy evaluation (Q1 2020)
    • Aggregations
    • Joins
    • Sorting
  • Adding compute fns (Q2 2020)
  • SQL support (Q3 2020) [Uncertain if needed]
  • Python bindings (Q4 2020)

IO

IO support isn't great yet, but this is not the best place to implement it. We are contributing to the effort in Apache Arrow's Rust implementation, and more contributors would be welcome there.

For now, we're trying to support CSV, JSON, and perhaps other simpler file formats. Note on Feather: The Feather file format support can be considered as deprecated in favour of Arrow IPC. Though we have implemented Feather, it's meant to be a stop-gap measure until Arrow supports IPC (in Rust, anticipated at 1.0.0).

  • IO Support
    • CSV (using Arrow)
      • Read
      • Write
    • JSON
      • Read (submitted to Arrow)
      • Write
    • Arrow IPC
      • Read File
      • Write FIle
    • Parquet (relying on Arrow)
      • Read File
      • Write File
    • SQL (planning on relying on other efforts, if someone wants to build a SQL<>Arrow converter)
      • PostgreSQL
        • Read (ongoing, reading of most columns possible)
        • Write
      • MSSQL (using tiberius)
        • Read
        • Write
      • MySQL
        • Read
        • Write

Functionality

  • DataFrame Operations

    • Select single column
    • Select subset of columns, drop columns
    • Add or remove columns
    • Rename columns
    • [-] Create dataframe from record batches (a Vec<RecordBatch> as well as an iterator)
    • Sort dataframes
    • Grouped operations
    • Filter dataframes
    • Join dataframes
  • Scalar Functions

    • Trig functions (sin, cos, tan, asin, asinh, ...) (using the num crate where possible)
    • Basic arithmetic (add, mul, divide, subtract) Implemented from Arrow
    • Date/Time functions
    • String functions
      • Basic string manipulation
      • Regular expressions (leveraging regex)
      • Casting to and from strings (using Arrow compute's cast kernel)
    • Crypto/hash functions (md5, crc32, sha{x}, ...)
    • Other functions (that we haven't classified)
  • Aggregate Functions

    • Sum, max, min
    • Count
    • Statistical aggregations (mean, mode, median, stddev, ...)
  • Window Functions

    • Lead, lag
    • Rank, percent rank
    • Other
  • Array Functions

    • Compatibility with Spark 2.4 functions
    • Compatibility with Spark 3.0 functions

Performance

We plan on providing simple benchmarks in the near future. The current blockers are:

  • IO
    • Text format (CSV)
    • Binary format (Arrow IPC)
  • Lazy operations
  • Aggregation
  • Joins

About

A Rust DataFrame implementation, built on Apache Arrow

License:Apache License 2.0


Languages

Language:Rust 100.0%