liw71 / ballista

Distributed compute platform implemented in Rust, using Apache Arrow memory model.

Home Page:https://ballistacompute.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ballista: Distributed Compute Platform

License Crates.io Gitter Chat Discord chat

Overview

Ballista is a distributed compute platform primarily implemented in Rust, using Apache Arrow as the memory model. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Ballista can be deployed in Kubernetes, or as a standalone cluster using etcd for discovery.

Architecture

The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.

Ballista Architecture Diagram

How does this compare to Apache Spark?

Although Ballista is largely inspired by Apache Spark, there are some key differences.

  • The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
  • Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
  • The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 50x - 100x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
  • The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Example Rust Client

#[tokio::main]
async fn main() -> Result<()> {
    
    let ctx = Context::remote("localhost", 50051, HashMap::new());

    let results = ctx
        .read_parquet("/path/to/data", None)?
        .aggregate(vec![col("passenger_count")], vec![max(col("fare_amount"))])?
        .collect()
        .await?;

    // print the results
    pretty::print_batches(&results)?;

    Ok(())
}

Status

An alpha release of Ballista is now available, and we are working towards the full 0.3.0 release in August 2020. Please refer to the user guide for instructions on using a released versions of Ballista.

Roadmap

We are now working on support for more complex operators, particularly joins, using the TPCH benchmarks to drive requirements. The full roadmap is available here.

More Examples

The following examples should help illustrate the current capabilities of Ballista

Documentation

The user guide is hosted at https://ballistacompute.org, along with the blog where news and release notes are posted.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

About

Distributed compute platform implemented in Rust, using Apache Arrow memory model.

https://ballistacompute.org

License:Apache License 2.0


Languages

Language:Rust 46.2%Language:Kotlin 35.7%Language:Java 12.5%Language:Scala 3.8%Language:Shell 1.1%Language:Dockerfile 0.8%