unum-cloud / udsb

Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads

Home Page:https://unum.cloud/post/2022-01-25-cpu-vs-gpu/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unopinionated DataScience Benchmark

A comparison of most commonly used Data-Science Python packages and their alternatives. Generally, those alternatives have identical Python interfaces, but come with Multi-Threaded CPU or even GPU backends, implemented in C++, CUDA, Rust and other low-level languages.

To run the default configuration for every folder - the procedure is similar:

cd x
conda env create -f env.yml
python bench.py

Matrices

For Linear Algebra and Digital Signal Processing we synthetically generate square random matrices, mainly of with single-precision floating point numbers. That is different from the default Pythons float that uses the 64-bit representation, more commonly described as double in C-like languages. Participating packages:

  • NumPy over BLIS
  • NumPy over OpenBLAS
  • NumPy over Intel MKL and One API
  • CuPy over CuBLAS

Graphs or Networks

For Graph Theoretical and Network Science workloads we pick various commonly used datasets from the Stanford Network Repository. All ranging under 1 MB to over 1 GB and 100 million edges. Participating packages:

  • NetworkX
  • RetworkX
  • CuGraph

Tabular Data: NYC Taxi Rides

We took the NYC Taxi Rides dataset as our primary dataset and run the classical 4-query benchmark on its subsets. Participating packages:

  • Pandas
  • Modin
  • CuDF
  • Dask-CuDF
  • SQLite
  • Apache DataFusion

About

Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads

https://unum.cloud/post/2022-01-25-cpu-vs-gpu/


Languages

Language:Jupyter Notebook 91.1%Language:Python 8.9%Language:Shell 0.0%