anmyachev / fuzzydata

Fuzzy Data Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status codecov PyPI version Downloads Chidata Group Twitter URL

fuzzydata

The fuzzydata Workflow Generator

The fuzzydata workflow generator enables:

  • Abstract specification of Dataframe-based Workflows
  • Generation of randomized tables and workflows
  • Loading and replay of workflows on multiple clients

Fuzzydata is currently designed to run using the following clients:

fuzzydata is designed to be extensible, you may implement your own client. Please see the existing clients in fuzzydata/clients for ways to extend the abstract Artifact, Operation and Workflow classes for your client.

Installation

Manual build/install using pip.

pip install fuzzydata

fuzzydata Does not install modin or SQLAlchemy by default, but this can be specified as an install option:

pip install fuzzydata[modin|sql|all]

Usage

Some examples of fuzzydata usage are in the examples directory. You can also run the fuzzydata command to get a list of command-line options supported in fuzzydata

$ fuzzydata --help
usage: fuzzydata [-h] [--wf_client WF_CLIENT] [--output_dir OUTPUT_DIR] [--wf_name WF_NAME]
              [--columns COLUMNS] [--rows ROWS] [--versions VERSIONS] [--bfactor BFACTOR]
              [--matfreq MATFREQ] [--npp NPP] [--log LOG] [--replay_dir REPLAY_DIR]
              [--wf_options WF_OPTIONS] [--exclude_ops EXCLUDE_OPS] [--scale_artifact SCALE_ARTIFACT]

optional arguments:
  -h, --help            show this help message and exit
  --wf_client WF_CLIENT
                        Workflow Client to be used (Default pandas). Available Workflows: pandas|modin|sql
  --output_dir OUTPUT_DIR
                        Location of Output datasets to be stored
  --wf_name WF_NAME     prefix for each workflow to be generated dir to be the path prefix for these files.
  --columns COLUMNS     Number of columns in the base version
  --rows ROWS           Number of rows in the base version
  --versions VERSIONS   Number of artifact versions to generate
  --bfactor BFACTOR     Workflow Branching factor, 0.1 is linear, 100 is star-like
  --matfreq MATFREQ     Materialization frequency, i.e. how many operations before writing out an artifact
  --log LOG             Set Logging Level
  --replay_dir REPLAY_DIR
                        Replay existing workflow in directory
  --wf_options WF_OPTIONS
                        JSON-encoded workflow engine options like sql_string or modin_engine
  --exclude_ops EXCLUDE_OPS
                        JSON-encoded list of ops to exclude e.g. ["pivot"]
  --scale_artifact SCALE_ARTIFACT
                        JSON-encoded dict of {artifact_label: new_size} to be scaled up e.g. {"artifact_0"
                        : 1000000}

Documentation

Download our paper here.

If you use fuzzydata in your research, please consider citing our paper:

@inproceedings{10.1145/3531348.3532178,
author = {Rehman, Mohammed Suhail and Elmore, Aaron},
title = {FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems},
year = {2022},
isbn = {9781450393539},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3531348.3532178},
doi = {10.1145/3531348.3532178},
booktitle = {Proceedings of the 2022 Workshop on 9th International Workshop of Testing Database Systems},
pages = {17–24},
numpages = {8},
location = {Philadelphia, PA, USA},
series = {DBTest '22}
}

License

MIT License

Contributing to fuzzydata

Check out the current roadmap in docs/roadmap.md. You are always welcome to develop a new client for fuzzydata.

Contact

Suhail Rehman / ChiData Group @ Uchicago CS

About

Fuzzy Data Benchmark

License:MIT License


Languages

Language:Python 99.9%Language:Shell 0.1%