PasaLab/Octopus-DF

Octopus-DataFrame

A cross-platform Pandas-like DataFrame base on Pandas, Spark, and Dask.
View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Contributing
License
Contact
Acknowledgements

About The Project

Octopus-DF, which integrates Pandas, Dask, and Spark as the backend computing platforms and exposes the most widely used Pandas-style APIs to users.

Why Octopus-DF?

Efficient for different data scales. A DataFrame-based algorithm has quite different performances over various platforms with various data scales. It is not efficient to process the DataFrame of different data scales on a single platform. Octopus-DF integrates Pandas, Dask, and Spark, which make it efficient for different data scales.
Ease to use. Octopus-DF provides Pandas-style API for data analysts to model and develops their problems and programs.

Built With

Getting Started

Octopus-DF

Prerequisites

You should install python3.6+ environment first.

Clone the repo

git clone https://github.com/PasaLab/Octopus-DF.git

Install all dependencies.

cd Octopus-DF
pip install –r requirements.txt

Installation

Generate the target package.
```
python setup.py sdist
```
Install the package
```
pip install Octopus-DataFrame
```

Usage

Octopus-DF is built on Pandas, Spark, and Dask. You need to deploy them in your distributed environment first. For spark, we use Spark on Yarn mode. For Dask, we use dask distributed. Please first check the official documentation to complete the installation and deployment.

To optimize the secondary index, you should install Redis first. The suggested way of installing Redis is compiling it from sources as Redis has no dependencies other than a working GCC compiler and libc. Please check the redis official documentation to complete the installation and deployment.

To optimize the local index, you should install plasma store first. In stand-alone mode, install pyarrow0.11.1 on the local machine (pip install pyarrow==0.11.1), and use plasma_store –m 1000000000 –s /tmp/plasma & to open up memory space to store memory objects. The above command opens up 1g of memory space. For details, please refer to official plasma documentation. In cluster mode, install on each machine and start the plasma store process.

When you install and deploy the above dependencies, you need to config Octopus-DF by editing the $HOME/.config/octopus/config.ini file.

The configuration example is as follows:

[installation]
          
HADOOP_CONF_DIR = $HADOOP_HOME/etc/hadoop  
SPARK_HOME = $SPARK_HOME
PYSPARK_PYTHON = ./ANACONDA/pythonenv/bin/python         # python environment
SPARK_YARN_DIST_ARCHIVES = hdfs://host:port/path/to/pythonenv.zip#ANACONDA
SPARK_YARN_QUEUE = root.users.xquant
SPARK_EXECUTOR_CORES = 4
SPARK_EXECUTOR_MEMORY =10g
SPARK_EXECUTOR_INSTANCES = 6
SPARK_MEMORY_STORAGE_FRACTION = 0.2
SPARK_YARN_JARS = hdfs://host:port/path/to/spark/jars    # jars for spark runtime, which is equivalent to spark.yarn.jars in spark configuration
SPARK_DEBUG_MAX_TO_STRING_FIELDS = 1000
SPARK_YARN_EXECUTOR_MEMORYOVERHEAD = 10g

For imperative interfaces such as SparkDataFrame and OctDataFrame, using Octopus-DF is like using pandas.

SparkDataFrame

from Octopus.dataframe.core.sparkDataFrame import SparkDataFrame
odf = SparkDataFrame.from_csv("file_path")
print(odf.head(100))
print(odf.loc[0:10:2,:])
print(odf.filter(like='1'))

OctDataFrame

from Octopus.dataframe.core.octDataFrame import OctDataFrame
# engine_type can be dask, pandas.
odf = OctDataFrame.from_csv("file_path",engine_type="spark")
print(odf.head(100))
print(odf.loc[0:10:2,:])
print(odf.filter(like='1'))

For declarative interfaces such as SymbolDataframe, using Octopus-DF is like using spark. Due to its lazy computation mechanism, you should call compute() to do the calculation.

SymbolDataFrame

from Octopus.dataframe.core.symbolDataFrame import SymbolDataFrame
# engine_type can be dask, pandas. 
# If not declared, Octopus-DF will automatically select the best platform.
# Note that this function is experimental, need to be improved.
odf = SymbolDataFrame.from_csv(filie_path,engine_type='spark')
odf1 = odf.iloc[0:int(0.8*M):2,:]
odf2 = odf1.iloc[0:int(0.2*M):1,:]
odf2.compute()

# We can show the exectution plan and scheduler's execution time by show_execution_plan()
odf2.show_execution_plan()

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact Us

Gu Rong - gurong@nju.edu.cn

PasaLab / Octopus-DF