In this repository, I am exploring different aspects of working with data in order to optimize the overal performance.
The following projects are available:
-
data-serialization: This investigates different serializtion formats in
Pandas
andPyArrow
such asCSV
,JSON
,Parquet
, etc. This repository compare thesaving time
,loading time
,file size
andconversion time
. This analysis can help us to make a better decision next time we need to save/load our data. -
pandas-speedup: Data manipulation is a time/resource expensive part in most of Data Science projects. Here I am exploring the different scenarios of maipulating pandas dataframe as fast as possible. I have covered multiple methods such as using simple loop, apply function, parallel computations, etc.
-
parallel-api: Do you need to call an API in batch? Either it is for load testing, batch testing, batch inference, or data data collection, I have a solution here for you.
Each project in its own folder has its own environment.yml
file. But if you need a conda environment with all the required libraries for all the projects in this repo, you can refer to the environment.yml
file in the root folder.
The following command creates a conda environment using the given environment.yml
.
conda env create -f environment.yml
There is a config-sample.toml
which should be renamed into config.toml
first and edited with the correct parameters.
If you need to publish the plotly charts online, you should add the credentials in the config.toml
file.
[plotly]
username = "plotly_username"
api_key = "plotly_api_key"
For more information, you can check Plotly Chart-Studio.
Kaveh Bakhtiyari - Website | Medium | LinkedIn | GitHub
Feel free to join the open-source community and contribute to this repository.