Data Analysis framework and Compute Engine for fun, it was started as a foundation for the How Data Platforms Work book associated to the Monthly Python Data Engineering Newsletter while writing the book to showcase the concepts explained in the its chapters.
The main priority of the codebase is to be as feature complete as possible while making it easy to understand and contribute to for people that have no prior knowledge of compute engines or data processing frameworks in general.
The codebase is heavily documented and commented to make it easy to understand and modify, and contributions are welcomed and encouraged, it is meant to be a safe playground for learning and experimentation.
Each component of the data platform is self documented in a way inspired by the literate programming concept. The complete documentation is available at Documentation
For further understanding of the codebase and the concepts reading the How Data Platforms Work book is recommended.
Install datapyground package from pip:
pip install datapyground
Once installed refer to the Documentation of each component to learn how to use it.
DataPyground
exposes some commands to play around with its features,
currently the following commands are provided:
Allows to run SQL queries on CSV and Parquet files:
$ pyground-fquery -t sales=examples/data/sales.csv "SELECT Product, Quantity, Price, Quantity*Price AS Total FROM sales WHERE Product='Videogame' OR Product='Laptop' ORDER BY Total DESC LIMIT 5"
Product | Quantity | Price | Total
--------- | -------- | ----- | ------
Videogame | 10 | 98.31 | 983.10
Laptop | 10 | 97.24 | 972.40
Videogame | 10 | 97.21 | 972.10
Videogame | 10 | 96.12 | 961.20
Laptop | 10 | 92.23 | 922.30
Contributions are welcomed and encouraged, it is meant to be a safe playground for learning and experimentation.
The only requirement is that the contributions maintain or increase the level of quality of the documentation and codebase, contributions that are not properly documented won't be merged, consider quality of docmentation more important that elegance or performance of the codebase for this project.
The contributions are currently meant to be in pure python, this does not prevent the use of c extensions and cython for performance in the future, but that will have to happen when the benefit they provide outweights the added complexity they introduce in the context of a learning project.
Install uv
python package:
pip install uv
Then install the dependencies and the project in editable mode:
uv sync --dev
uv run pytest -v
cd docs
uv run make html
The documentation is readable at docs/build/html
after being built.