This repository contains
-
a Dockerfile based on jupyter/pyspark-notebook and extended with
- Delta Lake
- Soda Spark (deprecated)
- Soda Core
- Faker
-
notebooks for learning and experimenting with the combination of these technologies.
Build the image
docker build -t jupyter/delta-lake .
Run a container with the
docker run -it --rm -p 8888:8888 -v "${PWD}":/home/jovyan/work jupyter/delta-lake
About scan.set_data_source_name()
and yaml
https://soda-community.slack.com/archives/C038FFU79J5/p1658849620595999
Test if delta and soda are working.
Intro in the utils.data_generation
module.
Getting scan results in proper data structure in soda-core is discussed in sodadata/soda-core#1406 and the referenced slack converstation.
Process all seven batches generated with utils.data_generatio.FakerProfileDataSnapshot
.
Steps:
- store batch as CSV
- overwrite data in delta table
- run a scan
- store scan results
Contains example queries to consume the scan results.
WIP