My portfolio with test tasks from different companies for Data Engineer
All the code has been formatted by Black: The Uncompromising Code Formatter
Configured GitHub actions
-
Dependabot checks on weekly basis
-
After each commit GitHub workflows run the following checks:
Task 1
Description: calculate pyspark aggregations from the given csv.
Tech:
- python
- spark
- csv
Task 2
Description: calculate pyspark aggregations from the given parquet and csv.
Tech:
- python
- spark
- csv
Task 3
Description: calculate pyspark aggregations from the given csv.
Tech:
- python
- spark
- csv
Task 4
Description:
- calculate pyspark aggregations from the given parquet
- ingest the data to postgres
- read the data from postgres
- calculate pyspark aggregations and save as cvs
Tech:
- python
- spark
- parquet
- postgres in docker with persistent storage
Task 5
Description:
- calculate pyspark metrics and dimensions aggregations from given json
- test the app
Tech:
- python
- spark
- pytest: 91% test coverage according to Coverage
- json/parquet
Kafka pet project
The project itself is another GitHub repo. The purpose of the project is to prove Java, Kafka, Prometheus and Grafana knowledge.