opendata opendata-api spark reproducible-research reproducibility scala spark-sql demography statistics data-science

data source example: connecting to an open-data API

This repo shows how Spark (3.0) can be leveraged to read open data accessible from remote APIs.

The death registry published by the French government is taken as an example. It contains in total more than 30 million death events since 1970.

The retrieval is performed using the new data source SPI introduced in Spark 3.0. The data source SPI for extracting data from remote APIs can give cleaner, more reusable code than ad hoc processing and is not necessarily more difficult to master.

Usage in a notebook or in a script

./tests/cluster-test.sc gives an example of how to use the data source. This example requires sbt, ammonite and docker to be installed locally.

The following instructions create a fat jar with all the code for the Spark data source, spin off a Spark cluster using docker-compose and runs a Spark session in ammonite, a scala REPL:

sbt assembly
./tests/cluster-test.sh

There is also an example polynote notebook, ./tests/SparkTest.ipynb.

Development

Unit and integration tests:

sbt test

End-to-end tests:

sbt assembly
./tests/cluster-test.sh

Code formatting:

sbt scalafmtAll

License

opendata-example is licensed under The MIT License.

About

Spark data source example: connecting to an open-data API

opendata opendata-api spark reproducible-research reproducibility scala spark-sql demography statistics data-science

MIT License

Languages

Language:Jupyter Notebook 82.0%Language:Scala 17.7%Language:Shell 0.2%Language:Dockerfile 0.1%