An experiment to get the MoJ Splink Spark demo(s) running on Microsoft Fabric.
Notebooks:
The best way to get Splink running is to make use of the new Environments feature in Fabric.
Note
Environments are a preview feature and may change before general release [as of 2024-03-05]
To use Splink in this demo, you need to:
- Upload the similarity UDF jar into the Lakehouse that you are using:
![similarity_jar_file](https://private-user-images.githubusercontent.com/7355456/310841586-b3f4f1a0-e293-4bbb-8a12-82541b44329b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI4MTAyMjcsIm5iZiI6MTcyMjgwOTkyNywicGF0aCI6Ii83MzU1NDU2LzMxMDg0MTU4Ni1iM2Y0ZjFhMC1lMjkzLTRiYmItOGExMi04MjU0MWI0NDMyOWIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDRUMjIxODQ3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDFhNjc3ZWQwM2QxMWM1YWIzNmUwMzdmZTgwZWVlNzIxNjE0NTFmMzExYTg3ODYzNzBiMDk3M2E3ODBmZWRlZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.8BIau2_-KLO481igqZWPUBm7dbZ8rtLXK_xCC1poOsU)
- In your environment, add a "spark.jars" Spark Property that points to the jar file. Use an ABFS path to point to the file e.g.
abfss://00000000-0000-0000-0000-000000000000@onelake.dfs.fabric.microsoft.com/00000000-0000-0000-0000-000000000000/Files/scala-udf-similarity-0.1.1_spark3.x.jar
:
![spark_jar_property](https://private-user-images.githubusercontent.com/7355456/310841677-d34041d6-7303-42aa-8b06-d6956cfd97e3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI4MTAyMjcsIm5iZiI6MTcyMjgwOTkyNywicGF0aCI6Ii83MzU1NDU2LzMxMDg0MTY3Ny1kMzQwNDFkNi03MzAzLTQyYWEtOGIwNi1kNjk1NmNmZDk3ZTMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDRUMjIxODQ3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDE2MjNlYjcyZDBmYmRiM2ViODkyNWU3YTQ3NTIxMzEzMTVlMTdmNGE0OTQwMGRhMmE5NTM2NzBjM2M4YTI4NCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.waTQCBkJcIz0k26OeP-rU6VS4-1yPLeZl2toFy5BMa4)
Note: You can get the correct ABFS path for your file by right-clicking on it in your lakehouse file listing and selecting "Copy ABFS Path":
- In your environment, add Splink as a Public Library from PyPl:
![splink_public_library](https://private-user-images.githubusercontent.com/7355456/310841758-fdf237ed-8a36-4da4-b651-87b3faa155fa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI4MTAyMjcsIm5iZiI6MTcyMjgwOTkyNywicGF0aCI6Ii83MzU1NDU2LzMxMDg0MTc1OC1mZGYyMzdlZC04YTM2LTRkYTQtYjY1MS04N2IzZmFhMTU1ZmEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDRUMjIxODQ3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzNlNDY3Zjg1YWQ3NjliMzllZDMyYTMxZTc2NTQzM2ExOTk3MGI0YjExMTU4YTFhYTA5OGU5NmY1MzZmYzk5MSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0B6hW32YmU3yZnZIPeAnzpWU_o6147cWdZIpy6Xkl10)