Development

For development, we need to install overcommit. This can be installed through Ruby. See instructions how here. Once installed, run overcommit --install in this directory. This will run the pre-commit hook for scalastyle.

base ❯ git commit -m 'added overcommit for hooks'
Running pre-commit hooks

✓ All pre-commit hooks passed

Running commit-msg hooks
Check subject line................................[SingleLineSubject] OK
Check subject capitalization.....................[CapitalizedSubject] WARNING
Subject should start with a capital letter
Check for trailing periods in subject................[TrailingPeriod] OK
Check text width..........................................[TextWidth] OK

⚠ All commit-msg hooks passed, but with warnings

Space Filling Curves

Space filling curves allow us to represent an n-dimensional curve in one dimensional while preserving locality. Techniques such as z-ordering allow big data platforms to efficiently store and process large chunks of data.

Available GitHub Packages

Spark-2.3.1 on Scala 2.11.12 
Spark-2.4.7 on Scala 2.11.12 and Scala 2.12.13
Spark-3.1.0 on Scala 2.12.13 Java 11 version 0.1.0 and 0.2.0

Usage

How to determine Morton (Z) or Hilbert Ordering.

Morton (Z Order)

Given the dataframe below, we want to Morton (Z Order) our data by id, x, y

// Currently, this isn't setup to use Maven. 
// For now, publish local or just assembly and use the jar.
val orderingCols: Array[String] = Array("id", "x", "y")
val df: DataFrame = Seq(
  (1, 1, 12.23, "a", "m"),
  (4, 9, 5.05, "b", "m"),
  (3, 0, 1.23, "c", "f"),
  (2, 2, 100.4, "d", "f"),
  (1, 25, 3.25, "a", "m")
).toDF("x", "y", "amnt", "id", "sex")

val mortonOrdering: Morton = new Morton(df, orderingCols)
// this will order your whole dataframe by the z_index
val zIndexedDF: DataFrame = mortonOrdering
  .mortonIndex.sort("z_index")

Hilbert Order

Hilbert is only available in version 0.2.0 on Spark 3.

Benefits

How do space filling curves benefit? Let's consider the Chicago crime data set available at Crimes - 2001 to Preset. This data was pulled on 8 August 2021. The downloaded csv file is 1.74 GB and 7374374 records. First, I converted the csv to parquet with defualt compression of snappy.

File Type	Compression	Number of Leaf Files	Optimization	Size (MB)
CSV	None	1	None	1781.76
Parquet	Snappy	13	None	470.02
Parquet	gzip	13	None	315.22
Parquet	gzip	1	Semi-linear	269.81
Parquet	gzip	1	Z-order
Parquet	gzip	1	Hilbert

which resulted in 13 leaf files all approximately 38 MB for a total size of 0.459 GB.

Work in Progress

README
Better organization

Help Needed

Looking for help with those experienced with creating decent READMEs and publishing code to Maven.

About

Space filling curve library for Spark

Apache License 2.0

Languages

Language:Scala 100.0%