subpath / space-filling-curves

Space filling curve library for Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Space Filling Curve CI

Development

For development, we need to install overcommit. This can be installed through Ruby. See instructions how here. Once installed, run overcommit --install in this directory. This will run the pre-commit hook for scalastyle.

base ❯ git commit -m 'added overcommit for hooks'
Running pre-commit hooks

✓ All pre-commit hooks passed

Running commit-msg hooks
Check subject line................................[SingleLineSubject] OK
Check subject capitalization.....................[CapitalizedSubject] WARNING
Subject should start with a capital letter
Check for trailing periods in subject................[TrailingPeriod] OK
Check text width..........................................[TextWidth] OK

⚠ All commit-msg hooks passed, but with warnings

Space Filling Curves

Space filling curves allow us to represent an n-dimensional curve in one dimensional while preserving locality. Techniques such as z-ordering allow big data platforms to efficiently store and process large chunks of data.

  1. Processing Petabytes of Data in Seconds with Databricks Delta
  2. Z-order curve
  3. Z-order indexing for multifaceted queries in Amazon DynamoDB: Part 1
  4. Z-order indexing for multifaceted queries in Amazon DynamoDB: Part 2

Available GitHub Packages

Spark-2.3.1 on Scala 2.11.12 
Spark-2.4.7 on Scala 2.11.12 and Scala 2.12.13
Spark-3.1.0 on Scala 2.12.13 Java 11 version 0.1.0 and 0.2.0

Usage

How to determine Morton (Z) or Hilbert Ordering.

Morton (Z Order)

Given the dataframe below, we want to Morton (Z Order) our data by id, x, y

// Currently, this isn't setup to use Maven. 
// For now, publish local or just assembly and use the jar.
val orderingCols: Array[String] = Array("id", "x", "y")
val df: DataFrame = Seq(
  (1, 1, 12.23, "a", "m"),
  (4, 9, 5.05, "b", "m"),
  (3, 0, 1.23, "c", "f"),
  (2, 2, 100.4, "d", "f"),
  (1, 25, 3.25, "a", "m")
).toDF("x", "y", "amnt", "id", "sex")

val mortonOrdering: Morton = new Morton(df, orderingCols)
// this will order your whole dataframe by the z_index
val zIndexedDF: DataFrame = mortonOrdering
  .mortonIndex.sort("z_index")

Hilbert Order

Hilbert is only available in version 0.2.0 on Spark 3.

Benefits

How do space filling curves benefit? Let's consider the Chicago crime data set available at Crimes - 2001 to Preset. This data was pulled on 8 August 2021. The downloaded csv file is 1.74 GB and 7374374 records. First, I converted the csv to parquet with defualt compression of snappy.

File Type Compression Number of Leaf Files Optimization Size (MB)
CSV None 1 None 1781.76
Parquet Snappy 13 None 470.02
Parquet gzip 13 None 315.22
Parquet gzip 1 Semi-linear 269.81
Parquet gzip 1 Z-order
Parquet gzip 1 Hilbert

which resulted in 13 leaf files all approximately 38 MB for a total size of 0.459 GB.

Work in Progress

  • README
  • Better organization

Help Needed

Looking for help with those experienced with creating decent READMEs and publishing code to Maven.

About

Space filling curve library for Spark

License:Apache License 2.0


Languages

Language:Scala 100.0%