sekikn / spark-executor-dict-plugin

Fast Read-only Data Dictionary Attached to Each Spark Executor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

License Build and test

This repository provides a Spark plugin implementation for executor-side RPC dict servers where a user can look up values with associated keys. There is the use case where a user wants to compute values in a map task by referring to a static shared state (e.g., a pre-built knowledge base and master data). If a shared state is small, a broadcast variable is a good fit for the case as follows:

>>> broadcasted_hmap = spark.sparkContext.broadcast({"key1": "value1", "key2": "value2", ...})
>>> @udf(returnType='double')
... def udf(x):
...     hmap = broadcasted_hmap.value
...     value = ...  # Computes a value by referring to the broadcasted dict like 'hmap[key]'
...     return value

>>> df = df.select(udf(col("x")))
>>> df.show()

Having a copied state on each task's memory, however, can be wasteful if the state is pretty big (e.g., 10g or more). To mitigate the issue, this plugin enables a user to spin up a RPC server along with an executor and the RPC server will return values associated with keys by referring to a specified state. Since all map tasks in an executor access the same shared state in a RPC server, the memory consumption is much smaller than that of broadcast variables. How a user accesses a shared state via a PRC server is as follows:

# 'largeMap.db' is a MapDB file-backed hash map implementation, https://mapdb.org
$ pyspark --jars=./assembly/spark-executor-dict-plugin_2.12_spark3.0-0.1.0-SNAPSHOT-with-dependencies.jar \
  --conf spark.plugins=org.apache.spark.plugin.SparkExecutorDictPlugin \
  --conf spark.files=/tmp/largeMap.db

>>> @udf(returnType='double')
... def udf(x):
...     from client import DictClient
...     hamap = DictClient()
...     value = ...  # Computes a value by talking to an executor-attached RPC map server
...                  # like 'hmap.lookup(key)'
...     return value

>>> df = df.select(udf(col("x")))
>>> df.show()

A RPC server holds a shared state as an on-disk hash map that MapDB provides. Therefore, frequently-accessed key-value pairs are expected to be on memory and the memory footprint can be small. For actual running examples, please see test code.

MapDB data conversion

To generate a MapDB's map file for your data, you can use a helper function included in the package:

$ spark-shell --jars=./assembly/spark-executor-dict-plugin_2.12_spark3.0-0.1.0-SNAPSHOT-with-dependencies.jar

scala> import io.github.maropu.MapDbConverter
scala> val largeMap = Map("key1" -> "value1", "key2" -> "value2", ...)
scala> MapDbConverter.save("/tmp/largeMap.db", largeMap)

Configurations

Property Name Default Meaning
spark.plugins.executorDict.dbFile "" Absolute path of a MapDB's loadable file in an executor's instance. If not specified, the plugin automatically detects it in the working directory of each executor.
spark.plugins.executorDict.port 6543 Default port number for a RPC dict server in an executor.
spark.plugins.executorDict.mapCacheSize 10000 Maximum number of cache entries for a shared dict.

TODO

  • Report some metrics via MetricRegistry
  • Supports Spark v3.1
  • Adds more tests

Bug reports

If you hit some bugs and requests, please leave some comments on Issues or Twitter (@maropu).

About

Fast Read-only Data Dictionary Attached to Each Spark Executor

License:Apache License 2.0


Languages

Language:Python 38.4%Language:Shell 33.5%Language:Scala 28.0%