This repository provides a Spark plugin implementation for executor-side RPC dict servers where a user can look up values with associated keys. There is the use case where a user wants to compute values in a map task by referring to a static shared state (e.g., a pre-built knowledge base and master data). If a shared state is small, a broadcast variable is a good fit for the case as follows:
>>> broadcasted_hmap = spark.sparkContext.broadcast({"key1": "value1", "key2": "value2", ...})
>>> @udf(returnType='double')
... def udf(x):
... hmap = broadcasted_hmap.value
... value = ... # Computes a value by referring to the broadcasted dict like 'hmap[key]'
... return value
>>> df = df.select(udf(col("x")))
>>> df.show()
Having a copied state on each task's memory, however, can be wasteful if the state is pretty big (e.g., 10g or more). To mitigate the issue, this plugin enables a user to spin up a RPC server along with an executor and the RPC server will return values associated with keys by referring to a specified state. Since all map tasks in an executor access the same shared state in a RPC server, the memory consumption is much smaller than that of broadcast variables. How a user accesses a shared state via a PRC server is as follows:
# 'largeMap.db' is a MapDB file-backed hash map implementation, https://mapdb.org
$ pyspark --jars=./assembly/spark-executor-dict-plugin_2.12_spark3.0-0.1.0-SNAPSHOT-with-dependencies.jar \
--conf spark.plugins=org.apache.spark.plugin.SparkExecutorDictPlugin \
--conf spark.files=/tmp/largeMap.db
>>> @udf(returnType='double')
... def udf(x):
... from client import DictClient
... hamap = DictClient()
... value = ... # Computes a value by talking to an executor-attached RPC map server
... # like 'hmap.lookup(key)'
... return value
>>> df = df.select(udf(col("x")))
>>> df.show()
A RPC server holds a shared state as an on-disk hash map that MapDB provides. Therefore, frequently-accessed key-value pairs are expected to be on memory and the memory footprint can be small. For actual running examples, please see test code.
To generate a MapDB's map file for your data, you can use a helper function included in the package:
$ spark-shell --jars=./assembly/spark-executor-dict-plugin_2.12_spark3.0-0.1.0-SNAPSHOT-with-dependencies.jar
scala> import io.github.maropu.MapDbConverter
scala> val largeMap = Map("key1" -> "value1", "key2" -> "value2", ...)
scala> MapDbConverter.save("/tmp/largeMap.db", largeMap)
Property Name | Default | Meaning |
---|---|---|
spark.plugins.executorDict.dbFile | "" | Absolute path of a MapDB's loadable file in an executor's instance. If not specified, the plugin automatically detects it in the working directory of each executor. |
spark.plugins.executorDict.port | 6543 | Default port number for a RPC dict server in an executor. |
spark.plugins.executorDict.mapCacheSize | 10000 | Maximum number of cache entries for a shared dict. |
- Report some metrics via
MetricRegistry
- Supports Spark v3.1
- Adds more tests
If you hit some bugs and requests, please leave some comments on Issues or Twitter (@maropu).