gerashegalov / rapids-shell

Utility to run/debug Spark RAPIDS in REPL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rapids-shell

This repo was started as a wrapper around Spark REPLs for easier use with the Spark RAPIDS plugin. Lately I have been putting more effort in maintaining standalone Jupyter notebooks that can be easily started without the wrapper script, and particularly easy to simply open them in VSCode with the Jupyter extension.

Original Utility

A utility to start RAPIDS-enabled Spark Shell with access to unit tests resources from https://github.com/NVIDIA/spark-rapids Before running the examples make sure to at least execute mvn package in your local spark-rapids repo if you are not using binaries.

Comand line options

See rapids.sh --help for up to date information

Usage: rapids.sh [OPTION]
Options:
  --debug
    enable bash tracing
  -h, --help
    prints this message
  -l4j=LOG4J_CONF_FILE, --log4j-file=LOG4J_CONF_FILE
    LOG4J_CONF_FILE location of a custom log4j config for local mode
  -nsys, --nsys-profile
    run with Nsights profile
  -m=MASTER, --master=MASTER
    specify MASTER for spark command, default is local[-cluster], see --num-local-execs
  -n, --dry-run
    generates and prints the spark submit command without executing
  -nle=N, --num-local-execs=N
    specify the number of local executors to use, default is 2. If > 1 use pseudo-distributed
    local-cluster, otherwise local[*]
  -uecp, --use-extra-classpath
    use extraClassPath instead of --jars to add RAPIDS jars to spark-submit (default)
  -uj, --use-jars
    use --jars instead of extraClassPath to add RAPIDS jars to spark-submit
  --ucx-shim=spark<3xy>
    Spark buildver to populate shim-dependent package name of RapidsShuffleManager.
    Will be replaced by a Boolean option
  -cmd=CMD, --spark-command=CMD
    specify one of spark-submit (default), spark-shell, pyspark, jupyter, jupyter-lab
  -dopts=EOPTS, --driver-opts=EOPTS
    pass EOPTS as --driver-java-options
  -eopts=EOPTS, --executor-opts=EOPTS
    pass EOPTS as spark.executor.extraJavaOptions
  --gpu-fraction=GPU_FRACTION
    GPU share per executor JVM unless local or local-cluster mode, see spark.rapids.memory.gpu.allocFraction

Environment variables

  • SPARK_RAPIDS_HOME - the path either to the local repo or to the location used for downloading the binaries

  • SPARK_HOME - the path either to the local Spark repo or to the root fo binary distro

  • SPARK_CMD - one of spark-shell, spark-submit (default), pyspark, jupyter, jupyter-lab

Examples

Use Spark RAPIDS in Jupyter notebook

SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 SPARK_CMD=jupyter[-lab] rapids.sh

Run in pseudo-distirbuted local-cluster mode

NUM_LOCAL_EXECS=2 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh

Allow attaching a java debugger to the driver JVM

JDBSTR=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh

Running Spark RAPIDS ScalaTests in spark-shell once started

Single test suite

scala> run(new com.nvidia.spark.rapids.InsertPartition311Suite)
InsertPartition311Suite:
...

Single test case

scala> run(new com.nvidia.spark.rapids.HashAggregatesSuite, "sum(floats) group by more_floats 2 partitions")
HashAggregatesSuite:
...

Using integration test datagens

In pyspark based drivers one can use data generators from spark-rapids/integration-tests or run whole pytests.

Add rapids.py as an ipython startup file, e.g. on *NIX

cp src/python/rapids.py ~/.ipython/profile_default/startup/

Datagen

key_data_gen = StructGen([
        ('a', IntegerGen(min_val=0, max_val=4)),
        ('b', IntegerGen(min_val=5, max_val=9)),
    ], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)

...

Pytest

runpytest('test_struct_count_distinct')

About

Utility to run/debug Spark RAPIDS in REPL


Languages

Language:Jupyter Notebook 91.2%Language:Shell 8.2%Language:Scala 0.4%Language:Python 0.2%