spark-pyrest

A simple python package for querying Apache Spark's REST API:

simple interface
returns DataFrames of task metrics for easy slicing and dicing
easily access executor logs for post-processing.

Installation and dependencies

spark-pyrest uses pandas.

To install:

$ git clone https://github.com/rokroskar/spark-pyrest.git
$ cd spark-pyrest
$ python setup.py install

Usage

Initializing the `SparkPyRest` object with a host address

from spark_pyrest import SparkPyRest

spr = SparkPyRest(host)

Get the current app

spr.app
u'app-20170420091222-0000'

Get the running/completed stages

spr.stages

[(2,
  u'groupByKey '),
 (1,
  u'partitionBy'),
 (0,
  u'partitionBy'),
 (4, u'runJob at PythonRDD.scala:441'),
 (3,
  u'distinct')]

Get task metrics for a stage

pr.tasks(1)

	bytesWritten	executorId	executorRunTime	host	remoteBytesRead	taskId
0	3508265	33	354254	172.19.1.124	112	25604
1	3651003	56	353554	172.19.1.74	112	24029
2	3905955	9	347724	172.19.1.62	111	19719

Check some properties aggregated by host/executor

host_tasks = spr.tasks(1)[['host','bytesWritten','remoteBytesRead','executorRunTime']].groupby('host')

host_tasks.describe()

		bytesWritten	executorRunTime	remoteBytesRead
host
172.19.1.100	count	1.270000e+02	127.000000	1.270000e+02
	mean	6.436659e+06	337587.913386	3.230165e+06
	std	1.744411e+06	114722.662028	5.590263e+05
	min	0.000000e+00	0.000000	1.098278e+06
	25%	6.426009e+06	292290.000000	2.967360e+06
	50%	6.805269e+06	348174.000000	3.315189e+06
	75%	7.160529e+06	420091.500000	3.582200e+06
	max	8.050828e+06	500831.000000	4.775107e+06
172.19.1.101	count	1.300000e+02	130.000000	1.300000e+02
	mean	6.470575e+06	335401.584615	3.178066e+06
	std	1.725755e+06	115268.286668	6.332040e+05
	min	0.000000e+00	0.000000	0.000000e+00
	25%	6.534020e+06	290462.750000	2.965009e+06
	50%	6.880597e+06	356160.000000	3.327600e+06
	75%	7.197086e+06	407338.750000	3.558479e+06
	max	7.983406e+06	515372.000000	4.526526e+06

Plot the data

mean_runtime = host_tasks.mean()['executorRunTime']

mean_runtime.plot(kind='bar')

Grab the log of an executor

If your application prints its own metrics to stdout/stderr, you need to be able to grab the executor logs to see these metrics. The logs are located on the hosts that make up your cluster, so accessing them programmatically can be tedious. You can see them through the Spark web UI of course, but processing them that way is not really useful. Using the executor_log method of SparkPyRest, you can grab the full contents of the log for easy post-processing/data extraction.

log = spr.executor_log(0)
print log

==== Bytes 0-301839 of 301839 of /tmp/work/app-20170420091222-0000/0/stderr ====
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/20 09:12:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14086@x04y08
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for TERM
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for HUP
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for INT
17/04/20 09:12:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/20 09:12:23 INFO SecurityManager: Changing view acls to: roskar
17/04/20 09:12:23 INFO SecurityManager: Changing modify acls to: roskar
17/04/20 09:12:23 INFO SecurityManager: Changing view acls groups to: 
17/04/20 09:12:23 INFO SecurityManager: Changing modify acls groups to:

Contributing

Please submit an issue if you discover a bug or have a feature request! Pull requests also very welcome.

About

A simple python package for querying Apache Spark's REST API

MIT License

Languages

Language:Python 100.0%