acroz / pylivy

A Python client for Apache Livy, enabling use of remote Apache Spark clusters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Set job description in LivySession

quartox opened this issue · comments

I am going to start working on a PR that would allow users to set the job description using the spark context during LivySession.read(). This is done using sc.setJobDescription("my description here") which will replace the name in the Spark UI and history server for the collect stage of the session.

I would welcome input if this should be in a new method, a parameter for read that allows injected code or just a name injection. We are already setting the job description in LivySession.run so an optional attribute of the session that prepends the code to the set job description might make sense.

Hey, thanks for the offer to contribute! Your PR would be very welcome.

I think this should be a new method rather than a parameter for read - this function would work similarly to read (it needs to generate and execute some code based on the passed description) but otherwise serves quite a different purpose. I'd also not be keen to add it as an attribute of the session as this would require us to block in LivySession.__init__ for the session to be available before submitting the statement.

BTW, be aware that I've just merged #93 which affects code in this module - you might want to make sure you're working off the latest master.

Yeah, that makes sense. I don't know if there is a good general purpose way to expose this new method. One option is basically "download but with arbitrary code execution before serializing the dataframe" which might lead to problems because the user might cause extra output to be printed that breaks the json parsing. Another option is a method with a job description argument that is pre-pended to the serialization code, but this might be so specific to our use case that it isn't applicable to other users.

I am trying to thing of a safe way to execute arbitrary code that can be parsed as separate before the printing the json. Does it seem reasonable to try to separate the text output from the arbitrary code from the json string?

Are you aware that you can already run arbitrary code with LivySession.run()? So for example, you could already use this to set the job description before downloading the dataframe (untested):

from livy import LivySession

LIVY_URL = "http://spark.example.com:8998"

with LivySession.create(LIVY_URL) as session:
    session.run('sc.setJobDescription("my description here")')
    ...
    local_df = session.download("df")

I should have explained the problem more clearly. Livy resets the job description with each statement. When we set the description in session.run then it disappears when the next statement is executed. In addition, the job description only shows up for the second stage of a statement. For simple queries executed by session.run then new job description does not show up at all.

We are using the job description to differentiate jobs from a pool of sessions. Otherwise we would need to look at log timestamps to compare jobs in the history server. By setting the description in the download we are guaranteed at least one stage will have useful information in the history server that can distinguish which query was run.

The problem is that we pool the sessions and then re-use each session for multiple purposes. One session might be used for several distinct queries. We probably can get the id of the spark application for each query, but then when we examine that spark application in the history server it is difficult to tell which stage belongs to which query. We might be able to do it by comparing timestamps, but it is much easier to see the name of the query in the stage by setting the job description.

I believe the job description does not overwrite the job group. We were trying to be careful about that so that the statement is still cancellable by livy.

Closing this since it is low priority and found other workarounds.