Set job description in LivySession

Question

Set job description in LivySession

quartox opened this issue 4 years ago · comments

I am going to start working on a PR that would allow users to set the job description using the spark context during LivySession.read(). This is done using sc.setJobDescription("my description here") which will replace the name in the Spark UI and history server for the collect stage of the session.

I would welcome input if this should be in a new method, a parameter for read that allows injected code or just a name injection. We are already setting the job description in LivySession.run so an optional attribute of the session that prepends the code to the set job description might make sense.

Andrew Crozier · Answer 1 · Thu Jan 28 2021 03:47:37 GMT+0800 (China Standard Time)

Hey, thanks for the offer to contribute! Your PR would be very welcome.

I think this should be a new method rather than a parameter for read - this function would work similarly to read (it needs to generate and execute some code based on the passed description) but otherwise serves quite a different purpose. I'd also not be keen to add it as an attribute of the session as this would require us to block in LivySession.__init__ for the session to be available before submitting the statement.

BTW, be aware that I've just merged #93 which affects code in this module - you might want to make sure you're working off the latest master.

Jesse Lord · Answer 2 · Thu Jan 28 2021 07:02:36 GMT+0800 (China Standard Time)

Yeah, that makes sense. I don't know if there is a good general purpose way to expose this new method. One option is basically "download but with arbitrary code execution before serializing the dataframe" which might lead to problems because the user might cause extra output to be printed that breaks the json parsing. Another option is a method with a job description argument that is pre-pended to the serialization code, but this might be so specific to our use case that it isn't applicable to other users.

I am trying to thing of a safe way to execute arbitrary code that can be parsed as separate before the printing the json. Does it seem reasonable to try to separate the text output from the arbitrary code from the json string?

Andrew Crozier · Answer 3 · Thu Jan 28 2021 19:34:47 GMT+0800 (China Standard Time)

Are you aware that you can already run arbitrary code with LivySession.run()? So for example, you could already use this to set the job description before downloading the dataframe (untested):

from livy import LivySession

LIVY_URL = "http://spark.example.com:8998"

with LivySession.create(LIVY_URL) as session:
    session.run('sc.setJobDescription("my description here")')
    ...
    local_df = session.download("df")

Jesse Lord · Answer 4 · Thu Jan 28 2021 22:37:42 GMT+0800 (China Standard Time)

I should have explained the problem more clearly. Livy resets the job description with each statement. When we set the description in session.run then it disappears when the next statement is executed. In addition, the job description only shows up for the second stage of a statement. For simple queries executed by session.run then new job description does not show up at all.

We are using the job description to differentiate jobs from a pool of sessions. Otherwise we would need to look at log timestamps to compare jobs in the history server. By setting the description in the download we are guaranteed at least one stage will have useful information in the history server that can distinguish which query was run.

Andrew Crozier · Answer 5 · Fri Jan 29 2021 00:57:38 GMT+0800 (China Standard Time)

Hmm, yes, I see. For reference, this appears to be happening here <https://github.com/apache/incubator-livy/blob/4d8a912699683b973eee76d4e91447d769a0cb0d/repl/src/main/scala/org/apache/livy/repl/Session.scala#L164> - looks like Livy is using the Job group to retrieve information about the statement later so we should be careful not to change that. I'm not sure where you're querying the Spark jobs, but is it possible to get information on the Spark application each job is from? It seems odd to me that we'd need to "manually" label each Spark job to identify the application (~= Livy session) it's associated with.

…

On Thu, 28 Jan 2021 at 14:38, Jesse Lord ***@***.***> wrote: I should have explained the problem more clearly. Livy resets the job description with each statement. When we set the description in session.run then it disappears when the next statement is executed. In addition, the job description only shows up for the second stage of a statement. For simple queries executed by session.run then new job description does not show up at all. We are using the job description to differentiate jobs from a pool of sessions. Otherwise we would need to look at log timestamps to compare jobs in the history server. By setting the description in the download we are guaranteed at least one stage will have useful information in the history server that can distinguish which query was run. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#95 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMCZPU5N5GNUNCAYCIT7STS4FZERANCNFSM4WPD4CIA> .

Jesse Lord · Answer 6 · Fri Jan 29 2021 03:35:29 GMT+0800 (China Standard Time)

The problem is that we pool the sessions and then re-use each session for multiple purposes. One session might be used for several distinct queries. We probably can get the id of the spark application for each query, but then when we examine that spark application in the history server it is difficult to tell which stage belongs to which query. We might be able to do it by comparing timestamps, but it is much easier to see the name of the query in the stage by setting the job description.

I believe the job description does not overwrite the job group. We were trying to be careful about that so that the statement is still cancellable by livy.

Jesse Lord · Answer 7 · Wed Jul 07 2021 02:02:16 GMT+0800 (China Standard Time)

Closing this since it is low priority and found other workarounds.