prestodb / presto-python-client

Python DB-API client for Presto

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decouple stats polling from results fetching

matthewwardrop opened this issue · comments

Greetings all!

I'm looking to transition a project I curate (omniduct, a library to simplify data acquisition, especially for data scientists) from pyhive to prestodb; but currently I would lose the ability to poll for query progress before actually attempting to retrieve results.

i.e. cursor.fetchone() is used to both collect results and update stats, which means that I cannot show progress of the actual execution of the query, only progress through collection of the results.

Would you welcome a patch to add support for this polling? Or are you planning to add it yourselves? Or are you opposed to adding this feature?

@matthewwardrop yes, we would welcome a patch to add support polling stats independently of fetching results. We're not currently working on it, so your contribution would be greatly appreciated :).

  1. What stats are you the most interest in?
  2. What options are you considering to gather stats?

Regarding (2.), the client could sent a GET HTTP request to a /1/query/{query_id} endpoint.

At some point, we'll need to consider using asyncio (or a concurrent.futures executor in Python 2.7) to asynchronously perform some HTTP request as interleaving the process of getting result and stats could lead to unexpected behaviors such as queries failing with an ABANDONED error if the client takes too long poll the status of a query.

Hi @ggreg,

Thanks for responding to this.

I'm interested in all of the stats that are returned by the standard endpoints, but most especially the 'progress' field. In terms of methodology, I am imagining polling the same endpoints currently used by PrestoQuery.fetch and returning stats. At some point, this will call return data and/or will enter a finished state, and any returned data will be cached on some internal instance attribute, and further status polling will simply return the state as of that time. The user can then use the fetch methods as before, which will collect the data set aside in the local cache and then append to it any data returned by subsequent endpoint calls until the data is fully collected locally, as is the current behaviour.

This should not suffer any abandonment issues unless the user does not move on to using the fetch method within some sensible window of time after the polling indicates that the query has successfully ran its course.

Perhaps the asyncio/futures approach might belong instead in a wrapping library, such as omniduct, unless you are planning to support multiplexing of queries within this library itself.

I'll put out a PR soon.

Hi @matthewwardrop,
Did you get a chance to add this functionality?

Not yet @akhandev . I'll try and put out a PR this week. :).