acroz / pylivy

A Python client for Apache Livy, enabling use of remote Apache Spark clusters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

download_sql does not return more than 1000 rows

muracstech opened this issue · comments

is there a way I can download 100K rows using download_sql?

I am having the same issue, and I would like to download as many rows as needed.

I too faced the same and figured out that Livy is restricting it to 1000 records. Spark explain plan shows a global limit of 1000 and I am trying to find how to bump that up.

Would anyone be interested in an s3/hdfs redirect download feature (@acroz not sure if this would be within the scope of this project)?

Users could provide the following additional params to session constructor:

  1. a prefix/directory for temporary storage (s3://, hdfs://, file://, etc)
  2. a fetcher function that returns a dataframe given a URI as a string.

The download method could have an optional flag for overriding default behavior. Instead of writing out rows, dataframe is saved to temp storage at generated uri uri = "TMP_DIR/DF_NAME.parquet" and returns fetcher(uri).

LivySession.create(livy_url, kind=SessionKind.SQL, spark_conf={'livy.rsc.sql.num-rows': '2000'})
with this spark_conf can control the output rows