Cannot use V2 for streaming read

Question

Cannot use V2 for streaming read

james-miles-ccy opened this issue a year ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I am trying to read via V2 in streaming way, with no success. I was wondering if there is anything I can do to get this working?

the code is below:

df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "excel")
.option("maxRowsInMemory", 20)
.schema(schema)
.load(file_path)

display(df)

the exception error is given below:

java.lang.UnsupportedOperationException: ExcelFileFormat as fallback format for V2 supports writing only

Expected Behavior

I was hoping it would generate a dataframe.

Steps To Reproduce

df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "excel")
.option("maxRowsInMemory", 20)
.schema(schema)
.load(file_path)

display(df)

Environment

- Spark version:3.3.1
- Spark-Excel version:2.12:3.3.1_0.18.7
- OS:Windows
- Cluster environment:Databricks

Anything else?

No response

Martin Mauch · Answer 1 · Sun Apr 23 2023 17:26:55 GMT+0800 (China Standard Time)

The documentation reads like this is only supported for a few specific file formats:
https://docs.databricks.com/ingestion/auto-loader/options.html#file-format-options
Not sure if they are hard-coded somewhere, or one would need to implement a special API.
I don't have time to look into this, but if you're willing to give it a try yourself I can give you some guidance.

Richie Caputo · Answer 2 · Tue Apr 30 2024 01:41:09 GMT+0800 (China Standard Time)

We have gotten this to work for other custom file formats with fixed schema. I wonder if we can apply a similar approach here while supporting provided schemas or inferred schemas.