.xls support? And a weird header issue
aodj opened this issue · comments
Firstly, thanks for roapi! I've been playing around with it for a little while and really like it.
I've been trying to use it to proxy some datafiles containing information on currency codes as published by the body behind the ISO 4217 standard. The data is available in .xls or .xml format, but I can't get roapi to parse the .xls file. Opening it in Excel shows that the file has a couple of leading rows of dud data, in a non-tabular format that's not relevant to the data to be loaded.
When running it with roapi I get the following errors:
$ RUST_LOG=debug roapi --table "https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:26:06Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:26:06Z INFO roapi::context] loading `uri(https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file
Downloading the file locally gives the same error:
$ RUST_LOG=debug roapi --table "list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:27:03Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:27:03Z INFO roapi::context] loading `uri(list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file
If I save it as an .xlsx file I get an error about unparseable data:
$ RUST_LOG=debug RUST_BACKTRACE=1 roapi --table "list-one.xlsx,format=xlsx,sheet_name=Active"
[2023-02-13T11:28:07Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:28:07Z INFO roapi::context] loading `uri(list-one.xlsx)` as table `list-one`
Error: Arrow error: Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2
Caused by:
Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2
Do you think there's anything that might be done to support the loading of this file? Something like skiprows
or header
in Pandas would work. I've worked around it by processing the data with Pandas and saving it as a .parquet file, but I would like the ability to reference the current upstream file without having to process it.
Thanks for any input you can provide. I look forward to any suggestions you might have.
I am facing same issue with all .xlsx files. Even a simple CSV saved as XLSX (to rule out any data issue) is unable to be read by roapi
$ roapi -c /Users/dm/Desktop/roapi_test.yaml [2023-02-23T20:33:47Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB [2023-02-23T20:33:47Z INFO roapi::context] loading 'uri(/Users/dm/Desktop/Dremio/Dremio_Test/All_Kind_Files/Salary_Sheet1.xlsx)' as table 'EXCEL_TABLE' Error: Error loading Xlsx table: Failed to open .xlsx file.