roapi / roapi

Create full-fledged APIs for slowly moving datasets without writing a single line of code.

Home Page:https://roapi.github.io/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

.xls support? And a weird header issue

aodj opened this issue · comments

Firstly, thanks for roapi! I've been playing around with it for a little while and really like it.

I've been trying to use it to proxy some datafiles containing information on currency codes as published by the body behind the ISO 4217 standard. The data is available in .xls or .xml format, but I can't get roapi to parse the .xls file. Opening it in Excel shows that the file has a couple of leading rows of dud data, in a non-tabular format that's not relevant to the data to be loaded.

When running it with roapi I get the following errors:

$ RUST_LOG=debug roapi --table "https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:26:06Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:26:06Z INFO  roapi::context] loading `uri(https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

Downloading the file locally gives the same error:

$ RUST_LOG=debug roapi --table "list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:27:03Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:27:03Z INFO  roapi::context] loading `uri(list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

If I save it as an .xlsx file I get an error about unparseable data:

$ RUST_LOG=debug RUST_BACKTRACE=1 roapi --table "list-one.xlsx,format=xlsx,sheet_name=Active"
[2023-02-13T11:28:07Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:28:07Z INFO  roapi::context] loading `uri(list-one.xlsx)` as table `list-one`
Error: Arrow error: Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Caused by:
    Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Do you think there's anything that might be done to support the loading of this file? Something like skiprows or header in Pandas would work. I've worked around it by processing the data with Pandas and saving it as a .parquet file, but I would like the ability to reference the current upstream file without having to process it.

Thanks for any input you can provide. I look forward to any suggestions you might have.

I am facing same issue with all .xlsx files. Even a simple CSV saved as XLSX (to rule out any data issue) is unable to be read by roapi
$ roapi -c /Users/dm/Desktop/roapi_test.yaml [2023-02-23T20:33:47Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB [2023-02-23T20:33:47Z INFO roapi::context] loading 'uri(/Users/dm/Desktop/Dremio/Dremio_Test/All_Kind_Files/Salary_Sheet1.xlsx)' as table 'EXCEL_TABLE' Error: Error loading Xlsx table: Failed to open .xlsx file.

Hi, @houqp , @aodj

I've prepared a PR #316 that fixes all errors described in this issue and also add possibility to customise range of rows/columns from spread sheet that will be available in roapi table.