openml / openml-python

Python module to interface with OpenML

Home Page:https://openml.github.io/openml-python/main/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problematic with `.pq` on dataset 42742

eddiebergman opened this issue · comments

The dataset 42742 tries to download its .pq format but it greeted with a no access at:
https://openml1.win.tue.nl/dataset42742/dataset_42742.pq

Not a major issue as we can automatically skip this but seeing as it's in the automlbenchmark,
we would prefer not to.

Automlbenchmark suite: https://www.openml.org/search?type=study&sort=tasks_included&study_type=task&id=271

@ravinkohli

When I use get_dataset(42742, download_data=True) I still get the arff file, which does allow me to load the data locally. Are you experiencing issues loading the data? Or is this just an early warning you are missing the parquet file? It is also possible that the parquet file does not exist for this dataset (for some files the conversion isn't yet done due to technical difficulties - minio gives the access error either way).

@prabhant is normally in control over the minio setup, but he is on leave now. Resolving this will be slower than usual.

Okay my bad, I should have done more checking, more specifically this snippet (which I can't change for evaluation protocol reasons) is causing a web-request to the above link. I should have checked that on my local machine, it correctly defers to arff.

import openml

# This workaround doesn't seem to disable the web-request either
# openml.dataset.functions._get_dataset_parquet = lambda x: None

d = openml.datasets.get_dataset(42742)
d.get_data(dataset_format="array", target=d.default_target_attribute)

For some reason our cluster will just hang with repeated retries before failing when doing api calls through minio. However this happens with any dataset and not 42742 in particular, we unfortunately have a very manual protocol of copying datasets from locals to the cluster right now.

It would be nice to have an option to simply disable parquet or specify which fileformat to use. I understand if you are transitioning away from arff then you may not want this to be an option and we need to figure this out more on our side.

Reference issue: #1159

Feel free to close this as my original issue was answered. If you think having an option to choose fileformat to download would be useful, then that should probably be a seperate issue

Does #1184 solve this issue?

I am most puzzled by why the monkey patch doesn't work (I do notice a typo, but I assume that wasn't in the original since it should raise an error (should be datasets)). As far as I remember, there are no other points in the code at which the minio server is contacted (get_data internally also calls _get_dataset_parquet). Perhaps you could get a stack trace of the point on openml-python that issues the minio call?

Yes, sorry I think #1184 solves this issue as well, the issue here is slightly different in that the arff is present but the _get_dataset_parquet is still called and there was no way to disable this behaviour. This matter because of the proxy issue in which it hanged.

I will close this issue as it seems you'd like to migrate away from arff and adding such an option to skip .pq makes no sense.