Supply direct file name when loading data from dataset

Question

Supply direct file name when loading data from dataset

sephiartlist opened this issue 2 years ago · comments

Hi,
Thx for your example using clearml

Regarding loading data in the preprocess_data file - It may be confusing to supply the name of the files.
I would think that it should be taken from the dataset.file_entries' or dataset.list_added_files()` or some config

It would be great if you could do a video on the relationship between datasets and the add_files and the best practices between them. the documentation is a bit dull.

Victor Sonck · Answer 1 · Tue Jul 26 2022 15:46:51 GMT+0800 (China Standard Time)

Hi @sephiartlist

Thank you for the feedback! Yeah, you're probably right in saying that I shouldn't have hardcoded nasa.csv. That said, even if I used dataset.list_added_files(), I'd still have to choose which file to use, which means that in this case I would do: dataset.list_added_files()[0] which imo is as bad as hardcoding the name.

The original reason I did it this way, was to make the example very verbose. Knowing we want to get nasa.csv makes it clear to people what file was added to the dataset in the previous step. Do you agree with this?

For what concerns the datasets, check out our new video that goes a little more in depth on the usage of ClearML data. And feel free to let us know (here or in the comments) whether it helped or not and which things are still confusing :)

sephiartlist · Answer 2 · Tue Jul 26 2022 18:21:01 GMT+0800 (China Standard Time)

Hi,
Thx for the updated video.
My remark came after considering how to use datasets in our use case, which involve 2 types of data formats: multiple Tabular files and images.
What would be the best practice when managing these data? At what stage should we fuse the data? should the fused data be versions in some sort dataset? or each of the tables should be managed separately without creating any fused dataset between the tables and/or the relevant images?

Victor Sonck · Answer 3 · Wed Jul 27 2022 17:46:06 GMT+0800 (China Standard Time)

That entirely depends on your use-case. ClearML data will allow you to do both pretty easily.

The easiest from a clearml data standpoint is to have everything together in 1 dataset. Clearml data does not care about a mix of different data types.
My personal opinion is that this makes sense for example if these data are created together and used together, might as well see them as one whole. E.g. images and their json labels make sense to put into a single dataset IMO

If you want to keep things separated, you could create 2 datasets one for the images and one for the tabular files, then have a third dataset that uses the squash for example to unify them. You could also always pull both datasets in the code and merge them there.

Again, my personal opinion is that this only makes sense when the 2 are independent of each other, meaning it might happen that images are coming in and get a new version, but there is no new tabular data. In that case the versions are out of sync. Even then it could make sense to go route 1, but that's personal preference I think.