jeremyjordan / flower-classifier

A simple image classifier for flowers.

Home Page:https://share.streamlit.io/jeremyjordan/flower-classifier/app.py

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

update how we manage datasets

jeremyjordan opened this issue · comments

What's your idea?
Currently, we have two ways of creating a dataset: CSVDataset and OxfordFlowers102Dataset. We're going to need a way to support the incremental addition of photos that we've collected from our Streamlit app.

Describe the solution you'd like
I think it makes sense to organize photos in Google Drive according to the ImageFolder standard, where we have a folder for each class. This can enable us to easily add new photos and explore the existing photos in our dataset (e.g. look at all the photos for a single class). However, we'll still want to be able to reproduce a single training run; if we have a mutating dataset, we'll need a way to represent the dataset at the point in time when the model was trained. We can accomplish this by creating a CSV file describing the dataset which could be reloaded at a later point using the existing CSVDataset.

Describe alternatives you've considered
We could keep using the CSVDataset as we add new images, but this approach feels a bit more awkward.

We could look into something like DVC to version control the data, but since we're storing the dataset in Google Drive I'm not sure how well this would work.